Sunday, July 28, 2013

Enterprise Data Workflows with Cascading Streamlined Enterprise Data Management and Analysis By Paco Nathan

What is Cascading? It is a Java application framework that enables typical developers to quickly and easily develop rich Data Analytics and Data Management applications that can be deployed and managed across a variety of computing environments. Cascading works seamlessly with Apache Hadoop 1.0 and API compatible distributions.
Enterprise Data Workflows with Cascading Streamlined Enterprise Data Management and Analysis By Paco Nathan (he is the director of data science at Concurrent and heads up the developer outreach program there. He has a dual background from Stanford in math/stats and distributed computing, with 25+ years experience in the tech industry).
It's a hand-on book, that is very useful for learning how to use Cascading. A book gives idea/concept with lots of examples. Readers can read sample codes and follow each example. Reader learn Cascading and much more things in a book. For example: Gradle, Scale, Clojure, R and etc. In a book, there uses Hadoop standalone (file). Readers can adapt it with HDFS.
[surachart@sopun part1]$ hadoop jar ./build/libs/impatient.jar data/rain.txt output/rain
13/07/17 15:36:03 INFO util.HadoopUtil: resolving application jar from found main method on: impatient.Main
13/07/17 15:36:03 INFO planner.HadoopPlanner: using application jar: /home/surachart/Impatient/part1/./build/libs/impatient.jar
13/07/17 15:36:03 INFO property.AppProps: using app.id: AF4864F3098D32707AECB70969216A30
13/07/17 15:36:04 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/07/17 15:36:04 WARN snappy.LoadSnappy: Snappy native library not loaded
13/07/17 15:36:04 INFO mapred.FileInputFormat: Total input paths to process : 1
13/07/17 15:36:05 INFO util.Version: Concurrent, Inc - Cascading 2.1.2
13/07/17 15:36:05 INFO flow.Flow: [] starting
13/07/17 15:36:05 INFO flow.Flow: [] source: Hfs["TextDelimited[['doc_id', 'text']->[ALL]]"]["data/rain.txt"]"]
13/07/17 15:36:05 INFO flow.Flow: [] sink: Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'text']]"]["output/rain"]"]
13/07/17 15:36:05 INFO flow.Flow: [] parallel execution is enabled: true
13/07/17 15:36:05 INFO flow.Flow: [] starting jobs: 1
13/07/17 15:36:05 INFO flow.Flow: [] allocating threads: 1
13/07/17 15:36:05 INFO flow.FlowStep: [] at least one sink does not exist
13/07/17 15:36:05 INFO flow.FlowStep: [] source modification date at: Wed Jul 17 15:32:54 ICT 2013
13/07/17 15:36:05 INFO flow.FlowStep: [] starting step: (1/1) output/rain
13/07/17 15:36:06 INFO mapred.FileInputFormat: Total input paths to process : 1
13/07/17 15:36:06 INFO flow.FlowStep: [] submitted hadoop job: job_201307171509_0002
13/07/17 15:36:26 INFO util.Hadoop18TapUtil: deleting temp path output/rain/_temporary
[surachart@sopun part1]$ hadoop fs -ls hdfs://localhost:9000/user/surachart/output/rain/
Found 4 items
-rw-r--r-- 1 surachart supergroup 0 2013-07-17 15:36 /user/surachart/output/rain/_SUCCESS
drwxr-xr-x - surachart supergroup 0 2013-07-17 15:36 /user/surachart/output/rain/_logs
-rw-r--r-- 1 surachart supergroup 308 2013-07-17 15:36 /user/surachart/output/rain/part-00000
-rw-r--r-- 1 surachart supergroup 214 2013-07-17 15:36 /user/surachart/output/rain/part-0000
Anyway, I didn't want to show much the example what I learned. I just felt get much more idea about Cascading and fellow/enjoyed with example. If Readers want to use Cascading. They should have this book. It will help them, because much more examples and easy for reading. Reader can read Sampler.

 

No comments: