Sunday, October 19, 2014

Learning Spark Lightning-Fast Big Data Analytics by Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia; O'Reilly Media

Apache Spark started as a research project at UC Berkeley in the AMPLab, which focuses on big data analytics. Spark is an open source cluster computing platform designed to be fast and general-purpose for data analytics - It's both fast to run and write. Spark provides primitives for in-memory cluster computing: your job can load data into memory and query it repeatedly much quicker than with disk-based systems like Hadoop MapReduce. Users can write applications quickly in Java, Scala or Python. In additional, it's easy to run standalone or on EC2 or Mesos. It can read data from HDFS, HBase, Cassandra, and any Hadoop data source.
If you would like a book about Spark - Learning Spark Lightning-Fast Big Data Analytics by Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia. It's a great book for who is interested in Spark development and starting with it. Readers will learn how to express MapReduce jobs with just a few simple lines of Spark code and more...
  • Quickly dive into Spark capabilities such as collect, count, reduce, and save
  • Use one programming paradigm instead of mixing and matching tools such as Hive, Hadoop, Mahout, and S4/Storm
  • Learn how to run interactive, iterative, and incremental analyses
  • Integrate with Scala to manipulate distributed datasets like local collections
  • Tackle partitioning issues, data locality, default hash partitioning, user-defined partitioners, and custom serialization
  • Use other languages by means of pipe() to achieve the equivalent of Hadoop streaming
With Early Release - 7 chapters. Explained Apache Spark overview, downloading and commands that should know, programming with RDDS (+ more advance) as well as working with Key-Value Pairs, etc. Easy to read and Good examples in a book. For people who want to learn Apache Spark or use Spark for Data Analytic. It's a book, that should keep in shelf.

Book: Learning Spark Lightning-Fast Big Data Analytics
Authors: Holden KarauAndy KonwinskiPatrick WendellMatei Zaharia

Thursday, October 09, 2014

Using Flume - Flexible, Scalable, and Reliable Data Streaming by Hari Shreedharan; O'Reilly Media

Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware. How to deliver log to Hadoop HDFS. Apache Flume is open source to integrate with HDFS, HBASE and it's a good choice to implement for log data real-time collection from front end or log data system.
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.It uses a simple data model. Source => Channel => Sink
It's a good time to introduce a good book about Flume - Using Flume - Flexible, Scalable, and Reliable Data Streaming by Hari Shreedharan (@harisr1234). It was written with 8 Chapters: giving basic about Apache Hadoop and Apache HBase, idea for Streaming Data Using Apache Flume, about Flume Model (Sources, Channels, Sinks), and some moew for Interceptors, Channel Selectors, Sink Groups, and Sink Processors. Additional, Getting Data into Flume* and Planning, Deploying, and Monitoring Flume.

This book was written about how to use Flume. It's very good to guide about Apache Hadoop and Apache HBase before starting about Flume Data flow model. Readers should know about java code, because they will find java code example in a book and easy to understand. It's a good book for some people who want to deploy Apache Flume and custom components.
Author separated each Chapter for Flume Data flow model. So, Readers can choose each chapter to read for part of Data flow model: reader would like to know about Sink, then read Chapter 5 only until get idea. In addition, Flume has a lot of features, Readers will find example for them in a book. Each chapter has references topic, that readers can use it to find out more and very easy + quick to use in Ebook.
With Illustration in a book that is helpful with readers to see Big Picture using Flume and giving idea to develop it more in each System or Project.
So, Readers will be able to learn about operation and how to configure, deploy, and monitor a Flume cluster, and customize examples to develop Flume plugins and custom components for their specific use-cases.
  • Learn how Flume provides a steady rate of flow by acting as a buffer between data producers and consumers
  • Dive into key Flume components, including sources that accept data and sinks that write and deliver it
  • Write custom plugins to customize the way Flume receives, modifies, formats, and writes data
  • Explore APIs for sending data to Flume agents from your own applications
  • Plan and deploy Flume in a scalable and flexible way—and monitor your cluster once it’s running
Book: Using Flume - Flexible, Scalable, and Reliable Data Streaming
Author: Hari Shreedharan

Monday, October 06, 2014

rsyslog: Send logs to Flume

Good day for learning something new. After read Flume book, that something popped up in my head. Wanted to test "rsyslog" => Flume => HDFS. As we know, forwarding log to other systems. We can set rsyslog:
*.* @YOURSERVERADDRESS:YOURSERVERPORT ## for UDP
*.* @@YOURSERVERADDRESS:YOURSERVERPORT ## for TCP
For rsyslog:
[root@centos01 ~]# grep centos /etc/rsyslog.conf
*.* @centos01:7777
Came back to Flume, I used Simple Example for reference and changed a bit. Because I wanted it write to HDFS.
[root@centos01 ~]# grep "^FLUME_AGENT_NAME\="  /etc/default/flume-agent
FLUME_AGENT_NAME=a1
[root@centos01 ~]# cat /etc/flume/conf/flume.conf
# example.conf: A single-node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
#a1.sources.r1.type = netcat
a1.sources.r1.type = syslogudp
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 7777
# Describe the sink
#a1.sinks.k1.type = logger
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://localhost:8020/user/flume/syslog/%Y/%m/%d/%H/
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.batchSize = 10000
a1.sinks.k1.hdfs.rollSize = 0
a1.sinks.k1.hdfs.rollCount = 10000
a1.sinks.k1.hdfs.filePrefix = syslog
a1.sinks.k1.hdfs.round = true


# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
[root@centos01 ~]# /etc/init.d/flume-agent start
Flume NG agent is not running                              [FAILED]
Starting Flume NG agent daemon (flume-agent):              [  OK  ]
Tested to login by ssh.
[root@centos01 ~]#  tail -0f  /var/log/flume/flume.log
06 Oct 2014 16:35:40,601 INFO  [hdfs-k1-call-runner-0] (org.apache.flume.sink.hdfs.BucketWriter.doOpen:208)  - Creating hdfs://localhost:8020/user/flume/syslog/2014/10/06/16//syslog.1412588139067.tmp
06 Oct 2014 16:36:10,957 INFO  [hdfs-k1-roll-timer-0] (org.apache.flume.sink.hdfs.BucketWriter.renameBucket:427)  - Renaming hdfs://localhost:8020/user/flume/syslog/2014/10/06/16/syslog.1412588139067.tmp to hdfs://localhost:8020/user/flume/syslog/2014/10/06/16/syslog.1412588139067
[root@centos01 ~]# hadoop fs -ls hdfs://localhost:8020/user/flume/syslog/2014/10/06/16/syslog.1412588139067
14/10/06 16:37:31 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rw-r--r--   1 flume supergroup        299 2014-10-06 16:36 hdfs://localhost:8020/user/flume/syslog/2014/10/06/16/syslog.1412588139067
[root@centos01 ~]#
[root@centos01 ~]#
[root@centos01 ~]# hadoop fs -cat hdfs://localhost:8020/user/flume/syslog/2014/10/06/16/syslog.1412588139067
14/10/06 16:37:40 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
sshd[20235]: Accepted password for surachart from 192.168.111.16 port 65068 ssh2
sshd[20235]: pam_unix(sshd:session): session opened for user surachart by (uid=0)
su: pam_unix(su-l:session): session opened for user root by surachart(uid=500)
su: pam_unix(su-l:session): session closed for user root
Look good... Anyway, It needs to adapt more...