Sunday, October 26, 2014

Getting Started with Impala Interactive SQL for Apache Hadoop by John Russell; O'Reilly Media

Impala is open source and a query engine that runs on Apache Hadoop. With Impala, you can query data, whether stored in HDFS or Apache HBase – including SELECT, JOIN, and aggregate functions – in real time. If you are looking for a book getting start with it - Getting Started with Impala Interactive SQL for Apache Hadoop by John Russell (@max_webster). Assist readers to write, tune, and port SQL queries and other statements for a Big Data environment, using Impala. The SQL examples in this book start from a simple base for easy comprehension, then build toward best practices that demonstrate high performance and scalability. For readers, you can download QuickStart VMs and install. After that, you can use it with examples in a book.
In a book, it doesn't assist readers to install Impala or how to solve the issue from installation or configuration. It has 5 chapters and not much for the number of pages, but enough to guide how to use Impala (Interactive SQL) and has good examples. With chapter 5 - Tutorials and Deep Dives, that it's highlight in a book and the example in a chapter that is very useful.
Free Sampler.

This book assists readers.
  • Learn how Impala integrates with a wide range of Hadoop components
  • Attain high performance and scalability for huge data sets on production clusters
  • Explore common developer tasks, such as porting code to Impala and optimizing performance
  • Use tutorials for working with billion-row tables, date- and time-based values, and other techniques
  • Learn how to transition from rigid schemas to a flexible model that evolves as needs change
  • Take a deep dive into joins and the roles of statistics
[test01:21000] > select "Surachart Opun" Name,  NOW() ;
Query: select "Surachart Opun" Name,  NOW()
+----------------+-------------------------------+
| name           | now()                         |
+----------------+-------------------------------+
| Surachart Opun | 2014-10-25 23:34:03.217635000 |
+----------------+-------------------------------+
Returned 1 row(s) in 0.14s
Author: John Russell (@max_webster)

Sunday, October 19, 2014

Learning Spark Lightning-Fast Big Data Analytics by Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia; O'Reilly Media

Apache Spark started as a research project at UC Berkeley in the AMPLab, which focuses on big data analytics. Spark is an open source cluster computing platform designed to be fast and general-purpose for data analytics - It's both fast to run and write. Spark provides primitives for in-memory cluster computing: your job can load data into memory and query it repeatedly much quicker than with disk-based systems like Hadoop MapReduce. Users can write applications quickly in Java, Scala or Python. In additional, it's easy to run standalone or on EC2 or Mesos. It can read data from HDFS, HBase, Cassandra, and any Hadoop data source.
If you would like a book about Spark - Learning Spark Lightning-Fast Big Data Analytics by Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia. It's a great book for who is interested in Spark development and starting with it. Readers will learn how to express MapReduce jobs with just a few simple lines of Spark code and more...
  • Quickly dive into Spark capabilities such as collect, count, reduce, and save
  • Use one programming paradigm instead of mixing and matching tools such as Hive, Hadoop, Mahout, and S4/Storm
  • Learn how to run interactive, iterative, and incremental analyses
  • Integrate with Scala to manipulate distributed datasets like local collections
  • Tackle partitioning issues, data locality, default hash partitioning, user-defined partitioners, and custom serialization
  • Use other languages by means of pipe() to achieve the equivalent of Hadoop streaming
With Early Release - 7 chapters. Explained Apache Spark overview, downloading and commands that should know, programming with RDDS (+ more advance) as well as working with Key-Value Pairs, etc. Easy to read and Good examples in a book. For people who want to learn Apache Spark or use Spark for Data Analytic. It's a book, that should keep in shelf.

Book: Learning Spark Lightning-Fast Big Data Analytics
Authors: Holden KarauAndy KonwinskiPatrick WendellMatei Zaharia

Thursday, October 09, 2014

Using Flume - Flexible, Scalable, and Reliable Data Streaming by Hari Shreedharan; O'Reilly Media

Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware. How to deliver log to Hadoop HDFS. Apache Flume is open source to integrate with HDFS, HBASE and it's a good choice to implement for log data real-time collection from front end or log data system.
Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.It uses a simple data model. Source => Channel => Sink
It's a good time to introduce a good book about Flume - Using Flume - Flexible, Scalable, and Reliable Data Streaming by Hari Shreedharan (@harisr1234). It was written with 8 Chapters: giving basic about Apache Hadoop and Apache HBase, idea for Streaming Data Using Apache Flume, about Flume Model (Sources, Channels, Sinks), and some moew for Interceptors, Channel Selectors, Sink Groups, and Sink Processors. Additional, Getting Data into Flume* and Planning, Deploying, and Monitoring Flume.

This book was written about how to use Flume. It's very good to guide about Apache Hadoop and Apache HBase before starting about Flume Data flow model. Readers should know about java code, because they will find java code example in a book and easy to understand. It's a good book for some people who want to deploy Apache Flume and custom components.
Author separated each Chapter for Flume Data flow model. So, Readers can choose each chapter to read for part of Data flow model: reader would like to know about Sink, then read Chapter 5 only until get idea. In addition, Flume has a lot of features, Readers will find example for them in a book. Each chapter has references topic, that readers can use it to find out more and very easy + quick to use in Ebook.
With Illustration in a book that is helpful with readers to see Big Picture using Flume and giving idea to develop it more in each System or Project.
So, Readers will be able to learn about operation and how to configure, deploy, and monitor a Flume cluster, and customize examples to develop Flume plugins and custom components for their specific use-cases.
  • Learn how Flume provides a steady rate of flow by acting as a buffer between data producers and consumers
  • Dive into key Flume components, including sources that accept data and sinks that write and deliver it
  • Write custom plugins to customize the way Flume receives, modifies, formats, and writes data
  • Explore APIs for sending data to Flume agents from your own applications
  • Plan and deploy Flume in a scalable and flexible way—and monitor your cluster once it’s running
Book: Using Flume - Flexible, Scalable, and Reliable Data Streaming
Author: Hari Shreedharan