Apache Flume - Distributed Log Collection for Hadoop, 2nd Edition
Hadoop is a great open source tool for shifting tons of unstructured data into something manageable so that your business can gain better insight into your customers' needs. It's cheap (mostly free), scales horizontally as long as you have space and power in your datacenter, and can handle problems that would crush your traditional data warehouse. That said, a little-known secret is that your Hadoop cluster requires you to feed it data. Otherwise, you just have a very expensive heat generator! You will quickly realize (once you get past the "playing around" phase with Hadoop) that you will need a tool to automatically feed data into your cluster. In the past, you had to come up with a solution for this problem, but no more! Flume was started as a project out of Cloudera, when its integration engineers had to keep writing tools over and over again for their customers to automatically import data. Today, the project lives with the Apache Foundation, is under active development, and boasts of users who have been using it in their production environments for years. In this book, I hope to get you up and running quickly with an architectural overview of Flume and a quick-start guide. After that, we'll dive deep into the details of many of the more useful Flume components, including the very important file channel for the persistence of in-flight data records and the HDFS Sink for buffering and writing data into HDFS (the Hadoop File System). Since Flume comes with a wide variety of modules, chances are that the only tool you'll need to get started is a text editor for the configuration file. By the time you reach the end of this book, you should know enough to build a highly available, fault-tolerant, streaming data pipeline that feeds your Hadoop cluster.