Making the most of this powerful MapReduce platform means mastering a vibrant ecosystem of quickly evolving code
18 essential Hadoop tools for crunching big data
Hadoop has grown to stand for so much more than a smallish stack of code for spreading work to a group of computers. This core proved to be so useful that a large collection of projects now orbits around it. Some offer data housekeeping, others monitor progress, and some deliver sophisticated data storage.
The Hadoop community is fast evolving to include businesses that offer support, rent time on managed clusters, build sophisticated enhancements to the open source core, or add their own tools to the mix.
Here is a look at the most prominent pieces of today’s Hadoop ecosystem. What appears here is a foundation of tools and code that runs together under the collective heading "Hadoop."
While many refer to the entire constellation of map and reduce tools as Hadoop, there's still one small pile of code at the center known as Hadoop. The Java-based code synchronizes worker nodes in executing a function on data stored locally. Results from these worker nodes are aggregated and reported. The first step is known as "map"; the second, "reduce."
Hadoop offers a thin abstraction over local data storage and synchronization, allowing programmers to concentrate on writing code for analyzing the data. Hadoop handles the rest. The job is split up and scheduled by Hadoop. Errors or failures are expected, and Hadoop is designed to work around faults by individual machines.
Setting up a Hadoop cluster involves plenty of repetitive work. Ambari offers a Web-based GUI with wizard scripts for setting up clusters with most of the standard components. Once you get Ambari up and running, it will help you provision, manage, and monitor a cluster of Hadoop jobs. The image at left shows one of the screens after starting up a cluster.
Ambari is an incubator project at Apache and is supported by Hortonworks. The code is available at http://incubator.apache.org/ambari/.
HDFS (Hadoop Distributed File System)
The Hadoop Distributed File System offers a basic framework for splitting up data collections between multiple nodes while using replication to recover from node failure. The large files are broken into blocks, and several nodes may hold all of the blocks from a file. The image from the Apache documentation at left shows how blocks are distributed across multiple nodes.
The file system is designed to mix fault tolerance with high throughput. The blocks are loaded to maintain steady streaming and are not usually cached to minimize latency. The default model imagines long jobs processing lots of locally stored data. This fits the project's motto that “moving computation is cheaper than moving data.”
The HDFS is also distributed under the Apache license from http://hadoop.apache.org/.
When the data falls into a big table, HBase will store it, search it, and automatically share the table across multiple nodes so MapReduce jobs can run locally. The billions of rows of data go in, and then the local versions of the jobs can query them.
The code does not offer the full ACID guarantees of full-function databases, but it does offer a limited guarantee for some local changes. All mutations in a single row will succeed or fail together.
The system, often compared to Google's BigTable, can be found at http://hbase.apache.org. The image at left shows a screenshot from HareDB, a GUI client for working with HBase.
Getting data into the cluster is just the beginning of the fun. Hive is designed to regularize the process of extracting bits from all of the files in HBase. It offers an SQL-like language that will dive into the files and pull out the snippets your code needs. The data arrives in standard formats, and Hive turns it into a query-able stash.
The image at left shows a snippet of Hive code for creating a table, adding data, and selecting information.
Hive is distributed by the Apache project at http://hive.apache.org/
Getting the treasure trove of the data stored in SQL databases into Hadoop requires a bit of massaging and manipulating. Sqoop moves large tables full of information out of the traditional databases and into the control of tools like Hive or HBase.
Sqoop is a command-line tool that controls the mapping between the tables and the data storage layer, translating the tables into a configurable combination for HDFS, HBase, or Hive. The image from the Apache literature at left shows Sqoop living in between the traditional repositories and the Hadoop structures living on the node.
The latest stable version is 1.4.4, but version 2.0 is progressing well. Both are available from http://sqoop.apache.org/ under the Apache license.
Once data is stored in nodes in a way Hadoop can find it, the fun begins. Apache's Pig plows through the data, running code written in its own language, called Pig Latin, filled with abstractions for handling the data. This structure steers users toward algorithms that are easy to run in parallel across the cluster.
Pig comes with standard functions for common tasks like averaging data, working with dates, or finding differences between strings. When those aren't enough -- as they often aren't -- you can write your own functions. The image at left shows one elaborate example from Apache's documentation of how you can mix your own code with Pig's to mine the data.
The latest version can be found at http://pig.apache.org.
Once Hadoop runs on more than a few machines, making sense of the cluster requires order, especially when some of the machines start checking out.
ZooKeeper imposes a file system-like hierarchy on the cluster and stores all of the metadata for the machines so you can synchronize the work of the various machines. (The image at left shows a simple two-tiered cluster.) The documentation shows how to implement many of the standard techniques for data processing, such as producer-consumer queues so the data is chopped, cleaned, sifted, and sorted in the right order. The nodes use ZooKeeper to signal each other when they're done so the others can start up with the data.
For more information, documentation, and the latest builds turn to http://zookeeper.apache.org/.
Not all Hadoop clusters use HBase or HDFS. Some integrate with NoSQL data stores that come with their own mechanisms for storing data across a cluster of nodes. This enables them to store and retrieve data with all the features of the NoSQL database and then use Hadoop to schedule data analysis jobs on the same cluster.
Most commonly this means Cassandra, Riak, or MongoDB, and users are actively exploring the best way to integrate the two technologies. 10Gen, one of the main supporters of MongoDB, for instance, suggests that Hadoop can be used for offline analytics while MongoDB can gather statistics from the Web in real time. The illustration at left shows how a connector can migrate data between the two.
There are a great number of algorithms for data analysis, classification, and filtering, and Mahout is a project designed to bring implementations of these to Hadoop clusters. Many of the standard algorithms, such as K-Means, Dirichelet, parallel pattern, and Bayesian classification, are ready to run on your data with a Hadoop-style map and reduce.
The image at left shows the result of a canopy-clustering algorithm that chooses points and radii to cover the collection of points. It's just one of the various data analysis tools built into Hadoop.
Mahout comes from the Apache project and is distributed under the Apache license from http://mahout.apache.org/.
There is but one tool for indexing large blocks of unstructured text, and it's a natural partner for Hadoop. Written in Java, Lucene integrates easily with Hadoop, creating one big tool for distributed text management. Lucene handles the indexing; Hadoop distributes queries across the cluster.
New Lucene-Hadoop features are rapidly evolving as new projects. Katta, for instance, is a version of Lucene that automatically shards across a cluster. Solr offers more integrated solutions for dynamic clustering with the ability to parse standard file formats like XML. The illustration shows Luke, a GUI for browsing Lucene images. It now sports a plug-in for browsing indices in a Hadoop cluster.
Lucene and many of its descendants are part of the Apache project and available from http://www.apache.org.
When Hadoop jobs need to share data, they can use any database. Avro is a serialization system that bundles the data together with a schema for understanding it. Each packet comes with a JSON data structure explaining how the data can be parsed. This header specifies the structure for the data up top avoiding the need to write out extra tags in the data to mark the fields. The result can be considerably more compact than traditional formats, such as XML or JSON, when the data is regular.
The illustration shows an Avro schema for a file with three different fields: name, favorite number, and favorite color.
Avro is another Apache project with APIs and code in Java, C++, Python, and other languages at http://avro.apache.org.
Breaking up a job into steps simplifies everything. If you break your project into multiple Hadoop jobs, Oozie is ready to start them up in the right sequence. You won't need to babysit the stack, waiting for one job to end before starting another. Oozie manages a workflow specified as a DAG (directed acyclic graph). (Cyclic graphs are endless loops, and they're traps for computers.) Just hand Oozie the DAG and go out to lunch.
The image at left shows one flowchart from the Oozie documentation. The code, protected by the Apache license, is found at http://oozie.apache.org/.
The world is a big place and working with geographic maps is a big job for clusters running Hadoop. The GIS (Geographic Information Systems) tools for Hadoop project has adapted some of the best Java-based tools for understanding geographic information to run with Hadoop. Your databases can handle geographic queries using coordinates instead of strings. Your code can deploy the GIS tools to calculate in three dimensions. The trickiest part is figuring out when the word "map" refers to a flat thing that represents the world and when "map" refers to the first step in a Hadoop job.
The image at left shows several tiers of the tools from the documentation. The tools are available from http://esri.github.io/gis-tools-for-hadoop/.
Gathering all the data is often as much work as storing it or analyzing it. Flume is an Apache project that dispatches "agents" to gather up information that will be stored in the HDFS. An agent might be gathering log files, calling the Twitter API, or scraping a website. These agents are triggered by events and can be chained together. Then the data is ready for analysis.
The code is available under the Apache license from http://flume.apache.org.
SQL on Hadoop
If you want to run a quick, ad-hoc query of all of that data sitting on your huge cluster, you could write a new Hadoop job which would take a bit of time. After programmers started doing this too often, they started pining for the old SQL databases, which could answer questions when posed in that relatively simple language of SQL. They scratched that itch, and now there are a number of tools emerging from various companies. All offer a faster path to answers.
Some of the most notable include: HAWQ, Impalla, Drill, Stinger, and Tajo. There are almost enough for another slideshow.
Many of the cloud platforms are scrambling to attract Hadoop jobs because they can be a natural fit for the flexible business model that rents machines by the minute. Companies can spin up thousands of machines to crunch on a big data set in a short amount of time instead of buying permanent racks of machines that can take days or even weeks to do the same calculation. Some companies, such as Amazon, are adding an additional layer of abstraction by accepting just the JAR file filled with software routines. Everything else is set up and scheduled by the cloud.
The image at left shows some blade computers from Martin Abegglen's Flickr feed.
The future is already coming. For some algorithms, Hadoop can be slow because it generally relies on data stored on disk. That's acceptable when you're processing log files that are only read once, but all of that loading can be a slog when you're accessing data again and again, as is common in some artificial intelligence programs. Spark is the next generation. It works like Hadoop but with data that's cached in memory. The illustration at left, from Apache's documentation, shows just how much faster it can run in the right situations.
Spark is being incubated by Apache and is available from http://spark.incubator.apache.org/.
Don’t have an account? Sign up here
Don't have an account? Sign up now