Facebook moves 30-petabyte Hadoop cluster to new data center

Exponential growth in data volumes prompts Facebook's largest data migration project

As the world's largest social network, Facebook accumulates more data in a single day than many good size companies generate in a year.

Facebook stores much of the data on its massive Hadoop cluster, which has grown exponentially in recent years.

Today the cluster holds a staggering 30 petabytes of data or, as Facebook puts it, about 3,000 times more information than is stored by the Library of Congress. The Facebook data store has grown by more than a third in the past year, the company notes.

To accommodate the surging data volumes, the company earlier this year launched an effort to move the ever-growing Hadoop cluster to a new and bigger Facebook data center in Prineville, Ore. The biggest data migration effort ever at Facebook was completed last month, the company said.

Paul Yang, an engineer with Facebook's data infrastructure team, outlined details of the project this week on the company's blog site. Yang said the migration to the new Facebook data center was necessary because the company had run out of available power and space leaving it unable to add nodes to the Hadoop cluster.

Yang was not immediately available to speak with Computerworld about the effort.

Facebook's experience with Hadoop is likely to be of interest to a growing number of companies that are tapping the Apache open source software to capture and analyze huge volumes of structured and unstructured data.

Much of the Hadoop's appeal lies in its ability to break up very large data sets into smaller data blocks that are then distributed across a cluster of commodity hardware systems for faster processing.

A Ventana Research report released this week showed that a growing number of enterprises have started using Hadoop to collect and analyze huge volumes of unstructured and machine-generated information, such as log and event data, search engine results, and text and multimedia content from social media sites.

Facebook said it uses Hadoop technology to capture and store billions of pieces of content generated by its members daily. The data is analyzed using the open source Apache Hive data warehousing tool set.

Other data-heavy companies using Hadoop in a similar manner include eBay, Amazon and Yahoo. Yahoo is a major contributor of Hadoop code.

Facebook's Hadoop cluster was said by bloggers in May 2010 to be the largest in the world.

At the time, the cluster consisted of 2000 machines, 800 16-core systems and 1,200 8-core machines. Each of the systems in the cluster stored between 12 and 24 terabytes of data.

Facebook had a pair of potential methods for moving the cluster to a new data center, Yang said in his post.

The company could physically move each node to the new location, a task that could have been completed in a few days "with enough hands at the job," he said. The company decided against that route because it would have resulted in an unacceptably long downtime, he said.

Instead, Facebook decided to build a new, larger Hadoop cluster and simply replicate data from the old cluster on it. The chosen approach was the more complex option because the source data that Facebook was seeking to replicate was on a live system with files being created and deleted continuously, Yang said in his blog.

Thus Facebook engineers built a new replication system that could handle the unprecedented cluster size and data load. "Because replication minimizes downtime, it was the approach that we decided to use for this massive migration," he said.

According to Yang, the data replication project was accomplished in two steps.

First, most of the data and directories from the original Hadoop cluster were copied in bulk to the new one using an open source tool called DistCp.

Then all changes to the files and data that happened after the bulk copying was done were replicated to the new cluster using Facebook's newly developed file replication system. The file changes were captured by a Hive plug-in that was also developed in-house by Facebook developers.

At switchover time, Facebook temporarily shutdown Hadoop's ability to create new files and let its replication system finish replicating all data on to the new cluster. It then changed its DNS settings so they pointed to the new server.

According to Yang, the fast internally-built data replication tool was a key contributor to the success of the migration project.

In addition to its use for data migration, the replication tool is used to provide new disaster-recovery functionality to the Hadoop cluster, he said.

"We showed that it was possible to efficiently keep an active multi-petabyte cluster properly replicated, with only a small amount of lag," he said. "With replication deployed, operations could be switched over to the replica cluster with relatively little work in case of a disaster."

Jaikumar Vijayan covers data security and privacy issues, financial services security and e-voting for Computerworld. Follow Jaikumar on Twitter at @jaivijayan, or subscribe to Jaikumar's RSS feed . His e-mail address is jvijayan@computerworld.com.

Read more about bi and analytics in Computerworld's BI and Analytics Topic Center.

Join the newsletter!

Error: Please check your email address.
Rocket to Success - Your 10 Tips for Smarter ERP System Selection

Tags BI and Analyticsopen sourcedatabasesapplicationssoftwareFacebook

Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Jaikumar Vijayan

Computerworld (US)
Show Comments

Cool Tech

SanDisk MicroSDXC™ for Nintendo® Switch™

Learn more >

Breitling Superocean Heritage Chronographe 44

Learn more >

Toys for Boys

Family Friendly

Panasonic 4K UHD Blu-Ray Player and Full HD Recorder with Netflix - UBT1GL-K

Learn more >

Stocking Stuffer

Razer DeathAdder Expert Ergonomic Gaming Mouse

Learn more >

Christmas Gift Guide

Click for more ›

Most Popular Reviews

Latest Articles

Resources

PCW Evaluation Team

Walid Mikhael

Brother QL-820NWB Professional Label Printer

It’s easy to set up, it’s compact and quiet when printing and to top if off, the print quality is excellent. This is hands down the best printer I’ve used for printing labels.

Ben Ramsden

Sharp PN-40TC1 Huddle Board

Brainstorming, innovation, problem solving, and negotiation have all become much more productive and valuable if people can easily collaborate in real time with minimal friction.

Sarah Ieroianni

Brother QL-820NWB Professional Label Printer

The print quality also does not disappoint, it’s clear, bold, doesn’t smudge and the text is perfectly sized.

Ratchada Dunn

Sharp PN-40TC1 Huddle Board

The Huddle Board’s built in program; Sharp Touch Viewing software allows us to easily manipulate and edit our documents (jpegs and PDFs) all at the same time on the dashboard.

George Khoury

Sharp PN-40TC1 Huddle Board

The biggest perks for me would be that it comes with easy to use and comprehensive programs that make the collaboration process a whole lot more intuitive and organic

David Coyle

Brother PocketJet PJ-773 A4 Portable Thermal Printer

I rate the printer as a 5 out of 5 stars as it has been able to fit seamlessly into my busy and mobile lifestyle.

Featured Content

Product Launch Showcase

Latest Jobs

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?