Facebook goes open source with query engine for big data

Facebook's Presto can sift through petabytes of data and swiftly return query results, according to the company

The architecture for Facebook's Presto

The architecture for Facebook's Presto

Potentially raising the bar on SQL scalability, Facebook has released as open source a SQL query engine it developed called Presto that was built to work with petabyte-sized data warehouses.

Currently, over 1,000 Facebook employees use Presto daily to run 30,000 interactive queries, involving over a petabyte of processing, according to a post authored by Facebook software engineer Martin Traverso. The company has scaled the software to run on a 1,000 node cluster.

Now, Facebook wants other data-driven organizations to use, and it hopes, refine Presto. The company has posted the software's source code and is encouraging contributions from other parties. The software is already being tested by a number of other large Internet services, namely AirBnB and Dropbox.

Standard data warehouses would be hard-pressed to offer the responsiveness of Presto given the amount of data Facebook collects, according to engineers at the company. Facebook's data warehouse has over 300 petabytes worth of material from its users, stored on Hadoop clusters. Presto interacts with this data through interactive analysis, as well as through machine-learning algorithms and standard batch processing.

To analyze this data, Facebook originally used Hadoop MapReduce along with Hive. But as the data warehouse grew, this approach proved to be far too slow.

The Facebook Data Infrastructure group first looked for other software for running faster queries, but didn't find anything that was both mature enough and capable of scaling to the required levels. Instead, the group built its own distributed SQL query engine, using Java.

Presto can do many of the tasks that standard SQL engines can, including complex queries, aggregations, left/right outer joins, subqueries, and most of the common aggregate and scalar functions. It lacks the ability to write results back to data tables and cannot create table joins beyond a certain size.

Unlike Hive, Presto does not use MapReduce, which involves writing results back to disk. Instead, Presto compiles parts of the query on the fly and does all of its processing in memory. As a result, Facebook claims Presto is 10 times better in terms of CPU efficiency and latency than the Hive and MapReduce combo.

Presto is one of a number of newly emerging SQL query engines that tackle the problem of offering speedy results for queries run against large Hadoop data sets. Hadoop distributor Pivotal has developed Hawq for this purpose, and fellow Hadoop distributor Cloudera is working on its own software called Impala.

Another benefit Facebook built into Presto is the ability to work with data sources other than Hadoop. Facebook runs a custom data store for its news feed, for instance, which Presto can also tap into. Facebook has also built connectors for HBase and Scribe. The software is extensible to other sources as well, according to Traverso.

Joab Jackson covers enterprise software and general technology breaking news for The IDG News Service. Follow Joab on Twitter at @Joab_Jackson. Joab's e-mail address is Joab_Jackson@idg.com

Join the PC World newsletter!

Error: Please check your email address.
Rocket to Success - Your 10 Tips for Smarter ERP System Selection

Tags applicationsdatabasesdata miningsoftwareData managementdata warehousingFacebookbusiness intelligence

Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Joab Jackson

IDG News Service
Show Comments

Most Popular Reviews

Latest Articles

Resources

PCW Evaluation Team

Matthew Stivala

HP OfficeJet 250 Mobile Printer

The HP OfficeJet 250 Mobile Printer is a great device that fits perfectly into my fast paced and mobile lifestyle. My first impression of the printer itself was how incredibly compact and sleek the device was.

Armand Abogado

HP OfficeJet 250 Mobile Printer

Wireless printing from my iPhone was also a handy feature, the whole experience was quick and seamless with no setup requirements - accessed through the default iOS printing menu options.

Azadeh Williams

HP OfficeJet Pro 8730

A smarter way to print for busy small business owners, combining speedy printing with scanning and copying, making it easier to produce high quality documents and images at a touch of a button.

Andrew Grant

HP OfficeJet Pro 8730

I've had a multifunction printer in the office going on 10 years now. It was a neat bit of kit back in the day -- print, copy, scan, fax -- when printing over WiFi felt a bit like magic. It’s seen better days though and an upgrade’s well overdue. This HP OfficeJet Pro 8730 looks like it ticks all the same boxes: print, copy, scan, and fax. (Really? Does anyone fax anything any more? I guess it's good to know the facility’s there, just in case.) Printing over WiFi is more-or- less standard these days.

Ed Dawson

HP OfficeJet Pro 8730

As a freelance writer who is always on the go, I like my technology to be both efficient and effective so I can do my job well. The HP OfficeJet Pro 8730 Inkjet Printer ticks all the boxes in terms of form factor, performance and user interface.

Michael Hargreaves

Windows 10 for Business / Dell XPS 13

I’d happily recommend this touchscreen laptop and Windows 10 as a great way to get serious work done at a desk or on the road.

Featured Content

Latest Jobs

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?