Researchers: Databases still beat Google's MapReduce

MapReduce and its open-source version Hadoop have many fans

A team of researchers will release on Tuesday a paper showing that parallel SQL databases perform up to 6.5 times faster than Google Inc.'s MapReduce data-crunching technology.

Google bypassed parallel databases and invented MapReduce as a way to index the World Wide Web on its global grid of low-end PC servers. As of January 2008, Google has used MapReduce to process 20 petabytes of data a day.

In results of in-house tests published last November, Google used MapReduce running on 1,000 servers to sort 1TB of data in just 68 seconds.

Such results have won MapReduce and its open-source version Hadoop many fans, who argue that the technology is already superior to the 40-year-old relational one for large-scale grids such as for cloud-computing infrastructures, and will eventually render databases obsolete for other tasks.

Microsoft technical fellow David DeWitt and Michael Stonebraker, a database industry legend and chief technology officer at Vertica Systems Inc., who co-authored the paper, have previously argued that MapReduce lacks many key features already standard to databases and was generally a "major step backward."

The paper, titled "A Comparison of Approaches to Large-Scale Data Analysis," viewable here.

It is sure to stoke heated discussion among data junkies over the technical merits of each approach. It will be published by the Association for Computing Machinery (ACM), a 92,000-member IT society, in the June 29-July 2 issue of its SIGMOD Record journal of data management.

In addition to DeWitt and Stonebraker, five researchers from Brown University, Yale University, MIT and the University of Wisconsin co-authored the report.

In the paper, DeWitt and Stonebraker put meat on their argument by testing two 100-node parallel, "shared-nothing" database clusters, one running the column-based Vertica and another running a row-based database from "a major relational vendor," against a similarly configured MapReduce one of the same size.

Servers had 2.4-GHz Intel Core 2 Duo processors running 64-bit Red Hat Enterprise Linux with 4GB of RAM and two 250GB SATA-I hard drives all connected by Gigabit Ethernet ports.

Their conclusion? Databases "were significantly faster and required less code to implement each task, but took longer to tune and load the data," the researchers write. Database clusters were between 3.1 and 6.5 times faster on a "variety of analytic tasks."

MapReduce also requires developers to write features or perform tasks manually that can be done automatically by most SQL databases, they wrote.

MapReduce may be "well suited for development environments with a small number of programmers and a limited application domain," they said. "This lack of constraints, however, may not be appropriate for longer-term and larger-sized projects."

Database industry analyst Curt Monash agreed with the results. "The results are pretty clear in favor of databases," Monash said. "Databases are more mature products."

The researchers note about a dozen parallel database vendors, including Teradata, Aster Data, Netezza, DATAllegro (now Microsoft), Dataupia, Vertica, ParAccel, Hewlett-Packard, Greenplum, IBM and Oracle.

The results reinforced Monash's belief that MapReduce was superior only for limited kinds of tasks, such as the text indexing and searching Google does, or data mining, he said.

Otherwise, "using MapReduce makes sense for most organizations only when it would otherwise be awkward to use a SQL database," he said.

The researchers did allow that parallel databases, which can be set up in large-scale grids that crunch hundreds of terabytes or even petabytes of data, were "much more challenging" than Hadoop to install and configure properly.

Loading data into MapReduce or Hadoop was also three times faster than into Vertica, and 20 times faster than the unnamed database, they wrote.

The researchers defend basing their tests on 100-server clusters, rather than the 1,000 server clusters used by Google.

"The superior efficiency of modern [databases] alleviates the need to use such massive hardware on data sets in the range of 1-2 PB," they wrote. "Since few data sets in the world even approach a petabyte in size, it is not at all clear how many MapReduce users really need 1,000 nodes."

Join the PC World newsletter!

Error: Please check your email address.

Tags mapreducedatabasesGooglehadoop

Our Back to Business guide highlights the best products for you to boost your productivity at home, on the road, at the office, or in the classroom.

Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Eric Lai

Computerworld (US)
Show Comments

Essentials

Lexar® JumpDrive® S57 USB 3.0 flash drive

Learn more >

Microsoft L5V-00027 Sculpt Ergonomic Keyboard Desktop

Learn more >

Mobile

Lexar® JumpDrive® S45 USB 3.0 flash drive 

Learn more >

Exec

HD Pan/Tilt Wi-Fi Camera with Night Vision NC450

Learn more >

Lexar® JumpDrive® C20c USB Type-C flash drive 

Learn more >

Lexar® Professional 1800x microSDHC™/microSDXC™ UHS-II cards 

Learn more >

Audio-Technica ATH-ANC70 Noise Cancelling Headphones

Learn more >

Budget

Back To Business Guide

Click for more ›

Most Popular Reviews

Latest News Articles

Resources

PCW Evaluation Team

Azadeh Williams

HP OfficeJet Pro 8730

A smarter way to print for busy small business owners, combining speedy printing with scanning and copying, making it easier to produce high quality documents and images at a touch of a button.

Andrew Grant

HP OfficeJet Pro 8730

I've had a multifunction printer in the office going on 10 years now. It was a neat bit of kit back in the day -- print, copy, scan, fax -- when printing over WiFi felt a bit like magic. It’s seen better days though and an upgrade’s well overdue. This HP OfficeJet Pro 8730 looks like it ticks all the same boxes: print, copy, scan, and fax. (Really? Does anyone fax anything any more? I guess it's good to know the facility’s there, just in case.) Printing over WiFi is more-or- less standard these days.

Ed Dawson

HP OfficeJet Pro 8730

As a freelance writer who is always on the go, I like my technology to be both efficient and effective so I can do my job well. The HP OfficeJet Pro 8730 Inkjet Printer ticks all the boxes in terms of form factor, performance and user interface.

Michael Hargreaves

Windows 10 for Business / Dell XPS 13

I’d happily recommend this touchscreen laptop and Windows 10 as a great way to get serious work done at a desk or on the road.

Aysha Strobbe

Windows 10 / HP Spectre x360

Ultimately, I think the Windows 10 environment is excellent for me as it caters for so many different uses. The inclusion of the Xbox app is also great for when you need some downtime too!

Mark Escubio

Windows 10 / Lenovo Yoga 910

For me, the Xbox Play Anywhere is a great new feature as it allows you to play your current Xbox games with higher resolutions and better graphics without forking out extra cash for another copy. Although available titles are still scarce, but I’m sure it will grow in time.

Featured Content

Latest Jobs

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?