How Digg.com uses the LAMP stack to scale upward

Caching and 'sharding' data speeds up the social media Web site

Digg.com credits two particular features of its LAMP (Linux, Apache, MySQL and PHP) server cluster for helping the news aggregation site maintain speedy performance in the face of high growth.

The site, which lets its users vote on, or "digg," their favorite news stories hosted on other sites, recently passed the 1.2 million-user mark according to Elliot White III, an engineer at San Francisco-based Digg. He spoke at MySQL's annual conference in Santa Clara, California on Tuesday.

Today, Digg.com boasts 100 servers scattered in multiple data centers that host a total of 30GB of data, but the site started off in late 2004 as a single Linux server running Apache 1.3, PHP 4, and MySQL 4.0 using the default MyISAM storage engine, White said.

As more users dug Digg, the site moved to an architecture that uses a load balancer in the front that sends queries to PHP servers, MySQL slave servers that feed the PHP servers, and a MySQL master server that feeds data to the slaves.

That's a fairly standard setup. But to get away from "sending raw queries against the database," White said Digg.com uses a software called Memcached. First developed for use by the Livejournal site, Memcached is tailored for dynamic sites like Digg.com, which serve Web pages with content that is constantly changing and is personalized according to user preferences, White said.

Memcached stores chunks of data that can be pulled and used to dynamically create a Web page. Conventional caching systems, which store whole Web pages, would be too slow and inefficient for a site like Digg.

The other atypical feature of Digg's setup is its use of what Tim Ellis, another Digg engineer, calls "sharding."

A term apparently coined by Google engineers, sharding involves breaking a database into smaller parts in order to isolate heavy loads for better performance.

"If 90 percent of your data is within a certain range, and you can get that part working really fast, then you can help customers," Ellis said. "Then it's OK if the remaining 10 percent is slower."

A database can be sharded by table, date or range. It is similar to partitioning, says Ellis, but with several key differences. Sharding usually involves divvying up data onto different physical machines. Partitioning, in contrast, typically occurs on the same piece of hardware. And while MySQL does not natively allow sharding, it does support partitioned tables, federated tables and clusters.

While sharding has helped Digg.com achieve much faster performance overall, breaking a database into several smaller ones increases complexity, Ellis said. That can mean more work for developers and database administrators, because of the inability to use common SQL commands such as joining tables. "Developers don't like this crazy stuff. That can create pushback," he said.

Digg's current architecture includes about 20 database servers, 30 Web servers, and a few search servers running Lucene; the balance operate as backup servers. All but one of the database servers run some version of MySQL 5. The transaction-heavy servers as well as the backup units use the InnoDB database engine, while the OLAP ones use MyISAM.

Ellis acknowledges that Digg.com "is really lucky" in that 98 percent of the time the database is accessed, it is being read, as opposed to experiencing more intensive data writes.

"Most people come to Digg's front page, read it and leave, which is kind of nice," said Ellis, drawing a knowing laugh from the audience of mostly PHP developers and DBAs.

Ellis also noted that although many users have complained that upgrading to MySQL 5 from 4.1 caused performance to drop, that was not true in Digg.com's case.

Maintaining Digg.com's high performance as the site grows more and more popular presents challenges to Digg engineers. For one thing, the company is unable to keep scaling by buying more physical memory. "We can't afford that anymore," White said.

Preventing Digg's enthusiastic developers from adding powerful but CPU-intensive features is "a political thing I constantly have to deal with as a DBA," said White.

Also, Digg was having a problem with its storage misreporting the status of data synchronizations. "Our hardware wanted to be fast," White said. "It was telling us things were synced to disk when it was not."

Finally, there is the mundane challenge of minimizing "schema cruft," or redundant tables of data which, if read, can slow down performance, said White.

"Everyone has to do this," he said.

Join the PC World newsletter!

Error: Please check your email address.

Our Back to Business guide highlights the best products for you to boost your productivity at home, on the road, at the office, or in the classroom.

Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Eric Lai

Computerworld
Show Comments

Essentials

Lexar® JumpDrive® S57 USB 3.0 flash drive

Learn more >

Microsoft L5V-00027 Sculpt Ergonomic Keyboard Desktop

Learn more >

Mobile

Lexar® JumpDrive® S45 USB 3.0 flash drive 

Learn more >

Exec

HD Pan/Tilt Wi-Fi Camera with Night Vision NC450

Learn more >

Lexar® JumpDrive® C20c USB Type-C flash drive 

Learn more >

Lexar® Professional 1800x microSDHC™/microSDXC™ UHS-II cards 

Learn more >

Audio-Technica ATH-ANC70 Noise Cancelling Headphones

Learn more >

Budget

Back To Business Guide

Click for more ›

Most Popular Reviews

Latest News Articles

Resources

PCW Evaluation Team

Azadeh Williams

HP OfficeJet Pro 8730

A smarter way to print for busy small business owners, combining speedy printing with scanning and copying, making it easier to produce high quality documents and images at a touch of a button.

Andrew Grant

HP OfficeJet Pro 8730

I've had a multifunction printer in the office going on 10 years now. It was a neat bit of kit back in the day -- print, copy, scan, fax -- when printing over WiFi felt a bit like magic. It’s seen better days though and an upgrade’s well overdue. This HP OfficeJet Pro 8730 looks like it ticks all the same boxes: print, copy, scan, and fax. (Really? Does anyone fax anything any more? I guess it's good to know the facility’s there, just in case.) Printing over WiFi is more-or- less standard these days.

Ed Dawson

HP OfficeJet Pro 8730

As a freelance writer who is always on the go, I like my technology to be both efficient and effective so I can do my job well. The HP OfficeJet Pro 8730 Inkjet Printer ticks all the boxes in terms of form factor, performance and user interface.

Michael Hargreaves

Windows 10 for Business / Dell XPS 13

I’d happily recommend this touchscreen laptop and Windows 10 as a great way to get serious work done at a desk or on the road.

Aysha Strobbe

Windows 10 / HP Spectre x360

Ultimately, I think the Windows 10 environment is excellent for me as it caters for so many different uses. The inclusion of the Xbox app is also great for when you need some downtime too!

Mark Escubio

Windows 10 / Lenovo Yoga 910

For me, the Xbox Play Anywhere is a great new feature as it allows you to play your current Xbox games with higher resolutions and better graphics without forking out extra cash for another copy. Although available titles are still scarce, but I’m sure it will grow in time.

Featured Content

Latest Jobs

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?