Hard data

Two research groups study that fallibilities of hard disks

No theory is ever as good as lots of real-world data. So here, based on lots of real-world data, is what you should do to minimize problems with hard disk drives: a) burn them in rigorously; b) replace them as soon as they start throwing errors, especially scan errors; and c) retire them before they turn three years old. Oh, and d) remember that none of those measures is a substitute for regular backups.

That's the gist of a pair of amazing studies presented at the FAST '07 storage conference this month. Two separate research groups each collected data on 100,000 disk drives, some of which failed -- then they crunched the numbers to identify how the drives failed, what they (mainly) failed from and what they (mostly) didn't fail from.

And ho boy, do they ever fail. Hard drives are the most commonly replaced hardware item in many data centers, and they account for 16 percent of all hardware-related outages. Anything that tells us how to keep them from dropping dead is money in the bank for IT shops.

One of the studies, from Carnegie Mellon University, got its statistics from a wide range of sites, including the Los Alamos National Laboratory, the Pittsburgh Supercomputing Center and various Internet service providers. (You can find that study online at www.usenix.org/events/fast07/tech/schroeder.html.)

The other study sifted through data from Google's automated system for tracking performance of drives in its own huge storage farms. That one's at http://labs.google.com/ papers/disk_failures.pdf.

If those two populations sound very much alike -- well, listen harder. High-performance computing centers tend to buy gear with high-performance specs. Google, on the other hand, is notoriously cheap when it comes to hardware -- it buys garden-variety hard drives in large lots from whoever is offering the best deal that particular week.

But it turns out that high-end and consumer drives have a lot in common. For one thing, they typically don't last the five years that drive vendors say they should, at least not in server-farm settings. Drive failures at Google take a big jump once drives get to be more than two years old. And according to the Carnegie Mellon team, those rising failure rates never level off -- they just keep going up as drives get older.

Think using a drive a lot will make it much more likely to fail? Nope, say the guys from Google. Low-utilization drives fail at almost exactly the same rate as high-utilization drives.

Think RAID is a guarantee against a storage catastrophe? Don't believe it, say the Carnegie Mellon folks. According to their real-world data, in RAID 5 arrays, when one drive fails, another drive failure will often happen much sooner than it theoretically should -- maybe even before you've replaced the bad drive and rebuilt the data set on the RAID array.

Think overheating drives are a major source of failure? Only when they get really, really hot, say the Googlers -- and even then, it's mainly a problem for drives that are three years old.

And while the Carnegie Mellon study found that SCSI, Fibre Channel and Serial ATA drives all fail at about the same rate, the Google study determined that some brands and models actually are much better when it comes to drive survival. Which are the best hard drives to buy? The Google guys aren't saying, the creeps.

But aside from that, these studies are a real gift to data centers.

Until now, all we've had to go on was vendor specsmanship, anecdotal experience and small-scale research. We had our rules of thumb and our theories -- and that was it.

Now we've got hard data to work from, lots of it. We know much more about how drives die, and how they don't. And now we can do something about it.

Burn 'em in. Watch for errors. Don't let 'em grow old.

And test those backups, OK?

Join the newsletter!

Or

Sign up to gain exclusive access to email subscriptions, event invitations, competitions, giveaways, and much more.

Membership is free, and your security and privacy remain protected. View our privacy policy before signing up.

Error: Please check your email address.
Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Frank Hayes

Computerworld
Show Comments

Cool Tech

Toys for Boys

Family Friendly

Stocking Stuffer

SmartLens - Clip on Phone Camera Lens Set of 3

Learn more >

Christmas Gift Guide

Click for more ›

Brand Post

Most Popular Reviews

Latest Articles

Resources

PCW Evaluation Team

Aysha Strobbe

Microsoft Office 365/HP Spectre x360

Microsoft Office continues to make a student’s life that little bit easier by offering reliable, easy to use, time-saving functionality, while continuing to develop new features that further enhance what is already a formidable collection of applications

Michael Hargreaves

Microsoft Office 365/Dell XPS 15 2-in-1

I’d recommend a Dell XPS 15 2-in-1 and the new Windows 10 to anyone who needs to get serious work done (before you kick back on your couch with your favourite Netflix show.)

Maryellen Rose George

Brother PT-P750W

It’s useful for office tasks as well as pragmatic labelling of equipment and storage – just don’t get too excited and label everything in sight!

Cathy Giles

Brother MFC-L8900CDW

The Brother MFC-L8900CDW is an absolute stand out. I struggle to fault it.

Luke Hill

MSI GT75 TITAN

I need power and lots of it. As a Front End Web developer anything less just won’t cut it which is why the MSI GT75 is an outstanding laptop for me. It’s a sleek and futuristic looking, high quality, beast that has a touch of sci-fi flare about it.

Emily Tyson

MSI GE63 Raider

If you’re looking to invest in your next work horse laptop for work or home use, you can’t go wrong with the MSI GE63.

Featured Content

Product Launch Showcase

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?