Supercomputers face growing resilience problems

As the number of components in large supercomputers grows, so does the possibility of component failure

As supercomputers grow more powerful, they'll also grow more vulnerable to failure, thanks to the increased amount of built-in componentry. A few researchers at the recent SC12 conference, held last week in Salt Lake City, offered possible solutions to this growing problem.

Today's high-performance computing (HPC) systems can have 100,000 nodes or more -- with each node built from multiple components of memory, processors, buses and other circuitry. Statistically speaking, all these components will fail at some point, and they halt operations when they do so, said David Fiala, a Ph.D student at the North Carolina State University, during a talk at SC12.

The problem is not a new one, of course. When Lawrence Livermore National Laboratory's 600-node ASCI (Accelerated Strategic Computing Initiative) White supercomputer went online in 2001, it had a mean time between failures (MTBF) of only five hours, thanks in part to component failures. Later tuning efforts had improved ASCI White's MTBF to 55 hours, Fiala said.

But as the number of supercomputer nodes grows, so will the problem. "Something has to be done about this. It will get worse as we move to exascale," Fiala said, referring to how supercomputers of the next decade are expected to have 10 times the computational power that today's models do.

Today's techniques for dealing with system failure may not scale very well, Fiala said. He cited checkpointing, in which a running program is temporarily halted and its state is saved to disk. Should the program then crash, the system is able to restart the job from the last checkpoint.

The problem with checkpointing, according to Fiala, is that as the number of nodes grows, the amount of system overhead needed to do checkpointing grows as well -- and grows at an exponential rate. On a 100,000-node supercomputer, for example, only about 35 percent of the activity will be involved in conducting work. The rest will be taken up by checkpointing and -- should a system fail -- recovery operations, Fiala estimated.

Because of all the additional hardware needed for exascale systems, which could be built from a million or more components, system reliability will have to be improved by 100 times in order to keep to the same MTBF that today's supercomputers enjoy, Fiala said.

Fiala presented technology that he and fellow researchers developed that may help improve reliability. The technology addresses the problem of silent data corruption, when systems make undetected errors writing data to disk.

Basically, the researchers' approach consists of running multiple copies, or "clones" of a program, simultaneously and then comparing the answers. The software, called RedMPI, is run in conjunction with the Message Passing Interface (MPI), a library for splitting running applications across multiple servers so the different parts of the program can be executed in parallel.

RedMPI intercepts and copies every MPI message that an application sends, and sends copies of the message to the clone (or clones) of the program. If different clones calculate different answers, then the numbers can be recalculated on the fly, which will save time and resources from running the entire program again.

"Implementing redundancy is not expensive. It may be high in the number of core counts that are needed, but it avoids the need for rewrites with checkpoint restarts," Fiala said. "The alternative is, of course, to simply rerun jobs until you think you have the right answer."

Fiala recommended running two backup copies of each program, for triple redundancy. Though running multiple copies of a program would initially take up more resources, over time it may actually be more efficient, due to the fact that programs would not need to be rerun to check answers. Also, checkpointing may not be needed when multiple copies are run, which would also save on system resources.

"I think the idea of doing redundancy is actually a great idea. [For] very large computations, involving hundreds of thousands of nodes, there certainly is a chance that errors will creep in," said Ethan Miller, a computer science professor at the University of California Santa Cruz, who attended the presentation. But he said the approach may be not be suitable given the amount of network traffic that such redundancy might create. He suggested running all the applications on the same set of nodes, which could minimize internode traffic.

In another presentation, Ana Gainaru, a Ph.D student from the University of Illinois at Urbana-Champaign, presented a technique of analyzing log files to predict when system failures would occur.

The work combines signal analysis with data mining. Signal analysis is used to characterize normal behavior, so when a failure occurs, it can be easily spotted. Data mining looks for correlations between separate reported failures. Other researchers have shown that multiple failures are sometimes correlated with each other, because a failure with one technology may affect performance in others, according to Gainaru. For instance, when a network card fails, it will soon hobble other system processes that rely on network communication.

The researchers found that 70 percent of correlated failures provide a window of opportunity of more than 10 seconds. In other words, when the first sign of a failure has been detected, the system may have up to 10 seconds to save its work, or move the work to another node, before a more critical failure occurs. "Failure prediction can be merged with other fault-tolerance techniques," Gainaru said.

Joab Jackson covers enterprise software and general technology breaking news for The IDG News Service. Follow Joab on Twitter at @Joab_Jackson. Joab's e-mail address is

Join the PC World newsletter!

Error: Please check your email address.

Tags ClusterssupercomputersdatabasesHigh performanceapplicationsIBMhardware systemsdata miningsoftwaredata warehousing

Our Back to Business guide highlights the best products for you to boost your productivity at home, on the road, at the office, or in the classroom.

Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Joab Jackson

IDG News Service
Show Comments

Most Popular Reviews

Latest News Articles


PCW Evaluation Team

Azadeh Williams

HP OfficeJet Pro 8730

A smarter way to print for busy small business owners, combining speedy printing with scanning and copying, making it easier to produce high quality documents and images at a touch of a button.

Andrew Grant

HP OfficeJet Pro 8730

I've had a multifunction printer in the office going on 10 years now. It was a neat bit of kit back in the day -- print, copy, scan, fax -- when printing over WiFi felt a bit like magic. It’s seen better days though and an upgrade’s well overdue. This HP OfficeJet Pro 8730 looks like it ticks all the same boxes: print, copy, scan, and fax. (Really? Does anyone fax anything any more? I guess it's good to know the facility’s there, just in case.) Printing over WiFi is more-or- less standard these days.

Ed Dawson

HP OfficeJet Pro 8730

As a freelance writer who is always on the go, I like my technology to be both efficient and effective so I can do my job well. The HP OfficeJet Pro 8730 Inkjet Printer ticks all the boxes in terms of form factor, performance and user interface.

Michael Hargreaves

Windows 10 for Business / Dell XPS 13

I’d happily recommend this touchscreen laptop and Windows 10 as a great way to get serious work done at a desk or on the road.

Aysha Strobbe

Windows 10 / HP Spectre x360

Ultimately, I think the Windows 10 environment is excellent for me as it caters for so many different uses. The inclusion of the Xbox app is also great for when you need some downtime too!

Mark Escubio

Windows 10 / Lenovo Yoga 910

For me, the Xbox Play Anywhere is a great new feature as it allows you to play your current Xbox games with higher resolutions and better graphics without forking out extra cash for another copy. Although available titles are still scarce, but I’m sure it will grow in time.

Featured Content

Latest Jobs

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?