SC13: Elevation plays a role in memory error rates

A study by AMD and the Department of Energy showed a higher supercomputer had more memory problems

A study from AMD and the Department of Energy showed how SRAM in the Cielo supercomputer had more transient errors than those in the Jaguar supercomputer, probably due to the difference in elevation between the two supercomputers

A study from AMD and the Department of Energy showed how SRAM in the Cielo supercomputer had more transient errors than those in the Jaguar supercomputer, probably due to the difference in elevation between the two supercomputers

With memory, as with real estate, location matters. A group of researchers from Advanced Micro Devices (AMD) and the Department of Energy's Los Alamos National Laboratory have found that the altitude at which SRAM (static random access memory) resides can influence how many random errors the memory produces.

In a field study of two high-performance computers, the researchers found that L2 and L3 caches had more transient errors on the supercomputer located at a higher altitude, compared with the one closer to sea level. They attributed the disparity largely to lower air pressure and higher cosmic ray-induced neutron strikes.

Strangely, higher elevation even led to more errors within a rack of servers, the researchers found. Their tests showed that memory modules on the top of a server rack had 20 percent more transient errors than those closer to the bottom of the rack. However, it's not clear what causes this smaller-scale effect.

Vilas Sridharan, an AMD technical staff member, presented the findings Thursday at the SC13 supercomputing conference, being held this week in the mile-high city of Denver.

Using the error logs of two large high-performance computers, the study examined the characteristics of transient memory errors, in which a memory module may store a 1 as a 0, or vice versa.

Transient errors are different from permanent or even intermittent errors, which are usually caused by hardware failure, Sridharan said. Transient errors appear more randomly and are not usually the fault of machinery. They are relatively rare, but depending on where they occur, they can cause a cascade of additional system errors.

The group studied the monthly transient fault rates of SRAM--the L2 and L3 caches within processors--in two large Cray supercomputers, each running thousands of AMD processors.

One supercomputer was the Jaguar system at Oak Ridge National Laboratory in Oak Ridge, Tennessee, which is approximately 817 feet (249 meters) above sea level, according to an online altitude finder.

The other system under study was the Cielo supercomputer at the Los Alamos National Laboratory in Los Alamos, New Mexico, which is about 7,058 feet (2,151 meters) above sea level.

The group had found that, when all other possible confounding issues were factored out, Cielo's SRAM had a "significantly higher rate of SRAM faults," compared with Jaguar's SRAM, Sridharan said.

For example, with L3 caches, Cielo was bedeviled by 735 transient faults for every 219 that Jaguar endured. L2 transient faults across the two machines showed a similar relationship.

The findings were not a surprise, according to Sridharan. It has long been theorized that transient memory errors can come from the high-energy impact of neutrons from cosmic rays, which is more pronounced at higher elevations. Other factors related to elevation, such as air pressure, may also play a role.

"This is theoretically well-known, but it is nice to see the data," Sridharan said.

Another effect was slightly more mysterious: The SRAM at the top of server racks had a significantly greater number of transient errors than that at the middle or the bottom of the same racks, within both Jaguar and Cielo.

"There is a trend towards a higher rate of SRAM faults as you go up the rack," Sridharan said. "This is something we don't really have a good explanation for."

SRAM on the server on the top of the rack had 20 percent more transient errors than the SRAM on the servers on the lower levels. "This is not a huge effect, but it is a consistent one," Sridharan said.

The difference probably could not be attributed solely to cosmic rays, Sridharan said. He briefly speculated on a number of possible causes. For example, because heat rises, the servers at the top of a rack are hotter than those on the bottom. Heat is a well-known culprit in equipment failure.

A low-cost solution, such as installing heat shielding on server racks, may be worth investigating, Sridharan said.

In the study, the group also looked at the DRAM memory faults. They examined memory from three different vendors and found that the fault rate of one vendor was four times the rate of another vendor. The group did not release the names of the vendors but did alert the vendor with the leading error rate about the comparatively high rate of faults for its products.

Joab Jackson covers enterprise software and general technology breaking news for The IDG News Service. Follow Joab on Twitter at @Joab_Jackson. Joab's e-mail address is Joab_Jackson@idg.com

Join the PC World newsletter!

Error: Please check your email address.

Tags U.S. Department of Energypopular scienceAdvanced Micro DevicesComponentsmemory

Our Back to Business guide highlights the best products for you to boost your productivity at home, on the road, at the office, or in the classroom.

Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Joab Jackson

IDG News Service
Show Comments

Most Popular Reviews

Latest News Articles

Resources

PCW Evaluation Team

Azadeh Williams

HP OfficeJet Pro 8730

A smarter way to print for busy small business owners, combining speedy printing with scanning and copying, making it easier to produce high quality documents and images at a touch of a button.

Andrew Grant

HP OfficeJet Pro 8730

I've had a multifunction printer in the office going on 10 years now. It was a neat bit of kit back in the day -- print, copy, scan, fax -- when printing over WiFi felt a bit like magic. It’s seen better days though and an upgrade’s well overdue. This HP OfficeJet Pro 8730 looks like it ticks all the same boxes: print, copy, scan, and fax. (Really? Does anyone fax anything any more? I guess it's good to know the facility’s there, just in case.) Printing over WiFi is more-or- less standard these days.

Ed Dawson

HP OfficeJet Pro 8730

As a freelance writer who is always on the go, I like my technology to be both efficient and effective so I can do my job well. The HP OfficeJet Pro 8730 Inkjet Printer ticks all the boxes in terms of form factor, performance and user interface.

Michael Hargreaves

Windows 10 for Business / Dell XPS 13

I’d happily recommend this touchscreen laptop and Windows 10 as a great way to get serious work done at a desk or on the road.

Aysha Strobbe

Windows 10 / HP Spectre x360

Ultimately, I think the Windows 10 environment is excellent for me as it caters for so many different uses. The inclusion of the Xbox app is also great for when you need some downtime too!

Mark Escubio

Windows 10 / Lenovo Yoga 910

For me, the Xbox Play Anywhere is a great new feature as it allows you to play your current Xbox games with higher resolutions and better graphics without forking out extra cash for another copy. Although available titles are still scarce, but I’m sure it will grow in time.

Featured Content

Latest Jobs

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?