Amazon comes clean about the great Cloud outage

Amazon's dished the dirt about the great cloud outage, and it's down to human error and software bugs. Will Amazon recover?

Amazon has posted an essay-length explanation of the cloud outage that took offline some of the Web's most popular services last week. In summary, it appears that human error during an system upgrade meant a redundant backup network for the Elastic Block Service (EBS) accidentally took up the entire network traffic in the U.S. East Region, overloading it, and jamming up the system.

At the end of a long battle to restore services, Amazon says it managed to recover most data but 0.07 percent "could not be restored for customers in a consistent state". A rather miserly 10-day usage credit is being given to users, although users should check their Amazon Web Services (AWS) control panel to see if they qualify. No doubt several users are also consulting the AWS terms and conditions right now, if not lawyers.

A software bug played a part, too. Although unlikely to occur in normal EBS usage, the bug became a substantial problem because of the sheer volume of failures that were occurring. Amazon also says their warning systems were not "fine-grained enough" to spot when other issues occurred at the same time as other, louder alarm bells were ringing.

Amazon calls the outage a "re-mirroring storm." EBS is essentially the storage component of the Elastic Compute Cloud (EC2), which lets users hire computing capacity in Amazon's cloud service.

EBS works via two networks: a primary one and a secondary network that's slower and used for backup and intercommunication. Both are comprised of clusters containing nodes, and each node acts as a separate storage unit.

There are always two copies of a node, meant to preserve data integrity. This is called re-mirroring. Crucially, if one node is unable to find its partner node to backup to then it'll get stuck until it can find a replacement, and will keep trying until it can find a node. Similarly, new nodes need also to create a partner to be valid, and will get stuck until they can succeed.

It appears that during a routine system upgrade, all network traffic for the U.S. East Region was accidentally sent to the secondary network. Being slower and of lower capacity, the secondary network couldn't handle this traffic. The error was realized and the changes rolled back, but by that point the secondary network had been largely filled -- leaving some nodes on the primary network unable to re-mirror successfully. When unable to re-mirror, a node stops all data access until it's sorted out a backup, a process that ordinarily takes milliseconds but -- it would transpire -- would now take days, as Amazon engineers fought to fix the system.

Because of the re-mirroring storm that had arisen, it became difficult to create new nodes, as happens normally during everyday EC2 usage. In fact, so many new node creation requests arose, which couldn't be serviced, that the EBS control system also became partially unavailable.

Amazon engineers then turned off the capability to create new nodes, essentially putting the brakes on EBS (and therefore EC2 -- this is probably the moment at which many websites and services went offline). Things began to improve but that's when a software bug struck. When many EBS nodes close their requests for re-mirroring at the same time, they fail. Normally this issue hadn't shown its head because there'd never been a situation when so many nodes were closing requests simultaneously.

As a result, even more nodes attempted to re-mirror and the situation became worse. The EBS control system was again adversely affected.

Fixing the problem was problematic because EBS was configured not to trust any nodes it thought had failed. Therefore, the Amazon engineers had to physically locate and connect new storage in order to create new nodes to meet the demand -- around 13 percent of existing volumes, which is likely a huge amount of storage. Additionally, they had reconfigured the system to avoid any more failures, but this made bringing the new hardware online very difficult.

Some system reprogramming took place and eventually everything began to return to normal. A snapshot had been made when the crisis hit and Amazon engineers had to restore 2.2 percent of this manually. Eventually 1.04 percent of the data had to be forensically restored (I'm guessing they had to dip into archives and manually extract and restore files). In the end, 0.07 percent of files couldn't be restored. That might not sound a lot, but bearing in mind Amazon Web Services is the stream train driving the Internet, I suspect it's quite a lot of data.

Amazon has, of course, promised to improve across the board -- everything from auditing processes to avoid the error that kicked off the event, to speeding up recovery. There's an apology too, but it's surprisingly short and perhaps not as grovelling as some would like. At this stage of the game I suspect all the AWS engineers want to do is take a few days off.

I'm among those who anticipated this outage was an extraordinary event. I thought an act of God might be involved somewhere -- maybe a seagull fell into a ventilation pipe and blew up a sever.

Sadly, it looks like I'm wrong. There are clear failures that could have been seen in advance, and they're are going to dent the confidence of anybody using Amazon Web Services. Ultimately, it's clear that nobody ever asked, "What if?"

I don't expect anybody to be giving up on Amazon Web Services right now, largely because it remains one of the cheapest and most accessible services out there. But Amazon's going to have to keep its nose clean in the coming months and years until the great cloud outage is just a memory.

Join the newsletter!

Error: Please check your email address.
Rocket to Success - Your 10 Tips for Smarter ERP System Selection

Tags Cloudcloud computinginternetservershardware systemsAmazon Web Services

Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Keir Thomas

PC World (US online)
Show Comments


James Cook University - Master of Data Science Online Course

Learn more >


Sansai 6-Outlet Power Board + 4-Port USB Charging Station

Learn more >



Back To Business Guide

Click for more ›

Brand Post

Most Popular Reviews

Latest Articles


PCW Evaluation Team

Louise Coady

Brother MFC-L9570CDW Multifunction Printer

The printer was convenient, produced clear and vibrant images and was very easy to use

Edwina Hargreaves

WD My Cloud Home

I would recommend this device for families and small businesses who want one safe place to store all their important digital content and a way to easily share it with friends, family, business partners, or customers.

Walid Mikhael

Brother QL-820NWB Professional Label Printer

It’s easy to set up, it’s compact and quiet when printing and to top if off, the print quality is excellent. This is hands down the best printer I’ve used for printing labels.

Ben Ramsden

Sharp PN-40TC1 Huddle Board

Brainstorming, innovation, problem solving, and negotiation have all become much more productive and valuable if people can easily collaborate in real time with minimal friction.

Sarah Ieroianni

Brother QL-820NWB Professional Label Printer

The print quality also does not disappoint, it’s clear, bold, doesn’t smudge and the text is perfectly sized.

Ratchada Dunn

Sharp PN-40TC1 Huddle Board

The Huddle Board’s built in program; Sharp Touch Viewing software allows us to easily manipulate and edit our documents (jpegs and PDFs) all at the same time on the dashboard.

Featured Content

Product Launch Showcase

Latest Jobs

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?