Amazon comes clean about the great Cloud outage

Amazon's dished the dirt about the great cloud outage, and it's down to human error and software bugs. Will Amazon recover?

Amazon has posted an essay-length explanation of the cloud outage that took offline some of the Web's most popular services last week. In summary, it appears that human error during an system upgrade meant a redundant backup network for the Elastic Block Service (EBS) accidentally took up the entire network traffic in the U.S. East Region, overloading it, and jamming up the system.

At the end of a long battle to restore services, Amazon says it managed to recover most data but 0.07 percent "could not be restored for customers in a consistent state". A rather miserly 10-day usage credit is being given to users, although users should check their Amazon Web Services (AWS) control panel to see if they qualify. No doubt several users are also consulting the AWS terms and conditions right now, if not lawyers.

A software bug played a part, too. Although unlikely to occur in normal EBS usage, the bug became a substantial problem because of the sheer volume of failures that were occurring. Amazon also says their warning systems were not "fine-grained enough" to spot when other issues occurred at the same time as other, louder alarm bells were ringing.

Amazon calls the outage a "re-mirroring storm." EBS is essentially the storage component of the Elastic Compute Cloud (EC2), which lets users hire computing capacity in Amazon's cloud service.

EBS works via two networks: a primary one and a secondary network that's slower and used for backup and intercommunication. Both are comprised of clusters containing nodes, and each node acts as a separate storage unit.

There are always two copies of a node, meant to preserve data integrity. This is called re-mirroring. Crucially, if one node is unable to find its partner node to backup to then it'll get stuck until it can find a replacement, and will keep trying until it can find a node. Similarly, new nodes need also to create a partner to be valid, and will get stuck until they can succeed.

It appears that during a routine system upgrade, all network traffic for the U.S. East Region was accidentally sent to the secondary network. Being slower and of lower capacity, the secondary network couldn't handle this traffic. The error was realized and the changes rolled back, but by that point the secondary network had been largely filled -- leaving some nodes on the primary network unable to re-mirror successfully. When unable to re-mirror, a node stops all data access until it's sorted out a backup, a process that ordinarily takes milliseconds but -- it would transpire -- would now take days, as Amazon engineers fought to fix the system.

Because of the re-mirroring storm that had arisen, it became difficult to create new nodes, as happens normally during everyday EC2 usage. In fact, so many new node creation requests arose, which couldn't be serviced, that the EBS control system also became partially unavailable.

Amazon engineers then turned off the capability to create new nodes, essentially putting the brakes on EBS (and therefore EC2 -- this is probably the moment at which many websites and services went offline). Things began to improve but that's when a software bug struck. When many EBS nodes close their requests for re-mirroring at the same time, they fail. Normally this issue hadn't shown its head because there'd never been a situation when so many nodes were closing requests simultaneously.

As a result, even more nodes attempted to re-mirror and the situation became worse. The EBS control system was again adversely affected.

Fixing the problem was problematic because EBS was configured not to trust any nodes it thought had failed. Therefore, the Amazon engineers had to physically locate and connect new storage in order to create new nodes to meet the demand -- around 13 percent of existing volumes, which is likely a huge amount of storage. Additionally, they had reconfigured the system to avoid any more failures, but this made bringing the new hardware online very difficult.

Some system reprogramming took place and eventually everything began to return to normal. A snapshot had been made when the crisis hit and Amazon engineers had to restore 2.2 percent of this manually. Eventually 1.04 percent of the data had to be forensically restored (I'm guessing they had to dip into archives and manually extract and restore files). In the end, 0.07 percent of files couldn't be restored. That might not sound a lot, but bearing in mind Amazon Web Services is the stream train driving the Internet, I suspect it's quite a lot of data.

Amazon has, of course, promised to improve across the board -- everything from auditing processes to avoid the error that kicked off the event, to speeding up recovery. There's an apology too, but it's surprisingly short and perhaps not as grovelling as some would like. At this stage of the game I suspect all the AWS engineers want to do is take a few days off.

I'm among those who anticipated this outage was an extraordinary event. I thought an act of God might be involved somewhere -- maybe a seagull fell into a ventilation pipe and blew up a sever.

Sadly, it looks like I'm wrong. There are clear failures that could have been seen in advance, and they're are going to dent the confidence of anybody using Amazon Web Services. Ultimately, it's clear that nobody ever asked, "What if?"

I don't expect anybody to be giving up on Amazon Web Services right now, largely because it remains one of the cheapest and most accessible services out there. But Amazon's going to have to keep its nose clean in the coming months and years until the great cloud outage is just a memory.

Join the PC World newsletter!

Error: Please check your email address.

Tags Amazon Web ServicesserversCloudhardware systemscloud computinginternet

Our Back to Business guide highlights the best products for you to boost your productivity at home, on the road, at the office, or in the classroom.

Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Keir Thomas

PC World (US online)
Show Comments


Lexar® JumpDrive® S57 USB 3.0 flash drive

Learn more >

Microsoft L5V-00027 Sculpt Ergonomic Keyboard Desktop

Learn more >


Lexar® JumpDrive® S45 USB 3.0 flash drive 

Learn more >


Lexar® Professional 1800x microSDHC™/microSDXC™ UHS-II cards 

Learn more >

Lexar® JumpDrive® C20c USB Type-C flash drive 

Learn more >

Audio-Technica ATH-ANC70 Noise Cancelling Headphones

Learn more >

HD Pan/Tilt Wi-Fi Camera with Night Vision NC450

Learn more >


Back To Business Guide

Click for more ›

Most Popular Reviews

Latest News Articles


PCW Evaluation Team

Azadeh Williams

HP OfficeJet Pro 8730

A smarter way to print for busy small business owners, combining speedy printing with scanning and copying, making it easier to produce high quality documents and images at a touch of a button.

Andrew Grant

HP OfficeJet Pro 8730

I've had a multifunction printer in the office going on 10 years now. It was a neat bit of kit back in the day -- print, copy, scan, fax -- when printing over WiFi felt a bit like magic. It’s seen better days though and an upgrade’s well overdue. This HP OfficeJet Pro 8730 looks like it ticks all the same boxes: print, copy, scan, and fax. (Really? Does anyone fax anything any more? I guess it's good to know the facility’s there, just in case.) Printing over WiFi is more-or- less standard these days.

Ed Dawson

HP OfficeJet Pro 8730

As a freelance writer who is always on the go, I like my technology to be both efficient and effective so I can do my job well. The HP OfficeJet Pro 8730 Inkjet Printer ticks all the boxes in terms of form factor, performance and user interface.

Michael Hargreaves

Windows 10 for Business / Dell XPS 13

I’d happily recommend this touchscreen laptop and Windows 10 as a great way to get serious work done at a desk or on the road.

Aysha Strobbe

Windows 10 / HP Spectre x360

Ultimately, I think the Windows 10 environment is excellent for me as it caters for so many different uses. The inclusion of the Xbox app is also great for when you need some downtime too!

Mark Escubio

Windows 10 / Lenovo Yoga 910

For me, the Xbox Play Anywhere is a great new feature as it allows you to play your current Xbox games with higher resolutions and better graphics without forking out extra cash for another copy. Although available titles are still scarce, but I’m sure it will grow in time.

Featured Content

Latest Jobs

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?