AWS says a typo caused the massive S3 failure this week

The cloud provider is implementing several changes to prevent similar events

Everyone makes mistakes. But working at Amazon Web Services means an incorrectly entered input can lead to a massive outage that cripples popular websites and services.

That's apparently what happened earlier this week, when the AWS Simple Storage Service (S3) in the provider's Northern Virginia region experienced an 11-hour system failure.

Other Amazon services in the US-EAST-1 region that rely on S3, like Elastic Block Store, Lambda, and the new instance launch for the Elastic Compute Cloud infrastructure-as-a-service offering were all impacted by the outage.

AWS apologized for the incident in a postmortem released Thursday. The outage affected the likes of Netflix, Reddit, Adobe, and Imgur. More than half of the top 100 online retail sites experienced slower load times during the outage, website monitoring service Apica said.

Here’s what set off the outage, and what Amazon plans to do:

According to Amazon, an authorized S3 employee executed a command that was supposed to "remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process," in response to the service's billing process working more slowly than anticipated.

One of the parameters for the command was entered incorrectly and took down a large number of servers that support a pair of critical S3 subsystems.

The Index subsystem “manages the metadata and location information of all S3 objects in the region,” while the placement subsystem “manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate.”

While those subsystems are built to be fault tolerant, the number of servers shut down required both to be fully restarted.

As it turns out, Amazon hasn't fully restarted those systems in its larger regions for several years, and S3 has experienced massive growth in the intervening time. Rebooting those subsystems took longer than expected, which added to the length of the outage.

In response to this incident, AWS is making several changes to its internal tools and processes. The tool that was responsible for causing the outage has been modified to take down servers more slowly and to block operations that will take capacity below safety check levels.

AWS is also evaluating its other tools to make sure they have similar safety systems in place.

AWS engineers are also going to start refactoring the S3 index subsystem to help speed up reboots and reduce the blast radius of future problems.

The cloud provider has also changed its Service Health Dashboard administration console to run across multiple regions. AWS employees were unable to update the dashboard during the outage because the console relied on S3 from the affected region.

Join the newsletter!

Or

Sign up to gain exclusive access to email subscriptions, event invitations, competitions, giveaways, and much more.

Membership is free, and your security and privacy remain protected. View our privacy policy before signing up.

Error: Please check your email address.

Tags amazon.com

Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Blair Hanley Frank

IDG News Service
Show Comments

Brand Post

PC World Evaluation Team Review - MSI GT75 TITAN

"I need power and lots of it. As a Front End Web developer anything less just won’t cut it which is why the MSI GT75 is an outstanding laptop for me. It’s a sleek and futuristic looking, high quality, beast that has a touch of sci-fi flare about it."

Most Popular Reviews

Latest Articles

Resources

PCW Evaluation Team

Luke Hill

MSI GT75 TITAN

I need power and lots of it. As a Front End Web developer anything less just won’t cut it which is why the MSI GT75 is an outstanding laptop for me. It’s a sleek and futuristic looking, high quality, beast that has a touch of sci-fi flare about it.

Emily Tyson

MSI GE63 Raider

If you’re looking to invest in your next work horse laptop for work or home use, you can’t go wrong with the MSI GE63.

Laura Johnston

MSI GS65 Stealth Thin

If you can afford the price tag, it is well worth the money. It out performs any other laptop I have tried for gaming, and the transportable design and incredible display also make it ideal for work.

Andrew Teoh

Brother MFC-L9570CDW Multifunction Printer

Touch screen visibility and operation was great and easy to navigate. Each menu and sub-menu was in an understandable order and category

Louise Coady

Brother MFC-L9570CDW Multifunction Printer

The printer was convenient, produced clear and vibrant images and was very easy to use

Edwina Hargreaves

WD My Cloud Home

I would recommend this device for families and small businesses who want one safe place to store all their important digital content and a way to easily share it with friends, family, business partners, or customers.

Featured Content

Product Launch Showcase

Latest Jobs

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?