Cloud failover a challenge for Amazon competitors, too

Building cloud-based applications that can fail over from one data center to another is difficult and may require the customer to have sophisticated technology expertise

The outage in Amazon's Elastic Compute Cloud service last week highlighted the limitations of load balancing and failover systems designed to keep applications running in case of failure. But Amazon isn't the only cloud vendor whose systems can't guarantee 100% uptime.

Building cloud-based applications that can fail over from one data center to another is difficult and may require the customer to have sophisticated technology expertise. Customers may have to work closely with the cloud vendor and purchase third-party load-balancing products to keep applications running in the event of failures like the one that hit Amazon.

GoGrid, which offers infrastructure-as-a-service computing in a fashion similar to Amazon's, offers service credits to customers when uptime falls below 100%, but that doesn't mean the cloud service never goes down.

"For the service elements we deliver, we're saying that we expect them to be up 100% of the time, and if they're not were going to compensate you," says GoGrid CEO and founder John Keagy. "Things do fail. Customers should not interpret a 100% service-level commitment as a 100% service-level guarantee."

But customers can keep their applications running through downtime if they are willing to put some extra work into it, Keagy says. Amazon customers who didn't have robust disaster recovery and failover plans were more likely to suffer downtime last week than those who planned ahead, he says.

SLAs: In the cloud, watch out for deceptive service-level agreements

GoGrid's cloud offerings are spread across 11 data centers, mostly run by co-location providers. Customers that want applications to fail over from one data center to another can use global traffic management products made by third parties, Keagy says. Customers can also achieve this extra level of protection entirely through services offered by GoGrid, but this "has to be architected in conjunction with us to get that done," Keagy says.

"That's what infrastructure is all about," Keagy says. "This is not platform as a service or software as a service. This is raw infrastructure that requires the user to have some responsibility for how they implement things."

Amazon lets customers host applications in multiple "availability zones" for an extra fee, but it's not clear how far apart these zones are. Last week, failures hit multiple availability zones.

While sites including Foursquare, Reddit, Quora and Hootsuite went offline, the success of photo-sharing site SmugMug shows how planning ahead can help customers survive what SmugMug CEO Don MacAskill called the "Amazonpocalypse."

SmugMug spread across three availability zones, and decided not to use Amazon's "Elastic Block Storage" service because of "unpredictable performance and sketchy durability," MacAskill wrote in his blog. The storage service played a key role in last week's failure.

If you're putting mission-critical applications in the cloud, MacAskill advises spreading them across multiple Amazon regions (East Coast and West Coast, for example) or multiple cloud providers.

Amazon's load-balancing service doesn't work across regions, so customers have to do some extra work on their own and use third-party software to make it happen, says Gartner analyst Drue Reeves. Spreading applications across multiple cloud vendors, meanwhile, is not impossible but difficult due to a lack of standards and interoperability.

Rackspace, another infrastructure-as-a-service provider, recently began offering a Cloud Load Balancers service that protects applications against the failure of a single server. But the load balancer does not spread applications across different data centers.

Josh Odom, who leads product development for Rackspace's cloud platform, notes that running an application in multiple data centers is the best way to guarantee 100% uptime, and Rackspace tries to make it easy for customers to use third-party load balancing and failover products to achieve that.

The biggest challenge isn't the application itself, but the data, Odom says. "Any kind of database replication with relational database systems is fairly cumbersome," Odom says. "We're trying to lower those barriers."

Rackspace's Texas data center suffered a few power outages in 2009, forcing the company to issue service credits to customers. The company has since brought in new data center experts and performed top-to-bottom audits of the facilities, Odom says. Despite past problems, Odom says Rackspace data centers are designed to withstand "catastrophic failures" including the loss of major power sources or network capacity.

While disaster recovery planning in infrastructure as a service requires some tech expertise, not all cloud services are geared toward the experts. Platform-as-a-service offerings -- such as Microsoft's Windows Azure or Google App Engine -- are designed to minimize involvement with underlying infrastructure and provide developers a relatively simple way to build and host Web applications.

But load balancing and the ability to fail over from one data center to another is still a big plus in platform-as-a-service clouds.

Microsoft recently announced “Windows Azure Traffic Manager,” saying it will allow “deployment of the same application to topologically dispersed data centers enabling the distribution of workload between these data centers through round robin, failover and performance based load balancing schemes.” Azure Traffic Manager is available only in a community technology preview, meaning it’s not ready for all customers. While Windows Azure Traffic Manager distributes traffic across multiple data centers, SQL Azure Data Sync, also in beta, replicates "databases across multiple data centers to prevent against a DC getting lost," according to Microsoft.

Developer Robert McLaws reports on Twitter that, even without Windows Azure Traffic Manager, customers can build applications to fail over across data centers if you "manage it yourself in code."

Google’s App Engine service can shift both applications and data from one data center to another without data loss or downtime in the event of failure, said Google product manager Greg D'alesandre. Google would not say how far apart the data centers are, but said "the system is designed so that there is no single geographic point of failure."

Amazon, meanwhile, has been accused of not providing a full explanation of what actually went wrong last week. Amazon blamed a "networking event" that "triggered a large amount of re-mirroring" of storage volume, creating a capacity shortage, and lost connections to virtual machines.

Thorsten von Eicken, CTO and founder of RightScale, which provides services that enhance the functionality of Amazon EC2, said Amazon "earns an F" for communication and has failed to offer a root-cause analysis.

Follow Jon Brodkin on Twitter:

Read more about data center in Network World's Data Center section.

Join the newsletter!


Sign up to gain exclusive access to email subscriptions, event invitations, competitions, giveaways, and much more.

Membership is free, and your security and privacy remain protected. View our privacy policy before signing up.

Error: Please check your email address.

Tags cloud computinginternetData Centerhardware systems

Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Jon Brodkin

Network World
Show Comments

Cool Tech

Bang and Olufsen Beosound Stage - Dolby Atmos Soundbar

Learn more >

Toys for Boys

Sony WF-1000XM3 Wireless Noise Cancelling Headphones

Learn more >

Nakamichi Delta 100 3-Way Hi Fi Speaker System

Learn more >

ASUS ROG, ACRONYM partner for Special Edition Zephyrus G14

Learn more >

Family Friendly

Mario Kart Live: Home Circuit for Nintendo Switch

Learn more >

Philips Sonicare Diamond Clean 9000 Toothbrush

Learn more >

Stocking Stuffer

Teac 7 inch Swivel Screen Portable DVD Player

Learn more >

SunnyBunny Snowflakes 20 LED Solar Powered Fairy String

Learn more >

Christmas Gift Guide

Click for more ›

Brand Post

Most Popular Reviews

Latest Articles


PCW Evaluation Team

Tom Pope

Dynabook Portégé X30L-G

Ultimately this laptop has achieved everything I would hope for in a laptop for work, while fitting that into a form factor and weight that is remarkable.

Tom Sellers


This smart laptop was enjoyable to use and great to work on – creating content was super simple.

Lolita Wang


It really doesn’t get more “gaming laptop” than this.

Jack Jeffries


As the Maserati or BMW of laptops, it would fit perfectly in the hands of a professional needing firepower under the hood, sophistication and class on the surface, and gaming prowess (sports mode if you will) in between.

Taylor Carr


The MSI PS63 is an amazing laptop and I would definitely consider buying one in the future.

Christopher Low

Brother RJ-4230B

This small mobile printer is exactly what I need for invoicing and other jobs such as sending fellow tradesman details or step-by-step instructions that I can easily print off from my phone or the Web.

Featured Content

Product Launch Showcase

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?