Google, Amazon reveal their secrets of scalability

As large IT systems scale to unforeseen levels of complexity, new laws of effective management come into play

Internet giants such as Google and Amazon run IT operations that are far larger than most enterprises even dream of, but lessons they learn from managing those humongous systems can benefit others in the industry.

At a few conferences in recent weeks, engineers from Google and Amazon revealed some of the secrets they use to scale their systems with a minimum of administrative headache.

At the Usenix LISA (Large Installation Systems Administration) conference in Washington, Google site reliability engineer Todd Underwood highlighted one of the company's imperatives that may be surprising: frugality.

"A lot of what Google does is about being super-cheap," he told an audience of systems administrators.

Google is forced to maniacally control costs because it has learned that "anything that scales with demand is a disaster if you are not cheap about it."

As a service grows more popular, its costs must grow in a "sub-linear" fashion, he said.

"Add a million users, you really have to add less than a 1,000 quanta of whatever expense you are incurring," Underwood said. A "quanta" of expense could be people's time, compute resources, or power.

That thinking is behind Google's efforts not to purchase off-the-shelf routing equipment from companies such as Cisco or Juniper. Google would need so many ports that it's more cost-effective to build its own, Underwood said.

He refuted the idea that the challenges Google faces are unique to a company of its size. For one, Google is composed of many smaller services, such as Gmail and Google+.

"The scale of all of Google is not what most application developers inside of Google deal with. They run these things that are comprehensible to each and every one of you," he told the audience.

Another technique Google employs is to automate everything possible. "We're doing too much of the machines' work for them," he said.

Ideally, an organization should get rid of its system administration altogether, and just build and innovate on existing services offered by others, Underwood said, though he admitted that's not feasible yet.

Underwood, who has a flair for the dramatic, stated: "I think system administration is over, and I think we should stop doing it. It's mostly a bad idea that was necessary for a long time but I think it has become a crutch."

Google's biggest competitor is not Bing or Apple or Facebook. Rather, it is itself, he said. The company's engineers aim to make its products as reliable as possible, but that's not their sole task. If a product is too reliable -- which is to say, beyond the five 9's of reliability (99.999 percent) -- then that service is "wasting money" in the company's eyes.

"The point is not to achieve 100 percent availability. The point is to achieve the target availability -- 99.999 percent -- while moving as fast as you can. If you massively exceed that threshold you are wasting money," Underwood said.

"Opportunity costs is our biggest competitor," he said.

The following week at the Amazon Web Services (AWS) re:Invent conference in Las Vegas, James Hamilton, AWS' vice president and distinguished engineer, discussed the tricks Amazon uses to scale.

Though Amazon is selective about what numbers it shares, AWS is growing at a prodigious rate. Each day, it adds the equivalent amount of compute resources (servers, routers, data center gear) that it had in total in the year 2000, Hamilton said. "This is a different type of scale," he said.

Key for AWS, which launched in 2006, was good architectural design. Hamilton admitted that Amazon was lucky to have got the architecture for AWS largely correct from the beginning.

"When you see fast growth, you learn about architecture. If there are architectural errors or mistakes made in the application, and the customers decide to use them in a big way, there are lots of outages and lots of pain," Hamilton said.

The cost of deploying a service on AWS comes down to setting up and deploying the infrastructure, Hamilton explained. For most organizations, IT infrastructure is an expense, not the core of its business. But at AWS, engineers focus solely on driving down costs for the infrastructure.

Like Google, Amazon often builds its own equipment, such as servers. That's not practical for enterprises, he acknowledged, but it works for an operation as large as AWS.

"If you have tens of thousands of servers doing exactly the same thing, you'd be stealing from your customers not to optimize the hardware," Hamilton said. He also noted that servers sold through the regular IT hardware channel often cost about 30 percent more than buying individual components from manufacturers.

Not only does this allow AWS to cut costs for customers, but it also allows the company to talk with the component manufacturers directly about improvements that would benefit AWS.

"It makes sense economically to operate this way, and it makes sense from a pace-of-innovation perspective as well," Hamilton said.

Beyond cloud computing, another field of IT that deals with scalability is supercomputing, in which a single machine may have thousands of nodes, each with dozens of processors. On the last day of the SC13 supercomputer conference, a panel of operators and vendors assembled to discuss scalability issues.

William Kramer, who oversees the National Center for Supercomputing Applications' Blue Waters machine at the University of Illinois at Urbana Champaign, noted that supercomputing is experiencing tremendous growth, driving the need for new workload scheduling tools to ensure organizations get the most from their investment.

"What is now in a chip -- a single piece of silicon -- is the size of the systems we were trying to schedule 15 years ago," Kramer said. "We've assumed the operating system or the programmer will handle all that scheduling we were doing."

The old supercomputing metrics of throughput seem to be fraying. This year, Jack Dongarra, one of the creators of the Linpack benchmark used to compare computers on the SC500 list, called for additional metrics to better gauge a supercomputer's effectiveness.

Judging a system's true efficiency can be tricky, though.

"You want to measure the amount of work going through the system over a period of time," and not just a simplistic measure of how much each node is being utilized, Kramer said.

He noted that an organization can measure the utilization of a system by measuring the percentage of time each node is utilized. But this approach can be misleading in that a workload can be slowed to increase the utilization rate, but as a result, less work is going through the system overall.

John Hengeveld, Intel's director of HPC marketing, suggested the supercomputing community take a tip from manufacturers of airplane jet engines.

"At Rolls-Royce, you don't buy a jet engine any longer, you buy hours of propulsion in the air. They ensure you get that number of hours of propulsion for the amount of money you pay. Maybe that is the way we should be doing things now," Hengeveld said. "We shouldn't be buying chips, we should buy results."

Joab Jackson covers enterprise software and general technology breaking news for The IDG News Service. Follow Joab on Twitter at @Joab_Jackson. Joab's e-mail address is Joab_Jackson@idg.com

Join the newsletter!

Error: Please check your email address.
Rocket to Success - Your 10 Tips for Smarter ERP System Selection

Tags IT managementCIO roleGooglebest practicesit strategyAmazon Web Services

Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Joab Jackson

IDG News Service
Show Comments

Cool Tech

Breitling Superocean Heritage Chronographe 44

Learn more >

SanDisk MicroSDXC™ for Nintendo® Switch™

Learn more >

Toys for Boys

Family Friendly

Panasonic 4K UHD Blu-Ray Player and Full HD Recorder with Netflix - UBT1GL-K

Learn more >

Stocking Stuffer

Razer DeathAdder Expert Ergonomic Gaming Mouse

Learn more >

Christmas Gift Guide

Click for more ›

Most Popular Reviews

Latest Articles

Resources

PCW Evaluation Team

Edwina Hargreaves

WD My Cloud Home

I would recommend this device for families and small businesses who want one safe place to store all their important digital content and a way to easily share it with friends, family, business partners, or customers.

Walid Mikhael

Brother QL-820NWB Professional Label Printer

It’s easy to set up, it’s compact and quiet when printing and to top if off, the print quality is excellent. This is hands down the best printer I’ve used for printing labels.

Ben Ramsden

Sharp PN-40TC1 Huddle Board

Brainstorming, innovation, problem solving, and negotiation have all become much more productive and valuable if people can easily collaborate in real time with minimal friction.

Sarah Ieroianni

Brother QL-820NWB Professional Label Printer

The print quality also does not disappoint, it’s clear, bold, doesn’t smudge and the text is perfectly sized.

Ratchada Dunn

Sharp PN-40TC1 Huddle Board

The Huddle Board’s built in program; Sharp Touch Viewing software allows us to easily manipulate and edit our documents (jpegs and PDFs) all at the same time on the dashboard.

George Khoury

Sharp PN-40TC1 Huddle Board

The biggest perks for me would be that it comes with easy to use and comprehensive programs that make the collaboration process a whole lot more intuitive and organic

Featured Content

Product Launch Showcase

Latest Jobs

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?