Facebook Tech Infrastructure Needs Constant Care
- 02 September, 2008 13:35
Started in a dorm room four years ago, the social networking site Facebook now claims to be the fourth most-trafficked site in the world. Ninety million active users pound on 10,000 servers every day, uploading millions and millions of pieces of information in a given month. For example, "friends," who socialize in 21 languages, add 500 million photos per month.
At last count, Facebook stored 6.6 billion photos total, more than any other photo site. Roughly 400,000 developers and entrepreneurs have built 25,000 applications for the platform and about 140 new applications are added per day. (For more on Facebook's application user interface appeal, see Why Microsoft Should Bring a Facebook-like Look to SharePoint).
Overall there are 25 terabytes of cached data available to help Facebook's 2,000 databases serve up user requests.
Yeah, the infrastructure fairly boils over with activity and Jonathan Heiliger is the lucky VP of technical operations who gets to stir the pot. Heiliger, who has run technology for several start-ups and advised venture capitalist firm Sequoia Capital, also directed site engineering for Wal-Mart's website. He joined Facebook in October 2007 to oversee its technology set-up, which many of its 600+ employees tinker with continuously. Whew! It's a good thing Heiliger lists as an interest (on his LinkedIn profile!) "anything 24 x 7."
CIO Senior Editor recently interviewed Heiliger to talk about his work at Facebook.
You've done a lot of startups in the past. What lessons from that experience do you bring to Facebook?
The decisions you make early on tend to leave a lasting impression. It's difficult to change the way a startup is started. One of the challenges or opportunities that drew me here was going from a purely engineering-driven culture-writing software for users [for] sharing information-to now operating this truly large infrastructure. Those are two very different things. [Early on] you make tradeoffs in IT to speed development that often can lead to disaster later when you have to operate five years later.
What was the first thing you wanted to accomplish when you got to Facebook?
I spent the first three months coming up to speed. It was the longest coming-up-to-speed process I'd been through because most of my prior experience was at much younger companies. When I joined Facebook, there were 300 employees. [In the past] typically, I was among the first 10 employees. I knew where the bodies were buried, what cultural challenges there were. At Facebook, I had to figure it out.
What did you figure out?
There's not a lot of formal process and structure in place. Here, [the culture dictates that] you can't dip a toe. You have to dive in headfirst and wrestle crocodiles. My first mission here was to build credibility and explain what technology operations does. Until that point, it was ambiguous what engineering, IT and operations each did.
How did you draw the lines between them?
We're constantly looking at the lines. It's not static at Facebook. Most IT organizations love to control change. I threw that out when I came. We're not going to try to control change, but speed it up. We trained people to be pushers. We do a major release once a week and minor releases every few days. Recently, we did some underlying changes- to photo gallery layout, for example, or starting using more Ajax calls on the site so the page refreshes are more seamless.
We created a 24x7 team in operations to be the stewards of cyber liability. Instead of calling them a NOC or help desk, we called them the "cyber liability group." If someone wants to do something dumb or push bad code, the team can revert it. There are 20 people, split between Palo Alto, California, and London, to follow the sun. No one has to work graveyard shift.
One thing we try to balance is, since Facebook is first and foremost a technology company, we don't want to stifle change and innovation. We'd rather innovate and have a little mess to clean up than run something reliable but stale. That's tough to do at a bank, I imagine. The IT organization at a large bank wouldn't have that flexibility-there's other people's money at stake, regulatory oversight.
What's an example?
Over the last two years, we have had a concerted effort to improve the push tool so site updates are seamless to users. Every couple of weeks, someone checks in some bad code or there's a bad database call or we fail to do full design review and push it into production and see user impact. The site might start running slower or a geography of users will have issues. Cyber liability isolates the problem, reverts the component or reverts the whole thing back to the previous known good state.
At Wal-Mart, we had the belief that we only roll forward, never back. Once you make schema changes in the database, it's difficult to pull back. If you pushed buggy code into production, you had to fix it in production. With user impact covered in the press.
Here, it's the opposite approach. We know there's going to be broken things that happen fairly regularly. We are ready. We have emotional shields for them.
You changed the basic Facebook interface a few weeks ago. What sort of things happened with that rollout?
That's a massive change. Similar to how we rolled out Chat, we turned the new interface on gradually, some percentage of users at a time.
How did you roll out the chat feature?
We had the technology running for about a month [detecting who was online] before we had the user interface visible. We turned it off several times, found a bunch of bugs that way. You can't discover that in a QA environment. You need millions of people pounding on it every day. But the actual rollout is gradual.
Is gradual rollouts an approach enterprises should take with big software rollouts?
That control, that knob, gives operations and development organizations a lot of confidence. You can turn up the heat and if there are issues, only a certain percentage of employees have been affected. It's a mentality shift. In some large enterprise apps, you can't necessarily control technology changes to a subset of users. They all have to be using the same iteration at the same time. But in other instances, you can.
What's going on at Facebook to keep the company and culture flexible?
The product we've built encourages people to be open and share information. A lot of decisions here-design reviews, PR strategy, what servers to buy-are often open for informal debate and input from across the employee base.
We built tools on top of the Facebook platform, including one called Ideas. Any employee can create an idea by category-social, office, product. There's a discussion tool with a star rating. One star is a really bad idea. Five stars is "I'm gonna quit if we don't do this." Ideas are anything from, "I think we should have a chat feature on the site" to "Can we replace sodas with juice in the fridge?" We encourage public comment.
We also live-blog. There's a person who transcribes any large company meeting, monthly presentations from different departments and weekly Q&As with the management team.
There's a combination of management's willingness and desire to continue to push to openness and creativity.
Why did you build these tools rather than buy them? There's no shortage of blog and chat tools out there.
A couple good reasons and not so good reasons. We're a technology company and we like to write software. But really, these tools are integrated with the Facebook interface, so it makes it that much easier for employees to use. One thing I've seen at a lot of other companies is they have a pea soup style-lots of tools and Web forms and e-mail in-boxes. It's difficult for an employee to know, if they have an HR question, do I e-mail or walk over? Do I have to fill out a comment form? For us, it's all Facebook. Employees use Facebook every day. They don't have to launch a browser window to go to different URLs to communicate.
What does your infrastructure look like?
Our entire Web site is run on free software. That varies from a large MySQL site-we're second or third to Yahoo, which is No. 1. And we are also a PHP site. We have half a dozen open source projects. Another is the Memcached project. I've taken over stewardship of that project with some of our developers here.
What have you contributed lately?
One example is Thrift, which is a language-independent network library that allows different software and systems to communicate without developers having to do rewrites of network application layers. That's gotten a respectable following among Web companies.