How Google Handles Outages

One of the most challenging issues for companies which provide service over the net is keeping systems up and running 100% of the time. No company has been able to keep their services running constantly but certainly Google is one of the leaders in this area. Their virtualized architechture was actually developed with the idea that many of the servers in the company’s data centers will fail. Still, outages happen and when they do, the company like all others has to react. Here’s how and a quick excerpt.

Data Center Knowledge: Google has many data centers and distributed operations. How do Google’s systems detect problems in a specific data center or portion of its network?

Urs Holzle: We have a number of best practices that we suggest to teams for detecting outages. One way is cross monitoring between different instances. Similarly, black-box monitoring can determine if the site is down, while white-box monitoring can help diagnose smaller problems (e.g. a 2-4% loss over several hours). Of course, it’s also important to learn from your mistakes, and after an outage we always run a full postmortem to determine if existing monitoring was able to catch it, and if not, figure out how to catch it next time.

DCK: Is there a central Google network operations center (NOC) that tracks events and coordinates a response?

Urs Holzle: No, we use a distributed model with engineers in multiple time zones. Our various infrastructure teams serve as “problem coordinators” during outages, but this is slightly different than a traditional NOC, as the point of contact may vary based on the nature of the outage. On-call engineers are empowered to pull in additional resources as needed. We also have numerous automated monitoring systems built by various teams for their products, that directly alerts an on-call engineer if anomalous issues are detected.

DCK: How much of Google’s ability to “route around” problems is automated, and what are the limits of automation?

Urs Holzle: There are several different layers of “routing around” problems – a failing Google File System (GFS) chunkserver can be routed around by the GFS client automatically, whereas a datacenter power loss may require some manual intervention. In general, we try to develop scalable solutions and build in the “route around” behavior into our software for problems with a clear solution. When the interactions are more complex and require sequenced steps or repeated feedback loops, we often prefer to put a human hand on the wheel.

DCK: How might a facility-level data center power outage present different
challenges than more localized types of reliability problems? How does
Google’s architecture address this?

Urs Holzle: The Google within-datacenter infrastructure (GFS, machine scheduling, etc) is generally designed to manage machine specific outages transparently, and rack/machine group outages as long as the mortality is a fraction of the total pool of machines. For example, GFS prefers to store replicated copies of data on machines on different racks so that the loss of a rack may create a performance degradation but won’t lose data.

Datacenter level and multi-region unplanned outages are infrequent enough that we use manual tools to handle them. Sometimes we need to build new tools when new classes of problems happen. Also, teams regularly practice failing out of or routing around specific datacenters as part of scheduled maintenance.

DCK: A “Murphy” question: Given all the measures Google has taken to prevent downtime in its many services, what are some of the types of problems that have actually caused service outages?

Urs Holzle: Configuration issues and rate of change play a pretty significant role in
many outages at Google. We’re constantly building and re-building systems, so a trivial design decision six months or a year ago may combine with two or three new features to put unexpected load on a previously-reliable component. Growth is also a major issue – someone once likened the process of upgrading our core websearch infrastructure to “changing the tires on a car while you’re going at 60 down the freeway.” Very rarely, the systems designed to route outages actually cause outages themselves; fortunately, the only recent example is the February Gmail outage (Here’s the postmortem in PDF format).