Could the Google Outage Have Been Prevented?

The massive Google Cloud outage this weekend reminds us the cloud is just a server someone else manages. Even though cloud vendors are experts at managing massive server farms, redundant systems, cybersecurity and other challenges IT faces, they too can make mistakes.

We broke the news about the outage Sunday and now we know what happened.

Benjamin Treynor Sloss the VP of 24/7 explained the problem. Basically, server configurations were changed and applied incorrectly. This was the impact:

Overall, YouTube measured a 2.5% drop of views for one hour, while Google Cloud Storage measured a 30% reduction in traffic. Approximately 1% of active Gmail users had problems with their account; while that is a small fraction of users, it still represents millions of users who couldn’t receive or send email. As Gmail users ourselves, we know how disruptive losing an essential tool can be! Finally, low-bandwidth services like Google Search recorded only a short-lived increase in latency as they switched to serving from unaffected regions, then returned to normal.

Benjamin continued:

Google’s engineering teams detected the issue within seconds, but diagnosis and correction took far longer than our target of a few minutes. Once alerted, engineering teams quickly identified the cause of the network congestion, but the same network congestion which was creating service degradation also slowed the engineering teams’ ability to restore the correct configurations, prolonging the outage. The Google teams were keenly aware that every minute which passed represented another minute of user impact, and brought on additional help to parallelize restoration efforts.

This outage reminds us of how important it is to be prepared for the outage BEFORE it happens. The absolute complexity of networks continues to grow exponentially – especially as public, private and hybrid clouds which are becoming increasingly interdependent, need to be up at all times.

We reached out to Enzo Signore, CMO at FixStream for his thoughts and he had this to say, “IT operations environments have become so complex and dynamic that even small issues can cause significant harm, even to the most sophisticated organizations. This is because it’s extremely hard to correlate events across the IT stack and the amount of data involved makes it impossible for humans to do it. Companies need to automate the process to correlate event data across all the IT domains so that they can quickly locate the problem and avoid disasters. This is the foundation of AIOps solutions.”

Of course, any outage could be prevented if you apply Monday Morning Quarterback analysis. AIOps in many ways aids in providing the clairvoyance needed to see what potential problems are coming down the road and to act to mitigate them quickly.