Last month’s Amazon Web Services cloud outage sparked a lot of online discussion and debate over the viability of cloud services. According to published reports, an online dating company ditched AWS after massive storms caused power outages and knocked out service in one of Amazon’s U.S. East-1 Availability Zones June 29.
But Netflix – one of Amazon’s biggest cloud customers – said it remains “bullish on the cloud” despite the AWS outage. In a blog post Friday, Greg Orzell, software architect at Netflix and Ariel Tseitlin, director of cloud solutions at the company, wrote a post mortem of the outage, which they said was one of the most significant Netflix had experienced in over a year. The outage showed up things that both AWS and Netflix could do better, they wrote.
“Our own root-cause analysis uncovered some interesting findings, including an edge-case in our internal mid-tier load-balancing service,” they wrote. “This caused unhealthy instances to fail to deregister from the load balancer which black-holed a large amount of traffic into the unavailable zone. In addition, the network calls to the instances in the unavailable zone were hanging, rather than returning no route to host.”
Netflix is working to improve its resiliency and is working closely with Amazon on ways to improve the cloud provider’s systems, “focusing our efforts on eliminating single points of failure that can cause region-wide outage and isolating the failures of individual zones,” Orzell and Tseitlin wrote.
“While it’s easy and common to blame the cloud for outages because it’s outside of our control, we found that our overall availability over the past several years has steadily improved,” they wrote. “When we dig into the root causes of our biggest outages, we find that we can typically put in resiliency patterns to mitigate service disruption.”
Last summer, I attended a session at the Gartner Catalyst Conference 2011, on planning for resiliency in the cloud. Richard Jones, a managing vice president at Gartner, said the public cloud is a utility and utilities fail, making it critical that customers prepare for downtime. Enterprises often assume cloud services are reliable but they need to take responsibility for uptime, he said.
Seemed like sound advice to me. Other companies may want to look to Netflix for cues on planning for cloud resiliency.