AWS Outage Fallout: What Lessons You Should Learn
A full post-mortem from AWS is still to come, but in the meantime, IT pros should start bolstering their cloud disaster recovery strategies now -- before the next outage.
- By Joey D'Antoni
In case you were on the beach yesterday, Amazon Web Services (AWS) had a major outage in its U.S.-East-1 region (which cascaded to services hosted in other regions), taking down all manner of Web sites.
The impact to me, personally, was not being able to watch "Top Chef" on Hulu, but the ripple effects of this outage went far. McDonald's kiosks couldn't show their menu images, which were likely hosted in S3, and smart vacuum cleaners stopped working. Even Amazon.com services like Amazon Music and Prime Video suffered downtime because of this outage.
So, What Happened?
As of this writing, AWS has released little information (though I expect a full post-mortem report in coming weeks). However, it did say that the root cause of the downtime was an "impairment of several network devices in the US-EAST-1 region."
While it can be cliched to always blame the network, as IT pros know, intermittent network problems can cause cascading failures that can be unpredictable as they are hard to model.
Is It Time to Go Multicloud?
No. Well...if you are running a major property with a big customer-facing presence, it can be a good strategy to have static Web and app content hosted in a second cloud. In the case of an outage like yesterday's, you'd have the option to direct traffic to the static presence, which can supply some level of experience for your users.
A good example of how this approach can be useful is an outage dashboard. Whenever a cloud provider has an outage, they are notoriously bad at properly reporting ongoing status. This is because they have hosted their dashboards in their own clouds using their own APIs -- and when these APIs go down, they take the monitoring with them. Using DNS, you can quickly redirect traffic to this static site, where your engineers can update the page with status updates.
However, if you aren't running Microsoft Azure or AWS, a better approach is to design your application to take advantage of multiple regions in the public cloud of your choice. While this sounds like a simple approach, there are several hurdles, including some that surfaced during yesterday's outage.
Some mitigations against regional outages can be trivial; using geo-replicated storage for your Web site images can be a cost-effective strategy. Other resiliency decisions are harder. They cost more money and can dramatically increase the complexity of your application. If you are only using infrastructure-as-a-service (IaaS) components like virtual machines and storage, building out a multi-region application is a well-known design pattern. You figure out a data replication strategy for how to have your stateful information in sync across regions (SQL Server Always On availability groups are a great example here, but other databases offer similar functionality).
Likewise, if you are dependent on a lightweight directory access protocol (LDAP) service like Active Directory, you should make sure the service is available across regions. You need things to be able to authenticate to function. Application and Web servers that do not keep any stateful information are easier deploy across geographies, but you will need some network intelligence -- a service like Azure Traffic Manager or AWS Route 53 -- to route incoming requests to your primary region.
Building out a real-time disaster recovery environment like I mentioned above is expensive, both in terms of computing and storage costs and added complexity to your application. In some cases, it can be challenging to even make the decision to failover to your second region. How long is the outage going to be? What risks are involved with failing over, and then potentially failing back? You should also note in my proposed architecture, I have included a new DNS dependency from my cloud provider, which can increase our risk of an outage: Without knowing the architecture of that DNS service, it could also be a single point of failure.
Which leads us to the next point.
Cloud Vendors Need More Transparency Around PaaS Architecture
AWS customers who thought they'd built their applications to survive a regional outage but still experienced downtime yesterday were surprised when the cascading effects of the outage took out other AWS services that had underlying dependencies in U.S.-East-1, which in turn left their applications vulnerable.
While you can and should do all manner of disaster recovery testing in the cloud, you are dependent on the cloud vendor's underlying services to stay online. This is why I advocate for cloud providers to be more transparent about their application architectures for platform-as-a-service (PaaS) offerings.
While those architectures may be considered trade secrets, I think sharing your high-availability and disaster recovery mechanisms allows your customers to make more informed decisions. For years, Microsoft was somewhat reluctant to publicly share its high-availability approach for Azure SQL Database. However, it relented; you can see the full architecture here. You can see that your Azure SQL Database will not survive the loss of a region as a standalone entity (though you could use your geo-redundant backups to restore to another region). This means that you, as the customer, get to make the decision to spend more money to have a live copy of your database in a second region.
Having transparency into these PaaS offerings allows customers to make better decisions around which services they choose for their applications. Understanding failure patterns is key to building a solid site reliability plan.
Testing, Testing, Testing
It would be remiss to talk of disaster recovery or failover without any talk of testing failover. One of the major benefits of cloud computing is that resources are so elastic. You can quickly spin up a testing environment for disaster recovery, and no matter how much you think you know, you will always learn something new when you do those tests. In fact, just yesterday, I found a bug with a PaaS service's backup solution during a disaster recovery test for a client.
Nothing in technology is ever trivial, and building applications to be tolerant of regional cloud outages is not an easy thing. You can build a resilient application stack that can be undermined by an underlying dependency on a single point of failure from your cloud provider. On top of all the testing and design considerations, it is important to have an understanding with your users and management about what your downtime budget is. How long can you be down without drastically affecting both your customers and your bottom line?
Beyond that, make sure you have a mechanism to communicate to your customer, and over communicate during an outage. Good luck, and may all your servers (or containers) stay online.
About the Author
Joseph D'Antoni is an Architect and SQL Server MVP with over a decade of experience working in both Fortune 500 and smaller firms. He is currently Principal Consultant for Denny Cherry and Associates Consulting. He holds a BS in Computer Information Systems from Louisiana Tech University and an MBA from North Carolina State University. Joey is the co-president of the Philadelphia SQL Server Users Group . He is a frequent speaker at PASS Summit, TechEd, Code Camps, and SQLSaturday events.