The Night the Lights Went Out in the Cloud: Lessons from the AWS Outage -- Redmondmag.com

The Night the Lights Went Out in the Cloud: Lessons from the AWS Outage

Last week's AWS outage that broke the Internet showed how critical it is to build applications that can withstand transient failure. Here's what you need to know to design a resilient cloud app (and it doesn't involve multicloud).

By Joey D'Antoni
12/02/2020

Last week, Amazon Web Services (AWS) had a fairly major outage in its U.S. East 1 region, taking down a lot of sites and breaking vacuum cleaners and doorbells.

Cloud outages are going to happen, and it's important to understand how they'll impact your applications and the right way to design applications to manage failure. Cloud computing affords you a myriad options for building and deploying applications, at a variety of price points. Choosing the right options depends on your team's skillset, budgets and the nature of your application.

The Nature of Cloud Outages
In preparing for this article, I researched a number of recent cloud outages in both Microsoft Azure and AWS. The interesting thing is that for the most part, each outage has been limited to a single region within a single provider -- or, in a few rare cases, a single service across the provider. Generally speaking, the way cloud providers deploy software prevents bugs from spreading to multiple regions. Of course, hardware failures can happen, but typically those sorts of failures are much more isolated.

The other type of failure that has happened is a service going away. In some cases, this can be related to failures in dependent systems, but those have been less common.

Finally, the loss of a datacenter in a region due to lightning took out most of an Azure region in 2018.

Designing for Downtime
With the above outages in mind, how do you design your application topology to survive transient failure? Well, first things first: Put down Visio and talk to your business.

While designing for high-availability and disaster recovery is an IT function, specifying a budget for application downtime is a decision the business needs to make.

Some applications, particularly internally facing applications, can tolerate a good amount of downtime. Others -- for example, a customer-facing Web site -- can probably only tolerate seconds to minutes of downtime.

Once you have a downtime budget, you can start building your cloud budget, then settle on a design for your application.

Building a Resilient Cloud Application
Dependent on your budget and the services you have chosen for your applications, you can choose options at a few different price points for multiregion support.

If you are running your database tier in virtual machines (VMs), you can use Always On availability groups in SQL Server to easily span both Availability Zones and regions so that your databases are broadly available. (AWS and Azure both use the term "Availability Zones" to specify different physical datacenters within a geographic region.)

You should note that you can send data synchronously (incurring no data loss) across Availability Zones, but as you go across regions, you will want to send data asynchronously (incurring loss of transactions in-flight) to avoid impacting the performance of your application. This is an expensive solution because you are doubling (or tripling) your compute and storage costs, but you can reduce the cost by downsizing your VMs in the secondary regions, keeping in mind that you will incur a small amount of additional downtime when you resize your VMs after failover.

As I mentioned, duplicating your VM infrastructure is an expensive solution, but it will allow you to have minutes of downtime in the event of a regional failure. Some other options at lower price points include shipping your backups to another region and planning to redeploy your infrastructure (causing hours of downtime), or using a tool like Azure Site Recovery (ASR), which provides infrastructure-based replication. While ASR is a good solution for many workloads, extremely busy databases may incur data loss as the replication process is not SQL Server-aware.

If you are using a Platform as a Service (PaaS) solution like Azure SQL Database or Managed Instance, or Amazon RDS, your choices are narrower, but much easier to implement.

With Azure SQL Database, you can geo-replicate to up to four different regions; with Managed Instance, you can replicate to a single additional region. Geo-replication can be combined with Failover Groups, which provides a single namespace (like an AG listener) to allow for faster failover. Azure also supports Availability Zones in PaaS, which is different from geo-replication and can provide greater uptime in the event of a datacenter failure. In addition, Microsoft recently added the ability to test failover within a region to ensure your application can survive transient failure.

Amazon RDS supports multiregion replication via the AWS Database Migration Service, which uses change data capture in SQL Server. AWS also supports Availability Group deployment into its own Availability Zones.

Both Azure and AWS support geo-replication of PaaS backups so that your backups are available in the event of a regional failure.

The Mulitcloud Trap
You will notice that I didn't mention multicloud solutions here. This was by design.

While multicloud solutions can be good for stateless applications like a static Web page, the complexity of implementing multicloud database solutions is really challenging, and beyond the skillsets of most organizations.

Some solutions like Azure Arc-enabled data services are attempting to make this easier, but it is still at the bleeding-edge of technology and is probably a couple of years away from being easy to implement. Additionally, going to a multicloud solution inherently prevents you from using a PaaS, which can reduce the benefits you get from using the public cloud.

The inherent complexity in trying to handle multitier failover across clouds means that you can easily incur more downtime from your multicloud implementation than you would have from a well-built single-cloud solution.

In summary, cloud failures are going to happen. It is always important to build applications that can survive normal transient failures in the event of a minor hardware problem.

To maintain your application through larger regional failures, you need to build a multiregion solution that allows for failover. You should also regularly test your failover scenarios to capture any potential misconfigurations (the most common being DNS problems) in your solution. Finally, think very carefully before building a multicloud solution; it's really challenging and not something to embark on a whim.

About the Author

Joseph D'Antoni is an Architect and SQL Server MVP with over two decades of experience working in both Fortune 500 and smaller firms. He holds a BS in Computer Information Systems from Louisiana Tech University and an MBA from North Carolina State University. He is a Microsoft Data Platform MVP and VMware vExpert. He is a frequent speaker at PASS Summit, Ignite, Code Camps, and SQL Saturday events around the world.

Featured

Microsoft's Record July Patch Tuesday Fixes 570 Flaws, Including Two Exploited Zero-Days

Microsoft's July Patch Tuesday release broke the record for a second straight month, delivering fixes for roughly 570 holes across Windows, SharePoint, Microsoft 365, Azure and others.
Why Most Backup Success Metrics Are Meaningless

Traditional backup metrics can show perfect health while failing to reveal whether critical workloads can actually be restored.
Microsoft Makes Passkeys the Entra ID Default as Identity Attacks Grow Stealthier

Microsoft will make passkeys the default authentication method in Entra ID and phase out its native delivery of SMS and voice authentication, a major shift aimed at reducing organizations' dependence on credentials that attackers can intercept or steal.
Sysdig Details Autonomous AI Agent Behind Ransomware Attack

Cloud security firm Sysdig has documented what it says is the first ransomware attack carried out from initial exploitation through encryption by an autonomous AI agent, without a human directing each step after the operation began.
Reading a 5.25 Inch Floppy Disk on Modern Hardware

A GreaseWeazle adapter and specialized software make it possible to recover files from decades-old 5.25-inch floppy disks.

comments powered by Disqus

Subscribe on YouTube

Office 365 Watch

Email Address*Country*

Please type the letters/numbers you see above.

Upcoming Training Events

0 AM

TechMentor & Cybersecurity Live! @ Microsoft HQ
August 3-7, 2026

Virtual Hands-on Training Seminar: PowerShell Mastery Workshop: From Fundamentals to Advanced Automation
September 9-10, 2026

The AI Pivot
September 25, 2026

Live! 360 6-Week Training & Certification Course: Mastering the Microsoft AI Framework: Building Enterprise-Ready AI Agents with Microsoft Foundry
October 6–November 10, 2026

Live! 360 Orlando
November 15-20, 2026

Artificial Intelligence Live! Orlando
November 15-20, 2026

AI Enterprise Architecture Live! Orlando
November 15-20, 2026

Cybersecurity & Ransomware Live! Orlando
November 15-20, 2026

Data Platform Live! Orlando
November 15-20, 2026

TechMentor Orlando
November 15-20, 2026

Live! 360 2-Day Hands-On Seminar: AI-Powered .NET Development with Claude & Claude Code
December 8-9, 2026

Virtual Hands-on Training Seminar: AI-Powered PowerShell and Infrastructure Automation with Claude Code
December 10-11, 2026

TechMentor & Cybersecurity Live! @ Microsoft HQ
August 9-13, 2027

Webcasts

More Webcasts

Whitepapers

More Tech Library