Joey on SQL Server
Turning the CrowdStrike Outage into a Disaster Recovery Tabletop Exercise
As the dust settles on the massive CrowdStrike and Azure outage, conversations of what your organization should do in similar situations should happen now.
- By Joey D'Antoni
- 07/19/2024
Last night, our Web site, dcac.com, was down (along with other services) due to the outage in the Azure Central region. I went to bed thinking about our lessons learned and how it would make a lovely column on building resiliency into disaster recovery processes and a handful of interesting tech tips (like putting your resource groups in a different region than your core business resources, so you make changes if an entire region is down).
Then, I woke up this morning to the news that American and Delta Air Lines were down as part of another "Microsoft" outage.
After reading a few more tweets, I discovered that it wasn't a Microsoft outage but instead an outage related to Windows because CrowdStrike, which is an endpoint detection and response (EDR) security tool, pushed out a new update globally, which included a malformed Windows system driver, which caused all the Windows machines running it to go into a blue screen of death (BSOD) cycle. While the Azure problems took a lot of Web sites and such down, the CrowdStrike outage has taken down servers and all endpoint devices. There's a key difference here -- even though modern software deployment methods discourage logging onto servers directly to fix things, IT admins are still paranoid and can get into servers in nearly any conditions. End-user devices are much harder.
Let's think about the airlines as an example. I don't have one as a client, but I do fly a lot, so let me speculate about their overall datacenter architecture. They likely have one or more on-premises data centers and significant cloud presences. The on-prem data center houses mainframes and other key business systems that aren't cloud-friendly. The cloud likely services the apps on your phone and less critical business functions. The number of computers in those data centers and clouds is high but manageable -- probably 1,000s. Now, think about the number of computers at an airport that run Windows. Every airline at every airport has at least 10-20 PCs, probably more. Larger airports like hubs and outstations have 100s or even 1000s of machines per airline.
Now, you are starting to talk about 1,000s of machines in hundreds of cities. The fix to this problem is to log in to Windows in safe mode and replace a system driver. While this is a one-line batch or PowerShell script, it requires the user to get into safe mode (which should require elevation of credentials) and the ability to run that script -- which likely means a helpdesk or admin user on the phone explaining the process to a ticketing or reservations agent. The need to touch every device means that recovery from this outage will require a massive manual remediation effort.
The concept of blast radius is exactly what it sounds like -- it derives from military usage, like many technology terms. Still, it refers to the potential impact or damage a systems failure or a data breach could cause. When you hear about systems architecture, designs like isolated networks and zero-trust minimize the blast radius of a data breach. Technologies like clustering and Kubernetes lower the impact of a single server failure. While blast radius is a focus of engineering large-scale services, it is also something you can apply in your IT organization.
In the cases of both the Azure and the CrowdStrike-related outages, we can see how a fundamental misunderstanding of blast radius can impact the scope of an outage. There is speculation on Reddit that the Azure outage was related to the decommissioning of a storage service.
While I can't speak to what happened to Azure, I can tell you about this one time I was on call. I got paged on a Sunday morning when an Oracle database server was down. I logged into the VPN and SSHed into the UNIX server, but my standard customized command prompt was not there. I didn't see anything when I tried to list files or processes. I texted my boss, who quickly replied, "I forgot to call you, but $vendor came in to decommission a storage area network (SAN), and they took the wrong one." It wasn't just that one database that was down -- we had to restore much of our datacenter from backup tape.
The CrowdStrike case is a bit more interesting. Typically, end-user devices are something IT thinks about but doesn't care about. IT admins carefully backup servers (and hopefully, test the restore process). At the same time, desktops/laptops are treated more like farm animals and replaced or reimaged when there is any kind of an issue. However, the mechanisms for doing those replacements and reimages are not designed to operate on "all of the machines in the company," so quickly figuring out a remediation strategy will be a challenge that IT admins everywhere face today. The big issue is that CrowdStrike apparently didn't do any phased rollout, a common approach for reducing your blast radius.
One of the challenges modern IT admins face is a lack of control. When your cloud region goes down, there is little you can do if you didn't plan for disaster recovery ahead of time. When your security vendor pushes a bad patch out globally, there isn't much you can do to resolve it. This lack of control means it is more important than ever to think about disaster recovery. In this case, a lot of the discussion is theoretical, which means it only costs time initially (the money part comes later) We call these tabletop DR exercises.
Tabletop DR exercises are an excellent excuse to think crazy thoughts. What happens in an earthquake? Does your security vendor get banned from the country? Some problems may not have solutions, but at least you are prepared and have some expectation of what will happen when the inevitable failure happens.
About the Author
Joseph D'Antoni is an Architect and SQL Server MVP with over two decades of experience working in both Fortune 500 and smaller firms. He holds a BS in Computer Information Systems from Louisiana Tech University and an MBA from North Carolina State University. He is a Microsoft Data Platform MVP and VMware vExpert. He is a frequent speaker at PASS Summit, Ignite, Code Camps, and SQL Saturday events around the world.