String of July Outages Fuels Datacenter Cloud Anxiety
Datacenter managers already have a tough time convincing the top brass to embrace cloud. Last week's high-profile service failures at Google, Facebook and Apple haven't helped.
- By John K. Waters
Analysts tell us that enterprises like the idea of getting out of the datacenter business, of letting go of the expenses associated with staff and on-premises hardware and software, and ultimately shifting just about everything to the cloud. But they're reluctant to place high-value data and applications in cloud environments because of concerns about reliability and security.
Several disruptions and outages reported last week by top cloud service providers and social networks grabbed headlines that are unlikely to ease those concerns. In fact, they will almost certainly increase the challenges for datacenter managers advocating for cloud-based services at their organizations.
It doesn't help that these outages follow what Google described as a "catastrophic failure" last month that was caused by datacenter automation software that took down Gmail, YouTube, Snapchat and Shopify (to name a few) for four hours. But not only are the causes of this month's outages unrelated to that disruption, they are also unrelated to each other.
Downtime incidents like these are expensive and scary, and it's worth sorting them out to provide some clarity -- and maybe ease a bit of the anxiety any splashy headlines might have triggered.
On July 2, Internet services provider Cloudflare experienced a global outage across its network of 140 datacenters that resulted in visitors to Cloudflare-proxied domains being shown 502 errors ("Bad Gateway").
The outage, which lasted about 30 minutes, was the result of a massive spike in CPU utilization of the network, company CTO John Graham-Cumming explained in a blog post, not an attack. It was the result of "a bad software deploy," he wrote -- specifically, the "deployment of a single misconfigured rule within the Cloudflare Web Application Firewall (WAF) during a routine deployment of new Cloudflare WAF Managed rules."
Once rolled back, the service returned to normal operation and all domains using Cloudflare returned to normal traffic levels, he said.
On July 3, users of Facebook and Instagram experienced problems uploading and/or sending images, videos and other files. The disruption, which lasted for more than half a day, left some users unable to load pictures to the Facebook newsfeed, view stories on Instagram or send messages on the company's WhatsApp service.
At the time, Facebook tweeted, "We're aware that some people are having trouble uploading or sending images, videos and other files on our apps. We're sorry for the trouble and are working to get things back to normal as quickly as possible." The problem, explained a company spokesperson, was an error triggered during routine maintenance. No other details were offered.
According to the DownDetector Web site, thousands of users around the world reported outages, but Europe and North America got the worst of it.
On that same day, Google reported latency problems with its Google Cloud Platform (GCP) networking and load balancing services among users in the eastern United States.
The root cause of the disruption, the company explained in a post on the Google status page, was "physical damage to multiple concurrent fiber bundles serving network paths in us-east1," which serve the Alphabet subsidiary's datacenter in South Carolina. In other words, someone cut the cable.
Google began "electively rerouting" some network traffic to minimize service interruptions. After approximately 21 hours, the company reported that it had repaired the damaged fiber bundles and returned to normal routing.
For about three hours on Independence Day, Apple experienced a cloud outage that affected many of its services, including the App Store, Apple Books, Apple Music, Apple Pay and Apple TV. And reports surfaced that iCloud users were having trouble signing on to access Photos, Mail, Backup, Find My Friends, Contacts and Calendars.
Internet analytics startup ThousandEyes tweeted, "Starting just before 9am PDT ThousandEyes tests detected that users connecting to http://apple.com and Apple services, such as Apple Pay, began experiencing significant packet loss, which would have prevented many of them from successfully connecting to those services." ThousandEyes added that the "packet loss appears to have been caused by a BGP [Border Gateway Protocol] route flap issue, where a routing announcement is made and withdrawn in quick succession."
The BGP Internet routing system was also at the center of an outage last month. On June 24, Verizon accidentally rerouted IP packages, causing Amazon, Google, Facebook and Cloudflare to experience what Cloudflare characterized as a "small heart attack." Cloudflare network software engineer Tom Strickx took the company to task for the error in a blog post:
A small company in Northern Pennsylvania became a preferred path of many Internet routes through Verizon (AS701), a major Internet transit provider. This was the equivalent of Waze routing an entire freeway down a neighborhood street -- resulting in many websites on Cloudflare, and many other providers, to be unavailable from large parts of the Internet. This should never have happened because Verizon should never have forwarded those routes to the rest of the Internet.
His blog post provides interesting details about the incident, and it's well worth reading.
John K. Waters is the editor in chief of a number of Converge360.com sites, with a focus on high-end development, AI and future tech. He's been writing about cutting-edge technologies and culture of Silicon Valley for more than two decades, and he's written more than a dozen books. He also co-scripted the documentary film Silicon Valley: A 100 Year Renaissance, which aired on PBS. He can be reached at [email protected].