In-Depth

Worst-Case Scenarios: When Disaster Strikes

Leading IT experts help answer the question: What if?

When Hurricane Katrina devastated the Gulf Coast and New Orleans in August 2005, federal officials said no one had ever conceived of such large-scale destruction, much less planned for it. From the painfully slow response to the galling red tape, it seemed as if the scope and scale of Katrina took the government by surprise.

It didn't. In August 2004, a FEMA-led exercise sought to model out the effects of a slow-moving, Category 3 hurricane striking New Orleans. The five-day project, which involved local, state, and federal agencies, as well as numerous private enterprises, forecast many of the most destructive aspects of the fictional storm, named Hurricane Pam. By all accounts, it was a forward-thinking exercise that should have prepared both the region and the nation for the fury of Katrina.

That exercise prompted Redmond magazine to run an exercise of its own. We formed a panel of leading IT experts to help us simulate four likely IT-related disasters. The chief information officer at Intel Corp., Stacy Smith, has concocted a timeline that reveals how a large financial services firm in San Francisco might weather a magnitude 6.8 earthquake. Johannes Ullrich, chief technical officer of IT training and security monitoring firm The SANS Institute, walks us through a savvy bot-net attack against an online retailer. Jack Orlove, vice president of professional services at IT consultancy Cyber Communication, shows how a localized bird flu outbreak might convulse a regional manufacturer. Finally, Madhu Beriwal, CEO of Innovative Emergency Management, offers a chilling glimpse at how a carefully crafted worm-and-virus assault could cripple an enterprise, and leave millions of people without access.

Each scenario models a viable threat, and offers IT planners insight into what to expect when the worst case becomes the real case.

Shaken Up
Distributed infrastructure and robust planning help ride out a major earthquake
Feb. 20 - Feb. 23, San Francisco, Calif.
Stacy Smith, CIO, Intel Corp.

The scene: A 6.8 magnitude earthquake, lasting 4 minutes and 20 seconds, strikes San Francisco at 4:20 a.m. on Monday, Feb. 20, 2005, causing major damage to the area. Hundreds of buildings and thousands of homes in the downtown and surrounding area suffer damage. Several water mains burst, disrupting service to major parts of downtown. Fires ignite in places where natural gas lines have broken. Power is out in many areas. Several highway overpasses on major commuter routes are damaged and others need to be inspected. The city government pleads with citizens to stay off roads so emergency services can get through. More than 3,700 people are injured, 81 dead.

Acme Financial maintains three office buildings within the hardest hit area. The only people in the buildings at the time of the quake were security guards. Trained in earthquake drills, they stop, drop and take cover as trained, avoiding injury. After the quake subsides, they use their computers' auto-dialer function to send alerts to the company's local Emergency Response Team (ERT).

Acme's two earthquake-resistant buildings sustain minimal damage, though the skybridge between them has fallen. No power is available to either building, but backup generators restore power to the data center and keep critical business functions running. Emergency power lighting enables the ERT team to assess the situation.

The older building sustains major damage. The backup generator fails, leaving the basement-located data center without power. Fortunately, the company houses data centers in three cities as part of its disaster recovery strategy. Data is continuously mirrored in real time between data centers in San Francisco, Los Angeles and Tokyo. The biggest issues facing IT and the Business Continuity team are the closure of the city to commuters, overloaded communications networks, and the need to locate and verify the safety of local employees. In a stroke of good fortune, none of the bridges (Golden Gate, Oakland Bay and San Mateo) used for carrying high-speed data communication lines are damaged.

For backup communications, the company has developed Internet and intranet sites, a toll-free 800 number, and a team of amateur radio operators trained to move into emergency operations centers (EOCs) and establish communications. An emergency alternative workspace outside the city is equipped with Internet access and telecommunications services, including a WiMAX (802.16) access point through which any 802.11g wireless access point in the building can connect to a T1 service.

On Monday at 4:30 a.m., crisis management team members access local copies of encrypted business continuity documents on their notebooks. Using DSL and cable services, team members begin communicating from their homes via Microsoft Instant Messaging.

The crisis management team leader is stranded in another city, so the second in command steps in. IT, as part of the crisis management team, posts a notice on the Web and in a recorded phone message telling all employees to stay out of the area and work on laptops from home. Corporate policy requiring employees to take laptops home every night means most employees should have their computers with them.

Fallout: Shaken Up
  • Losses in equipment and IT infrastructure: $1.1 million
  • Distributed infrastructure saved the day, thanks to remote data backup and Web services-based applications.
  • Joint business continuity and IT earthquake planning enabled post-quake transition for home-based workforce.
  • Company will consider satellite broadband as a backup alternative for EOC and homes of key employees.

Employees needing computing equipment are directed to report to the alternative workspace the next day or as soon as they can. Directions are posted on the intranet. At 6:20 a.m., an EOC is established in the alternative workspace and one of the amateur radio operators has arrived to supply communications.

By 8 a.m. the regional phone system and local ISPs are overloaded. E-mail messages from employees trickle in through the Internet, but few phone calls make it through. By the end of the day, 50 percent of local employees have contacted the ERT to confirm that they can work from home. The next morning, nearly all members of IT and business continuity have been contacted and, unless in personal crisis, have either mobilized for recovery efforts or are performing normal job functions from home.

At 8:30 a.m. on Feb. 21, the backup generators are down to about 24 hours of fuel. Using amateur radio, IT employees warn other company data center locations to prepare for a server and application failover. They also request replacement backups of all critical San Francisco data from mirrored sites as additional insurance. The company uses Microsoft Cluster Server and Application Center/Network Load Balancing Service for load balancing and handling the extra load on the remote servers.

At 9 a.m., the company experiences heavy Internet traffic loads as credit card and brokerage customers check to see if their accounts are still available, and suppliers on the extranet post inquiries to their direct contacts. IT routes calls and Internet traffic to the company's Los Angeles offices. Employees working from home struggle with slow ISP connections and an overloaded phone system.

At 11 a.m., reports of fires, water outages, street damage and looting downtown make it obvious it will be several days before more members of the ERT can enter the city to further assess damage. Freight and messenger companies are notified via the Web to make deliveries to the EOC.

On Wednesday morning, Feb. 22, ERT members report that they can't find fuel and the generators have run out. They are able to properly shut down the network, however. Around 4 p.m. that afternoon ERT teams are finally able to use local streets to access company buildings. The older headquarters is taped off due to severe structural and water damage. The other two buildings have sustained minor damage (broken windows, cracked drywall), though utility services need to be restored. They refuel the backup generator and restart it, taking the transactional load off servers in other locations.

Company officers decide on Thursday to rent another building to temporarily replace the older headquarters building. IT begins assessing power, telecommunications and assets requirements, including the use of WiMAX to distribute wireless service in the facility. Remote IT staff from both the Los Angeles and Tokyo offices plan and assist in backup recovery, load balancing and other tasks. The server OS and applications configurations mirrored at other locations will help ensure a faster restore of computer operations.

By 10 a.m. Thursday, Feb. 23, local telecommunications and Internet begin to return to normal as operators add more capacity. More San Francisco staff successfully work at home, relieving the Los Angeles staff that has been covering for them. Power has been restored to the main downtown buildings and IT is organizing shifts to work on-site.

Acme continues over the next several weeks to set up a replacement facility for its old headquarters. It moves back into the other San Francisco buildings two weeks after the earthquake.

Ransom Demand
Sophisticated bot-net attack staggers online retailer
Nov. 1 - Dec. 13, Chicago, Ill.
Johannes Ullrich, CTO, SANS Institute Internet Storm Center

The scene: Throughout November 2006, an eastern European organized crime syndicate uses a new zero-day exploit for Microsoft Internet Explorer to install bot software on a large number of systems. The bot spreads via three primary vectors: Spam e-mail carrying a malicious URL, an instant messenger worm and a number of hacked Web sites considered trusted by many users.

After collecting about 100,000 systems, the group on Nov. 29 sends a note to an e-commerce company demanding a $50,000 ransom payment in order to avoid an attack. On Dec. 1, to prove its ability to carry out the attack, the syndicate places 100 orders from 100 different systems for random items.

The e-commerce company uses the 100 test orders to attempt to track a pattern. Common anti-DDOS (distributed denial of service) technology, which focuses on simple “packeting” attacks, proves unable to deter the event -- the attackers use sophisticated scripts to simulate regular browsing and ordering. The e-commerce company considers validating orders with a “captcha” solution, which foils robotic systems by using a graphic to prompt user input. This should deter the attack, but initial tests reveal that it will also cause 20 percent of valid orders to be aborted.

On Dec. 4 the team decides to develop and test the captcha software. It takes two days to write the software and two more to perform load and regression testing. The company is fortunate that it upgraded its test systems in August, which helps minimize response time.

The next day -- Dec. 5 -- the retailer enables real-time order volume alerts to warn of an attack as quickly as possible. The timing of the attack -- during the holiday shopping season -- complicates matters, because order volumes can be volatile. The retailer typically sees 10,000 orders each day, but peak holiday season rates can spike to 30,000 orders during limited-time promotions. The team calculates an average volume of about 20 orders per minute and decides to set the alert threshold at 200 orders per minute to account for legitimate spikes. The team also considers adopting order volume change thresholds, so if orders spike anomalously from one minute to the next, the system will alert IT managers to a possible attack. The captcha solution, meanwhile, can go live within hours of an attack being detected.

Fallout: Ransom Demand
  • Losses due to ...
    • Staff time on task: $10,000-$20,000
    • Lost business: $50,000
    • Return of erroneously shipped orders: $100,000+
  • Good change management and strong situational awareness enabled rapid response
  • Early action allowed the staff to get ahead of the threat and defend against it.
The actual attack starts on Dec. 12 at 9 a.m. EST. The bots are controlled from an IRC channel, enabling them to vary their assault and update their attack software from a Web site. As the company typically sees an increase in orders during this time of day, and the fake orders ramp up slowly, they go undetected for an hour. During that time, the attacker is able to place 2,000 fake orders. The company had expected about 3,000 valid orders during that time.

Once detected, the company moves the captcha solution in place. The switch occurs within a minute. Some customers placing orders or using shopping carts during the switch over experience odd behavior, but the disruption is limited and temporary. The captcha solution filters out fake orders, but the attacker now contacts the company demanding another $100,000.

The attack then morphs into a DDOS assault, where the bots just browse the site, add items to the cart, and search the large site for specific items. One particular search causes large database loads as it returns most of the available inventory. As a result, the site is no longer accessible to regular users.

A patch for the IE vulnerability is released on Patch Tuesday, Dec. 12.

One Dec. 13, the company analyzes logs in more detail, and disables the site search. The action reduces server loads by an order of magnitude and enables the site to resume responsive operation. To further limit the DDOS attack, the team requires captcha input to add items to the cart. Once the captcha is used to identify a “human” shopper, a cookie is placed on the user's system to avoid the need for repeated captcha identification.

After an hour of tweaking and log analysis, the site is again usable for valid users. However, search functionality remains down until the team can implement search rate limiting, so that each customer may only place one search every 30 seconds. The attack soon fades as the attackers realize the company won't pay the ransom and has strengthened its defenses.

Hand in Glove
Internet worm propagates lethal zero-day exploit
April 4 - April 11, Washington, D.C.
Madhu Beriwal, President & CEO, Innovative Emergency Management Inc.

The scene: On April 4 at 8:13 a.m. EST, a terrorist organization releases an Internet worm into the wild that uses remote code execution vulnerabilities in Apache and IIS to inject downloadable code to the default homepage. The worm carries a lethal payload -- a virus that exploits an unpublished vulnerability in Microsoft Windows to randomly delete files and disable systems. The malware combines stealthy propagation with a sudden, lethal attack for maximum affect.

To maximize impact and delay detection, the group targets specific Web servers, including high-traffic sites such as Yahoo!, Google, CNN and ESPN, as well as default sites like MSN and AOL. System sites like Dell and HP and popular destinations such as Fark and Slashdot are also targeted. The targeted systems -- about 4,000 in all -- are successfully infected within a few minutes of each other. The virus starts getting downloaded and will infect millions of Web browsers before a patch becomes available on April 11.

Fallout: Hand in Glove
  • Losses due to ...
    • Productivity lost: $6 million
    • Staff time on task: $10,000+
    • Broken contracts: $100 million
  • Failure to address a critical vulnerability led to unacceptable exposure.
  • Defense in depth and diverse security solutions can help protect against emerging threats.
  • Security must be regarded as a process that engages administrators, management and users.

There is scant warning. On March 20, a Microsoft security bulletin warns of a flaw in recent versions of IE that could allow remote code execution under the right circumstances. BoBobble Corp. security staff review the bulletin and decide that the recommended workaround -- disabling ActiveX -- would be overly disruptive. Having avoided exposure to similar issues with IE in the past, the BoBobble staff decide to take no action. It's a stance taken by corporations worldwide, leaving the large majority of the worlds' PCs exposed.

By April 7, the first signs of an attack emerge. The help desk fields about the normal number of calls about corrupted files on the release day of the virus, but attempts to find a rogue system fail. By day 3, however, the number of infected PCs at BoBobble Corp. spikes. Several PCs are completely disabled. The team gets an early break when an administrator notices that one of the corrupted files is actually a screenshot of one of his user's desktops. The company's chief information security office (CISO) decides a virus is the culprit and immediately orders all network switches, routers, hubs and firewalls powered off. All PCs and servers are also powered off to preserve data until the systems can be disinfected. The shutdown stops the spread and the effects of what appears to the CISO to be a particularly nasty virus.

Mainstream news media pick up the story on April 8, as almost every PC with Internet access is now infected. Most users have noticed something is wrong with their computer, but don't know what to do about it. Internet usage grinds to a halt. E-commerce transactions approach zero for the first time in the Internet age.

First to recover are anti-virus vendors, who are able to concoct a cure on April 9. Their systems come back online and start distributing the software and anti-virus signatures. BoBobble Corp. initiates around-the-clock recovery operations on April 10 to get the systems back up, disinfected and reconnected. Full recovery occurs on April 11, about the same time Microsoft releases a patch to fix the IE vulnerability.

The damage, however, is done. More than 50 percent of the corporation's files have been corrupted. Restoring files from tape or shadow copy fails, as the sheer volume of corrupted data overwhelms the help desk. Contact lists, cost estimates, multi-million dollar proposals, and other business-critical files are lost, essentially halting BoBobble operations. What's more, the firm's disaster recovery plan proved ineffective when network access and PCs were turned off. Business processes that had been streamlined through the use of IT couldn't be conducted manually, and productivity fell to zero. At least one BoBobble office, which was piloting an IP telephony program, was completely cut off. Cell phones became the only means of communication.

Ultimately, it's deemed most cost-effective to completely restore every file to the network file servers. This effectively sets BoBobble Corp. clocks back from April 11 to April 3, and erases a full business week of activity. Other corporations are not so lucky, as their disaster recovery systems either were inadequate, backup media was corrupted, or backups weren't functioning correctly.

Worldwide, travel becomes more difficult as the internal systems for major airlines, as well as online booking systems and travel agency systems become infected. Only a few communication, water and electric power systems are affected, because most control systems for this type of equipment aren't linked to the Internet. Government ability to provide services is hurt, but Continuity of Operations (COOP) and Continuity of Government (COG) programs and disaster recovery efforts reduce the impact somewhat.

Pandemic Panic
Suspected bird flu outbreak rocks a regional manufacturer
Feb. 19 – March 3, Eureka, Calif.
Jack Orlove, Vice President of Professional Services, Cyber Communication Inc.

The scene: On Sunday Feb. 19, St. Joseph Hospital in Eureka, Calif., confirms the first fatal case of H5N1 Avian bird flu in the United States. Broadcast media cover the isolated case aggressively and describe the risks of an almost “certain” pandemic event. The Governor's office acknowledges that if human-to-human transmission is verified, the plan will call for immediate quarantine, freezing all traffic in and out of the city, and major disruptions in all forms of commerce.

TV stations report on school absenteeism and an “unusually light” Monday commute across Oregon and Northern California. Acme Manufacturing CEO Jim Brown tracks reports from management in Eureka about high absenteeism and missed outbound shipments.

Facing unusually strong demand, Eureka has been running at 115 percent of capacity with most employees working overtime. Production goals are being exceeded, but getting the product to market and managing customers is another matter. Only one driver shows up on Monday, and the call center has about one-third of its full staff, stopping many operations. Goods in-transit continue to arrive at the Eureka facility, stacking up at Acme's shipping and receiving docks as employees struggle to find space.

That afternoon, Eureka facility manager Mike Smith asks for Sacramento volunteers to drive up and augment the call center and warehousing operations, but management is hesitant to authorize the move. Smith and executive staff decide to activate the Crisis Management Team (CMT).

Absentee rates in Eureka reach 56.7 percent on Monday, with most of those employees indicating that they won't be in tomorrow. Business operations are disrupted and customers experience hour-plus hold times. Many callers don't get through at all.

The Acme warehouse in Eureka becomes overcrowded with pallets of finished goods, as contracted drivers fail to report to work. The company attempts to delay shipments from suppliers, but vendors insist that Acme accept goods in-transit. CMT seeks to contract warehouse space, but most warehouse suppliers are not answering phones or have no space available. Management authorizes overtime for employees willing to work.

Tuesday morning starts with reports of sick people filling hospitals as far north as Seattle and south as Los Angeles. As anticipated, absenteeism grows to over 70 percent, with most workers reporting to work with germ-denying masks.

The warehouse situation worsens as trucks continue to arrive. Workers report some fistfights with delivery drivers, and many drivers simply unhook their loads and leave their trailers in the parking lot, causing further frustration. Acme's distributors still demand delivery to accommodate their own customers and insist that Acme meet its service-level agreements.

With business processes breaking down, Brown activates the formal Business Continuity Plan and convenes the team in the Sacramento emergency operations center (EOC). There is discussion of activating a center in Reno.

The general situation worsens. Supermarkets are emptied and lines at gas stations make it nearly impossible to fill gas tanks. All schools have officially closed, and there's wide speculation that the National Guard will be activated to keep the peace. There have been no other “official” cases of bird flu, but many people are skeptical of the government reports. Rumors abound of new cases in Eureka.

On Tuesday afternoon, management decides to close the Eureka location and relocate operations to a temporary building complex in Lincoln, Calif., 15 miles east of the Sacramento location. A private carrier has been contracted to fly employees from Eureka to Lincoln, but officials won't authorize flights into the Arcata-Eureka airport.

IT attempts to establish remote employee operations, but lacking an infrastructure for remote access, the effort is limited to key executives and managers. Technicians deploy VPN and VNC remote software so key employees can use broadband links to access internal PCs from their homes. Labor-intensive installation of the remote client and VPN software slows the effort.

Tuesday evening, the FBI and Centers for Disease Control (CDC) announce that no new cases of H5N1 flu have been seen in the last 72 hours. The only death is traced back to an infected flock of birds handled by the victim. Local authorities are optimistic that the virus is not human-to-human transmissible and that restrictions may be relaxed.

Fallout: Pandemic Panic
  • Losses due to ...
    • Lost sales and revenue: $300,000
    • Staff time on task and related costs: $150,000
  • With 75 percent of operating capacity in one location, Acme should distribute operations.
  • Remote and home-based working alternatives using Web-based application can help improve business continuity.
  • Staff cross-training eliminates single points of knowledge.
The Acme EOC is now operating two 12-hour rotating shifts and an 800 telephone number has been set up to accommodate the flood of calls from employees and customers. In Eureka, IT efforts to maintain business processes ultimately fail, due in large part to the loss of employees in the warehouse and other areas. Business essentially grinds to a halt. While the IT recovery plan performed well, the broader impacts are overwhelming. Some critical operations have been recovered at the Acme hot site and the makeshift Lincoln operation is handling some warehousing and the call center operations.

The crisis winds down Wednesday morning, when the California governor's office, CDC, FEMA and the FBI begin assuring the public that the infection was an isolated event and poses no immediate pandemic threat. A sense of normalcy begins to return as employees are welcomed back to work at the Eureka site on Thursday morning.

Over the next several weeks Acme works to recover its primary Eureka site. Full call center operations are quickly restored. The Lincoln emergency site is scheduled to shut down on March 2, after running partial operations for two weeks.

More Information

Get Worst Case Scenarios from the Tech Library
"Worst-Case Scenarios" is now available for download as an 18-page PDF from the Tech Library. Check it out.

Featured

comments powered by Disqus

Subscribe on YouTube