7 Disaster-Dodging Dictums
Following these seven tips can help you prepare for a disaster -- or prevent one from becoming even worse.
In survey after survey, disaster recovery -- or, as it's known in politically correct circles, business continuity -- remains high on the list of things to do in 2009. Even as companies seek to contain costs, reduce IT's carbon footprint, and decide which data must be preserved and protected for regulatory compliance, continuity remains on the front burner.
It isn't that natural or man-made disasters happen more now than ever before. In truth, there's no actuarial table for disaster. Despite 150 years of data on weather-related events such as hurricanes, we have no idea whether a hurricane will come anywhere near, say, Tampa, Fla., in 2009. A key driver of disaster planning interest is today's economic and political uncertainties: There's a natural uptick in disaster recovery planning (DRP) during such troubling times.
Perhaps the biggest driver of DRP is compliance. The regulatory mandate differs from industry to industry, of course, but most laws and rules require that firms take some measures to identify certain data assets: SEC filing documents, patient health-care data, private data of clients and customers, and so on. These assets must be preserved for several months to several decades; protected against accidental or malicious disclosure to unauthorized parties; copied securely; stored safely against loss -- perhaps in a data-protection system, which is roughly 80 percent of a business' disaster recovery capability; and made accessible under tight time frames should they ever be required by a subpoena or summons. The consequences of a failure to comply range from public embarrassment to fines, jail time for senior management or costly requirements to notify individual customers in the wake of a loss.
To risk managers, DRP is simply one component of an organization's risk profile. Not every company wants to spend money to develop a continuity capability, which is something that may never need to be used. Some are willing to take a chance. However, where such a capability is explicitly required, as in the case of the Health Insurance Portability and Accountability Act (HIPAA), risk managers may not get a vote.
What Is a Disaster?
To my mind, a disaster is an unplanned interruption of normalized access to business-critical data for whatever constitutes an unacceptable period of time for the business. Implicit in this definition is the concept of time-to-data, called a recovery time objective (RTO). Time-to-data is a sensible metric for assessing the solvency of a disaster recovery strategy. It embodies the time to locate backup copies of the data and to re-point users and applications to them so that work can continue. It also encompasses all of the work required to organize teams that will rebuild infrastructure to connect user work areas to the data and the work to contact, organize and transport key personnel -- often a skeleton crew -- to their new work areas once network and system connections are restored.
After assisting in the development of more than 100 plans for both large and small organizations, I've seen best practices as well as pitfalls. These are my top seven tips for building a good plan and preventing a disaster.
1. Make sure senior management is engaged.
By its nature, the DRP process crosses lines of business boundaries, and high-profile management backing will likely be required to get time on the calendars of the experts in the different business units. Moreover, management holds the purse strings. They have the money you need to fund your carefully crafted strategy and to build an ongoing testing regimen -- which translates the paper plan into a true recovery capability -- once the plan is complete. As the old saw attests, "No bucks, no Buck Rogers."
2. Argue DRP's business-value case.
Take a page out of the Harvard Business Review, which has a penchant for triangles, and ask yourself, "Can I justify this DRP capability in all three categories of business value: cost savings, risk reduction and operational productivity (top-line growth)?" These days, IT initiatives such as DRP won't be funded if they're not argued in terms of these three components of a business-value case.
The answer, of course, is that disaster recovery has little or nothing to offer in terms of cost savings or top-line growth. It falls strictly within the boundaries of risk reduction. And even the risk-reduction value is often hard to measure, as the likelihood of a disaster is nearly impossible to quantify. Most planners instead use FUD to raise concerns about the consequences of a disaster.
The smartest thing you can do is to make a full business-value case by not contextualizing the work you're doing as disaster recovery or business continuity at all. Call it, instead, a data-management strategy.
Continuity planning begins with deconstruction analysis of business processes to identify the data they generate and use the resources or services that are associated with delivering that data to users who make it purposeful. Data inherits its criticality, like so much DNA, from the business process that it serves. It's the business process that makes data critical, important, neutral or unimportant from a recovery standpoint. The criticality of the business process determines to a significant degree the selection of an appropriate data-protection or recovery strategy.
WAN-based data replication and system failover -- near instantaneous disaster recovery -- are expensive and inappropriate for any but the most critical data sets and applications with always-on requirements. Similarly, tape-based backup and restore is appropriate for data sets that can wait for hours or days following a disruption event without jeopardizing mission-critical business operations. To determine which recovery services to apply to which applications and data, you need to understand business process criticality; hence the analysis work.
The good news is that the very same analytical process that must precede continuity planning must also precede effective compliance or information-governance planning. It must also be undertaken before applying security functions such as data encryption. Truth be told, it should be done before IT buys any more software or hardware so that spending aligns with business requirements.
The analysis you do for DRP can be viewed as data modeling. In the process of modeling the data and understanding the services and resources that a company currently uses to support each business process, you're building a data-management model of the organization that has enormous probative value beyond mere risk reduction. In addition to setting parameters on services appropriately applied to protect data assets, the model can additionally be consulted to see where IT resources and services might be reallocated to contain corporate IT costs or to improve efficiency and productivity.
So, a data-management initiative can deliver the full business-value case that DRP by itself cannot. Because you have to perform the analysis anyway, why not make the creation of the data model one objective of the planning effort?
3. Make prevention more important than recovery.
You can't protect what you can't see, so infrastructure-management -- especially storage-management -- technology needs to be implemented to help spot burgeoning error conditions before they cause disastrous downtime. Power protection, or the deployment of surge suppression and uninterruptible power supplies, is another priority, because the most common causes of equipment failures are utility power-related.
4. Rationalize your data.
As you analyze your data assets, you're likely to find that more than 65 percent of the data on expensive disk arrays and SANs doesn't need to be there. It's never referenced and usually comprises a mixture of old data of value to the organization that belongs in an archive, orphan data whose creator or server no longer exists in the infrastructure, or data belonging to someone who thinks his next wife's last name is .JPEG and has downloaded every picture of her that he can find. Encourage the deployment of an active archive, preferably one based on green media like tape or optical. Then use your storage-management tools to spot and remove orphan and contraband data. Doing so will alleviate the strain on -- and cost of -- data-protection services like tape backup, contain costs by forestalling the need to buy more storage, and improve productivity by making useful data easier to find.
5. Aim to build in -- rather than bolt on -- your recovery capability.
Building in means several things. First, it means making continuity a consideration in the processes by which applications are built and infrastructure is architected and procured. Many apps are designed without any attention to how readily they can be recovered. In an actual disaster, you don't want to have to build an identical platform to host your app; you want the flexibility to re-host software on any gear available. And you should opt for message-oriented middleware rather than hard-coded, machine ID- or IP address-specific remote procedure calls. If you don't know the difference between MOM and RPCs, get some training, or you'll never be able to get the application developers to follow your guidance!
You also need to be involved in infrastructure architecture and product selection. Some cool new technologies highly touted by their vendors -- server virtualization and storage thin provisioning are just two that come trippingly to the tongue -- may be potential disasters waiting to happen. It's up to you to be the "killjoy" and to identify the potential risks so they can be discussed and business advantages weighed against potential downsides. At a minimum, help procurement understand that items like redundant power supplies on mission-critical servers are not an option but a continuity requirement.
6. Wherever possible, build testing into your recovery strategy.
A number of products recently entered the market that aggregate information from recovery services into a dashboard. You can monitor that dashboard to ensure that data-protection processes are taking place as planned, and that they're still aligned with your time-to-data objectives, even as the volume of data under protection grows. Look into recovery products that simulate or perform actual failovers whenever you want to, without disrupting the business. That will let you validate your protection and recovery strategy and eliminate these tasks from your formal test program. It also means that your data-protection and failover strategy can be leveraged in non-DRP related ways. Failing over to your shadow infrastructure during maintenance on your primary gear is a great way to eliminate maintenance downtime and justify the costs of the failover capability.
7. Get going today.
Contrary to some of the marketecture you might have encountered, you don't need any sort of specialized certification to build a business-savvy continuity capability. A lot of folks have made a lot of money making DRP seem like a Byzantine methodology understood only by a specially vetted cadre of certified practitioners. In my experience, it's a straightforward application of common sense that requires project-management skills, knowledge of IT concepts and technology, and the patience and tenacity of a Middle East peace negotiator. There are no experts, no gurus. Most who have successfully recovered their organizations from the brink are humbled by the experience: "If I had a few more disasters, I might get to be pretty good at this." Hardly guru-speak.