An Ounce of Prevention
Disaster recovery planning can be worth a lot more than a pound of cure when your network goes down.
Disaster Recovery plan: I've got one, you've got one, we've all got one. (You do have one, right? If you don't, go write one; we can wait 'til you get back.) But how often do you test the theory of how well your plan actually works?
As part of our contract with our co-location vendor, we take part in two-day drills twice a year. These are real-time drills, where we have 48 hours to re-create a fully functional network using nothing but replacement hardware and backup media. For this particular drill, we were concerned with rebuilding four mission-critical Windows 2000 servers:
- DC1: Our main domain controller (DC) that held three of the five
Flexible Single Master Operations (FSMO) roles on the network
- APP1: An application server that also functioned as a DC to provide
redundancy. APP1 held the other two FSMO roles on the network
- WWW1: Our corporate Web server
- MX1: Our corporate mail server, running Exchange 5.5
What follows is a diary of our 48-hour experience.
Day One, 8 a.m.: Assessment
We arrive and assess the replacement hardware provided by our co-location vendor. As a part of our contract, we were told that we would have identical replacements—we were asked to provide model numbers, serial numbers, the whole nine yards. The reality turns out to be slightly different. While our production environment is standardized on Compaq ProLiant servers, our replacement hardware is all in the Dell PowerEdge family.
This is disconcerting from a technical standpoint, but we are handed a copy of Microsoft Knowledge Base article 249694, "How to Move a Windows 2000 Installation to Different Hardware," and told it will work like a charm. On the other hand, it's a good reflection of reality. Let's face it, what's the likelihood you'll have exact duplicates of your production hardware waiting for you at a moment's notice?
So we set to work restoring our AD database onto the replacement hardware for DC1. The short version of KB 249694 goes something like this:
- Install your production-level service pack.
- Perform an authoritative restore of System State data.
- Perform an in-place upgrade of Win2K.
- Re-apply any service packs and hotfixes.
AD Restore Options
If you're not conversant with performing
AD restores, you may be unfamiliar with some of the terms
used here. The System State data on a DC consists of the following
- AD (the NTDS files)
- Boot files
- COM+ class registration database
- The System Volume (SYSVOL)
When restoring the System State,
there are a few options for how to handle the restore. In
Win2K, you can mark a System State restore as either authoritative
or non-authoritative. A non-authoritative restore, the default
type, refers to a restore where an AD object (such as a user
or group account) is restored to the AD database, but any
changes made are applied after the restore. An authoritative
restore will perform the restore, but will mark the restored
version of the object as definitive; no subsequent changes
will be applied.
For example, say you have a user
object called jharrison. On Thursday, the user account is
accidentally deleted and needs to be restored from a Sunday
backup. On Wednesday (after the Sunday backup), jharrison's
"Department" attribute is changed from "Marketing" to "Communications"
when the user received a promotion. In a non-authoritative
restore, jharrison's user object will be restored with the
"Marketing" department attribute, but the attribute will be
updated to "Communications" by changes replicated from another
DC. In an authoritative restore, the user object's department
attribute will remain "Marketing," even after regular AD replication.
Windows Server 2003 provides
a third option: A primary AD restore. Use a primary restore
when restoring the first replica of your domain data to the
network, as in the case of a disaster recovery scenario where
you've lost all DCs. If the network in this article had been
running Windows 2003 instead of Win2K, a primary restore would
have been appropriate.
— Laura E. Hunter
The first step is pretty intuitive: The service pack on the replacement hardware
needs to match the service pack level on the production machine, so that versions
of DLLs and other system files won't conflict after the restore is finished.
To make the restore as smooth as possible, we also create volumes and partitions
on the new hardware that exactly duplicates the production configuration. Once
that's done, we reboot into AD Restore Mode and perform a full restore of DC1's
Day One, 12 p.m.: Still "Hanging" Around
In staring at the hardware differences on the restore machines, I can't shake
a sinking feeling that this isn't going to go quite as easily as our co-location
support rep makes it sound. Sure enough, the first attempt leaves us hanging
at the final "Preparing network connections…" screen on the
final reboot. Because I'm occasionally impatient during processes like this,
I choose that point to go to lunch, to see if the newly restored server just
needs a little more time to finalize its settings. Forty-five minutes later?
Still sitting on the same screen.
We spend the remainder of the afternoon retrying the AD restore with limited success. We attempt the in-place upgrade a few more times, various permutations of authoritative vs. non-authoritative restores, then a Repair Installation once or twice for good measure. But the System State information seems patently unwilling to restore onto such completely different hardware, leaving us with Blue Screens of Death or interminable hanging at various stages in the startup process before we wipe the hard drive with Fdisk and start over.
Day One, 11 p.m.: Partial Success
Because we have only a 48-hour window to test our restore procedures, we put
the AD restore aside and spend the rest of the afternoon and evening restoring
our application data, working around the lack of AD information wherever possible.
Most notably, we aren't able to do anything with Exchange without a working
domain to join the server to. By about 11 p.m., having restored most of our
application data, we declare the day at least a partial success. We decide to
tackle the AD restore with fresh eyes after a night's sleep.
Day Two, 7 a.m.: Disappearing DNS
Ramped up on about a thousand volts of Starbuck's espresso, we take another
look at the AD restore. After some brainstorming, we realize that one potential
complication might be our production DNS configuration. As part of a large,
heterogeneous internetwork, our production AD infrastructure relies on a centralized
Unix BIND server for DNS; individual offices don't run Windows DNS servers within
the individual LANs. But because the drill is taking place in connectivity isolation,
so that we can bring up restored systems without bringing down their production
counterparts, our restored DCs are pointing to DNS servers that essentially
We try installing and configuring the DNS Server service. After configuring the replacement server to point to itself for DNS queries, we perform the System State restore again. While we finally made it to a desktop (Huzzah!), the event logs are littered with DNS errors—we overwrote a System State that contained DNS information with one that did not. "No problem," I say, "we'll just uninstall and re-install the service and then everything will be fine." No such luck. Fdisk, try it again—'round and 'round we go.
Day Two, 12 p.m.: A Smaller Hammer
By this point we're fairly convinced that our attempts at a full System State restore are roughly equivalent to swinging a sledgehammer at a finish nail, so we begin to look for a more finessed approach. After another few hours of trial and error, we finally devise a solution. We need DNS to be running on our restored network, but DC1 can't be the machine to run it. We install DNS on the APP1 server instead, pointing DC1 to APP1 and enabling dynamic updates. We then return to DC1 and install AD on it by running Dcpromo, creating a domain with the same name as our production domain. (Again, we are in connectivity isolation, so we know this won't interfere with name resolution on our production network.)
Once we verify that AD is installed on DC1, and that the necessary DNS records have been created in the DNS zone on APP1, we reboot into AD Restore Mode and attempt the restore one final time. But instead of restoring the full System State, we restore the AD database only, without any of the associated system files to avoid landing in the "conflicting DLL" quagmire yet again. We then use ntdsutil to mark the restore as authoritative, and restore the boot.ini file to ensure that the ARC paths—which provide the location of the system and boot partitions—haven't been altered. By restoring the boot.ini file, we're ensuring that the OS will have the correct location of the system and boot partitions if System Restore overwrites the ARC paths.
Day Two, 4 p.m.: Try, Try Again
We still aren't quite out of the woods, though, since the next reboot leaves us stuck on the now all-too-familiar "Preparing network connections …" screen. Our next step is to run a repair installation.
Unlike an in-place upgrade, a repair installation re-scans the computer's Plug & Play hardware and updates the %Systemroot%\ Repair directory. Before rebooting from the restore, we remove the display adapters and NICs from the Win2K Device Manager so the install will re-detect them. During a few run-throughs, we find the NIC configuration is still incorrect after the repair, requiring us to remove the NICs a second time and re-scan for new hardware in Device Manager. Once the network adapters are properly recognized, we reset the IP configuration to communicate on the appropriate subnet.
Day Two, 6 p.m.: Victory!
Finally, we have success. The server boots with minimal fuss, and a visit to Active Directory Users & Computers shows all of our Organizational Units (OUs), computer, group and user objects sitting exactly where we want them. All that's left is some cleanup. (Okay, that and letting out a few victory screams in the middle of the co-lo room. Don't ask about the strange looks that garners.)
Our final cleanup involves a quick trip back to DS Restore mode and ntdsutil to perform a metadata cleanup of the restored AD database. It includes references to some DCs we decommissioned a year ago that we'd simply forgotten about. (This also serves to point out some needed maintenance on the production network, since these "ghost" entries in the AD database could lead to replication issues, and troubles during software installations or upgrades.) We also disable some extraneous services added during the restore, the software for which hadn't been installed at the disaster recovery site. We finish up with a final service pack re-install, and are finally left with a functioning DC and AD database.
Day Two, 7 p.m.: Burgers and Beers
Being fortuitously close to 7 p.m. anyway, we call the drill a success and adjourn
for a few beverages and greasy bar appetizers, followed by a well-deserved night's rest before returning to the "real world" of the daily office grind.
While you can perform most Active Directory management functions from the graphical
interface, some tasks need to be (or are more easily) performed from the command
line. This is where "ntdsutil" comes in. It is the built-in command-line
utility that can assist with AD restores, Flexible Single Master Operations
(FSMO) role management and AD database maintenance. In our case, we needed to
perform a metadata cleanup to remove references to extinct DCs not cleanly removed
from AD. (In other words, the DC wasn't cleanly Dcpromo'ed out of its domain.)
To begin using this utility, simply open a Command window, type "ntdsutil"
and press Enter. This will present a simple ntdsutil: prompt. From here
you can enter any number of sub-menus, or you can enter a "?" to see
a list. The metadata cleanup involves the following command-line sequence:
metadata cleanup: connections
From here, you'll connect to a server that belongs to the same domain as the
server you want to delete.
server connections: connect
to server dc2
server connections: quit
metadata cleanup: select operation target
select operation target: list domains
You'll see a list of sites in your forest, each with a number next to it. Enter
the number next to the domain containing the appropriate server by typing "select
site SiteNumber" and you'll be returned to the select operation
select operation target:list
domains in site
A similar list of domains will be displayed. Select the site by typing "select
domain DomainNumber" and you'll be returned again to the select
operation target menu.
select operation target:list servers in site
Type "select server ServerNumber" to select the server you
want to remove.
Type "Quit to return to the metadata cleanup prompt" back at the
select operation target command.
Finally, type "remove selected server" to remove the offending server.