In-Depth

Resurrection from the Blue Screen of Death

In an ideal IT world, restoring data to a new server from a dead one wouldn't be a problem: Just rebuild the data and apps from your backup. Welcome to the less-than-perfect world of IT.

During a disaster recovery test, you're almost always dealing with restoring data from tape to a platform that doesn't contain the same hardware. If you perform a full system restore and the OS won't boot successfully, your test fails miserably. You lose precious hours in these situations while administrators scurry looking for answers. In a real-world disaster recovery scenario, this could cost your company thousands of dollars in downtime and could even put you out of business altogether.

A colleague and I were attempting to restore our main application server and ran into this problem. In this case, we were attempting to restore a system originally housed on a Compaq ProLiant 6500 with a Smart 2DH controller, to a Compaq ML570 with a SmartArray 5300. Fortunately, we attempted the restore on our test network first to document any problems before the actual upgrade of the system. Our operating system was NT Server 4.0 Enterprise Edition.

One, Two, Three Strikes You're…
Our restore kept blue screening on startup, with the following error message: STOP:0x0000007B_INACCESSABLE_BOOT_DEVICE. We tried everything to get around it. First, we tried booting into the last known good hardware profile. That would've been too easy. Strike one. Then we tried booting using an NT boot disk with a boot.ini file lifted from another Compaq server with a "HAL Recovery Option." Strike two. Finally, we replaced the controller with a spare Smart 2DH we had, created a new RAID volume, loaded NT, and ran the restore again. It worked!

Then it hit me…what if? What if this had been a real disaster, and we were sitting in some room down in Philadelphia with nothing but some foreign hardware and our tapes? Disaster recovery vendors do a good job of matching up your equipment, but they don't carry everything. What if they didn't have the controller we needed? Knowing that our CIO would be asking us the same question, we decided that we'd better find a workaround.

It turns out we'd been lucky. All the times we had done off-site disaster recovery testing, we'd never run into this problem. I kept remembering what our disaster recovery vendor had said the last time we were in Philly: Other companies typically restore only their data drives. They re-create their shares and reinstall applications from scratch. What? That sounded absurd to us. Do these guys know how customized our systems are? That's when I realized that this problem was affecting other companies as well, no matter what hardware was being used.

We knew the hardware in the new system was different and this was causing the problem. We called Compaq and learned that the new RAID controller was using a similar chipset. This was causing the restored OS to think the driver it was supposed to load was correct for the device physically present in the system. Anyone who has done NT builds has experienced the problem of a system bluescreening after installing new hardware (such as a video controller).

Once we realized what was going on, we attempted (through various support channels) to find a workaround for the problem, without success. We theorized that if we performed a parallel installation of NT, we should be able to mount the registry of the (original) target build and change the relative settings. Then we found a TechNet article (Q198859, "Starting Windows NT from a Replacement SCSI Adapter of a Different Type") that outlines the steps required to start NT from a replacement SCSI adapter. That was all we needed.

You can combine the steps from the Q article and the following real-world example in a disaster recovery situation to restore systems with dissimilar RAID devices and even overcome other types of hardware conflicts.

The Registry is Key
After you've done your restore from tape, boot the system and make a note of the .SYS drivers that failed to load when the Blue Screen of Death appears. These will be specific to your original hardware. The common drivers for Compaq RAID controllers are outlined in Table 1 (see "Resurrection, Step-by-Step.").

The driver disk supplied by your manufacturer will give you the name of this driver. The .INF file included with the driver disk can be opened in Notepad to discover the driver name related to your controller type. Having good documentation here really helps.

Perform a parallel build of NT and leave the current file system intact, with no changes. Use a directory name like "sos" for the system directory. You won't need network support. Log into the "sos" build as the administrator and start regedt32.exe.

The first problem may be getting your parallel build completed. After popping in your NT Server CD, you may realize that NT Setup has also detected the RAID controller incorrectly. Setup inevitably fails with a "hard disk not found" error. (You won't be able to run Compaq's SmartStart program and load the OS with driver support because it would require a full system erase, which defeats your purpose.) You need the OS that's already on the system properly.

Have the correct drivers for the new RAID controller and make setup load support for the device by pressing F6 repeatedly when the first NT Setup (blue) screen appears. This disables the auto detection of devices during setup. Doing this allows you to specify the driver for your RAID controller manually. Do so by pressing "S" to specify additional SCSI devices when prompted. You'll also be prompted to overwrite the NT common files in C:\Program Files\Common Files. Choose "no to all." This prevents setup from overwriting your common files that were updated by later service packs applied to your original installation.

The next step is to save two keys to .REG files:

HKLM\SYSTEM\ControlSet001\ Services\drivername

HKLM\System\ControlSet001\ Enum\Root\LEGACY_drivername

Drivername is the name of the RAID driver in use by the "sos" build. It's the driver loaded during the "sos" build. Again, use the chart to figure out the key name based on your Compaq RAID controller type.

Then load the system hive from the original installation into regedt32. It should be in the winnt\system32\config directory and will be called simply "system." Create and restore the two keys you saved in the two .REG files to their appropriate locations in the loaded "winnt" system hive. You may have to set the security at the root of this loaded hive to "everyone" = "full control" in order to perform the restore. This will give you the keys you need to boot the system.

It's important to disable the original device drivers that are causing the blue screen. Do this by changing the "Start" value to "0x4" from "0x0" for these keys:

HKLM\SYSTEM\ControlSet001\Services\old.drivername, and

HKLM\System\ControlSet001\Enum\Root\LEGACY_old.drivername.

Old.drivername will be the name of the RAID driver that was in use by the original "winnt" build. It's also the driver causing the blue screen.

Unload the "winnt" system hive. This will save the new settings. Then close regedt32. Don't forget to copy the "new.drivername.sys" file from the "sos" build to the "winnt" build. This is so the "winnt" build can find the driver specified in the keys you just added. The default location for these drivers is %systemroot%\system32\drivers\. Set the view settings to "show all files."

And that's it. Reboot and specify the "winnt" build when prompted at start-up. Your system will use the new controller drivers and should boot successfully. Once you're comfortable that the system's booting properly, you can remove the "sos" directory and any references to it in the boot.ini.

Perhaps the most important thing we learned during this process was not to panic. Nothing in IT is truly new. Somebody out there has experienced what you're going through at least once before. Thorough, hard-copy documentation of your equipment resources will serve you well in a crisis. Funny how the old rules still apply.

Stan Jourdan, network administrator, also contributed to this article.

About the Author

Jim Richards, MCSE, MCP+Internet, is a network engineer in Boston, Massachusetts. He can be reached at [email protected].

Featured

comments powered by Disqus

Subscribe on YouTube