Storm Stories -- Redmondmag.com

Storm Stories

It’s a truism in IT that various parts of your network—servers, hard drives, video cards, that mission-critical software program—will grind to a halt eventually. Here we present four disaster-recovery scenarios and how to recover from each.

By Derek Melber
02/01/2004

When disaster strikes your network, will you know what to do? Remember, as consultants are always telling you, it’s not a matter of if you have a crucial outage on a critical server, but when. And when when comes, you’ll be faced with choices.

Decision-making is an important part of disaster troubleshooting and recovery; as with most things Windows-related, there’s usually more than one way to diagnose and fix a problem. Making the right decision can often mean the difference between a working server and one that requires you to update your resume.

Here we present four disaster scenarios and how to recover from each. You may have encountered one or more of these; chances are good you may need to know how to deal with each one during your career.

Disaster Scenario 1: Domain Controller Can’t Phone Home
A Windows 2000 Active Directory (AD) domain was just installed. The domain is running in Mixed mode, as there are some straggling Windows NT 4.0 backup domain controllers (BDCs) in the domain. One of the BDCs, NTBDC1, is in the process of being upgraded to a Win2K DC. The upgrade is necessary because NTBDC1 runs a home-grown application that must run on a DC. The programmer has since left the company, so it would be very difficult to get the application running on a different DC with a fresh installation.

The upgrade process consists of two steps. First, the OS is upgraded to Win2K. The second step is to run DCPromo to make the computer a Win2K DC.

NTBDC1 starts the upgrade process and maneuvers through the first step without a glitch. The second stage of the process starts, which goes great….until the reboot! When NTBDC1 comes to life after the second reboot, there seem to be some issues (“issues” being the technical term for “a horrible disaster”).

The symptoms are as follows:

NTBDC1 has the replicated copy of AD.

NTBDC1 shows up in the Domain Controllers OU.

NTBDC1 shows up in the Default-first-site-name site.

The other DCs can’t connect to NTBDC1’s copy of AD.

Any AD changes that appear on the other DCs don’t replicate to NTBDC1’s copy of AD.

After the three restarts it takes a typical Microsoft operating system to correct itself, nothing has changed. We have an orphaned DC that can’t communicate with the other DCs. However, we can’t just wipe the computer clean, since it runs our home-grown application.

Any restoration from tape might be more complex than it’s worth, as the computer has gone through such a large transformation; plus, AD’s been changed to accommodate the new Win2K DC. The restoration would make it an NT BDC, which could confuse AD even more.

At this point, an attempt is made to demote the DC back to a member server, before again attempting the promotion. However, during the demotion, NTBDC1 can’t be removed from the domain, since it can’t communicate with the other DCs at the AD level. Now we’re stuck! We have a DC that can’t be demoted, won’t communicate as a DC, and can’t be reinstalled.

There are no tools that Microsoft provides to recover from a situation like this. The Directory Services Restore Mode, Recovery Console, NTDSUTIL and other tools won’t help us.

In this case, your best option is to resort to an undocumented (in the Microsoft world anyway) and unsupported “modification” of the DC (modification is the technical term for “hack”).

It was well known—and still unsupported—in the NT days that the ProductOptions Registry value could change the behavior of a server. Basically, it could make a server out of a workstation. A similar modification can be made to the same Registry value to turn a DC into a member server. Here are the steps:

1. Start NTBDC1 in Active Direc-tory Restore Mode.

2. Edit the following Registry value:

HKLM\SYSTEM\CurrentControlSet\Control\ProductOptions

3. Change the value for Product Options from LanmanNT to ServerNT.

4. Stop the File Replication Service on NTBDC1. (An easy way to do this is to run “net stop ntfrs” from a command prompt.)

5. Delete the winnt\sysvol folder on NTBDC1.

6. Delete the winnt\NTDS folder on NTBDC1.

7. Restart NTBDC1.

8. Change NTBDC1 to a standalone server.

9. On a functional DC in the domain, clean up all remnants of NTBDC1. This will most likely require the use of NTDSUTIL and the Metadata cleanup option. For details on how to use NTDSUTIL to clean up orphaned objects in AD, refer to Knowledge Base article 230306. The two key components requiring cleaning are the computer object in Active Directory Users and Computers and the server listed in Active Directory Sites and Services.

10. Join NTBDC1 to the domain.

11. Run DCPROMO on NTBDC1.

This process can be used in many scenarios, and in cases like this it’s been a saving grace for many admins. There’s no need to call Microsoft PSS for help with this process, as they won’t have anything to do with it. However, when in a pinch, it can be just what you need.

Disaster Scenario 2: It’s a RAID!
I walked into the server room at 6:30 one morning to find out one of my servers had a Blue Screen of Death waiting for me. After restarting the computer to see if it would come out of this coma, I was greeted with yet another BSOD. I scanned the BSOD binary puke displayed on the screen to see if my secret decoder ring could decrypt any of the information. This particular BSOD was gracious; it displayed the exact driver causing the problem: fasttrak.sys. I knew exactly what that was—my RAID controller.

I restarted my computer and dove into the RAID utility. I went directly to the RAID 1 (mirror) array on that computer before rebooting, only to find that it showed both drives functional and running, when I was expecting it to tell me which one was bad.

I was a bit dismayed, but had to get the server running. So, I did what any good admin would do: I unplugged one of the drives connected to this RAID array (neither was running the operating system) and restarted the computer. It started just fine, and my data was intact. I was getting errors indicating my array was errant, but at least I was running.

Now I had to find out which drive was causing the problem. When I’d unplugged the drive to get the computer running, it didn’t give me errors indicating the drive was the problem. So I swapped drives, unplugging one and plugging in the other, and restarted the computer. The computer started just fine, and again the data was present. This was perplexing, since both drives appeared to be fine.

It was curious that both drives seemed to work independently, but when started together in the original array, the computer wouldn’t start. My best guess at this point is that the RAID 1 information stored on each drive to keep the drives synchronized was corrupt in some fashion.

I now looked at the server’s RAID utility; it also confirmed that the drive was functional, with no errors. I finally resorted to the event viewer and the application log. I normally ignore this step altogether, since the event viewer usually gives inconclusive and cryptic messages. I was shocked to see a message indicating which channel on the RAID controller was causing the problem.

I removed the drive on the problem channel and restarted the computer. I then went to order another drive to replace the failed one.

In the meantime, the server and I went on our merry way. That is, until about 3 p.m., when the server presented me with yet another BSOD. This error message, however, wasn’t as descriptive as the last one, indicating nothing as to the reason for the crash. I restarted the computer and it came alive again, with what seemed to be no problem.

The server ran through the day with no problems, until about noon the next day, when I realized there was another BSOD on the server after a call from a user who couldn’t access files. I again restarted the server; it came back to life with no problems.

The failed RAID hard drive was supposed to arrive quickly; it didn’t. Many customer support calls and much complaining later, it was determined that the model of hard drive needed was on backorder, by about a week. I gritted my teeth at the news, but was confident I could obtain this drive from another location with no problem. I was wrong—after several hours of calling every company that sells computer parts, none of which had my hard drive, I was forced to wait for the backordered drive. There was no other server on the network that could hold this data and no money in the budget to get another server.

I wasn't 100 percent certain that the new BSODs were related to the failed hard drive, since there was no indication from the BSOD dump or the log files that the RAID controller was causing the crashes. But over the next week, the server crashed randomly three more times.

When the hard drive arrived, it was quickly installed in the server, and the RAID array synchronized. The server was restarted, and it hasn’t failed since.

Although the users of this computer were upset and the BSODs made me even more upset, I had few options but to wait this problem out. Although RAID is a great technology, it can sometimes cause more harm than good when it comes to making data available. In this case, the RAID protected my data from disaster, but it also caused problems with accessing the data.

Disaster Scenario 3: Server In-Security
It’s 9 a.m. and the phone begins to ring. The first user complains that he’s not able to access files on Server1. The next phone call is the remote location, reporting that the backup failed on Server10. The next user reports that she’s not able to authenticate to Server5 from her NT workstation. The phone continues to ring, but you ignore it. Instead, you try to solve the problems for all of the servers.

First, you attempt to access each computer, to see if the user issues are affecting the administrator accounts, too. You find out that some— but not all—of the issues affect the administrative accounts. You also discover that you can access all the resources on all servers as the administrator, ruling out an overall communication problem. Since the users that reported problems are from all over the company, you quickly rule out a specific networking issue.

Then, you try to think about what all the servers might have in common. They’re all Win2K and Windows Server 2003. Since they have different service pack and security patch levels, you rule out a virus for now. They all have static IP addresses and common IP configurations, so you know it’s not an IP-related issue. Moving on to AD, you know the servers are all located in the same Organizational Unit (OU). However, you haven’t implemented any Group Policy Objects (GPO) for the Servers OU yet.

Because this seems like the only potential problem, you investigate the OU through Active Directory Users and Computers. There, you find that someone has linked a new GPO, named WatchThis, to the Servers OU. You scan the GPO and find that almost every setting in the Computer Configuration portion of the GPO has been set to something. The Security Settings section has been significantly altered, including Local Policies, User Rights, Security options, File system permissions, Registry permissions…the works!

You also find that several OUs have servers in them. You investigate those OUs, but find that they don’t have any GPOs linked to them.

After donning your Sherlock Holmes hat and magnifying glass and doing some detective work, you track down the culprit: the junior administrator, who thought he was “just” altering a GPO not linked to anything.

After properly berating the junior admin, it’s time to clean up the mess. You need to:

Get the servers functioning again.

Allow for easy repairs, as insurance against similar future problems.

Allow for quick checking of servers to make sure they have the correct configurations.

Ensure the servers persistently receive the correct settings.

As a solution, you decide to use the Security Configuration and Analysis tool. This allows for the creation of security templates, which address all security-related concerns that caused so many problems for your users. The security templates can be quickly created, then modified for each specific set of OUs. The security templates can also be imported into the GPOs at each OU where the servers reside. This makes for fast and efficient security settings application for the servers within the OUs. The final solution looks like this:

There are four security templates created, one for each server OU.

Each server OU has a new GPO linked to it.

Each new GPO has the appropriate security template imported into it, as shown in Figure 1.

Figure 1. Importing security templates to a GPO is an efficient way to improve security.

Each server OU’s ACL has been modified to alter who can modify and link GPOs to it.

The Group Policy Creator Owners group has been modified to only allow a few IT admins the ability to create new GPOs.

All the GPOs have been configured to apply security settings, even if the GPO hasn’t changed. Use the following GPO path: Computer Configuration | Administrative Templates | Group Policy. See Figure 2.

Figure 2. This policy enforces the application of security settings, even if the settings haven’t changed.

After creating this new GPO and security template environment, you’ve established who can modify the security settings; how often the security settings are applied; where the security settings are applied; and which security settings are applied. If there are any changes in the future, you can always re-import the security template into the GPO to get back to the original configuration.

Disaster Scenario 4: Bad Video Signal
There I was, looking at the latest and greatest video drivers available on the Dell Web site for my Inspiron laptop computer. I was running Windows XP on my laptop, which had been stable so far. The newest driver mentioned that I could use this new utility that came along for the ride at install to manage my video adapter seamlessly. I proceeded to download the new driver and install it. When the installation was complete, it gave me a test screen that was ideal. I proceeded to restart my computer.

After the computer restarted, I logged on and everything seemed to be fine. I started to look for the newly-installed utility, but the mouse began to jump across the screen, or so it seemed. Then, as I opened up new windows, the screen began to pixelize. I knew I was in trouble at this point—the new driver wasn’t compatible with my computer.

I now needed to recover from this install, but which option should I choose? The list is rather lengthy with XP, but here were the options I had to choose from: 1) Modify the display settings; 2) Last Known Good (LKG); 3) System Restore; 4) Automated System Recovery (ASR); or 5) Driver Rollback

I first tried to modify the display settings, hoping a change in resolution would be compatible with this driver. The screen was set at a fairly high resolution; after lowering it, I found that all I got were bigger pixels when the problem occurred.

The other options seemed more daunting, since they’d actually change my system and system files. Here was my thought process for each choice:

The LKG was a pretty good option, knowing that XP now includes any recently installed drivers in the recovery process. However, I wasn’t 100 percent comfortable going this route, since it had been quite a while since I’d rebooted and I was modifying other settings before I installed the new driver. I figured I’d come back to this, if better options weren’t working.

System Restore was another solid choice, since it would put the system back for me; but there were a couple of troubling gotchas with this option. First, System Restore changes many files and Registry entries, which might do more harm than good, depending on what’s changed since the last restore point. The other problem is that the installed driver was signed by Microsoft. What does this mean? First, it means Microsoft’s blessed it for XP. It also means that System Restore won’t create a restore point for this driver install. So this option was ruled out, since it wouldn’t help in this situation.

ASR didn’t seem wise, since this option is used to recover from a disaster that no other tool can help recover from. I wasn’t there yet. I’d only installed a new video driver, not destroyed my system—or didn’t think I had, anyway. I figured this would be my final option, if nothing else worked.

Driver Rollback, which would allow me to put the old driver back into place with just a restart, seemed to be the best option since it had the lowest associated risk. As I read up on it, it seemed more and more like the option for me.

I went into the Device Manager and found my Display Adapter in the device list. I then found the Roll Back Driver button, as shown in Figure 3, which allowed me to go back to the previous installed driver, before the new and improved one was installed. The process was quick and clean, and upon a restart, I was back up and running with no pixelization issues.

Figure 3. Roll Back Driver should be strongly considered as a first option when troubleshooting certain scenarios, since it makes the fewest system changes.

Know Your Options
If this had failed, I would’ve had to take more drastic measures. Knowing your options at a time of crisis is extremely important. Each recovery tool seems like it’s the one for you, but there are small (or large) issues that might rule them out or make them less attractive for your situation.

Featured

Microsoft Makes Point-in-Time Restore Generally Available for Windows 11

Microsoft has made point-in-time restore generally available for Windows 11, giving users and IT administrators a built-in way to roll back PCs after bad updates, driver problems, app corruption or other disruptions.
PowerShell Trusted Hosts for Mixed Environments

Trusted host lists can help keep PowerShell remoting working in mixed domain and workgroup environments, but only if admins avoid overwriting existing WinRM settings.
AI Cost, Security Pressures Push Enterprises Toward Private Cloud, Broadcom Says

Broadcom Inc. is making the case that enterprise AI has reached a new infrastructure phase, arguing that production workloads are forcing organizations to rethink where they run cloud applications and how they manage cost, security and governance.
Microsoft, Chevron Partner on Power Deal for Massive Texas AI Datacenter

Microsoft on Monday unveiled plans for one of the largest datacenter expansions in its history, pairing a new 2-gigawatt AI-focused campus in Pecos, Texas, with a 20-year power agreement from Chevron designed to ensure the facility has dedicated energy capacity as demand for AI services continues to surge.
Where Air Gapped Backups Actually Fail, Part 2

Air gapped backups can still fail when configuration drift, lost encryption keys and routine human mistakes go unnoticed until recovery is needed.