Never Again

Time Machine

An accidental power outage wreaks havoc in a data center.

A few years back we had an electrician doing some work in our data center. He accidentally disconnected power on the wrong leg of the grid and crashed 80 Windows servers and a number of other network components.

One of those racks had our Active Directory Forest Root PDC (primary domain controller) emulator, which acted as the root time server for all systems within the forest. Unbeknownst to us, the battery on that server's motherboard wasn't holding its charge. When the server came back up, it reset the system time to the date the BIOS was manufactured.

My standard procedure for restarting our data center following any sort of power outage was to start all the domain controllers (a total of nine DCs -- three for three domains) before starting any of the other servers. Once they were stabilized, we would start our storage area network, which contained our Exchange server and all remaining servers.

After this outage, I was more concerned with server data corruption than appliance-based firewalls, so I was in no hurry to get our Internet firewalls connected -- that was a big mistake.

With its dead motherboard battery, the server acting as the primary time keeper in the AD Forest was carrying a bad time and date. When the internal communications went back online, the server resynchronized all our domain controllers (more than 30) across the country with a new time and date. Of course, the "new" date was a few years old.

When the PDC regained its Internet connection, it immediately updated its time from an Internet time source. The PDC sent the new, correct time to all the domain controllers. What I didn't realize was that AD has a built-in function that says if a domain controller has not communicated via replication in 90 days, consider it to be a bad replica, orphan it out of the directory and never communicate with it again. Because the last-known replication was "years ago," due to the time changing back and then forward, we had problems.

After working on the issue myself for about two hours -- trying things like authoritative restores of AD -- I found nothing worked. I called Microsoft, and over the course of three long days with very little sleep, I escalated the matter to some of Microsoft's top-level engineers. They finally managed to get things functioning like normal.

What's Your Worst IT Nightmare?

Write up your story in 300-800 words and e-mail it to Michael Desmond at [email protected]. Please use "Never Again" as the subject line and be sure to include your contact information for story verification.

We had to do a massive amount of metadata cleanup to get AD to work. This included using tools like ADSI edit, NTDSUtil, DCDiag, RepAdmin, Setspn, Ldifde and a host of reskit tools. We began the process by performing a DCPROMO /FORCEREMOVAL on all domain controllers that were not PDC emulators. Once we had the three servers communicating and functioning normally, we had the tedious task of removing all of the "forceremoval" related FRS, DNS and AD entries from these remaining three PDC emulators.

At this point we had a massive network being supported by only three domain controllers from separate domains in the same forest. As we all know, the FISMO roles need to be separated out when there's more than one domain. We introduced domain controllers back into the network one at a time until everything was restored to its original configuration.

Additionally, more than 300 of our nearly 700 PCs had corrupt Kerberos tickets due to the changing time stamps. They all had to be disjoined and rejoined to the domain. All of this could have been avoided had Microsoft had a hook that lets orphaned DCs back into the forest, but we figured it out. Thank God for Remote Desktop Services!

About the Author

Jeremy Price is a senior network & systems engineer for a large construction firm. He has since forgotten more about orphaned DCs than most of us will ever know.

Featured

comments powered by Disqus

Subscribe on YouTube