Never Again

Time Machine

An accidental power outage wreaks havoc in a data center.

A few years back we had an electrician doing some work in our data center. He accidentally disconnected power on the wrong leg of the grid and crashed 80 Windows servers and a number of other network components.

One of those racks had our Active Directory Forest Root PDC (primary domain controller) emulator, which acted as the root time server for all systems within the forest. Unbeknownst to us, the battery on that server's motherboard wasn't holding its charge. When the server came back up, it reset the system time to the date the BIOS was manufactured.

My standard procedure for restarting our data center following any sort of power outage was to start all the domain controllers (a total of nine DCs -- three for three domains) before starting any of the other servers. Once they were stabilized, we would start our storage area network, which contained our Exchange server and all remaining servers.

After this outage, I was more concerned with server data corruption than appliance-based firewalls, so I was in no hurry to get our Internet firewalls connected -- that was a big mistake.

With its dead motherboard battery, the server acting as the primary time keeper in the AD Forest was carrying a bad time and date. When the internal communications went back online, the server resynchronized all our domain controllers (more than 30) across the country with a new time and date. Of course, the "new" date was a few years old.

When the PDC regained its Internet connection, it immediately updated its time from an Internet time source. The PDC sent the new, correct time to all the domain controllers. What I didn't realize was that AD has a built-in function that says if a domain controller has not communicated via replication in 90 days, consider it to be a bad replica, orphan it out of the directory and never communicate with it again. Because the last-known replication was "years ago," due to the time changing back and then forward, we had problems.

After working on the issue myself for about two hours -- trying things like authoritative restores of AD -- I found nothing worked. I called Microsoft, and over the course of three long days with very little sleep, I escalated the matter to some of Microsoft's top-level engineers. They finally managed to get things functioning like normal.

What's Your Worst IT Nightmare?

Write up your story in 300-800 words and e-mail it to Michael Desmond at mdesmond@redmondmag.com. Please use "Never Again" as the subject line and be sure to include your contact information for story verification.

We had to do a massive amount of metadata cleanup to get AD to work. This included using tools like ADSI edit, NTDSUtil, DCDiag, RepAdmin, Setspn, Ldifde and a host of reskit tools. We began the process by performing a DCPROMO /FORCEREMOVAL on all domain controllers that were not PDC emulators. Once we had the three servers communicating and functioning normally, we had the tedious task of removing all of the "forceremoval" related FRS, DNS and AD entries from these remaining three PDC emulators.

At this point we had a massive network being supported by only three domain controllers from separate domains in the same forest. As we all know, the FISMO roles need to be separated out when there's more than one domain. We introduced domain controllers back into the network one at a time until everything was restored to its original configuration.

Additionally, more than 300 of our nearly 700 PCs had corrupt Kerberos tickets due to the changing time stamps. They all had to be disjoined and rejoined to the domain. All of this could have been avoided had Microsoft had a hook that lets orphaned DCs back into the forest, but we figured it out. Thank God for Remote Desktop Services!

About the Author

Jeremy Price is a senior network & systems engineer for a large construction firm. He has since forgotten more about orphaned DCs than most of us will ever know.

comments powered by Disqus

Reader Comments:

Fri, Nov 10, 2006 Anonymous Anonymous

Yes, there is a way called "Allow Replication With Corrupt and Divergent Partners" but there is only one condition true for it. you never know what's gonna to happen with your domain. GOD save you after same.
we never no why same domain controller failed to replicate for more that 60 - 90 odd days. i think first you need to find out what is the reason it failed to replicate with such a long period. if you find same answer you are good .. before adding same key in registry, make sure that your ad database is intact so that you can sync with other domain controllers.

Fri, Jul 21, 2006 Anonymous Anonymous

"All of this could have been avoided had Microsoft had a hook that lets orphaned DCs back into the forest"

Unfortunately your Microsoft support person didn't know (or recognize) that this called for the "Allow Replication With Corrupt and Divergent Partners" registry key. Your problem would have been solved. Sorry.

Thu, Jul 13, 2006 Anonymous Anonymous

What a true nightmare! Hope this will never happen to us!

Add Your Comment Now:

Your Name:(optional)
Your Email:(optional)
Your Location:(optional)
Comment:
Please type the letters/numbers you see above

Redmond Tech Watch

Sign up for our newsletter.

I agree to this site's Privacy Policy.