How NOT to Architect Active Directory
Learn how to do Active Directory design right from these real-world case studies of those who have done it wrong.
Working in customer support for HP Services, one of the world's biggest computer support companies, I've seen some pretty messed-up Active Directory (AD) designs in my time. In many cases, the design disaster was the work of the nefarious "consultant" who was conveniently unreachable once the network was broke.
But even if the AD architect was available, a perception problem exists that can hamper efforts to repair the damage: Some believe that once the AD design is complete and implemented, it's set in stone. That's false: Although it can take
significant time, effort and money, implementing a new design is usually possible—and sometimes required, if the root cause of the problem is the design itself.
What follows are actual cases I've worked on over the past few years
(the names have been changed to
protect the guilty). As you read, keep in mind that it's important to examine the design principles involved in each
case and not get too hung up on the technical details.
Living with Limits
This customer was a retail business with more than 1,000 stores. They hired a well-known company to design the AD structure and the supporting network (DNS, DHCP and so on). Although HP didn't do the design, we had the support contract.
The design called for every store to have its own domain, complete with two domain
controllers (DCs), a DNS server, a WINS server and a DHCP server.
All this was designed to support about four users per store! There was no IT staff in the stores; everything was supported from company headquarters. The AD design had 1,000 child domains off of a single parent domain, as shown in Figure 1.
|Figure 1. The original forest design featured a single parent/root domain and 1,400 child domains. Some domains had as few as four users.
It made no sense to have a domain for four users, and almost immediately they started having replication problems. I told them early on that this wasn't a good design, but they didn't care. This is what their consultant came up with, and by golly, they were going to use it.
We gradually worked through the problems, until one day it all
came crashing down. The retailer had been happily cranking out domains like hotcakes when they realized things were coming unglued: replication was broken, authentication failed, business-critical apps didn't work. You get the picture.
What happened is that they hit the AD ceiling on domain creation, which is a hard limit of 800. The customer engaged Microsoft engineers to help them eliminate enough domains in the right sequence to get them working again, but they were stuck at the 800 domain limit.
One possible solution, shown in
Figure 2, would have been to divide the domains up into four regional child domains under the parent, then divide the 1,400 store domains under the regional domains (becoming grandchild domains). It would have worked, because there would be about 350 child domains for each regional "parent."
|Figure 2. Adding another level of domains in the forest would have avoided the domain limitation, but still created many more domains than necessary. (Click image to view larger version.)
But given the relatively small
number of users, there's no reason they couldn't have used a single domain. They had a centralized IT administration model, high-speed links to all the stores, and no reason at all for a multiple domain model. Given those factors, a single domain design made the most sense.
This was what we suggested, and what they ultimately ended up doing (Figure 3). They built a single domain infrastructure, migrated the users, computers and groups to the new domain and tore down the old. Besides AD being happier and much easier to administer, it led to a huge reduction in hardware, going from more than 1,000 servers to a handful.
|Figure 3. The solution was to create a single domain and create Organizational Units (OUs) that took the place of the 1,400 former child domains. (Click image to view larger version.)
Moral of the story: We live in a world of limits; this is one of them. You may argue that Microsoft should eliminate the 800 domain restriction, but a forest with 1,000 domains and four users per domain makes no sense. Disaster recovery in this environment would be a nightmare. Large global corporations are getting by with four domains, and Microsoft itself is moving to a single domain due to disaster recovery and security reasons. If proper design by competent architects had been done in the first place, this
problem would never have happened.
The Broken Tiebreaker
A global company—let's call it the Acme Corporation—has headquarters in New York. It hired a reputable
company to provide consulting advice on its migration from Windows NT to Windows Server 2003. In designing the AD site topology, it wanted a three-tiered structure, shown in
Figure 4. In this setup, the slower links connect the lower-tier sites, while the faster links connect the second-tier sites and the two core sites at the top tier. The core sites are Chicago and Milwaukee, with Milwaukee being a disaster recovery site.
|Figure 4. This three-tiered topology shows the replication path. Replication first
happens at the lower-cost (better bandwidth) sites, and moves through the higher-cost (lower bandwidth) sites. (Click image to view larger version.)
The design should be fairly simple to implement by creating site links: two sites in each link with the higher link costs (representing sites with lower-bandwidth connections) at the lower level and the lowest cost in the link (sites that have the most bandwidth) containing the two core sites. It would look similar to Figure 5.
|Figure 5. This is the site link design the company went with, which allows for smooth replication of Active Directory data. (Click image to view larger version.)
But rather than using solid, time-tested design principles, Acme came up with the most bizarre solution imaginable. It involved an incredibly obscure tiebreaker rule described on page 166 of the Distributed Systems Guide of the Windows 2000 Resource Kit, that says if the Knowledge
Consistency Checker (KCC) has two site links to choose from and both are equal cost, it will break the tie by building a connection to the site with the most domain
controllers in it. If both sites have the same number of domain
controllers, it selects the site based on alphabetical order.
|The Knowledge Consistency Checker (KCC) helps replication topology remain stable throughout an Active Directory domain. In essence, it makes sure that changes made at one domain controller are successfully pushed to all other DCs in a domain’s sites, so that they all contain the same information. Making this replication run as smoothly as possible is critical to making sure site connections aren’t overwhelmed with replication information, which can clog up the network like hair in a drain. That’s why it’s important to determine site cost links: sites with lower-bandwidth connections can’t handle the same volume of replication information as sites with bigger pipes. Sites with smaller pipes get a higher cost, since it’s more time-consuming to push updates to them, and vice versa.
— Keith Ward
Acme put all the sites in a single site link and depended on this tie-breaker rule for the KCC to sort out all the sites and replication in a three-tier topology. In examining Figure 5, consider the case of
replicating from Chicago to another site. If all sites are in one site link, they have the same cost, schedule and replication frequency, instead of the different costs shown in the
figure. All sites are thus lumped together, and because they all have one DC, replication goes in
alphabetical order, from the Chicago hub to Amsterdam, then Bangalore, Beijing, Berlin, Boston and so on. In addition, in its testing, the company couldn't verify that the tiebreaker system worked; it seemed to be random.
So, the company was basing its entire replication topology on a
fairly complex three-tier design, and hoping that when all the sites were lumped together, the KCC would somehow figure out how to force the replication from the hub sites to the second tier and then out to the third tier. Then, to add insult to injury, the tiebreaker system didn't even work.
I'd never even heard of this tiebreaker, so I contacted a respected AD engineer at Microsoft; he'd never heard of it either.
If they proceeded with the migration using this topology design, they would be repairing it very soon. I proposed a solution that followed some basic replication rules and solved the potential problem. Those rules are:
- Force the KCC to replicate the way you want it. The more freedom you give the KCC to figure out the topology, the less likely that replication will go the way you want it.
- Each Site Link should only have two sites, except the core (top level) link, which may have more.
- Cost should be planned to force replication into the hub, following the physical network from slowest to fastest links.
- Do not use scheduling unless absolutely necessary. It's actually
possible to create a schedule that would prevent any replication. Make sure to test any schedule you create.
This customer eventually implemented the design in Figure 5 and proceeded with a very smooth migration. The health check we did after the migration yielded no errors in the AD environment.
Morals of the Story:
- Design the site topology very
narrowly; don't give the KCC any "wiggle room."
- Step back and see if it makes sense.
- Validate everything in a good test environment before putting the design into production. The fact that the design didn't work in the test environment raised a flag in this case. We were able to solve problems before they brought down the production network.
- Don't be afraid to get a second opinion.
A small non-profit organization was designing its migration to Windows 2000 from NT, and got locked in a debate over how many forests it should have. There were only a few thousand accounts, but the IT department was split between two business groups. In addition, it had a separate fund-raising arm that had to be in
its own forest for legal and business reasons. That meant they would have at least two forests.
The company was also deploying Exchange and wanted one organization so that everyone would have an @corp.org e-mail address. That would make it impossible to tell from the outside world whether the account was in the non-profit or fund-raising segment of the organization.
The two IT groups—because they were at each other's throats, let's call them the Hatfield IT group and the McCoy IT group—had created their own autonomous NT domains. The McCoy group only had about 20 users. As we conducted the assessment, these groups put pressure on us as to the forest design. The Hatfields wanted a single forest in addition to the fund-raising forest, while the McCoys wanted their own forest in addition to the Hatfield's forest and the fund-raising forest.
The situation became quite heated, because the two IT groups didn't trust each other. Neither organization wanted to give up the Enterprise Admin account, which has ultimate authority over the forest. This
distrust went back more years than most of them had been employed at the company, and we were stuck
in the middle. No matter which
configuration we picked, one group would end up angry with us.
We finally called a meeting with all interested parties, including administrators, lower level managers and two business division directors. I told them they couldn't leave until we resolved this matter because we couldn't move forward without a decision.
I had our Exchange consultant explain the steps necessary to get a single Exchange organization to serve three forests, including SMTP forwarding, Free-Busy synchronization, Calendar synchronization and so on. It would be enough of a challenge to do it for two forests, let alone three; there were a lot of moving parts that could fail or cause delays in updates. One administrator commented that "People would not care about a power outage if they could get their e-mail," underscoring the importance of an efficient e-mail system to their organization. Others agreed.
This problem, then, wasn't a technical issue at all: it was a management issue. The Hatfield IT group didn't want an enterprise admin from the McCoy IT group messing with their environment, and vice versa.
Thus, we dealt with the problem
from a management, rather than a technical, standpoint. We recommended
establishing a set of policies governing the enterprise administrator; if the admin violated those policies, that was a management issue that could
easily be resolved with a reprimand or other measures, including dismissal.
Everyone agreed and we ended up with a single forest for the Hatfields and McCoys, plus the fund-raising organization's forest.
Moral of the story: Choose admins with care. Microsoft recently noted that one of the top reasons customers called for disaster recovery support was "accidental deletion of objects," in which admins accidentally (or not) whack user accounts. Its No. 1 solution for this problem was to "be careful who you give admin privileges to."
It's like being a parent. If you can't trust your teenager to drive safely, don't try to invent a car that won't ever go over the speed limit, won't start if it detects
alcohol or won't run a red light. You have to train the teenager, and the same is true with domain administrators. In either case, if he can't handle the responsibility, take it away.