In-Depth

Duel to the Death

This health system's organization had its migration strategy all mapped out. But then reality intervened, causing a major course-change.


We have about 25,000 employees, served through a multi-region WAN with a large number of NT 4.0 domains and NT 4.0 servers for hundreds of different applications used in our health care system. Each region had a master account domain and many resource domains, as shown in Figure 1.

Some of the resource domains maintain trusts to several of the regional master account domains (MADs). The regions share a message domain that contains Exchange 5.5 servers from all regions. Unfortunately, there are California users in both the HSOR and the HSCA domains. One of the advantages of moving to AD will be the MoveTree utility, which will allow us eventually to move these California users to the CA domain after it's built.

We're planning this migration for two reasons:

  • Outlook Web Access 2000. Many of our top executives have been to briefings and summits where they've seen both OWA 2000 and Exchange 2000 demonstrated at great length. They want these capabilities—especially the improvements in OWA and benefits for remote workers.
  • We have a huge user base of Windows 95 workstations that need upgrading. We never switched to Windows 98 or NT Workstation, mainly due to budget cycles. But we now have the funding to do massive upgrades. Because we'll be upgrading these machines to Win2K Pro, we'd like to take advantage of Active Directory, especially policies and software installation packages.
Domains by the dozens
Figure 1. The company's old network included a gaggle of domains, with a multitude of trust relationships to maintain. User domains are on top, with resource domains below.

Active Directory Design
Our first Win2K implementation technical meetings were our AD architecture design meetings. In conjunction with these AD design discussions were the DNS discussions, tackling topics like: Should we use the same namespace on the inside as the outside? Should we use a "one-off" namespace or a completely separate one? Which region has control of the DNS namespace? It's amazing that the simple design shown in Figure 2 took two full days to be agreed upon. Our company has strong regionalized presences in several states with separate IT departments and NT 4.0 MADs. We used an outside consultant, which helped us cut through the political problems and kept the discussion focused on the technical reasons for eliminating some designs in favor of the final one we agreed upon.

One domain, one tree
Figure 2. Each peer domain in the new infrastructure is a separate tree in the single forest. AD is the "empty" root domain created first.

The design has these advantages:

  • Each region keeps separate domain security policies and administration.
  • It uses the empty place-holder domain as the apolitical domain root.
  • We're able to leverage the centrally applied global catalog, schema changes and Exchange 2000.
  • Peer use of the DNS namespace. Delegation of subzones for each region's administration.
  • Each region can make individual decisions on issues such as whether to implement an in-place domain upgrade or a migration to a pristine domain and how to structure additional forest elements below the first level.

Each of the domain names corresponds to the state abbreviations except for the AD name, which is arbitrary. Unfortunately, AD wasn't our first choice for the abbreviation. We fell victim to the well-known warning about having to rebuild the forest when not sufficiently planned. The original choice was the slightly more agnostic DS. We had to rebuild the root domain after we discovered that ds..org was already in widespread use by another region as a Web URL. Lesson learned: Make sure to have all the right people included in the design meetings and thoroughly check out the existing namespace prior to building domains.

March 26, 2001: Day 1 of the AD Pilot
Our main tasks: Get a new DC for the HSOR domain physically set up and start the OS install from the new server build document; and use the opportunity to double-check the new server build document.

We met briefly after lunch with the core group to develop a last-minute plan regarding the migration/administration tool selection. Some vendors' quotes need reworking. One came in at $1.9 million for the complete tool suite! We can't present this to our management and risk the humiliating laughter.

We agreed to proceed using a NetIQ migration tool—Domain Migration Administrator—for the first part of the pilot (since we get the option to use this tool while using the Microsoft Consulting Service (MCS)), but we'll keep evaluating the migration and administration tools in our lab.

Broken Tools
We tried migration suites from NetIQ, Aelita and BindView. None worked adequately. The most problematic area for all the tools that we tested occurred during machine and profile migration. A high percentage of test users weren't able to use the migrated profiles, causing Windows to automatically create a new profile during the login process.

During our early migration attempts, the NetIQ machine migration wasn't consistent. We were only able to migrate one machine at a time to avoid errors. Testing also indicated that migrating machines and machine profiles using NetIQ required the use of an account—in the Local Administrators group of the workstation—that has the right to add machines to the new domain.

We encountered an additional problem with NetIQ when both accounts in the source and target domains were disabled after migrating the accounts. We assumed it would only disable the source.

Impact (1-10) Risk (%) Risk Mitigation Plan
10 <> Upgrade process corrupts the SAM database. Crate offline backup/dress rehearsal.
0 5 SAM already corrupt. Corrupt data wouldn't be accessible. Not different than now.
7 75 Imcompatibility of software currently on DCs. Inventory. Move software off DC to another machine.
8 5 Resources needed. May need to send personnel to CA for two DCs there.
8 5 SAM becoming too large. Upgrade all DCs; problem eliminated.
7 90 Unknown carryover of "garbage" accounts (users, machines, security groups that should be cleaned up.) Clean up now.
5 10 Current DCs' hardware non-Win2K compliant. Inventory/update.
9 10 Upgrade impacts mission-critical processes. Test/research/dress rehearsal.
2 5 Clients have hardcoded IP addresses to DCs. Research may just jave to accept this risk and react when (or if) things break.
10 90 Political pushback. Prepare and document sound reasons for taking this course; show how advantages outweigh risks.
10 100 Login scripts and replication won't work. Eliminate by configuring file replication service.
1 25 Collapsing resource domains later. Need to buy management tools.
    Changing original naming conventions. Need to communicate and get agreement from other regions.
    Training for admins on new tools (MMC) for account management. Introduce tools ASAP.
Table 1. Before switching gears to the in-place upgrade, the IT department prepared this risk assessment table showing the possible impact on the environment.

Aelita and BindView also had their share of problems. The support person from Aelita was reluctant to provide any help even though we had a "crippled" version that we were evaluating on a limited pilot group. We received an e-mail from Aelita suggesting that we purchase the product for support.

The migration using the tools turned out to be much, much more time-consuming and problematic than we ever imagined. We were spending all our time just tracking down small individual problems related to inconsistent results with the migration tools. At times it would be a user with an incomplete SID history; the next time it would be a machine that failed to migrate or a profile that didn't get re-permissioned properly. We were also constantly talking with the vendors and updating or patching the software. Our team's resources were consumed with these activities.

April 16: The Big Decision
After about six weeks of this situation, we made the decision to change tracks, kill the migration plan, and do an in-place upgrade instead. Our reasons for the about-face included these:

  • Fewer resources needed, both in labor and money. Migrating groups of people over time is a longer and more labor-intensive operation. After an in-place upgrade, all accounts can log into an AD domain in one fell swoop. Less coordination and project time are needed for this process. It also means we won't have to purchase migration tools, which would be in excess of $175,000.
  • Less impact on users: SID history doesn't need to be migrated to a new domain. We would eliminate the problems we've been facing in doing just that. Domain rights and roles wouldn't change, and the domain name would remain the same. Users would still log into HSOR as the domain name.
  • Escalates entry into AD. Our entire domain could be in AD in six weeks as opposed to six months. We can more accurately define delivery dates and focus more energy on security, policies and other items that need our attention within AD.

Despite all the time spent on the migration, the switch to the in-place upgrade method was one of the better moves our management has ever approved. We've eliminated all the migration-related problems, accelerated our Win2K conversion and reduced the costs and resources tremendously.

Initially, we believed the biggest risks were with the in-place upgrade. Since then, though, we've found that migration has much more risk. The in-place upgrade allows users to continue to authenticate using BDCs even while we upgrade; if things don't go smoothly, we can always roll back using a BDC that we've taken offline. On the other hand, migration risks a great deal of instability and unknowns because of the inconsistent results of using migration tools on a large user database with an inordinate number of groups. Large SID histories would have to be maintained and we would encounter what's sometimes called "token bloat."

From talking to MCS, I've learned that initially it was seeing most customers considering migration due to the unknowns of the AD upgrade. Now it's seeing around 70 percent of its customers doing the in-place upgrade. I believe this proportion will grow even larger in favor of in-place upgrades. A migration should only be attempted if there are absolutely overwhelming reasons for doing so. Our original reason for wanting to do a migration was because it's always better to build the "new house" and then move out of the "old house," if you can afford it.

Lessons Learned

We learned so many painful lessons. We came upon many technical snags and unadvertised features that come with any complex product and large impact project. TechNet will become your close ally. It contains a great deal of useful and timely information online. Many of your technical problems will differ, but I believe these issues have universal application:

  • Before we started, I read in many places recommendations that said the majority of time should be devoted to planning, to minimize problems. We did a great deal of planning up front, but no matter how much you plan, some things won't become obvious until you've tried some of the wrong things first. Everyone's environments are layered with complexities and unique blends of technologies. No amount of mental preparation is going to protect you from all the snags and pitfalls, so use these as learning opportunities. For example, if you've just upgraded your DHCP server to Win2K and it's no longer working, take a few minutes to learn why before resorting to the back-out plan, even though your pager's going off.
  • Keep moving the project forward. Make dates and deadlines even if they're arbitrary. When you get hung up working on a side-issue you've uncovered, identify it as a separate project or job and move on. Keep the main priority and critical path in mind. The scope of this work will turn over many rocks and reveal some ugly things. There may be delays, but there's always some important step that can be broken into smaller tasks, leaving some parts that can be worked on without delay.
  • From the start, and many times during the project, you need to make sure management realizes the resources that need to be dedicated to this project. Much of this work is behind the curtain; depending on your management, it may think you're not doing much more than running Setup.exe. Sending regular status updates that emphasize the need for dedicated personnel is paramount.
  • To short-circuit the painful, steep learning curve, get access to help from someone who has experience with Win2K upgrades. What worked well for us was to have a consultant/advisor that came in one day a week to help address major concerns. This person kept in contact via e-mail and attended our weekly AD project meeting.
  • DNS. Always consider DNS first for problems. Double-check the servers you're pointing to for DNS and WINS, and check to see that the records are being registered correctly. Know the new IPCONFIG switches.
  • Don't let your group get entangled in the creation of an OU structure. This is a never-ending labyrinth with more political than technical traps. Start simple and justify any additions. Group policies don't have to be applied using OUs; they can be centrally applied to the domain using different security groups.

July 2001: Where We Are Now
The upgrade is a continuing project. We currently have a root place-holder domain in native mode; we're in Mixed mode with our Oregon primary MAD. We're rolling out Win2K Pro very quickly now using our standard image and putting these workstations directly into our new tree. We should have the Oregon MAD switched to native mode by the end of August.

It will take more than a year to collapse the NT 4.0 resource domains. Some of the collapsing of resource domains will be done quickly after we switch to native mode, but depending on the application, some servers will remain in NT 4.0 resource domains until the vendor updates the application or we move to a replacement system.

Other regions will be adding new domains to our forest very soon. The adventure continues...

About the Author

Alan Knowles, MCSE, CNE, is a server engineer with a large health care organization, which has branches in the Pacific Northwest and Western U.S.

Featured

comments powered by Disqus

Subscribe on YouTube