If It Can Go Wrong, It Will: A Migration Case Study
A problem-by-problem breakdown of why one company's three-year Unix to Windows migration went so horribly, horribly wrong.
Experience is a costly teacher -- if you can learn from the mistakes of others, instead of at your own expense, you should take advantage of the opportunity. To that end, what follows is the story of one organization's recent migration and the problems they encountered. Being an image-conscious organization, they are not identified in this article, but none of the details of their experience have been changed.
Background
In late 2002, the organization had approximately 2,000 users and was still heavily reliant on legacy mainframe systems running Unix. The majority of the desktops were running Windows-based operating systems, with a handful of Macs isolated to a few departments and a minor spattering of Linux systems confined to Web-based operations.
After a study of several alternatives, they decided to migrate from the Unix platform and port everything to Windows Server 2003. Realizing the load the IT staff was already under, the decision was made to hire a number of new IT employees with a short-term contract to carry out this operation and allot a full year to get the task done.
Problem #1: Bigger Than Budgeted
The enormity of the project was greater than anticipated. Migrating from one distribution of Unix to another is one thing (and something they had done in years past), but migrating between two operating systems that are so different that's it's more a rewrite and porting issue than anything else. A failure to appreciate the depth of the changes tainted the undertaking from the very beginning, and the time and budget agreed upon were too small and unrealistic.
Instead of taking one year, the project took three. Since time and money are closely intertwined, the cost of the conversion also ended up being close to three times the original estimate as well.
What should have been done: Much more planning was needed. The organization needed to talk to other organizations that had carried out similar projects and come up with a realistic budget and timeframe.
Problem #2: Temporary (and Under-Skilled) Staffing
While it makes sense that the IT staff was already so busy that more help was needed in order to take on this additional undertaking, what doesn't make sense is to bring in the new help and task them with this project. Given the importance and magnitude of the change, the most senior employees should have been in charge and overseeing it. Instead, they continued with their normal operations and the new employees had the dual responsibility of learning the organization from the ground up and carrying out the change.
Not only that, but short-term employees were brought in at low wages in order to not upset the fulltime IT staff. The low wages served to deter anyone truly experienced with carrying out such a mission and the vast majority of the applicants were recent college graduates with paper skills and no experience.
Knowing that their contracts were for limited time periods, many of those hired devoted much of their time to seeking other employment. This led to turnover and a constant "returning-to-square-one" mentality.
What should have been done: The only time you can bring new help into IT and task them with a job of this magnitude is if those coming in have great experience in the field (migration specialists/consultants, etc.). Even then, they should report to a senior IT staff member who knows the organization and fully appreciates its inner workings. Bringing unspecialized labor in at a low wage to avoid offending anyone is almost a surefire recipe for failure.
Problem #3: No Buy-In
The IT department failed to win buy-in from the rest of the organization before beginning the change. Not only did they fail to win it, they failed to even ask for it. As far as most users knew, IT was just doing something and “screwing things up again.” Because they did not understand what was being done, why it was being done or even how it was being done, the users would become irate every time one of their services would be affected regardless of the duration.
The users would then run to their managers and complain that they cannot get their work done. The managers would complain to the IT department as well as to their bosses. Upper management had originally approved the change, but as they began to hear complaints from the departments, they began to distance themselves and question where best to place their loyalties.
Approximately halfway through the change, there came a time when the president himself asked that the project be scrapped and the organization just keep what it had. This was a monumental moment for not only were the users against it, but now there was little support for the change from anywhere within the organization.
What should have been done : One of the first steps in any major IT project should involve communication and education. Even if the users don't understand what you are telling them, just telling them what you are going to do before you do it goes a long way toward obtaining their buy-in. Be up front and tell them there are going to be times when they cannot access a particular service but that it is something they will live with because in the end the organization will be much better. Telling them this upfront helps, and then reminding them of it as the project goes along helps further: let them know ahead of time when a particular service will be down, when you will be doing testing, etc.
While this is important for users, and reduces the number of complaints they pass up the chain, it is even more important for upper management. Those who agreed to the project need to know the status of it -- they need to know that they may hear complaints and that you are on top of the situation and doing all you can to create the minimal level of inconvenience that you can.
Problem #4: Too Much, Too Soon
It may make sense to convert everything from one operating system to another, but thinking that you have to do it all within the scope of a single project is foolhardy. It can be more time consuming, but converting one element at a time and assuring its success before starting the next, can end up being a smoother road to completion than trying to tackle everything at the same time.
In this case, the organization attempted to convert everything simultaneously because they needed to justify the cost savings. If they had converted 90 percent of the services over in a designated time period, that would mean that 10 percent were still running on the mainframes and thus that cost would still be incurred. By going from 100 percent to 0 percent in one year, as the original plan specified, they could make the numbers look great on paper.
Looking good on paper is important -- otherwise, you may not win approval -- but the rolling heads when numbers aren't met need to be weighed equally. This organization turned out to be very forgiving and understanding in terms of the fact that no one lost their job as a result of the migration. At the same time, the credibility of the IT manager was severely compromised, and it will be a long time before he will be able to recover from it or get another large project funded so easily.
What should have been done: A lab environment should have been created first and sample migrations of individual services tried time and again until the comfort level was higher. Tying in with problem No. 1, above, realism needed to be added to the mix as well when it came to date and budget matters.
It is always better to underpromise and overperform than go in the opposite direction. Telling that it will take two years to implement something and then doing it within a year and a half is much better than saying it can be done in 16 months and then taking 18.
About the Author
Emmett Dulaney is the author of several books on Linux, Unix and certification,
including the Security+ Study Guide, Fourth Edition. He can be reached at [email protected].