In-Depth
Never Again: Your IT Horror Stories
Redmond readers share their worst IT moments.
They go by many names: CLEs (Career
Limiting Events); Murphy Moments;
Blue Screen Memories; RUAs (Resume Updating Actions). What they all have
in common is disaster.
Most IT folks have at least one tale of woe, of that time when their career flashed before their eyes (those in the biz for a long time often have more than one -- sometimes many more). It often starts when the help desk phones start lighting up like a Vegas casino. Users can't connect to the network or Internet. Servers aren't talking to each other or to you. Then your mouth goes dry, as you realize you haven't tested your backups for -- well, you can't remember for how long. And where is that bootable CD now that you need it?
Chances are you also found a solution, recovered from your error and got things shipshape again. Otherwise, you probably wouldn't be reading this article, because your new job at the local car wash demands your total commitment. You learned a lesson, gained experience and wisdom, and have become a better IT pro as a result.
But wouldn't it be nice to learn those lessons without the near-death experience?
Our new continuing column, called Never Again, aims to do just that. Each month,
we'll present the most compelling story in print, and others will appear online.
If you have a tale of technical terror you'd like to submit for this column,
send in a 300- to 800-word, first-person write-up of your scariest IT moment
on the job to Keith Ward at [email protected].
Now, let the nightmares begin.
Out of Service
By Ron Stewart
I work at an IT services company. Recently, we moved the
servers of a rapidly growing client from their own office to a data center.
We've performed similar server moves several times in the past, and the first
few tasks went off without a hitch. We shut down the servers late on Friday
afternoon, packed them up and had a bonded carrier move them to the data center.
Once there, we racked the servers, reconnected them and booted them.
Our server technician watched the monitor as the first server booted, preparing to log on to each server and perform some basic tests. He waited patiently for the familiar Windows Server logon screen to appear.
After several minutes went by, it became clear that something was very wrong. "Applying computer settings," the screen read -- for more than two hours, before a logon dialog box finally appeared. Logon itself took an hour to complete. When the GUI appeared, it responded extremely slow. In addition, no network connections were listed.
"The
vendor’s support tech basically threw up his hands, telling
our guys to wipe the servers clean and rebuild them from scratch." |
|
The server and network techs double-checked all connections and settings, verifying
that they were correct. They formed a theory that the servers needed to boot
onto a network that used the IP addresses from the office LAN, with which they
were still configured. The techs reconfigured the network components and restarted
the servers. More than an hour later, as the servers took their sweet time booting
yet again, this theory was thrown overboard.
It was now well past midnight. The team phoned the servers' manufacturer for assistance. Discussion soon focused on how the servers' network cards were configured to function together as a team; the vendor's support tech suggested disabling this so the network cards could operate independently. But after doing this, the problems continued.
At this point, the vendor's support tech basically threw up his hands, telling our guys to wipe the servers clean and rebuild them from scratch.
The exhausted and bleary-eyed server tech looked out of the data center's windows, saw the dull glow of dawn on the horizon, and retained just enough good sense to inform the support tech that no, he wasn't going to do that. He hung up, and our guys called it a night (not that much was left of it). They would return to take another crack at things the next day.
The following afternoon, our CIO called me (I should never leave my cell phone on during weekends.) He briefed me on what was going on. "A fresh set of eyes might help," he said. Could I get down to the data center as soon as possible? After making the usual apologies to my long-suffering wife, I went to ground zero.
Progress was slow and frustrating. Each server had numerous issues in addition to the brutally slow boot time: No network connections were listed; the GUI was sluggish; services couldn't be stopped or started.
Because the servers were able to boot into Safe Mode quickly, we figured the cause of the problem must have been one of the non-essential services. So we went about disabling all these services, then booted the servers normally (which now only took the usual couple of minutes) and gradually started only the non-essential services required for each server's functionality.
By midnight, all the servers save one were operational. Everyone else went home, leaving me to work on the last non-functioning computer -- an intranet Web server. As this server had been designated a low priority, we hadn't used Safe Mode to reconfigure its services, and as the hours passed, it had eventually become accessible.
With the pressure now gone, I finally had the time to analyze the services. I went through the list, and spotted the culprit behind our lost weekend. The APC PBE Agent service, after six hours, was "Starting." I disabled that one service, rebooted, and all the problems went away.
I'm pretty sure I screamed.
We made some mistakes here. First, the data center had its own huge, shared UPS, so the APC software wasn't needed and should have been removed. Second, (we discovered this later), the digital certificate used to sign the APC software had expired just the week before. (To add insult to injury, a Microsoft Knowledge Base article on this very problem appeared the following week, just a few days too late to help us.) And third, we should have performed this analysis several hours before, but we'd been too focused on restoring functionality.
Many of the lessons here are specific to this incident, but the two reminders I took away from it are: A) When it comes to technology, no change is simple, no matter how many times you've done it before; and B) You can save time if you take the time to work the problem, rather than letting it work you.
Ron Stewart is a senior technical consultant at Syscom Consulting in Vancouver,
Canada. He has worked in IT for more than 10 years, far too much of it on evenings
and weekends.
That's a Wrap
By Ryan Williams
I'm a consultant, so I've seen a lot of issues in data centers with
my clients. One of the most memorable involved a client that had all their data
center servers go down during some renovations. Imagine the surprise of the
person sent in to check the server room when he found that the remodeling contractors
had shrink-wrapped the racks of servers to keep dust out! The contractors neglected
to mention that they would be doing this, so all the servers were on when they
wrapped them up. Naturally, the servers overheated and shut themselves down.
Luckily, none of the servers were fatally damaged.
The moral of this story: When remodeling your data center, make sure the contractors
are closely supervised.
Ryan Williams has more than nine years in the network integration and the
professional services field. He has extensive experience in implementing and
supporting Active Directory, Exchange and collaboration technologies.
Disappearing DNS
By Ernest Franzen
One of my worst experiences was finding out the ramifications of deleting
our main Active Directory-integrated DNS zone.
We had to move one of our domain controllers to a new IP subnet, so I changed the IP address of the DC and rebooted. After the reboot, everything looked good -- except for DNS, which had a big red "X" through the zone.
So, knowing that the DNS is replicated from other DCs, I deleted the zone and
recreated a new zone with the same name -- my thinking was that it would populate
within a few minutes from one of the other DCs.
Instead, the phone started ringing with users having all types of connectivity
problems: Web pages wouldn't load; e-mail was down; file and print services
were down. The problem was affecting the whole corporation.
"After
the reboot, everything looked good -- except for DNS, which had
a big red 'X' through the zone." |
|
Things got louder when a support tech came in while we were starting to troubleshoot
the problem. "You did what?!" he screamed. "You can't do that!
DNS is integrated within AD; that's why it's called an Active Directory-integrated
DNS zone!" That explained what was happening. By deleting DNS at the remote
site, it deleted DNS from all the sites. So when I recreated the zone, it replaced
our existing 15,000 records with a new zone -- a zone containing only the DNS
record of the DC and the file and print server at the remote site.
Luckily, we had a tape backup from another DC and were able to perform an authoritative restore and get back most of the original DNS records. But several
others were missed and had to be created manually (let's just say that it was a very long night).
Since that experience, I've had another problem with DNS corruption on a single
DC that required a call to Microsoft support. I was dismayed during the troubleshooting
process when the technician told me to "delete the zone." Needless to say, I
argued against this course of action -- this was one lesson I learned the hard
way.
Ernest Franzen is a senior network architect for a Fortune 500 company.
He holds MSCA and MSCE certifications.
[Redmond magazine wishes to thank Thomas Haines and AOPA Pilot
magazine for allowing us to use the title of this column without getting bent
out of shape. - Ed.]
About the Author
Keith Ward is the editor in chief of Virtualization & Cloud Review. Follow him on Twitter @VirtReviewKeith.