The Process That Wouldn't Die
This IT pro should have known better than to reboot the Primary Domain Controller on Friday the 13th. Read on about his special nightmare.
As the IT manager for a 30-person architectural firm, I’m where the buck stops when it comes to network and server uptime. After a routine reboot of our network Primary Domain Controller (PDC) at the end of the workweek, the system was completely unresponsive. The ensuing troubleshooting session turned into an all-nighter that included two long waits for tech support, five hours of phone time with two different technicians, 10 reboots and a wild runaway process—wrapping up with a successful conclusion as the sun came up.
The heart of our architectural business is CAD drawings. I invest a great percentage of my time in protecting these resources. Our two-server network might be simple in topology, but it’s critical to our survival and meeting goals in a deadline-heavy industry.
We have two Dell PowerEdge servers running NT 4.0 that provide network, file, print, backup, VPN, antivirus and e-mail services.
Every server needs an occasional reboot. The question is, “Which timeframe creates the least disruption and downtime?” I’ve found—through trial and error—that the most effective time for this in my firm is Friday evening. This particular Friday, which I later noted was the 13th, I planned to reboot the PDC and shut down the Backup Domain Controller (BDC) in order to perform maintenance on its Uninterruptible Power Supply (UPS) unit. Even allowing a few extra moments to circulate around the building and get my co-workers off the network, this should have been a one-hour job.
The server in this tale is our PDC that also does duty as the CAD file server
and network print server. This Dell 1400SC with a PowerEdge Raid Controller
3C RAID card, 1GHz single processor, 512MB of RAM and DDS4 tape unit has
been a reliable workhorse since we brought it online in April 2002. Besides
these functions, the machine also has Norton AntiVirus Server installed,
which sends antivirus definition files out to workstations around the network.
Something Definitely Isn’t Right
I restarted the PDC at 6 p.m., then directed my attention to replacing the
battery in the UPS that protects the BDC. That task completed, I checked
back on the status of the PDC. Although the logon box appeared shortly and
accepted my password, it seemed to take forever to bring up the desktop.
I watched as the desktop appeared, icon-by-icon, then watched with growing
concern as the items in the system tray disappeared one after another. Although
the mouse pointer moved around the screen normally, double-clicking an item
had no effect. The icon wouldn’t even darken to indicate it had been selected.
Get Past the Denial
At times like these, I have to take a moment to get into the proper frame
of mind. The first thing is to get past the moment of denial. Even my optimistic
self could tell that this wasn’t a good way to start the weekend. OK, take
a deep breath—so my Friday evening was in jeopardy. I could deal with that.
Better not think about what could happen to Saturday—yet.
I have a few rules of thumb for dealing with server issues. First, do nothing
to make the problem worse without a clear understanding of the ramifications.
Second, read all the available material before jumping in to try
a rescue procedure. Third, swallow your pride and get as much help as you
have available or can afford.
Time to troubleshoot. Sitting at the PDC console, I tried to bring up the Event Viewer. I watched the hourglass for 10 minutes. Nothing. The three-finger salute (Ctrl-Alt-Del) brought up the logon security box, and I excruciatingly tabbed over to the Task Manager button. Ten more minutes of watching the box struggle convinced me it was time for the reset button. Unfortunately, this would be only the first of many hot restarts that evening.
I repeated the whole procedure a few times, trying to get into Task Manager
or Event Viewer for a clue to the source of the problem. Within a minute
or so, the computer was unresponsive and I had to reset it again. It’s tough
to diagnose a problem when you can’t get to your normal troubleshooting
tools. Because we normally purchase warranties and phone support with our
servers, I decided to start taking advantage of Dell’s expertise.
I waited much longer than usual in Dell’s server support queue, but because
I’ve had nothing but top-rate help and service from Dell over the years,
I waited cheerfully. Eventually, a support engineer named Jennifer verified
my identity and the server service tag and got my initial description.
Because we purchased our server sans operating system, Dell’s support (in this case) extended only to hardware issues. Of course, many situations are a blend of hardware/software issues, and each technician has to make the call on how far he or she will extend support. In this case, we carried out a complete series of diagnostic tests to verify the status of the RAID card, onboard controller slot and each of the three 18GB SCSI hard drives.
Unplugging the SCSI cable from the RAID card and plugging it directly into the onboard slot didn’t cure anything, proving that the card wasn’t the problem. The onboard diagnostics began painstakingly scanning each hard drive for errors. Jennifer showed Job-like patience as I read back the results of each scan.
Hard drives one and two were fine, but drive three had a few errors that required moving a couple of sectors’ worth of information. This wasn’t very surprising, even on a normally functioning machine. The next step was to download another drive-testing tool and run a second round of tests. At this point, I began to appreciate that I could access the Internet conveniently from the other server. With the ADSL connection to myself, I hit Dell’s download site and was soon running the prescribed disk tests. This particular tool ran quickly (the only thing that did that evening) and shortly gave all three disks a clean bill of health.
This was good news for Jennifer and bad news for me. We’d established that the motherboard, RAID card, SCSI cable and hard-drive array were all functioning perfectly. I’d been on the phone for 2 1/2 hours at this point. We discussed my options: 1) booting to the Last Known Good Configuration, 2) initiating the Emergency Repair process, or 3) as a final resort, reinstalling the operating system and server applications. Foremost in my mind were the gigabytes of CAD files on the data partition. The full backup scheduled for 1 a.m. Saturday had not, of course, had a chance to run. This meant the potential for losing all of Friday’s changes.
Earning the IT Paycheck
As there are a few specific steps for reloading the PERC drivers on a Dell
system, I took some careful notes as Jennifer spoke. Somehow, my mind kept
telling me it wouldn’t get to option two or three. I made my last call home
for the evening. I told my wife that Dell had done all it could for me,
and it was now time for some lonely hours at the server. Or, as folks around
the office remind me, time to earn my IT paycheck.
Restarting the server gave me a window of approximately 30 seconds in which to try to get a peek into Task Manager. When I managed to get the processor utilization monitor on the screen, it was pegged at 100 percent and never wavered. That told me it was a process or service using all the server’s resources—possibly multiple processes. It was a bit daunting to look at the long list of services running and realize that any one of them could be the culprit.
Sometime early Saturday, I decided I wasn’t getting anywhere and it was
time to proceed with repairs using my Emergency Repair Disk (ERD) and the
NT 4.0 CD. In the process of reviewing the repair procedure, I realized
I wasn’t completely sure which model RAID card I had in the machine. I thought
I knew everything about that server, as I did the initial installation and
setup. Suddenly, though, I wanted to be very sure about the details. I browsed
the www.premiersupport.dell.com site for the latest driver downloads for
a PowerEdge 1400 and was surprised to see that that model was sold with
at least three different brands of RAID controller. According to my files,
the card was a PERC3/SC, but it wasn’t obvious to me which brand that was.
Remembering one of my guiding principles, I decided not to proceed until
I was sure I wouldn’t make a bigger mess out of the situation. That’s the
IT version of the carpenter’s motto: “Measure twice, cut once.”
I picked up the phone and waited what seemed considerably longer this time
for a live technician. At length, I had Chris on the line and he reviewed
the notes from my call with Jennifer. One thing I’ve learned about calling
tech-support centers: You’ll generally get a different tech each time, and
each support pro has his or her own strengths. I’ve had several experiences
in which the second tech had a different insight or could suggest additional
options to try. The moral: Be persistent and glean what you can from each
one’s experience. Chris held on through a few reboot/hang sequences and
suggested rebooting NT into video graphics array (VGA) mode. By loading
the OS with less demand on resources, I might have a longer time window
in which to kill off processes. Wishing me luck, Chris turned me loose and
grabbed the next caller from what he said was a very backed-up queue.
The Scientific Method
By now, I’d lost all track of time. The tantalizing prospect that I might
be close to discovering the root of the issue spurred me on, and the adrenaline
rush kept me alert. The plan was to use either the command prompt or Services
applet to systematically disable all noncritical server functions until
the runaway processor usage returned to normal.
Within a few minutes, I could see that this procedure was helping. By disabling two or three services during each window of opportunity, the server had less to deal with during startup. It took another hour or so of watching the server moving in slow motion; but finally, the server rebooted normally, brought up the desktop, and the processor utilization counter hovered at 2 percent. What a moment of incredible relief! I hadn’t wanted to entertain the thought of rebuilding a production PDC in the middle of the night—much less watch the backup unit restore gigabytes of important company files for hours on end.
Despite the relief, I knew I still had a lot of work to do to preserve Friday’s data and isolate the renegade process. I was confident that, by selectively enabling services, I could discover which service startup hammered the processor.
With the server again functional, I started up the Veritas Backup Exec services
one by one, watching the processor counter like a hawk. Each service started
normally, so I put in a backup tape and started the normal “Full” backup
job. I wanted to make sure we didn’t lose the work that 30-plus folks had
put in on Friday. During the next hour or two, as the backup completed,
I spent the downtime researching the many error messages in the System and
Application event logs. There were errors from the UPS software, from Norton
AntiVirus Server, from the Netlogon service and the tape-drive SCSI card.
I’m sure the multiple hard resets and not being able to load services accounted
for the majority of the stop events.
Hunting Down the Culprit
With the server safely backed up, I began the laborious process of pinning
down which service promptly went haywire when started. It took only a few
minutes to reveal that the Norton AntiVirus Server software was corrupted.
I was delighted to have the problem isolated and even happier that the repair
path was obviously a reinstallation. Naturally, the uninstall process from
Add/Remove Programs didn’t cooperate so I had to completely remove all traces
of NAV from the machine, reboot, then carry out a clean install per instructions
in the Norton Knowledge Base. The server rebooted normally, and I carried
out a precautionary full scan to see if perhaps a virus had made its way
through the defenses and had sabotaged the antivirus software itself. The
scan came back clean.
Each of the workstations depen-dent upon this server for downloaded antivirus definitions had to be checked to verify the server-client relationship. I added the Windows 2000 clients from the server and hooked the Windows 98 clients back up from the workstation side. With all the tidying up finished, I felt ready to call it a night’s work. As I left the office and set the alarm system, I noticed it was exactly 8 a.m.
I still don’t know why the antivirus service stuck on full throttle. It’s
evidently not a common problem, as I found no mention of it on Symantec’s
support site. It goes down in the “unexplained” category, yet the general
issue of a runaway service is common in the server world.
Prepare a folder for each server with the service tag, license codes and
other documentation. The middle of the night isn’t the time to be wondering
exactly what hardware you have and which drivers are safe to load.
Document BIOS and firmware versions. You won’t always be lucky enough to
have full Internet access to be able to pull this information from a manufacturer’s
Document the brand of RAID card and its drivers. If you don’t do this, you
may not be able to even access your hard disks in an emergency rebuild.
Make periodic System State backups and keep your ERDs up-to-date. A stale
ERD or System State backup will only complicate life by adding unknown factors
to a recovery effort. You’ll actually be reverting to an earlier configuration
of your operating system.