Hard Drive Fall Down, Go Boom!
Further investigation of a hard drive failure revealed one tiny but important tip that now lets one company's e-mail server run problem-free.
In February 2001, we brought our e-mail hosting
in-house using Exchange 2000. We installed the application on an IBM server
with dual Pentium III 800 MHZ processors, 1GB RAM and four 10,000 RPM
36GB hard drives in a RAID 5 configuration. This server was more powerful
than we needed, but we wanted room for growth.
Two months later we started having some problems with our new e-mail
server. For no apparent reason it would stop responding: The monitor would
go black, and no combination of keystrokes would bring it back. The only
way we could bring the server back was to do a hard reboot. The first
time the server stopped responding, we chalked it up to a Windows 2000
glitch. However, when the server started crashing on a regular basis,
the level of concern increased exponentially.
At the time we had just 125 e-mail users, so the hardware was more than
sufficient to handle the traffic. That wasn’t the problem. The server
wasn’t going into sleep mode, so that wasn’t the cause either. The event
logs were clear (we were logging not just Windows events but also Exchange
events). As far as monitoring performance, it appeared that all commonly
used counters were well within acceptable ranges.
One Saturday in April our e-mail world collapsed. I got a call from the
senior IT director at around 10 a.m. He was attempting to use Outlook
Web Access from home and it wouldn’t respond. He decided to go into the
office and check out the server. He noticed the e-mail server wasn’t responding,
so he did a hard reboot. During the boot process, he received a horrible
message: Inaccessible Boot Device.
I ran over to the office. We tried another hard reboot, with no luck.
I immediately got on the phone with IBM support. Since the drives were
in a RAID 5 configuration, we should have been able to get the server
back up. We were able to determine which of the hard drives was the problem.
However, the IBM technician determined that the parity stripe had become
corrupt. Thus, the only thing we could do was replace the drive, reinstall
the OS, reinstall Exchange and restore from backups. Since we had 24x7x4-hour
support, a new hard drive was in my hands in four hours. By about 4 a.m.
Sunday morning the server was back up and all key employees were notified
by voice mail of the problem and told they might be missing some mail.
The server was back up, but we still had no explanation as to why the
crash occurred. We needed an answer and needed it fast, in case the problem
occurred again. We felt it was absolutely a hardware issue, so we continued
to work with IBM support. Finally, an extremely bright IBM technician
made a discovery. Evidently, a batch of hard drives was sent out with
bad microcodes. We downloaded a tool from IBM to examine the microcodes
on our hard drives and the three “old” hard drives in the e-mail server
had bad microcodes (the new hard drive was fine). We updated the microcodes
on these three hard drives and our e-mail server has been continuously
running now for over a year without any problems.
Christopher M. Roscoe, MCSA, CIW, is the senior network administrator
at National Packaging Solutions Group, a manufacturer of corrugated boxes.