Hard Drive Fall Down, Go Boom!

Further investigation of a hard drive failure revealed one tiny but important tip that now lets one company's e-mail server run problem-free.

In February 2001, we brought our e-mail hosting in-house using Exchange 2000. We installed the application on an IBM server with dual Pentium III 800 MHZ processors, 1GB RAM and four 10,000 RPM 36GB hard drives in a RAID 5 configuration. This server was more powerful than we needed, but we wanted room for growth.

Two months later we started having some problems with our new e-mail server. For no apparent reason it would stop responding: The monitor would go black, and no combination of keystrokes would bring it back. The only way we could bring the server back was to do a hard reboot. The first time the server stopped responding, we chalked it up to a Windows 2000 glitch. However, when the server started crashing on a regular basis, the level of concern increased exponentially.

At the time we had just 125 e-mail users, so the hardware was more than sufficient to handle the traffic. That wasn’t the problem. The server wasn’t going into sleep mode, so that wasn’t the cause either. The event logs were clear (we were logging not just Windows events but also Exchange events). As far as monitoring performance, it appeared that all commonly used counters were well within acceptable ranges.

Also in this issue:

 Get Active Directory Replication Right!
by Andrew Lindley

 Exchange 2000 Upgrade, Times Two
by Cynthia Balusek

 Wireless Meets Mother Nature
by Justin Melot

 The Expiration Date That Did Us In
by Jeremy Dillinger

 Troubleshooting Under Pressure
by James D. Pollock

(Back to introduction.)

One Saturday in April our e-mail world collapsed. I got a call from the senior IT director at around 10 a.m. He was attempting to use Outlook Web Access from home and it wouldn’t respond. He decided to go into the office and check out the server. He noticed the e-mail server wasn’t responding, so he did a hard reboot. During the boot process, he received a horrible message: Inaccessible Boot Device.

I ran over to the office. We tried another hard reboot, with no luck. I immediately got on the phone with IBM support. Since the drives were in a RAID 5 configuration, we should have been able to get the server back up. We were able to determine which of the hard drives was the problem. However, the IBM technician determined that the parity stripe had become corrupt. Thus, the only thing we could do was replace the drive, reinstall the OS, reinstall Exchange and restore from backups. Since we had 24x7x4-hour support, a new hard drive was in my hands in four hours. By about 4 a.m. Sunday morning the server was back up and all key employees were notified by voice mail of the problem and told they might be missing some mail.

The server was back up, but we still had no explanation as to why the crash occurred. We needed an answer and needed it fast, in case the problem occurred again. We felt it was absolutely a hardware issue, so we continued to work with IBM support. Finally, an extremely bright IBM technician made a discovery. Evidently, a batch of hard drives was sent out with bad microcodes. We downloaded a tool from IBM to examine the microcodes on our hard drives and the three “old” hard drives in the e-mail server had bad microcodes (the new hard drive was fine). We updated the microcodes on these three hard drives and our e-mail server has been continuously running now for over a year without any problems.

About the Author

Christopher M. Roscoe, MCSA, CIW, is the senior network administrator at National Packaging Solutions Group, a manufacturer of corrugated boxes.

comments powered by Disqus

Reader Comments:

Mon, Oct 14, 2002 Chuck KY

There is a known issue concerning bad microcodes on certain models of IBM SCSI drives. I had a similar problem occur with a customer. The White box server w/IBM drives could not find two out of three drives on a re-boot. SCSI disk controller was an Adaptec PCI card. Adaptec support notified me of the problem w/IBM drives. they will erroneously report failed to the SCSI controller or sometimes fall out of the array on a hard boot. The utility to updat the microcodes from IBM does not gaurantee your data when using this utility. Scary!

Tue, Sep 24, 2002 Anonymous Anonymous

Yes, as long as you use hardware implemented RAID, the controller does all of the work and the computer thinks it only has one drive in the system.

Sun, Sep 22, 2002 Anonymous Anonymous

Deja Vue on the overnighter...good story!

Fri, Sep 20, 2002 Anonymous Anonymous

To Scott Hermans, if you use a hardware based RAID 5 array as opposed to a software based RAID 5 array, you can install the OS or anything else for that matter on the array.

Thu, Sep 19, 2002 Scott Hermans Springfield, MA

The author speaks of 'inaccessible boot device' and a bad RAID 5 array. Throughout 9 years in IT, I've never known it possible to install the OS on RAID5. The only RAID solution for the OS is RAID1 which of course has no parity stripe. The solution here is not logical to me.

Thu, Sep 19, 2002 Jibran Khizar Islamabad (Pakistan)

Well i faced the same problem several times (RAID is not configered).About the blue screen message i made few changes in BIOS like (change in FDD properties & Load the default settings) well it helped me.This article gave me a new idea (if you have configured RAID atleast).

Wed, Sep 18, 2002 Anonymous Anonymous

very informative

Wed, Sep 18, 2002 Alexandre Paiva Deerfield Beach, FL

How about I have the same problem? How can I fix it on my own without IBM support? Is this a specific IBM hardware problem? For this specific model? Thanks in advance, bye.

Wed, Sep 18, 2002 Anonymous Anonymous

Very helpful.

Add Your Comment Now:

Your Name:(optional)
Your Email:(optional)
Your Location:(optional)
Please type the letters/numbers you see above

Redmond Tech Watch

Sign up for our newsletter.

I agree to this site's Privacy Policy.