Hard Drive Fall Down, Go Boom! -- Redmondmag.com

Hard Drive Fall Down, Go Boom!

Further investigation of a hard drive failure revealed one tiny but important tip that now lets one company's e-mail server run problem-free.

By Christopher M. Roscoe
10/01/2002

In February 2001, we brought our e-mail hosting in-house using Exchange 2000. We installed the application on an IBM server with dual Pentium III 800 MHZ processors, 1GB RAM and four 10,000 RPM 36GB hard drives in a RAID 5 configuration. This server was more powerful than we needed, but we wanted room for growth.

Two months later we started having some problems with our new e-mail server. For no apparent reason it would stop responding: The monitor would go black, and no combination of keystrokes would bring it back. The only way we could bring the server back was to do a hard reboot. The first time the server stopped responding, we chalked it up to a Windows 2000 glitch. However, when the server started crashing on a regular basis, the level of concern increased exponentially.

At the time we had just 125 e-mail users, so the hardware was more than sufficient to handle the traffic. That wasn’t the problem. The server wasn’t going into sleep mode, so that wasn’t the cause either. The event logs were clear (we were logging not just Windows events but also Exchange events). As far as monitoring performance, it appeared that all commonly used counters were well within acceptable ranges.

Also in this issue:

Get Active Directory Replication Right!
by Andrew Lindley

Exchange 2000 Upgrade, Times Two
by Cynthia Balusek

Wireless Meets Mother Nature
by Justin Melot

The Expiration Date That Did Us In
by Jeremy Dillinger

Troubleshooting Under Pressure
by James D. Pollock

(Back to introduction.)

One Saturday in April our e-mail world collapsed. I got a call from the senior IT director at around 10 a.m. He was attempting to use Outlook Web Access from home and it wouldn’t respond. He decided to go into the office and check out the server. He noticed the e-mail server wasn’t responding, so he did a hard reboot. During the boot process, he received a horrible message: Inaccessible Boot Device.

I ran over to the office. We tried another hard reboot, with no luck. I immediately got on the phone with IBM support. Since the drives were in a RAID 5 configuration, we should have been able to get the server back up. We were able to determine which of the hard drives was the problem. However, the IBM technician determined that the parity stripe had become corrupt. Thus, the only thing we could do was replace the drive, reinstall the OS, reinstall Exchange and restore from backups. Since we had 24x7x4-hour support, a new hard drive was in my hands in four hours. By about 4 a.m. Sunday morning the server was back up and all key employees were notified by voice mail of the problem and told they might be missing some mail.

The server was back up, but we still had no explanation as to why the crash occurred. We needed an answer and needed it fast, in case the problem occurred again. We felt it was absolutely a hardware issue, so we continued to work with IBM support. Finally, an extremely bright IBM technician made a discovery. Evidently, a batch of hard drives was sent out with bad microcodes. We downloaded a tool from IBM to examine the microcodes on our hard drives and the three “old” hard drives in the e-mail server had bad microcodes (the new hard drive was fine). We updated the microcodes on these three hard drives and our e-mail server has been continuously running now for over a year without any problems.

About the Author

Christopher M. Roscoe, MCSA, CIW, is the senior network administrator at National Packaging Solutions Group, a manufacturer of corrugated boxes.

Featured

Supply Chain Attack Hits Microsoft GitHub Repos, AI Coding Tools

GitHub disabled 73 Microsoft repositories on June 5 after a malicious commit landed in an Azure project, in what researchers described as a supply chain attack aimed at developer workstations and AI coding environments.
The 4 Microsoft Build 2026 Announcements That Matter Most

Microsoft Build 2026 showed how Redmond is tying its future to agentic AI, AI-native Windows development, scientific discovery and quantum computing.
Active Directory Basics Are Anything but Basic

Microsoft MVP Derek Melber explains why real AD knowledge depends on understanding how Group Policy, replication and DNS behave in production.
Data Hoarding: The Backup Problem that Nobody Wants to Admit To

Letting data pile up may feel safer than deleting it, but unchecked accumulation can make backups slower, costlier and harder to recover when something goes wrong.
Microsoft 365 Android Coding Error Put Account Tokens at Risk

A coding error in several Microsoft 365 Android apps could have allowed a malicious app on the same device to silently obtain account tokens and act as the signed-in user, according to new research from Enclave.