Brother, Can You Spare a Drive? -- Redmondmag.com

Never Again

Brother, Can You Spare a Drive?

On the day they call "Black Wednesday," what didn't go wrong?

By Todd Miles
04/01/2008

Back in the mid-1990s, I was working as the help desk coordinator in a 250-person federal government office in Washington, D.C. My team was responsible for managing the network, servers, desktop PCs and applications. Most of the time it was a fairly pleasant working environment, until the day we called "Black Wednesday."

We had several servers dedicated to this office, but the Novell server functioned as the main file store. The system had plenty of memory and processing power. File storage was taken care of by a RAID 5 array. It had six 1GB drives with 4GB of usable space. All servers were backed up with a Legato tape library system.

Here Comes Trouble
Black Wednesday started off innocently enough, but during the lunch hour things began to go very wrong. A drive failed in the RAID array. However, about five seconds after the first drive failed, a second drive went down. This meant that the hot spare had almost no time to spin up and take up the slack from the first failed drive. The array crashed hard.

We spent the next several hours trying to recover the failed array with no success. We decided to replace the failed drives with new ones and then do the restore from tape. We didn't have any spare drives on hand so we had to purchase new ones.

After some significant hair pulling, we were able to get credit card authorization to purchase the spare drives. Unfortunately, spare drives were not available locally, so they had to be shipped overnight from California. This meant, of course, the server remained dormant all day Thursday with a very nervous help desk staff awaiting the arrival of the new drives.

Mail Mix-Up
After what seemed to be an eternity, the new drives arrived in the building, but were sent to the wrong office. Making matters worse, the person who had signed for them in the other office left for the day with the drives under lock and key. On Friday morning, the drives were finally delivered to our server room and were installed in the drive array. We then started restoring the data, a procedure we expected to take a few hours to complete.

This is where the story turns truly disastrous. As we started the tape restore, the tape head retrieved the first tape in the saveset and proceeded to mount it. Then, to our horror, the machine ate it. Not only was the tape severed, it had been mangled beyond recognition. Since this was a multi-tape saveset, all subsequent tapes were deemed worthless as well. We tried several homemade remedies for fixing the tape, but were unsuccessful.

Eventually, we had to make the hard decision and go to the previous week's tapeset. We brought the server back online late in the weekend with old data. Thankfully, our government sponsors understood to the point that no one was fired.

What's Your Worst IT Nightmare?

Write up your story in 300-600 words and e-mail it to Editor Ed Scannell at [email protected]. Use "Never Again" as the subject line and be sure to include your contact information for story verification.

Lesson Learned
We replaced the proprietary RAID array with a more popular -- and expensive -- unit supported by multiple vendors. We also purchased some additional spare drives, making sure that the drives were not from the same lot number. To address the tape solution, we placed a disk array between the RAID array and the tape library system. This gave us another layer of redundancy.

We learned some hard lessons by going through this nightmarish experience, and vowed we'd never put ourselves in this position again.

About the Author

Todd Miles has worked in IT for several government agencies.

Featured

Supply Chain Attack Hits Microsoft GitHub Repos, AI Coding Tools

GitHub disabled 73 Microsoft repositories on June 5 after a malicious commit landed in an Azure project, in what researchers described as a supply chain attack aimed at developer workstations and AI coding environments.
The 4 Microsoft Build 2026 Announcements That Matter Most

Microsoft Build 2026 showed how Redmond is tying its future to agentic AI, AI-native Windows development, scientific discovery and quantum computing.
Active Directory Basics Are Anything but Basic

Microsoft MVP Derek Melber explains why real AD knowledge depends on understanding how Group Policy, replication and DNS behave in production.
Data Hoarding: The Backup Problem that Nobody Wants to Admit To

Letting data pile up may feel safer than deleting it, but unchecked accumulation can make backups slower, costlier and harder to recover when something goes wrong.
Microsoft 365 Android Coding Error Put Account Tokens at Risk

A coding error in several Microsoft 365 Android apps could have allowed a malicious app on the same device to silently obtain account tokens and act as the signed-in user, according to new research from Enclave.