Never Again

Brother, Can You Spare a Drive?

On the day they call "Black Wednesday," what didn't go wrong?

Back in the mid-1990s, I was working as the help desk coordinator in a 250-person federal government office in Washington, D.C. My team was responsible for managing the network, servers, desktop PCs and applications. Most of the time it was a fairly pleasant working environment, until the day we called "Black Wednesday."

We had several servers dedicated to this office, but the Novell server functioned as the main file store. The system had plenty of memory and processing power. File storage was taken care of by a RAID 5 array. It had six 1GB drives with 4GB of usable space. All servers were backed up with a Legato tape library system.

Here Comes Trouble
Black Wednesday started off innocently enough, but during the lunch hour things began to go very wrong. A drive failed in the RAID array. However, about five seconds after the first drive failed, a second drive went down. This meant that the hot spare had almost no time to spin up and take up the slack from the first failed drive. The array crashed hard.

We spent the next several hours trying to recover the failed array with no success. We decided to replace the failed drives with new ones and then do the restore from tape. We didn't have any spare drives on hand so we had to purchase new ones.

After some significant hair pulling, we were able to get credit card authorization to purchase the spare drives. Unfortunately, spare drives were not available locally, so they had to be shipped overnight from California. This meant, of course, the server remained dormant all day Thursday with a very nervous help desk staff awaiting the arrival of the new drives.

Mail Mix-Up
After what seemed to be an eternity, the new drives arrived in the building, but were sent to the wrong office. Making matters worse, the person who had signed for them in the other office left for the day with the drives under lock and key. On Friday morning, the drives were finally delivered to our server room and were installed in the drive array. We then started restoring the data, a procedure we expected to take a few hours to complete.

This is where the story turns truly disastrous. As we started the tape restore, the tape head retrieved the first tape in the saveset and proceeded to mount it. Then, to our horror, the machine ate it. Not only was the tape severed, it had been mangled beyond recognition. Since this was a multi-tape saveset, all subsequent tapes were deemed worthless as well. We tried several homemade remedies for fixing the tape, but were unsuccessful.

Eventually, we had to make the hard decision and go to the previous week's tapeset. We brought the server back online late in the weekend with old data. Thankfully, our government sponsors understood to the point that no one was fired.

What's Your Worst IT Nightmare?
Write up your story in 300-600 words and e-mail it to Editor Ed Scannell at [email protected]. Use "Never Again" as the subject line and be sure to include your contact information for story verification.

Lesson Learned
We replaced the proprietary RAID array with a more popular -- and expensive -- unit supported by multiple vendors. We also purchased some additional spare drives, making sure that the drives were not from the same lot number. To address the tape solution, we placed a disk array between the RAID array and the tape library system. This gave us another layer of redundancy.

We learned some hard lessons by going through this nightmarish experience, and vowed we'd never put ourselves in this position again.

About the Author

Todd Miles has worked in IT for several government agencies.

Featured

comments powered by Disqus

Subscribe on YouTube