Never Again

Brother, Can You Spare a Drive?

On the day they call "Black Wednesday," what didn't go wrong?

Back in the mid-1990s, I was working as the help desk coordinator in a 250-person federal government office in Washington, D.C. My team was responsible for managing the network, servers, desktop PCs and applications. Most of the time it was a fairly pleasant working environment, until the day we called "Black Wednesday."

We had several servers dedicated to this office, but the Novell server functioned as the main file store. The system had plenty of memory and processing power. File storage was taken care of by a RAID 5 array. It had six 1GB drives with 4GB of usable space. All servers were backed up with a Legato tape library system.

Here Comes Trouble
Black Wednesday started off innocently enough, but during the lunch hour things began to go very wrong. A drive failed in the RAID array. However, about five seconds after the first drive failed, a second drive went down. This meant that the hot spare had almost no time to spin up and take up the slack from the first failed drive. The array crashed hard.

We spent the next several hours trying to recover the failed array with no success. We decided to replace the failed drives with new ones and then do the restore from tape. We didn't have any spare drives on hand so we had to purchase new ones.

After some significant hair pulling, we were able to get credit card authorization to purchase the spare drives. Unfortunately, spare drives were not available locally, so they had to be shipped overnight from California. This meant, of course, the server remained dormant all day Thursday with a very nervous help desk staff awaiting the arrival of the new drives.

Mail Mix-Up
After what seemed to be an eternity, the new drives arrived in the building, but were sent to the wrong office. Making matters worse, the person who had signed for them in the other office left for the day with the drives under lock and key. On Friday morning, the drives were finally delivered to our server room and were installed in the drive array. We then started restoring the data, a procedure we expected to take a few hours to complete.

This is where the story turns truly disastrous. As we started the tape restore, the tape head retrieved the first tape in the saveset and proceeded to mount it. Then, to our horror, the machine ate it. Not only was the tape severed, it had been mangled beyond recognition. Since this was a multi-tape saveset, all subsequent tapes were deemed worthless as well. We tried several homemade remedies for fixing the tape, but were unsuccessful.

Eventually, we had to make the hard decision and go to the previous week's tapeset. We brought the server back online late in the weekend with old data. Thankfully, our government sponsors understood to the point that no one was fired.

What's Your Worst IT Nightmare?
Write up your story in 300-600 words and e-mail it to Editor Ed Scannell at escannell@redmondmag.com. Use "Never Again" as the subject line and be sure to include your contact information for story verification.

Lesson Learned
We replaced the proprietary RAID array with a more popular -- and expensive -- unit supported by multiple vendors. We also purchased some additional spare drives, making sure that the drives were not from the same lot number. To address the tape solution, we placed a disk array between the RAID array and the tape library system. This gave us another layer of redundancy.

We learned some hard lessons by going through this nightmarish experience, and vowed we'd never put ourselves in this position again.

About the Author

Todd Miles has worked in IT for several government agencies.

comments powered by Disqus

Reader Comments:

Thu, Apr 10, 2008 rippleyaliens tempest

Tuomoks, you sound like you are a master, BUT in reality, you remind me of the many consultants, that walk in , and say i have a better plan. You are talking bad about a situation that happened over 10 yrs ago? How many servers have you rebuilt? 1, 10, 100.. and are you saying that every one was 100% restored? YAH right.. Even with the latest state of the art system TODAY, you will loose something.. Unless you are like spending 30k, to protect 1 server.. amatuer is what you sound like...

Thu, Mar 27, 2008 tuomoks Seattle

The first question coming to my mind - who designed the reliability, backup and restore / recovery? Seems a lot like a vendor / consulting company solution. I feel for you, seen and been there but also designed "failsafe" systems a long, long time. Too many companies depend on (recommended) technology only, even today, not understanding that many things can go wrong at the same time or sequentially, on long run the system always do that. The problem is that the solution is not so much technical or even money but (inside) political and you know how that works?

Add Your Comment Now:

Your Name:(optional)
Your Email:(optional)
Your Location:(optional)
Comment:
Please type the letters/numbers you see above

Redmond Tech Watch

Sign up for our newsletter.

I agree to this site's Privacy Policy.