Just Say No
Drinking and disk management just don't mix.
About eight years ago, I worked as a Windows NT admin for the credit-card division of a large bank in Scotland, currently one of the largest in the world. It was a Friday around 5 p.m. when my team of three Windows admins discovered a problem on one of our main disk arrays: A nasty corruption had basically screwed up the whole drive array. For some reason we had problems with two drives on our RAID 5 array, and inexplicably the hot spare hadn't kicked in. The affected array held about 120GB of data and served as our main production File Server. But the serious problem we were facing was we had lost live production data.
Contained within the disk arrays was statistical information on credit-card user habits, demographics and other vital data used by the business analysts. The data on this partition was critical, so it was extremely important that we get the information restored as quickly as possible.
As it was already 5 p.m., and us being Scots, we all had urgent appointments in the pub that night. We decided to forego starting the restore that night, and instead agreed to start work early the next day around 9 a.m. We all went our separate ways and had a few beers, followed by a few more and a few more for good measure. I left the pub around 1:30 a.m. thinking I had to be in reasonably good shape for the restore effort. Sadly, as it turned out, I was the responsible one.
I came in at about 8:55 a.m. to find that only one of my colleagues was in the data center. The whites of his eyes were the color of red snooker balls. Our tasks that morning included replacing the failed disk, reconfiguring a new hot spare and reformatting the partition so that it could hold the restored data. My colleague had decided, despite his raging hangover, to start the reformat and restore process on his own.
The first step after putting in new drives was to reformat the affected partition. There was just one problem: My hungover colleague formatted the wrong partition. He managed take out yet another 120GB array.
As it turned out, the other two guys in my group, including the one who just nuked a working partition, had plans to go out of town that weekend. So naturally, I was left to restore both partitions myself.
Unfortunately, there was a wrinkle. We had implemented a new restore procedure but it had yet to be tested. The bank had just merged two separate credit-card companies, migrated to a new domain and moved to a different building. Complicating things further, I was relatively new to my job and had little experience with Backup Exec, the product we were using for our restore operation. To say that I was a little nervous would be an understatement.
The major issue was to restore as much as possible with as little impact as possible, including not letting the boss know just how badly we had bungled the initial recovery. It turned out to be both an extremely nerve-racking and boring weekend. We didn't have a tape robot, so I had to perform the restore and constantly change tapes as needed. In my spare time, I ended up writing scripts to re-share everything that I could think about, so that after the restore was complete, the shares could be re-created. No, our backups weren't very sophisticated in those days.
I finally got everything restored by about 3 a.m. on Monday and ran my scripts. I sent out a brief e-mail to the systems team, telling them I would be slightly late coming in.
The moral of the story? Never, ever mix critical-data recovery and alcohol consumption.
Chris Murdoch, the Wintel technical lead for the data center operations team at a financial institution in the San Francisco Bay Area, is a native of Scotland. He has a post-graduate degree in Information Systems and has 12 years experience in systems administration.