Scout out potential trouble by listening to your servers’ sounds. The best plan, as always, is a current backup.
Many network administrators who have been in the game long enough tend
to develop a motherly instinct when it comes overseeing their family of
servers and workstations. They know when their machines are too hot, hungry
(for more RAM), or sick. The sound a machine makes when it’s down and
out is distinct. Sounds emanating from the server normally lead to a straightforward
diagnosis, as there are so few moving parts: hard drive failure. The noise
sometimes reminds me of a playing card in the spokes of a bicycle or an
index finger in a moving CPU fan (though I shouldn’t admit to having done
the latter). The symptoms aren’t always severe at first. Advanced operating
systems like Windows NT/2000 can detect bad sectors on the drive and reroute
the location accordingly. But when the worst-case scenario becomes reality,
you’d better count on a long night.
I wasn’t prepared for such an evening when I drove six hours to
a client site to upgrade its SQL Server from 6.5 to 7.0. In fact, because
I’d done at least 10 upgrades of the same kind, using the upgrade wizard
provided in SQL Server 7.0, I was expecting to roll out early and spend
a relaxing evening studying in the hotel room for an upcoming exam.
After arriving, I made my introductions to the IT staff members, with
whom I’d only had phone contact previously. We shared a few moments of
“Hey, that’s what you look like!” before I started getting ready for the
upgrade. I think they could perceive my confidence, as we’d been planning
this for several weeks. We chose a Thursday evening so I could be on site
Friday morning in the event of a disaster.
The Dreaded Clicking Sound
I remember seeing the entrance to the server room from about 15 feet away.
As I approached the entrance, I felt something odd, like a premonition
of a long night without the opportunity even to stop for a slice of delivery
pizza—but I shrugged it off.
The moment I placed my right foot in the doorway, however, I heard it:
The repetitive click of a small, incapacitated spindle arm banging against
the metal surface of a disk drive platter. I looked at the now inquisitive
IT staff members, who’d placed their entire trust in me to make their
jobs and lives easier. Their questioning looks said, “I wonder what he’s
going to do about this?” I smiled and said something that apparently only
I found humorous: “So, you guys did make good backups like I asked, right?”
|The more I
worked on the machine, the more noise it
made and the slower it got. I had
to act fast.
The sickly machine, naturally, turned out to be the SQL erver system
I was there to upgrade. I pulled up to the machine in a painfully uncomfortable
rolling office chair, a place I would occupy for the next several hours.
I was miraculously able to log in and navigate the directory structure,
though the machine was crawling. The first thing I discovered was that
there was no configured RAID (Redundant Array of Inexpensive Disks), hardware,
software or otherwise. They’d gone with the antithesis of RAID—SLED (Single
Large Expensive Disk). The drive was partitioned with C: and D: drives.
The master database was set up on C: and all the user databases on D:.
SQL Server seemed to be running fine but I knew it was only a matter of
time—potentially only minutes—before we might never boot again. The more
I worked on the machine, the more noise it made and the slower it got.
I had to act fast.
If at First You Don’t Succeed…
They’d purchased a new server that was going to be the recipient
of the upgraded databases, once the conversion was finished on the now-crashing
server. Both were running NT 4.0 with Service Pack 6. They’d already installed
SQL 7.0 with SP2 on the new server.
I faced several alternatives. One option would be to attempt a machine-to-machine
upgrade by connecting, via the network with the upgrade wizard, to the
old SQL Server. Another course of action: Remove SQL 7.0 on the new server,
install SQL 6.5, copy over the database files to the new server and perform
a single-machine upgrade. Either option would require pulling very large
amounts of data—a gig and change—from the ailing system. I decided to
walk down the machine-to-machine upgrade path first. About 10 minutes
into the process, when I was just starting to believe it would work, the
old server hung. I waited for any sign of life, then decided to take the
only option left and bounce the old server. Another small miracle occurred
when it actually rebooted successfully and the SQL services started! So,
scratch plan A and move to plan B, the single-machine upgrade.
I’d learned a trick when moving a full SQL Server directly to a new machine
without having to restore databases individually, one of which I’d employed
several times in the past. The procedure’s simple: Install the same version
of SQL Server on another machine. Stop the SQL services on both machines.
Copy all the SQL database and log files, like master.dat, into the same
location from the source server to the destination server. If master.dat
resided on C:\MSSQL\Data, then that’s where it has to go on the new server.
The master database contains information about all the databases, users
and logins on the server. With all the files in place, take the old server
offline, give the new server the same name as the old box and change the
IP address. Restart the SQL services on the new machine. If everything
was done correctly, the new SQL Server would be identical to the old SQL
Server. This was my new plan of attack.
Crash No. 2
I uninstalled SQL 7.0 and installed SQL 6.5 on the new server.
It had been partitioned identically, so all that remained for me to do
was move the data and log files. I connected to the dying server by mapping
drives to the administrative shares on the C: and D: drives and began
copying the files. In hindsight, I could have attempted to zip the files,
but that would require even more HD activity. After another 45 minutes
of copying, the old server hung again. Arggghhh!
Please Tell Me You Made the Backup
This time I opted to restore from backup tape. I’d made the network
manager at the client site promise me he’d make backups of everything,
including the raw data files, before I began the upgrade. He’d stopped
the SQL services so the files wouldn’t be open and subsequently skipped
during the backup process. I was able to pull the backed-up files from
the tape and restore them on the new SQL server, which was now a SQL 6.5
machine. I started the services after powering down the old machine for
the final time, and everything started successfully. All that remained
was to reinstall SQL 7.0 and complete the upgrade process for all the
databases. Thankfully, it went off without a hitch.
As it was nearly 3 a.m. before I finished, I headed back to the hotel.
I was so geared up, I did actually study for a few minutes. The users
showed up the next morning, rested and eager to experience the promised
performance gains. I made sure they were all content and then headed home
for a relaxing weekend, remembering to thank the real heroes who’d saved
the day with one backup tape.
MTBF Means Just That
Every hard drive comes with an MTBF value. MTBF stands for Mean
Time Between Failures and is measured in hours. Though today’s hard drives
have values in the hundreds of thousands of hours, just knowing the number
exists is food for thought.
Remember, though, that there are many resources and technologies out
there to prevent hard-drive catastrophes, or to at least provide the minimum
downtime. Thus, there are really no excuses for not protecting your data.
Technologies like Intelli- Mirror, clustering, Remote Installation Services,
disk imaging and single-disk recovery procedures that come with many backup
applications offer varying levels of protection.
There are also companies that provide services to yank data off drives
seemingly beyond repair. Many of these resources are expensive, however,
and not all companies see the value in investing in them. Having seen
my share of crashes, and with the plunging prices of hard drives, I’d
recommend at the very least using the software mirroring available with
Win2K in conjunction with a solid tape backup plan. In the end, the time
and cost involved with rebuilding and restoring a server, especially if
there’s significant data loss, would likely pay for the ultimate addition
to my server family: a twin pair of quad Xeon, load-balanced, RAID 10,
hot-swappable cluster servers with a solid tape backup plan.