Posey's Tips & Tricks
Server Power Outage Checklist: What To Do When the Power Returns
One of the big perks of being a freelance technology writer is that I get to work from my home. And thanks to the Internet, I don't have to be tied to a specific geographic region. I chose to live in one of the more rural parts of South Carolina. It's a great lifestyle, but one of the problems with living in the sticks is that my power is less than reliable.
Sometimes a week or two may pass between power failures, but at other times the power may go off seven or eight times in a single day. Normally, a power outage lasts anywhere from two or three seconds to a couple of minutes. Every once in a while, though, the power stays off for several hours at a time.
Recently, during a long duration power outage, I decided that it might be useful to write a short checklist of things that need to be done after the power comes back online.
The top floor of my home is crammed with servers. The collection includes lab machines along with production servers that are used for E-mail, file storage, etc. As you would expect, all of my machines are protected by battery backup. As such, short power failures aren't really an issue for me. However, if the power stays off for an hour or more, then the batteries start to drain and servers begin to drop offline.
Before I get started on what to do after a power outage, I just want to talk a little about my power backup setup. It is common for battery backups to be linked directly to network servers through a USB cable. In that way, the servers can monitor the available battery power and can begin shutting down gracefully whenever the batteries start getting low.
Although some of my servers are set up this way, graceful shutdowns aren't always an option. Network appliances, storage arrays and servers running non-Windows operating systems aren't compatible with the software that I use to monitor available battery power. Therefore, when the batteries drain, there are always some machines on my network that immediately fail.
Three Basic Steps
So what should you do when the power comes back on after some hardware has gone down abruptly? The first thing that I usually do is to run a self-test on all of my battery backups.
Maybe the rest of the world has better luck than I do, but the more severe outages always seem to be preceded by a voltage spike or a brown out. As such, I tend to burn through a lot of backup batteries. By running a self-test on the backup batteries, I am confirming that each battery has survived the outage. Sure, while the self-test will always report that the battery is dead, the test is still valid for detecting battery faults.
Once I have checked all of my batteries and made any necessary replacements, I begin powering up servers and network appliances. The second step I perform is that I always confirm that each device is displaying the correct time.
On Windows networks, there are a number of functions that are tied to the Kerberos protocol. Kerberos breaks if the clocks are out of sync by more than five minutes. This situation can cause a variety of problems, ranging from login failures, E-mail problems and the failure of backup agents.
Last year, I had an extended power failure, and when the lights came back on, I powered up all of the servers. Everything seemed normal at first, but I soon noticed that my cell phone couldn't connect to my Exchange mailbox, even though all of the Exchange Server services were running. After a couple of frustrating hours, I noticed that the server's clock was wrong. The server's CMOS battery was dead so the server had lost track of the time during the power failure. Once I reset the server's clock, ActiveSync began synchronizing my mail once again.
After that incident I have made it a habit to check the time on each device as it comes back online.
My third step to do after the power comes back on is to check my RAID arrays. If a fault-tolerant array goes down abruptly, the array is usually left in an inconsistent state. As such, I want to make sure that the automated repair sequence is running and that the power failure hasn't caused any severe disk corruption. I also like to verify that all hard drives within the array are still functional.
Once I have verified these three things, then the rest of the startup process is relatively routine. For example, I like to make sure that all of the server services start and that I still have Internet access.
Although these basic checks are simple to perform, they are important to keep in mind. You'll want to take care to ensure that everything works correctly after things go dark.
Brien Posey is a 22-time Microsoft MVP with decades of IT experience. As a freelance writer, Posey has written thousands of articles and contributed to several dozen books on a wide variety of IT topics. Prior to going freelance, Posey was a CIO for a national chain of hospitals and health care facilities. He has also served as a network administrator for some of the country's largest insurance companies and for the Department of Defense at Fort Knox. In addition to his continued work in IT, Posey has spent the last several years actively training as a commercial scientist-astronaut candidate in preparation to fly on a mission to study polar mesospheric clouds from space. You can follow his spaceflight training on his Web site.