Troubleshooting -- the HP Way
Hardware troubleshooting can be difficult, but a methodical approach can do wonders. Here's how it's done at one of the world's largest IT services companies.
Troubleshooting hardware and software issues for one of the world's largest IT firms can be a daunting challenge. With thousands of installations of HP equipment deployed throughout the world, there is no shortage of fine tuning that needs to happen at the computer giant. Some problems are easy to spot, especially for an experienced troubleshooter. Others can be much more difficult, caused by issues buried so deep that you need the IT equivalent of a backhoe to uncover it.
Since joining HP in 1997, I've seen the entire gamut of troubleshooting trials. I once worked on a difficult cluster configuration while the client company's CIO and an engineer from a competitor watched over
my shoulder, and I've toiled in
conditions so isolated it felt like I was in a submarine at the bottom
of the ocean.
To be successful in such diverse environments, you need a strong troubleshooting
framework. At HP, it's called the 6-Step Troubleshooting Methodology, and it
provides a proven, battle-tested system for troubleshooting almost any issue,
especially server-related problems. The steps break down as follows:
- Collect Data
- Evaluate the Information: Isolate the Mode of Failure
- Develop an Optimized Action Plan
- Execute the Action Plan
- Determine Whether the Problem is Solved
- Preventive Measures
Tip: It's FANtastic
I had our maintenance
department run an additional Ethernet line for a newly constructed
office. They ran it parallel with a power line in the wall.
The person in the new office had sporadic problems with connectivity
to the network.
There was a large industrial fan outside the office, so I
plugged it into the wall outlet beside the Ethernet line and
turned it on. All of a sudden, the person was back to working
on the network without a problem. It seems that the large
motor on the fan did something with the electrical signal,
which allowed the Ethernet line to run parallel with the power
line. I told the user to plug the fan in when they have problems
until we could get the Ethernet cable reinstalled.
-- Douglas W. McGaha
What follows is a real-world example that required us to use all six steps,
some of them multiple times, to fix a customer's problem. This client was experiencing
issues with a ProLiant 6400 server. The server would work for a period of between
10 minutes and 36 hours, then mysteriously experience a hard lockup to the point
where the server could only be recovered by performing a power cycle. Our services
organization had made several site visits and replaced several parts, but the
problem persisted. It was time for a more rigorous approach.
Step 1: Gather Data
Tip: Perils of Peripherals
My favorite troubleshooting
tip when I have a hardware problem is to disconnect all of
the peripherals, then restart. If the problem goes away, I
then start connecting them one at a time to isolate the fault.
If the hardware fault is internal, first remove all of the
PCI cards and basically use the same procedure.
-- Bill McMillin
Society Hill, S.C.
Collecting data is the first step of any troubleshooting exercise. So I got started
with a survey of the ProLiant servers, which included a complete rundown of installed
components. Among them:
- Types of installed cards and their associated slots
- Size of the hard drives and their RAID configuration
- Number of processors and their speeds
- Firmware levels
- Software driver versions (including a history with date stamps of upgrades)
- Windows Event Log information
The survey revealed that the drivers and firmware were one or two revisions
out of date, but otherwise, things looked fine. On arrival onsite, the customer
provided a couple of new details, including one potentially very important clue:
The erratic behavior started after a scheduled maintenance window to perform
routine firmware and software updates.
Step 2: Evaluation & Analysis
The timing of the issues made it seem likely that something had occurred during
the maintenance window to cause the problem. After all, the server had been
in service for more than a year before problems cropped up.
I began to ask questions about the specific maintenance that was done to this
- What firmware was loaded?
- What cards were uninstalled or installed?
- What drivers were updated?
- What hard drives were replaced?
- Was the server removed from the rack?
- Were any parts other than the hard drive replaced?
- How soon after maintenance was the first failure observed?
Tip: Pull and Push
The very first thing to
do is open the case, unplug and re-plug the wires, and unseat
and re-seat all the cards. I bet half the time you'll
find something was unplugged or halfway out of the socket,
the loosening caused by heat fluctuations when the system
was on. Now make sure everything is snug and tight again,
put it all back together and plug it in again. Don't
do anything else until you've tried this!
-- Eric W. Wallace
Next, I drilled down into the software and hardware updates, both of which
can be the source of system trouble:
- Was the ROM, drive firmware or smart controller firmware updated?
- Were they updated with reboots in between when the OS was brought up, or
simply performed in sequence before the OS loaded? Loading updates on top
of other updates without testing each update for OS stability can cause nasty
problems; it didn't seem to fit this situation, however.
- Were NICs upgraded?
- Were Smart Controllers (hardware-based RAID controllers that increase data
throughput and allow "hot" replacement of a failed hard drive) replaced? This
was to determine if a driver or hardware mismatch existed.
- Driver updates: When were various Microsoft drivers, Compaq drivers and
so on installed; in what order; and were reboots done in between updates?
There are times when one vendor's drivers overwrite another's. I was exploring
the possibility of a potential OS stability issue by asking if the server
had been rebooted in between driver updates.
- What hard drives were replaced (one had been replaced because of an amber
LED, indicating a potential pre-failure of a hard disk drive)? It wouldn't
be the first time that an incorrect hard drive was replaced.
- Was the server removed from the rack? When a server is removed, it's often
not handled in the gentlest manner. Jarring the server when setting it on
a floor or table can cause cards to become dislodged or a component to fail;
or, even worse, the integrity of the server can be compromised, resulting
in the sheet metal making contact with the system board.
- Were any parts replaced other than the hard drive? (A fishing expedition
question to cover all the bases.)
- How soon after the maintenance window was the first failure observed? This
is an important question: after any change you really need to observe the
behavior of the server for a few days. In this case the answer was the next
morning, approximately eight hours after the maintenance.
Tip: Cold as Ice
I've used this trick
only twice, and it worked once. One of my fellow IT coworkers
had her hard drive fail. There were some pretty important
e-mail archives on it, so it was important to get the data
off if possible. However, we were not willing to spend $2,000
to send the drive out for recovery. The drive wouldn't
spin up, so I couldn't ghost it as I would normally.
Instead, I packed the drive in a Ziplock bag and put it in
the freezer for about 90 minutes. Once it was cold, I quickly
hooked it up to the computer and was able to get the drive
up long enough to get the data off. The freezing trick worked
until the drive became hot, which was about 20 minutes.
-- Joanna Lovett
Step 3: Develop a Plan
Based on my collection of data, I believed the problem originated from one of
- Damage to the server (or one of its components), possibly during removal
from the rack, servicing, or reinsertion into the rack
- Some type of firmware or configuration error (like a ROM version in relation
to a third-party hardware card)
- Some type of OS/driver conflict. It could be that the server lock-up was
occurring before the error could be logged
Working on these three likely scenarios, I proposed that we start troubleshooting
in reverse order of what was done in the maintenance window -- start with software,
then firmware, then hardware.
Tip: Known Good
We keep a "known
good" PC in our office. We can remove the hard drive
of a PC that won't boot and scan for spyware and viruses.
We also have all of the major hard drive vendor diagnostics
installed on the system, as well as several different data
-- Kyle Beckman
Step 4: Work the Plan
We began by uninstalling the changes to drivers and software, and within 15
minutes the server failed. We then tried back-revving the firmware to where
the server was prior to the maintenance. This essentially put the server in
a time warp, back to its pre-problem days. We observed the server for about
30 minutes, put a stress utility on it and went to lunch. When we got back,
the server was down.
Based on this, I ruled out a software or firmware-related issue, because the
server's track record had been exemplary in the current configuration. I then
started working the hardware angle.
I removed the server from the rack and went over it physically, looking for problems. Nothing was apparent, so I replaced the system board and drive cage. The server then ran for about an hour (again under the influence of the stress utility), and failed.
Continuing to narrow down the list of potential causes, I started unseating
every part, and was contemplating the "shotgun" approach of replacing all the
parts inside the server at once, when I observed a little glint from the SCSI
cable that connected the drive cage to the Smart Array Controller.
Step 5: See What Happens
Tip: Lights Out?
Here is the tip I taught
at Bellevue Community College: "Do you have lights?"
In other words, is it plugged in? Turned on? Does it have
a connection? It was a way of reminding people to check the
basic functions before trying something complicated. Occasionally,
I would pull the network connection out of a student's practice
server and it could take 10 to 15 minutes to think about checking
-- Matthew Damp
I saw that the cable's plastic sheath had been worn away, as if scraped off by
a knife. Based on the bend in the cable in relation to all the other parts of
the server, it looked like it would be easy to pinch the cable when closing the
server's lid. I removed the cable and replaced it with a new one.
After that, we ran the stress utility; for the next three hours the server
ran fine. We were optimistic but not quite ready to declare victory. The following
morning, the server was still running -- we'd cleared the next hurdle. We hung
around until lunch and, with no problems, were growing more confident every
hour. At 3 p.m. the server was still running. At that point, we had a review
meeting to discuss the situation.
Step 6: Make Sure It Doesn't Happen
I said that the issue came down to a simple and easy mistake -- one I've made
before -- of not paying enough attention to the SCSI cables when closing the
server lids and/or sliding the server's system tray. I showed them the compromised
cable, and explained that at some point the cable's exposed wiring would make
solid contact with the sheet metal of the server's case and cause a short. Result?
The server locks up.
I replaced the cable and demonstrated how to properly close the server lids
without compromising the cable. To my knowledge, that server never had another
Tip: Linux Rx
Use Linux to help troubleshoot.
1. Boot up a system with a Knoppix CD.
2. Make the alleged "broken" device work under
3a. Explain to the user (again) why you do not allow administrative
privileges to users so they can load buggy software.
3b. Apologize to user for broken hardware (the other .0001
percent of times)
Knoppix is a bootable Linux distribution on CD, with a baseline
set of drivers and other utilities. It's always being updated,
and you can order the CD at www.knoppix.org.
-- John Slater
Sometimes It's the Simple Things
Throwing parts at a problem isn't usually the best solution. Often, an expensive
piece of technology can give someone fits. The knee-jerk reaction of ordering
hardware is often expensive; the server is usually down, the time required to
order and receive the hardware adds to downtime cost, and the lack of a guarantee
for resolution is often more expensive and frustrating.
Taking the time up front to "work the issue" and methodically troubleshoot
will usually resolve the problem. Failure to do so, as was the case in this
situation, often leads to a delayed resolution -- not to mention the additional
expense incurred due to the replacement of perfectly good parts. In this case,
it turned out that a $4 cable was the difference between emotional anguish and
By utilizing a methodical approach and deductive reasoning, most problems can
be resolved. By using the ingrained six-step method, I was able to resolve the
customer's problem and prevent further expense to both the customer and HP.
You can do the same, whether you work at a huge, worldwide service vendor or
a mom-and-pop shop with one server connected to four computers.