In-Depth

Troubleshooting -- the HP Way

Hardware troubleshooting can be difficult, but a methodical approach can do wonders. Here's how it's done at one of the world's largest IT services companies.

Troubleshooting hardware and software issues for one of the world's largest IT firms can be a daunting challenge. With thousands of installations of HP equipment deployed throughout the world, there is no shortage of fine tuning that needs to happen at the computer giant. Some problems are easy to spot, especially for an experienced troubleshooter. Others can be much more difficult, caused by issues buried so deep that you need the IT equivalent of a backhoe to uncover it.

Since joining HP in 1997, I've seen the entire gamut of troubleshooting trials. I once worked on a difficult cluster configuration while the client company's CIO and an engineer from a competitor watched over my shoulder, and I've toiled in conditions so isolated it felt like I was in a submarine at the bottom of the ocean.

To be successful in such diverse environments, you need a strong troubleshooting framework. At HP, it's called the 6-Step Troubleshooting Methodology, and it provides a proven, battle-tested system for troubleshooting almost any issue, especially server-related problems. The steps break down as follows:

  1. Collect Data
  2. Evaluate the Information: Isolate the Mode of Failure
  3. Develop an Optimized Action Plan
  4. Execute the Action Plan
  5. Determine Whether the Problem is Solved
  6. Preventive Measures
Reader Tip: It's FANtastic

I had our maintenance department run an additional Ethernet line for a newly constructed office. They ran it parallel with a power line in the wall. The person in the new office had sporadic problems with connectivity to the network.

There was a large industrial fan outside the office, so I plugged it into the wall outlet beside the Ethernet line and turned it on. All of a sudden, the person was back to working on the network without a problem. It seems that the large motor on the fan did something with the electrical signal, which allowed the Ethernet line to run parallel with the power line. I told the user to plug the fan in when they have problems until we could get the Ethernet cable reinstalled.

-- Douglas W. McGaha
Easley, S.C.

What follows is a real-world example that required us to use all six steps, some of them multiple times, to fix a customer's problem. This client was experiencing issues with a ProLiant 6400 server. The server would work for a period of between 10 minutes and 36 hours, then mysteriously experience a hard lockup to the point where the server could only be recovered by performing a power cycle. Our services organization had made several site visits and replaced several parts, but the problem persisted. It was time for a more rigorous approach.

Reader Tip: Perils of Peripherals

My favorite troubleshooting tip when I have a hardware problem is to disconnect all of the peripherals, then restart. If the problem goes away, I then start connecting them one at a time to isolate the fault. If the hardware fault is internal, first remove all of the PCI cards and basically use the same procedure.

-- Bill McMillin
Society Hill, S.C.

Step 1: Gather Data
Collecting data is the first step of any troubleshooting exercise. So I got started with a survey of the ProLiant servers, which included a complete rundown of installed components. Among them:
  • Types of installed cards and their associated slots
  • Size of the hard drives and their RAID configuration
  • Number of processors and their speeds
  • Firmware levels
  • Software driver versions (including a history with date stamps of upgrades)
  • Windows Event Log information

The survey revealed that the drivers and firmware were one or two revisions out of date, but otherwise, things looked fine. On arrival onsite, the customer provided a couple of new details, including one potentially very important clue: The erratic behavior started after a scheduled maintenance window to perform routine firmware and software updates.

Step 2: Evaluation & Analysis
The timing of the issues made it seem likely that something had occurred during the maintenance window to cause the problem. After all, the server had been in service for more than a year before problems cropped up.

Grumble, grumble...beer....grumble

I began to ask questions about the specific maintenance that was done to this server:

  • What firmware was loaded?
  • What cards were uninstalled or installed?
  • What drivers were updated?
  • What hard drives were replaced?
  • Was the server removed from the rack?
  • Were any parts other than the hard drive replaced?
  • How soon after maintenance was the first failure observed?
Reader Tip: Pull and Push

The very first thing to do is open the case, unplug and re-plug the wires, and unseat and re-seat all the cards. I bet half the time you'll find something was unplugged or halfway out of the socket, the loosening caused by heat fluctuations when the system was on. Now make sure everything is snug and tight again, put it all back together and plug it in again. Don't do anything else until you've tried this!

-- Eric W. Wallace
Portland, Maine

Next, I drilled down into the software and hardware updates, both of which can be the source of system trouble:

  • Was the ROM, drive firmware or smart controller firmware updated?
  • Were they updated with reboots in between when the OS was brought up, or simply performed in sequence before the OS loaded? Loading updates on top of other updates without testing each update for OS stability can cause nasty problems; it didn't seem to fit this situation, however.
  • Were NICs upgraded?
  • Were Smart Controllers (hardware-based RAID controllers that increase data throughput and allow "hot" replacement of a failed hard drive) replaced? This was to determine if a driver or hardware mismatch existed.
  • Driver updates: When were various Microsoft drivers, Compaq drivers and so on installed; in what order; and were reboots done in between updates? There are times when one vendor's drivers overwrite another's. I was exploring the possibility of a potential OS stability issue by asking if the server had been rebooted in between driver updates.
  • What hard drives were replaced (one had been replaced because of an amber LED, indicating a potential pre-failure of a hard disk drive)? It wouldn't be the first time that an incorrect hard drive was replaced.
  • Was the server removed from the rack? When a server is removed, it's often not handled in the gentlest manner. Jarring the server when setting it on a floor or table can cause cards to become dislodged or a component to fail; or, even worse, the integrity of the server can be compromised, resulting in the sheet metal making contact with the system board.
  • Were any parts replaced other than the hard drive? (A fishing expedition question to cover all the bases.)
  • How soon after the maintenance window was the first failure observed? This is an important question: after any change you really need to observe the behavior of the server for a few days. In this case the answer was the next morning, approximately eight hours after the maintenance.
Reader Tip: Cold as Ice

I've used this trick only twice, and it worked once. One of my fellow IT coworkers had her hard drive fail. There were some pretty important e-mail archives on it, so it was important to get the data off if possible. However, we were not willing to spend $2,000 to send the drive out for recovery. The drive wouldn't spin up, so I couldn't ghost it as I would normally. Instead, I packed the drive in a Ziplock bag and put it in the freezer for about 90 minutes. Once it was cold, I quickly hooked it up to the computer and was able to get the drive up long enough to get the data off. The freezing trick worked until the drive became hot, which was about 20 minutes.

-- Joanna Lovett
Cambridge, Mass.

Step 3: Develop a Plan
Based on my collection of data, I believed the problem originated from one of three things:

  1. Damage to the server (or one of its components), possibly during removal from the rack, servicing, or reinsertion into the rack
  2. Some type of firmware or configuration error (like a ROM version in relation to a third-party hardware card)
  3. Some type of OS/driver conflict. It could be that the server lock-up was occurring before the error could be logged

Working on these three likely scenarios, I proposed that we start troubleshooting in reverse order of what was done in the maintenance window -- start with software, then firmware, then hardware.

Reader Tip: Known Good

We keep a "known good" PC in our office. We can remove the hard drive of a PC that won't boot and scan for spyware and viruses. We also have all of the major hard drive vendor diagnostics installed on the system, as well as several different data recovery programs.

-- Kyle Beckman
Fayetteville, Ga.

Step 4: Work the Plan
We began by uninstalling the changes to drivers and software, and within 15 minutes the server failed. We then tried back-revving the firmware to where the server was prior to the maintenance. This essentially put the server in a time warp, back to its pre-problem days. We observed the server for about 30 minutes, put a stress utility on it and went to lunch. When we got back, the server was down.

Based on this, I ruled out a software or firmware-related issue, because the server's track record had been exemplary in the current configuration. I then started working the hardware angle.

I removed the server from the rack and went over it physically, looking for problems. Nothing was apparent, so I replaced the system board and drive cage. The server then ran for about an hour (again under the influence of the stress utility), and failed.

Continuing to narrow down the list of potential causes, I started unseating every part, and was contemplating the "shotgun" approach of replacing all the parts inside the server at once, when I observed a little glint from the SCSI cable that connected the drive cage to the Smart Array Controller.

Reader Tip: Lights Out?

Here is the tip I taught at Bellevue Community College: "Do you have lights?" In other words, is it plugged in? Turned on? Does it have a connection? It was a way of reminding people to check the basic functions before trying something complicated. Occasionally, I would pull the network connection out of a student's practice server and it could take 10 to 15 minutes to think about checking the cable.

-- Matthew Damp
Mukilteo, Wash.

Step 5: See What Happens
I saw that the cable's plastic sheath had been worn away, as if scraped off by a knife. Based on the bend in the cable in relation to all the other parts of the server, it looked like it would be easy to pinch the cable when closing the server's lid. I removed the cable and replaced it with a new one.

After that, we ran the stress utility; for the next three hours the server ran fine. We were optimistic but not quite ready to declare victory. The following morning, the server was still running -- we'd cleared the next hurdle. We hung around until lunch and, with no problems, were growing more confident every hour. At 3 p.m. the server was still running. At that point, we had a review meeting to discuss the situation.

Step 6: Make Sure It Doesn't Happen Again
I said that the issue came down to a simple and easy mistake -- one I've made before -- of not paying enough attention to the SCSI cables when closing the server lids and/or sliding the server's system tray. I showed them the compromised cable, and explained that at some point the cable's exposed wiring would make solid contact with the sheet metal of the server's case and cause a short. Result? The server locks up.

I replaced the cable and demonstrated how to properly close the server lids without compromising the cable. To my knowledge, that server never had another problem.

Reader Tip: Linux Rx

Use Linux to help troubleshoot.

1. Boot up a system with a Knoppix CD.

2. Make the alleged "broken" device work under Linux.

3a. Explain to the user (again) why you do not allow administrative privileges to users so they can load buggy software.

3b. Apologize to user for broken hardware (the other .0001 percent of times)

Knoppix is a bootable Linux distribution on CD, with a baseline set of drivers and other utilities. It's always being updated, and you can order the CD at www.knoppix.org.

-- John Slater
Madison, N.J.

Sometimes It's the Simple Things
Throwing parts at a problem isn't usually the best solution. Often, an expensive piece of technology can give someone fits. The knee-jerk reaction of ordering hardware is often expensive; the server is usually down, the time required to order and receive the hardware adds to downtime cost, and the lack of a guarantee for resolution is often more expensive and frustrating.

Taking the time up front to "work the issue" and methodically troubleshoot will usually resolve the problem. Failure to do so, as was the case in this situation, often leads to a delayed resolution -- not to mention the additional expense incurred due to the replacement of perfectly good parts. In this case, it turned out that a $4 cable was the difference between emotional anguish and customer satisfaction.

By utilizing a methodical approach and deductive reasoning, most problems can be resolved. By using the ingrained six-step method, I was able to resolve the customer's problem and prevent further expense to both the customer and HP. You can do the same, whether you work at a huge, worldwide service vendor or a mom-and-pop shop with one server connected to four computers.

Featured

comments powered by Disqus

Subscribe on YouTube