In-Depth

9 Troubleshooting Tactics

A best practices guide that'll turn you into a troubleshooting efficiency expert.

Troubleshooting comes in all forms and sizes, but it's always an important task for IT. In some cases, you may be trying to solve a driver conflict for a single user, and in another you may be trying to solve a performance issue with an Enterprise-level application that affects thousands of users. For many of us, the true fun (as well as angst and frustration) of being in IT stems from the task of troubleshooting.

Many IT staffers handle troubleshooting in an ad hoc manner. That is, they may react differently to the same problem each time it occurs, and there are no established practices for handling difficult issues. Often, they'll try and retry the same solutions, making even simple tasks seem mind-numbingly difficult. If you've taken any of Microsoft's certification exams, you know that your ability to troubleshoot problems is important. It's no less important in the real world.

More Troubleshooting

 Back to "What's the Problem?", plus more of your questions, exclusively online.

 Peers Helping Peers: Got an answer for these other reader troubleshooting questions?

Though the problems can vary dramatically, some basic troubleshooting techniques will help you find the most efficient path to solving a problem. In this brief article, I'm going to give you my troubleshooting tactics. Instead of diving into technical details (things like using PING to ensure that network devices are responding), I focus on more general principles that can be applied to many different types of solutions (technical and non-technical).

With that goal in mind, let's look at some common troubleshooting best practices.

1. Identify The Desired Solution
OK, maybe I'm starting to sound like a Windows NT 4.0 exam here, but bear with me. It's important to figure out where you're going if you're planning eventually to reach a destination. If you ask users and business managers about optimizing performance, they'll often state that the goal is to make things run "as fast as possible." Though it might sound like a worthwhile initiative, it's rarely practical. First, most systems really don't need to achieve the theoretical maximum performance. Second, the cost to achieve maximum performance would probably be prohibitive. And how will you ever know what the actual "maximum" is if things can always be improved? A better idea is often to settle for the not-so-ambitious but much-more-practical goal of "good enough." You may state, for example, that Web-based reports should be returned to the user within 30 seconds. Or, that network logons should complete within 45 seconds, regardless of network load. With these well-defined goals in mind, you can completely address issues to everyone's satisfaction.

2. Fully Understand The Problem
All too often, I've seen IT staff jump into finding the solution for a problem without fully understanding the issue. If a person complains about a problem with a soundcard on her desktop machine, you might be inclined to reinstall various drivers. But what if the issue is that there's too much bass output when she uses headphones? Or users might complain about performance issues related to accessing a particular resource (a corporate database server, for example). Although that's their original complaint, it's quite possible that the real issue is at the network level or perhaps even on the client side. By asking simple questions, such as "Who's affected by the problem?" "Are the problems repeatable or intermittent?" and "When did the problems begin?" you can gain significant insight into potential remedies. I can't stress this one enough: Be sure you fully understand the problem before looking for solutions.

3. Define Metrics and
Make Repeatable Measurements
Suppose you're troubleshooting an intermittent issue. These tend to be particularly frustrating, since it's difficult to know whether or not you've solved the problem. It's important to establish some kind of metric. For example, if a user says that logons take "forever" on Monday mornings, make them quantify this. If the logon takes 30 seconds, then perhaps it's more of a perception issue. If it takes 10 minutes, you've got a major problem somewhere. The important thing is that you have a way to measure the problem. Now, when you make changes, you can go back to these original measurements to see if you've made a difference. The approach that you take to resolving the problem should be based on this information.

4. Document Your Troubleshooting Efforts
You've heard that those who don't remember the past are condemned to repeat its mistakes. The only thing more embarrassing than spending hours troubleshooting a simple issue is doing it more than once! OK, perhaps retracing your steps (running around in circles) is just as bad. To avoid such problems, be sure to document the steps you've taken to troubleshoot an issue and make sure that it's recorded for later use.

5. Make One Change at a Time
Scientific practices dictate that you should minimize the number of affected variables when you're trying to measure the effects of a change. Imagine this: You make several changes to a system at once in an attempt to affect overall performance. You're happy to find that overall performance has improved by 25 percent. However, unknown to you, some of the changes improved performance while others decreased it (see figure). You've improved performance, overall, but clearly you could have done a better job. The ideal solution is to make each change independently and then test each it. It's definitely more time-consuming up-front, but in the end, the results can be considerably more valuable.

Performance boost?
A simple troubleshooting effort that introduces multiple changes at one time. Although the overall effect is a performance boost, some of the changes actually reduced performance.

6. Prioritize Your Issues
Good troubleshooters are often just waiting for the next challenge to stump them. In an ideal world, we'd have to focus on only one problem at a time, but few of us actually have this luxury. More often, you face multiple problems, all of which are important. In such cases, your first step should be to prioritize them. It's difficult to work on all of the issues efficiently at once, and task-switching can take significant resources. Apart from being frustrating and stressful, you'll have a hard time focusing on any of the problems. Instead, make a "hit list" of items, based on their importance and start knocking them out one by one. The small victories along the way will also help you know that you're making progress!

7. Prioritize Potential Solutions
If you're trying to find your way out of the woods, you're likely to have many different paths to take. Sometimes you'll have several hunches about what will solve a problem, but you can't try them all at once (remember the previous rules). Potential solutions will have different factors, but the most important ones to consider include the likelihood that it will solve the problem and the amount of effort required to implement the solution. In some situations, you may choose to start with the simpler, easier solutions first, even if they're less likely to solve the problem. Or you may choose to bite the bullet and go for the most likely solution, regardless of the amount of effort required. Having multiple teams tackle the possibilities can also be helpful. All of this is doable only if you first organize and prioritize your potential solutions.

8. Make a Business Case for
Troubleshooting Each Issue

Most of us are used to justifying costs related to longer-term projects, but the same principles should apply to troubleshooting. Here's one that most of us techies probably don't like: Sometimes the best "solution" might be to throw in the towel, give up, and live with a problem. I must admit that it was frustrating, but such was the case for an irritating intermittent issue I faced a while back. The problem seemed to be related to memory leaks in a third-party application that eventually led to a corruption of the network stack on critical servers. After considerable unsuccessful troubleshooting, we determined that it wasn't worth the additional effort to continue trying to solve the problem (the problem was rare, and we had much higher priorities). It was difficult to peel the team away from the issue, but we had been neglecting other priorities while we tried to hunt down our Moby Dick. It turned out that the costs related to solving this issue couldn't justify the potential benefit. The best "solution" was none at all. This may not be the case often, but you should always keep track of the amount of effort you're exerting to solve a problem, in order to keep costs in control.

9. Follow The Rules When You Can,
But Make Exceptions When You Must

I'll concede that there are certainly circumstances in which all of my guidelines might not apply. For example, suppose you find that a critical production server is misbehaving, and no backup system exists. The goal is to get the machine up and running as quickly as possible, at any cost. In this case, you might take some risks. For example, you may make several changes at the same time in the hopes that the changes are independent and that one will solve the problem. Or you might be forced to investigate an urgent issue before you have the complete details. With that said, however, remember that the practice of ignoring the rules is for exceptions and it should be used only in a pinch! I trust that you'll find these nine techniques useful the next time you're trying to tackle a sticky issue. If you use an organized process that employs best practices, even the most annoying, frustrating and complex issues can be reduced to a simple, effective process.

Good troubleshooting!

Featured

comments powered by Disqus

Subscribe on YouTube