What Were They Thinking?
An HP hardware troubleshooting guru shares some of the worst snafus he's experienced so you can learn from them.
When it comes to configuring systems and networks, there's always more than one way to skin a cat. In fact, IT managers face so many options that it's easy to get in trouble.
Because we usually learn more from what someone did wrong, rather than what they did right, I'll share a couple of the more memorable encounters I've had with customers in my years as an HP hardware troubleshooter.
What's in a Name?
Back in the mid-1990s, an international customer was having all sorts of network
issues. After initial discussions failed to identify the obvious reasons for network
resources intermittently dropping offline, I flew out to their office on the Gulf
Coast for a closer look.
On arrival, I discovered that network resources local to the company's U.S. facilities were fairly stable, and the same was true for the European offices. But communication between the U.S. and European offices was spotty. The configuration included a mixture of Windows NT 3.51 and 4.0, with Windows-based clients using the Microsoft implementation of TCP/IP. The company clearly needed consistent name resolution between the United States and Europe to resolve the issue.
As I began my investigation, I discovered some legacy-name resolutions, like network-based and local LMHOSTS files, remained in place. It looked like the IT department hadn't succeeded in fully purging the network when they implemented WINS resolution for their enterprise. So I went to work cleansing the rest of the network-based LMHOSTS files. This helped, but only marginally. Communications between network resources on each continent remained intermittent.
It was time to dig deeper. I ruled out a physical disconnect by pinging remote servers by IP addresses. Even when U.S.-based clients couldn't access the servers by name, the servers remained accessible by IP address.
I discovered if I created an LMHOSTS file or used the IP address of the server along with the share, I could access network resources. That led me to investigate the WINS infrastructure. I soon found that the push-pull replication was fine stateside, but that the servers were replicating their information to a U.K. server I couldn't access. I also wanted to confirm their configuration and determine if their WINS hierarchy replicated to the same server.
The European configuration was similar to the one in the United States. I found that these servers, too, replicated to the same master WINS server in the U.K. IT office, with the same target IP address. But the staff didn't have access to the target server. In fact, they didn't even know it existed.
Finally, the U.K. IT team performed a network trace and located the mystery WINS server and its owner. It wasn't in the server room. Instead, the mystery box turned up in a random lab, independent of the IT organization. When the IT staff went to check on hardware, they found a couple printers, an old Compaq DeskPro 386/20 running Windows NT 3.51 server and a few other miscellaneous items.
The obvious question comes to mind: Why would an international company, with a global infrastructure, put the linchpin of its naming resolution on a desktop (and an old one at that!), rather than a server? The answer: By accident.
As it turns out, the server that should have been the WINS master had an IP address very similar to the DeskPro desktop. When two numbers in the address assignment were accidentally transposed, the error got copied by everyone configuring other WINS servers. The result: A snowball effect that directed traffic to an unassuming desktop.
The finding also explained the issue of sporadic name resolution, as a system in the lab would be rebooted at will. The "owner" of the DeskPro had no idea that his test system had unintentionally been utilized as the master WINS server for the entire organization.
I still wonder how something like that could happen.
I was an IT manager with a major trucking
firm, where we used SNA Server to let Microsoft clients talk
to an AS/400 server. Every now and then somebody's connection
would lock up and our procedure was to clear the hung connection
so the user could establish a fresh one.
We started having a problem where all the AS/400 connections
would fail simultaneously. It was an intermittent thing and
we couldn't pin down the cause. Checking the server showed
no connections and, more importantly, no error codes. A sweep
for viruses turned up nothing. We were stumped!
After yet another failure, I asked the department administrative
assistant, who in the past used to handle some IT functions,
"Did you do anything with the SNA services?" And she said,
"Sure. I cleared out the table."
There was that dawning, that moment, when I realized:
Oh my god, she's been clearing it out every day. Whenever
she got a call that a connection was hung, she thought her
job was to just go into the table and clear everybody out.
She had been doing this for weeks.
You should have seen my CIO. He just looked at me, shook
his head, pointed at the door and said, "Get out."
Clap On! Clap Off!
Here's a classic from a peer of mine. He got a call from a customer, a small
business with three servers, about a hard-to-pin-down power issue at their facility.
It seemed one of the servers would intermittently lose power. The initial phone
triage-such as plugging power cords into different outlets-didn't work, so an
on-site visit was in order.
Suspecting a bad power supply, the consultant entered the server room. The door was wood, with a metal frame -- the kind that rattles when you close it with a little attitude. In the server room, my counterpart saw the server was plugged into a "Clapper" device, which in turn was plugged into the wall.
That's right, the very same Clapper that turns lights on and off when you clap
your hands. As it turns out, when you closed the door with enough force, the
door rattled and triggered the Clapper, which then cut off power to the server.
We've all heard the stories. There's
the guy who kept floppy disks pinned to a filing cabinet with a magnet; or the
woman who plugged a surge protector into itself and wondered why the system
wouldn't power up. These are the tales that try the patience of any IT professional.
Still, whatever the situation, I encourage you to take the high ground. If you pull off a miracle recovery, be humble. If you encounter colossal acts of stupidity, be kind when explaining the problem. Think back to your first days with computers and you might understand why some people make mistakes that now make you cringe.
Take me, for instance -- on my first day in my first professional job, I sat
at my desk for 10 minutes trying to figure out how to turn on my computer. There's
hope for all of us.
Tom Pruett has been with HP since 1997. During his career, he’s been an
IT consultant (focused on Microsoft solutions), network architect and worked on a help desk. He currently resides in Huntsville, Ala., with his wife of 14 years and their four children.