In-Depth

The Process That Wouldn't Die

This IT pro should have known better than to reboot the Primary Domain Controller on Friday the 13th. Read on about his special nightmare.

As the IT manager for a 30-person architectural firm, I’m where the buck stops when it comes to network and server uptime. After a routine reboot of our network Primary Domain Controller (PDC) at the end of the workweek, the system was completely unresponsive. The ensuing troubleshooting session turned into an all-nighter that included two long waits for tech support, five hours of phone time with two different technicians, 10 reboots and a wild runaway process—wrapping up with a successful conclusion as the sun came up.

The heart of our architectural business is CAD drawings. I invest a great percentage of my time in protecting these resources. Our two-server network might be simple in topology, but it’s critical to our survival and meeting goals in a deadline-heavy industry.

We have two Dell PowerEdge servers running NT 4.0 that provide network, file, print, backup, VPN, antivirus and e-mail services.

Every server needs an occasional reboot. The question is, “Which timeframe creates the least disruption and downtime?” I’ve found—through trial and error—that the most effective time for this in my firm is Friday evening. This particular Friday, which I later noted was the 13th, I planned to reboot the PDC and shut down the Backup Domain Controller (BDC) in order to perform maintenance on its Uninterruptible Power Supply (UPS) unit. Even allowing a few extra moments to circulate around the building and get my co-workers off the network, this should have been a one-hour job.

The server in this tale is our PDC that also does duty as the CAD file server and network print server. This Dell 1400SC with a PowerEdge Raid Controller 3C RAID card, 1GHz single processor, 512MB of RAM and DDS4 tape unit has been a reliable workhorse since we brought it online in April 2002. Besides these functions, the machine also has Norton AntiVirus Server installed, which sends antivirus definition files out to workstations around the network.

Something Definitely Isn’t Right
I restarted the PDC at 6 p.m., then directed my attention to replacing the battery in the UPS that protects the BDC. That task completed, I checked back on the status of the PDC. Although the logon box appeared shortly and accepted my password, it seemed to take forever to bring up the desktop. I watched as the desktop appeared, icon-by-icon, then watched with growing concern as the items in the system tray disappeared one after another. Although the mouse pointer moved around the screen normally, double-clicking an item had no effect. The icon wouldn’t even darken to indicate it had been selected.

Get Past the Denial
At times like these, I have to take a moment to get into the proper frame of mind. The first thing is to get past the moment of denial. Even my optimistic self could tell that this wasn’t a good way to start the weekend. OK, take a deep breath—so my Friday evening was in jeopardy. I could deal with that. Better not think about what could happen to Saturday—yet.

I have a few rules of thumb for dealing with server issues. First, do nothing to make the problem worse without a clear understanding of the ramifications. Second, read all the available material before jumping in to try a rescue procedure. Third, swallow your pride and get as much help as you have available or can afford.

Time to troubleshoot. Sitting at the PDC console, I tried to bring up the Event Viewer. I watched the hourglass for 10 minutes. Nothing. The three-finger salute (Ctrl-Alt-Del) brought up the logon security box, and I excruciatingly tabbed over to the Task Manager button. Ten more minutes of watching the box struggle convinced me it was time for the reset button. Unfortunately, this would be only the first of many hot restarts that evening.

I repeated the whole procedure a few times, trying to get into Task Manager or Event Viewer for a clue to the source of the problem. Within a minute or so, the computer was unresponsive and I had to reset it again. It’s tough to diagnose a problem when you can’t get to your normal troubleshooting tools. Because we normally purchase warranties and phone support with our servers, I decided to start taking advantage of Dell’s expertise.

Jennifer
I waited much longer than usual in Dell’s server support queue, but because I’ve had nothing but top-rate help and service from Dell over the years, I waited cheerfully. Eventually, a support engineer named Jennifer verified my identity and the server service tag and got my initial description.

Because we purchased our server sans operating system, Dell’s support (in this case) extended only to hardware issues. Of course, many situations are a blend of hardware/software issues, and each technician has to make the call on how far he or she will extend support. In this case, we carried out a complete series of diagnostic tests to verify the status of the RAID card, onboard controller slot and each of the three 18GB SCSI hard drives.

Unplugging the SCSI cable from the RAID card and plugging it directly into the onboard slot didn’t cure anything, proving that the card wasn’t the problem. The onboard diagnostics began painstakingly scanning each hard drive for errors. Jennifer showed Job-like patience as I read back the results of each scan.

Hard drives one and two were fine, but drive three had a few errors that required moving a couple of sectors’ worth of information. This wasn’t very surprising, even on a normally functioning machine. The next step was to download another drive-testing tool and run a second round of tests. At this point, I began to appreciate that I could access the Internet conveniently from the other server. With the ADSL connection to myself, I hit Dell’s download site and was soon running the prescribed disk tests. This particular tool ran quickly (the only thing that did that evening) and shortly gave all three disks a clean bill of health.

This was good news for Jennifer and bad news for me. We’d established that the motherboard, RAID card, SCSI cable and hard-drive array were all functioning perfectly. I’d been on the phone for 2 1/2 hours at this point. We discussed my options: 1) booting to the Last Known Good Configuration, 2) initiating the Emergency Repair process, or 3) as a final resort, reinstalling the operating system and server applications. Foremost in my mind were the gigabytes of CAD files on the data partition. The full backup scheduled for 1 a.m. Saturday had not, of course, had a chance to run. This meant the potential for losing all of Friday’s changes.

Earning the IT Paycheck
As there are a few specific steps for reloading the PERC drivers on a Dell system, I took some careful notes as Jennifer spoke. Somehow, my mind kept telling me it wouldn’t get to option two or three. I made my last call home for the evening. I told my wife that Dell had done all it could for me, and it was now time for some lonely hours at the server. Or, as folks around the office remind me, time to earn my IT paycheck.

Restarting the server gave me a window of approximately 30 seconds in which to try to get a peek into Task Manager. When I managed to get the processor utilization monitor on the screen, it was pegged at 100 percent and never wavered. That told me it was a process or service using all the server’s resources—possibly multiple processes. It was a bit daunting to look at the long list of services running and realize that any one of them could be the culprit.

Sometime early Saturday, I decided I wasn’t getting anywhere and it was time to proceed with repairs using my Emergency Repair Disk (ERD) and the NT 4.0 CD. In the process of reviewing the repair procedure, I realized I wasn’t completely sure which model RAID card I had in the machine. I thought I knew everything about that server, as I did the initial installation and setup. Suddenly, though, I wanted to be very sure about the details. I browsed the www.premiersupport.dell.com site for the latest driver downloads for a PowerEdge 1400 and was surprised to see that that model was sold with at least three different brands of RAID controller. According to my files, the card was a PERC3/SC, but it wasn’t obvious to me which brand that was. Remembering one of my guiding principles, I decided not to proceed until I was sure I wouldn’t make a bigger mess out of the situation. That’s the IT version of the carpenter’s motto: “Measure twice, cut once.”

Chris
I picked up the phone and waited what seemed considerably longer this time for a live technician. At length, I had Chris on the line and he reviewed the notes from my call with Jennifer. One thing I’ve learned about calling tech-support centers: You’ll generally get a different tech each time, and each support pro has his or her own strengths. I’ve had several experiences in which the second tech had a different insight or could suggest additional options to try. The moral: Be persistent and glean what you can from each one’s experience. Chris held on through a few reboot/hang sequences and suggested rebooting NT into video graphics array (VGA) mode. By loading the OS with less demand on resources, I might have a longer time window in which to kill off processes. Wishing me luck, Chris turned me loose and grabbed the next caller from what he said was a very backed-up queue.

The Scientific Method
By now, I’d lost all track of time. The tantalizing prospect that I might be close to discovering the root of the issue spurred me on, and the adrenaline rush kept me alert. The plan was to use either the command prompt or Services applet to systematically disable all noncritical server functions until the runaway processor usage returned to normal.

Within a few minutes, I could see that this procedure was helping. By disabling two or three services during each window of opportunity, the server had less to deal with during startup. It took another hour or so of watching the server moving in slow motion; but finally, the server rebooted normally, brought up the desktop, and the processor utilization counter hovered at 2 percent. What a moment of incredible relief! I hadn’t wanted to entertain the thought of rebuilding a production PDC in the middle of the night—much less watch the backup unit restore gigabytes of important company files for hours on end.

Despite the relief, I knew I still had a lot of work to do to preserve Friday’s data and isolate the renegade process. I was confident that, by selectively enabling services, I could discover which service startup hammered the processor.

With the server again functional, I started up the Veritas Backup Exec services one by one, watching the processor counter like a hawk. Each service started normally, so I put in a backup tape and started the normal “Full” backup job. I wanted to make sure we didn’t lose the work that 30-plus folks had put in on Friday. During the next hour or two, as the backup completed, I spent the downtime researching the many error messages in the System and Application event logs. There were errors from the UPS software, from Norton AntiVirus Server, from the Netlogon service and the tape-drive SCSI card. I’m sure the multiple hard resets and not being able to load services accounted for the majority of the stop events.

Hunting Down the Culprit
With the server safely backed up, I began the laborious process of pinning down which service promptly went haywire when started. It took only a few minutes to reveal that the Norton AntiVirus Server software was corrupted. I was delighted to have the problem isolated and even happier that the repair path was obviously a reinstallation. Naturally, the uninstall process from Add/Remove Programs didn’t cooperate so I had to completely remove all traces of NAV from the machine, reboot, then carry out a clean install per instructions in the Norton Knowledge Base. The server rebooted normally, and I carried out a precautionary full scan to see if perhaps a virus had made its way through the defenses and had sabotaged the antivirus software itself. The scan came back clean.

Each of the workstations depen-dent upon this server for downloaded antivirus definitions had to be checked to verify the server-client relationship. I added the Windows 2000 clients from the server and hooked the Windows 98 clients back up from the workstation side. With all the tidying up finished, I felt ready to call it a night’s work. As I left the office and set the alarm system, I noticed it was exactly 8 a.m.

I still don’t know why the antivirus service stuck on full throttle. It’s evidently not a common problem, as I found no mention of it on Symantec’s support site. It goes down in the “unexplained” category, yet the general issue of a runaway service is common in the server world.

Lessons Learned
 Prepare a folder for each server with the service tag, license codes and other documentation. The middle of the night isn’t the time to be wondering exactly what hardware you have and which drivers are safe to load.

 Document BIOS and firmware versions. You won’t always be lucky enough to have full Internet access to be able to pull this information from a manufacturer’s site.

 Document the brand of RAID card and its drivers. If you don’t do this, you may not be able to even access your hard disks in an emergency rebuild.

 Make periodic System State backups and keep your ERDs up-to-date. A stale ERD or System State backup will only complicate life by adding unknown factors to a recovery effort. You’ll actually be reverting to an earlier configuration of your operating system.

comments powered by Disqus

Reader Comments:

Tue, Feb 20, 2007 Cheri Leesburg, VA

As thrilling as a fiction novel, but even scarier because it's true! Two thumbs up! Good Job! I HATE NORTON.

Sat, Jun 11, 2005 Anonymous Anonymous

I just use Linux.... kill -9 never fails... :lol:

Mon, May 30, 2005 Larry W Holberg LTC RET'D Las Vegas Nv

The comment about the backup is my
feeling also. It's to late when your in trouble

Tue, Jul 29, 2003 Anonymous Anonymous

I enjoy reading this articule

Sat, Jul 19, 2003 Delboy UK

I believe Bruce's issue was to do with the NAV or a process related to NAV. It's happened to me before on a different system. It makes very interesting reading. Worth following.

Mon, Jun 30, 2003 kamau Kenya

That was an excellent article that we should learn from,IT pros are known to jump into troubleshooting without knowing what exactly what we are fixing.That logical approach was agood approach that we should follow otherwise you might worsen the situation.

Fri, Jun 27, 2003 Anonymous Anonymous

Moral of the story? Stop using legacy MS software :-)

Tue, Jun 24, 2003 Anonymous London

Your account was so clear we can all relate to similar days!!
Others may say - you could have done it a lot quicker using a different tool - but you didn't know that. The main thing is, you patiently stuck it through and you saved the data!! Well done

Mon, Jun 23, 2003 Anonymous Anonymous

Mein Gott!! one for the books.

Wed, Jun 18, 2003 Anonymous Anonymous

A google search for 100% cpu use and "processes" turns up norton as the culprit! "Google" best knowledge base in the world!

Wed, Jun 18, 2003 Greg Kansas City

Enough said about this sysadmin's ability to troubleshoot NT, although I agree he needs some serious help.

Microsoft has unknowingly slammed themselves in this article. Here is an "experienced" Microsoft sysadmin who does not know how to monitor pegged processes from another computer. This is basic functionality that should have been built into any Microsoft server OS long ago. It is a good thing Sysinternals recognized this and developed PSTools FOR FREE, because you don't get this capability with the built in Task Manager, with any of the base tools, or (to my knowledge) with the additional expense of the Resource Kits. Well, you could upgrade to 2000, or 2003, but wait, you still won't get this basic functionality!!!

How about taking a hint from this article, you Microsoft developers working on server OS, build into the Computer Management mmc and the command line the ability to remotely monitor and kill runaway processes. This might save us, (your customers) from having to learn the hard lessons like the fella in this article.

Sat, Jun 14, 2003 Mike DK

Another tale of why Norton AV should never be let near a PDC...

Fri, Jun 13, 2003 Anonymous Anonymous

HEY I JUST REALIZED , PROBABLY BECAUSE OF TO MUCH QUAKE III. BUT THIS ARTICLE IS A FEW DAYS OLD , BUT HE REFERENCED FRIDAY THE 13. THERE HASN'T BEEN A FRI 13 FOR 6 MONTHS. ALSO I AM ALSO TRYING TO FIGURE OUT IN MY MIND SOME OF HIS TROUBLESHOOTING SKILLS. ONE DID HE NOT HOOK AN ARRAY DIRECTLY UP TO A SCSI CONTROLLER. TO CHECK THE HARD DRIVES. 2 WHAT VERSON OF NORTON. I HAVE BEEN RUNNING NAV CORPORATE SINCE 98. ON 150 SERVERS NOT A PROBLEM EXCEPT WHEN NIMDA HIT THE SERVERS WERE JUST BUSY. 2 IF I RECALL IT IS A SINGLE PROCESSOR AND ONLY 512 MB OF RAM. OBVIOUSLY NO UPGRADE IN THE LAST YEARS SO NO MAINTINANCE.
RUNNING NORTON I GUESS THAT ALSO RUNS SCANS ON THE OTHER MACHINES, AND DOES BACKUP, ALSO HIW WINS PLUS ANY OTHER SERVICES. WITH A BASIC SERVER. THE HOURS WASTED COULD HAVE BOUGHT THE LATEST UPGRADES TO SYMANTEC, VERITAS. ALSO WITH REGARDS TO THE DELL $500 UPGRADES YOUR MACHINE TO DUAL AND 2GB OF RAM. SOMETIMES AS NETWORK ADMINISTRATORS WE MAKE THE CARDINAL SIN OF NOT TAKING CARE OF THE MACHINES THAT MEAN THE MOST TO US. THER SERVERS. CONSTANT MAINTINANCE, UPGRADES TO BOTH THE OS AND HARDWARE. COMMON SENSE. THE BEST PREPARATION FOR SITUATIONS LIKE HIS IS TO PLAN FOR THEM.

Fri, Jun 13, 2003 Christopher Ohio

What the heck are you doing running Windows NT 4.0 in the year 2003 on your main servers?

Fri, Jun 13, 2003 Fam Luanda - Angola

Lotta wasted time with support team, i really think you should try to make an image of the servers with Powerquests V2i Protector. Thanx for sharing your experiences.

Thu, Jun 12, 2003 randall north carolina

in re the hyper critical comments about bruce -- critics are like eunuchs -- they can talk about it but cannot do it... this type of situation will occur for even the best it manager, regardless of experience level... so cut him some slack guys ( and gals!) !!!

Tue, Jun 10, 2003 Anonymous Anonymous

Hello, their you, first thing's first 1. did you atleast look at the event log first. second why would you have so much mission critical data on a PDC. third if you need dell to go through the process of 2.5 hrs to tell you that you had good hardware, you are indeed not prepared.
Simple simple, first data on different partition. second if using dell, symantec, and veritas, did you purchase the IDR. with a dds4 and veritas with IDR server would have just needed to be restored from the day before. if 4gb partition 15 min for a server restore. but also did you do a backup before rebooting. i am not yelling or complaining but that stuff is common sense. infact the servers if properly setup and maintained shouldn't have to be rebooted at all. biggest strike against you is the wasted time. Should have bought PRoliant. their would be no question on hardware, because insight manager will page you and tell you there is a problem. Document whatever, if you were serious about your job you would know every thing there is to know about your 2 servers. 3 practice practice practice. hell i have installed new printer drivers to have the server bsod on me and won't boot. Big deal, with proper motivation and a lot of planning, i just installed the bootable IDR disk, ran a partition and registry restore rebooted and from there went on to deal with the mess, service pack, new drivers reboot etc.... No need to wast time on this sight on some trivial stuff. As a professional you wouldn't have to wine to the world on a simple task, just complete the missoin

Mon, Jun 9, 2003 glynn houston

I'll start getting my list of 'lessons learned' 1st thing tomorrow morning.

Mon, Jun 9, 2003 antonio KC US

Thanks for sharing your story Bruce. As a contructive criticism these should be our lessons learned:

1. Document, document, document.
3. Create disaster recovery plan. It should include what to do in cases like the one you ran into. For example a document that includes instructions on what to do when server performance is affected. Disabling 3rd party services and AV software should be one of the first steps in troubleshooting servers.
4. Backup everything before doing anything, including rebooting the server, on your servers.
4. Don't put all your eggs in the same basket, use a different partition for your data or even better use a dedicated server for this role.

MCSE Server plus

Mon, Jun 9, 2003 Super-64 Chicago

Been There Done That!!

Sun, Jun 8, 2003 Maame Serwaah IPMC, Tema-Ghana

I did not read it all.

Sun, Jun 8, 2003 Georgina IPMC, Ghana

Nice story. More of those.

Sun, Jun 8, 2003 Bobbie IPMC, Ghana

That is a nice experience in "trouble" and troubleshooting.
Chao

Fri, Jun 6, 2003 lesly jean-denis NEW-YORK

Great job Bruce,and tanks for your advice.I do appreciate your patience

Wed, Jun 4, 2003 Karlis Latvia

Good story, interesting comments. No panic, and this is the main thing, if you are in trouble like this.
Learned something new - about Recovery Console on NT, hope will need not, but who knows...

Tue, Jun 3, 2003 Anonymous Anonymous

Good lessons learned, but the negative feedback amazes me. People in different IT environments will have different experiences. Some ITs have blasted the author for not using the recovery console right away, they must not work with NT because NT doesn't have a RECOVERY CONSOLE. Rather then criticize and ridicule the author for some mistakes, constructive criticism would be a lot more helpful to everyone else who reads these comments for getting their “experience”.

Tue, Jun 3, 2003 ChipChick Dayton, Ohio

Great read on "real world" learning experience. We've all been there at one point. Anyone who denies that is in denial themselves about where they started. Congrats on surviving the problem!

Tue, Jun 3, 2003 Anonymous Anonymous

Good job.

May want to pick up a W2K Server cd. I learned the the command console of W2K would boot into an errant NT4 server. could disable, start. stop & enable services from the CC.

Sun, Jun 1, 2003 Anonymous Anonymous

This certainly provide a lot of useful knowledge to those boys out there.

Sun, Jun 1, 2003 Anonymous Anonymous

Our IT manager had me read this--perhaps sometimes he does earn his salary.

Sat, May 31, 2003 Anonymous Anonymous

erd right off the bat? come on dude...try something

Fri, May 30, 2003 jeff Chico, CA

Ok procedure I guess, but the CPU time counter in the task manager should have told him which proc was the issue right away..... As for Norton, one big headache switch to Bitdefender Server ed. and you won't have problems like this.

Fri, May 30, 2003 Anonymous Anonymous

If i were the employer, I would have gladly paid the extra labor hours to know that my tech had taken his time with my network and data!

Thu, May 29, 2003 Anonymous Anonymous

no substance!!!

Thu, May 29, 2003 Michaël Belgium

A very interesting story.
I have had same experience a while ago.
The clue is ,never do any restarts or interventions on a production machine without first having a backup and the ERD disks prepared.Antivirus software can cause a lot of problems and a distribution server can better be installed on a seperate machine.

Thu, May 29, 2003 Ete London

Good read - apart from being a tested IT pro, Bruce has a great style of writing. I have really learned a lot from this piece. Well done.

Thu, May 29, 2003 Tim Anonymous

You should have had ERD Commander and shut down some services to try and isolate the problem.

Thu, May 29, 2003 Anonymous Anonymous

Been there, done that

Thu, May 29, 2003 DAVE Boston MA

this is just another reason why you need qualified people doing network administration

Thu, May 29, 2003 Anonymous Anonymous

Been there.

Thu, May 29, 2003 Anonymous Anonymous

Thank You for telling the rest of the World.

Wed, May 28, 2003 Anonymous Anonymous

Two thumbs up!

Wed, May 28, 2003 Ray Fayetteville, NC

Have been there with simular results... although I think the Bad Juju was closer to 18 hours.

Wed, May 28, 2003 David Anonymous

Seen it all before. ALWAYS backup before downing your servers!!

Wed, May 28, 2003 Paul Plymouth Ma.

Another good story from the trenches. I've been is a similar situation with a compaq proliant and did a parallel install into winnt2 just to get the beast up and then swapped back and forth to find and eliminate the problems.

Wed, May 28, 2003 John Chicago

I have only seen this probem with the NAV Exchange agent. And turning off the service and then turning it back on - days later resolved the problem. In this case, the problem appeared to be conflicts between the system backup process and a 'scheduled' Exchange mail store scan.

Whenever I run into a 'hung system problem' I always start by reducing the services started until I can down to the core set. This really helps in identifying problems.

Nice job.

Wed, May 28, 2003 Anne Plano, TX

I realize that with only two servers that the options are rather limited, but I couldn't help experiencing ever rising degrees of horror on reading what was also running on the PDC. And then there's the NT bit...just wondering what the plan is for funding a support call to Microsoft when the next issue looks like it's OS related...

Wed, May 28, 2003 Jude Washington DC

Great job and thanks for reporting your experience.

Wed, May 28, 2003 Bob Pittsburgh

Very well written and suspensful article. I read with sweat on my brow pondering what my next move would be. It sounds like you made good choices and avoided making matters worse. I survived a similar experience with Windows Time service and W32time.exe going head-to-head for 100% of the processor time. It was a painful experience, expecially when your hands are tied with the processor at 100%.

Wed, May 28, 2003 Anonymous Anonymous

Good experience... Thanx for sharing it with the rest of us. But to be perfectly honest, that might have been a good weekend to upgrade to Windows 2000.

Wed, May 28, 2003 Warren Forks Wa

We are getting ready to deploy two 2000 DC's with Norton's Antivirus Software. A good headsup to remember.

Wed, May 28, 2003 Anonymous San Jose

Just one suggestion: Use Trend Micro's products! I have never heard of their products killing a server -- can't say the same thing about Norton.

Wed, May 28, 2003 Curtis Bradley Washington DC

Excellent story. I especially like the denial phase. I've been there a couple of times.

Wed, May 28, 2003 Anonymous Anonymous

Sounds like my life

Wed, May 28, 2003 Anonymous Anonymous

Great read.
One word of advise... There is a product from Wininternals called ERD Commander 2002. An excellent protduct that will allow you to easily disable services from an Explorer like interface. Its a bootable CD-rom. I've used it before to recover a few servers. When you have 250 servers you're tshooting a lot. ;-)

Wed, May 28, 2003 Nelson S.F.

One thing I do is put the documentation listed at the end on an intranet web server. That way the info is available from any workstation and the boss can see what we've got without having to write a separate document for him.

Wed, May 28, 2003 Anonymous Anonymous

Good recommendation list at the end. A little odd that someone with over 5 years experience wouldn't be following these procedures in the first place.

Wed, May 28, 2003 Edi California

In my experience, a PDC processing slowing would be handled first with a shut down of all extra services ... I was surprised at the hardware analysis ... maybe the author had experiences that lead him down that path.
Working for a small company, the author probably wears too many hats to be able to keep on top of an active network. All the best and thanks for the read.

Wed, May 28, 2003 Jack Las Vegas

Very good read - I really enjoyed it. Thanks for a well written story!

Wed, May 28, 2003 Eric Greensboro

It is good to see others following a "scientific method". We're a Dell shop as well and exact determination of the PERC in use is rarely simple...especially when it comes to driver updates.

Wed, May 28, 2003 Thomas NC

Very well written.

Wed, May 28, 2003 Jean Sim Jersey

We have had problems with Norton Antivirus and Exchange Servers in the past and had to change to a different AV software

Wed, May 28, 2003 Tom Anonymous

This sounds very familiar .. I've had a few of these situations in my 27 years in IT.

Wed, May 28, 2003 Adrian Richmond, VA

Whoa, Camel. I'm a little surprised by the circumstances of this disaster:

First, why in the world would you schedule maintenance (of any type) to occur prior to a full backup? You were begging for disaster.

Second, you mention that the Last-Known-Good option was available to you. It was not. Once you receive the "Press Control-Alt-Del" window, it's too late for LKG Configuration.

This is a great story, and I'm impressed by your candor, but honestly you should have known better.

Wed, May 28, 2003 Anonymous Anonymous

This should be a required read for all consultant customers to help them better understand how large bills can sometimes happen.

Wed, May 28, 2003 Anonymous Anonymous

Lesson Learned is now on my to do list!

Wed, May 28, 2003 David Soede MCSE CCNA Central Coast, Australia

I appreciate Bruce's candour in his article but respectfully suggest downing the BDC before ensuring the PDC was operational was a BIG mistake in an environment with only 2 DC's!

Holding the "shift" key down when re-booting into VGA mode also prevents services running placed in the registry keys for HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Run & HKEY_CURRENT_USER\SOFTWARE\Microsoft\Windows\CurrentVersion\Run which helps.

Although it is easy in hindsight from this article, I always turn of Anti-virus when faced with these types of problems - in my experience with Symantec, McAfee and InnoculateIT this kind of maximum resource taking issue occurs way more than it should. All AV products -necessary though they are - carry a performance hit to some degree, so in any issue performance related it should be the first to be disabled.

Lastly, I hope Bruce can convince his firm to spread the load amongst more servers, even basic low-spec'd units can easily carry PDC/BDC/DNS/DHCP etc roles and these are best offloaded from mission critical (like your CAD server) servers for precisely the reasons shown in this article. The first 3 lessons learned should be part of basic documentation for ANY network, no matter how small, and any IT Pro - particularly an MCSE - should be doing this as a matter of course, second nature. The last lesson learnt . . . can I suggest Bruce implement a comprehansive DRP system NOW in consultation with an expert. He is absolutely right that he is responsible for 30 other people's work in one respect and this needs to be shouldered before something else - and it will - goes wrong.

Wed, May 28, 2003 Anonymous Michigan

Nice article. As a consultant we try to sell the value of documentation, documentation, documentation. All too often customers lack to see the value. Had documentation been completed in the first place, the second call to Dell would not occured.

All too often the mind set "time to earn my IT paycheck" is due to a lack of knowledge and lack of best practices. Nothing speaks true to this than this article.

All too often IT receives a bad wrap for excessive overtime due to negligence from the IT manager-administrator.

Good to see situations like this add to my value.

BTW- Did you get paid overtime?

Wed, May 28, 2003 E Anonymous

gee , i guess most if not all who come to this site has lived this at least once , i could write a couple of similar stories on my own ...... whats so spectacular about this ????? good writting and description but i've seen so much worst ......

E
MCP MCSA MCSE MCDBA

Wed, May 28, 2003 Bucky Dallas, TX

This was an excellent read. I particularly like the way the writer emphasizes not trying to fix things until you know you won't break them even more.

Wed, May 28, 2003 Philadmin London

Been there done that BUT this was Trend Antivirus for SBS and Backup Exec misbehaving on a service pack update. (this stupidity caused Tend to withdraw the product!)
My lesson learned is to set the AV and Backup services to manual start before service packing a server.
IT is very unlikely to be hardware related on a reboot. This whole lunacy is why we keep third party sw on servers to an absolute minimum.
PS who ever critisised Bruce for rebooting once a week ER, I don't think he actually said that, only that when he rebooted a Friday was the best time!

Wed, May 28, 2003 Anonymous Anonymous

Great to hear, we have two VTC servers that, after installing as Norton Clients, kept crashing intermittently.

Wed, May 28, 2003 Novaro Bern, Switzerland

I'm in good company. Well written and thanks for sharing your experience.

Tue, May 27, 2003 David Anonymous

Before doing anything "significant" to a production server, PLAN (in writing) what you're going to do, and how you'll recover if and when something goes wrong. Considderation of the latter point would have led to a realisation of the point already made by a couple of readers: backup BEFORE. If you don't want to wait for a full backup, a differential will do (provided you have confidence in your previous full backup).

And review the event logs BEFORE the reboot (eg while waiting for the backup to finish).

Finally, I appreciate that with only 2 servers they'll both be DCs, and will also host user data; but you should at least put it in separate partitions(s).

And have a contingency plan to backup individual partitions, or the whole disc array, WITHOUT having to boot the main OS -- eg using a disc image app like Ghost.

Tue, May 27, 2003 MilesK San Diego

Man, so many memories came flooding back to me from Long Nights of the servers in my past that needed to be brought back to life. Thanks for the read. :)

Tue, May 27, 2003 Anonymous Anonymous

This problem in my opinion is fairly common. I am surprised it took dell that long to help you. I guess it is true when you said the support pro has his or her own strengths.

Tue, May 27, 2003 Anonymous Anonymous

This is why we use McAfee.

Tue, May 27, 2003 Tom Los Angeles, California

Well written story, sharing an experience. We all have our own Friday the 13th stories, and by sharing our experiences and what we learned from them, those who read them become stronger and more knowledgable. There is another Friday the 13th coming in June!!!!

Tue, May 27, 2003 Anonymous Anonymous

going into vga mode should have been the 1st step.

Tue, May 27, 2003 Craig Oxnard, California, USA

Best written & most meaningful article in a long time. I too have suffered from Norton, but to to this time extent.

Tue, May 27, 2003 Theo Mills South Carolina

This experiencew was well worth shring. I have never encounter this issue but, if it ever happen, I will remember these solutions.
Theo, MCSE, MCSA, MCP, A plus

Tue, May 27, 2003 Eric Golpe Philippines

I had Norton AVS also on Win2003 Server with a process NSCTop.exe that seemed to self-promote itself to the illustrious category of "elite and most-hyper thread" in the process model. I had to remove the NAV admin tools and reinstall NAV i a basic config to rectify the problem, which was sending discovery packets from hell all oround the subnet. ROFL.. great relation...

Tue, May 27, 2003 Curtis Charlotte, NC

Your methods suck! I have never worked with a "production network" and even I know that a sluggish system indicates a proccessor or RAM overuse. I personally would have rebooted immediately into VGA mode (or for 2k, Safe Mode), and began anylysing the processor utilization. I guess NT doesn't show the per proccess usuage like 2000 (I have limited knowledge of NT). As others have mentioned, Recovery Console is a good alternative for a situation like yours. "Earning the IT paycheck", who exactly are you refering to? The Dell Techs were the only ones earning their pay that night. You should have prepared for a disaster BEFORE attempting maintainance. Every time you restart your system you run the risk of some type of failure. You state you reboot the PDC weekly, that seems sort of odd, do you have something leaking memory? Your article is interesting, but isn't worthy of any technological merit.

Tue, May 27, 2003 Anonymous Anonymous

I thought I was the only one who ran into trouble like this. LOL Very good story

Tue, May 27, 2003 jeff Anonymous

so,
1. Saturday full backup should have been run before reboot.
2.Shoot software first if hardware boots to GUI.
3.Recovery console disabling 3rd party services should be the first tool used.

Did I miss anything? BTW Bruce did fix it,
nuff said.

Tue, May 27, 2003 Anonymous Anonymous

very informative

Tue, May 27, 2003 Tom Phoenix

Nicely written, but only fair on troubleshooting skills. If the machine boots to GUI, then goes catatonic it's probably not a hardware issue. Disabling all non-critical services should be the first line, regardless of which tool you use to get there. Regarding the remote management of services from another box; if the system is that locked up by a runaway, it's unlikely you'll get remote tools to connect before they timeout.

Tue, May 27, 2003 Ray Anonymous

Good read...although some of the user comments are quite rude. We run Symantec products on a lot of machines with no problems if installed and configured correctly. Not to say there isn't times when there are issues, but what software doesn't have issues? This article wasn't about blaming something or someone, it talked about how he got out of a situation he found himself in. It could help someone that reads it. Thanks for sharing.

Tue, May 27, 2003 Jack Rochester, NY

A good read...thanks for sharing the experience!

Tue, May 27, 2003 Greg Tampa

Good article and very enjoyable to read. These 'user' comments are also very interesting.

Tue, May 27, 2003 Tom Wilkes-barre

Have something similar happing with newest version of Symantec Corporate Anti-virus, but not to that extent, server still boots. Manually unistall is the solution.

Tue, May 27, 2003 Greg K Seattle

Gee whiz. Don't be afraid to use Recovery Console. I usually do it first thing if I can't boot a server. I use the LISTSVC and DISABLE commands to disable services and/or drivers. You soon find out whether it's hardware or software this way, and you can be just as patient and systematic but with more proactive results. (I don't think LSTSRV is a Recovery Console command, Cliff.)

Tue, May 27, 2003 Anonymous Anonymous

I have had similar problems with Symantec products, although not one like your hung process. I can especially relate to the "cannot uninstall using Add/Remove". At times, I have had to use special uninstall tools from Symantec, or clean the registry of all references to Norton, Symantec, etc.

Tue, May 27, 2003 Anonymous Anonymous

Great story, Im sure we all have had similar experiances at some stage, we should all benefit from the knowledge of others in the "little" things we often overlook. I myself am in the middle of creating a disaster recovery manual for my employer as practically everything about the network is in my head!!!

Tue, May 27, 2003 Anonymous Anonymous

This is an example of excellent documenting the problem and steps toward resolution. We back up every night, so we wouldn't have had the problem tha Bruce had with the potential for losing data - but I have 130 users, and it would have been disastrous to lose that. Still, his approach was very patient and methodical. Congratulations on succeeding!

Tue, May 27, 2003 Anonymous Anonymous

Right on, Cliff! Always kill 3rd party services first, ask questions later!

Tue, May 27, 2003 TonyC Sacramento, CA

Well written story. I have also has issues with runaway NAV.
Some of the stress could have been prevented by not storing critcal data files on a DC.

Tue, May 27, 2003 Anonymous Anonymous

Instead of "The Process That Wouldn't Die" it should have been "The Sysadmin That Didn't Know Where To Start Or What To Look For Or How To Kill It But Made Some Money Writing An Article Anyway Because Some Hardware Techs Held His Hand All Night".

Tue, May 27, 2003 Cliff Tucson

As an ex-support tech for MS, when I use to get this call; the first words out of my mouth were "what do you have running on the machine?"

You would told me what you thought I wanted to hear and I would have asked the second question: Do you have Norton running the the machine?
I would have asked you to uninstall it. When the uninstall failed, I would have asked you to reinstall it over the top of itself to repair it. Then we would have uninstalled it one more time.
That's not because I'm sure that Norton is the culprit, it's so that I can have complete access to the registry, System32 folder, and anything else that would have to be accessed with a service pack reinstall.
The moral of the story was learned hard. You don't turn Norton off, it turns you off.

In hindsight:
What you could have done was put in a Windows 2000 CD, any 2000 CD; select repair and then the recovery console. Once you were in the console you could have disabled all the services with the LSTSRV command and rebooted. Yes, this does work on NT 4.0. But hey, coulda, shoulda, woulda, who knows.

Tue, May 27, 2003 Anonymous Anonymous

very good and a very patient IT pro to stay calm when needed.

Tue, May 27, 2003 Anonymous Anonymous

The other Lesson Learned is don't write troubleshooting articles on NT if you don't know how to troubleshoot NT.

Tue, May 27, 2003 Anonymous Anonymous

The funny part about the Lessons Learned is that you didn't need any of this info to fix the problem...

Lesson Learned:
Stop all uncessesary services before spending an evening on the phone with Dell troubleshooting a nonexistant hardware problem.

Tue, May 27, 2003 Kim W Kahler Fremont, CA

Maybe the thing to learn from this is rather: Try to decide as early in the process as possible if you have a hardware or software failure.

NT4 and Win2K users usually know that if the machine hangs after a reboot, it is very often a service running wild.

Tue, May 27, 2003 Richard Anonymous

Poor Bruce. This article shows how NOT to troubleshoot a bad reboot. 99.9% of all problems on server-class machines are software related, so he should not have wasted time on the hardware. Also, the correct method to disable services is to use the Win2K recovery console. Yes, you can install it to hard disk or run it from CD on NT4 systems! This would have saved HUGE amounts of time.

Tue, May 27, 2003 Tim Los Angeles

I've had the same kind of experience with an Exchange server that had to be rebuilt on a Friday night becuse a user insisted on opening an email they were told not to open. :(

Tue, May 27, 2003 Steve Anonymous

A good administrator wouldn't have anything to worry about if he was on top of the network as he should be

Tue, May 27, 2003 Richard Kilmartin harrogate

Very enjoyable read, been there got the tee shirt, prefer not to go back again though! We are using Norton Corp, I might just have a look at the task manager!!!!

Tue, May 27, 2003 Bill Charlotte

We've had to quit using Norton AV because of similar problems seriously slowing down MS SQL apps.

Tue, May 27, 2003 Anonymous Anonymous

One thing that struck me as odd: www.premiersupport.dell.com does list the hardware a system was originally sold with. Nothing easier than that to find out what RAID controller is in the box. Short of a reboot and looking at its banner, that is.

Tue, May 27, 2003 Anonymous Anonymous

We have a lot of problems with are antivivus software as well

Tue, May 27, 2003 Hakeem Fahm Washington, DC

Thanks for sharing your experience and making recommendations. This may happen again, but you just have to "be Prepared".

Tue, May 27, 2003 Adam Boston

You might also want to run a special full backup AHEAD of schedule before doing major work on your server.

Tue, May 27, 2003 Dave Ayers Fresno, California

Great job of reporting your experience.

Tue, May 27, 2003 Anonymous Anonymous

been there, glad to know someone else has also

Tue, May 27, 2003 Anonymous Anonymous

Another great tale from the trenches. I can always count on MCPmag to provide worthwhile reading.

Tue, May 27, 2003 Cliff Citrus Heights

I can relate to this story having lived through similar horrorshows trying to recover a Dell Server with a Raid Array. Bruce's troubleshooting approach and his preventative measures are worth following.

Tue, May 27, 2003 Anonymous Anonymous

This was an enjoyable read.

Add Your Comment Now:

Your Name:(optional)
Your Email:(optional)
Your Location:(optional)
Comment:
Please type the letters/numbers you see above

Redmond Tech Watch

Sign up for our newsletter.

I agree to this site's Privacy Policy.