Dumb IT Mistakes: Readers Share Their Biggest IT Goofs
Even the sharpest tools in the IT shed blunder from time to time. Here are a few of your biggies.
On April 21, the Amazon cloud service went down, with large chunks of the network staying offline for three full days. Even worse, some customers never got all their data back.
Was it a natural disaster? Hackers from Bulgaria? No, an Amazon IT worker simply mis-configured the network, moving large volumes of traffic to parts of the network designed for low-capacity work. Ouch.
While a colossal blunder in its impact, errors like these are rather simple to make. In fact, 16 Redmond readers came clean with their biggest blunders. For the sake of job security and future career prospects, some readers preferred to remain anonymous; we identify them with pseudonyms in this story.
For some, it feels nice to get these mistakes off their chest. More importantly, these stories serve as lessons about what not to do.
Reader Drew once wrote a script designed to make a database snapshot every morning at 4:30. Drew asked a co-worker to "make sure that this script is run daily at 4:30 a.m." No problem, the co-worker said.
For three months, all was perfect. Snapshots and backups happened right on schedule. Then, one fateful Saturday, Drew awoke at 6 a.m. to a phone call: "The script hasn't run today," his co-worker said. Drew asked how he knew. "Because I haven't executed it," the co-worker explained.
Drew was confused, "but after a few minutes of explanation I was incredulous to find out that this guy was waking up daily at 4:30 a.m. and executing this script manually!"
Early Lessons Learned
Many years ago, a reader we'll call Steve took his first IT job doing hardware support when he was right out of college. Oddly enough, the job was with a small community college. One day he was sent off to clean the college president's laser printer.
"I had never seen the insides of a laser printer, but I knew how to clean things, so off I went. These were the days before self-contained toner cartridges, when toner was distributed via a magnetic roller that just carried a layer of toner on its outside surface. You see where this is going," Steve explains. "White walls, white carpet, white furniture, all now a light shade of grey. I'm pretty sure that episode eventually cost me the job, but I've gone on to bigger and better things since. Still, it was a rough way to begin a career."
It may have been all for the good. "Lesson to be learned -- early, often and all too frequently the hard way -- don't be afraid to say, 'I don't know,'" Steve advises.
More experienced workers also mess up. A.M. is an IT vet who has made his share of mistakes. He says he's a cautious sort who always has a failsafe plan in place, so even the worst mistake can be sorted out in a few hours. Others with whom he works are not always so careful. This is what happened at a past employer with offices across the United States, which were all connected through a WAN.
"We were on a Novell NetWare network, and our office was one of the 'trees' on the Novell Directory Service [NDS]," A.M. explains. One Saturday, the team was migrating from Novell 4.x to 5.x.
Unfortunately, it wasn't the same old team, as a new IT director had fired the previous network manager and hired a new one "who was completely incompetent -- just as he, the IT director, was. The new network manager, instead of announcing that on Saturday the network would be completely down all day or possibly all day, told the head of the accounting department that her team could come in that weekend and the network would only be down 'a few hours.'"
Upgrades like this are rarely as simple as they seem, and this was no exception. During the upgrade, the accounting department and the new IT boss kept asking when the network would be back up, making the upgrade team anxious.
"The senior network admin who was doing the upgrade started rushing ... and that's when things went wrong. Another admin and I were in the server room when we heard, 'Oh no, no, no, no oh [expletive]!' We looked at each other knowing it was going to be an all-nighter."
The network admin had deleted the directory tree in the NetWare NDS partition. "User accounts, gone. Network objects, gone. Printers, gone. Our network was cut off from the other offices, and there was no backup. We couldn't log on to computers; we couldn't print or access files on the file server. We were dead in the water," A.M. recalls.
The IT team worked the rest of the weekend and deep into Monday rebuilding print queues, network objects and basically bringing the network back to life.
The worst was yet to come. A.M. took the fall, and on that Monday was called into the office of the new IT director, who was sitting with the new network administrator.
"Apparently, I came 10 minutes late, and that was why the senior network admin had to rush. I was waiting for the punch line, and when nobody laughed, I asked if they were serious. That was my first lesson in office politics -- one I will never forget."
In the late 1980s, reader M.M. had to upgrade Novell NetWare at a U.S. Department of Defense office. He found the NetWare 3.x file and print server to be pretty much a mess. "Documents were mixed in the Sys volume; it had no administrator for a long time, and I found tons of porn on it as well," he says.
"I was looking at various logs and stuff and noted errors regarding the disk mirror. Stupid me, I decided I would come in early one morning and reboot the server. I noted it had been up and running for more than 300 days," M.M. remembers. "So without backing up anything -- the tape drive was broken -- I restarted."
Bad idea. "The 'out of mirror' disk rebuilt the server, so in effect it reverted back in time about eight months. The bindery, as they called it, and everything else came up as it was at one time in the days of yore. Of course, no one could log on, including the system account. How I wished I had a time machine."
Back in the days of Windows NT, Brendan Buschi (who allowed his real name to be used here) taught courses in certifications. All students had their own Windows NT workstations, with 20 machines in each of eight classrooms. "They were all part of a single domain that was run by a single NT 4 server," he explains.
The network admin at the polytechnic school was looking to lock down the machines and tried to do so through setting up some system policies. "The result was not what he wanted. He booted and logged onto every workstation only to find that the workstations were unable to use key Windows components. Instead of shutting down the workstations, changing the system policy and restarting the workstations, he simply deleted the system policy from the domain server. He was surprised to find out that the workstations were still operating under the now-deleted system policy and he had no idea how to correct the problem," Buschi recalls.
The admin asked Buschi for help two hours before classes were set to start for the new term. "It took me almost exactly that amount of time to correct the problem. We were literally leaving the last classroom we worked on as the students were entering it," Buschi says.
Buschi saved the day in an even bigger way when working at a research lab where he implemented a Windows 2000 domain. The network administrator worked remotely and told people what to do over the phone.
Once Buschi set up the network, he began to worry the tape backup wasn't working. "I was told that I was not to touch that -- that was under the control of the network administrator," he says.
"Out of the four lab servers, one contained 30 years of research. The data was not mirrored to any other server but was backed up every day to tape. I suspected the backups were failing and advised them to try restoring data to test them. I was told to take a hike. On my last day, I brought in an external drive and made a backup. I was not supposed to do that, but I figured that if something happened, I would at least have a backup of the data as of that date."
Two weeks later, Dell sent over a technician who wrecked the array, including all three of its drives. Buschi was asked to install new drives and "then told to go home."
The network administrator in charge of the tape backup told the Dell tech how to restore the data from tape. "There were tapes for each day, and tapes were rotated so there were 20 tapes total. Each and every one was completely blank!"
The company, now frantic and unable to restore, called Buschi back in. "They were totally nuts. The director told me that they would not be able to continue in business without the data," Buschi remembers. "Luckily, I was able to restore everything up to the point when I backed it up. They lost their last two weeks of new data before the RAID was destroyed, but they didn't care. They were ecstatic."
Y2K Not OK
Reader Brooke Justice, who agreed to let us use his full name, came clean to Redmond with a story he had previously only shared with his immediate family. During Y2K, Justice worked in a shop that ran Oracle. Most of the apps in the midsize shop had no Y2K glitches, so they could simply be left alone. But Justice discovered that the Oracle Installer was problematic. Unfortunately, there was no workaround at the time.
"I was using my workstation in part of the testing, meaning the same one I performed my other admin duties on. Not real smart, but extra equipment was hard to come by since everyone was testing, and I had already identified the problem, which seemed pretty minimal to test," Justice explains.
As Justice tested, he would regularly uninstall most everything and then manually edit the registry to get rid of the leftover Oracle entries.
"During this testing, I got a call that I had to address. Somewhere in troubleshooting and resolving the problem, I connected to a key Oracle middleware server's registry. After I was done, I went back to Y2K troubleshooting. I knew I had uninstalled the Oracle products and installer, but had not removed the registry junk it left over. So, I opened the registry and removed all the Oracle entries that I could find," he says.
Justice tried to reinstall the Oracle Installer, but it didn't work. Looking at the registry, he quickly caught his mistake. "In Windows NT, if you opened a remote registry and closed regedit without closing the remote registry, it would open up the remote registry the next time you opened regedit. [This] meant that I had removed all of the Oracle registry entries on a key Oracle middleware server instead of on my workstation! That would affect 1,500 users in short order -- not good!" he explains.
Justice made for the elevators, realized they were too slow, and dashed downstairs to the basement datacenter. He quickly located the ArcServe backup tapes, and then shot next door to grab the off-site tapes. "After grabbing the tape, running back to our basement server room and putting it in the drive, I started the restore of those Oracle registry keys. While it was completing, I started getting calls in the basement: 'We've been paging you. We have a bunch of people calling in who can't connect to Oracle.' While I was restarting the services, I said, 'Yeah, not sure what the problem is. I can connect from here just fine. Have them try again.' Voilˆ, they could connect just fine. 'Just some sort of gremlin,' I told them. Whew, disaster averted! And 10 years of silence broken."
Greg used to work for a government agency that needed content management, so it bought a commercial product and then paid contractors to set it up, which involved quite a bit of customization. The customization brought along with it a hugely inefficient UI.
"Whenever the users added a new document to the system, they would have to enter a document profile that contained 11 required fields. Some of the fields used drop-down list boxes to help standardize the entry. The only problem was that the list boxes would only match the first character of the entry and some of the lists contained tens of thousands of entries -- no lie! In order to get the users to use the system, the agency hired more contractors to enter the profile information for the users," Greg says.
Now the content management system software has been upgraded by the vendor, but the agency's version is so customized it can't easily be upgraded. And the older version of the software is moving off into the sunset.
Active Directory Debacles
Active Directory came in with its share of confessions. Take Sue. She worked for local government and needed to delete a user who no longer worked in the city manager's office.
"I had a serious case of click-without-reading-itis and deleted the entire OU for the city manager's office -- which runs the entire city. I was new enough to AD that I didn't know I could restore the OU and frantically started recreating user accounts. It was really awful!" she says.
A similar thing happened to reader Lee.
"Several years ago, I was deleting a computer object in AD. Instead of deleting the object, I deleted the OU that contained all of our member servers. I spent an hour rejoining all of the servers to the domain and rebooting them. There were about 40 total," Lee says. "At least the domain controllers weren't in there."
Some AD mistakes were made by others and were caught by Redmond readers. Reader Gregg remembers how one of his company's technical support staffers "deployed his computers, which were Windows in AD, with domain users as part of the local administrators group. When I asked why he was doing that, he said the students couldn't install any applications and he was tired of all the phone calls. I asked if he considered that as a security risk, and he stated that I need not worry because the staff and faculty locked their doors at night."
Reader Michael Hall, who let us use his real name, had one of the more embarrassing gaffes. The Christian camp and conference center he worked for had just moved to Microsoft Small Business Server 2003, which was protected by Trend Micro server messaging, antivirus and filtering software.
The antivirus install went fine, but the filtering hadn't yet been configured. There was a bunch of profane and pornographic material let through in the meantime, and the camp CEO ordered Hall to configure the filtering pronto.
"I quickly worked through the setup and accepted all the defaults until I came to one check-box that said something like 'must match filter exactly,' with a preset check in it. It seemed to me that was restrictive, and I thought it would be better to uncheck it. I also set it so that the server would send notices to the sender, recipient and administrator about the offending e-mail and to delete the e-mail from the server. I finished the configuration and left for the evening," Hall explains.
That night he checked his e-mail, saw some mail about the filtering, and went to bed unconcerned. "Very early the next morning, I started getting phone calls that e-mail was slow or down. They also said that we were getting calls from staff, customers, pastors and even the bishop's office about e-mails that had been sent out in the last 12 hours. The e-mails notified the sender that the e-mail they had sent us was not delivered because it contained profanity!" he recalls.
Hall dashed to the camp and found the server horribly overloaded, with the admin's mailbox alone containing a quarter million messages. After freeing up space, Hall found the problem. "One little check-box! Without the 'must match filter exactly' box checked, the filter was hitting on words like assistant, associate and similar words. It had also gone into everyone's inbox, sent mailbox and every other Exchange folder, replacing the content of the 'offending e-mails' with a notice saying that the content was deleted, and sent notices to the sender, receiver and administrator. While that was bad enough, the notice must have contained a word that met the filtering criteria because it sent another notice to the recipient ... a loop that continued to grow and bog down the whole server."
It took a couple days to sort out, and more time for Hall to explain and apologize to the powers that be. "Everyone understood and I wasn't blamed for the incident. I also have never again been pressured to 'fast track' the installation of new software or hardware," Hall says.
While many of these stories are from the past, some are so recent that the sting hasn't yet gone away. That's how it is with the reader we'll just call R.S.
"This just happened yesterday. One of my remote technicians decided he would run a defrag on the site domain controller and file and print server in the middle of the business day. As with all our business-critical systems, we have alerting set up for system failures. We started receiving alarms of CPU and memory usage hitting 100 percent consistently," R.S. says.
Seeing performance so shockingly poor, R.S. logged into the server and saw the tech logged in and the defrag running. "I very politely sent him a system message asking him if he was truly running a defrag during the business day ... Needless to say, the process was terminated and all alerts ceased."
In the 1990s, reader Louis ran Unix and Windows NT servers at a manufacturing company. There were two main Unix boxes, one for developing a document-management program and the second for quality control.
Late one night, it was time to upgrade the OS, and the development server install came out fine. "Feeling confident that I knew what I was doing, and not wanting to waste another night in the computer room by myself, I decided to upgrade the OS of the production-quality monitoring server right there and then. I popped in the CD and started the upgrade. It complained that the partitions were a bit too small, but I ignored the message and said I was sure the sizes were OK. After a while, it stopped copying files from the CD, as the partition was full! I had a half-upgraded OS that wasn't working, as some components were upgraded and some weren't," Louis recalls.
In panic mode, Louis tried a restore from a DAT tape. Unfortunately, the new half-installed OS couldn't read the old backup tape's format. The only choice was to put the old OS back in place and restore from there.
"Morning came. The production line started and the boss was breathing down my back wanting me to restore the server to its previous evening's state ASAP, as the quality team could not verify the quality of product coming off the line -- but I couldn't make the install and the restore go faster. It took all day and evening to install and restore the data. By the next day, the server was back in operation, but it took a good 24 hours to get it back there. Some of my colleagues had to pitch in, as I was having a hard time thinking straight because of the lack of sleep and constant grilling of the boss," Louis says.
"I learned my lesson that day," he says. "Plan twice and execute once. Fortunately, I did not lose my job over that incident, but I'm sure I came very close."
The Blame Game
Many times, IT is unfairly blamed. It's bad enough when bosses do it. It's even worse when the press gets involved. That happened to one Redmond reader whose company recently upgraded its ERP software. When the software was put into action, a company employee complained about paycheck problems. The worker didn't just complain internally; he went to the local press. Then a large computer publication (not this one) picked it up as an example of one of the "worst ERP implementations."
Workers went to their union, which hired an auditor to figure out how IT had messed this all up. After three weeks of investigation, it turned out to be incompetent departmental timekeepers who were at fault. "IT paid for contractors to replace them, and voilˆ, problem solved. Yes, there were lessons to be learned from the implementation. But was it a blunder? By no means," the IT pro concludes.
The Right Thing Goes Wrong
Another unidentified reader was leaving his job as "a single-person IT department for a software company." In order to not leave the company in the lurch, this worker tried to finish up an important data-conversion project.
"The new network admin, who was hired without me even interviewing him, decided to replace the existing antivirus. This slowed development-build and data-conversion times from one hour to 10 hours because he didn't know how to configure it, and then he lost the admin password to change the config," the reader says. "His next project was spending $50,000 on new Cisco switches, which brought down the entire network for two days because he didn't understand VLANs, Cisco IOS or half/full duplex. Took him four weeks to go from Ron to MoRon."
As the previous anecdote demonstrates, one of the worst things you can do in IT is remove vital security protections, which is what happened to one unidentified reader.
"Our voice team was hitting all the remote user laptops, installing IP phones with software. When they got to our cubes, they requested for us to disable the antivirus, as it was annoyingly alerting when they did the IP phone software install," the IT pro explains. "Yeah. There was a good reason it was alerting -- their install files were infected. They had already infected a good number of users before getting to us. Not a good day for them."
Video Killed the Database Star
In the days of video stores, a friend of Redmond reader Harry was looking to automate. Instead of asking Harry to build the software, the store owner went with someone else, who had no IT training and wrote a dBase II app that didn't work.
"I was asked to help, and when I was working on his application I wanted to delete some test files that I created and entered 'erase *.*' But I was in the wrong folder and automatically replied 'Y' when asked to confirm. Yep -- I deleted all his source code and programs, for which there was no backup. I used the old 'un-delete' from PC Tools to get back some of the files, but could not get them all back. Starting from scratch, it took me two weeks to get a basic system back in place."