Tips for Troubleshooting Physical Hardware Problems in a Virtual Datacenter -- Redmondmag.com

Tips for Troubleshooting Physical Hardware Problems in a Virtual Datacenter

Virtualization complicates the process of finding and remedying hardware problems, but a few techniques and tools can make the process more manageable.

By Brien Posey
07/06/2011

Although it's usually easy to troubleshoot hardware problems on a server or PC, the difficulty of locating faulty hardware is compounded when virtualization is involved. But there are some proven techniques you can use to isolate hardware problems on a virtualization host.

What's so difficult about tracking down hardware problems on a virtualization host server? It isn't always difficult to figure out that you have a hardware problem. Some hardware problems affect the entire system. For example, if a server's power supply goes bad, then the entire server will shut down (assuming that no redundant power supplies exist). The same can be said for hard disks. If a hard drive fails, then the failure is usually obvious.

Other types of failures aren't so easy to detect, however. One of the trickiest components to troubleshoot is memory. A few weeks ago, for example, I had a virtualization host server that started acting strangely. Although memory was ultimately to blame for the problem, the virtualization layer essentially hid the hardware problem.

I first noticed the problem when I was trying to complete an article. Any time I write an article, I have to test the procedures I'm writing about, and I usually also have to get a few screenshots along the way. As such, I have several virtualization hosts set up to run a number of different virtualized lab servers. On this particular evening, I logged on to a lab domain controller (DC) to create a user account. This particular virtual server has been around for at least a year and has always been reliable.

When I opened the Active Directory Users and Computers console, I discovered that, though the console opened, it wouldn't display any information. I tried a few fixes, but none of them worked. As I was up against a deadline, I solved the problem by bringing a new DC online. Luckily, this DC was able to read all of the Active Directory information from the DC that was giving me fits. Later, I deleted the DC that was giving me problems and didn't give the problem any more thought.

About a week later, someone asked me to write an article about SharePoint 2010. Even though I have several lab-based SharePoint deployments, I needed to set up a dedicated SharePoint server for this particular article. So I created a new virtual server and installed Windows. The installation completed without errors, but when I went to install the SharePoint prerequisites things started getting a little bit crazy. For example, when I attempted to install IIS, the Server Manager returned an error message I'd never seen before.

Because I was once again up against a deadline, I didn't have time to troubleshoot the problem. Instead, I blew away the new virtual server and tried creating it again. This time, I wasn't even able to get through the entire Windows installation. After installing about half of the Windows files, setup gave me a message telling me that some of the required files were missing or corrupt. At this point, I assumed I must have a bad Windows Server installation DVD. The type of work that I do requires me to set up lab machines constantly, and I'd been using this particular installation DVD for quite some time. It seemed completely plausible that the DVD could have somehow been damaged from all of the excessive handling. I created a new Windows Server installation DVD, and that seemed to fix the problem.

A few days went by, and I began working on an article about Exchange Server 2010. I tried to open the Exchange Management Console on one of my lab servers (which had been in place awhile) only to discover that the console displayed several error messages. This is when my mind flashed back to some of the other problems I'd been having, and I realized that all of the problems were probably connected to each other.

If these types of problems had occurred on a physical server, I might have realized more quickly that I had a memory issue. However, because I was operating in a virtual server environment, the physical memory problem was a lot less obvious.

Some of you are probably screaming, "How could you not see that?" Here's how: All of the problems I described are indicative of memory errors. However, the problems were always isolated to a single virtual machine (VM). The server as a whole seemed to be OK. Even though I was having trouble with certain VMs, there were many other VMs running on the same host server without any problems. This initially led me to assume that my problems were related to a specific VM rather than to the server hardware.

Troubleshooting Hardware Errors
Now that I've shown you what can happen when hardware problems occur on a virtualization host server, I want to outline some techniques you can use to get to the bottom of these types of problems.

The first step in diagnosing tricky hardware problems on a virtualization host is to pay attention to where the problems seem to manifest themselves. In my particular case, the problem was always isolated to a single VM. However, it wasn't always the same VM that was affected.

If you start noticing that one of your VMs is having problems that may point to a hardware issue, then you might try booting the VMs in a different order. The reason for doing so is to see if the problem moves. If the VM that was giving you fits suddenly starts working and a different VM begins having problems, you can rest assured that there's an issue with the underlying hardware.

Of course, there are a couple of things you need to keep in mind using this technique. If memory is to blame for the problem, then the amount of memory that has been allocated to each VM can skew the results of this test. For example, let's say you're trying to solve a problem on a host server that's running three VMs. Let's also assume that one of the VMs has been allocated 4GB of memory, while the other two were only allocated 1GB of memory each. Because one VM has a disproportionately large amount of memory allocated to it (at least compared to the other ones), that VM may continue to use at least some of the same physical memory modules each time it's booted, regardless of the boot order.

The other important factor to bear in mind is that you must take the severity of the error into account. For example, if the memory problem is so bad that it wreaked havoc on the affected server, then you probably don't want to run the risk of trying to see if you can get the problem to show up within a different virtual server, especially in a production environment. After all, all data passes through memory before it's committed to disk, and you don't want to run the risk of corrupting production data.

Intermittent Memory Problems
It's possible that the signs of a memory problem may not always be confined to a specific VM. Depending on how memory has been allocated, it's possible that multiple VMs could be using the faulty memory module. If that happens, it should be fairly obvious that the server has a hardware problem.

What can be more difficult to diagnose, however, are situations in which memory problems seem to move from one VM to another even if the VMs haven't been rebooted in a different order. Such problems can occur as a result of using dynamic memory allocation.

In case you're not familiar with dynamic memory allocation, it's a concept that exists in both Hyper-V and VMware. The basic idea is, rather than allocating a set amount of memory to a VM, you instead specify the maximum amount of memory a VM can use. By doing so, VMs will use only the memory they need. As additional memory is required, that memory is dynamically allocated to the VM. This helps increase VM density by eliminating memory waste. However, the use of dynamic memory allocation can also make it difficult to troubleshoot memory problems, because the affected memory range could potentially be reallocated to different VMs.

Diagnosing Memory Problems
There are a wide variety of tools available for diagnosing memory problems. My personal favorite is a free tool called Memtest86+ (memtest.org). The nice thing about Memtest86+ is that it runs from a boot CD, which allows it to test areas of memory that are normally in use by the Windows OS.

But having a good diagnostic tool is only the first step. You still have to figure out how to use the tool to diagnose memory problems.

The best option for testing a server's memory is to shut down the entire host server and boot it from the Memtest86+ boot disk. However, there are a couple of problems with this method. For starters, it practically takes an act of Congress to shut down a host server and all of the VMs that are running on it. This is less of an issue, however, if the host server is part of a hypervisor-level cluster.

The other problem with testing an entire physical server is that servers used for virtualization hosts are typically equipped with huge amounts of memory, sometimes ranging in the hundreds of gigabytes. It takes a long time to test that much memory.

My advice for diagnosing memory problems is to try to isolate the problems to a specific VM by turning off any dynamic memory features. Once you determine which VM is affected, you can try running a memory diagnostic utility such as Memtest86+ from within the VM.

Simply boot the VM from the Memtest86+ boot disk. Assuming the VM is using the bad memory, the memory diagnostic utility you're using should be able to confirm the existence of a problem without you having to test all of the memory in the entire physical server. Best of all, you can leave the other VMs running while you're performing the test.

Fixing the Problem
Once you determine that a memory problem exists, you still have to replace the faulty modules. You have two options for doing so. One option is to simply replace all of the memory in the entire server. Doing so minimizes the amount of time that the host server is offline, but the repair could be expensive.

The other method you can use is to remove one memory module at a time until you figure out which module is causing the problem. This technique saves money on replacement costs, but leads to more downtime than you'd experience if you elected to replace all of the server's memory. The other disadvantage is that the process of elimination doesn't work very well if multiple memory modules are damaged.

So why should you bother running a tool like Memtest86+? Why not just replace all of the server's memory at the first sign of trouble? Replacing an entire server's memory can be expensive, so it's important to confirm that memory really is causing the problem before you incur that expense. I've seen system board problems that mimic memory problems. You don't want to replace all of a server's memory only to discover that there was nothing wrong with the memory in the first place.

About the Author

Brien Posey is a 22-time Microsoft MVP with decades of IT experience. As a freelance writer, Posey has written thousands of articles and contributed to several dozen books on a wide variety of IT topics. Prior to going freelance, Posey was a CIO for a national chain of hospitals and health care facilities. He has also served as a network administrator for some of the country's largest insurance companies and for the Department of Defense at Fort Knox. In addition to his continued work in IT, Posey has spent the last several years actively training as a commercial scientist-astronaut candidate in preparation to fly on a mission to study polar mesospheric clouds from space. You can follow his spaceflight training on his Web site.

Featured

To Support AI Workloads, Microsoft Taps Lumen for Network Boost

With AI workloads demanding more from its datacenters, Microsoft is looking to outside partners for infrastructure support.
Which Enterprises Should Ditch Passwords? All of Them!

As security biometrics continue to sophisticate, many organizations are still using flawed passwords to safeguard their data. That needs to change. Now.
CrowdStrike Blames Internal Testing Faults for Mass Outage

In an initial post-mortem, Austin-based security firm CrowdStrike said that an issue in its testing software led to the company pushing through a faulty update that took down more than 8.5 million Windows systems last week.
Build Your Own Custom Copilot, Part 2: Making the Connection

Now that you have a plan, it's time to create your personalized Copilot.
Microsoft Releases CrowdStrike Outage Recovery Tool

Microsoft has released a tool to help recover affected systems after last week's global outage caused by a faulty update pushed through by security firm CrowdStrike.