Posey's Tips & Tricks
What Causes Hyper-V Replication Failures?
Hyper-V replication failures happen rarely, but their impact can be catastrophic when they do. Know the scenarios that are likely to trigger a replication failure.
Over the years, I have developed something of a love/hate relationship with Hyper-V replication. On one hand, replication is one of my favorite Hyper-V features. Since my own production environment is not large enough to justify me building a Hyper-V cluster, I use replication as a way to protect myself against a host-level failure.
But as much as I may like the Hyper-V replication feature, it occasionally fails pretty catastrophically.
The way that Hyper-V replication is supposed to work is whenever a block is changed within a virtual hard disk that is configured for replication, that block is replicated to another host during the next replication cycle. What happens sometimes, however, is that the synchronization process simply comes to a stop.
There is usually a corresponding error message that says, "Hyper-V suspended replication for virtual machine <VMName> due to a non-recoverable failure. (Virtual Machine ID <VMId>). Resume replication after correcting the failure." When this happens, Hyper-V lets you right-click on the virtual machine and choose a Resume Replication command from the shortcut menu. Sometimes this fixes the problem, but other times it is impossible to resume virtual machine replication.
When that happens, there is little that can be done besides disabling replication, deleting the replica and then setting up replication from scratch. This puts the virtual machine at risk until its entire contents can be resynchronized.
So what causes these replication failures? Unfortunately, I don't have a definitive answer. I've searched TechNet for an explanation whenever I have experienced this problem in the past, but have always come up empty-handed. Even so, I think that I might have finally figured out what is going on.
I have been having problems with periodic replication failures for as long as the replication feature has existed. The problems have become far less frequent over time, which I attribute to Hyper-V becoming more mature. Today, Hyper-V replication failures are kind of a rarity (at least in my organization), but I have noticed that there are a couple of things that seem to happen just before a replication failure.
One of the events that seems to trigger a replication failure is a Hyper-V outage. Last fall, for example, I found myself in the projected path of a hurricane. In preparation for the hurricane, I shut down my replica server and stored it somewhere safe. When I eventually brought the server back online, however, I had to deal with an irrecoverable replication failure. I will concede that the timing of this failure could possibly be a coincidence -- but I don't think so.
The other thing that seems to cause replication failures is copying large amounts of data to a virtual machine. About a year ago, for example, I had to copy just under 3TB of data to a virtual file server. Shortly thereafter, the replication process stopped working.
I have long held theories as to why these two types of events might trigger replication failures, but I have always resisted the temptation to write about them because all of the evidence in support of my theories was circumstantial at best. I assumed that the replica was simply being overloaded with replication data, and that replication would fail because the replication process could not be completed within a single cycle. Again, this was just a theory.
This morning, however, I accidentally stumbled onto an interesting TechNet article while looking for something else. The article, which you can read here, lists two reasons why this type of replication failure occurs.
The first reason is insufficient storage space. However, I have always had plenty of free storage space when failures have occurred in my environment, so that explanation does not account for my problems.
The second reason is that a power failure has occurred on the replica side and the replica server was restarted. The article goes on to explain that data that needs to be replicated can accumulate while the replica server is offline. When the server comes back online, replication fails because the bandwidth is inadequate to handle the volume of data that needs to be replicated.
This explanation seems to fit both of my suspected replication failure triggers. The TechNet article says point-blank that replication can fail if the replica server is disconnected and then restarted. However, it also seems plausible that trying to copy massive amounts of data to a virtual machine could overwhelm the available bandwidth, resulting in a replication failure. Hopefully, this explanation will shed some light on the situation for anyone else who may have been struggling with replication failures.
About the Author
Brien Posey is a 22-time Microsoft MVP with decades of IT experience. As a freelance writer, Posey has written thousands of articles and contributed to several dozen books on a wide variety of IT topics. Prior to going freelance, Posey was a CIO for a national chain of hospitals and health care facilities. He has also served as a network administrator for some of the country's largest insurance companies and for the Department of Defense at Fort Knox. In addition to his continued work in IT, Posey has spent the last several years actively training as a commercial scientist-astronaut candidate in preparation to fly on a mission to study polar mesospheric clouds from space. You can follow his spaceflight training on his Web site.