What You Need To Know About Data Deduplication -- Redmondmag.com

What You Need To Know About Data Deduplication

Break out those calculators, it's time to do some math!

By Brien Posey
04/29/2022

As storage costs continue to increase, it is becoming increasingly important to perform backups in a way that minimizes storage costs, but without sacrificing the protection of your data in the process. Deduplication has long been one of the go-to mechanisms for reducing backup costs. As helpful as deduplication might be however, deduplication ratios can be a little bit misleading.

For those who might not be familiar with data deduplication, it is based on the idea that any large dataset is likely to contain at least some redundancy. This redundancy might exist in the form of duplicate storage blocks, duplicate files, or duplication within the data itself. Deduplication works by eliminating redundant data so that only a single copy actually needs to be stored. There are several different forms of deduplication, but that's the basic idea behind how it works.

Deduplication efficiency is often expressed as a ratio (such as 10:1), with higher ratios indicating a greater level of deduplication. When it comes to deduplication however, your mileage may vary.

There are two important things to understand about data deduplication. First, your data plays a major role in the effectiveness of the deduplication process. A deduplication engine cannot shrink the data footprint unless redundancy exists within the dataset. Some data deduplicates really well, while other datasets contain almost no redundancy at all. Compressed archives such as ZIP files for example, usually can not be deduplicated. Similarly, many media formats such as MPG and JPEG files are already compressed and therefore tend not to benefit from deduplication.

The other important thing to understand if you are just starting out with deduplication is that the ratios can be really misleading. In the past, vendors have often tried to wow their customers with super high deduplication ratios. While it is true that higher ratios do indicate a greater level of deduplication (and therefore less storage consumption), it is also true that higher ratios yield diminishing returns.

Imagine for a moment that your backup product of choice was able to achieve a 2:1 deduplication ratio for the data that you backed up. At 2:1, you have shrunk the data's footprint by 50 percent. If, for example, you were backing up 1 TB of data, then only 512 GB of data would be written to the backup target.

Now, imagine that the same backup product managed to deduplicate your data at a 3:1 ratio. This would mean that you are reducing the data's footprint by roughly 66.7 percent. If you were backing up 1 TB of data, it would mean that only about 341 GB of data is being written to the backup target. A 3:1 ratio represents a significant improvement over a 2:1 ratio, but gains become far less significant at higher ratios.

To show you what I mean, consider what would happen if you managed to deduplicate your backup at a ratio of 20:1. At 20:1 you would have reduced your data's footprint by 95 percent. If you were backing up 1 TB of data then such a high ratio would mean that you would only be writing about 51.2 GB to the backup target.

The thing to keep in mind is that it's tough to improve on a 95 percent reduction in the data footprint. Earlier I showed you that going from 2:1 to 3:1 caused the data to be reduced by an extra 16.7 percent. Such gains are impossible with a 20:1 deduplication ratio because the data has already been shrunken by 95 percent. Shrinking the data another 5 percent would mean that the data is completely gone!

So let's look at what would happen if we went from 20:1 to 25:1. At 25:1 the data is reduced by 96 percent. In other words, increasing the deduplication ratio by a factor of five only resulted in another 1 percent reduction in the data's footprint. In the case of backing up 1 TB of data, a 25:1 deduplication ratio would only save you about 10 GB of space over a 20:1 deduplication ratio.

Of course most organizations back up far more than 1 TB of data, and in the case of an extremely large dataset, even a 1 percent decrease in the storage footprint could be significant. Regardless of the data size however, returns diminish as deduplication ratios increase.

About the Author

Brien Posey is a 22-time Microsoft MVP with decades of IT experience. As a freelance writer, Posey has written thousands of articles and contributed to several dozen books on a wide variety of IT topics. Prior to going freelance, Posey was a CIO for a national chain of hospitals and health care facilities. He has also served as a network administrator for some of the country's largest insurance companies and for the Department of Defense at Fort Knox. In addition to his continued work in IT, Posey has spent the last several years actively training as a commercial scientist-astronaut candidate in preparation to fly on a mission to study polar mesospheric clouds from space. You can follow his spaceflight training on his Web site.

Featured

Anthropic Claude Goes GA in Microsoft Foundry

Anthropic's Claude family of AI models is now generally available in Microsoft Foundry on Azure, giving enterprise developers another frontier model they can deploy, manage and govern through Microsoft's cloud AI platform.
Taking Hyper-V Health Monitoring to the Next Level, Part 2

Let's launch a PowerShell script that goes beyond replication health to evaluate Hyper-V failover readiness by checking storage, networking, memory, VM configuration and other key conditions that could determine whether a virtual machine can successfully fail over.
Microsoft Moves Up Quantum Safe Security Timeline

Microsoft is accelerating its quantum-safe security timeline, saying advances in quantum computing and new federal requirements have pushed post-quantum cryptography from a future planning issue into an immediate engineering priority.
Microsoft: AI Builders Are Growing More Confident in Enterprise Agents

Artificial intelligence agents are moving beyond experimentation and into production, according to Microsoft's inaugural 2026 Agent Confidence Index.
Taking Hyper-V Health Monitoring to the Next Level, Part 1

A new PowerShell-based Hyper-V health tool goes beyond replication status to estimate whether a VM is actually ready for a successful failover.