Global Dedupe and Data Replication for Primary Storage

The concept of deduplication of replication data streams is not new; however, it has generally only been applied in backup applications. At GreenBytes, we have recently completed an important new feature in the GreenBytes File System (GBFS) kernel that allows optimization of replication data streams from any data set. It is now possible, for the first time, to snapshot an iSCSI LUN and send only the data that is unique to the replication TARGET. The value of this optimization in bandwidth-constrained networks is immense. GBFS is the only iSCSI SAN to include global snapshot deduplication, compression and bandwidth throttling as an integral part of the snapshot engine.

Storage systems (like those based on vanilla ZFS) have had the ability to dedupe blocks within a snap shot. This has proven to be of limited utility as most duplicate blocks are created between and not within replication jobs. Trying to use the ZFS send command to update a target after a full backup usually results in the full rehydration of data blocks prior to replication.

Consider this simple example: on Saturday, I send a full backup stream to an iSCSI LUN connected to my storage appliance. This is the first time this data has been sent. For simplicity, let’s assume it contains zero duplicate blocks and occupies the full storage requirement (1TB in my example). The following week, I do another full backup, which typically is 90% of the same data. Predictably, these new files dedupe nicely.

Now the system takes a scheduled snap and replicates it. The differential block snapshot logic will tag all of the new blocks as needed to be replicated (typical of any snapshot-based replication technology). This is not the result we were hoping for as a second large replication job is now choking my network. The vanilla ZFS ‘send –d’ option doesn’t help either as that only dedupes duplicates within the snapshot. We end up wasting a lot of memory comparing blocks to no good effect.

What can be done?

Clearly, we need to have some knowledge of what is on our replication TARGET before we send a data stream. This is exactly what the GBFS storage kernel now achieves. In fact, we have added two levels of dedupe replication! The first compares blocks between snapshots and only sends the unique blocks.

The second has a preamble conversation with the target and only sends the globally unique blocks. This second method is the most highly optimized and is designed for situations where multiple appliances are replicating to a single target repository.

How big is the effect? The results of our testing show that every block that is deduplicated in this manner reduces the size of the data-steam by potentially > 100X. In our example above with 90% duplicate data, the 1TB stream would now be 11% of its fully hydrated size: an 89% reduction in replication time or bandwidth requirements.

Stay tuned for more data on GBFS performance. My next entries will cover optimizations to the replication transport and compression engine. I will also follow up with a comprehensive set of benchmarks and an in-depth look into our unique SSD Hybrid Storage Architecture (HSA). Expect a few surprises!

Latest Images

Trending Articles

Latest Images