Druvaa inSync v3.0 Feature List Real Businesses Can’t Depend upon Just Tape Backup

Understanding Data Deduplication

By Jaspreet on January 9th, 2009 under Data Protection,Technology & Innovation

“Data deduplication is inarguably one of the most new important technologies in storage for the past decade” says Gartner. So let’s take a detailed look at what it actually means.

Definition

Data deduplication or Single Instancing essentially refers to the elimination of redundant data. In the deduplication process, duplicate data is deleted, leaving only one copy (single instance) of the data to be stored. However, indexing of all data is still retained should that data ever be required.

Example
A typical email system might contain 100 instances of the same 1 MB file attachment. If the email platform is backed up or archived, all 100 instances are saved, requiring 100 MB storage space. With data deduplication, only one instance of the attachment is actually stored; each subsequent instance is just referenced back to the one saved copy reducing storage and bandwidth demand to only 1 MB.

Technological Classification

The practical benefits of this technology depend upon various factors like –

  1. Point of Application – Source Vs Target
  2. Time of Application – Inline vs Post-Process
  3. Granularity – File vs Sub-File level
  4. Algorithm – Fixed size blocks Vs Variable length data segments

A simple relation between these factors can be explained using the diagram below -

Deduplication Technological Classification

Target Vs Source based Deduplication

Target based deduplication acts on the target data storage media. In this case the client is unmodified and not aware of any deduplication. The deduplication engine can embedded in the hardware array, which can be used as NAS/SAN device with deduplication capabilities. Alternatively it can also be offered as an independent software or hardware appliance which acts as intermediary between backup server and storage arrays. In both cases it improves only the storage utilization.

Target Vs Source Deduplication

On the contrary Source based deduplication acts on the data at the source before it’s moved. A deduplication aware backup agent is installed on the client which backs up only unique data. The result is improved bandwidth and storage utilization. But, this imposes additional computational load on the backup client.

Inline Vs Post-process Deduplication

In target based deduplication, the deduplication engine can either process data for duplicates in real time (i.e. as and when its send to target) or after its been stored in the target storage.

The former is called inline deduplication. The obvious advantages are -

  1. Increase in overall efficiency as data is only passed and processed once
  2. The processed data is instantaneously available for post storage processes like recovery and replication reducing the RPO and RTO window.

the disadvantages are -

  1. Decrease in write throughput
  2. Extent of deduplication is less – Only fixed-length block deduplication approach can be use

The inline deduplication only processed incoming raw blocks and does not have any knowledge of the files or file-structure. This forces it to use the fixed-length block approach (discussed in details later).

Inline Vs Post Process Deduplication

The post-process deduplication asynchronously acts on the stored data. And has an exact opposite effect on advantages and disadvantages of the inline deduplication listed above.

File vs Sub-file Level Deduplication

The duplicate removal algorithm can be applied on full file or sub-file levels. Full file level duplicates can be easily eliminated by calculating single checksum of the complete file data and comparing it against existing checksums of already backed up files. It’s simple and fast, but the extent of deduplication is very less, as it does not address the problem of duplicate content found inside different files or data-sets (e.g. emails).

The sub-file level deduplication technique breaks the file into smaller fixed or variable size blocks, and then uses standard hash based algorithm to find similar blocks.

Fixed-Length Blocks v/s Variable-Length Data Segments

Fixed-length block approach, as the name suggests, divides the files into fixed size length blocks and uses simple checksum (MD5/SHA etc.) based approach to find duplicates. Although it’s possible to look for repeated blocks, the approach provides very limited effectiveness. The reason is that the primary opportunity for data reduction is in finding duplicate blocks in two transmitted datasets that are made up mostly – but not completely – of the same data segments.

Data Sets and Block Allignment

For example, similar data blocks may be present at different offsets in two different datasets. In other words the block boundary of similar data may be different. This is very common when some bytes are inserted in a file, and when the changed file processes again and divides into fixed-length blocks, all blocks appear to have changed.

Therefore, two datasets with a small amount of difference are likely to have very few identical fixed length blocks.

Variable-Length Data Segment technology divides the data stream into variable length data segments using a methodology that can find the same block boundaries in different locations and contexts. This allows the boundaries to “float” within the data stream so that changes in one part of the dataset have little or no impact on the boundaries in other locations of the dataset.

ROI Benefits

Each organization has a capacity to generate data. The extent of savings depends upon – but not directly proportional to – the number of applications or end users generating data. Overall the deduplication savings depend upon following parameters –

  1. No. of applications or end users generating data
  2. Total data
  3. Daily change in data
  4. Type of data (emails/ documents/ media etc.)
  5. Backup policy (weekly-full – daily-incremental or daily-full)
  6. Retention period (90 days, 1 year etc.)
  7. Deduplication technology in place

The actual benefits of deduplication are realized once the same dataset is processed multiple times over a span of time for weekly/daily backups. This is especially true for variable length data segment technology which has a much better capability for dealing with arbitrary byte insertions.

Numbers

The dedupication ratio increases everytime to pass the same complete data-set through the deduplication engine.

If compared against daily full backups, which I think is not widely used today, the ratios are close to 1:300.  Most if the venders use this as a marketing jargon to attract customers, even though none of their customers could be doing daily full-backup :)

If compared against modern day incremental backups, our customer statistics show that, the results are between 1:4 to 1:50 for source based deduplication.

Bookmark and Share

Related Posts:

  1. Why so much delay in inSync 3.1 and Phoenix ?? Well, first let me confess that inSync v3.1 took much...
  2. Green-ness of Data De-duplication The Storage Hunger Sale of disk-bases storage system has already...
  3. Data De-duplication The Gartner Report (here) says storage data de-duplication and virtualization...
  4. File-systems Vs Databases This topic has been on my plate for some time...
  5. Understanding RPO and RTO Recovery Point Objective (RPO) and Recovey Time Objective (RTO) are...

Related posts brought to you by Yet Another Related Posts Plugin.

22 Comments Add your own

  • 1. Mike Dutch  |  January 9th, 2009 at 6:14 pm

    Also see SNIA for dedupe information:

    White Paper – http://www.snia.org/forums/dmf/knowledge/white_papers_and_reports/

    Webcast – http://www.snia.org/forums/dmf/knowledge/webcasts/

    Tutorials – http://www.snia.org/forums/dmf/knowledge/tutorials/

  • 2. Borja  |  January 11th, 2009 at 10:45 am

    Good post, Jaspreet. It is clear and neutral, but I’m missing what is the technology used by inSync in each of the cases:
    - Source or Target ?
    - Inline or Post-Process ?
    - File or Sub-File level ?
    - Fixed size blocks or Variable length data segments ?

  • 3. Jaspreet  |  January 11th, 2009 at 12:07 pm

    Thanks Borja,

    I wanted to keep the post neutral.

    InSync is a backup software hence its a source based deduplication tech. The product uses sub-file level approach with variable size data segment algorithm.

    This helps it find those duplicate emails between large PST files. And the result is up to 90% bandwidth, storage savings.

    Its one of the rare in-production products to do so.

    The Inline and Post-process appoaches are only valid in case of target dedup.

    Jaspreet

  • 4. PuneTech » Understa&hellip  |  January 14th, 2009 at 8:10 pm

    [...] on the promise of speed and low-bandwidth consumption. In this article, reproduced with permission from their blog, they explain what exactly data de-duplication is and how it [...]

  • 5. Laxman  |  January 19th, 2009 at 2:51 am

    How is the block size determined when using variable block length? Can you shed more light on the so called “Variable-Length Data Segment technology”? Specifically, how is the block size determined? Do you have any references? Thanks.

  • 6. Jaspreet  |  January 19th, 2009 at 2:59 am

    Laxman,

    That’s the “key” to this technology. Some players use heuristics .. and some signatures or leaner checksums.

    It depends on the data type and sometimes the algorithm needs to be trained as well.

    Druvaa has filed multiple patent applications on this :)

  • 7. Mike Dutch  |  January 19th, 2009 at 5:18 pm

    The inline vs post-process discussion is not accurate. Most inline solutions (both source and target) use variable length segmentation. Inline solutions can also be content aware.

    The numbers section is also inaccurate. Very high dedupe ratios (e.g., 500:1) are common but it really depends on what exactly you are measuring. For a good discussion, see the SNIA white paper titled “Understanding Data Deduplication Ratios” here:
    .

  • 8. Jaspreet  |  January 20th, 2009 at 12:30 am

    Mike,

    Thanks for the information.

    IMO, Inline always applied to target based dedup. But, point me to right sources if I am wrong.

    Any deduped NAS/SAN device can-not control what data is flushed to it. The mounted file-system or storage driver flushes information which it may not make any sense to the device. In such cases the variable data-segment algorithm can’t be applied.

    Yup, most vendors .. present twisted information. They compare ratio’s against daily full-backups which are rare in enterprises today.

    I corrected the ratio’s part. Thanks.

  • 9. Data De-Duplication &laqu&hellip  |  January 22nd, 2009 at 1:33 am

    [...] The original post – http://blog.druvaa.com/2009/01/09/understanding-data-deduplication/ Possibly related posts: (automatically generated)Rumor watch: AT&T tethering plans may be [...]

  • 10. Best Practices For Laptop&hellip  |  January 22nd, 2009 at 12:12 pm

    [...] Remote Backup Software is quickly becoming one of the most important pieces of software package needed by the nearly 200 million employees who operate remotely from their desks. Why is remote back up software so important? First over a half a million laptops are misplaced at US airports every year. Users usually do not backup their files before they depart on a trip. With all the not protected computers traveling a lot of data can be missing. [...]

  • 11. Laptop Backup Software So&hellip  |  January 25th, 2009 at 6:34 am

    [...] while traveling remote data backup software permits your work to outlast your trip. Laptop Backup Software will protect the corporate workers from losing their finished work if any of the above-mentioned [...]

  • 12. Laptop Backup Software Pr&hellip  |  January 26th, 2009 at 6:44 pm

    [...] var pubId=9720; var siteId=13297; var kadId=8656; var kadwidth=336; var kadheight=280; Remote Backup Software is quickly becoming one of the most important pieces of software package needed by the nearly 200 [...]

  • 13. Home, Arts, Entertainment&hellip  |  February 3rd, 2009 at 1:07 am

    [...] while traveling remote data backup software permits your work to outlast your trip. Laptop Backup Software will protect the corporate workers from losing their finished work if any of the above-mentioned [...]

  • 14. Data Protection with a La&hellip  |  February 17th, 2009 at 6:27 am

    [...] 17 Feb Data Protection with a Laptop Backup Software admin Remote Backup Software is quickly becoming one of the most important pieces of software package needed by the nearly 200 [...]

  • 15. What Are The Benefits of &hellip  |  February 22nd, 2009 at 7:16 am

    [...] digital assets with a laptop backup solution Program safely and efficiently with our up to date Data Duplication process. To find out how to protect your sensitive data visit [...]

  • 16. Jered Floyd  |  March 9th, 2009 at 10:33 am

    Jaspreet,

    Mike is right; in-line deduplication is not limited to fixed-length blocking. Our Permabit Enterprise Archive product, for example, does variable-sized chunking for optimal deduplication.

    You say:

    Any deduped NAS/SAN device can-not control what data is flushed to it. The mounted file-system or storage driver flushes information which it may not make any sense to the device. In such cases the variable data-segment algorithm can’t be applied.

    We cannot control when data is flushed to our device, but this does not mean that we cannot inspect the structure of the file as it is being written an make intelligent choices about where to set boundaries. Data being written sequentially is generally flushed sequentially, and in the case of out-of-order writes from the block cache we are able to do reassembly in memory or make guesses about file structure based on previously seen landmarks in the file. In our experience, we nearly always get deduplication as good as post-processing the file after it has been written entirely, and we do not introduce a dangerous “dedupe window” which can lead to falling far behind in the data stream.

    This is technically more complicated to implement than post-process, but it can and has been done. There is more information about our deduplication technologies, which we call Scalable Data Reduction, on our website at http://www.permabit.com/products/sdr.asp.

    Regards,
    Jered Floyd
    CTO, Permabit

  • 17. Milind  |  March 10th, 2009 at 2:11 am

    I agree with Jered. All document are written in entirety and the filesystem cache also flushes out in a sequential fashion. In case of random overwrites, you may need to read-back and merge to reconstruct the block. For database files, fixed block sizes would perform better. In essence, inline variable sized chunking is possible. Most NAS servers still prefer post-processing to avoid impact on in-band performance.

    Milind Borate,
    CTO, Druvaa

  • 18. Data Deduplication - A De&hellip  |  March 15th, 2009 at 6:11 pm

    [...] The article was taken from blog post – http://blog.druvaa.com/2009/01/09/understanding-data-deduplication/ [...]

  • 19. Understanding Data Dedupl&hellip  |  March 24th, 2009 at 12:54 pm

    [...] Understanding Data Deduplication [...]

  • 20. jitendra  |  April 1st, 2009 at 9:57 am

    hi this good but, still i confuse
    just i want to how variable size block is manage
    i want to it’s file format in a specific manner.
    so plz help me..

  • 21. Remote Data Backup Progra&hellip  |  June 9th, 2009 at 12:58 am

    [...] back by restoring all the recently saved data. You will have more satisfied customers if you use a remote data backup. Let us discuss why remote data backup software program is absolutely key in the business [...]

  • 22. Remote Data Backup Progra&hellip  |  June 15th, 2009 at 7:08 am

    [...] the day, your data backup must secure your files and take minimal of your system resources. If your data backup program abuses the use of your resources, it would decrease the speed of your company computers. [...]

Leave a Comment

Required

Required, hidden

Some HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Trackback this post  |  Subscribe to the comments via RSS Feed


Categories

Subscribe

Calendar

January 2009
M T W T F S S
« Dec   Feb »
 1234
567891011
12131415161718
19202122232425
262728293031  

Archives

Blogroll

Meta

Tags

backup bare metal restore beta blackbird Business data backup cleantech cloud storage data backup database data dedeuplication data deduplication data protection Disaster Recovery discount Druva druvaa druvaa insync v3 Druva inSync enterprise backup Enterprise PC backup software enterprises File system greentech inSync insync roadmap laptop backup Laptop Backup Software new release news Notebook Backup offsite backup pc backup PC Backup Software Performance Improvement product design python performance optimization remote backup restore ROI search software storage storage growth technology usability

Visitors Online