<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Druva Blog &#187; data deduplication</title>
	<atom:link href="http://blog.druva.com/tag/data-deduplication/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.druva.com</link>
	<description>Enterprise Data Backup and Beyond</description>
	<lastBuildDate>Wed, 21 Dec 2011 23:25:01 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>Motif #4.1</title>
		<link>http://blog.druva.com/2011/02/25/motif-4-1/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=motif-4-1</link>
		<comments>http://blog.druva.com/2011/02/25/motif-4-1/#comments</comments>
		<pubDate>Fri, 25 Feb 2011 17:36:00 +0000</pubDate>
		<dc:creator>Chandar</dc:creator>
				<category><![CDATA[data deduplication]]></category>
		<category><![CDATA[Druva inSync]]></category>
		<category><![CDATA[News & Events]]></category>
		<category><![CDATA[Productization]]></category>
		<category><![CDATA[Products]]></category>
		<category><![CDATA[Technology & Innovation]]></category>
		<category><![CDATA[backup]]></category>
		<category><![CDATA[enterprise backup]]></category>
		<category><![CDATA[laptop backup]]></category>
		<category><![CDATA[Performance Improvement]]></category>
		<category><![CDATA[product design]]></category>
		<category><![CDATA[python performance optimization]]></category>
		<category><![CDATA[restore]]></category>
		<category><![CDATA[storage growth]]></category>
		<category><![CDATA[technology]]></category>

		<guid isPermaLink="false">http://blog.druva.com/?p=675</guid>
		<description><![CDATA[Every well-planned release has a motif, a term often used to describe a dominant theme in a literary, artistic, or musical work. In all the early releases, the motif for Druva has been Simplicity: simplicity for both end-users and for &#8230; <a href="http://blog.druva.com/2011/02/25/motif-4-1/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Every well-planned release has a motif, a term often used to describe a dominant theme in a literary, artistic, or musical work.</p>
<p>In all the early releases, the motif for Druva has been <em>Simplicity</em>: simplicity for both end-users and for IT administrators. For end users, Druva inSync is now so simple to use that it just works without their knowing. It’s completely non-intrusive and works on all kinds of networks such as WAN and VPN. It’s so simple to use that end users can access their data from any Web browser without having to contact IT. Likewise, for IT administrators, it’s so simple to download (~40MB), install (under 20 minutes), and manage (almost zero maintenance) that the total cost of ownership is almost negligible. The simplicity motif has been a strong differentiator for Druva’s offerings.</p>
<p>For inSync 4.0, we made <em>Storage and Bandwidth Optimization</em> the motif. To optimize storage, we introduced App-aware Dedupe, an industry-first dedupe technology that offers a 90% storage savings across all user data and a 100% dedupe accuracy at the source (laptops) for supported applications such as Outlook and Office. To optimize bandwidth, we introduced the Octopus WAN optimization engine, a multi-threaded client architecture that does smart bandwidth throttling to offer a 5x performance gain for every client backing up on WAN.</p>
<p style="text-align: center">
<div id="attachment_678" class="wp-caption aligncenter" style="width: 430px"><a href="http://blog.druva.com/wp-content/uploads/2011/02/Motif-1.png"><img class="size-full wp-image-678  " src="http://blog.druva.com/wp-content/uploads/2011/02/Motif-1.png" alt="" width="420" height="300" /></a><p class="wp-caption-text">The eye-catching red shack on the wharf (Rockport, Massachusetts) is often called Motif #1, a reference to its popularity among artists. </p></div>
<p>The theme for inSync 4.1 emerged naturally to “<em>Scale</em>” as customers were increasingly deploying Druva to more users in each of their environments. With release 4.1, we wanted to make inSync scale efficiently along several dimensions as outlined below –</p>
<p><strong>Scale</strong> -</p>
<ul>
<li>2000 users per server</li>
<li>16TB of data per server</li>
<li>200 parallel connections per server</li>
</ul>
<p><strong>Performance</strong> -</p>
<ul>
<li>We’re excited to introduce an innovative HyperCache technology, which can improve backup performance by 6x compared to inSync 4.0. HyperCache is an in-memory cache that can be configured to access the most optimal subset of your dedupe index in memory resulting in a high hit rate. The usual 80-20 rule applies here: with just a 30% subset of the dedupe index, Hypercache can deliver upwards of 75% hit rate. We recommend a 4GB of HyperCache size for every 1TB of data to maximize performance. The admin console offers a simple way for you to configure HyperCache for optimal performance.</li>
<li>You can now configure an SSD storage for your dedupe index to further enhance your server performance. Lab results show a whopping 12x performance improvement with HyperCache and SSD configurations.</li>
<li>You can now install Druva on a 64-bit system for enhanced performance.</li>
</ul>
<p><strong>Administration</strong> –</p>
<ul>
<li>4.1 now supports a new administrative role in addition to a Server Administrator. A Profile Administrator role grants permissions to manage one or more user profiles in order to edit profile settings, add users, and manage data restore for those profiles. This is a great way to scale the administration tasks across your organization between server and profile administration.</li>
<li>In light of the above role, we’ve enhanced our dashboard and reporting, so an administrator can get a customized view of their reports depending on their role.</li>
<li>You can now automate the import of users to inSync from your Active Directory. A periodic import from your AD can be set up to dynamically add users to inSync.</li>
</ul>
<p><strong>Access -</strong></p>
<ul>
<li>We’re very excited to announce mobile access of your data from iPads and iPhones. Check out our newest app at <a href="http://itunes.apple.com/app/insync/id420380654?mt=8">http://itunes.apple.com/app/insync/id420380654?mt=8#</a></li>
</ul>
<p>We&#8217;re excited about the upcoming deployments of inSync 4.1 and the performance benefits to all of you. In my next blog, I’ll talk about the 2 editions of inSync 4.1 (Enterprise and Professional), how they compare, and which one is right for you. Stay tuned….</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.druva.com/2011/02/25/motif-4-1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Say Hello to Blackbird !</title>
		<link>http://blog.druva.com/2010/09/06/say-hello-to-blackbird/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=say-hello-to-blackbird</link>
		<comments>http://blog.druva.com/2010/09/06/say-hello-to-blackbird/#comments</comments>
		<pubDate>Mon, 06 Sep 2010 19:09:46 +0000</pubDate>
		<dc:creator>Jaspreet</dc:creator>
				<category><![CDATA[data deduplication]]></category>
		<category><![CDATA[Druva inSync]]></category>
		<category><![CDATA[blackbird]]></category>

		<guid isPermaLink="false">http://blog.druva.com/?p=432</guid>
		<description><![CDATA[With inSync v4.0 going live last week, Druva showcased the new Blackbird storage engine which introduces a new concept called &#8211; &#8220;Application Aware Data Deduplication&#8221;. This new engine although currently only available in inSync, will form the core of all &#8230; <a href="http://blog.druva.com/2010/09/06/say-hello-to-blackbird/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>With inSync v4.0 going live last week, Druva showcased the new <strong>Blackbird storage engine</strong> which introduces a new concept called &#8211; <em>&#8220;Application Aware Data Deduplication&#8221;</em>. This new engine although currently only available in inSync, will form the core of all future product offerings.<a href="http://blog.druva.com/wp-content/uploads/2010/09/iStock_000008912797XSmall.jpg"><img align="right" class="alignright size-full wp-image-433" src="http://blog.druva.com/wp-content/uploads/2010/09/iStock_000008912797XSmall.jpg" alt="" width="300" height="200" /></a></p>
<p>
The idea of &#8220;<em>app-aware deduplication</em>&#8221; emerged from the fact that complex applications like MS Outlook or Exchange need much more intelligent deduplicate removal than simple block based approach. </p>
<p>
Each data block in PST is of fixed size and usually contains a header and a footer (ref: <a href="http://www.five-ten-sg.com/libpst/">libpst</a> ) which makes it impossible for simple dedupe approaches to identify block boundaries and hence restricting deduplication accuracy to just 30-40%.</p>
<p>Application aware data deduplication depends upon APIs exposed by the application to understand the construct of on-disk data and deduplicate at the logical-block or message level. This guarantees 100% deduplication accuracy and faster processing of data.</p>
<p>Another interesting change is shift from PostgreSQL database to <strong>no-SQL Oracle embedded database</strong>. This small (less than 1MB in size) embedded database removes the heavy &#8220;SQL&#8221; and networking layer between the server and database, hence greatly improving performance and scalability. The new engine can now support 16TB of dedupe data and about 200 parallel backups.</p>
<p>In a nutshell, the Blackbird engine will have the following features -</p>
<ol>
<li>App-Aware deduplication</li>
<li>Light-weight and highly scalable</li>
<li>Simple to install and zero-maintenance</li>
<li>Near-CDP &#8211; timeline/event based near-continuous backups</li>
<li>Search enabled restores</li>
<li>Replication (to be showcased soon <img src='http://blog.druva.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </li>
</ol>
<p>
<strong>InSync v4.0 </strong><br />
InSync v4 is definitely a new benchmark for laptop backup. I am <em>extremely </em>confident that if anyone tries this solution will never buy anything else for laptop backup. With new storage, redesigned WAN Optimization and dashboard, its clearly leaps and bounds ahead of what&#8217;s available in the market.</p>
<p>  More about new features &#8211; <a href="http://www.druva.com/insync/version-4-0">http://www.druva.com/insync/version-4-0</a><br />
  Download inSync v4 &#8211; <a href="http://www.druva.com/download/insync">http://www.druva.com/download/insync</a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.druva.com/2010/09/06/say-hello-to-blackbird/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Druvaa Deduplicates Its Name. Now Druva.com</title>
		<link>http://blog.druva.com/2010/02/25/druvaa-deduplicates-its-name-now-druvacom/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=druvaa-deduplicates-its-name-now-druvacom</link>
		<comments>http://blog.druva.com/2010/02/25/druvaa-deduplicates-its-name-now-druvacom/#comments</comments>
		<pubDate>Thu, 25 Feb 2010 17:30:48 +0000</pubDate>
		<dc:creator>Jaspreet</dc:creator>
				<category><![CDATA[About Druva]]></category>
		<category><![CDATA[data deduplication]]></category>
		<category><![CDATA[druva.com]]></category>

		<guid isPermaLink="false">http://blog.druvaa.com/?p=330</guid>
		<description><![CDATA[Not so long back, a customer jokingly asked me &#8220;How are you guys selling data deduplication software, when your company has duplicates in it&#8217;s name&#8221; Well at that time, I did not have a good answer. But we did realize &#8230; <a href="http://blog.druva.com/2010/02/25/druvaa-deduplicates-its-name-now-druvacom/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Not so long back, a customer jokingly asked me &#8220;How are you guys selling data deduplication software, when your company has duplicates in it&#8217;s name&#8221; <img src='http://blog.druva.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />   Well at that time, I did not have a good answer. But we did realize that people were facing issues remembering the extra &#8220;<em>a</em>&#8220;.</p>
<p><a href="http://www.druva.com"><img align="right" class="alignright size-medium wp-image-331" src="http://blog.druvaa.com/wp-content/uploads/2010/02/druva_change_newsletter11-256x300.jpg" alt="Druva Name Change" width="256" height="300" /></a></p>
<p>So, to make the brand recall simpler, we spent some good time, effort and money to removed the duplicate <em>&#8220;A&#8221;</em>.</p>
<p>
The website has already been migrated (see <a href="http://www.druva.com">www.druva.com</a>)and now has a new cool logo as well. All the email addresses will be carried forward as-they-are to the new domain, and older email addresses will still be valid. Request you to make a change in your address book as well.</p>
<p>
We soon will be migrating other sub-domains (blog, kb, forums etc.). The changes may take some time, and request you to be patient.</p>
<p>
If you are a customer or listed somewhere in our salesforce.com, an email is probably already waiting in your Inbox guiding you through changes <img src='http://blog.druva.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://blog.druva.com/2010/02/25/druvaa-deduplicates-its-name-now-druvacom/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Hello World !</title>
		<link>http://blog.druva.com/2009/12/15/hello-world-2/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=hello-world-2</link>
		<comments>http://blog.druva.com/2009/12/15/hello-world-2/#comments</comments>
		<pubDate>Tue, 15 Dec 2009 09:57:49 +0000</pubDate>
		<dc:creator>Jaspreet</dc:creator>
				<category><![CDATA[Druva Phoenix]]></category>
		<category><![CDATA[News & Events]]></category>
		<category><![CDATA[Technology & Innovation]]></category>
		<category><![CDATA[data deduplication]]></category>
		<category><![CDATA[network backup]]></category>
		<category><![CDATA[server backup]]></category>

		<guid isPermaLink="false">http://blog.druvaa.com/?p=314</guid>
		<description><![CDATA[After long waits and about 4 months of beta program, I am extremely excited to announce the general availability of Druvaa Phoenix v1.0. The entire team has been super busy to make this happen. And I am sure it would &#8230; <a href="http://blog.druva.com/2009/12/15/hello-world-2/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>After long waits and about 4 months of beta program, I am extremely excited to announce the general availability of Druvaa Phoenix v1.0.</p>
<p>The entire team has been super busy to make this happen. And I am sure it would be quite evident when you give it a try.</p>
<p><a href="http://www.druvaa.com/phoenix/network-backup"><img class="size-full wp-image-315 alignright" src="http://blog.druvaa.com/wp-content/uploads/2009/12/phoenix2.jpg" alt="Druvaa Phoenix" width="174" height="190" align="right" /></a></p>
<h3>Reinventing Backup</h3>
<p>Phoenix is designed ground-up for remote backups. Here are some of the key product features which make it <em>ultra </em>special -</p>
<ol>
<li><strong>Global Source Based Data Deduplication</strong> &#8211; Over 90% reduction in backup time, bandwidth and storage.</li>
<li><strong>WAN Optimization</strong> &#8211; Understands high latency and noisy networks.</li>
<li><strong>Near Continuous Data Protection</strong> &#8211; snapshot/restore-points based point-in-time restores. No age-old full, incremental backups.</li>
<li><strong>Smart Bandwidth Scheduling </strong>- Set smart bandwidth limits for each backup schedule.</li>
</ol>
<p></p>
<h3>The Road Ahead</h3>
<p>What we currently have is just a platform which will be used to showcase some market changing features -</p>
<ol>
<li><strong>Search Based Restore &#8211; </strong>We missed this feature in v1.0, but should be available in the next v1.2 release</li>
<li><strong>&#8220;Blackbird SR-71</strong><strong>&#8220;</strong> &#8211; A new storage engine with application aware data deduplication. This should be able to match an attachment inside exchange store at New Jersey to a file stored in a file-server at Kent. This should set the standards for backup performance.</li>
<li><strong>Long Distance Replication</strong> &#8211; Replicate backed up data over noisy long distance IP networks.</li>
<li><strong>Advanced Dashboard</strong> &#8211; The second best reporting dashboard (after Google Analytics).</li>
</ol>
<p><strong>Application aware Agents </strong>- Phoenix currently only comes with generic Windows agent, we plan to introduce these starting v2.0</p>
<p>Useful Links -</p>
<ul>
<li>Product Page - <a href="http://www.druvaa.com/phoenix/network-backup">www.druvaa.com/phoenix/network-backup</a></li>
<li>Download - <a href="http://www.druvaa.com/download/phoenix">www.druvaa.com/download/phoenix</a></li>
<li>Quick Setup Guide - <a href="http://www.druvaa.com/phoenix/quick-setup-guide">http://www.druvaa.com/phoenix/quick-setup-guide</a></li>
</ul>
<p>I welcome you guys to download a copy and share your feedback !</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.druva.com/2009/12/15/hello-world-2/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Why so much delay in inSync 3.1 and Phoenix ??</title>
		<link>http://blog.druva.com/2009/11/09/why-so-much-delay-in-insync-31-and-phoenix/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=why-so-much-delay-in-insync-31-and-phoenix</link>
		<comments>http://blog.druva.com/2009/11/09/why-so-much-delay-in-insync-31-and-phoenix/#comments</comments>
		<pubDate>Mon, 09 Nov 2009 17:55:56 +0000</pubDate>
		<dc:creator>Jaspreet</dc:creator>
				<category><![CDATA[Data Protection]]></category>
		<category><![CDATA[Technology & Innovation]]></category>
		<category><![CDATA[backup]]></category>
		<category><![CDATA[data deduplication]]></category>

		<guid isPermaLink="false">http://blog.druvaa.com/?p=298</guid>
		<description><![CDATA[Well, first let me confess that inSync v3.1 took much more time than we planned. We had initially planned to release inSync by July 09 and Phoenix public beta by Sep 09. In Short - We are working on a &#8230; <a href="http://blog.druva.com/2009/11/09/why-so-much-delay-in-insync-31-and-phoenix/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Well, first let me confess that inSync v3.1 took much more time than we planned.<img class="alignright size-medium wp-image-300" src="http://blog.druvaa.com/wp-content/uploads/2009/11/time-300x199.jpg" alt="Time" width="220" height="140" align="right" /> We had initially planned to release inSync by July 09 and Phoenix public beta by Sep 09.</p>
<p><strong>In Short -</strong><br />
We are working on a new storage engine codename Blackbird (based on the SR-71 legend). The new engine will use application specific deduplication technology to improve performance and bandwidth/storage savings.</p>
<p>Initially planned for inSync v3.1 and Phoenix v1.0 , this now will be available in next major releases.</p>
<p><strong>The longer version -<br />
</strong> For the past two years, we have been doing experiments on various different algorithms for global source based data deduplication. While releasing inSync v2.0 we finalized on chunk based or variable-block based data deduplication, because of the simple fact that it was tough to find similar data blocks at natural block boundaries across different users. We also worked on the performance which gradually improved over time.</p>
<p>While the approach was reasonably accurate, there was a scope of significant improvement. We realized that 90% of the backup data on customer PCs comes from the documents and PST files, hence something totally <span style="text-decoration: underline">focussed </span>on PST files can dramatically improve the deduplication performance.</p>
<p>Also, while working on Phoenix, we came across a bigger challenge of finding duplicates across different data sources within the enterprise. We soon realized that simple block based approach will not take us too far. We also realized that most of the vendors use fixed and variable block/chunk based hashing techniques. This works well for them, because they have been treating backups as &#8220;<em>byte streams&#8221;</em>, and the only way to remove duplicates is fixed or variable size data deduplication.</p>
<p>Looking at various data types and possible ways improve, we could clearly see two fundamental changes in our approach which could bring paradigm shift in data deduplication -</p>
<ol>
<li>For accuracy &#8211; Application aware data deduplication</li>
<li>For performance &#8211; Hierarchical block based deduplication</li>
</ol>
<p>Application aware deduplication, can <strong>actually pin point duplicates across PST file attachments and  normal office documents</strong>.</p>
<p>On the PC side, majority of the data is office documents and Email files. This makes it simpler to introduce the new approach, but still a lot of work needs to be done to productise it. For Phoenix, the problem is much bigger and would take some more time to solve.</p>
<p>The new engine should be ready soon. It would be shipped first in inSync v4.0 early next year and then in Phoenix v2.0 . In the next few posts, I will try and get some benchmark data.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.druva.com/2009/11/09/why-so-much-delay-in-insync-31-and-phoenix/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Understanding Data Deduplication</title>
		<link>http://blog.druva.com/2009/01/09/understanding-data-deduplication/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=understanding-data-deduplication</link>
		<comments>http://blog.druva.com/2009/01/09/understanding-data-deduplication/#comments</comments>
		<pubDate>Fri, 09 Jan 2009 07:37:04 +0000</pubDate>
		<dc:creator>Jaspreet</dc:creator>
				<category><![CDATA[Data Protection]]></category>
		<category><![CDATA[Technology & Innovation]]></category>
		<category><![CDATA[backup]]></category>
		<category><![CDATA[Business data backup]]></category>
		<category><![CDATA[data deduplication]]></category>
		<category><![CDATA[ROI]]></category>
		<category><![CDATA[storage]]></category>

		<guid isPermaLink="false">http://blog.druvaa.com/?p=74</guid>
		<description><![CDATA[&#8220;Data deduplication is inarguably one of the most new important technologies in storage for the past decade&#8221; says Gartner. So let&#8217;s take a detailed look at what it actually means. Definition Data deduplication or Single Instancing essentially refers to the &#8230; <a href="http://blog.druva.com/2009/01/09/understanding-data-deduplication/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>&#8220;Data deduplication is inarguably one of the most new important technologies in storage for the past decade&#8221; says Gartner. So let&#8217;s take a detailed look at what it actually means.</p>
<h2>Definition</h2>
<blockquote><p>Data deduplication or Single Instancing essentially refers to the elimination of redundant data. In the deduplication process, duplicate data is deleted, leaving only one copy (single instance) of the data to be stored. However, indexing of all data is still retained should that data ever be required.</p></blockquote>
<p><strong>Example</strong><br />
A typical email system might contain 100 instances of the same 1 MB file attachment. If the email platform is backed up or archived, all 100 instances are saved, requiring 100 MB storage space. With data deduplication, only one instance of the attachment is actually stored; each subsequent instance is just referenced back to the one saved copy reducing storage and bandwidth demand to only 1 MB.</p>
<h2>Technological Classification</h2>
<p>The practical benefits of this technology depend upon various factors like –</p>
<ol>
<li><strong>Point of Application</strong> &#8211; Source Vs Target</li>
<li><strong>Time of Application</strong> &#8211; Inline vs Post-Process</li>
<li><strong>Granularity</strong> &#8211; File vs Sub-File level</li>
<li><strong>Algorithm</strong> &#8211; Fixed size blocks Vs Variable length data segments</li>
</ol>
<p>A simple relation between these factors can be explained using the diagram below -</p>
<p style="text-align: center"><a href="http://blog.druvaa.com/wp-content/uploads/2009/01/dedup-tree.jpg"><img class="size-full wp-image-78 aligncenter" src="http://blog.druvaa.com/wp-content/uploads/2009/01/dedup-tree.jpg" alt="Deduplication Technological Classification" width="404" height="280" /></a></p>
<h3>Target Vs Source based Deduplication</h3>
<p><strong>Target based deduplication</strong> acts on the target data storage media. In this case the client is unmodified and not aware of any deduplication. The deduplication engine can embedded in the hardware array, which can be used as NAS/SAN device with deduplication capabilities. Alternatively it can also be offered as an independent software or hardware appliance which acts as intermediary between backup server and storage arrays. In both cases it improves only the storage utilization.</p>
<p style="text-align: center"><a href="http://blog.druvaa.com/wp-content/uploads/2009/01/target-source-dedup.jpg"><img class="size-full wp-image-98 aligncenter" src="http://blog.druvaa.com/wp-content/uploads/2009/01/target-source-dedup.jpg" alt="Target Vs Source Deduplication" width="563" height="154" /></a></p>
<p>On the contrary <strong>Source based deduplication</strong> acts on the data at the source before it’s moved. A deduplication aware backup agent is installed on the client which backs up only unique data. The result is improved bandwidth and storage utilization. But, this imposes additional computational load on the backup client.</p>
<h3>Inline Vs Post-process Deduplication</h3>
<p>In target based deduplication, the deduplication engine can either process data for duplicates in real time (i.e. as and when its send to target) or after its been stored in the target storage.</p>
<p>The former is called <strong>inline deduplication</strong>. The obvious advantages are -</p>
<ol>
<li>Increase in overall efficiency as data is only passed and processed once</li>
<li>The processed data is instantaneously available for post storage processes like recovery and replication reducing the <a title="Understanding RPO and RTO" href="blog.druvaa.com/2008/03/22/understanding-rpo-and-rto/" target="_blank">RPO and RTO</a> window.</li>
</ol>
<p>the disadvantages are -</p>
<ol>
<li>Decrease in write throughput</li>
<li>Extent of deduplication is less &#8211; Only fixed-length block deduplication approach can be use</li>
</ol>
<p>The inline deduplication only processed incoming raw blocks and does not have any knowledge of the files or file-structure. This forces it to use the fixed-length block approach (discussed in details later).</p>
<div class="mceTemp mceIEcenter">
<dl>
<dt><a href="http://blog.druvaa.com/wp-content/uploads/2009/01/inline-post-dedup.jpg"><img class="size-full wp-image-111" src="http://blog.druvaa.com/wp-content/uploads/2009/01/inline-post-dedup.jpg" alt="Inline Vs Post Process Deduplication" width="500" height="95" /></a></dt>
</dl>
</div>
<p><strong>The post-process deduplication</strong> asynchronously acts on the stored data. And has an exact opposite effect on advantages and disadvantages of the <em>inline deduplication</em> listed above.</p>
<h3>File vs Sub-file Level Deduplication</h3>
<p>The duplicate removal algorithm can be applied on full file or sub-file levels. Full file level duplicates can be easily eliminated by calculating single checksum of the complete file data and comparing it against existing checksums of already backed up files. It’s simple and fast, but the extent of deduplication is very less, as it does not address the problem of duplicate content found inside different files or data-sets (e.g. emails).</p>
<p>The sub-file level deduplication technique breaks the file into smaller fixed or variable size blocks, and then uses standard hash based algorithm to find similar blocks.</p>
<h3>Fixed-Length Blocks v/s Variable-Length Data Segments</h3>
<p>Fixed-length block approach, as the name suggests, divides the files into fixed size length blocks and uses simple checksum (MD5/SHA etc.) based approach to find duplicates. Although it&#8217;s possible to look for repeated blocks, the approach provides very limited effectiveness. The reason is that the primary opportunity for data reduction is in finding duplicate blocks in two transmitted datasets that are made up mostly &#8211; but not completely &#8211; of the same data segments.</p>
<p style="text-align: center"><a href="http://blog.druvaa.com/wp-content/uploads/2009/01/file-bocks.jpg"><img class="size-full wp-image-83 aligncenter" src="http://blog.druvaa.com/wp-content/uploads/2009/01/file-bocks.jpg" alt="Data Sets and Block Allignment" width="321" height="193" /></a></p>
<p>For example, similar data blocks may be present at different offsets in two different datasets. In other words the block boundary of similar data may be different. This is very common when some bytes are inserted in a file, and when the changed file processes again and divides into fixed-length blocks, all blocks appear to have changed.</p>
<p>Therefore, two datasets with a small amount of difference are likely to have very few identical fixed length blocks.</p>
<p><strong>Variable-Length Data Segment technology</strong> divides the data stream into variable length data segments using a methodology that can find the same block boundaries in different locations and contexts. This allows the boundaries to &#8220;float&#8221; within the data stream so that changes in one part of the dataset have little or no impact on the boundaries in other locations of the dataset.</p>
<h2>ROI Benefits</h2>
<p>Each organization has a capacity to generate data. The extent of savings depends upon – but not directly proportional to – the number of applications or end users generating data. Overall the deduplication savings depend upon following parameters –</p>
<ol>
<li>No. of applications or end users generating data</li>
<li>Total data</li>
<li>Daily change in data</li>
<li> Type of data (emails/ documents/ media etc.)</li>
<li> Backup policy (weekly-full – daily-incremental or daily-full)</li>
<li> Retention period (90 days, 1 year etc.)</li>
<li>Deduplication technology in place</li>
</ol>
<p>The actual benefits of deduplication are realized once the same dataset is processed multiple times over a span of time for weekly/daily backups. This is especially true for <em>variable length data segment</em> technology which has a much better capability for dealing with arbitrary byte insertions.</p>
<p><strong>Numbers</strong></p>
<p>The dedupication ratio increases everytime to pass the same complete data-set through the deduplication engine.</p>
<p>If compared against <em>daily full backups</em>, which I think is not widely used today, the ratios are close to 1:300.  Most if the venders use this as a marketing jargon to attract customers, even though none of their customers could be doing daily full-backup <img src='http://blog.druva.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>If compared against modern day incremental backups, our customer statistics show that, the results are <strong>between 1:4 to 1:50</strong> for source based deduplication.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.druva.com/2009/01/09/understanding-data-deduplication/feed/</wfw:commentRss>
		<slash:comments>22</slash:comments>
		</item>
	</channel>
</rss>

