Oops, I Lost My Laptop !

According to the recent study by Ponemon Institute, “Airport Insecurity: The Case of Lost Laptops”, sponsored by Dell, business travellers lose more than 12,000 laptops per week in U.S. airports.

Mobile Workforce Airports

According to the same study, which examined losses at 106 of the U.S.’s largest airports, the top 36 “Class B” airports averaged 286 lost laptops per month, which is about one laptop lost every 2.6 hours at these airports. The study also found that only 66% of the lost laptops were never recovered and about a third of those recovered were reclaimed.

Enterprise data is more dispersed and diverse than ever. And with over 30% corporate data sitting on PCs, administrators can no longer hold the end user responsible for protection of this critical corporate data.

The the above statistics clearly states the need for the following two solution on every single corporate PC  -

  • Data Protection Solution – Designed for laptops, keeping in mind the mobile workforce
  • Disk Encryption and Data Leakage Prevention Solution

Shameless Plug: Druvaa inSync is a simple, fast scalable and solution especially designed for mobile workforce. Learn more here – http://www.druvaa.com/insync/laptop-backup

Understanding Data Deduplication

“Data deduplication is inarguably one of the most new important technologies in storage for the past decade” says Gartner. So let’s take a detailed look at what it actually means.

Definition

Data deduplication or Single Instancing essentially refers to the elimination of redundant data. In the deduplication process, duplicate data is deleted, leaving only one copy (single instance) of the data to be stored. However, indexing of all data is still retained should that data ever be required.

Example
A typical email system might contain 100 instances of the same 1 MB file attachment. If the email platform is backed up or archived, all 100 instances are saved, requiring 100 MB storage space. With data deduplication, only one instance of the attachment is actually stored; each subsequent instance is just referenced back to the one saved copy reducing storage and bandwidth demand to only 1 MB.

Technological Classification

The practical benefits of this technology depend upon various factors like –

  1. Point of Application – Source Vs Target
  2. Time of Application – Inline vs Post-Process
  3. Granularity – File vs Sub-File level
  4. Algorithm – Fixed size blocks Vs Variable length data segments

A simple relation between these factors can be explained using the diagram below -

Deduplication Technological Classification

Target Vs Source based Deduplication

Target based deduplication acts on the target data storage media. In this case the client is unmodified and not aware of any deduplication. The deduplication engine can embedded in the hardware array, which can be used as NAS/SAN device with deduplication capabilities. Alternatively it can also be offered as an independent software or hardware appliance which acts as intermediary between backup server and storage arrays. In both cases it improves only the storage utilization.

Target Vs Source Deduplication

On the contrary Source based deduplication acts on the data at the source before it’s moved. A deduplication aware backup agent is installed on the client which backs up only unique data. The result is improved bandwidth and storage utilization. But, this imposes additional computational load on the backup client.

Inline Vs Post-process Deduplication

In target based deduplication, the deduplication engine can either process data for duplicates in real time (i.e. as and when its send to target) or after its been stored in the target storage.

The former is called inline deduplication. The obvious advantages are -

  1. Increase in overall efficiency as data is only passed and processed once
  2. The processed data is instantaneously available for post storage processes like recovery and replication reducing the RPO and RTO window.

the disadvantages are -

  1. Decrease in write throughput
  2. Extent of deduplication is less – Only fixed-length block deduplication approach can be use

The inline deduplication only processed incoming raw blocks and does not have any knowledge of the files or file-structure. This forces it to use the fixed-length block approach (discussed in details later).

Inline Vs Post Process Deduplication

The post-process deduplication asynchronously acts on the stored data. And has an exact opposite effect on advantages and disadvantages of the inline deduplication listed above.

File vs Sub-file Level Deduplication

The duplicate removal algorithm can be applied on full file or sub-file levels. Full file level duplicates can be easily eliminated by calculating single checksum of the complete file data and comparing it against existing checksums of already backed up files. It’s simple and fast, but the extent of deduplication is very less, as it does not address the problem of duplicate content found inside different files or data-sets (e.g. emails).

The sub-file level deduplication technique breaks the file into smaller fixed or variable size blocks, and then uses standard hash based algorithm to find similar blocks.

Fixed-Length Blocks v/s Variable-Length Data Segments

Fixed-length block approach, as the name suggests, divides the files into fixed size length blocks and uses simple checksum (MD5/SHA etc.) based approach to find duplicates. Although it’s possible to look for repeated blocks, the approach provides very limited effectiveness. The reason is that the primary opportunity for data reduction is in finding duplicate blocks in two transmitted datasets that are made up mostly – but not completely – of the same data segments.

Data Sets and Block Allignment

For example, similar data blocks may be present at different offsets in two different datasets. In other words the block boundary of similar data may be different. This is very common when some bytes are inserted in a file, and when the changed file processes again and divides into fixed-length blocks, all blocks appear to have changed.

Therefore, two datasets with a small amount of difference are likely to have very few identical fixed length blocks.

Variable-Length Data Segment technology divides the data stream into variable length data segments using a methodology that can find the same block boundaries in different locations and contexts. This allows the boundaries to “float” within the data stream so that changes in one part of the dataset have little or no impact on the boundaries in other locations of the dataset.

ROI Benefits

Each organization has a capacity to generate data. The extent of savings depends upon – but not directly proportional to – the number of applications or end users generating data. Overall the deduplication savings depend upon following parameters –

  1. No. of applications or end users generating data
  2. Total data
  3. Daily change in data
  4. Type of data (emails/ documents/ media etc.)
  5. Backup policy (weekly-full – daily-incremental or daily-full)
  6. Retention period (90 days, 1 year etc.)
  7. Deduplication technology in place

The actual benefits of deduplication are realized once the same dataset is processed multiple times over a span of time for weekly/daily backups. This is especially true for variable length data segment technology which has a much better capability for dealing with arbitrary byte insertions.

Numbers

The dedupication ratio increases everytime to pass the same complete data-set through the deduplication engine.

If compared against daily full backups, which I think is not widely used today, the ratios are close to 1:300.  Most if the venders use this as a marketing jargon to attract customers, even though none of their customers could be doing daily full-backup :)

If compared against modern day incremental backups, our customer statistics show that, the results are between 1:4 to 1:50 for source based deduplication.

PC Backup – Six Must have Features

For any enterprise, the definition and amount of “critical data” on laptops and desktops is increasing. This is fueled by increasing security concerns, user mobility and cross-geography office expansions. While the expectations have increased, the existing backup solutions haven’t adapted well with these changes.

They still continue to depend upon large computational resources and dedicated and trusted network/media for backups. The reason, I think, is that most of PC backup solutions have been molded out of old server archival products.

In short, the key requirements for an enterprise PC backup should be -

  1. Simple and Automated
  2. Non-intrusive – Light weight and resource/power friendly
  3. Secure and Internet friendly
  4. WAN and bandwidth optimized
  5. Support for incremental backup for large files like Outlook PST
  6. On-demand restore points

Features Explained -

1. Simple and Automated

“Backing up your PC is one of those things, like eating right or changing your oil on time, that everybody knows they’re supposed to do, but too few people actually carry off well…”

Walter Mossberg, The Wall Street Journal

Surprisingly most of the Notebook backup solutions still have calender schedules. IMO, this is prehistoric. The setup should be max 5 steps and schedules as simple as – “Run every 4 hours”.

2. Non-intrusive – Light weight and resource/power friendly

The primary reason employees hate backup is because of the system/network slowdowns caused by the backup which ticks in as soon as the user logs in.

Laptops are replacing desktops in most of the enterprises, but the software still hasn’t evolved. Backups should be resource friendly and optimized low power consumption. Also, simple options like these can make a lot of difference

  • Don’t backup when i am on battery
  • Consume max 10% of my CPU
  • Consume max. 20% of my bandwidth

3. WAN and Bandwidth optimized

Every company has a reasonably good percentage of mobile workforce. And usually this includes the top-tier management (CEO, and likes). With increasing laptop thefts and data risks, backups should be WAN/Internet ready.

The user should be able to choose a bandwidth (something like use 10% of my bandwidth) and the backup solution should just do the job, even over the weakest internet links. This also greatly helps in cross-office backups and backup consolidation efforts.

4. Secure and Internet friendly

Security is very important, specially when you are over WAN/VPN. Most of the backup solutions are Server triggered, making security policies for firewalls and monitoring very diffic ult (every one is afraid, when they see data flowing out of their network).

The backups should be client triggered, so that the server side firewalls just allow and monitor inbound traffic. Also,The solution should be able to securely setup encrypted/authenticated channels to backup. (SSL channels are best, when it comes to WAN/Internet)

5. Support for incremental sync for large files like Outlook PST

With data increasing, and WAN coming into picture it is very important that the backups are incremental in nature and only the changed bits are copied back to the server.

6. On-demand restore (points)

Sending an email to admin to get the data back is surely complete NO, specially when the user may be off-site/traveling. The backup software should facilitate a smart (possibly browser) based remote and secure data restore.

So next time you choose a backup software for your personal or enterprise needs, make sure it has evolved to have the above mentioned features.

And remember – backup more, backup often.