Data de-duplication changes economics of backup

Address data de-duplication and you are addressing costs

The ability to de-duplicate backup data -- that is, back up or copy only unique blocks of data -- is rapidly changing the economics of data protection.

Data volumes are growing exponentially. Companies are not only generating more primary data but also are required by government regulators to back up and retain that data many times over its life cycle. With a retention period of one year for weekly full backups and 10 days for daily incremental backups, a single terabyte of data requires 53TB of storage capacity for data protection over its life cycle. Backing up, managing and storing this data is driving up labor costs as well as power, cooling and floor space costs.

That's the bad news. The good news is the cost of disk storage is decreasing, making it increasingly attractive for secondary storage.

And data de-duplication technology -- typically found on disk-based virtual tape libraries (VTL) -- can help control data growth by backing up and storing any given piece of data only one time.

VTLs are disk-based systems that emulate tape technology to enable enterprises to install them in existing environments with minimal disruption. De-duplication software (available on some VTLs) stores a baseline data set and then checks subsequent backup sets for duplicate data. When it finds a duplicate, it stores a small representation of it that enables the software to compile and restore complete files as needed.

There are two main data de-duplication methodologies: hash-based and byte-level comparison-based. The hash-based approach runs incoming data through an algorithm to create a small representation and a unique identifier for the data called a hash. It then compares the hash with previous hashes stored in a look-up table. If a match is found it replaces the redundant data with a pointer to the existing hash. If no match is found, the data is added to the look-up table. But using a look-up table to identify duplicate hash strings can put a significant strain on performance and may require several weeks to achieve optimal de-duplication efficiency.

A more efficient method simply compares items on an object-by-object level; for example, comparing Word documents to other Word documents. Some technologies perform this comparison using a pattern-matching algorithm. However, a more efficient technology uses intelligent processes that analyze the back-up files and the reference data set to identify files that are likely to be redundant before comparing the two files in more detail. By focusing its activities on suspected duplicates, it can de-duplicate more thoroughly and avoid processing new files unnecessarily.

Some technologies perform the de-duplication as the data is being backed up. This inline de-duplication slows backup performance and adds complexity to the backup. Other technologies perform out-of-band de-duplication in which they back up the data first at full wire speed and perform the de-duplication afterward.

Byte-level de-duplication can provide up to 25:1 data reduction ratios. When combined with compression technology -- a typical VTL feature -- enterprises can store 50 times more data in the same space without adding capacity. This dramatic reduction enables companies to store more data online and keep it online longer, leading to labor savings and the advantages of keeping data on disk.

Storing data on disk, for example, takes up less physical space than tape, and significantly reduces power, cooling, security and other operating and infrastructure costs (according to a recent Gartner report, by 2008 50 percent of current data centers will have insufficient power and cooling capacity to meet the demands of high-density equipment).

Other benefits include:

- Longer online data retention -- A 50:1 capacity reduction for a typical mix of business data (e-mail and files) means data can be maintained online longer to meet increasingly stringent business/regulatory service-level agreements.

- Decreased workload, increased reliability -- An enterprise with a 65TB data store that is growing at a typical rate of 56 percent annually and is backed up weekly would typically require two racks of disk storage using de-duplication vs. 49 racks without. By reducing the number of racks required and the number of disks spinning, the reliability of the overall system is increased; and the power, cooling and administration required is significantly reduced. - Enable faster backups and restores -- Appliance solutions that de-duplicate outside the primary data path can deliver unimpeded wire-speed Fibre Channel backup and restore performance in the many TB/hr range.

- Eliminate physical threats to data -- Unlike physical tapes that can be lost, stolen or damaged, data on disk is maintained in a secure, highly available environment.

Data de-duplication changes the economics of data protection by making the cost of backing up to a VTL significantly less expensive than just disk-based data-protection solutions.

Data de-duplication is an important way for data center managers to address the spiraling cost of energy, labor and space, and to manage the impending shortage of power and cooling capacity.

Sandorfi is CTO of Sepaton. He can be reached at msandorfi@sepaton.com.

Join the newsletter!

Error: Please check your email address.
Rocket to Success - Your 10 Tips for Smarter ERP System Selection
Keep up with the latest tech news, reviews and previews by subscribing to the Good Gear Guide newsletter.

Miklos Sandorfi

Network World
Show Comments

Cool Tech

SanDisk MicroSDXC™ for Nintendo® Switch™

Learn more >

Breitling Superocean Heritage Chronographe 44

Learn more >

Toys for Boys

Family Friendly

Panasonic 4K UHD Blu-Ray Player and Full HD Recorder with Netflix - UBT1GL-K

Learn more >

Stocking Stuffer

Razer DeathAdder Expert Ergonomic Gaming Mouse

Learn more >

Christmas Gift Guide

Click for more ›

Most Popular Reviews

Latest Articles

Resources

PCW Evaluation Team

Walid Mikhael

Brother QL-820NWB Professional Label Printer

It’s easy to set up, it’s compact and quiet when printing and to top if off, the print quality is excellent. This is hands down the best printer I’ve used for printing labels.

Ben Ramsden

Sharp PN-40TC1 Huddle Board

Brainstorming, innovation, problem solving, and negotiation have all become much more productive and valuable if people can easily collaborate in real time with minimal friction.

Sarah Ieroianni

Brother QL-820NWB Professional Label Printer

The print quality also does not disappoint, it’s clear, bold, doesn’t smudge and the text is perfectly sized.

Ratchada Dunn

Sharp PN-40TC1 Huddle Board

The Huddle Board’s built in program; Sharp Touch Viewing software allows us to easily manipulate and edit our documents (jpegs and PDFs) all at the same time on the dashboard.

George Khoury

Sharp PN-40TC1 Huddle Board

The biggest perks for me would be that it comes with easy to use and comprehensive programs that make the collaboration process a whole lot more intuitive and organic

David Coyle

Brother PocketJet PJ-773 A4 Portable Thermal Printer

I rate the printer as a 5 out of 5 stars as it has been able to fit seamlessly into my busy and mobile lifestyle.

Featured Content

Product Launch Showcase

Latest Jobs

Don’t have an account? Sign up here

Don't have an account? Sign up now

Forgot password?