[Dar-support] dar + deduplication

Discussion:

Christof Schulze

2012-10-07 10:08:49 UTC

Hello everyone,

this is my first post to this list. While I have searched using gmane I
am not sure whether this has been discussed already.

When saving backups off-site using a slow internet connection the size
of the differential backup files may quickly become an issue as dar will
save the whole file that changed instead of just the bytes that did
actually change.

Say I have a file named foo.txt that is in the initial backup which
contains nothing but the string "hello", then the file is changed such
that the content now reads "hello world" and a differential backup is
run.
foo.txt will be saved again having the content "hello world". However to
completely restore the content it would be sufficient to keep the
knowledge that the file changed and that the missing bit is " world".

Of course it is possible to use this concept (via xdelta or bdelta) on a
complete archive however it gets inefficient when slices and compression
are used.

That is why I wonder whether it is planned to add this feature to dar
itself. Can it be done? Am I missing something?

Regards

Christof

--
() ascii ribbon campaign - against html e-mail
/\ www.asciiribbon.org - against proprietary attachments

yungchin

2012-10-07 17:51:16 UTC

Permalink

Post by Christof Schulze
Of course it is possible to use this concept (via xdelta or bdelta) on a
complete archive however it gets inefficient when slices and compression
are used.
That is why I wonder whether it is planned to add this feature to dar
itself. Can it be done? Am I missing something?

Just in case it's useful to you, I've been using dar on top of bup for more than
a year now. You can read more about bup at
https://github.com/bup/bup/blob/master/README.md but here's a very brief
overview:

* dedup is based on matching hashes of chunks of data (it won't actually
dedup your "hello world" example, because the chunks are bigger than that,
but it dedups at relevant sizes)
* chunks are variable-sized, like in rsync (this improves real-world dedup
performance vastly)
* with remote backups, dedup is performed before network transfer

I haven't much time right now so let me just paste a line from my backup
script (where only dar needs to run as root):

sudo dar --create - --noconf --empty-dir --nodump --cache-directory-tagging \
| bup split -r server_name: -n target_branch

Which runs a full backup and stores the dar archive in a bup repository on
"server_name", in a branch called "target_branch". Note that you have to
run dar without compression, because you can't really dedup compressed
data, but bup will by default compress the output after dedup.

Backing up a heavily used laptop with ~35GB of data, and sending about 1
full backup a week, my repository will grow ~100MB a week. This means
I've never had a need to delete any dar archive so far! (which is good, too,
because bup hasn't implemented deleting revisions yet)

You can also send incremental backups during the week, and your full
backups will hardly use any extra space because bup can reuse chunks
from the incremental backups.

Currently, you need rather a lot of disk space for restores, because you
have to export the dar archive from bup before you can read it with dar.
But I'm working on a patch to bup, which should allow dar to read the
dar archive from a bup fuse-mount.

YC

Niklas Hambüchen

2012-10-07 18:15:45 UTC

Permalink

Yes bup is the choice for dedup.

Post by yungchin

Just in case it's useful to you, I've been using dar on top of bup for more than
a year now. You can read more about bup at
https://github.com/bup/bup/blob/master/README.md but here's a very brief
* dedup is based on matching hashes of chunks of data (it won't actually
dedup your "hello world" example, because the chunks are bigger than that,
but it dedups at relevant sizes)
* chunks are variable-sized, like in rsync (this improves real-world dedup
performance vastly)
* with remote backups, dedup is performed before network transfer
I haven't much time right now so let me just paste a line from my backup
sudo dar --create - --noconf --empty-dir --nodump --cache-directory-tagging \
| bup split -r server_name: -n target_branch
Which runs a full backup and stores the dar archive in a bup repository on
"server_name", in a branch called "target_branch". Note that you have to
run dar without compression, because you can't really dedup compressed
data, but bup will by default compress the output after dedup.
Backing up a heavily used laptop with ~35GB of data, and sending about 1
full backup a week, my repository will grow ~100MB a week. This means
I've never had a need to delete any dar archive so far! (which is good, too,
because bup hasn't implemented deleting revisions yet)
You can also send incremental backups during the week, and your full
backups will hardly use any extra space because bup can reuse chunks
from the incremental backups.
Currently, you need rather a lot of disk space for restores, because you
have to export the dar archive from bup before you can read it with dar.
But I'm working on a patch to bup, which should allow dar to read the
dar archive from a bup fuse-mount.
YC
------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
_______________________________________________
Dar-support mailing list
https://lists.sourceforge.net/lists/listinfo/dar-support

Christof Schulze

2012-10-07 19:24:42 UTC

Permalink

Hi,

[bup explained briefly]

This is an interesting concept, however it means that using dars
encryption is not possible - other than that it does what I had hoped
for.

Unfortunately encryption is a priority for me when storing backups
off-site. The deduplication should take place before any data will be
added to a differential archive in the first place. That is where it is
most efficient and all the dars features are still available, maybe
except catalogue isoplation which would still be possible but pointless
for the referenced archive.

Is there anything I could do short of hacking dar to make this work?

Regards

Christof

--
() ascii ribbon campaign - against html e-mail
/\ www.asciiribbon.org - against proprietary attachments

Christof Schulze

2012-10-07 19:27:20 UTC

Permalink

bup

...
also the backups cannot be transferred to any server anymore - even
consumer routers now offer small ftp servers that could be used as a
cheap remote storage location. With bup, a different infrastructure is
needed.

Regards

Christof

--
() ascii ribbon campaign - against html e-mail
/\ www.asciiribbon.org - against proprietary attachments

Denis Corbin

2012-10-07 19:16:08 UTC

Permalink

Post by Christof Schulze
Hello everyone,

Hello Christof,

Post by Christof Schulze
this is my first post to this list. While I have searched using gmane I
am not sure whether this has been discussed already.
When saving backups off-site using a slow internet connection the size
of the differential backup files may quickly become an issue as dar will
save the whole file that changed instead of just the bytes that did
actually change.
Say I have a file named foo.txt that is in the initial backup which
contains nothing but the string "hello", then the file is changed such
that the content now reads "hello world" and a differential backup is
run.
foo.txt will be saved again having the content "hello world". However to
completely restore the content it would be sufficient to keep the
knowledge that the file changed and that the missing bit is " world".
Of course it is possible to use this concept (via xdelta or bdelta) on a
complete archive however it gets inefficient when slices and compression
are used.
That is why I wonder whether it is planned to add this feature to dar
itself. Can it be done? Am I missing something?

I think it has been under discussion some months or years ago.

The first point to take into account is: to be able to know that "
world" has been added in your file, you need to know what was the
contents of that file at the time of the previous backup ("hello"). If
you do a differential backup based on a full backup this is easy,
however you will have to read both current file and archive reference to
compare file contents (so expect the time to backup to be rawly twice
longer).

If now you base your backup on a differential backup, and if that file
was recorded as unchanged since the last backup, you cannot know
anything about its original contents. Some proposed using checksums or
rolling checksums in order to detect in which part the file changed,
without having to provide the whole original file contents. A point here
to consider is how much byte to cover by a single checksum, or in other
words how much checksums per megabyte to retain. The more checksums you
have, the more precise you will be able to detect a change in a file and
the less you will have data to add in your archive, but the more your
archive of reference will require space to store the list of checksum
for each file.

In other words, when actually doing a differential backup, if a file has
not changed at all, you end with only inode information in the resulting
archive. With this new approach you would have a list of checksums
proportional to the length of that file. For files than changed a
little, you would also have a list of checksums and some part of its
data, depending on the number and repartition of changes. Last for a
file that changed completely, we could simply just store the whole data,
no need to record nor compute the list of checksums.

As you see, if you have very little changes and many files to backup,
your resulting archive could be bigger with this new way of doing than
the way dar currently works. If you have many files that changed
completely, you will also have bigger archive as you still need to
record the checksum list of each files that did not change.

The gain could only be positive if you have little changes in huge files
(the threshold depends on the overall number of checksums to store in
the archive compared to the size of the data that will not be saved in
the huge files that changed).

At restoration time now, if you have a full backup, things are simple.
If instead you have a differential backup with only a part of the data
for a given file, you cannot restore that file. You must have on the
filesystem the file in its former state in order to apply this "patch".
So if your file is partially saved over 10 consecutive archives, you
will need to recover each part in turn from each one of them.

The last point is that relying on checksums does not give certainty that
the file has not changed. Even if the probability is thin that two
different sequences of characters give the same hash result, it is not
something impossible. In that case you would miss saving a file that
changed.

This feature has already been requested
( http://sourceforge.net/p/dar/feature-requests/138/ )
but is not in the top priority list due to more interesting features
(here this makes things complicated for a not systematic gain and a lot
of additional CPU power requirement).

Post by Christof Schulze
Regards
Christof

Regards,
Denis.

Christof Schulze

2012-10-07 19:48:22 UTC

Permalink

Hi Denis,

Post by Denis Corbin

I think it has been under discussion some months or years ago.
The first point to take into account is: to be able to know that "
world" has been added in your file, you need to know what was the
contents of that file at the time of the previous backup ("hello"). If
you do a differential backup based on a full backup this is easy,
however you will have to read both current file and archive reference to
compare file contents (so expect the time to backup to be rawly twice
longer).
If now you base your backup on a differential backup, and if that file
was recorded as unchanged since the last backup, you cannot know
anything about its original contents. Some proposed using checksums or
rolling checksums in order to detect in which part the file changed,
without having to provide the whole original file contents. A point here
to consider is how much byte to cover by a single checksum, or in other
words how much checksums per megabyte to retain. The more checksums you
have, the more precise you will be able to detect a change in a file and
the less you will have data to add in your archive, but the more your
archive of reference will require space to store the list of checksum
for each file.
In other words, when actually doing a differential backup, if a file has
not changed at all, you end with only inode information in the resulting
archive. With this new approach you would have a list of checksums
proportional to the length of that file. For files than changed a
little, you would also have a list of checksums and some part of its
data, depending on the number and repartition of changes. Last for a
file that changed completely, we could simply just store the whole data,
no need to record nor compute the list of checksums.
As you see, if you have very little changes and many files to backup,
your resulting archive could be bigger with this new way of doing than
the way dar currently works. If you have many files that changed
completely, you will also have bigger archive as you still need to
record the checksum list of each files that did not change.
The gain could only be positive if you have little changes in huge files
(the threshold depends on the overall number of checksums to store in
the archive compared to the size of the data that will not be saved in
the huge files that changed).
At restoration time now, if you have a full backup, things are simple.
If instead you have a differential backup with only a part of the data
for a given file, you cannot restore that file. You must have on the
filesystem the file in its former state in order to apply this "patch".
So if your file is partially saved over 10 consecutive archives, you
will need to recover each part in turn from each one of them.
The last point is that relying on checksums does not give certainty that
the file has not changed. Even if the probability is thin that two
different sequences of characters give the same hash result, it is not
something impossible. In that case you would miss saving a file that
changed.
This feature has already been requested
( http://sourceforge.net/p/dar/feature-requests/138/ )
but is not in the top priority list due to more interesting features
(here this makes things complicated for a not systematic gain and a lot
of additional CPU power requirement).

Thank you for taking the time for explaining this. I do second your
point. In the scenario you described, having dedup inside dar is
insanely complex and hardly beneficial.

This is the reason why I only do one full backup backup and then a
differential backup only to this first full backup. The next day again I
take a differential backup towards this first full backup and so on.

Day1: full backup (a)
Day2: diff to (a) creates file (b)
Day3: diff to (a) creates file (c)
Day4: diff to (a) creates file (d)
...

After a while these diffs get larger and I just create a new full
backup.

This has a couple of advantages:
* I only need two sets of backups for a complete restore (the first full
backup and the latest diff)
* I can delete older diffs - say I want to be able to go back to daily
snapshots during the last 30 days - I only have to keep the last 30
diffs
* The data of the full backup usually is at hand when creating the diffs
(at the moment I isolate the catalogue, transfer the full backups to
an ftp site, and remove the full backup locally but having dedup
working - I would probably keep the full backups locally)

The drawbacks of this approach are:

* For certain amount of time I have to keep two full backups around,
namely until the files b/c/d are not needed any more.
* The differential backup is not as space efficient as in your
scenario. That means when transferring the backup over using a slow
uplink takes a lot of time.

So in this scenario, for dedup the checksumming-tree-complexity can be
taken out of the picture and also restore is straight forward.
PArsing the data in the fullbackup again while creating the diff is
really not an issue since creating the backup is cpu-bound (compression
& encryption) not IO-bound here. So having dedup probably would speed up
creating the backups here as it would vastly reduce the amount of data
to be handled by those two cpu-intensive tasks.

Regards

Christof

--
() ascii ribbon campaign - against html e-mail
/\ www.asciiribbon.org - against proprietary attachments

Denis Corbin

2012-10-09 11:53:25 UTC

Permalink

Post by Christof Schulze
Hi Denis,

Hello Christof,
[...]

Post by Christof Schulze
This is the reason why I only do one full backup backup and then a
differential backup only to this first full backup. The next day again I
take a differential backup towards this first full backup and so on.
Day1: full backup (a)
Day2: diff to (a) creates file (b)
Day3: diff to (a) creates file (c)
Day4: diff to (a) creates file (d)
...
After a while these diffs get larger and I just create a new full
backup.

That's an interesting approach.

Post by Christof Schulze
* I only need two sets of backups for a complete restore (the first full
backup and the latest diff)
* I can delete older diffs - say I want to be able to go back to daily
snapshots during the last 30 days - I only have to keep the last 30
diffs
* The data of the full backup usually is at hand when creating the diffs
(at the moment I isolate the catalogue, transfer the full backups to
an ftp site, and remove the full backup locally but having dedup
working - I would probably keep the full backups locally)
* For certain amount of time I have to keep two full backups around,
namely until the files b/c/d are not needed any more.
* The differential backup is not as space efficient as in your
scenario. That means when transferring the backup over using a slow
uplink takes a lot of time.
So in this scenario, for dedup the checksumming-tree-complexity can be
taken out of the picture and also restore is straight forward.
PArsing the data in the fullbackup again while creating the diff is
really not an issue since creating the backup is cpu-bound (compression
& encryption) not IO-bound here. So having dedup probably would speed up
creating the backups here as it would vastly reduce the amount of data
to be handled by those two cpu-intensive tasks.

Maybe it should effectively make sens to limit the "deduplication" (as
you call it) or "binary diff" (as some other call it) when creating
differential archives based on full backups. Having this feature
available when creating archive based on differential backups does only
complicate the work inside dar and restoration process for the user.

Post by Christof Schulze
Regards
Christof

Regards,
Denis.

Christof Schulze

2012-10-09 16:23:38 UTC

Permalink

Hi Denis,

Post by Denis Corbin
Maybe it should effectively make sens to limit the "deduplication" (as
you call it) or "binary diff" (as some other call it) when creating
differential archives based on full backups. Having this feature
available when creating archive based on differential backups does only
complicate the work inside dar and restoration process for the user.

I agree.
That would certainly limit the complexity of the implementation. Also it
would be usable when it actually has the largest impact.

Christof

--
() ascii ribbon campaign - against html e-mail
/\ www.asciiribbon.org - against proprietary attachments