Post by Denis CorbinPost by Christof Schulzethis is my first post to this list. While I have searched using gmane I
am not sure whether this has been discussed already.
When saving backups off-site using a slow internet connection the size
of the differential backup files may quickly become an issue as dar will
save the whole file that changed instead of just the bytes that did
actually change.
Say I have a file named foo.txt that is in the initial backup which
contains nothing but the string "hello", then the file is changed such
that the content now reads "hello world" and a differential backup is
run.
foo.txt will be saved again having the content "hello world". However to
completely restore the content it would be sufficient to keep the
knowledge that the file changed and that the missing bit is " world".
Of course it is possible to use this concept (via xdelta or bdelta) on a
complete archive however it gets inefficient when slices and compression
are used.
That is why I wonder whether it is planned to add this feature to dar
itself. Can it be done? Am I missing something?
I think it has been under discussion some months or years ago.
The first point to take into account is: to be able to know that "
world" has been added in your file, you need to know what was the
contents of that file at the time of the previous backup ("hello"). If
you do a differential backup based on a full backup this is easy,
however you will have to read both current file and archive reference to
compare file contents (so expect the time to backup to be rawly twice
longer).
If now you base your backup on a differential backup, and if that file
was recorded as unchanged since the last backup, you cannot know
anything about its original contents. Some proposed using checksums or
rolling checksums in order to detect in which part the file changed,
without having to provide the whole original file contents. A point here
to consider is how much byte to cover by a single checksum, or in other
words how much checksums per megabyte to retain. The more checksums you
have, the more precise you will be able to detect a change in a file and
the less you will have data to add in your archive, but the more your
archive of reference will require space to store the list of checksum
for each file.
In other words, when actually doing a differential backup, if a file has
not changed at all, you end with only inode information in the resulting
archive. With this new approach you would have a list of checksums
proportional to the length of that file. For files than changed a
little, you would also have a list of checksums and some part of its
data, depending on the number and repartition of changes. Last for a
file that changed completely, we could simply just store the whole data,
no need to record nor compute the list of checksums.
As you see, if you have very little changes and many files to backup,
your resulting archive could be bigger with this new way of doing than
the way dar currently works. If you have many files that changed
completely, you will also have bigger archive as you still need to
record the checksum list of each files that did not change.
The gain could only be positive if you have little changes in huge files
(the threshold depends on the overall number of checksums to store in
the archive compared to the size of the data that will not be saved in
the huge files that changed).
At restoration time now, if you have a full backup, things are simple.
If instead you have a differential backup with only a part of the data
for a given file, you cannot restore that file. You must have on the
filesystem the file in its former state in order to apply this "patch".
So if your file is partially saved over 10 consecutive archives, you
will need to recover each part in turn from each one of them.
The last point is that relying on checksums does not give certainty that
the file has not changed. Even if the probability is thin that two
different sequences of characters give the same hash result, it is not
something impossible. In that case you would miss saving a file that
changed.
This feature has already been requested
( http://sourceforge.net/p/dar/feature-requests/138/ )
but is not in the top priority list due to more interesting features
(here this makes things complicated for a not systematic gain and a lot
of additional CPU power requirement).
Thank you for taking the time for explaining this. I do second your
point. In the scenario you described, having dedup inside dar is
insanely complex and hardly beneficial.
differential backup only to this first full backup. The next day again I
take a differential backup towards this first full backup and so on.
...
backup.
namely until the files b/c/d are not needed any more.
scenario. That means when transferring the backup over using a slow
uplink takes a lot of time.
taken out of the picture and also restore is straight forward.
& encryption) not IO-bound here. So having dedup probably would speed up
to be handled by those two cpu-intensive tasks.