Moving to GitHub, slowly

The software on this pages will slowly be moved to GitHub https://github.com/hilbix/. The CVS repository will be migrated to GIT as well, so the history will be preserved, a bit. See FAQ.

Scylla and Charybdis, md5backup - Tools

The tools are developed under Linux with ESR's paradigm release early, release often in mind.
So you can consider this beta software, or alpha, or pre-alpha, or even worse ;)

Have a look in the download directory for all downloads.
As always here, all you get is the source. No binaries here.

md5backup 0.3.15-20061003-231645

Interim backup tool (look into latest version, download latest version 0.3.15-20061003-231645)

md5backup is an interim filesystem to filesystem backup on my way to build a backup utility which suits all my needs. To use it, you need a second hard drive which is used as the backup media. Optionally you can do a networked backup, too, however this feature is not completely developed today.

Currently no metadata is backed up. It only protects the data inside files. This currently is not able to backup sparse files (most often databases which must be backed up by other means or cache files which can be ignored safely).

Please note, that there is no real restore function yet! There now is bin/md5restore.sh which can be used to restore a file interactively, but it is painfully slow and braindead to use. So you definitively don't want to restore complete with it yet and must be root to use it. Have a look at doc/restore.txt as well.

Sorry, there is no Wiki/FAQ/etc. yet. If I ever find some time I will prepare one.

md5backup is usable today, I backup all my production Internet servers with it which run RedHat 9, SuSe 7.2, SuSE 9.0 and Debian Sarge. Just run bin/dobackup.sh, this shall setup everything for you, too. Networked backup is possible too using scylla+charybdis, please look into the announcements of 0.3.10 below.

The main feature of md5backup is, that it stores the files under their content's MD5 sum, such that you can check the files integrity on the backup volume easily.

Please note that md5backup was written such, that it shall work reliable in any circumstances. However I cannot give you any guarantee that it can protects your valuable data! However I trust it. All scripts I use can be found in the bin/ directory. For more information, have a look into the doc/ directory and read sc-backup.txt.

New for the upcomming 0.4.x:

You need SQLite to compile md5backup (local copy of SQLite source code).

History:

version 0.3.15-20061003-231645

download (212859 bytes)

This version now has two new preliminary scripts:

1. There now is a restore script bin/md5restore.sh

2. To setup networking there now is bin/sc-setup.sh

Little bit else changed. The scripts are not completely ready yet!

The restore is not capable to restore the metadata of a file. And this script is not thought to restore directories. Also be aware of the fact that sparse files are not yet backed up.

You need scylla-charybdis compiled to use sc-setup.sh

version 0.3.14-20050306-002847

download (132389 bytes)

Some medium restructuring in some central routines done. Multiple file store added (my backup archive became full).

This new feature works, but is nearly untested (as always).

Additional readonly directories named outN where N is a number starting from 0 are searched for existing backed up files, too, like the out directory. This way you can (manually) move old data from out/ into another directory to extend the harddrive space for backup.

Files which are considered new (active data) are copied back into the main file store.

version 0.3.13-20050220-164446

download (124530 bytes)

Wildcard ignores added. As always, this new feature is not much tested.

Ignores, which are listed in a file, now can start with a ? (the ? is skipped) which enables wildcard matching. Wildcards are:

Allquantor (*), Existquantor (?) and variants ([...]):
First character ^ inverts content
First matching character can be anything, so []] matches ] and [[] matches [
a-b matches a to b including a and b (a<=b)
b-a matches a to b excluding a and b (a<b)
Example: []-]-] matches ] or - (this is a-b and -)

Wildcard ignores are matched at backup time. Until I manage to create a better O(n) regexp parser which suits all my needs, this eats O(n*m) CPU, as all (m) ignores run over each (n) file names found.

These new type of ignores also solves a problem with "normal" ignores (which are much faster) which are processed at startup time. The "normal" ignores sometimes produce "amazing" ignores on frequently changing files, as inode numbers are matched, so the ignore "hits" a wrong file (the file the inode became at the time the backup process reaches the file).

version 0.3.12-20041005-050410

download (99040 bytes)

"nice", security lack fixed, new sparse files handling, bin/compare.sh

md5backup now automatically nices itself and uses file flushs. Also the backed up files are no more readable by others. It ignores sparse files and the source has been reorganized internally, some routines have moved into tinolib.

There is bin/compare.sh which can be used as a template for a restore script! Just copy the script and replace the MODE=compare line by MODE=restore and be sure to have understood what you do (else you will miss the second safety belt). Also bin/compare.sh can check if the backup really worked ;)

Sparse files are skipped now if they are too big (over 1 MB) and are too sparse (75%). The problem with those is, they have too few data in it. The drawback is, that they are no more backed up until I have added some more efficient sparse file support. Think of following: Create a 200 TB file on a 64 bit filesystem. Add one block of data somewhere in it. Now do the backup. If you are able to process 1 GB/s (which is extremely fast) it still takes over 1 day just to hunt for this silly block of data.

The security thingie is, that the files in the out/ directory were globally readable. That really makes no sense but I am usually alone on my machines, so this did not harm me. Be sure to do
chmod -R o-rwx /backup/md5backup

The nice is a step in my continuing effort to make md5backup less invasive for the system. A backup system shall run in background and shall not use up a lot resources when it runs - md5backup does not reach this goal by far.

The nice seems to help when the filesystem is mounted without the option noatime and the harddrive is somewhat slow (as harddrives are, YKWIM). The frequent directory inode flushes of the backup process can hinder other processes from IO. Without the nice, md5backup gets a too high scheduling priority as it always runs as root.

What I would like is to (Posixly correct) scan the directory tree without 'accessing' it and to only use "background IO" (this is when the harddrive else is idle). Did not find a method for this yet, AFAICS (if it is not already present) there should be a process capability to do so, so the process can request from the kernel if it is allowed to scan the directories without leaving a trace (which the kernel grants or not) and to become a "nulltask" for a resource, this is, it only runs if the resource is not used by other processes.

(This text should go into a Wiki, but currently I do not have one.)

version 0.3.11-20040930-013306

download (81692 bytes)

Bugfix for sc-loop.sh: It simply did not work, oops ;)

sc-backup.sh is broken anyways. Following three scripts should run independently from each other:

The backup process backing up MySQL (sc-mysql.sh)
The backup process backing up files (dobackup.sh)
The network process, transporting files (sc-move.sh)

However sc-backup.sh (and therefor sc-loop.sh) calls them one after another. This way a network starvation slows down the backup cycle extremely. Bad design as it is, keep that in mind. I now "improved" sc-backup.sh a little bit such that the loop does not completely stop when the network is down (but it can take ages), so the left over data is hopefully transferred at the next cycle. Leave improvemnts of sc-backup.sh and/or sc-loop.sh for the future.

However you can always invent your own scripts or run the three scripts noted above from cron, of course.

version 0.3.10-20040929-020738

(81000 bytes archive)

sc-loop.sh added and minor bugfix release in sc-backup.sh

Important: I changed the autodetection of mount points to ignore iso9660 and loopback filesystems. Please check if everything is still backed up. Following should help: /backup/md5backup/dobackup.sh fgrep ' start /' /backup/md5backup/log/`hostname -f`

I installed Debian Sarge with no networking. And guess what? The backup script was unable to detect the hostname ..

With Debian you shall have following packages installed: libssl-dev libgdbm-dev In future you will need zlib, too: zlib1g-dev (or do `apt-cache search zlib`)

version 0.3.9-20040825-030231

download (77157 bytes)

Bugfix release

Investigating the compare code because of the findings of the recent developments around MD5 (see md5crk.com) I found out two things:

The compare did not work right if one file was truncated.
md5backup is able to correctly backup two files which have the same MD5 sum but different contents (see test/README)!

If you are extremely paranoied, do a full backup again with this version! To do this, just remove the DBM file in the dbm/ directory. However you should not need to do this (if you want to know everything is OK you can always do: cd /backup/md5backup; bin/check.sh).

Actually it's very unlikely that this bug ever does any harm even if it would persist. To trigger this bug, following must happen:

Either the file to backup must have an MD5 sum which is identical to an already known file AND the file to backup must be shorter than the already known file AND all bytes until EOF of the file to backup must compare identical to the already known file. Actually the file backupped then is contained in the file already known, however the short version is not retained. Also note, that this case is extremely unlikely to ever be observed, as up to today no two files are known which have the same MD5 sum but are not identical.

Or, the other possiblility is, that something strange happenes on the backup archive while md5backup is active. Like the temporarily written file is truncated out of programs control or the disk becomes full and the kernel does not report this error. Even then the MD5 sum must still match and thus you can be pretty sure that the compare step would not be needed. However with the latest finding of two files with identical MD5 sum (see test/ directory), it's proven that the compare step is far from beeing redundant.

The compare step was only done because I am a paranoied person and want to detect corruption to the backup archive, such that md5backup detects that the archived version is corrupt and a fresh copy is retained. Without the compare, the "full backup" method would not yield something like a "full backup", instead it's merely a heuristic (with a 2^-128 error probablility *eg*).

Why do I write this text here? It's to show you, that MD5backup is written such, that even if something breaks, it's unlikely you get garbage. And this is a pun against XP (eXtreme Programming), as you simply never write unit tests which will check such type of unknown implementation bugs in advance (you need a couple of files with identical MD5 sum and different content). Writing good unit tests takes factors of more time than writing the application (for me!), so I simply have no time for this.

So it's more easy to find such errors with code audits. Additionally unit tests have the drawback that they usually don't run while the software is productive. So they are not there when the processor becomes confused and calculates 1+1=3. Therefor I like to have check statements in the software to control proper function at runtime. Something like a built in unit test ..

version 0.3.8-20040725-111739

download (68292 bytes)

Two bugs removed: bin/dobackup.sh 'root' test now works in cron, too, and A very minor and purely academic buglet in tino_file_lstat_diff(). The ignore lists was updated, too. This here was not tested much.

I realized that I will need growing file support somewhat urgent as I have some files of several 100MB which steadily grow and grow a bit, and it's very inefficient to do multiple snapshots of them each day. So that's what I will do next. This will render the old data unusable. So be prepared, for version 0.4.x you must delete the database and the file store. It might take some time until this version shows up, though.

Note, that with 0.4.x the current sc-*.sh scripts will become useless, too, hardlinks are not supported by scylla-charybdis (yet). I don't have any good idea currently, how to easily archive the effect I want without tweaking too much. And I really don't want to introduce the networking option yet, I am not completely sure if it will ever be integrated in md5backup at all. However I am in the rewriting phase of charybdis anyway, so perhaps I will hack this into scylla as well.

version 0.3.7-20040706-020801

download (61494 bytes)

First release with freshmeat announcement.

WARNING! CURRENTLY YOU ARE ON YOUR OWN WHEN IT COMES TO RESTORE!

Just unpack and run bin/dobackup.sh

If you networked backup, please read doc/sc-backup.txt This feature is working, but needs a lot of manual setup for now.

version 0.3.6-20040705-201547

(60935 bytes archive)

I still find little errors to correct.

Release probably worth for public Freshmeat announcement.

[view more history] [view complete history]

License and Disclaimer

All you can see here is free software according to the GNU GPL.
Copyright (C)2000-2011 by Valentin Hilbig
Note that the software comes with absolutely no warranty of any kind.
You use the software at your own risk.
Valentin Hilbig cannot be hold responsible for any unintended damage,
lost data or malfunction of the software you can find here.

Last modified: 2011-09-12 by Valentin Hilbig [ Imprint / Impressum ]