$Header: /CVSROOT/public/scylla-charybdis/md5backup/doc/results.txt,v 1.2 2004/09/28 23:58:47 tino Exp $

Some results of this proof of concept code:

MD5 directory infrastructure:

- It uses 2 hex digits of the MD5 sum to make 256 subdirectories in
  one directory.  This makes one directory fit into 4096 bytes or one
  kernel page which is optimal for caching.  There is no real
  improvement to decrease this to 1 hex digit or to increase it to 3
  hex digits.

- We have two directory levels.  This makes 65536 directories.  3
  levels would make 16 million directories which really makes no sense
  to me.

- It is expected that this structure is efficient even when more than
  100 files are stored beneath these directories.  So it can hold over
  6 Million files.  With an average file size of say 100 KB this makes
  600 GB of data.  This is not much if you think about harddrives
  today have 300 GB, however I will think about this when the time has
  come.

Program structure and speed:

- When the filesystem structure is cached in the RAM on my Development
  Computer (Celeron 466, UDMA2 hard drive), the program scans over
  3000 files per seconds on an idle machine:
  arg=7 ign=1 err=0 dir=12975 file=473157 mod=14 old=2 new=12
  real 2m2.802s
  user 0m48.910s
  sys  0m53.640s

- A (incremental) backup taken without data beeing cached in the
  memory gave me following figure (idle machine and local database):

  [root@firebird /backup/md5backup]# ./backup.sh
  autoignore file: /usr/local/backup/dbm/firebird.03.softkill.net
  12:26 args=7 ign=1 err=0 dir=12981 file=477569 mod=4709 old=60 new=4649 174MB

  Then running again with the filesystem beeing cached you see a
  difference (the difference to first I think comes from the database,
  the internal drive is slower than the external backup hard drive):

  [root@firebird /backup/md5backup]# ./backup.sh
  autoignore file: /usr/local/backup/dbm/firebird.03.softkill.net
  03:43 args=7 ign=1 err=0 dir=12981 file=477571 mod=8 old=2 new=6 1MB

- The local GDM database (use softlinks to link it to the local drive)
  to scan for modified timestamps makes the program this fast.

- Another point is the way directories are scanned.  This is the way I
  already found suitable on my Atari ST and on some problematic
  network filesystems.  First the plain directory is read into memory
  (read directory inode), then a stat() run is done (read file
  inodes), and then the files are processed (database access).  This
  highly improves locality of harddrive accesses.  The downside is for
  extremely big directories on inefficient filesystem designs, there
  the stat() pass may be very slow (this happens on FAT type
  filesystems) where a combined directory scan with intermediate
  stat() calls might be more efficient, but I found my way to be much
  more reliable on problematic network volumes where accessing to many
  files while reading a directory may invalidate the directory handle.


If you ever wonder why I do what I do how I do it:

Some provider noted me, that the computers will be moved and I have to
make sure, that there is a backup for the computers.  Nice thing,
however for this I need to backup arround 50 GB (over the network) and
I did not (again) want to move this intermediately to another server,
and thus increase my steady file chaos, here.

Data IO is not the problem, the only problem is, where to keep the
data?  Well, I had a system for this, however it ceased working some
month ago, and well, the new system perhaps will be ready next year (I
am too busy with other things).  Thus I needed a fast solution.  As
follows:

I have a lot of external hard drives (FireWire+USB).  Therefor I took
one, and attached it onto one of my old, but reliable linux systems
(P233 with ASUS T2P4 and only 256 MB of RAM).  For this I needed to
install a FireWire card, as well.  However, some driver irregularities
in the current 2.4 Linux did some very strange things: Often an
IEEE1394-timeout occured, and there was no retry and no error message
either!  So a lot of sectors just did not get written to the hard
drive.  Even mke2fs was impossible, as the instability was so high
that with a 101% probability, even the filesystem structure could not
be written to disk, without mke2fs telling me any error!

Well, using USB2 then, made things a lot better.  But I now have a bad
taste about this.  And interestingly, with a Celeron333 under RH9 here
at home I do not observe these type of errors, perhaps it's the
FireWire card I used in the P233?  I don't know.  I even don't want to
know.  The only thing I know for sure is following:

With md5backup *no* trouble of this kind will go undetected, as I can
do forensic at any time, and I will get a safe bet on which data is
correct and which not.

This is how I want it to be.  I cannot trust the computer to do
anything correctly.  I cannot trust the network to transfer data
unharmed.  I even cannot trust my harddrive to tell me about errors or
to keep the data alive.  But I can trust in my software that either
everything works as expected or I will be able to tell what still is
usable.

Please show me any other software out there which is able to do the
same, and I will start to use it at once.  Well, there is something
like this, it's called BitTorrent.  It can recover from any type of
local error, as it makes heavy use of cryptographic checksums - and
yes, I use it!  Well, I have some issues about the way the BitTorrent
protocol is implemented in the software, but the protocol itself is
flawless.

-Tino
$Log: results.txt,v $
Revision 1.2  2004/09/28 23:58:47  tino
slight changes to show all the new things

Revision 1.1  2004/05/09 21:12:12  tino
README.results moved here

Revision 1.4  2004/05/04 05:09:41  tino
preparing version 0.3.1

Revision 1.3  2004/01/17 14:06:37  tino
Release made ready

Revision 1.2  2004/01/11 22:04:32  tino
Some last bugs fixed

Revision 1.1  2004/01/11 20:30:10  tino
Added