$Header: /CVSROOT/public/scylla-charybdis/md5backup/doc/results.txt,v 1.2 2004/09/28 23:58:47 tino Exp $ Some results of this proof of concept code: MD5 directory infrastructure: - It uses 2 hex digits of the MD5 sum to make 256 subdirectories in one directory. This makes one directory fit into 4096 bytes or one kernel page which is optimal for caching. There is no real improvement to decrease this to 1 hex digit or to increase it to 3 hex digits. - We have two directory levels. This makes 65536 directories. 3 levels would make 16 million directories which really makes no sense to me. - It is expected that this structure is efficient even when more than 100 files are stored beneath these directories. So it can hold over 6 Million files. With an average file size of say 100 KB this makes 600 GB of data. This is not much if you think about harddrives today have 300 GB, however I will think about this when the time has come. Program structure and speed: - When the filesystem structure is cached in the RAM on my Development Computer (Celeron 466, UDMA2 hard drive), the program scans over 3000 files per seconds on an idle machine: arg=7 ign=1 err=0 dir=12975 file=473157 mod=14 old=2 new=12 real 2m2.802s user 0m48.910s sys 0m53.640s - A (incremental) backup taken without data beeing cached in the memory gave me following figure (idle machine and local database): [root@firebird /backup/md5backup]# ./backup.sh autoignore file: /usr/local/backup/dbm/firebird.03.softkill.net 12:26 args=7 ign=1 err=0 dir=12981 file=477569 mod=4709 old=60 new=4649 174MB Then running again with the filesystem beeing cached you see a difference (the difference to first I think comes from the database, the internal drive is slower than the external backup hard drive): [root@firebird /backup/md5backup]# ./backup.sh autoignore file: /usr/local/backup/dbm/firebird.03.softkill.net 03:43 args=7 ign=1 err=0 dir=12981 file=477571 mod=8 old=2 new=6 1MB - The local GDM database (use softlinks to link it to the local drive) to scan for modified timestamps makes the program this fast. - Another point is the way directories are scanned. This is the way I already found suitable on my Atari ST and on some problematic network filesystems. First the plain directory is read into memory (read directory inode), then a stat() run is done (read file inodes), and then the files are processed (database access). This highly improves locality of harddrive accesses. The downside is for extremely big directories on inefficient filesystem designs, there the stat() pass may be very slow (this happens on FAT type filesystems) where a combined directory scan with intermediate stat() calls might be more efficient, but I found my way to be much more reliable on problematic network volumes where accessing to many files while reading a directory may invalidate the directory handle. If you ever wonder why I do what I do how I do it: Some provider noted me, that the computers will be moved and I have to make sure, that there is a backup for the computers. Nice thing, however for this I need to backup arround 50 GB (over the network) and I did not (again) want to move this intermediately to another server, and thus increase my steady file chaos, here. Data IO is not the problem, the only problem is, where to keep the data? Well, I had a system for this, however it ceased working some month ago, and well, the new system perhaps will be ready next year (I am too busy with other things). Thus I needed a fast solution. As follows: I have a lot of external hard drives (FireWire+USB). Therefor I took one, and attached it onto one of my old, but reliable linux systems (P233 with ASUS T2P4 and only 256 MB of RAM). For this I needed to install a FireWire card, as well. However, some driver irregularities in the current 2.4 Linux did some very strange things: Often an IEEE1394-timeout occured, and there was no retry and no error message either! So a lot of sectors just did not get written to the hard drive. Even mke2fs was impossible, as the instability was so high that with a 101% probability, even the filesystem structure could not be written to disk, without mke2fs telling me any error! Well, using USB2 then, made things a lot better. But I now have a bad taste about this. And interestingly, with a Celeron333 under RH9 here at home I do not observe these type of errors, perhaps it's the FireWire card I used in the P233? I don't know. I even don't want to know. The only thing I know for sure is following: With md5backup *no* trouble of this kind will go undetected, as I can do forensic at any time, and I will get a safe bet on which data is correct and which not. This is how I want it to be. I cannot trust the computer to do anything correctly. I cannot trust the network to transfer data unharmed. I even cannot trust my harddrive to tell me about errors or to keep the data alive. But I can trust in my software that either everything works as expected or I will be able to tell what still is usable. Please show me any other software out there which is able to do the same, and I will start to use it at once. Well, there is something like this, it's called BitTorrent. It can recover from any type of local error, as it makes heavy use of cryptographic checksums - and yes, I use it! Well, I have some issues about the way the BitTorrent protocol is implemented in the software, but the protocol itself is flawless. -Tino $Log: results.txt,v $ Revision 1.2 2004/09/28 23:58:47 tino slight changes to show all the new things Revision 1.1 2004/05/09 21:12:12 tino README.results moved here Revision 1.4 2004/05/04 05:09:41 tino preparing version 0.3.1 Revision 1.3 2004/01/17 14:06:37 tino Release made ready Revision 1.2 2004/01/11 22:04:32 tino Some last bugs fixed Revision 1.1 2004/01/11 20:30:10 tino Added