$Header: /CVSROOT/public/scylla-charybdis/md5backup/doc/archive_format.txt,v 1.1 2005/07/17 00:51:32 tino Exp $ Beginning with 0.4, md5backup will introduce an intermediate "archive format". This is for preparation of future versions. The "archive format" separates the reading engine from the writing engine. What is the idea behind this format? - while md5backup runs, it 'overloads' the machine with IO in two situations: + while scanning the directory tree for changes + while copying file data into the file store This here adresses the second problem, as it allows to "throttle" md5backup more easily by slowing down the throughput into the archive format. - md5backup shall become multithreaded, such that several parts can run independently from each other: + The directory scanner This way later on md5backup might do true online backups of changed files. All what is needed is kernel support for more easy file activity monitoring. + The archiver This is the part, which picks up the files and writes them to the archive format. + The filestore The archive format then is unpacked into the file store format. And later on, this will allow to add + The packer/unpacker This is an online comression module (zlib etc.) for the archive format. + The retriever This retrieves files from the file store to bring it back into archive format. + The restore This can read the archive format and write it to the filesystem unpacked. The archive format also is the way to gain true independent networked backup. You might now what a network is: It is connecting computers. Where I differ in my thoughts from others is, that my basic assumption is, that the network is down when some important task (like the backup) wants to run. So one goal is, that md5backup can do the backup even if the network is down! Of course I cannot tell computers how to use telepathy, but what I can archive is, that the whole process is defined such, that it runs well even if the backup server is currently unreachable, and it will postpone the data transfer to it until the backup server is up again. This is where the "archive format" comes into play. md5backup will never define it's own transport mechanism of the archive format over the network. There will be modules which handle this problem for you using an external transport mechanism which can be implemented by you. The basic network transport module will just record the data on the hard drive while it sends the data to this external transport process. This process then must signal a "good" when the complete file is processed, such that the file can be deleted locally. Is there no such "good" signal, then the file will just stay on the hard drive until some external process picks it up and sends it to the backup server. This process must make sure, that the file is transferred correctly (else the local database and the remote repository might get out of sync). The process to transport the data to the other side can be scylla+charybdis, rsync, scp, or anything you might think of. ----------------------------- Begin Vaporware Advertisement ----------------------------- Please note that in a distant future, everything might merge into one utility gcopy (see www.gcopy.com, yes, the domain is empty today), which stands for GNU copy or General Copy, which will be able to replace following: - md5backup, scylla+charybdis, diskimg, ptybuffer - cp, cmp, dd - buffer - ftp, scp, wget - tar, cpio - netcat, socklinger, tcpserver, inetd, xinetd - and others The idea behind gcopy is a general "swiss army knife" of 1:1 data copying. So anything which has to do with "I have data here, I want data there" *must* be implemented into gcopy for it to be able to fullfill the task. Also it shall be quite efficient in the way it works, so it must contain buffering and throtteling, connected and disconnected modes, and it must be able to connect to other computers or accept such connections in various ways. Stay tuned, perhaps I ever come arround to program it. ;) --------------------------- End Vaporware Advertisement --------------------------- What is currently undefined is, how corrupt data shall be addressed. The archive format can detect corruption, but it cannot heal it. So this is left for manual operation. In "desparate mode" you can flag the corruption, remove the database if this flag was set, delete the flag and do a full backup again. In a far distant future there might be a process, which is able to get the local database in sync with the remote file store, such that there is a "second safety net" which allows you to re-backup a file in case it was not received correctly on the file store (or was accidentially destroyed). This is important as I, personally, want, that any process automagically recovers from any type of failure on any side without any manual intervention. However currently you need to take over control in case something breaks in the backup server or on the transport process to the backup server. Therefor it's so important that the transport service works extremely reliable - which is the case with S+C, but probably not with others. $Log: archive_format.txt,v $ Revision 1.1 2005/07/17 00:51:32 tino added