$Header: /CVSROOT/public/scylla-charybdis/md5backup/doc/archive_format.txt,v 1.1 2005/07/17 00:51:32 tino Exp $

Beginning with 0.4, md5backup will introduce an intermediate "archive
format".  This is for preparation of future versions.

The "archive format" separates the reading engine from the writing
engine.

What is the idea behind this format?

- while md5backup runs, it 'overloads' the machine with IO in two
  situations:

  + while scanning the directory tree for changes

  + while copying file data into the file store

  This here adresses the second problem, as it allows to "throttle"
  md5backup more easily by slowing down the throughput into the
  archive format.

- md5backup shall become multithreaded, such that several parts can
  run independently from each other:

  + The directory scanner

    This way later on md5backup might do true online backups of
    changed files.  All what is needed is kernel support for more easy
    file activity monitoring.

  + The archiver

    This is the part, which picks up the files and writes them to the
    archive format.

  + The filestore

    The archive format then is unpacked into the file store format.
    And later on, this will allow to add

  + The packer/unpacker

    This is an online comression module (zlib etc.) for the archive
    format.

  + The retriever

    This retrieves files from the file store to bring it back into
    archive format.

  + The restore

    This can read the archive format and write it to the filesystem
    unpacked.

The archive format also is the way to gain true independent networked
backup.

You might now what a network is:  It is connecting computers.

Where I differ in my thoughts from others is, that my basic assumption
is, that the network is down when some important task (like the
backup) wants to run.  So one goal is, that md5backup can do the
backup even if the network is down!

Of course I cannot tell computers how to use telepathy, but what I can
archive is, that the whole process is defined such, that it runs well
even if the backup server is currently unreachable, and it will
postpone the data transfer to it until the backup server is up again.
This is where the "archive format" comes into play.

md5backup will never define it's own transport mechanism of the
archive format over the network.  There will be modules which handle
this problem for you using an external transport mechanism which can
be implemented by you.  The basic network transport module will just
record the data on the hard drive while it sends the data to this
external transport process.  This process then must signal a "good"
when the complete file is processed, such that the file can be deleted
locally.  Is there no such "good" signal, then the file will just stay
on the hard drive until some external process picks it up and sends it
to the backup server.  This process must make sure, that the file is
transferred correctly (else the local database and the remote
repository might get out of sync).

The process to transport the data to the other side can be
scylla+charybdis, rsync, scp, or anything you might think of.

-----------------------------
Begin Vaporware Advertisement
-----------------------------

Please note that in a distant future, everything might merge into one
utility gcopy (see www.gcopy.com, yes, the domain is empty today),
which stands for GNU copy or General Copy, which will be able to
replace following:

- md5backup, scylla+charybdis, diskimg, ptybuffer
- cp, cmp, dd
- buffer
- ftp, scp, wget
- tar, cpio
- netcat, socklinger, tcpserver, inetd, xinetd
- and others

The idea behind gcopy is a general "swiss army knife" of 1:1 data
copying.  So anything which has to do with "I have data here, I want
data there" *must* be implemented into gcopy for it to be able to
fullfill the task.  Also it shall be quite efficient in the way it
works, so it must contain buffering and throtteling, connected and
disconnected modes, and it must be able to connect to other computers
or accept such connections in various ways.

Stay tuned, perhaps I ever come arround to program it. ;)

---------------------------
End Vaporware Advertisement
---------------------------

What is currently undefined is, how corrupt data shall be addressed.
The archive format can detect corruption, but it cannot heal it.  So
this is left for manual operation.  In "desparate mode" you can flag
the corruption, remove the database if this flag was set, delete the
flag and do a full backup again.

In a far distant future there might be a process, which is able to get
the local database in sync with the remote file store, such that there
is a "second safety net" which allows you to re-backup a file in case
it was not received correctly on the file store (or was accidentially
destroyed).

This is important as I, personally, want, that any process
automagically recovers from any type of failure on any side without
any manual intervention.

However currently you need to take over control in case something
breaks in the backup server or on the transport process to the backup
server.  Therefor it's so important that the transport service works
extremely reliable - which is the case with S+C, but probably not with
others.

$Log: archive_format.txt,v $
Revision 1.1  2005/07/17 00:51:32  tino
added