-*-text-*-
$Header: /CVSROOT/public/scylla-charybdis/md5backup/doc/multi_filestore.txt,v 1.2 2005/03/06 00:39:37 tino Exp $

========================================================
Actually, this has been implemented, halfhearted.
========================================================

md5backup now allows additional directories named outN where N is a
number starting from 0.  These directories are searched for data which
has already been written.  Name them "backup file store".

However this is halfhearted today.  To be safe for now, changed files
which can be found in the backup file store, are copied into the main
file store again.  The idea is, that new files are always accessible
through the main file store.

Now, you might think, what is the backup file store good for, then?
Well, in the unlikely case, that two files happen to have the same
md5sum (see ../test/file0 and 1) md5backup numbers the files.  To get
this number accurate, it must be able to look into the backup file
stores.

Yes, that's pure paranoia.  Usually you will not need this ever.  So
having a file in the backup file store shall be enough.  However it
would not be easy to gain the md5backup behavior easily, but it's easy
to remove the duplicate files from the main file store again.

Just look for files which are in the backup file store as well and
remove them (after a compare) in the main file store.  Perhaps I will
add a script soon to do this.

Please note that there is not much overhead to move the file into the
main file store, as the backup process always copies files into tmp/
first to get the data into a safe place, as files can change while you
read them.

Also this behavior is consistent for the linked file store as well.
So I decided to go that way.  Please remember, that md5backup will
never be based on any database which can be used to "optimize"
anything.  Databases are only hints, which can speed up things, but
they shall never make life more complicated in case they become
corrupt.

Where to head from here:

The main file store always must reside on a volume which allows
hardlinks, like NFS volumes or local hard drives.

The backup file store is considered to stay on any volume you like.
That are volumes like CD-ROM which allow implicite hardlinks, or
volumes like SMB mounts which do not allow hardlinks at all.

So we have two strategies when growing files come up:

Growing files can be saved hardlinked in the main file store.

However they cannot be saved this way into the backup file store.
Therefor there must be defined an "archive format" to copy data from
the main file store to a backup file store.  The idea is to use
"references" to other files which are understood by md5backup in case
they are seen.  This can be encoded in the file name or in some
special metadata files.  I like the file name idea, as a "find out?
-type f" helps to identify this type of files, however directory
lookups must be done by md5backup (which can be cached in the
database).

Note that the "archiving" is something like a compression task then.
It utilizes the hardlink info found in the main file store, but it
must not rely on this completely.  It will allow to work with "indexes
in TAR files" as well and shall even support indexes in compressed
files.

The good thing is, that all the complex stuff (compression) goes into
a separate utility, which can be run independently from md5backup, it
can even run on another host (in the NFS case).  So it's the
definitive way to go.

Please note, that I always want to have a compare script, which is
able to handle all the special cases.  Such that this compare script
can be changed such, that it allows a restore.

This then shows the way how to get disconnected mode done.

Certain file stores then can be represented as a file which only lists
the files with md5sums but without the file contents.  As md5backup
always copies new data into the file store even if it can be found in
a backup file store, we always have the "touched" information, even if
something happens which is wrong (two files which have the same md5
sum but different data).  This, however, cannot be detected by
md5backup if the file store is disconnected, it only can be detected
by the archiving process (or, sadly, if you try to restore anything).

And as soon as disconnected file stores are there, md5backup becomes a
full featured networked backup as I want to have it.  A networked
backup must be able to backup data even while the network is down.
This works with md5backup in the case you backup to a local hard drive
and let the archiver move the data over the backup into some
disconnected file store.  (Well, this will not work with file store on
NFS, of course, but this is a backup over LAN while I think of backup
over Internet).

Yes, I think is a *must* criteria of a networked backup to be able to
backup while the network is down.  Of course you can argue, that the
data is ultimatively protected only of it has reached it's backup
destination, which cannot be reached while the network is down, but
this is only half of the truth.

I usually want to be able to restore a point in time, like a file
saved 5 minutes ago but then accidentially deleted (perhaps it got
deleted because the network went down).  So I definitively don't want
to hear something like "the backup has not taken place because the
network went down", I just want to have my file restored.  It's
completely out of interest, if the backup process has kept a local
copy or a networked copy then, the important thing is, there is a
copy.

Only in case of a catastrophic hardware failure you need the networked
backup to restore the files.  Those problems are less often.  And
usually, in such a case, the major harm is the downtime.  If you then
cannot restore 2h of work, because this was not yet saved over the
network, well, this is often of minor importance (in such a case).

Also note, that you usually can recover hard drive data even from a PC
which was burned down.  However you cannot recover a backup which was
not done.  Additionally with md5backup you can be nearly completely
sure if you recovered the file correctly, as you have the md5sum to
check after the data was recovered.  That's the idea behind md5backup:

Provide forensic even in the most weird cases you ever can imagine.

That's why I do it this way, as I don't understand how to make it
reliable another way.


Here is the old original idea:
========================================================
The following might go into the successor of md5backup,
as md5backup is only an intermediate utility!
========================================================

Some words to disconnected / multiple file store:

The idea behind this is, that some time the backup volume
becomes filled up.  In this case you need to have a second
volume where data is stored.  This is important, as I really
had several times the problem, after a hard drive failure,
to get the readable part of the RAID or LVM volumes.  So
there should be no need to use RAID or LVM to extend the
backup volume.

Two strategies arise:

First are multiple volumes.   md5backup shall fill the first
one, then switch over to the second one, and so on.  There
should be an easy way to find the file again.  So perhaps
we need another directory layer below out/ where you can
mount more volumes.

Later fewer and fewer files of the "old" volumes will be used.
It will become feasible to retire these old volumes sometimes,
when good big sized permanent backup methods (like optical
storage etc.) becomes available.  In this case you want to
disconnect these volumes from the backup system completely.

All those files for "history" reasons (which need not be
restored in case of a failure) shall then be archived away.
However you do not want to copy these files into the remaining
backup storage, and you do not want to access the archive
in case you backup such an "history" file on your harddrive.

I do not trust MD5 sums alone to be able to do this.  There
must be some method such that the user can see this happens
and decide.  There is really no need to backup all old files
when you unpack some old archive and do only a look on it.


When this all is reached, something else comes into mind:

Index files in general archives.

Often you have files within a TAR, zip or similar.  Often
you have the file compressed and uncompressed.  There is
really no need to backup both, the compressed and the
uncompressed files.  Even having files in a tar is enough
to restore it.  So perhaps it is feasible to index files
within archives and then do not backup them individually
in case you have them there.  Only a reference "can be
found in archive xx under name yy" should be enough.

This information then must be kept in three locations:

The database, the log and the metadata.  As always I want
the database to be able to be deltete any time, as it's
binary information.  To find the history of a file you have
to look into the log, to find the current version you
have to look into the metadata.

It complicates things, though.  You cannot switch an archive
in disconnected mode when there is a file within which may
not be disconnected.  As well if the archive somehow gets
lost (a simple bit error in a .tar.gz makes everything behind
unreadable) all data stored within is lost, too.

Perhaps this is not feasible at all.  md5backup is thought
to be able to backup huge data areas.  I think of Terabytes,
today.  This will become Petabytes very, soon.  Please note
that I combine this with my idea of www.nastysan.org,
which some times shall become a block device which is
capable of storing data a way, such that 20% of the hard
drives can behave irregularily and do not suffer any data
loss at all.  Irregularily means, not only have unreadable
data.  Most times I get "read error" from the drive, that
is true.  But several times I observed a completely erratic
behavior, like presenting me complete garbate while pretending
this garbage is the correct information.  Sometimes drives
just write sectors to the wrong sector, thus not storing the
data and destroying other, possibly valued, data.

I want to have a "virtual drive" which can overcome this
type of erratic behavior with ease.  A drive which detects
such types of defects for sure and which can repair it with
an extreme high probability.  If it cannot repair the defect
it tells you about the defect, such that you know when data
is not correct.

I want this for over 3 years now.  With hard drives becomming
bigger and bigger such a meta-drive becomes more and more
important.  However, I did not came arround to program it
today, sorry.

Please note, my complete file storage reaches the 10 TB level
in 2004.  It's my own file storage.  The major problem is,
that I copy and copy data all over, and I have lost oversight.
What I need is a reliable file storage.  A file storage
designed to keep the data safe for 10 years, minimum.
Then I can start to reconstruct all the lost data over the
years, to sort out the data bit for bit and be safe against
just another loss of data, just because the sorting took 3 years,
and the hard drive only lasts 2.5 years.

I tried LVM.  I tried RAID.  I tried backups.  All those only
brought me more and more trouble and made the situation even
more chaotic.  So I now concentrate on a backup with forensic.

Then I will build a huge file storage, an even huger backup
storage and a backup of the backup storage.  Then if something
fails in the backup storage it can be restored from the
double backup.  If something fails in the double backup, it
is deleted and a backup of the backup storage is done.

So the backup is safe.  Thus I can concentrate on the file
storage alone.  This - today - only is arround 2 TB.  That's
not much, really.  However I need this accessible from 3
locations:  Home, Work and Internet.  So I have to build it
3 times.  This means 3 backups, top.  And keep everything
synchronized.  Nice task, eh?  Oh well, I would like to have
the grid, today.

-Tino
$Log: multi_filestore.txt,v $
Revision 1.2  2005/03/06 00:39:37  tino
Information corrected according to version 0.3.14

Revision 1.1  2004/05/09 23:25:12  tino
copied from TODO