-*-text-*- $Header: /CVSROOT/public/scylla-charybdis/md5backup/doc/multi_filestore.txt,v 1.2 2005/03/06 00:39:37 tino Exp $ ======================================================== Actually, this has been implemented, halfhearted. ======================================================== md5backup now allows additional directories named outN where N is a number starting from 0. These directories are searched for data which has already been written. Name them "backup file store". However this is halfhearted today. To be safe for now, changed files which can be found in the backup file store, are copied into the main file store again. The idea is, that new files are always accessible through the main file store. Now, you might think, what is the backup file store good for, then? Well, in the unlikely case, that two files happen to have the same md5sum (see ../test/file0 and 1) md5backup numbers the files. To get this number accurate, it must be able to look into the backup file stores. Yes, that's pure paranoia. Usually you will not need this ever. So having a file in the backup file store shall be enough. However it would not be easy to gain the md5backup behavior easily, but it's easy to remove the duplicate files from the main file store again. Just look for files which are in the backup file store as well and remove them (after a compare) in the main file store. Perhaps I will add a script soon to do this. Please note that there is not much overhead to move the file into the main file store, as the backup process always copies files into tmp/ first to get the data into a safe place, as files can change while you read them. Also this behavior is consistent for the linked file store as well. So I decided to go that way. Please remember, that md5backup will never be based on any database which can be used to "optimize" anything. Databases are only hints, which can speed up things, but they shall never make life more complicated in case they become corrupt. Where to head from here: The main file store always must reside on a volume which allows hardlinks, like NFS volumes or local hard drives. The backup file store is considered to stay on any volume you like. That are volumes like CD-ROM which allow implicite hardlinks, or volumes like SMB mounts which do not allow hardlinks at all. So we have two strategies when growing files come up: Growing files can be saved hardlinked in the main file store. However they cannot be saved this way into the backup file store. Therefor there must be defined an "archive format" to copy data from the main file store to a backup file store. The idea is to use "references" to other files which are understood by md5backup in case they are seen. This can be encoded in the file name or in some special metadata files. I like the file name idea, as a "find out? -type f" helps to identify this type of files, however directory lookups must be done by md5backup (which can be cached in the database). Note that the "archiving" is something like a compression task then. It utilizes the hardlink info found in the main file store, but it must not rely on this completely. It will allow to work with "indexes in TAR files" as well and shall even support indexes in compressed files. The good thing is, that all the complex stuff (compression) goes into a separate utility, which can be run independently from md5backup, it can even run on another host (in the NFS case). So it's the definitive way to go. Please note, that I always want to have a compare script, which is able to handle all the special cases. Such that this compare script can be changed such, that it allows a restore. This then shows the way how to get disconnected mode done. Certain file stores then can be represented as a file which only lists the files with md5sums but without the file contents. As md5backup always copies new data into the file store even if it can be found in a backup file store, we always have the "touched" information, even if something happens which is wrong (two files which have the same md5 sum but different data). This, however, cannot be detected by md5backup if the file store is disconnected, it only can be detected by the archiving process (or, sadly, if you try to restore anything). And as soon as disconnected file stores are there, md5backup becomes a full featured networked backup as I want to have it. A networked backup must be able to backup data even while the network is down. This works with md5backup in the case you backup to a local hard drive and let the archiver move the data over the backup into some disconnected file store. (Well, this will not work with file store on NFS, of course, but this is a backup over LAN while I think of backup over Internet). Yes, I think is a *must* criteria of a networked backup to be able to backup while the network is down. Of course you can argue, that the data is ultimatively protected only of it has reached it's backup destination, which cannot be reached while the network is down, but this is only half of the truth. I usually want to be able to restore a point in time, like a file saved 5 minutes ago but then accidentially deleted (perhaps it got deleted because the network went down). So I definitively don't want to hear something like "the backup has not taken place because the network went down", I just want to have my file restored. It's completely out of interest, if the backup process has kept a local copy or a networked copy then, the important thing is, there is a copy. Only in case of a catastrophic hardware failure you need the networked backup to restore the files. Those problems are less often. And usually, in such a case, the major harm is the downtime. If you then cannot restore 2h of work, because this was not yet saved over the network, well, this is often of minor importance (in such a case). Also note, that you usually can recover hard drive data even from a PC which was burned down. However you cannot recover a backup which was not done. Additionally with md5backup you can be nearly completely sure if you recovered the file correctly, as you have the md5sum to check after the data was recovered. That's the idea behind md5backup: Provide forensic even in the most weird cases you ever can imagine. That's why I do it this way, as I don't understand how to make it reliable another way. Here is the old original idea: ======================================================== The following might go into the successor of md5backup, as md5backup is only an intermediate utility! ======================================================== Some words to disconnected / multiple file store: The idea behind this is, that some time the backup volume becomes filled up. In this case you need to have a second volume where data is stored. This is important, as I really had several times the problem, after a hard drive failure, to get the readable part of the RAID or LVM volumes. So there should be no need to use RAID or LVM to extend the backup volume. Two strategies arise: First are multiple volumes. md5backup shall fill the first one, then switch over to the second one, and so on. There should be an easy way to find the file again. So perhaps we need another directory layer below out/ where you can mount more volumes. Later fewer and fewer files of the "old" volumes will be used. It will become feasible to retire these old volumes sometimes, when good big sized permanent backup methods (like optical storage etc.) becomes available. In this case you want to disconnect these volumes from the backup system completely. All those files for "history" reasons (which need not be restored in case of a failure) shall then be archived away. However you do not want to copy these files into the remaining backup storage, and you do not want to access the archive in case you backup such an "history" file on your harddrive. I do not trust MD5 sums alone to be able to do this. There must be some method such that the user can see this happens and decide. There is really no need to backup all old files when you unpack some old archive and do only a look on it. When this all is reached, something else comes into mind: Index files in general archives. Often you have files within a TAR, zip or similar. Often you have the file compressed and uncompressed. There is really no need to backup both, the compressed and the uncompressed files. Even having files in a tar is enough to restore it. So perhaps it is feasible to index files within archives and then do not backup them individually in case you have them there. Only a reference "can be found in archive xx under name yy" should be enough. This information then must be kept in three locations: The database, the log and the metadata. As always I want the database to be able to be deltete any time, as it's binary information. To find the history of a file you have to look into the log, to find the current version you have to look into the metadata. It complicates things, though. You cannot switch an archive in disconnected mode when there is a file within which may not be disconnected. As well if the archive somehow gets lost (a simple bit error in a .tar.gz makes everything behind unreadable) all data stored within is lost, too. Perhaps this is not feasible at all. md5backup is thought to be able to backup huge data areas. I think of Terabytes, today. This will become Petabytes very, soon. Please note that I combine this with my idea of www.nastysan.org, which some times shall become a block device which is capable of storing data a way, such that 20% of the hard drives can behave irregularily and do not suffer any data loss at all. Irregularily means, not only have unreadable data. Most times I get "read error" from the drive, that is true. But several times I observed a completely erratic behavior, like presenting me complete garbate while pretending this garbage is the correct information. Sometimes drives just write sectors to the wrong sector, thus not storing the data and destroying other, possibly valued, data. I want to have a "virtual drive" which can overcome this type of erratic behavior with ease. A drive which detects such types of defects for sure and which can repair it with an extreme high probability. If it cannot repair the defect it tells you about the defect, such that you know when data is not correct. I want this for over 3 years now. With hard drives becomming bigger and bigger such a meta-drive becomes more and more important. However, I did not came arround to program it today, sorry. Please note, my complete file storage reaches the 10 TB level in 2004. It's my own file storage. The major problem is, that I copy and copy data all over, and I have lost oversight. What I need is a reliable file storage. A file storage designed to keep the data safe for 10 years, minimum. Then I can start to reconstruct all the lost data over the years, to sort out the data bit for bit and be safe against just another loss of data, just because the sorting took 3 years, and the hard drive only lasts 2.5 years. I tried LVM. I tried RAID. I tried backups. All those only brought me more and more trouble and made the situation even more chaotic. So I now concentrate on a backup with forensic. Then I will build a huge file storage, an even huger backup storage and a backup of the backup storage. Then if something fails in the backup storage it can be restored from the double backup. If something fails in the double backup, it is deleted and a backup of the backup storage is done. So the backup is safe. Thus I can concentrate on the file storage alone. This - today - only is arround 2 TB. That's not much, really. However I need this accessible from 3 locations: Home, Work and Internet. So I have to build it 3 times. This means 3 backups, top. And keep everything synchronized. Nice task, eh? Oh well, I would like to have the grid, today. -Tino $Log: multi_filestore.txt,v $ Revision 1.2 2005/03/06 00:39:37 tino Information corrected according to version 0.3.14 Revision 1.1 2004/05/09 23:25:12 tino copied from TODO