Code Monkey home page Code Monkey logo

zbackup's Introduction

Build Status Coverity Scan Build Status

Introduction

zbackup is a globally-deduplicating backup tool, based on the ideas found in rsync. Feed a large .tar into it, and it will store duplicate regions of it only once, then compress and optionally encrypt the result. Feed another .tar file, and it will also re-use any data found in any previous backups. This way only new changes are stored, and as long as the files are not very different, the amount of storage required is very low. Any of the backup files stored previously can be read back in full at any time. The program is format-agnostic, so you can feed virtually any files to it (any types of archives, proprietary formats, even raw disk images -- but see Caveats).

This is achieved by sliding a window with a rolling hash over the input at a byte granularity and checking whether the block in focus was ever met already. If a rolling hash matches, an additional full cryptographic hash is calculated to ensure the block is indeed the same. The deduplication happens then.

Features

The program has the following features:

  • Parallel LZMA or LZO compression of the stored data
  • Built-in AES encryption of the stored data
  • Possibility to delete old backup data
  • Use of a 64-bit rolling hash, keeping the amount of soft collisions to zero
  • Repository consists of immutable files. No existing files are ever modified
  • Written in C++ only with only modest library dependencies
  • Safe to use in production (see below)
  • Possibility to exchange data between repos without recompression

Build dependencies

  • cmake >= 2.8.3 (though it should not be too hard to compile the sources by hand if needed)
  • libssl-dev for all encryption, hashing and random numbers
  • libprotobuf-dev and protobuf-compiler for data serialization
  • liblzma-dev for compression
  • liblzo2-dev for compression (optional)
  • zlib1g-dev for adler32 calculation

Quickstart

To build and install:

cd zbackup
cmake .
make
sudo make install
# or just run as ./zbackup

zbackup is also part of the Fedora/EPEL, Debian, Ubuntu, Arch Linux and FreeBSD.

To use:

zbackup init --non-encrypted /my/backup/repo
tar c /my/precious/data | zbackup backup /my/backup/repo/backups/backup-`date '+%Y-%m-%d'`
zbackup restore /my/backup/repo/backups/backup-`date '+%Y-%m-%d'` > /my/precious/backup-restored.tar

If you have a lot of RAM to spare, you can use it to speed-up the restore process -- to use 512 MB more, pass --cache-size 512mb when restoring.

If encryption is wanted, create a file with your password:

# more secure to use an editor
echo mypassword > ~/.my_backup_password
chmod 600 ~/.my_backup_password

Then init the repo the following way:

zbackup init --password-file ~/.my_backup_password /my/backup/repo

And always pass the same argument afterwards:

tar c /my/precious/data | zbackup --password-file ~/.my_backup_password backup /my/backup/repo/backups/backup-`date '+%Y-%m-%d'`
zbackup --password-file ~/.my_backup_password restore /my/backup/repo/backups/backup-`date '+%Y-%m-%d'` > /my/precious/backup-restored.tar

If you have a 32-bit system and a lot of cores, consider lowering the number of compression threads by passing --threads 4 or --threads 2 if the program runs out of address space when backing up (see why below, item 2). There should be no problem on a 64-bit system.

Caveats

  • While you can pipe any data into the program, the data should be uncompressed and unencrypted -- otherwise no deduplication could be performed on it. zbackup would compress and encrypt the data itself, so there's no need to do that yourself. So just run tar c and pipe it into zbackup directly. If backing up disk images employing encryption, pipe the unencrypted version (the one you normally mount). If you create .zip or .rar files, use no compression (-0 or -m0) and no encryption.
  • Parallel LZMA compression uses a lot of RAM (several hundreds of megabytes, depending on the number of threads used), and ten times more virtual address space. The latter is only relevant on 32-bit architectures where it's limited to 2 or 3 GB. If you hit the ceiling, lower the number of threads with --threads.
  • Since the data is deduplicated, there's naturally no redundancy in it. A loss of a single file can lead to a loss of virtually all data. Make sure you store it on a redundant storage (RAID1, a cloud provider etc).
  • The encryption key, if used, is stored in the info file in the root of the repo. It is encrypted with your password. Technically thus you can change your password without re-encrypting any data, and as long as no one possesses the old info file and knows your old password, you would be safe (note that ability to change repo type between encrypted and non-encrypted is not implemented yet -- someone who needs this is welcome to create a pull request -- the possibility is all there). Also note that it is crucial you don't lose your info file, as otherwise the whole backup would be lost.

Limitations

  • Right now the only modes supported are reading from standard input and writing to standard output. FUSE mounts and NBD servers may be added later if someone contributes the code.
  • The program keeps all known blocks in an in-RAM hash table, which may create scalability problems for very large repos (see below).
  • The only encryption mode currently implemented is AES-128 in CBC mode with PKCS#7 padding. If you believe that this is not secure enough, patches are welcome. Before you jump to conclusions however, read this article.
  • It's only possible to fully restore the backup in order to get to a required file, without any option to quickly pick it out. tar would not allow to do it anyway, but e.g. for zip files it could have been possible. This is possible to implement though, e.g. by exposing the data over a FUSE filesystem.

Most of those limitations can be lifted by implementing the respective features.

Safety

Is it safe to use zbackup for production data? Being free software, the program comes with no warranty of any kind. That said, it's perfectly safe for production, and here's why. When performing a backup, the program never modifies or deletes any existing files -- only new ones are created. It specifically checks for that, and the code paths involved are short and easy to inspect. Furthermore, each backup is protected by its SHA256 sum, which is calculated before piping the data into the deduplication logic. The code path doing that is also short and easy to inspect. When a backup is being restored, its SHA256 is calculated again and compared against the stored one. The program would fail on a mismatch. Therefore, to ensure safety it is enough to restore each backup to /dev/null immediately after creating it. If it restores fine, it will restore fine ever after.
To add some statistics, the author of the program has been using an older version of zbackup internally for over a year. The SHA256 check never ever failed. Again, even if it does, you would know immediately, so no work would be lost. Therefore you are welcome to try the program in production, and if you like it, stick with it.

Usage notes

The repository has the following directory structure:

/repo
    backups/
    bundles/
        00/
        01/
        02/
        ...
    index/
    info
  • The backups directory contain your backups. Those are very small files which are needed for restoration. They are encrypted if encryption is enabled. The names can be arbitrary. It is possible to arrange files in subdirectories, too. Free renaming is also allowed.
  • The bundles directory contains the bulk of data. Each bundle internally contains multiple small chunks, compressed together and encrypted. Together all those chunks account for all deduplicated data stored.
  • The index directory contains the full index of all chunks in the repository, together with their bundle names. A separate index file is created for each backup session. Technically those files are redundant, all information is contained in the bundles themselves. However, having a separate index is nice for two reasons: 1) it's faster to read as it incurs less seeks, and 2) it allows making backups while storing bundles elsewhere. Bundles are only needed when restoring -- otherwise it's sufficient to only have index. One could then move all newly created bundles into another machine after each backup.
  • info is a very important file which contains all global repository metadata, such as chunk and bundle sizes, and an encryption key encrypted with the user password. It is paramount not to lose it, so backing it up separately somewhere might be a good idea. On the other hand, if you absolutely don't trust your remote storage provider, you might consider not storing it with the rest of the data. It would then be impossible to decrypt it at all, even if your password gets known later.

The program does not have any facilities for sending your backup over the network. You can rsync the repo to another computer or use any kind of cloud storage capable of storing files. Since zbackup never modifies any existing files, the latter is especially easy -- just tell the upload tool you use not to upload any files which already exist on the remote side (e.g. with gsutil it's gsutil cp -R -n /my/backup gs:/mybackup/).

To aid with creating backups, there's an utility called tartool included with zbackup. The idea is the following: one sprinkles empty files called .backup and .no-backup across the entire filesystem. Directories where .backup files are placed are marked for backing up. Similarly, directories with .no-backup files are marked not to be backed up. Additionally, it is possible to place .backup-XYZ in the same directory where XYZ is to mark XYZ for backing up, or place .no-backup-XYZ to mark it not to be backed up. Then tartool can be run with three arguments -- the root directory to start from (can be /), the output includes file, and the output excludes file. The tool traverses over the given directory noting the .backup* and .no-backup* files and creating include and exclude lists for the tar utility. The tar utility could then be run as tar c --files-from includes --exclude-from excludes to store all chosen data.

Scalability

This section tries do address the question on the maximum amount of data which can be held in a backup repository. What is meant here is the deduplicated data. The number of bytes in all source files ever fed into the repository doesn't matter, but the total size of the resulting repository does. Internally all input data is split into small blocks called chunks (up to 64k each by default). Chunks are collected into bundles (up to 2MB each by default), and those bundles are then compressed and encrypted.

There are then two problems with the total number of chunks in the repository:

  • Hashes of all existing chunks are needed to be kept in RAM while the backup is ongoing. Since the sliding window performs checking with a single-byte granularity, lookups would otherwise be too slow. The amount of data needed to be stored is technically only 24 bytes for each chunk, where the size of the chunk is up to 64k. In an example real-life 18GB repo, only 18MB are taken by in its hash index. Multiply this roughly by two to have an estimate of RAM needed to store this index as an in-RAM hash table. However, as this size is proportional to the total size of the repo, for 2TB repo you could already require 2GB of RAM. Most repos are much smaller though, and as long as the deduplication works properly, in many cases you can store terabytes of highly-redundant backup files in a 20GB repo easily.
  • We use a 64-bit rolling hash, which allows to have an O(1) lookup cost at each byte we process. Due to birthday paradox, we would start having collisions when we approach 2^32 hashes. If each chunk we have is 32k on average, we would get there when our repo grows to 128TB. We would still be able to continue, but as the number of collisions would grow, we would have to resort to calculating the full hash of a block at each byte more and more often, which would result in a considerable slowdown.

All in all, as long as the amount of RAM permits, one can go up to several terabytes in deduplicated data, and start having some slowdown after having hundreds of terabytes, RAM-permitting.

Design choices

  • We use a 64-bit modified Rabin-Karp rolling hash (see rolling_hash.hh for details), while most other programs use a 32-bit one. As noted previously, one problem with the hash size is its birthday bound, which with the 32-bit hash is met after having only 2^16 hashes. The choice of a 64-bit hash allows us to scale much better while having virtually the same calculation cost on a typical 64-bit machine.
  • rsync uses MD5 as its strong hash. While MD5 is known to be fast, it is also known to be broken, allowing a malicious user to craft colliding inputs. zbackup uses SHA1 instead. The cost of SHA1 calculations on modern machines is actually less than that of MD5 (run openssl speed md5 sha1 on yours), so it's a win-win situation. We only keep the first 128 bits of the SHA1 output, and therefore together with the rolling hash we have a 192-bit hash for each chunk. It's a multiple of 8 bytes which is a nice properly on 64-bit machines, and it is long enough not to worry about possible collisions.
  • AES-128 in CBC mode with PKCS#7 padding is used for encryption. This seems to be a reasonbly safe classic solution. Each encrypted file has a random IV as its first 16 bytes.
  • We use Google's protocol buffers to represent data structures in binary form. They are very efficient and relatively simple to use.

Compression

zbackup uses LZMA to compress stored data. It compresses very well, but it will slow down your backup (unless you have a very fast CPU).

LZO is much faster, but the files will be bigger. If you don't want your backup process to be cpu-bound, you should consider using LZO. However, there are some caveats:

  • LZO is so fast that other parts of zbackup consume significant portions of the CPU. In fact, it is only using one core on my machine because compression is the only thing that can run in parallel.
  • I've hacked the LZO support in a day. You shouldn't trust it. Please make sure that restore works before you assume that your data is safe. That may still be faster than a backup with LZMA ;-)
  • LZMA is still the default, so make sure that you use the -o bundle.compression_method=lzo argument when you init the repo or whenever you do a backup.

You can mix LZMA and LZO in a repository. Each bundle file has a field that says how it was compressed, so zbackup will use the right method to decompress it. You could use an old zbackup respository with only LZMA bundles and start using LZO. However, please think twice before you do that because old versions of zbackup won't be able to read those bundles.

Improvements

There's a lot to be improved in the program. It was released with the minimum amount of functionality to be useful. It is also stable. This should hopefully stimulate people to join the development and add all those other fancy features. Here's a list of ideas:

  • Ability to change bundle type (between encrypted and non-encrypted).
  • Improved garbage collection. The program should support ability to specify maximum index file size / maximum index file count (for better compatibility with cloud storages as well) or something like retention policy.
  • A command to fsck the repo by doing something close to what garbage collection does, but also checking all hashes and so on.
  • Parallel decompression. Right now decompression is single-threaded, but it is possible to look ahead in the stream and perform prefetching.
  • Support for mounting the repo over FUSE. Random access to data would then be possible.
  • Support for exposing a backed up file over a userspace NBD server. It would then be possible to mount raw disk images without extracting them.
  • Support for other encryption types (preferably for everything openssl supports with its evp).
  • Support for other compression methods.
  • You name it!

Communication

The author is reachable over email at [email protected]. Please be constructive and don't ask for help using the program, though. In most cases it's best to stick to the forum, unless you have something to discuss with the author in private.

Similar projects

zbackup is certainly not the first project to embrace the idea of using a rolling hash for deduplication. Here's a list of other projects the author found on the web:

  • bup, based on storing data in git packs. No possibility of removing old data. This program was the initial inspiration for zbackup.
  • ddar, seems to be a little bit outdated. Contains a nice list of alternatives with comparisons.
  • rdiff-backup, based on the original rsync algorithm. Does not do global deduplication, only working over the files with the same file name.
  • duplicity, which looks similar to rdiff-backup with regards to mode of operation.
  • Some filesystems (most notably ZFS and Btrfs) provide deduplication features. They do so only at the block level though, without a sliding window, so they can not accommodate arbitrary byte insertion/deletion in the middle of data.
  • Attic, which looks very similar to zbackup.

Credits

Copyright (c) 2012-2014 Konstantin Isakov ([email protected]) and ZBackup contributors, see CONTRIBUTORS. Licensed under GNU GPLv2 or later + OpenSSL, see LICENSE.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.

zbackup's People

Contributors

4txj7f avatar am1go avatar bbbsnowball avatar bfontaine avatar dragonroot avatar eagafonov avatar frenkel avatar ikatson avatar mknjc avatar rutsky avatar sectoid avatar txtdawg avatar ulrichalt avatar utzig avatar vitalif avatar vlad1mir-d avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

zbackup's Issues

do not remove tmp dir?

I've tried to restore backup as root (backup is done as another user), but interrupted it in the middle.
So tmp dir wasn't removed (as I suppose), and on another backup cycle I've got:
"Can't create a temporary file in dir....."
When I have a look what was wrong with tmp dir it was already removed.
Maybe it would be better to create tmp dir once in init and never remove?

Big CPU load without reason

I have to use zbackup repo long time, now it has 336 backups.

It was works fine, and full backup take about 2 hours.

Here is graph cpu load from my server:
chart2 php

Last day it take abnormal CPU usage, and after long time I killed zbackup proccess.

I tryed again, and again it eat CPU and not finished for 4+ hours.

No idea, where I should search the answer.

A lot of disk space, all seems correct, nothing changed.

What I should check?
Is there any internal restrictions?

Does not compile on OSX

The first issue is that it says "cannot find endian.h", I solved it with
replacing "#include <endian.h>" in "endian.hh" with

#ifdef __APPLE__
#include <machine/endian.h>
#else
#include <endian.h>
#endif

But after that there's one I can't solve myself:

[  3%] Building CXX object CMakeFiles/zbackup.dir/chunk_index.cc.o
In file included from /Users/igor/projects/zbackup/chunk_index.cc:9:
In file included from /Users/igor/projects/zbackup/chunk_index.hh:13:
In file included from /usr/include/c++/4.2.1/ext/hash_map:65:
/usr/include/c++/4.2.1/ext/hashtable.h:595:16: error: type 'const hasher' (aka 'const __gnu_cxx::hash<unsigned long long>') does not provide a call operator
      { return _M_hash(__key) % __n; }
               ^~~~~~~
/usr/include/c++/4.2.1/ext/hashtable.h:587:16: note: in instantiation of member function '__gnu_cxx::hashtable<std::pair<const unsigned long long, ChunkIndex::Chain *>, unsigned long long,
      __gnu_cxx::hash<unsigned long long>, std::_Select1st<std::pair<const unsigned long long, ChunkIndex::Chain *> >, std::equal_to<unsigned long long>, std::allocator<ChunkIndex::Chain *>
      >::_M_bkt_num_key' requested here
      { return _M_bkt_num_key(__key, _M_buckets.size()); }
               ^
/usr/include/c++/4.2.1/ext/hashtable.h:510:18: note: in instantiation of member function '__gnu_cxx::hashtable<std::pair<const unsigned long long, ChunkIndex::Chain *>, unsigned long long,
      __gnu_cxx::hash<unsigned long long>, std::_Select1st<std::pair<const unsigned long long, ChunkIndex::Chain *> >, std::equal_to<unsigned long long>, std::allocator<ChunkIndex::Chain *>
      >::_M_bkt_num_key' requested here
        size_type __n = _M_bkt_num_key(__key);
                        ^
/usr/include/c++/4.2.1/ext/hash_map:215:22: note: in instantiation of member function '__gnu_cxx::hashtable<std::pair<const unsigned long long, ChunkIndex::Chain *>, unsigned long long,
      __gnu_cxx::hash<unsigned long long>, std::_Select1st<std::pair<const unsigned long long, ChunkIndex::Chain *> >, std::equal_to<unsigned long long>, std::allocator<ChunkIndex::Chain *> >::find'
      requested here
      { return _M_ht.find(__key); }
                     ^
/Users/igor/projects/zbackup/chunk_index.cc:79:37: note: in instantiation of member function '__gnu_cxx::hash_map<unsigned long long, ChunkIndex::Chain *, __gnu_cxx::hash<unsigned long long>,
      std::equal_to<unsigned long long>, std::allocator<ChunkIndex::Chain *> >::find' requested here
  HashTable::iterator i = hashTable.find( rollingHash );
                                    ^
1 error generated.
make[2]: *** [CMakeFiles/zbackup.dir/chunk_index.cc.o] Error 1
make[1]: *** [CMakeFiles/zbackup.dir/all] Error 2
make: *** [all] Error 2

tar on FreeBSD

Both FreeBSD's build-in tar (bsdtar) and GNU tar (when installed from the ports collection) have default options which include -f. (These defaults can be seen by running [g]tar --show-defaults.) Therefore the examples of running:

tar c /my/precious/data | zbackup backup ...

... fail with an error like:

tar: /dev/sa0: Cannot open: Operation not supported

To work around this, we can be more explicit that the "file" we should create is STDOUT:

tar c -f - /my/precious/data | zbackup backup ...

CentOS 7 support

Hi,

Tried installing build dependancies and building zbackup but failing to find dependancies.
Any chance you can explain how to build this for centos (7) ?

Many Thanks

cmake minimum version

is it really necessary?

cmake_minimum_required( VERSION 2.8.9 )

on ubuntu 12.04 (last LTS release) I'm getting

$ cmake .
CMake Error at CMakeLists.txt:4 (cmake_minimum_required):
  CMake 2.8.9 or higher is required.  You are running version 2.8.7

Unable to restore an archive using v1.4.1

Hi There,

I have an backup archive created with v1.3 which consistently fails to resotre with v1.4.1 at the same point in the archive. The error message is:

Can't open file /mnt/zbackup/15-Feb/bundles/6a/6a9aed9dd043212eeb0637c3111dc1e0809aceea53d8f4dd

However, when I look for the specified file, it clearly exists (and has the same permissions as all others). When restoring with v1.3 everything works like expected. Any hints on how to debug this issue?

zbackup gc keeps adding 2x file size?

[root@hw1 zbackup]# ls -lh index
total 216M
-rw------- 1 root root 132K Apr 11 02:58 50f87e1dad04b9dff1b153b08d79b6b548cce18c24019d90
-rw------- 1 root root 216M Apr 10 08:35 878f27901bf28cad4ecae1e1d58190fe4118548b972d7607

FIRST GC adds 216M index

[root@hw1 zbackup]# zbackup gc --non-encrypted $PWD
Loading index...
Loading index file 878f27901bf28cad4ecae1e1d58190fe4118548b972d7607...
Loading index file 50f87e1dad04b9dff1b153b08d79b6b548cce18c24019d90...
Index loaded.
Using up to 40 MB of RAM as cache
Using up to 16 thread(s) for compression
Checking used chunks...
Checking backup 003...
Checking backup 001...
Checking backup 002...
Checking backup 004...
Checking backup 005...
Checking bundles...
Loading index...
Loading index file 878f27901bf28cad4ecae1e1d58190fe4118548b972d7607...
Loading index file 50f87e1dad04b9dff1b153b08d79b6b548cce18c24019d90...
Index loaded.
Cleaning up...
Garbage collection complete
[root@hw1 zbackup]# ls -lh index
total 432M
-rw------- 1 root root 132K Apr 11 02:58 50f87e1dad04b9dff1b153b08d79b6b548cce18c24019d90
-rw------- 1 root root 216M Apr 10 08:35 878f27901bf28cad4ecae1e1d58190fe4118548b972d7607
-rw------- 1 root root 216M Apr 11 05:18 fa996559bbf6d44bc671908c42a3a2e1c21096e4d1c115c1

SECOND GC adds 432M index

[root@hw1 zbackup]# zbackup gc --non-encrypted $PWD
Loading index...
Loading index file fa996559bbf6d44bc671908c42a3a2e1c21096e4d1c115c1...
Loading index file 878f27901bf28cad4ecae1e1d58190fe4118548b972d7607...
Loading index file 50f87e1dad04b9dff1b153b08d79b6b548cce18c24019d90...
Index loaded.
Using up to 40 MB of RAM as cache
Using up to 16 thread(s) for compression
Checking used chunks...
Checking backup 003...
Checking backup 001...
Checking backup 002...
Checking backup 004...
Checking backup 005...
Checking bundles...
Loading index...
Loading index file fa996559bbf6d44bc671908c42a3a2e1c21096e4d1c115c1...
Loading index file 878f27901bf28cad4ecae1e1d58190fe4118548b972d7607...
Loading index file 50f87e1dad04b9dff1b153b08d79b6b548cce18c24019d90...
Index loaded.
Cleaning up...
Garbage collection complete
[root@hw1 zbackup]# ls -lh index
total 863M
-rw------- 1 root root 432M Apr 11 05:19 41f8614034da0cfaf76766ef28f590e1dc9521b3383d656b
-rw------- 1 root root 132K Apr 11 02:58 50f87e1dad04b9dff1b153b08d79b6b548cce18c24019d90
-rw------- 1 root root 216M Apr 10 08:35 878f27901bf28cad4ecae1e1d58190fe4118548b972d7607
-rw------- 1 root root 216M Apr 11 05:18 fa996559bbf6d44bc671908c42a3a2e1c21096e4d1c115c1

Every new gc run will add a new file 2x size of prev one.

Deduplication problem

I am testing this sophisticated tool.
After backing up the /root/ dir on my dev box, then cloning WordPress source (21MB) I run zbackup again. This is the diff of md5sum's output (before and after):

1a2
> efde1f546057cd375bbb47207e7634ce  backup/backups/root3
236a238
> 1340e063c5851211dd27ad57a0c07816  backup/bundles/d1/d1094a69d507380828a748bc583774c82feb98a1e3786f63
461a464
> b861fda8de69afe5c0e93acd55925758  backup/bundles/a3/a39d064190e8ea1efa75c07b73328c79f8d7810ff6367eed
499a503
> d47a6725cd3a49eb08da016a6a9c42fe  backup/bundles/c2/c2d7da30a831e9d58f2779a241949cb3ef7635f72d189d4c
519a524
> fcee729f7f3c9a86143f0a6414222d25  backup/bundles/54/54f0542f17877fb73e070f682e0c04147d18074a0224cc68
532a538
> cd5ee3186a52967b6cbe5b4e0c6ee741  backup/bundles/95/95c0e5b8c06b99551485a72ffdb860ecc0bd2058a1b6364d
606a613
> d1a2beddab738c97bd27c80467043b91  backup/bundles/da/dab0c6b5776f9e144431fbb53a5ef437525b0fa51359308d
1242a1250
> e1d1bb1f172c0d1b213044d3b76e4355  backup/bundles/63/63a4a88490e30e9e7e39d0813f4de730971c92037572582e
1327a1336
> 55fb1acb00fada6a0ac7ee954f266356  backup/bundles/cb/cbaf962b94552b9a4a92bc7d9cf9ef015a6394fcd87f53f6
1451a1461
> dbe03c9985b217fcf1ac58cb045b6350  backup/index/936f312c3add4a1c8afea3f22ab67cfdd8cb7ff68e6bef9e

Could it be that cloning 21 MB of text files turns out to be 3.6 MB in the backup repo?

diff backup2.md5 backup3.md5 |cut -d' ' -f4|grep ^b|xargs ls -l

-rw------- 1 root root     81 Dec 26 01:38 backup/backups/root3
-rw------- 1 root root 460191 Dec 26 01:37 backup/bundles/54/54f0542f17877fb73e070f682e0c04147d18074a0224cc68
-rw------- 1 root root 471995 Dec 26 01:37 backup/bundles/63/63a4a88490e30e9e7e39d0813f4de730971c92037572582e
-rw------- 1 root root 531291 Dec 26 01:37 backup/bundles/95/95c0e5b8c06b99551485a72ffdb860ecc0bd2058a1b6364d
-rw------- 1 root root 406794 Dec 26 01:37 backup/bundles/a3/a39d064190e8ea1efa75c07b73328c79f8d7810ff6367eed
-rw------- 1 root root 621043 Dec 26 01:37 backup/bundles/c2/c2d7da30a831e9d58f2779a241949cb3ef7635f72d189d4c
-rw------- 1 root root 469583 Dec 26 01:37 backup/bundles/cb/cbaf962b94552b9a4a92bc7d9cf9ef015a6394fcd87f53f6
-rw------- 1 root root 359627 Dec 26 01:37 backup/bundles/d1/d1094a69d507380828a748bc583774c82feb98a1e3786f63
-rw------- 1 root root 292876 Dec 26 01:38 backup/bundles/da/dab0c6b5776f9e144431fbb53a5ef437525b0fa51359308d
-rw------- 1 root root   7756 Dec 26 01:38 backup/index/936f312c3add4a1c8afea3f22ab67cfdd8cb7ff68e6bef9e

Integrated file list

A list of files in the backup would be a nice feature.
There could be tar format auto detection -> file list generation also.

Currently there is no way to do it by ZBackup.
I'll probably add some additional fields to proto in future.
Please rise an issue if you want this feature to become implemented.

error: Can't open file /usr/home/username/a_zbackup_repo/tmp/q6HhmS

I provided a reasonable fix (commit 4c73159), via pull request.

It looks like a call to umask used the wrong permission settings. When the
temporary file was created with mkstemp, read and write permissions were masked
off for the owner. My FreeBSD, rightfully, followed the umask setting. I'm not sure
if/why zbackup was working for GNU/Linux systems, unless they were ignoring the
umask setting, all together. umask can be thought of as a way to REMOVE permissions.

I believe that setting the umask, in the context of temporary files, is done to
work around a security issue where some old (10+ years) c libraries' implementations
of mkstemp would create files with mode 666 (read and write for everyone).
Modern implementations create temporary files with 600 (only the owner can read
and write). So it is probably fairly safe to remove the umask call, all
together. But my commit provides a reasonable way to set the umask. It also
resets the umask to the previous setting, after the file is created.

An arguably better way, might be to explicitly set the temporary file's mode,
after creation (but before any potentially sensitive data is written to it):
e.g.
fchmod(fd, S_IRUSR | S_IWUSR );

I suppose there is an extremely slight chance (if an old mkstemp library is
used), that something/someone else writes to that temp file, after it is created,
but before the chmod. But it is also possible that the umask setting is ignored
during the file creation.

It took a while to figure this out, because the temporary folder and files disappear
before program exit (destructors in tmp_mgr.cc?). So I first thought they might
not have been created in the first place.

Linking zbackup as library

Hey all,

I've got an idea: what if we move all the code of zbackup into library (let's say libzbackup) with C interface and then make zbackup application to be a thin wrapper for that library? Such approach allows to create binding for scripting languages (i.e. Python/Perl/Ruby/etc) easily and will simplify integration of zbackup to custom deployment solutions.

We can even make 2 libraries: static one (which will be linked with zbackup main() code and tool will remain as it was) and dynamic one for bindings.

Any suggestions? If I implement such thing - will the project accept it into mainline?

Need advice about speed-up restore process.

I use zbackup for mysql backup.

Database tar size about 75Gb. Database for logs, grown each day.
I do each day backup:
192 backups currently.
Perfect compression:
$du -d 1 -h
784K ./backups
156M ./index
4,0K ./tmp
33G ./bundles
34G .
No encryption.
Only problem is - very slow restore speed.
Using up to 2024 MB of RAM as cache
75,4GB 1:05:18 [19,7MB/s]
Its slow.
How I can improve speed?

  • recreate repo and reload last 7-14 backups to it? I need not all 192
  • tune some parameters?

specify zbackup tmp dir

I'm trying to write a little script which does encrypted zbackups to GCE's storage. I'm using gcsfuse to mount my bucket, but it doesn't support renames (due to renaming being non-atomic in S3-like datastores), and this causes problems with zbackup:

Can't rename file /mnt/gce/zbackup/tmp/SRqSBj to /mnt/gce/zbackup/bundles/4f/4f5538ccf49cc05c80a445a00644ba9b78b423b5087dfb37

It would be really useful to have a flag to specify the tmp dir zbackup uses, so I could specify something on my local filesystem, and then it could move from there to the bucket's FUSE mount when it's done. Is this possible?

Handling on low ulimit

OSX has a painfully low default file handles ulimit. (soft: 256, hard: 4096)
zbackup dies unexpectedly on larger backups if the limit is not increased...
Error is: "Can't open file /path/bundles/fe/f...."

Ideally zbackup should warn on low limit or display an error when file handles are exceeded.

Question: How to delete old data?

Image a dataset of 1GB which often has internal changes (i.e. the content of the 1GB changes, but the overall amount of data stays the same). zbackup is used to backup the data and there is a backup policy that changes up to 1 month ago needs to be stored.

Now when zbackup has been running for more than 1 month there will (due to the nature of how the data is updated) likely be bundles that are no longer needed as they are only relevant for backups older than 1 month. But how to detect that?

Would the method be to parse the index/backup files and detect bundles which are only mentioned in older files and them delete them (alternatively noting the full set of backups and simply parsing which are mentioned in the newest files)?

Or is there already some utility to do this?

Br
Ask

Segmentation Fault while starting zbackup since 30fe01a

Hi,

I am uncertain if this is for a github issue - feel free to close it otherwise.

Since the recent commit 30fe01a I am unable to build zbackup to work correctly on debian stable. Using a clean tree, rerunning cmake and rebuilding changes nothing. c5b821b works fine. I know that not every revision is supposed to provide a clean build, but I wanted to inform about the issue. I am not a developer but will try to provide debugging information.

I get a segmentation fault.

The following information is a first attempt in trying to help. If I can provide anything more, feel free to ask.

Platform is amd64

CMake:

     -- The C compiler identification is GNU 4.7.2
     -- The CXX compiler identification is GNU 4.7.2
     -- Check for working C compiler: /usr/bin/gcc
     -- Check for working C compiler: /usr/bin/gcc -- works
     -- Detecting C compiler ABI info
     -- Detecting C compiler ABI info - done
     -- Check for working CXX compiler: /usr/bin/c++
     -- Check for working CXX compiler: /usr/bin/c++ -- works
     -- Detecting CXX compiler ABI info
     -- Detecting CXX compiler ABI info - done
     -- Found ZLIB: /usr/lib/x86_64-linux-gnu/libz.so (found version "1.2.7") 
     -- Found OpenSSL: /usr/lib/x86_64-linux-gnu/libssl.so;/usr/lib/x86_64-linux-gnu/libcrypto.so (found version "1.0.1e") 
     -- Found PROTOBUF: /usr/lib/libprotobuf.so  
     -- Looking for include file pthread.h
     -- Looking for include file pthread.h - found
     -- Looking for pthread_create
     -- Looking for pthread_create - not found.
     -- Looking for pthread_create in pthreads
     -- Looking for pthread_create in pthreads - not found
     -- Looking for pthread_create in pthread
     -- Looking for pthread_create in pthread - found
     -- Found Threads: TRUE  
     -- Looking for lzma_auto_decoder in /usr/lib/x86_64-linux-gnu/liblzma.so
     -- Looking for lzma_auto_decoder in /usr/lib/x86_64-linux-gnu/liblzma.so - found
     -- Looking for lzma_easy_encoder in /usr/lib/x86_64-linux-gnu/liblzma.so
     -- Looking for lzma_easy_encoder in /usr/lib/x86_64-linux-gnu/liblzma.so - found
     -- Looking for lzma_lzma_preset in /usr/lib/x86_64-linux-gnu/liblzma.so
     -- Looking for lzma_lzma_preset in /usr/lib/x86_64-linux-gnu/liblzma.so - found
     -- Found LibLZMA: /usr/include (found version "5.1.0") 
     -- Looking for lzo1x_decompress_safe in /usr/lib/x86_64-linux-gnu/liblzo2.so
     -- Looking for lzo1x_decompress_safe in /usr/lib/x86_64-linux-gnu/liblzo2.so - found
     -- Looking for lzo1x_1_compress in /usr/lib/x86_64-linux-gnu/liblzo2.so
     -- Looking for lzo1x_1_compress in /usr/lib/x86_64-linux-gnu/liblzo2.so - found
     -- Found LibLZO: /usr/include (found version "2.06") 
     -- Looking for unw_getcontext
     -- Looking for unw_getcontext - found
     -- Looking for unw_init_local
     -- Looking for unw_init_local - found
     -- Found LibUnwind: /usr/include (found version "0.99") 
     -- Configuring done
     -- Generating done
     -- Build files have been written to: /usr/src/zbackup_test  

Some compiler warnings while building:

    /usr/src/zbackup_test/zbackup_base.cc: In member function ‘ExtendedStorageInfo ZBackupBase::loadExtendedStorageInfo(const EncryptionKey&)’:
    /usr/src/zbackup_test/zbackup_base.cc:160:14: warning: ‘google::protobuf::uint32 StorageInfo::chunk_max_size() const’ is deprecated (declared at  /usr/src/zbackup_test/zbackup.pb.h:1710) [-Wdeprecated-declarations]
    /usr/src/zbackup_test/zbackup_base.cc:161:14: warning: ‘google::protobuf::uint32 StorageInfo::bundle_max_payload_size() const’ is deprecated (declared at  /usr/src/zbackup_test/zbackup.pb.h:1732) [-Wdeprecated-declarations]
    /usr/src/zbackup_test/zbackup_base.cc:163:14: warning: ‘const string& StorageInfo::default_compression_method() const’ is deprecated (declared at          /usr/src/zbackup_test/zbackup.pb.h:1785) [-Wdeprecated-declarations]

ldd ./zbackup

    linux-vdso.so.1 =>  (0x00007fffe57ff000)
    libprotobuf.so.7 => /usr/lib/libprotobuf.so.7 (0x00007f9c2398d000)
    libssl.so.1.0.0 => /usr/lib/x86_64-linux-gnu/libssl.so.1.0.0 (0x00007f9c2372d000)
    libcrypto.so.1.0.0 => /usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0 (0x00007f9c23334000)
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f9c23118000)
    libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f9c22f01000)
    liblzma.so.5 => /lib/x86_64-linux-gnu/liblzma.so.5 (0x00007f9c22cdd000)
    liblzo2.so.2 => /usr/lib/x86_64-linux-gnu/liblzo2.so.2 (0x00007f9c22abc000)
    libunwind.so.7 => /usr/lib/libunwind.so.7 (0x00007f9c228a3000)
    libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f9c2259b000)
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f9c22319000)
    libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f9c22103000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f9c21d76000)
    libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f9c21b72000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f9c23c94000)

gdb output

    GNU gdb (GDB) 7.4.1-debian
    Copyright (C) 2012 Free Software Foundation, Inc.
    License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
    This is free software: you are free to change and redistribute it.
    There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
    and "show warranty" for details.
    This GDB was configured as "x86_64-linux-gnu".
    For bug reporting instructions, please see:
    <http://www.gnu.org/software/gdb/bugs/>...
    Reading symbols from /usr/src/zbackup_test/zbackup...(no debugging symbols found)...done.
    (gdb) run
    Starting program: /usr/src/zbackup_test/zbackup 
    [Thread debugging using libthread_db enabled]
    Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

    Program received signal SIGSEGV, Segmentation fault.
    0x0000000000423c75 in _GLOBAL__sub_I__ZN12ConfigHelper13defaultConfigE ()

Wrong chunk map generated in release builds

Affects only binaries built with Release cmake build type.
Causes OOM on inspect deep (because of string concatenation if repo contains many chunks) an very long loop over near all chunks on cacheless restore.

Happens with GCC 4.4.7 20120313 (Red Hat 4.4.7-16) and works fine with clang version 3.4.2 (tags/RELEASE_34/dot2-final), looks like unspecified behaviour somewhere :\

Feature request: List the bundles referenced by a particular backup

Motivation

I'm guessing a lot of people move their bundles to remote storage, while keeping local copies of the backup/index files. If you're in this situation and you want to restore a backup, you currently have to download all bundles, since you don't know which will be needed.

Feature request

I suggest adding a command which would list the bundles referenced by a particular backup.

$ zbackup list-bundles /my/repo/backups/foo
e374b414e7aecd88a9dbba2feb77000a5edb7c9374865dab
a2feb77000a5edb7c9374865dabe374b414e7aecd88a9dbb
...

This would allow users to scp only the required bundles back into their local repo.

GC failing with 'Can't parse message FileHeader'

zbackup version 1.4.1
gentoo linux

~ $ zbackup --non-encrypted gc zrepo || echo 1
Loading index...
Loading index file 142c784cfc57a5fd6a25bc995481ba0ffc29bee6534ebd9b... 
Loading index file 53165df2656d04de2d24566659f30739cfc5f3cfe170d4ae... 
Loading index file 1fe7a0b3824433fade24ddc582e02df876ba5ed359ed567a... 
Loading index file 157eca6435071df96253d85309ab8c9070de08adf2bde9b6... 
Loading index file 0ba93a21c483b5b62265d74b8656c0dbb2887705cbd1b9b2... 
Index loaded.
Using up to 40 MB of RAM as cache
Using up to 8 thread(s) for compression
Checking used chunks...
Checking backup prod...
Can't parse message FileHeader
1

Sequential reading of the backup

Hi,

I took a look at your code, but I don't have enough time at the moment to dig deep into it, especially since I have no idea how protobufs work (although they do seam like something I'll look into in the near future). My question is:

would it be possible to read parts of the backup? For example - first kilobyte or 1 kb at 3 mb from the start and things like that?

If that is possible, then it would probably be easy to restore specific files from the tar, and not the entire backed up tar. Or, if some manifest file is added at the beginning of the archive (zip has that, I think), then it would be possible to make visible the entire backed up file tree using fuse, a gui frontend or something like that.

Error during garbage collection with subfolders in backup/ folder

When storing backups in subfolders under the backup/ directory, garbage collection fails. Let me illustrate as follows. To me - without looking at the code - it looks like the subdirectories are read as backup files. I can reproduce with the latest git that I got to build (see other bug).
This makes deleting backups from the repository quite a lot harder because I can not clean up behind them. I'd appreciate it if someone could have a look at this. :) Thanks.

Some Version information:

  root@usagi:~# zbackup --help
  ZBackup, a versatile deduplicating backup tool, version 1.4
  Copyright (c) 2012-2014 Konstantin Isakov <[email protected]> and
  ZBackup contributors
  Comes with no warranty. Licensed under GNU GPLv2 or later + OpenSSL.
  Visit the project's home page at http://zbackup.org/

  Usage: zbackup [flags] <command> [command args]
    Flags: --non-encrypted|--password-file <file>
      password flag should be specified twice if import/export
      command specified
     --silent (default is verbose)
     --threads <number> (default is 2 on your system)
     --cache-size <number> MB (default is 40)
     --exchange [backups|bundles|index] (can be
      specified multiple times)
     --compression <compression> <lzma|lzo> (default is lzma)
     --help|-h show this message
    Commands:
      init <storage path> - initializes new storage;
      backup <backup file name> - performs a backup from stdin;
      restore <backup file name> - restores a backup to stdout;
      export <source storage path> <destination storage path> -
        performs export from source to destination storage;
      import <source storage path> <destination storage path> -
        performs import from source to destination storage;
      gc <storage path> - performs chunk garbage collection.
    For export/import storage path must be valid (initialized) storage.

Create Repositories, encrypted and unencrypted:

  root@usagi:~# zbackup --password-file /root/.passphrase init /tmp/zbackup_test_crypt/
  root@usagi:~# zbackup --non-encrypted init /tmp/zbackup_test_plain/

Put some content in:

  root@usagi:~# dd if=/dev/urandom bs=4M count=10 | zbackup --password-file /root/.passphrase backup /tmp/zbackup_test_crypt/backups/test1
  Loading index...
  Index loaded.
  Using up to 2 thread(s) for compression
  10+0 records in
  10+0 records out
  41943040 bytes (42 MB) copied, 12.6271 s, 3.3 MB/s

  root@usagi:~# dd if=/dev/urandom bs=4M count=10 | zbackup --non-encrypted backup /tmp/zbackup_test_plain/backups/test1
  Loading index...
  Index loaded.
  Using up to 2 thread(s) for compression
  10+0 records in
  10+0 records out
  41943040 bytes (42 MB) copied, 11.9597 s, 3.5 MB/s

Test garbage collection with a backup in, working fine:

  root@usagi:~# zbackup --password-file /root/.passphrase gc /tmp/zbackup_test_crypt/
  Loading index...
  Loading index file 6603d1fa03891cc91b0a80e02f95c11f3db412361653d294...
  Index loaded.
  Using up to 40 MB of RAM as cache
  Using up to 2 thread(s) for compression
  Checking used chunks...
  Checking backup test1...
  Checking bundles...
  Loading index...
  Garbage collection complete

  root@usagi:~# zbackup --non-encrypted gc /tmp/zbackup_test_plain/
  Loading index...
  Loading index file 1fe0e6eeb365b5da3bfdf1c58d27cc3bab030297d2b47f2e...
  Index loaded.
  Using up to 40 MB of RAM as cache
  Using up to 2 thread(s) for compression
  Checking used chunks...
  Checking backup test1...
  Checking bundles...
  Loading index...
  Loading index file 1fe0e6eeb365b5da3bfdf1c58d27cc3bab030297d2b47f2e...
  Index loaded.
  Cleaning up...
  Garbage collection complete

Create folders inside the backup directory:

  root@usagi:~# mkdir /tmp/zbackup_test_plain/backups/test_folder
  root@usagi:~# mkdir /tmp/zbackup_test_crypt/backups/test_folder

Test garbage collection again, not working anymore. It fails while trying to read the folder as file.
The error messages differ for encrypted and unencrypted repositories:

  root@usagi:~# zbackup --password-file /root/.passphrase gc /tmp/zbackup_test_crypt/
  Loading index...
  Loading index file 999403125662263560008c14cf4f02ed77b818d7fd6eb1a4...
  Loading index file 6603d1fa03891cc91b0a80e02f95c11f3db412361653d294...
  Index loaded.
  Using up to 40 MB of RAM as cache
  Using up to 2 thread(s) for compression
  Checking used chunks...
  Checking backup test1...
  Checking backup test_folder...
  size of the encrypted file is incorrect

  root@usagi:~# zbackup --non-encrypted gc /tmp/zbackup_test_plain/
  Loading index...
  Loading index file dd5096abf90eb102dedabc97b52cf07c380448d7840131f4...
  Loading index file 1fe0e6eeb365b5da3bfdf1c58d27cc3bab030297d2b47f2e...
  Index loaded.
  Using up to 40 MB of RAM as cache
  Using up to 2 thread(s) for compression
  Checking used chunks...
  Script started on Sat 24 Jan 2015 03:19:43 PM CET
  Checking backup test1...
  Checking backup test_folder...
  Can't parse message FileHeader

Add a backup in the created folders:

  root@usagi:~# dd if=/dev/urandom bs=4M count=10 | zbackup --non-encrypted backup /tmp/zbackup_test_plain/backups/test_folder/test2
  Loading index...
  Loading index file dd5096abf90eb102dedabc97b52cf07c380448d7840131f4...
  Loading index file 1fe0e6eeb365b5da3bfdf1c58d27cc3bab030297d2b47f2e...
  Index loaded.
  Using up to 2 thread(s) for compression
  10+0 records in
  10+0 records out
  41943040 bytes (42 MB) copied, 12.3289 s, 3.4 MB/s

  root@usagi:~# dd if=/dev/urandom bs=4M count=10 | zbackup --password-file /root/.passphrase backup /tmp/zbackup_test_crypt/backups/test_folder/test2
  Loading index...
  Loading index file 999403125662263560008c14cf4f02ed77b818d7fd6eb1a4...
  Loading index file 6603d1fa03891cc91b0a80e02f95c11f3db412361653d294...
  Index loaded.
  Using up to 2 thread(s) for compression
  10+0 records in
  10+0 records out
  41943040 bytes (42 MB) copied, 13.425 s, 3.1 MB/s

Try again, but still the same error with the folders no longer empty:

  root@usagi:~# zbackup --password-file /root/.passphrase gc /tmp/zbackup_test_crypt/
  Loading index...
  Loading index file 2d0cc4325853e6b35d43d158d1a5bcece0812b205125bdb9...
  Loading index file 999403125662263560008c14cf4f02ed77b818d7fd6eb1a4...
  Loading index file 6603d1fa03891cc91b0a80e02f95c11f3db412361653d294...
  Index loaded.
  Using up to 40 MB of RAM as cache
  Using up to 2 thread(s) for compression
  Checking used chunks...
  Checking backup test1...
  Checking backup test_folder...
  size of the encrypted file is incorrect

  root@usagi:~# zbackup --non-encrypted gc /tmp/zbackup_test_plain/                                                                                                                                Loading index...
  Loading index file dd5096abf90eb102dedabc97b52cf07c380448d7840131f4...
  Loading index file c1504fca297127e0f2f6bd1ae478bbf32d1220e1b56682a9...
  Loading index file 1fe0e6eeb365b5da3bfdf1c58d27cc3bab030297d2b47f2e...
  Index loaded.
  Using up to 40 MB of RAM as cache
  Using up to 2 thread(s) for compression
  Checking used chunks...
  Checking backup test1...
  Checking backup test_folder...
  Can't parse message FileHeader

Backing up large directory structures

Hi,
I am interested in trying to make a change to zbackup to help better support my own use case.

I am trying to backup a directory containing many smaller files (ranging from 100 bytes to 50MB) that individually change rarely. The general approach of tarring the directory takes up a lot of IO, so for example backing up 100GB of files could easily take 2 hours.

If I run rsync without -c, it simply checks the size and date of the file. I was thinking of something along the same lines where each backup has a manifest of files that includes the size, date, and possibly inode number. The backup process would then only store the details of the changed files.

The simplest way I can think to do this is to make zbackup write out the the details of the backup file to stdout when backing up (instead of writing to /backup) and then restoring using a handle piped through stdin (instead of reading from /backup). The manifest file would then store the handle to the backup in zbackup.

This seems like a relatively simple approach that would require only small changes to zbackup. My questions are:

  • Does this approach seem sound?
  • Would such a change be merged in?

I realise there are performance penalties involved in invoking zbackup uniquely to backup and restore every file, however since changed files are the minority and restores are infrequent, I think the penalty is worth paying. In future zbackup could be optimised to allow multiple files to be backed up and restored in one invocation.

David

Feature request: multithreaded restore (decompression)

Hi,
zbackup is almost the perfect tool for me.. but the limit of a single thread for decompression limiting it to a restore speed of about 50 MB/sec on my hardware is an issue for some of my use cases.
Would adding this feature be difficult?
Any hints if I decided to have a look myself, given I'm not so familiar with C++? From looking at the code it's all happening in chunk_storage.cc?
thanks

Feature request: disable compression

I'm running ZBackup (I love this tool!) also on an homemade NAS built on top of an HP MicroServer to backup small LVM volumes. In this case I have lot of free disk space for backups but a not so powerful CPU: is it possible to add an option to the --compression command line flag to disable the compression? Something like:

backup --non-encrypted --compression no backup /path/to/dir

(I'm using 1.4.1 from Fedora 21)

Thanks

OpenSSL and GPL incompatibility

Hi there, I am trying to adopt the zbackup package in Debian. However, I have run into a problem, and without resolution, zbackup will be removed from Debian.

In short, zbackup cannot be included in Debian because the OpenSSL and GPL licences are incompatible.

The solution is fairly simple: you can add a special exemption. Other solutions include changing to a non-GPL licence or finding a way to not use OpenSSL.

Further reading:

Note that the previous maintainer had circumvented the issue by falsely identifying your licence as containing the OpenSSL exception. I do not know whether this was your previous licence and you changed at some point, or if the maintainer acted independently.

Feature request: gzip instead of lzma as an option?

zbackup is heavily cpu bound, I assume a lot (if not the most) of it comes from lzma compression.

It would be nice to have gzip as an option, if the server is not as powerful or does not have many cores.

How hard is that to implement?

Exchange causes segfault in config destructor

(gdb) bt f
#0  0x0806906d in Config::~Config (this=0xbf9bc9d4, __in_chrg=<value optimized out>) at /home/AmiGO/soft/sources/zbackup/config.cc:140
No locals.
#1  0x08078018 in main (argc=146498600, argv=0x8bb7be0) at /home/AmiGO/soft/sources/zbackup/zbackup.cc:386
        printHelp = <value optimized out>
        args = std::vector of length 3, capacity 4 = {0xbf9bde0d "export", 0xbf9bde14 "repos/1/", 0xbf9bde2d "repos/2/"}
        passwords = std::vector of length 2, capacity 2 = {"", ""}
        config = {
          runtime = {
            threads = 4, 
            cacheSize = 41943040, 
            exchange = std::bitset = {
              [0] = 1,
              [1] = 1,
              [2] = 1
            }
          }, 
          storable = 0x8bb8338, 
          keywords = 0x8bb978c, 
          cleanup_storable = true, 
          cleanup_keywords = true
        }
        __func__ = "main"

Feature: Flush pending bundles from tmp to bundles

I'm trying to perform a backup (tar c | zbackup) without having enough local disk space to store all of the newly created bundle files. What I'd like to do is:

  • write out a "part" of the bundle files of the backup (let's say half of it)
  • (externally) upload the files in bundles/ to remote storage
  • delete the local copies in bundles/ to free disk space
  • write the next bundle part

I'm not sure if ZBackup even needs to be aware of this. I'm currently thinking of two ways to solve this:

  1. Add some sort of signal handler to ZBackup (i.e. SIGUSR2) to tell it to flush all the (finished) bundle files. I'm not sure if this is even possible (do bundle files still get appended to after some point?) and it could leave the repo with inconsistent state.
  2. Solve this externally by doing "partial backups". I.e. write a program like tar c | splitter-zbackup that reads the incoming tar stream, pipes it into zbackup, but aborts the zbackup child at some point. zbackups then just sees a smaller file and is happy. Then, start a new zbackup continuing the still-open tar stream. (Assume the tar stream cannot be rewound)

I quite like solution (2), however, it comes with the disadvantage that at no point I ever store the entire tar archive with zbackup. I could just re-open and pipe the tar stream into zbackup again (should write no more data because redundancy). Maybe zbackup could help here by allowing to "append to" an already-existing archive file, acting as though the entire base file's content was piped in just before the new part.

Thoughts?

Consistency check

Repo consistency check would be a nice feature.
Now: zbackup restore /media/backup/backups/1st > /dev/null

Check failed: lzma_easy_encoder error: 5

Initially reported at https://groups.google.com/forum/#!topic/zbackup/PePEAYUkfAY

Reproducing:

  • 1.4 release
  • # dd if=/dev/urandom bs=1M count=1024 | pv | ./objdir/zbackup --threads 8 backup --password-file repos/1pass repos/1/backups/test

Result:

Loading index...
Index loaded.
Using up to 8 thread(s) for compression
Check failed: lzma_easy_encoder error: 5                                                                                        <=>                                                         ]
At /home/AmiGO/soft/sources/zbackup/compression.cc:73
 164MiB 0:02:38 [1,04MiB/s] [                                                                                             <=>                                                               ]
Аварийный останов (core dumped)

Expected:
Work properly

Error Message 'Can't parse message of type "FileHeader"'

I am receiving the following error message when I attempt to backup my Documents folder. I am also receiving the same error message when I attempt to backup my Dropbox folder, however the command had worked fine just minutes before.

zbackup v1.3
OSX Yosemite

tar c Documents | zbackup backup /Volumes/Data/Backup/backups/Documents-`date '+%Y-%m-%d'`
Loading index...
Loading index file .DS_Store...
[libprotobuf ERROR google/protobuf/message_lite.cc:123] Can't parse message of type "FileHeader" because it is missing required fields: version
Can't parse message FileHeader

-- EDIT --
I removed the .DS_Store file and it is working again. May need to implement a blacklist of files for the index folder.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.