Code Monkey home page Code Monkey logo

archive-tools's People

Contributors

rkrahl avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

archive-tools's Issues

Must verify that basedir is a directory when creating an archive with relative paths

When creating an archive with relative paths, all paths must start with basedir. This is verified. But it is not verified that basedir is a directory:

>>> p = Path("msg.txt")
>>> p.is_file()
True
>>> archive = Archive("archive.tar", mode="x:", paths=[p], basedir=p)

This pathologic case even produces a corrupt archive that cannot be extracted:

$ tar tvf archive.tar 
-r--r--r-- rolf/users      262 2019-02-10 21:26 msg.txt/.manifest.yaml
-rw-r--r-- rolf/users        7 2019-02-10 21:13 msg.txt
$ tar xf archive.tar 
tar: msg.txt: Cannot open: File exists
tar: Exiting with failure status due to previous errors

Consider to move backup-tool into a separate package

The backup-tool script added in #70 is built on top of the archive-tool core package. But archive-tool is useful independently of backup-tool and there are case where it makes sense to install only the archive-tool core package without backup-tool.

archive-tool create throws an error when trying to explicitly add a symlink

When trying to create an archive adding a symbolic link as a file argument, archive-tool throws an error "invalid path ...: must be normalized":

$ ls -lad base/data base/s.dat
drwxr-x--- 1 rolf users 14  6. Aug 15:04 base/data
lrwxrwxrwx 1 rolf users 12  6. Aug 15:04 base/s.dat -> data/rnd.dat
$ archive-tool.py create archive.tar base/data base/s.dat 
archive-tool.py create: error: invalid path base/s.dat: must be normalized

Enable mounting an archive as FUSE

It should be possible to mount an archive read-only as a Filesystem in Userspace (FUSE). All the file stubs should be taken from the embedded manifest. Only when accessing the content of a file it should be read from the tar file.

Command line interface

There should be a command line script that provides a command line interface to the basic functionality. The current interim scripts are mostly used for testing and should be replaced by the final command line interface.

Make the checksum algorithm configurable

In principle, archive-tools is capable to deal with archives using any checksum algorithm supported by hashlib.new(). it can list the content, verify the checksums and also create them. But there is no option to select the checksum algorithm to be used on create, neither in the API, in the Archive.create() call, nor as a command line option for archive-tool create.

The only way is to set a class attribute in archive.manifest.FileInfo:

from pathlib import Path
from archive.archive import Archive
import archive.manifest

archive.manifest.FileInfo.Checksums = ['md5', 'blake2b']

Archive().create(Path("data.tar.gz"), paths=[Path("data")])

There should be a corresponding option, both in the API and the command line.

Diff functionality

Add the ability to diff two archives. As a command line tool, this should be similar to extracting the two archives to directories, lets say arch1 and arch2 and then running diff -qr arch1 arch2. It should report all files missing in either archive and all files having different type, size, or checksum. Differences in the file system metadata, such as uid, uname, gid, gname, mode, and mtime should be ignored by default, but there might be a command line switch to report also those. The paths should be taken relative to the respective base directory of the archive when matching the files. Something like:

$ archive-tool diff arch1.tar.bz2 arch2.tar.xz
Files arch1/INDEX and arch2/INDEX differ
Only in arch1.tar.bz2: file1.txt
Only in arch2.tar.xz: file2.dat

Manage the version number

Use a tool like setuptools_scm to manage the version number rather then hardcoding it in a source file.

Add persistent identifiers for the messages in a mail archive

There should be a way to persistently identify messages in a mailarchive. Therefore we should add a persistent identifiers to the mailindex in the archive. These identifiers should have the following properties:

  • They must be unique for any message in a mailarchive. It would be desirable if the identifiers would also be unique across mailarchives, but it is open for the moment whether we should also aim to guarantee that.
  • When a new mailarchive is created from an existing one, retaining a subset of the messages of the old archive, it should be possible to keep these identifiers in the new archive.
  • When a new mailarchive is created combining messages from several existing archives, it should be possible to keep these identifiers in the new archive provided they are unique across the old archives.

Obviously, the message-id from the mail comes in mind as a candidate. But there is no way to guarantee that two different messages in the same archive may not have the same message-id.

Another candidate might be the key used in the internal MailDir structure of the archive. But in the current implementation by class mailbox.MailDir from the Python standard library, this key is generated anew each time a message is added to the MailDir and there is no way to control that in order to retain existing identifiers. It might be an option to consider if we implement our own version of the MailBox API, ref. #31.

Archive.extract() does not preserve the file modification time for symbol links

When an archive is extracted with the Archive.extract(), the file modification time for symbol links is not preserved:

>>> # Create an archive containing a symbol link with a well defined mtime
... 
>>> base = Path("base")
>>> pf = base / "msg.txt"
>>> pl = base / "s.txt"
>>> base.mkdir()
>>> with pf.open("wt") as f:
...     print("Hello!", file=f)
... 
>>> pl.symlink_to("msg.txt")
>>> mtime=1565100853
>>> for p in (base, pf, pl):
...     os.utime(p, (mtime, mtime), follow_symlinks=False)    
... 
>>> archive_path = Path("archive.tar")
>>> Archive().create(archive_path, paths=[base])
<archive.archive.Archive object at 0x7f287bb219e8>
>>> 
>>> # Extract the archive
... 
>>> outdir = Path("out")
>>> outdir.mkdir()
>>> with Archive().open(archive_path) as archive:
...     archive.extract(outdir)
... 
>>> # Check the file modification time ...
... # ... it matches for the file
... 
>>> fstat = (outdir / pf).lstat()
>>> assert fstat.st_mtime == mtime
>>> 
>>> # ... but not for the symbol link
... 
>>> fstat = (outdir / pl).lstat()
>>> assert fstat.st_mtime == mtime
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AssertionError
>>> 
>>> str(datetime.fromtimestamp(mtime))
'2019-08-06 16:14:13'
>>> str(datetime.fromtimestamp(fstat.st_mtime))
'2021-05-29 19:35:41.400020'

The cause of the issue is that Archive.extract() relies on TarFile.extract() from the standard library. The latter does not bother to set the file modification time for symbol links.

Note btw. that the GNU tar command line program does preserve the file modification time for symbol links:

$ ls -la base
total 8
drwxr-xr-x 1 rolf rk 24 Aug  6  2019 .
drwxr-xr-x 1 rolf rk 36 May 29 19:38 ..
-rw-r--r-- 1 rolf rk  7 Aug  6  2019 msg.txt
lrwxrwxrwx 1 rolf rk  7 Aug  6  2019 s.txt -> msg.txt
$ ls -la out/base
total 8
drwxr-xr-x 1 rolf rk 24 Aug  6  2019 .
drwxr-xr-x 1 rolf rk  8 May 29 19:35 ..
-rw-r--r-- 1 rolf rk  7 Aug  6  2019 msg.txt
lrwxrwxrwx 1 rolf rk  7 May 29 19:35 s.txt -> msg.txt
$ mkdir out-tar
$ tar -x -f archive.tar --directory=out-tar
$ ls -la out-tar/base 
total 12
drwxr-xr-x 1 rolf rk  52 Aug  6  2019 .
drwxr-xr-x 1 rolf rk   8 May 29 19:39 ..
-r--r--r-- 1 rolf rk 621 May 29 19:35 .manifest.yaml
-rw-r--r-- 1 rolf rk   7 Aug  6  2019 msg.txt
lrwxrwxrwx 1 rolf rk   7 Aug  6  2019 s.txt -> msg.txt

Failure from verify if the archive contains a directory with long path name

With Python 3.7 and older, an archive fails verification if it contains a directory with a long path name:

>>> import os
>>> from pathlib import Path
>>> import sys
>>> import tempfile
>>> from archive.archive import Archive
>>> 
>>> base = tempfile.mkdtemp(prefix="tarfile-test-")
>>> os.chdir(base)
>>> 
>>> sys.version_info
sys.version_info(major=3, minor=7, micro=10, releaselevel='final', serial=0)
>>> 
>>> dirname = Path("lets_start_with_a_somewhat_long_directory_name_"
...                "because_we_need_a_very_long_overall_path")
>>> os.mkdir(dirname)
>>> 
>>> subdir1 = dirname / "sub-1"
>>> subdir2 = dirname / "sub-directory-2"
>>> os.mkdir(subdir1)
>>> os.mkdir(subdir2)
>>> len(str(subdir1))
93
>>> len(str(subdir2))
103
>>> 
>>> archive_path = Path("sample.tar")
>>> 
>>> Archive().create(archive_path, paths=[dirname])
<archive.archive.Archive object at 0x7f7aa8452150>
>>> 
>>> with Archive().open(archive_path) as archive:
...     archive.verify()
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/home/abuild/test/archive-tools-0.5.2.dev90+g8f613d1/build/lib/archive/archive.py", line 292, in verify
    self._verify_item(fileinfo)
  File "/home/abuild/test/archive-tools-0.5.2.dev90+g8f613d1/build/lib/archive/archive.py", line 304, in _verify_item
    raise ArchiveIntegrityError("%s: missing" % itemname)
archive.exception.ArchiveIntegrityError: /tmp/tarfile-test-jog7k657/sample.tar:lets_start_with_a_somewhat_long_directory_name_because_we_need_a_very_long_overall_path/sub-directory-2: missing

The cause of the issue becomes apparent if we look at the members of the tarfile, note the spurious trailing forward slash in the name of subdir2:

>>> with Archive().open(archive_path) as archive:
...     for ti in archive._file.getmembers():
...         print(ti.name)
... 
lets_start_with_a_somewhat_long_directory_name_because_we_need_a_very_long_overall_path/.manifest.yaml
lets_start_with_a_somewhat_long_directory_name_because_we_need_a_very_long_overall_path
lets_start_with_a_somewhat_long_directory_name_because_we_need_a_very_long_overall_path/sub-1
lets_start_with_a_somewhat_long_directory_name_because_we_need_a_very_long_overall_path/sub-directory-2/

The problem does not occur with Python 3.8 and newer:

>>> import os
>>> from pathlib import Path
>>> import sys
>>> import tempfile
>>> from archive.archive import Archive
>>> 
>>> base = tempfile.mkdtemp(prefix="tarfile-test-")
>>> os.chdir(base)
>>> 
>>> sys.version_info
sys.version_info(major=3, minor=8, micro=10, releaselevel='final', serial=0)
>>> 
>>> dirname = Path("lets_start_with_a_somewhat_long_directory_name_"
...                "because_we_need_a_very_long_overall_path")
>>> os.mkdir(dirname)
>>> 
>>> subdir1 = dirname / "sub-1"
>>> subdir2 = dirname / "sub-directory-2"
>>> os.mkdir(subdir1)
>>> os.mkdir(subdir2)
>>> len(str(subdir1))
93
>>> len(str(subdir2))
103
>>> 
>>> archive_path = Path("sample.tar")
>>> 
>>> Archive().create(archive_path, paths=[dirname])
<archive.archive.Archive object at 0x7f4c4b3ee1f0>
>>> 
>>> with Archive().open(archive_path) as archive:
...     archive.verify()
... 
>>> with Archive().open(archive_path) as archive:
...     for ti in archive._file.getmembers():
...         print(ti.name)
... 
lets_start_with_a_somewhat_long_directory_name_because_we_need_a_very_long_overall_path/.manifest.yaml
lets_start_with_a_somewhat_long_directory_name_because_we_need_a_very_long_overall_path
lets_start_with_a_somewhat_long_directory_name_because_we_need_a_very_long_overall_path/sub-1
lets_start_with_a_somewhat_long_directory_name_because_we_need_a_very_long_overall_path/sub-directory-2

Further investigation reveals that it is the Python version that created the archive that matters: an archive created with Python 3.8 can be verified with Python 3.7 without error, but if the archive has been created with Python 3.7, verification also fails with Python 3.8. Apparently, the relevant change was bpo-36268: the switch to the POSIX.1-2001 pax standard as the default format used for writing tars with mod:tarfile.

Add a custom extraction filter

Python 3.12 raises a warning:

DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.

Background are tarfile extraction filters introduced with Python 3.12.

TypeError from "archive-tool.py ls" if the path of a file in the archive contains two consecutive blanks

Calling archive-tool.py ls on an archive that contains a file having two consecutive blanks in the path throws a TypeError.

Steps to reproduce:

$ mkdir dir
$ touch dir/Normal-file
$ touch "dir/hello  world"
$ archive-tool.py create dir.tar dir
$ archive-tool.py ls dir.tar
drwxr-xr-x  rolf/rk  0  2024-02-23 16:46  dir
-rw-r--r--  rolf/rk  0  2024-02-23 16:46  dir/Normal-file
Traceback (most recent call last):
  File "/home/rolf/Progs/archive-tools/build/scripts-3.6/archive-tool.py", line 5, in <module>
    archive.cli.archive_tool()
  File "/home/rolf/Progs/archive-tools/build/lib/archive/cli/__init__.py", line 45, in archive_tool
    sys.exit(args.func(args))
  File "/home/rolf/Progs/archive-tools/build/lib/archive/cli/ls.py", line 31, in ls
    ls_ls_format(archive)
  File "/home/rolf/Progs/archive-tools/build/lib/archive/cli/ls.py", line 20, in ls_ls_format
    print(format_str % i)
TypeError: not all arguments converted during string formatting

Must disable adding file <base>/.manifest.yaml to an archive

The manifest is stored as a special file .manifest.yaml in the archive. Must prevent clashes with actual files in the archive. Must not allow a file <base>/.manifest.yaml be added to an archive:

>>> base = Path("base")
>>> (base / ".manifest.yaml").is_file()
True
>>> archive = Archive("archive.tar", mode="x:", paths=[base])

This would otherwise lead to an inconsistent archive:

$ tar tvf archive.tar 
-r--r--r-- rolf/users      965 2019-02-11 21:09 base/.manifest.yaml
drwxr-xr-x rolf/users        0 2019-02-11 21:06 base/
-r--r--r-- rolf/users      650 2019-02-10 23:50 base/.manifest.yaml
drwxr-xr-x rolf/users        0 2019-02-10 23:49 base/data/
drwxr-xr-x rolf/users        0 2019-02-10 23:49 base/data/misc/
-rw-r--r-- rolf/users      385 2019-02-10 23:49 base/data/misc/rnd.dat
drwxr-xr-x rolf/users        0 2019-02-10 23:49 base/data/other/

Support deduplication

Since we calculate checksums on all files added to the archive, we are in an excellent position to detect duplicate files. We should use this opportunity to add duplicate files only once in the archive and add hard links for all further instances. We could support three different modes:

  • no deduplication at all: always add all individual copies to the archive. Don't use hard links.

  • deduplication for file system hard links: if more then one path are added to the archive that are hard links to the same file in the file system, hard link them also in the archive. This is the default behaviour of the classical tar command.

  • deduplication for all identical files: if any two paths to be added to the archive having identical content, regardless if they are hard links in the file system or just copies, always add the file content only once and add hard links for all other instances.

Verfiy fails if archive contains hard links

Surprisingly, the tarfile standard library module already does some deduplication (ref. #28): if more then one path are added to the archive that are hard links to the same file in the file system, the standard lib creates hard links in the archive:

$ ls -la base
total 8
drwxr-xr-x 1 rolf users  32 May 25 15:33 .
drwxr-xr-x 1 rolf users  30 May 25 15:34 ..
-rw------- 2 rolf users 385 May 25 13:16 rnd1.dat
-rw------- 2 rolf users 385 May 25 13:16 rnd2.dat
$ archive-tool create archive.tar base
$ tar tvf archive.tar 
-r--r--r-- rolf/users      683 2019-05-25 15:34 base/.manifest.yaml
drwxr-xr-x rolf/users        0 2019-05-25 15:33 base/
-rw------- rolf/users      385 2019-05-25 13:16 base/rnd1.dat
hrw------- rolf/users        0 2019-05-25 13:16 base/rnd2.dat link to base/rnd1.dat

Unfortunately, archive-tool verify chokes on such an archive:

$ archive-tool verify archive.tar 
archive-tool verify: error: archive.tar:base/rnd2.dat: wrong type, expected regular file

archive-tool diff fails with TypeError

The command line tool archive-tool diff fails with a TypeError, if the first archive contains an additional entry at the end that is not present in the second archive. Consider two archives:

$ archive-tool ls archive.tar 
drwxr-xr-x  rolf/rk    0  2021-05-09 21:06  base
drwxr-x---  rolf/rk    0  2021-05-09 21:06  base/data
-rw-------  rolf/rk  385  2021-05-09 21:06  base/data/rnd.dat
drwxr-xr-x  rolf/rk    0  2021-05-09 21:06  base/empty
-rw-r--r--  rolf/rk    7  2021-05-09 21:06  base/msg.txt
-rw-------  rolf/rk  385  2021-05-09 21:06  base/rnd.dat
lrwxrwxrwx  rolf/rk    0  2021-05-09 21:06  base/s.dat -> data/rnd.dat

$ archive-tool ls archive-newfile.tar 
drwxr-xr-x  rolf/rk    0  2021-05-09 21:08  base
drwxr-x---  rolf/rk    0  2021-05-09 21:06  base/data
-rw-------  rolf/rk  385  2021-05-09 21:06  base/data/rnd.dat
drwxr-xr-x  rolf/rk    0  2021-05-09 21:06  base/empty
-rw-r--r--  rolf/rk    7  2021-05-09 21:06  base/msg.txt
-rw-------  rolf/rk  385  2021-05-09 21:06  base/rnd.dat
lrwxrwxrwx  rolf/rk    0  2021-05-09 21:06  base/s.dat -> data/rnd.dat
-rw-r--r--  rolf/rk    0  2021-05-09 21:08  base/zzz.dat

Note that archive-newfile.tar has an additional entry base/zzz.dat as very last item.

Now:

$ archive-tool diff archive-newfile.tar archive.tar
Traceback (most recent call last):
  File "/usr/bin/archive-tool", line 5, in <module>
    archive.cli.archive_tool()
  File "/usr/lib/python3.6/site-packages/archive/cli/__init__.py", line 45, in archive_tool
    sys.exit(args.func(args))
  File "/usr/lib/python3.6/site-packages/archive/cli/diff.py", line 62, in diff
    elif path1 is None or path1 > path2:
TypeError: '>' not supported between instances of 'PosixPath' and 'NoneType'

archive-tool check should use the archive's basedir as default for the files argument

The archive-tool check command requires at least two positional arguments (unless the --stdin flag is used), archive and files. This should be changed such that the files argument is optional. If not provided, the base directory from the archive should be taken as the default.

E.g. assume an archive foo.tar.gz having the base directory foo. At the moment, the command

$ archive-tool check foo.tar.gz foo

will check if all files in foo are in the archive. Omitting the files argument yields an error:

$ archive-tool check foo.tar.gz 
usage: archive-tool [-h] {create,verify,ls,info,check,diff,find} ...
archive-tool: error: either --stdin or the files argument is required

This should be changed so that the files argument is optional and defaults to foo in this case.

Mail archive should provide a Mailbox API

As a followup to #4, there should be a subclass of mailbox.Mailbox reading from a mail archive. Might be implemented as a subclass of mailbox.Maildir disabling all write access and diverting all methods accessing files to read from corresponding file objects returned from TarFile.extractfile().

Find functionality

Add the ability to search entries in archives. As a command line tool, this should be similar to the well known find command, only that it searches archives instead of directory hierarchies. Something like:

$ archive-find -type f -name '*.bib' archive-*.tar.bz2

The basedir keyword argument to Archive() should be optional when creating an archive

When creating an archive, the basedir keyword argument to Archive() should be optional. If omitted, suitable default depending on path and paths should be chosen. But if the argument is omitted, a rather incomprehensible TypeError is raised:

>>> archive = Archive("archive.tar", mode="x:", paths=["base"])
Traceback (most recent call last):
  ...
TypeError: argument should be a path or str object, not <class 'NoneType'>

Strictly speaking, this is not an error, since in the lack of any documentation, it is not documented that this argument is optional in this case. Nevertheless, this behavior was not intended.

Add a user defined "Domain" field to the manifest header

Add an optional field Domain or similar to the header of the manifest. This should be user defined at the creation of the archive. The value (if provided) should be a string and it should be considered opaque by the archive-tools package.

Rationale: the user might want to store many archives from different origin at the same location. For instance it might be backups taken from different hosts or installations. In this case, it would be useful for the user to be able to mark the archives according to the context, such as the hostname that the backup has been created from.

Review class Archive API

The API of class Archive may need a review:

  • At the moment, archive objects are create by directly calling the constructor that in turn will either call a method to read the manifest from an existing archive or create a new archive, depending on the mode argument. Some of the constructor's arguments are only meaningful for creating a new archive. Consider to add two separate class methods open() and create() instead.
  • When called in the reading mode, the constructor will open the archive file to read the manifest and close it. A subsequent call of the verify() method will open the archive file again. Consider to keep the file open. In that case, consider to implement the context manager protocol.

backup-tool: add a command line flag to override the schedule

backup-tool is designed to be run be a system timer and to decide which schedule to use based on the settings in the configuration file. However, there should be the option override the schedule in a command line flag. This would be useful for running the tool manually.

archive-tool.py fails with NameError when tying to emit a warning

A warning is raised when trying to add a file to an archive having an unsupported type, ref. #34:

>>> from pathlib import Path
>>> from archive.archive import Archive
>>> archive = Archive().create(Path("archive-tmp.tar"), "", [Path("/tmp/.X11-unix")])
/usr/lib/python3.6/site-packages/archive/manifest.py:135: ArchiveWarning: /tmp/.X11-unix/X0: socket ignored
  warnings.warn(ArchiveWarning("%s ignored" % e))

But the command line script archive-tool.py fails with a NameError while trying to emit this warning:

$ archive-tool create archive-tmp.tar /tmp/.X11-unix
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/archive/manifest.py", line 133, in iterpaths
    info = cls(path=p)
  File "/usr/lib/python3.6/site-packages/archive/manifest.py", line 61, in __init__
    raise ArchiveInvalidTypeError(self.path, ftype)
archive.exception.ArchiveInvalidTypeError: /tmp/.X11-unix/X0: socket

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/bin/archive-tool", line 5, in <module>
    archive.cli.archive_tool()
  File "/usr/lib/python3.6/site-packages/archive/cli/__init__.py", line 44, in archive_tool
    sys.exit(args.func(args))
  File "/usr/lib/python3.6/site-packages/archive/cli/create.py", line 28, in create
    tags=args.tag)
  File "/usr/lib/python3.6/site-packages/archive/archive.py", line 76, in create
    self._create(path, mode, paths, basedir, excludes, dedup, tags)
  File "/usr/lib/python3.6/site-packages/archive/archive.py", line 82, in _create
    self.manifest = Manifest(paths=paths, excludes=excludes, tags=tags)
  File "/usr/lib/python3.6/site-packages/archive/manifest.py", line 165, in __init__
    self.fileinfos = sorted(fileinfos, key=lambda fi: fi.path)
  File "/usr/lib/python3.6/site-packages/archive/manifest.py", line 140, in iterpaths
    yield from cls.iterpaths(p.iterdir(), excludes)
  File "/usr/lib/python3.6/site-packages/archive/manifest.py", line 135, in iterpaths
    warnings.warn(ArchiveWarning("%s ignored" % e))
  File "/usr/lib64/python3.6/warnings.py", line 99, in _showwarnmsg
    msg.file, msg.line)
  File "/usr/lib/python3.6/site-packages/archive/cli/__init__.py", line 25, in showwarning
    s = "%s: %s\n" % (argparser.prog, message)
NameError: name 'argparser' is not defined

Spurious error from test test_03_cli_error.py

The test test_03_cli_error.py may occasionally yield spurious errors:

============================= test session starts ==============================
platform linux -- Python 3.4.6, pytest-3.3.2, py-1.5.2, pluggy-0.6.0
rootdir: /home/rolf/Progs/archive-tools/tests, inifile: pytest.ini
plugins: dependency-0.3.2
collected 115 items                                                            

test_01_create.py ................................                       [ 27%]
test_02_create_errors.py .......                                         [ 33%]
test_02_create_misc.py ...                                               [ 36%]
test_02_verify_errors.py ........                                        [ 43%]
test_03_cli.py .........................................                 [ 79%]
test_03_cli_check.py ............                                        [ 89%]
test_03_cli_error.py ...........F                                        [100%]

=================================== FAILURES ===================================
_______________________ test_cli_integrity_missing_file ________________________

test_dir = PosixPath('/tmp/archive-tools-test-rxz4gr1c')
archive_name = 'archive-test_cli_integrity_missing_file.tar'
monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x7f707140d160>

    def test_cli_integrity_missing_file(test_dir, archive_name, monkeypatch):
        monkeypatch.chdir(str(test_dir))
        base = Path("base")
        missing = base / "data" / "not-present"
        with missing.open("wt") as f:
            f.write("Hello!")
        manifest = Manifest(paths=[base])
        with open("manifest.yaml", "wb") as f:
            manifest.write(f)
        missing.unlink()
        with tarfile.open(archive_name, "w") as tarf:
            with open("manifest.yaml", "rb") as f:
                manifest_info = tarf.gettarinfo(arcname="base/.manifest.yaml",
                                                fileobj=f)
                manifest_info.mode = stat.S_IFREG | 0o444
                tarf.addfile(manifest_info, f)
            tarf.add("base")
        with TemporaryFile(mode="w+t", dir=str(test_dir)) as f:
            args = ["verify", archive_name]
            with pytest.raises(subprocess.CalledProcessError) as exc_info:
                callscript("archive-tool.py", args, stderr=f)
            assert exc_info.value.returncode == 3
            f.seek(0)
            line = f.readline()
>           assert "%s:%s: missing" % (archive_name, missing) in line
E           AssertionError: assert ('%s:%s: missing' % ('archive-test_cli_integrity_missing_file.tar', PosixPath('base/data/not-present'))) in 'archive-tool.py verify: error: archive-test_cli_integrity_missing_file.tar:base/data: wrong modification time\n'

/home/rolf/Progs/archive-tools/tests/test_03_cli_error.py:206: AssertionError
----------------------------- Captured stdout call -----------------------------

> /usr/bin/python3 /home/rolf/Progs/archive-tools/build/scripts-3.4/archive-tool.py verify archive-test_cli_integrity_missing_file.tar
==================== 1 failed, 114 passed in 10.50 seconds =====================

The reason is obvious: unlinking the file touches the modification time of the parent directory so that it doesn't match the metadata recorded in the manifest.

archive-tool check should ignore metadata

The command archive-tool check takes an archive and a list of files as argument and checks if these files are in the archive. But an archive also contains a manifest file that itself is not considered to belong to the content of the archive. This has the confusing effect that any directory that results from extracting an archive contains a file that archive-tool check reports to be not in the archive:

$ tar xfj archive.tar.bz2
$ archive-tool check archive.tar.bz2 archive
archive/.manifest.yaml

This is particularly unfortunate, because the output of archive-tool check was intended to be useful to create incremental archives. But the manifest file cannot be added to an archive:

$ archive-tool create --basedir=archive archive-v2.tar.bz2 `archive-tool check archive.tar.bz2 archive`
archive-tool create: error: cannot add archive/.manifest.yaml: this filename is reserved

Define and raise custom exceptions

There are various error conditions when creating or reading archives. At the moment, standard library exceptions are raised, mostly ValueError. It would be more appropriate to define custom exceptions and raise those instead.

class FileInfo should calculate checksums lazily

A FileInfo object is created either from data or from a file path. In the latter case, the file attributes are obtained from a stat() call and checksums of the file are calculated (in the case of a regular file). The calculation of the checksums is expensive as it requires reading the complete file. This should be postponed to the moment that this information is actually needed.

This may particularly speed up archive-tool check, because the checkums need only be compared for those files that are present in the archive. Currently the checksums of all files on the command line are calculated first, before even checking if they are in the archive.

Archive create should take a working directory as optional argument

When creating an archive, the Archive class should take a directory path as optional additional argument and should temporarily change to that directory if given. E.g. something like the following should work:

archive = Archive("archive.tar", mode="x:", paths=["data"], workdir="/home/bla")

In that case, both the archive path archive.tar and the paths to be added to the archive, data in this case, should be taken relative to the work dir /home/bla, regardless of the current working directory when making the call. If not given in the call, workdir should default to the current working directory.

Add a header with version information to the mailarchive index

Class MailArchive adds an index as metadata to the archive. We may add new features to this mailindex in future versions while still maintaining backward compatibility. See #49 for an example. So we may need to distinguish versions. Therefore we should add a header to the mailindex in a similar way as in the manifest.

Improved functionality in archive-tool check

The command archive-tool check should accept additional flags:

  • --prefix <dir>: take the files to be checked relative to <dir> when looking them up in the archive.

  • --stdin: read files to be checked from stdin rather to take them as command line arguments.

Add a command line script to manage backups

Add another command line script to manage backups. It should be designed to be called regularly as a background job, (e.g. from cron or a systemd timer).

Features:

  • It must at least support creation of backups.

  • Read configuration from a file. On creation of an archive the following options should be set from the configuration file:

    • directories that should be archived,
    • exclude patterns,
    • directory to store the archive,
    • a template for the archive file name,
    • tags to be added to the archive,
    • directory to look for previous backups when creating incremental backups.
  • Support creation of incremental backups. This may be controlled by a command line flag.

Add a header section to the manifest

Add a dedicated header section with some metadata to the manifest. This could facilitate compatibility across versions. Could do this in a separate YAML document, so the result could look like:

%YAML 1.1
---
Checksums:
- sha256
Date: Sun, 10 Mar 2019 12:40:26 +0100
Generator: archive-tools 1.0
Version: '1.0'
---
- gid: 100
  gname: users
  mode: 493
  mtime: 1413139698.2000918
  path: Bilder/2014
  type: d
  uid: 1000
  uname: rolf

inconsistent result from archive-tool diff with option --skip-dir-content

The command line tool archive-tool diff features the option --skip-dir-content that, according to the documentation is supposed to: "in the case of a subdirectory missing from one archive, only report the directory, but skip its content." The result is inconsistent in some situations. Consider two archives:

$ archive-tool ls archive-a.tar
drwxr-xr-x  rolf/rk    0  2021-05-14 15:47  base
drwxr-x---  rolf/rk    0  2021-05-14 15:47  base/data
drwxr-xr-x  rolf/rk    0  2021-05-14 14:54  base/data/aa
-rw-r--r--  rolf/rk  347  2021-05-14 14:54  base/data/aa/rnda.dat
-rw-------  rolf/rk  385  2021-05-14 14:45  base/data/rnd.dat
-rw-r--r--  rolf/rk  487  2021-04-18 23:11  base/data/rnd2.dat
drwxr-xr-x  rolf/rk    0  2021-05-14 14:45  base/empty
-rw-r--r--  rolf/rk    7  2021-05-14 14:45  base/msg.txt
-rw-------  rolf/rk  385  2021-05-14 14:45  base/rnd.dat
lrwxrwxrwx  rolf/rk    0  2021-05-14 14:45  base/s.dat -> data/rnd.dat

$ archive-tool ls archive-b.tar
drwxr-xr-x  rolf/rk    0  2021-05-14 15:47  base
drwxr-xr-x  rolf/rk    0  2021-05-14 14:54  base/data/aa
-rw-r--r--  rolf/rk  347  2021-05-14 14:54  base/data/aa/rnda.dat
-rw-r--r--  rolf/rk   42  2021-05-14 15:49  base/data/rnd2.dat
drwxr-xr-x  rolf/rk    0  2021-05-14 15:49  base/data/zz
-rw-r--r--  rolf/rk  347  2021-05-14 15:49  base/data/zz/rndz.dat
drwxr-xr-x  rolf/rk    0  2021-05-14 14:45  base/empty
-rw-r--r--  rolf/rk    7  2021-05-14 14:45  base/msg.txt
-rw-------  rolf/rk  385  2021-05-14 14:45  base/rnd.dat
lrwxrwxrwx  rolf/rk    0  2021-05-14 14:45  base/s.dat -> data/rnd.dat

Note that archive-b.tar contains some content below base/data, but not that directory itself.

The output of archive-tool diff without the --skip-dir-content option is correct:

$ archive-tool diff archive-a.tar archive-b.tar
Only in archive-a.tar: base/data
Only in archive-a.tar: base/data/rnd.dat
Files archive-a.tar:base/data/rnd2.dat and archive-b.tar:base/data/rnd2.dat differ
Only in archive-b.tar: base/data/zz
Only in archive-b.tar: base/data/zz/rndz.dat

With that option, the output is just misleading or even plain wrong in this case:

$ archive-tool diff --skip-dir-content archive-a.tar archive-b.tar
Only in archive-a.tar: base/data
Only in archive-b.tar: base/data/aa
Only in archive-b.tar: base/data/rnd2.dat
Only in archive-b.tar: base/data/zz

Note that archive-tool diff reports base/data/aa to be missing in archive-a.tar, which is false.

Mail archive

Add a special archive flavor for mail messages. It should provide a list of all messages in the archive with its basic attributes (Date, Subject, From, To) as extended metadata. The internal structure might be something like a tar archive of an Maildir mailbox.

Basic functionality

Basic functions that should be implemented:

  • Create an archive, takes a list of files to include in the archive as input.
  • List the contents of the archive.
  • Check the integrity and consistency of an archive.
  • Display details on a file in an archive.
  • Given a list of files as input, list those files that are either not in the archive or where the file in the archive is older or differs.

archive-tool create should ignore files with invalid type

archive-tool only supports certain file types that may be added to an archive, currently regular files, directories, and symbol links. If trying to add a file having an invalid type, such as a socket, archive-tool create throws an error:

$ archive-tool create archive-tmp.tar /tmp   
archive-tool create: error: /tmp/.X11-unix/X0: invalid file type

Should ignore files with invalid type on create instead (but may still emit a warning, though).

Spurious FileNotFoundError from Archive.create() when passing a relative path as workdir argument

The Archive.create() features a workdir keyword argument, the method is supposed to temporarily change to that directory if given, see #20. This works as expected if the working directory is absolute, but fails when passing a relative path.

Consider the following setup:

$ ls -laR work
work:
total 0
drwxr-xr-x 1 rolf rk   8 May  1 12:35 .
drwxr-xr-x 1 rolf rk 278 May  1 12:35 ..
drwxr-xr-x 1 rolf rk   8 May  1 12:35 base

work/base:
total 0
drwxr-xr-x 1 rolf rk  8 May  1 12:35 .
drwxr-xr-x 1 rolf rk  8 May  1 12:35 ..
drwxr-xr-x 1 rolf rk 14 May  1 12:35 data

work/base/data:
total 4
drwxr-xr-x 1 rolf rk  14 May  1 12:35 .
drwxr-xr-x 1 rolf rk   8 May  1 12:35 ..
-rw-r--r-- 1 rolf rk 385 Apr 18 23:11 rnd.dat

Now, creating an archive passing an absolute path as workdir works as expected:

>>> workdir = Path.cwd() / "work"
>>> Archive().create(Path("archive-abs.tar"), "", [Path("base")], workdir=workdir)
>>> (workdir / "archive-abs.tar").is_file()
True

Trying the same using a relative path fails:

>>> workdir = Path("work")
>>> Archive().create(Path("archive-rel.tar"), "", [Path("base")], workdir=workdir)
Traceback (most recent call last):
  ...
FileNotFoundError: [Errno 2] No such file or directory: 'work/archive-rel.tar'

archive-tool create should have an option to exclude files

When creating an archive, it is currently not possible to exclude some parts of a large directory tree. archive-tool create should haven an option --exclude so that it will be possible to spell something like:

$ archive-tool create --exclude=/etc/bootsplash --exclude=/etc/udev archive-etc.tar /etc

Add a flag to ignore file modification time to archive-tool check

The archive-tool check subcommand checks whether some files are in a archive. A file is considered to be in the archive if the file size and checksum matches and the file modification time is not newer than modification time recorded in the archive for that entry.

Now, the modification time for a file is somewhat fragile, it may for instance be set by merely copying the file. It might be useful in some cases to be able to ignore that criterion in the checks. A command line flag like --ignore-mtime should be added to the archive-tool check subcommand for this purpose.

Should sort the members of an archive

It might be convenient to sort the members before adding them to the archive. That would make the outcome more predictable and would make it easier to find entries in the manifest. E.g. the following:

>>> archive = Archive("archive.tar", mode="x:", paths=["base/c", "base/a", "base/b"])
>>> [fi.path for fi in archive.manifest]
[PosixPath('base/c'), PosixPath('base/a'), PosixPath('base/b')]

should yield [PosixPath('base/a'), PosixPath('base/b'), PosixPath('base/c')] instead.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.