rkrahl / archive-tools Goto Github PK
View Code? Open in Web Editor NEWTools for managing archives
License: Apache License 2.0
Tools for managing archives
License: Apache License 2.0
When creating an archive with relative paths, all paths must start with basedir
. This is verified. But it is not verified that basedir
is a directory:
>>> p = Path("msg.txt")
>>> p.is_file()
True
>>> archive = Archive("archive.tar", mode="x:", paths=[p], basedir=p)
This pathologic case even produces a corrupt archive that cannot be extracted:
$ tar tvf archive.tar
-r--r--r-- rolf/users 262 2019-02-10 21:26 msg.txt/.manifest.yaml
-rw-r--r-- rolf/users 7 2019-02-10 21:13 msg.txt
$ tar xf archive.tar
tar: msg.txt: Cannot open: File exists
tar: Exiting with failure status due to previous errors
The backup-tool
script added in #70 is built on top of the archive-tool
core package. But archive-tool
is useful independently of backup-tool
and there are case where it makes sense to install only the archive-tool
core package without backup-tool
.
When trying to create an archive adding a symbolic link as a file argument, archive-tool
throws an error "invalid path ...: must be normalized":
$ ls -lad base/data base/s.dat
drwxr-x--- 1 rolf users 14 6. Aug 15:04 base/data
lrwxrwxrwx 1 rolf users 12 6. Aug 15:04 base/s.dat -> data/rnd.dat
$ archive-tool.py create archive.tar base/data base/s.dat
archive-tool.py create: error: invalid path base/s.dat: must be normalized
It should be possible to mount an archive read-only as a Filesystem in Userspace (FUSE). All the file stubs should be taken from the embedded manifest. Only when accessing the content of a file it should be read from the tar file.
There should be a command line script that provides a command line interface to the basic functionality. The current interim scripts are mostly used for testing and should be replaced by the final command line interface.
In principle, archive-tools is capable to deal with archives using any checksum algorithm supported by hashlib.new()
. it can list the content, verify the checksums and also create them. But there is no option to select the checksum algorithm to be used on create, neither in the API, in the Archive.create()
call, nor as a command line option for archive-tool create
.
The only way is to set a class attribute in archive.manifest.FileInfo
:
from pathlib import Path
from archive.archive import Archive
import archive.manifest
archive.manifest.FileInfo.Checksums = ['md5', 'blake2b']
Archive().create(Path("data.tar.gz"), paths=[Path("data")])
There should be a corresponding option, both in the API and the command line.
Add the ability to diff two archives. As a command line tool, this should be similar to extracting the two archives to directories, lets say arch1
and arch2
and then running diff -qr arch1 arch2
. It should report all files missing in either archive and all files having different type, size, or checksum. Differences in the file system metadata, such as uid, uname, gid, gname, mode, and mtime should be ignored by default, but there might be a command line switch to report also those. The paths should be taken relative to the respective base directory of the archive when matching the files. Something like:
$ archive-tool diff arch1.tar.bz2 arch2.tar.xz
Files arch1/INDEX and arch2/INDEX differ
Only in arch1.tar.bz2: file1.txt
Only in arch2.tar.xz: file2.dat
Use a tool like setuptools_scm
to manage the version number rather then hardcoding it in a source file.
There should be a way to persistently identify messages in a mailarchive. Therefore we should add a persistent identifiers to the mailindex in the archive. These identifiers should have the following properties:
Obviously, the message-id from the mail comes in mind as a candidate. But there is no way to guarantee that two different messages in the same archive may not have the same message-id.
Another candidate might be the key
used in the internal MailDir
structure of the archive. But in the current implementation by class mailbox.MailDir
from the Python standard library, this key is generated anew each time a message is added to the MailDir
and there is no way to control that in order to retain existing identifiers. It might be an option to consider if we implement our own version of the MailBox
API, ref. #31.
There should be a test suite.
When an archive is extracted with the Archive.extract()
, the file modification time for symbol links is not preserved:
>>> # Create an archive containing a symbol link with a well defined mtime
...
>>> base = Path("base")
>>> pf = base / "msg.txt"
>>> pl = base / "s.txt"
>>> base.mkdir()
>>> with pf.open("wt") as f:
... print("Hello!", file=f)
...
>>> pl.symlink_to("msg.txt")
>>> mtime=1565100853
>>> for p in (base, pf, pl):
... os.utime(p, (mtime, mtime), follow_symlinks=False)
...
>>> archive_path = Path("archive.tar")
>>> Archive().create(archive_path, paths=[base])
<archive.archive.Archive object at 0x7f287bb219e8>
>>>
>>> # Extract the archive
...
>>> outdir = Path("out")
>>> outdir.mkdir()
>>> with Archive().open(archive_path) as archive:
... archive.extract(outdir)
...
>>> # Check the file modification time ...
... # ... it matches for the file
...
>>> fstat = (outdir / pf).lstat()
>>> assert fstat.st_mtime == mtime
>>>
>>> # ... but not for the symbol link
...
>>> fstat = (outdir / pl).lstat()
>>> assert fstat.st_mtime == mtime
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AssertionError
>>>
>>> str(datetime.fromtimestamp(mtime))
'2019-08-06 16:14:13'
>>> str(datetime.fromtimestamp(fstat.st_mtime))
'2021-05-29 19:35:41.400020'
The cause of the issue is that Archive.extract()
relies on TarFile.extract()
from the standard library. The latter does not bother to set the file modification time for symbol links.
Note btw. that the GNU tar
command line program does preserve the file modification time for symbol links:
$ ls -la base
total 8
drwxr-xr-x 1 rolf rk 24 Aug 6 2019 .
drwxr-xr-x 1 rolf rk 36 May 29 19:38 ..
-rw-r--r-- 1 rolf rk 7 Aug 6 2019 msg.txt
lrwxrwxrwx 1 rolf rk 7 Aug 6 2019 s.txt -> msg.txt
$ ls -la out/base
total 8
drwxr-xr-x 1 rolf rk 24 Aug 6 2019 .
drwxr-xr-x 1 rolf rk 8 May 29 19:35 ..
-rw-r--r-- 1 rolf rk 7 Aug 6 2019 msg.txt
lrwxrwxrwx 1 rolf rk 7 May 29 19:35 s.txt -> msg.txt
$ mkdir out-tar
$ tar -x -f archive.tar --directory=out-tar
$ ls -la out-tar/base
total 12
drwxr-xr-x 1 rolf rk 52 Aug 6 2019 .
drwxr-xr-x 1 rolf rk 8 May 29 19:39 ..
-r--r--r-- 1 rolf rk 621 May 29 19:35 .manifest.yaml
-rw-r--r-- 1 rolf rk 7 Aug 6 2019 msg.txt
lrwxrwxrwx 1 rolf rk 7 Aug 6 2019 s.txt -> msg.txt
With Python 3.7 and older, an archive fails verification if it contains a directory with a long path name:
>>> import os
>>> from pathlib import Path
>>> import sys
>>> import tempfile
>>> from archive.archive import Archive
>>>
>>> base = tempfile.mkdtemp(prefix="tarfile-test-")
>>> os.chdir(base)
>>>
>>> sys.version_info
sys.version_info(major=3, minor=7, micro=10, releaselevel='final', serial=0)
>>>
>>> dirname = Path("lets_start_with_a_somewhat_long_directory_name_"
... "because_we_need_a_very_long_overall_path")
>>> os.mkdir(dirname)
>>>
>>> subdir1 = dirname / "sub-1"
>>> subdir2 = dirname / "sub-directory-2"
>>> os.mkdir(subdir1)
>>> os.mkdir(subdir2)
>>> len(str(subdir1))
93
>>> len(str(subdir2))
103
>>>
>>> archive_path = Path("sample.tar")
>>>
>>> Archive().create(archive_path, paths=[dirname])
<archive.archive.Archive object at 0x7f7aa8452150>
>>>
>>> with Archive().open(archive_path) as archive:
... archive.verify()
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/home/abuild/test/archive-tools-0.5.2.dev90+g8f613d1/build/lib/archive/archive.py", line 292, in verify
self._verify_item(fileinfo)
File "/home/abuild/test/archive-tools-0.5.2.dev90+g8f613d1/build/lib/archive/archive.py", line 304, in _verify_item
raise ArchiveIntegrityError("%s: missing" % itemname)
archive.exception.ArchiveIntegrityError: /tmp/tarfile-test-jog7k657/sample.tar:lets_start_with_a_somewhat_long_directory_name_because_we_need_a_very_long_overall_path/sub-directory-2: missing
The cause of the issue becomes apparent if we look at the members of the tarfile, note the spurious trailing forward slash in the name of subdir2
:
>>> with Archive().open(archive_path) as archive:
... for ti in archive._file.getmembers():
... print(ti.name)
...
lets_start_with_a_somewhat_long_directory_name_because_we_need_a_very_long_overall_path/.manifest.yaml
lets_start_with_a_somewhat_long_directory_name_because_we_need_a_very_long_overall_path
lets_start_with_a_somewhat_long_directory_name_because_we_need_a_very_long_overall_path/sub-1
lets_start_with_a_somewhat_long_directory_name_because_we_need_a_very_long_overall_path/sub-directory-2/
The problem does not occur with Python 3.8 and newer:
>>> import os
>>> from pathlib import Path
>>> import sys
>>> import tempfile
>>> from archive.archive import Archive
>>>
>>> base = tempfile.mkdtemp(prefix="tarfile-test-")
>>> os.chdir(base)
>>>
>>> sys.version_info
sys.version_info(major=3, minor=8, micro=10, releaselevel='final', serial=0)
>>>
>>> dirname = Path("lets_start_with_a_somewhat_long_directory_name_"
... "because_we_need_a_very_long_overall_path")
>>> os.mkdir(dirname)
>>>
>>> subdir1 = dirname / "sub-1"
>>> subdir2 = dirname / "sub-directory-2"
>>> os.mkdir(subdir1)
>>> os.mkdir(subdir2)
>>> len(str(subdir1))
93
>>> len(str(subdir2))
103
>>>
>>> archive_path = Path("sample.tar")
>>>
>>> Archive().create(archive_path, paths=[dirname])
<archive.archive.Archive object at 0x7f4c4b3ee1f0>
>>>
>>> with Archive().open(archive_path) as archive:
... archive.verify()
...
>>> with Archive().open(archive_path) as archive:
... for ti in archive._file.getmembers():
... print(ti.name)
...
lets_start_with_a_somewhat_long_directory_name_because_we_need_a_very_long_overall_path/.manifest.yaml
lets_start_with_a_somewhat_long_directory_name_because_we_need_a_very_long_overall_path
lets_start_with_a_somewhat_long_directory_name_because_we_need_a_very_long_overall_path/sub-1
lets_start_with_a_somewhat_long_directory_name_because_we_need_a_very_long_overall_path/sub-directory-2
Further investigation reveals that it is the Python version that created the archive that matters: an archive created with Python 3.8 can be verified with Python 3.7 without error, but if the archive has been created with Python 3.7, verification also fails with Python 3.8. Apparently, the relevant change was bpo-36268: the switch to the POSIX.1-2001 pax standard as the default format used for writing tars with mod:tarfile.
YAML 5.1 raises the following DeprecationWarning: archive/manifest.py:126: YAMLLoadWarning: calling yaml.load_all() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
Should provide some documentation.
Python 3.12 raises a warning:
DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
Background are tarfile extraction filters introduced with Python 3.12.
Calling archive-tool.py ls
on an archive that contains a file having two consecutive blanks in the path throws a TypeError
.
Steps to reproduce:
$ mkdir dir
$ touch dir/Normal-file
$ touch "dir/hello world"
$ archive-tool.py create dir.tar dir
$ archive-tool.py ls dir.tar
drwxr-xr-x rolf/rk 0 2024-02-23 16:46 dir
-rw-r--r-- rolf/rk 0 2024-02-23 16:46 dir/Normal-file
Traceback (most recent call last):
File "/home/rolf/Progs/archive-tools/build/scripts-3.6/archive-tool.py", line 5, in <module>
archive.cli.archive_tool()
File "/home/rolf/Progs/archive-tools/build/lib/archive/cli/__init__.py", line 45, in archive_tool
sys.exit(args.func(args))
File "/home/rolf/Progs/archive-tools/build/lib/archive/cli/ls.py", line 31, in ls
ls_ls_format(archive)
File "/home/rolf/Progs/archive-tools/build/lib/archive/cli/ls.py", line 20, in ls_ls_format
print(format_str % i)
TypeError: not all arguments converted during string formatting
The manifest is stored as a special file .manifest.yaml
in the archive. Must prevent clashes with actual files in the archive. Must not allow a file <base>/.manifest.yaml
be added to an archive:
>>> base = Path("base")
>>> (base / ".manifest.yaml").is_file()
True
>>> archive = Archive("archive.tar", mode="x:", paths=[base])
This would otherwise lead to an inconsistent archive:
$ tar tvf archive.tar
-r--r--r-- rolf/users 965 2019-02-11 21:09 base/.manifest.yaml
drwxr-xr-x rolf/users 0 2019-02-11 21:06 base/
-r--r--r-- rolf/users 650 2019-02-10 23:50 base/.manifest.yaml
drwxr-xr-x rolf/users 0 2019-02-10 23:49 base/data/
drwxr-xr-x rolf/users 0 2019-02-10 23:49 base/data/misc/
-rw-r--r-- rolf/users 385 2019-02-10 23:49 base/data/misc/rnd.dat
drwxr-xr-x rolf/users 0 2019-02-10 23:49 base/data/other/
Since we calculate checksums on all files added to the archive, we are in an excellent position to detect duplicate files. We should use this opportunity to add duplicate files only once in the archive and add hard links for all further instances. We could support three different modes:
no deduplication at all: always add all individual copies to the archive. Don't use hard links.
deduplication for file system hard links: if more then one path are added to the archive that are hard links to the same file in the file system, hard link them also in the archive. This is the default behaviour of the classical tar
command.
deduplication for all identical files: if any two paths to be added to the archive having identical content, regardless if they are hard links in the file system or just copies, always add the file content only once and add hard links for all other instances.
Surprisingly, the tarfile
standard library module already does some deduplication (ref. #28): if more then one path are added to the archive that are hard links to the same file in the file system, the standard lib creates hard links in the archive:
$ ls -la base
total 8
drwxr-xr-x 1 rolf users 32 May 25 15:33 .
drwxr-xr-x 1 rolf users 30 May 25 15:34 ..
-rw------- 2 rolf users 385 May 25 13:16 rnd1.dat
-rw------- 2 rolf users 385 May 25 13:16 rnd2.dat
$ archive-tool create archive.tar base
$ tar tvf archive.tar
-r--r--r-- rolf/users 683 2019-05-25 15:34 base/.manifest.yaml
drwxr-xr-x rolf/users 0 2019-05-25 15:33 base/
-rw------- rolf/users 385 2019-05-25 13:16 base/rnd1.dat
hrw------- rolf/users 0 2019-05-25 13:16 base/rnd2.dat link to base/rnd1.dat
Unfortunately, archive-tool verify
chokes on such an archive:
$ archive-tool verify archive.tar
archive-tool verify: error: archive.tar:base/rnd2.dat: wrong type, expected regular file
The archive-tool.py
script has the verify
, ls
, info
, and check
subcommands that read an archive file. Consider the option that these subcommands read a directory that is assumed to have been created by extracting an archive (e.g. read the manifest file therein) instead. This idea has been formulated in #17 first.
The command line tool archive-tool diff
fails with a TypeError
, if the first archive contains an additional entry at the end that is not present in the second archive. Consider two archives:
$ archive-tool ls archive.tar
drwxr-xr-x rolf/rk 0 2021-05-09 21:06 base
drwxr-x--- rolf/rk 0 2021-05-09 21:06 base/data
-rw------- rolf/rk 385 2021-05-09 21:06 base/data/rnd.dat
drwxr-xr-x rolf/rk 0 2021-05-09 21:06 base/empty
-rw-r--r-- rolf/rk 7 2021-05-09 21:06 base/msg.txt
-rw------- rolf/rk 385 2021-05-09 21:06 base/rnd.dat
lrwxrwxrwx rolf/rk 0 2021-05-09 21:06 base/s.dat -> data/rnd.dat
$ archive-tool ls archive-newfile.tar
drwxr-xr-x rolf/rk 0 2021-05-09 21:08 base
drwxr-x--- rolf/rk 0 2021-05-09 21:06 base/data
-rw------- rolf/rk 385 2021-05-09 21:06 base/data/rnd.dat
drwxr-xr-x rolf/rk 0 2021-05-09 21:06 base/empty
-rw-r--r-- rolf/rk 7 2021-05-09 21:06 base/msg.txt
-rw------- rolf/rk 385 2021-05-09 21:06 base/rnd.dat
lrwxrwxrwx rolf/rk 0 2021-05-09 21:06 base/s.dat -> data/rnd.dat
-rw-r--r-- rolf/rk 0 2021-05-09 21:08 base/zzz.dat
Note that archive-newfile.tar
has an additional entry base/zzz.dat
as very last item.
Now:
$ archive-tool diff archive-newfile.tar archive.tar
Traceback (most recent call last):
File "/usr/bin/archive-tool", line 5, in <module>
archive.cli.archive_tool()
File "/usr/lib/python3.6/site-packages/archive/cli/__init__.py", line 45, in archive_tool
sys.exit(args.func(args))
File "/usr/lib/python3.6/site-packages/archive/cli/diff.py", line 62, in diff
elif path1 is None or path1 > path2:
TypeError: '>' not supported between instances of 'PosixPath' and 'NoneType'
The archive-tool check
command requires at least two positional arguments (unless the --stdin
flag is used), archive
and files
. This should be changed such that the files
argument is optional. If not provided, the base directory from the archive should be taken as the default.
E.g. assume an archive foo.tar.gz
having the base directory foo
. At the moment, the command
$ archive-tool check foo.tar.gz foo
will check if all files in foo
are in the archive. Omitting the files
argument yields an error:
$ archive-tool check foo.tar.gz
usage: archive-tool [-h] {create,verify,ls,info,check,diff,find} ...
archive-tool: error: either --stdin or the files argument is required
This should be changed so that the files
argument is optional and defaults to foo
in this case.
As a followup to #4, there should be a subclass of mailbox.Mailbox
reading from a mail archive. Might be implemented as a subclass of mailbox.Maildir
disabling all write access and diverting all methods accessing files to read from corresponding file objects returned from TarFile.extractfile()
.
Add the ability to search entries in archives. As a command line tool, this should be similar to the well known find
command, only that it searches archives instead of directory hierarchies. Something like:
$ archive-find -type f -name '*.bib' archive-*.tar.bz2
The archive-tool check
command as implemented on #1 lists files not contained in the archive. Should also implement the inverse operation: given a list of files, list all that are in the archive having matching attributes and checksum.
When creating an archive, the basedir
keyword argument to Archive()
should be optional. If omitted, suitable default depending on path
and paths
should be chosen. But if the argument is omitted, a rather incomprehensible TypeError
is raised:
>>> archive = Archive("archive.tar", mode="x:", paths=["base"])
Traceback (most recent call last):
...
TypeError: argument should be a path or str object, not <class 'NoneType'>
Strictly speaking, this is not an error, since in the lack of any documentation, it is not documented that this argument is optional in this case. Nevertheless, this behavior was not intended.
Add an optional field Domain
or similar to the header of the manifest. This should be user defined at the creation of the archive. The value (if provided) should be a string and it should be considered opaque by the archive-tools package.
Rationale: the user might want to store many archives from different origin at the same location. For instance it might be backups taken from different hosts or installations. In this case, it would be useful for the user to be able to mark the archives according to the context, such as the hostname that the backup has been created from.
The API of class Archive
may need a review:
open()
and create()
instead.verify()
method will open the archive file again. Consider to keep the file open. In that case, consider to implement the context manager protocol.For the moment, it is possible to do the following:
archive = Archive("archive.tar.gz", mode="x:gz", paths=["base", "base/../../../etc/passwd"], basedir="base")
This should raise an error.
backup-tool
is designed to be run be a system timer and to decide which schedule to use based on the settings in the configuration file. However, there should be the option override the schedule in a command line flag. This would be useful for running the tool manually.
A warning is raised when trying to add a file to an archive having an unsupported type, ref. #34:
>>> from pathlib import Path
>>> from archive.archive import Archive
>>> archive = Archive().create(Path("archive-tmp.tar"), "", [Path("/tmp/.X11-unix")])
/usr/lib/python3.6/site-packages/archive/manifest.py:135: ArchiveWarning: /tmp/.X11-unix/X0: socket ignored
warnings.warn(ArchiveWarning("%s ignored" % e))
But the command line script archive-tool.py
fails with a NameError
while trying to emit this warning:
$ archive-tool create archive-tmp.tar /tmp/.X11-unix
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/archive/manifest.py", line 133, in iterpaths
info = cls(path=p)
File "/usr/lib/python3.6/site-packages/archive/manifest.py", line 61, in __init__
raise ArchiveInvalidTypeError(self.path, ftype)
archive.exception.ArchiveInvalidTypeError: /tmp/.X11-unix/X0: socket
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/bin/archive-tool", line 5, in <module>
archive.cli.archive_tool()
File "/usr/lib/python3.6/site-packages/archive/cli/__init__.py", line 44, in archive_tool
sys.exit(args.func(args))
File "/usr/lib/python3.6/site-packages/archive/cli/create.py", line 28, in create
tags=args.tag)
File "/usr/lib/python3.6/site-packages/archive/archive.py", line 76, in create
self._create(path, mode, paths, basedir, excludes, dedup, tags)
File "/usr/lib/python3.6/site-packages/archive/archive.py", line 82, in _create
self.manifest = Manifest(paths=paths, excludes=excludes, tags=tags)
File "/usr/lib/python3.6/site-packages/archive/manifest.py", line 165, in __init__
self.fileinfos = sorted(fileinfos, key=lambda fi: fi.path)
File "/usr/lib/python3.6/site-packages/archive/manifest.py", line 140, in iterpaths
yield from cls.iterpaths(p.iterdir(), excludes)
File "/usr/lib/python3.6/site-packages/archive/manifest.py", line 135, in iterpaths
warnings.warn(ArchiveWarning("%s ignored" % e))
File "/usr/lib64/python3.6/warnings.py", line 99, in _showwarnmsg
msg.file, msg.line)
File "/usr/lib/python3.6/site-packages/archive/cli/__init__.py", line 25, in showwarning
s = "%s: %s\n" % (argparser.prog, message)
NameError: name 'argparser' is not defined
The test test_03_cli_error.py
may occasionally yield spurious errors:
============================= test session starts ==============================
platform linux -- Python 3.4.6, pytest-3.3.2, py-1.5.2, pluggy-0.6.0
rootdir: /home/rolf/Progs/archive-tools/tests, inifile: pytest.ini
plugins: dependency-0.3.2
collected 115 items
test_01_create.py ................................ [ 27%]
test_02_create_errors.py ....... [ 33%]
test_02_create_misc.py ... [ 36%]
test_02_verify_errors.py ........ [ 43%]
test_03_cli.py ......................................... [ 79%]
test_03_cli_check.py ............ [ 89%]
test_03_cli_error.py ...........F [100%]
=================================== FAILURES ===================================
_______________________ test_cli_integrity_missing_file ________________________
test_dir = PosixPath('/tmp/archive-tools-test-rxz4gr1c')
archive_name = 'archive-test_cli_integrity_missing_file.tar'
monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x7f707140d160>
def test_cli_integrity_missing_file(test_dir, archive_name, monkeypatch):
monkeypatch.chdir(str(test_dir))
base = Path("base")
missing = base / "data" / "not-present"
with missing.open("wt") as f:
f.write("Hello!")
manifest = Manifest(paths=[base])
with open("manifest.yaml", "wb") as f:
manifest.write(f)
missing.unlink()
with tarfile.open(archive_name, "w") as tarf:
with open("manifest.yaml", "rb") as f:
manifest_info = tarf.gettarinfo(arcname="base/.manifest.yaml",
fileobj=f)
manifest_info.mode = stat.S_IFREG | 0o444
tarf.addfile(manifest_info, f)
tarf.add("base")
with TemporaryFile(mode="w+t", dir=str(test_dir)) as f:
args = ["verify", archive_name]
with pytest.raises(subprocess.CalledProcessError) as exc_info:
callscript("archive-tool.py", args, stderr=f)
assert exc_info.value.returncode == 3
f.seek(0)
line = f.readline()
> assert "%s:%s: missing" % (archive_name, missing) in line
E AssertionError: assert ('%s:%s: missing' % ('archive-test_cli_integrity_missing_file.tar', PosixPath('base/data/not-present'))) in 'archive-tool.py verify: error: archive-test_cli_integrity_missing_file.tar:base/data: wrong modification time\n'
/home/rolf/Progs/archive-tools/tests/test_03_cli_error.py:206: AssertionError
----------------------------- Captured stdout call -----------------------------
> /usr/bin/python3 /home/rolf/Progs/archive-tools/build/scripts-3.4/archive-tool.py verify archive-test_cli_integrity_missing_file.tar
==================== 1 failed, 114 passed in 10.50 seconds =====================
The reason is obvious: unlinking the file touches the modification time of the parent directory so that it doesn't match the metadata recorded in the manifest.
The command archive-tool check
takes an archive and a list of files as argument and checks if these files are in the archive. But an archive also contains a manifest file that itself is not considered to belong to the content of the archive. This has the confusing effect that any directory that results from extracting an archive contains a file that archive-tool check
reports to be not in the archive:
$ tar xfj archive.tar.bz2
$ archive-tool check archive.tar.bz2 archive
archive/.manifest.yaml
This is particularly unfortunate, because the output of archive-tool check
was intended to be useful to create incremental archives. But the manifest file cannot be added to an archive:
$ archive-tool create --basedir=archive archive-v2.tar.bz2 `archive-tool check archive.tar.bz2 archive`
archive-tool create: error: cannot add archive/.manifest.yaml: this filename is reserved
There are various error conditions when creating or reading archives. At the moment, standard library exceptions are raised, mostly ValueError
. It would be more appropriate to define custom exceptions and raise those instead.
A FileInfo
object is created either from data or from a file path. In the latter case, the file attributes are obtained from a stat()
call and checksums of the file are calculated (in the case of a regular file). The calculation of the checksums is expensive as it requires reading the complete file. This should be postponed to the moment that this information is actually needed.
This may particularly speed up archive-tool check
, because the checkums need only be compared for those files that are present in the archive. Currently the checksums of all files on the command line are calculated first, before even checking if they are in the archive.
When creating an archive, the Archive
class should take a directory path as optional additional argument and should temporarily change to that directory if given. E.g. something like the following should work:
archive = Archive("archive.tar", mode="x:", paths=["data"], workdir="/home/bla")
In that case, both the archive path archive.tar
and the paths to be added to the archive, data
in this case, should be taken relative to the work dir /home/bla
, regardless of the current working directory when making the call. If not given in the call, workdir
should default to the current working directory.
Class MailArchive
adds an index as metadata to the archive. We may add new features to this mailindex in future versions while still maintaining backward compatibility. See #49 for an example. So we may need to distinguish versions. Therefore we should add a header to the mailindex in a similar way as in the manifest.
The command archive-tool check
should accept additional flags:
--prefix <dir>
: take the files to be checked relative to <dir>
when looking them up in the archive.
--stdin
: read files to be checked from stdin rather to take them as command line arguments.
Add another command line script to manage backups. It should be designed to be called regularly as a background job, (e.g. from cron or a systemd timer).
Features:
It must at least support creation of backups.
Read configuration from a file. On creation of an archive the following options should be set from the configuration file:
Support creation of incremental backups. This may be controlled by a command line flag.
Add a dedicated header section with some metadata to the manifest. This could facilitate compatibility across versions. Could do this in a separate YAML document, so the result could look like:
%YAML 1.1
---
Checksums:
- sha256
Date: Sun, 10 Mar 2019 12:40:26 +0100
Generator: archive-tools 1.0
Version: '1.0'
---
- gid: 100
gname: users
mode: 493
mtime: 1413139698.2000918
path: Bilder/2014
type: d
uid: 1000
uname: rolf
The command line tool archive-tool diff
features the option --skip-dir-content
that, according to the documentation is supposed to: "in the case of a subdirectory missing from one archive, only report the directory, but skip its content." The result is inconsistent in some situations. Consider two archives:
$ archive-tool ls archive-a.tar
drwxr-xr-x rolf/rk 0 2021-05-14 15:47 base
drwxr-x--- rolf/rk 0 2021-05-14 15:47 base/data
drwxr-xr-x rolf/rk 0 2021-05-14 14:54 base/data/aa
-rw-r--r-- rolf/rk 347 2021-05-14 14:54 base/data/aa/rnda.dat
-rw------- rolf/rk 385 2021-05-14 14:45 base/data/rnd.dat
-rw-r--r-- rolf/rk 487 2021-04-18 23:11 base/data/rnd2.dat
drwxr-xr-x rolf/rk 0 2021-05-14 14:45 base/empty
-rw-r--r-- rolf/rk 7 2021-05-14 14:45 base/msg.txt
-rw------- rolf/rk 385 2021-05-14 14:45 base/rnd.dat
lrwxrwxrwx rolf/rk 0 2021-05-14 14:45 base/s.dat -> data/rnd.dat
$ archive-tool ls archive-b.tar
drwxr-xr-x rolf/rk 0 2021-05-14 15:47 base
drwxr-xr-x rolf/rk 0 2021-05-14 14:54 base/data/aa
-rw-r--r-- rolf/rk 347 2021-05-14 14:54 base/data/aa/rnda.dat
-rw-r--r-- rolf/rk 42 2021-05-14 15:49 base/data/rnd2.dat
drwxr-xr-x rolf/rk 0 2021-05-14 15:49 base/data/zz
-rw-r--r-- rolf/rk 347 2021-05-14 15:49 base/data/zz/rndz.dat
drwxr-xr-x rolf/rk 0 2021-05-14 14:45 base/empty
-rw-r--r-- rolf/rk 7 2021-05-14 14:45 base/msg.txt
-rw------- rolf/rk 385 2021-05-14 14:45 base/rnd.dat
lrwxrwxrwx rolf/rk 0 2021-05-14 14:45 base/s.dat -> data/rnd.dat
Note that archive-b.tar
contains some content below base/data
, but not that directory itself.
The output of archive-tool diff
without the --skip-dir-content
option is correct:
$ archive-tool diff archive-a.tar archive-b.tar
Only in archive-a.tar: base/data
Only in archive-a.tar: base/data/rnd.dat
Files archive-a.tar:base/data/rnd2.dat and archive-b.tar:base/data/rnd2.dat differ
Only in archive-b.tar: base/data/zz
Only in archive-b.tar: base/data/zz/rndz.dat
With that option, the output is just misleading or even plain wrong in this case:
$ archive-tool diff --skip-dir-content archive-a.tar archive-b.tar
Only in archive-a.tar: base/data
Only in archive-b.tar: base/data/aa
Only in archive-b.tar: base/data/rnd2.dat
Only in archive-b.tar: base/data/zz
Note that archive-tool diff
reports base/data/aa
to be missing in archive-a.tar
, which is false.
Add a special archive flavor for mail messages. It should provide a list of all messages in the archive with its basic attributes (Date, Subject, From, To) as extended metadata. The internal structure might be something like a tar archive of an Maildir
mailbox.
Basic functions that should be implemented:
archive-tool
only supports certain file types that may be added to an archive, currently regular files, directories, and symbol links. If trying to add a file having an invalid type, such as a socket, archive-tool create
throws an error:
$ archive-tool create archive-tmp.tar /tmp
archive-tool create: error: /tmp/.X11-unix/X0: invalid file type
Should ignore files with invalid type on create instead (but may still emit a warning, though).
The Archive.create()
features a workdir
keyword argument, the method is supposed to temporarily change to that directory if given, see #20. This works as expected if the working directory is absolute, but fails when passing a relative path.
Consider the following setup:
$ ls -laR work
work:
total 0
drwxr-xr-x 1 rolf rk 8 May 1 12:35 .
drwxr-xr-x 1 rolf rk 278 May 1 12:35 ..
drwxr-xr-x 1 rolf rk 8 May 1 12:35 base
work/base:
total 0
drwxr-xr-x 1 rolf rk 8 May 1 12:35 .
drwxr-xr-x 1 rolf rk 8 May 1 12:35 ..
drwxr-xr-x 1 rolf rk 14 May 1 12:35 data
work/base/data:
total 4
drwxr-xr-x 1 rolf rk 14 May 1 12:35 .
drwxr-xr-x 1 rolf rk 8 May 1 12:35 ..
-rw-r--r-- 1 rolf rk 385 Apr 18 23:11 rnd.dat
Now, creating an archive passing an absolute path as workdir
works as expected:
>>> workdir = Path.cwd() / "work"
>>> Archive().create(Path("archive-abs.tar"), "", [Path("base")], workdir=workdir)
>>> (workdir / "archive-abs.tar").is_file()
True
Trying the same using a relative path fails:
>>> workdir = Path("work")
>>> Archive().create(Path("archive-rel.tar"), "", [Path("base")], workdir=workdir)
Traceback (most recent call last):
...
FileNotFoundError: [Errno 2] No such file or directory: 'work/archive-rel.tar'
When creating an archive, it is currently not possible to exclude some parts of a large directory tree. archive-tool create
should haven an option --exclude
so that it will be possible to spell something like:
$ archive-tool create --exclude=/etc/bootsplash --exclude=/etc/udev archive-etc.tar /etc
The archive-tool check
subcommand checks whether some files are in a archive. A file is considered to be in the archive if the file size and checksum matches and the file modification time is not newer than modification time recorded in the archive for that entry.
Now, the modification time for a file is somewhat fragile, it may for instance be set by merely copying the file. It might be useful in some cases to be able to ignore that criterion in the checks. A command line flag like --ignore-mtime
should be added to the archive-tool check
subcommand for this purpose.
Integrate travis ci to run the test suite.
It might be convenient to sort the members before adding them to the archive. That would make the outcome more predictable and would make it easier to find entries in the manifest. E.g. the following:
>>> archive = Archive("archive.tar", mode="x:", paths=["base/c", "base/a", "base/b"])
>>> [fi.path for fi in archive.manifest]
[PosixPath('base/c'), PosixPath('base/a'), PosixPath('base/b')]
should yield [PosixPath('base/a'), PosixPath('base/b'), PosixPath('base/c')]
instead.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.