nexb / extractcode Goto Github PK
View Code? Open in Web Editor NEWA mostly universal file extraction library and CLI tool to extract almost any archive in a reasonably safe way on Linux, macOS and Windows.
Home Page: https://www.aboutcode.org/
A mostly universal file extraction library and CLI tool to extract almost any archive in a reasonably safe way on Linux, macOS and Windows.
Home Page: https://www.aboutcode.org/
We should enable Python 3.9 testing on CI.
The link contains a typo and points to a not existing page:
- homepage_url: https://github.com/nexB/extractode
It misses a c
, should be:
- homepage_url: https://github.com/nexB/extractcode
It would avoid a possible security issue with calling a subprocess and also may allow progress reporting (i.e. https://github.com/prebuilder/fetchers.py/blob/master/fetchers/unpackers/archives/tar.py#L24).
when running extractcode
on gcc-4.9
(download here https://packages.debian.org/jessie/all/gcc-4.9-source/download) extractcode fails with these error messages:
nakami@debian:~/Downloads/scancode-toolkit-developNEW$ ./extractcode samples/gcc-4.9-source_4.9.2-10_all.deb
Extracting archives...
[####################################]
ERROR extracting: writer-big.tar: 'Truncated tar archive'
ERROR extracting: test-trailing-junk.zip: 'Invalid central directory signature'
ERROR extracting: issue6550.gz: 'Error -3 while decompressing: too many length or distance symbols'
Extracting done.
with the --verbose
flag the error messages look like this:
nakami@debian:~/Downloads/scancode-toolkit-developNEW$ ./extractcode --verbose samples/gcc-4.9-source_4.9.2-10_all.deb
Extracting archives...
Extracting: gcc-4.9-source_4.9.2-10_all.deb
[...]
Extracting: changelog.Debian.gz
ERROR extracting: writer-big.tar: 'Truncated tar archive'
ERROR extracting: test-trailing-junk.zip: 'Invalid central directory signature'
ERROR extracting: issue6550.gz: 'Error -3 while decompressing: too many length or distance symbols'
ERROR extracting: writer-big.tar: 'Truncated tar archive'
ERROR extracting: test-trailing-junk.zip: 'Invalid central directory signature'
ERROR extracting: issue6550.gz: 'Error -3 while decompressing: too many length or distance symbols'
ERROR extracting: writer-big.tar: 'Truncated tar archive'
ERROR extracting: test-trailing-junk.zip: 'Invalid central directory signature'
ERROR extracting: issue6550.gz: 'Error -3 while decompressing: too many length or distance symbols'
ERROR extracting: writer-big.tar: 'Truncated tar archive'
ERROR extracting: test-trailing-junk.zip: 'Invalid central directory signature'
ERROR extracting: issue6550.gz: 'Error -3 while decompressing: too many length or distance symbols'
Extracting done.
are you aware of this problem?
it seems that either the depth of further archives within the initial archive, the depth that those have or both depths sumed up is the problem here.
When extractcode-libarchive isnt present, on openSUSE we see the following. A more graceful error indicating what to do would be helpful.
[ 35s] src/extractcode/archive.py:29: in <module>
[ 35s] from extractcode import libarchive2
[ 35s] <frozen importlib._bootstrap>:991: in _find_and_load
[ 35s] ???
[ 35s] <frozen importlib._bootstrap>:975: in _find_and_load_unlocked
[ 35s] ???
[ 35s] <frozen importlib._bootstrap>:671: in _load_unlocked
[ 35s] ???
[ 35s] /usr/lib/python3.8/site-packages/_pytest/assertion/rewrite.py:168: in exec_module
[ 35s] exec(co, module.__dict__)
[ 35s] src/extractcode/libarchive2.py:635: in <module>
[ 35s] archive_reader = libarchive.archive_read_new
[ 35s] /usr/lib/python3.8/ctypes/__init__.py:386: in __getattr__
[ 35s] func = self.__getitem__(name)
[ 35s] /usr/lib/python3.8/ctypes/__init__.py:391: in __getitem__
[ 35s] func = self._FuncPtr((name_or_ordinal, self))
[ 35s] E AttributeError: /usr/bin/python3.8: undefined symbol: archive_read_new
I have extracted a Windows Docker image using extractcode. I noticed that the Program Files
directory in one of the layers had the space replaced with an underscore (Program_Files
vs Program Files
). I expect that the path of the files/directory would not be modified when extracted.
We have this https://github.com/apache/tika/raw/1.28.5/tika-parsers/src/test/resources/test-documents/droste.zip , this zip contains itself as a zip inside and it's a never-ending recursive extraction. See:
Hello,
I can't seem to manage to get extractcode and typecode to run the tests.
The whole issue is that part of the README.rst:
To install this package with its full capability (where the binaries for
7zip and libarchive are installed), use thefull
extra option::pip install extractcode[full]
If you want to use the version of binaries (possibly) provided by your operating
system, use theminimal
option::pip install extractcode
In this case, you will need to provide a working and compatible libarchive and
7zip installed and configured in one of these ways such that ExtractCode can
find them:
a typecode-libarchive and typecode-7z plugin: See the standard ones at
https://github.com/nexB/scancode-plugins/tree/main/builtins
These can either bundle a libarchive library, a 7z executable or expose a
system-installed libraries.
It does so by providing plugin entry points asscancode_location_provider
forextractcode_libarchive
that should point to aLocationProviderPlugin
subclass with aget_locations()
method that must return a mapping with
this key:
- 'extractcode.libarchive.dll': the absolute path to a libarchive shared object/DLL
See for example:
- https://github.com/nexB/scancode-plugins/blob/4da5fe8a5ab1c87b9b4af9e54d7ad60e289747f5/builtins/extractcode_libarchive-linux/setup.py#L40
- https://github.com/nexB/scancode-plugins/blob/4da5fe8a5ab1c87b9b4af9e54d7ad60e289747f5/builtins/extractcode_libarchive-linux/src/extractcode_libarchive/__init__.py#L17
And in the same way, the
scancode_location_provider
forextractcode_7zip
should point to aLocationProviderPlugin
subclass with aget_locations()
method that must return a mapping with this key:
- 'extractcode.sevenzip.exe': the absolute path to a 7zip executable
See for example:
use environment variables to point to installed binaries:
- EXTRACTCODE_LIBARCHIVE_PATH: the absolute path to a libarchive DLL
- EXTRACTCODE_7Z_PATH: the absolute path to a 7zip executable
a system-installed libarchive and 7zip executable available in the system PATH.
So I am on a distro with libarchive-3.7.1 and p7zip-16.02 installed. Obviously I don't want to bundle with the full option.
I set up:
export EXTRACTCODE_7Z_PATH=%{_bindir}
export EXTRACTCODE_LIBARCHIVE_PATH_ENVVAR=%{_libdir}
%pytest
It seems that libarchive is detected:
=============================== warnings summary ===============================
../../../../usr/lib/python3.12/site-packages/typecode/magic2.py:195
/usr/lib/python3.12/site-packages/typecode/magic2.py:195: UserWarning: System libmagic found in typical location is used. Install instead a typecode-libmagic plugin for best support.
warnings.warn(
src/extractcode/libarchive2.py:107: 12 warnings
/builddir/build/BUILD/extractcode-31.0.0/src/extractcode/libarchive2.py:107: UserWarning: Using "libarchive" library found in a system location. Install instead a extractcode-libarchive plugin for best support.
warnings.warn(
(same with libmagic)
However nothing works:
==================================== ERRORS ====================================
_________________ ERROR collecting src/extractcode/archive.py __________________
src/extractcode/archive.py:29: in <module>
from extractcode import libarchive2
<frozen importlib._bootstrap>:1266: in _find_and_load
???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
???
<frozen importlib._bootstrap>:841: in _load_unlocked
???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
exec(co, module.__dict__)
src/extractcode/libarchive2.py:635: in <module>
archive_reader = libarchive.archive_read_new
/usr/lib64/python3.12/ctypes/__init__.py:392: in __getattr__
func = self.__getitem__(name)
/usr/lib64/python3.12/ctypes/__init__.py:397: in __getitem__
func = self._FuncPtr((name_or_ordinal, self))
E AttributeError: /usr/bin/python3: undefined symbol: archive_read_new
_________________ ERROR collecting src/extractcode/archive.py __________________
src/extractcode/archive.py:29: in <module>
from extractcode import libarchive2
<frozen importlib._bootstrap>:1266: in _find_and_load
???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
???
<frozen importlib._bootstrap>:841: in _load_unlocked
???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
exec(co, module.__dict__)
src/extractcode/libarchive2.py:635: in <module>
archive_reader = libarchive.archive_read_new
/usr/lib64/python3.12/ctypes/__init__.py:392: in __getattr__
func = self.__getitem__(name)
/usr/lib64/python3.12/ctypes/__init__.py:397: in __getitem__
func = self._FuncPtr((name_or_ordinal, self))
E AttributeError: /usr/bin/python3: undefined symbol: archive_read_new
_________________ ERROR collecting src/extractcode/extract.py __________________
src/extractcode/extract.py:23: in <module>
import extractcode.archive
<frozen importlib._bootstrap>:1266: in _find_and_load
???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
???
<frozen importlib._bootstrap>:841: in _load_unlocked
???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
exec(co, module.__dict__)
src/extractcode/archive.py:29: in <module>
from extractcode import libarchive2
<frozen importlib._bootstrap>:1266: in _find_and_load
???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
???
<frozen importlib._bootstrap>:841: in _load_unlocked
???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
exec(co, module.__dict__)
src/extractcode/libarchive2.py:635: in <module>
archive_reader = libarchive.archive_read_new
/usr/lib64/python3.12/ctypes/__init__.py:392: in __getattr__
func = self.__getitem__(name)
/usr/lib64/python3.12/ctypes/__init__.py:397: in __getitem__
func = self._FuncPtr((name_or_ordinal, self))
E AttributeError: /usr/bin/python3: undefined symbol: archive_read_new
_________________ ERROR collecting src/extractcode/extract.py __________________
src/extractcode/extract.py:23: in <module>
import extractcode.archive
<frozen importlib._bootstrap>:1266: in _find_and_load
???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
???
<frozen importlib._bootstrap>:841: in _load_unlocked
???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
exec(co, module.__dict__)
src/extractcode/archive.py:29: in <module>
from extractcode import libarchive2
<frozen importlib._bootstrap>:1266: in _find_and_load
???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
???
<frozen importlib._bootstrap>:841: in _load_unlocked
???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
exec(co, module.__dict__)
src/extractcode/libarchive2.py:635: in <module>
archive_reader = libarchive.archive_read_new
/usr/lib64/python3.12/ctypes/__init__.py:392: in __getattr__
func = self.__getitem__(name)
/usr/lib64/python3.12/ctypes/__init__.py:397: in __getitem__
func = self._FuncPtr((name_or_ordinal, self))
E AttributeError: /usr/bin/python3: undefined symbol: archive_read_new
_______________ ERROR collecting src/extractcode/libarchive2.py ________________
src/extractcode/libarchive2.py:635: in <module>
archive_reader = libarchive.archive_read_new
/usr/lib64/python3.12/ctypes/__init__.py:392: in __getattr__
func = self.__getitem__(name)
/usr/lib64/python3.12/ctypes/__init__.py:397: in __getitem__
func = self._FuncPtr((name_or_ordinal, self))
E AttributeError: /usr/bin/python3: undefined symbol: archive_read_new
_______________ ERROR collecting src/extractcode/libarchive2.py ________________
src/extractcode/libarchive2.py:635: in <module>
archive_reader = libarchive.archive_read_new
/usr/lib64/python3.12/ctypes/__init__.py:392: in __getattr__
func = self.__getitem__(name)
/usr/lib64/python3.12/ctypes/__init__.py:397: in __getitem__
func = self._FuncPtr((name_or_ordinal, self))
E AttributeError: /usr/bin/python3: undefined symbol: archive_read_new
____________________ ERROR collecting tests/test_archive.py ____________________
tests/test_archive.py:29: in <module>
from extractcode import archive
<frozen importlib._bootstrap>:1266: in _find_and_load
???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
???
<frozen importlib._bootstrap>:841: in _load_unlocked
???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
exec(co, module.__dict__)
src/extractcode/archive.py:29: in <module>
from extractcode import libarchive2
<frozen importlib._bootstrap>:1266: in _find_and_load
???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
???
<frozen importlib._bootstrap>:841: in _load_unlocked
???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
exec(co, module.__dict__)
src/extractcode/libarchive2.py:635: in <module>
archive_reader = libarchive.archive_read_new
/usr/lib64/python3.12/ctypes/__init__.py:392: in __getattr__
func = self.__getitem__(name)
/usr/lib64/python3.12/ctypes/__init__.py:397: in __getitem__
func = self._FuncPtr((name_or_ordinal, self))
E AttributeError: /usr/bin/python3: undefined symbol: archive_read_new
____________________ ERROR collecting tests/test_archive.py ____________________
tests/test_archive.py:29: in <module>
from extractcode import archive
<frozen importlib._bootstrap>:1266: in _find_and_load
???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
???
<frozen importlib._bootstrap>:841: in _load_unlocked
???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
exec(co, module.__dict__)
src/extractcode/archive.py:29: in <module>
from extractcode import libarchive2
<frozen importlib._bootstrap>:1266: in _find_and_load
???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
???
<frozen importlib._bootstrap>:841: in _load_unlocked
???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
exec(co, module.__dict__)
src/extractcode/libarchive2.py:635: in <module>
archive_reader = libarchive.archive_read_new
/usr/lib64/python3.12/ctypes/__init__.py:392: in __getattr__
func = self.__getitem__(name)
/usr/lib64/python3.12/ctypes/__init__.py:397: in __getitem__
func = self._FuncPtr((name_or_ordinal, self))
E AttributeError: /usr/bin/python3: undefined symbol: archive_read_new
____________________ ERROR collecting tests/test_extract.py ____________________
tests/test_extract.py:23: in <module>
from extractcode import extract
<frozen importlib._bootstrap>:1266: in _find_and_load
???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
???
<frozen importlib._bootstrap>:841: in _load_unlocked
???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
exec(co, module.__dict__)
src/extractcode/extract.py:23: in <module>
import extractcode.archive
<frozen importlib._bootstrap>:1266: in _find_and_load
???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
???
<frozen importlib._bootstrap>:841: in _load_unlocked
???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
exec(co, module.__dict__)
src/extractcode/archive.py:29: in <module>
from extractcode import libarchive2
<frozen importlib._bootstrap>:1266: in _find_and_load
???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
???
<frozen importlib._bootstrap>:841: in _load_unlocked
???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
exec(co, module.__dict__)
src/extractcode/libarchive2.py:635: in <module>
archive_reader = libarchive.archive_read_new
/usr/lib64/python3.12/ctypes/__init__.py:392: in __getattr__
func = self.__getitem__(name)
/usr/lib64/python3.12/ctypes/__init__.py:397: in __getitem__
func = self._FuncPtr((name_or_ordinal, self))
E AttributeError: /usr/bin/python3: undefined symbol: archive_read_new
____________________ ERROR collecting tests/test_extract.py ____________________
tests/test_extract.py:23: in <module>
from extractcode import extract
<frozen importlib._bootstrap>:1266: in _find_and_load
???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
???
<frozen importlib._bootstrap>:841: in _load_unlocked
???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
exec(co, module.__dict__)
src/extractcode/extract.py:23: in <module>
import extractcode.archive
<frozen importlib._bootstrap>:1266: in _find_and_load
???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
???
<frozen importlib._bootstrap>:841: in _load_unlocked
???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
exec(co, module.__dict__)
src/extractcode/archive.py:29: in <module>
from extractcode import libarchive2
<frozen importlib._bootstrap>:1266: in _find_and_load
???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
???
<frozen importlib._bootstrap>:841: in _load_unlocked
???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
exec(co, module.__dict__)
src/extractcode/libarchive2.py:635: in <module>
archive_reader = libarchive.archive_read_new
/usr/lib64/python3.12/ctypes/__init__.py:392: in __getattr__
func = self.__getitem__(name)
/usr/lib64/python3.12/ctypes/__init__.py:397: in __getitem__
func = self._FuncPtr((name_or_ordinal, self))
E AttributeError: /usr/bin/python3: undefined symbol: archive_read_new
________________ ERROR collecting tests/test_extractcode_api.py ________________
tests/test_extractcode_api.py:16: in <module>
from extractcode import extract
<frozen importlib._bootstrap>:1266: in _find_and_load
???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
???
<frozen importlib._bootstrap>:841: in _load_unlocked
???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
exec(co, module.__dict__)
src/extractcode/extract.py:23: in <module>
import extractcode.archive
<frozen importlib._bootstrap>:1266: in _find_and_load
???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
???
<frozen importlib._bootstrap>:841: in _load_unlocked
???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
exec(co, module.__dict__)
src/extractcode/archive.py:29: in <module>
from extractcode import libarchive2
<frozen importlib._bootstrap>:1266: in _find_and_load
???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
???
<frozen importlib._bootstrap>:841: in _load_unlocked
???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
exec(co, module.__dict__)
src/extractcode/libarchive2.py:635: in <module>
archive_reader = libarchive.archive_read_new
/usr/lib64/python3.12/ctypes/__init__.py:392: in __getattr__
func = self.__getitem__(name)
/usr/lib64/python3.12/ctypes/__init__.py:397: in __getitem__
func = self._FuncPtr((name_or_ordinal, self))
E AttributeError: /usr/bin/python3: undefined symbol: archive_read_new
________________ ERROR collecting tests/test_extractcode_api.py ________________
tests/test_extractcode_api.py:16: in <module>
from extractcode import extract
<frozen importlib._bootstrap>:1266: in _find_and_load
???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
???
<frozen importlib._bootstrap>:841: in _load_unlocked
???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
exec(co, module.__dict__)
src/extractcode/extract.py:23: in <module>
import extractcode.archive
<frozen importlib._bootstrap>:1266: in _find_and_load
???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
???
<frozen importlib._bootstrap>:841: in _load_unlocked
???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
exec(co, module.__dict__)
src/extractcode/archive.py:29: in <module>
from extractcode import libarchive2
<frozen importlib._bootstrap>:1266: in _find_and_load
???
<frozen importlib._bootstrap>:1237: in _find_and_load_unlocked
???
<frozen importlib._bootstrap>:841: in _load_unlocked
???
/usr/lib/python3.12/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
exec(co, module.__dict__)
src/extractcode/libarchive2.py:635: in <module>
archive_reader = libarchive.archive_read_new
/usr/lib64/python3.12/ctypes/__init__.py:392: in __getattr__
func = self.__getitem__(name)
/usr/lib64/python3.12/ctypes/__init__.py:397: in __getitem__
func = self._FuncPtr((name_or_ordinal, self))
E AttributeError: /usr/bin/python3: undefined symbol: archive_read_new
archive_read_new is not found so it doesn't find libarchive I believe.
So I may not have understood the README.rst correctly. Are all the part s following "In this case, you will need to provide a working and compatible libarchive and 7zip installed and configured in one of these ways such that ExtractCode can find them:" mandatory or do I have to choose among the options?
Are the plugins mandatory?
Then If I distribute extractool, does the user have to set up EXTRACTCODE_7Z_PATH and EXTRACTCODE_7Z_PATH each time? Is there a way to avoir that without bundling?
On Fedora 24, I get this:
$ extractcode test/requests-2.11.1.tar.gz
Extracting archives...
[------------------------------------]
Traceback (most recent call last):
File "scancode-toolkit/bin/extractcode", line 9, in <module>
load_entry_point('scancode-toolkit', 'console_scripts', 'extractcode')()
File "scancode-toolkit/lib/python2.7/site-packages/click/core.py", line 664, in __call__
return self.main(*args, **kwargs)
File "scancode-toolkit/src/scancode/utils.py", line 64, in main
standalone_mode=standalone_mode, **extra)
File "scancode-toolkit/lib/python2.7/site-packages/click/core.py", line 644, in main
rv = self.invoke(ctx)
File "scancode-toolkit/lib/python2.7/site-packages/click/core.py", line 837, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "scancode-toolkit/lib/python2.7/site-packages/click/core.py", line 464, in invoke
return callback(*args, **kwargs)
File "scancode-toolkit/src/scancode/extract_cli.py", line 156, in extractcode
for xev in extraction_events:
File "scancode-toolkit/lib/python2.7/site-packages/click/_termui_impl.py", line 240, in next
rv = next(self.iter)
File "scancode-toolkit/src/scancode/api.py", line 43, in extract_archives
from extractcode.extract import extract
File "scancode-toolkit/src/extractcode/extract.py", line 37, in <module>
from extractcode import archive
File "scancode-toolkit/src/extractcode/archive.py", line 47, in <module>
from extractcode import libarchive2
File "scancode-toolkit/src/extractcode/libarchive2.py", line 91, in <module>
libarchive = load_lib()
File "scancode-toolkit/src/extractcode/libarchive2.py", line 79, in load_lib
lib = ctypes.CDLL(libarchive)
File "/usr/lib64/python2.7/ctypes/__init__.py", line 357, in __init__
self._handle = _dlopen(self._name, mode)
OSError: libbz2.so.1.0: cannot open shared object file: No such file or directory
It starts working correctly if I run this:
sudo ln -s /usr/lib64/libbz2.so.1 /usr/lib64/libbz2.so.1.0
Hi,
I'm running into a problem with certain .lz4 and also .jar files. Example (lz4):
$:~/SCAN_IMAGES/release-1.13.zip-extract$ ~/scancode-toolkit/extractcode ./release/deploy_artifacts/router.tar.gz-extract/router.tar-extract/0f6a1467d0c8a8fce8ea65eedd0d2ee6e23f979498d128a0101318c7549f90a6/layer.tar-extract/var/lib/apt/lists/deb.debian.org_debian_dists_bullseye_main_binary-amd64_Packages.lz4
Extracting archives...
[####################] 4
ERROR extracting: /home/joe/SCAN_IMAGES/release-1.13.zip-extract/release/deploy_artifacts/router.tar.gz-extract/router.tar-extract/0f6a1467d0c8a8fce8ea65eedd0d2ee6e23f979498d128a0101318c7549f90a6/layer.tar-extract/var/lib/apt/lists/deb.debian.org_debian_dists_bullseye_main_binary-amd64_Packages.lz4: Unrecognized archive format
Extracting done.
But the file has substance and can be decompressed using the lz4
utility:
$:~/SCAN_IMAGES/release-1.13.zip-extract$ ls -al ./release/deploy_artifacts/router.tar.gz-extract/router.tar-extract/0f6a1467d0c8a8fce8ea65eedd0d2ee6e23f979498d128a0101318c7549f90a6/layer.tar-extract/var/lib/apt/lists/deb
.debian.org_debian_dists_bullseye_main_binary-amd64_Packages.lz4
-rw-r--r-- 1 joe users 17315708 Apr 29 2023 ./release/deploy_artifacts/router.tar.gz-extract/router.tar-extract/0f6a1467d0c8a8fce8ea65eedd0d2ee6e23f979498d128a0101318c7549f90a6/layer.tar-extract/var/lib/apt/lists/deb.debian.org_debian_dists_bullseye_main_binary-amd64_Packages.lz4
$:~/SCAN_IMAGES/release-1.13.zip-extract$ lz4 -t ./release/deploy_artifacts/router.tar.gz-extract/router.tar-extract/0f6a1467d0c8a8fce8ea65eedd0d2ee6e23f979498d128a0101318c7549f90a6/layer.tar-extract/var/lib/apt/lists/deb.debian.org_debian_dists_bullseye_main_binary-amd64_Packages.lz4
./release/deploy_art : decoded 45545571 bytes
$:~/SCAN_IMAGES/release-1.13.zip-extract$ lz4 --list ./release/deploy_artifacts/router.tar.gz-extract/router.tar-extract/0f6a1467d0c8a8fce8ea65eedd0d2ee6e23f979498d128a0101318c7549f90a6/layer.tar-extract/var/lib/apt/lists
/deb.debian.org_debian_dists_bullseye_main_binary-amd64_Packages.lz4
Frames Type Block Compressed Uncompressed Ratio Filename
1 LZ4Frame B4D 16.51M - - deb.debian.org_debian_dists_bullseye_main_binary-amd64_Packages.lz4
$:~/SCAN_IMAGES/release-1.13.zip-extract$ lz4 -dv ./release/deploy_artifacts/router.tar.gz-extract/router.tar-extract/0f6a1467d0c8a8fce8ea65eedd0d2ee6e23f979498d128a0101318c7549f90a6/layer.tar-extract/var/lib/apt/lists/de
b.debian.org_debian_dists_bullseye_main_binary-amd64_Packages.lz4
*** LZ4 command line interface 64-bits v1.9.3, by Yann Collet ***
Decoding file ./release/deploy_artifacts/router.tar.gz-extract/router.tar-extract/0f6a1467d0c8a8fce8ea65eedd0d2ee6e23f979498d128a0101318c7549f90a6/layer.tar-extract/var/lib/apt/lists/deb.debian.org_debian_dists_bullseye_main_binary-amd64_Packages
./release/deploy_art : decoded 45545571 bytes
Following is what the file header looks like:
$:~/SCAN_IMAGES/release-1.13.zip-extract$ hexdump ./release/deploy_artifacts/router.tar.gz-extract/router.tar-extract/0f6a1467d0c8a8fce8ea65eedd0d2ee6e23f979498d128a0101318c7549f90a6/layer.tar-extract/var/lib/apt/lists/deb.debian.org_debian_dists_bullseye_main_binary-amd64_Packages.lz4 | head
0000000 2204 184d 4040 cdc0 0078 f200 5003 6361
0000010 616b 6567 203a 6130 0a64 6f53 7275 0c63
0000020 f600 2008 3028 302e 322e 2e33 2d31 2935
0000030 560a 7265 6973 6e6f 203a 0015 7cf5 622b
0000040 0a31 6e49 7473 6c61 656c 2d64 6953 657a
0000050 203a 3032 3632 0a38 614d 6e69 6174 6e69
0000060 7265 203a 6544 6962 6e61 4720 6d61 7365
0000070 5420 6165 206d 703c 676b 672d 6d61 7365
0000080 642d 7665 6c65 6c40 7369 7374 612e 696c
0000090 746f 2e68 6564 6962 6e61 6f2e 6772 0a3e
The magic bytes are correct, pls refer to https://github.com/lz4/lz4/blob/dev/doc/lz4_Frame_format.md
Why can lz4
decode it properly but extractcode
cannot?
Regards,
Matthias
ERROR extracting: /tmp/pyside/PySide6_Essentials-6.3.0-cp36-abi3-win_amd64.whl-extract/PySide6/pyside6qml.abi3.lib: /tmp/pyside/PySide6_Essentials-6.3.0-cp36-abi3-win_amd64.whl-extract/PySide6/pyside6qml.abi3.lib
/tmp/pyside/PySide6_Essentials-6.3.0-cp36-abi3-win_amd64.whl-extract/PySide6/pyside6qml.abi3.lib
Open ERROR: Can not open the file as [Ar] archive
ERRORS:
Is not archive
ERROR extracting: /tmp/pyside/PySide6_Essentials-6.3.0-cp36-abi3-win_amd64.whl-extract/PySide6/pyside6.abi3.lib: /tmp/pyside/PySide6_Essentials-6.3.0-cp36-abi3-win_amd64.whl-extract/PySide6/pyside6.abi3.lib
/tmp/pyside/PySide6_Essentials-6.3.0-cp36-abi3-win_amd64.whl-extract/PySide6/pyside6.abi3.lib
Open ERROR: Can not open the file as [Ar] archive
ERRORS:
Is not archive
Extracting done.
D:/apptest/nqi/nia-npa-workflow-sl_master_20200218180521_5860907_B72.n0063.tar.gz-extract/nia-npa-workflow-sl/nia-npa-workflow.tar.gz-extract/nia-npa-workflow-1.0-SNAPSHOT.jar: [(u'd
\scancode-toolkit-3.0.0\tmp\scancode-tk-3.0.0-aerdzt\scancode-extract-c__kic\com\codahale\metrics\InstrumentedScheduledExecutorService$InstrumentedCallable.class', u'D:\apptest\nqi\
workflow-sl_master_20200218180521_5860907_B72.n0063.tar.gz-extract\nia-npa-workflow-sl\nia-npa-workflow.tar.gz-extract\nia-npa-workflow-1.0-SNAPSHOT.jar-extract\com\codahale\metrics\dScheduledExecutorService$InstrumentedCallable.class', u"[Errno 2] No such file or directory:
These are not super common but they are supported by the latest libarchive
The extractcode doc at https://scancode-toolkit.readthedocs.io/en/stable/tutorials/how_to_extract_archives.html doc doesn't mention the "--ignore" option at all. it's quite an important option to avoid wasting time on unnecessary files and also for preventing extractcode falling over when it encountered an invalid/corrupt archive file that isn't required.
When documenting this flag, it'd be helpful to explain the interaction between the extractcode --ignore and the scancode parameter of the same name. Specifically, having just spent several hours adding debug statements to the source code to understand why my extractcode --ignore globs weren't working, the piece of info that would really help is to know that the extractcode ignores do NOT apply to paths within the archives (e.g. my-archive.tar/tests/foo
is extracted even if I use extractcode --ignore=*/tests/*
) but only to the decision about which archives to unpack.
(aside: I was wondering about create an additional FR for applying extractcode ignores to individual files - could make it a LOT faster if extractcode didn't waste time writing to-be-ignored files such as /tests/ to disk only to be later ignored by scancode... if you think that's a good idea we could create an issue for that too)
Scanning UPX-compressed executables does not make sense unless they could be unpacked first.
See https://en.wikipedia.org/wiki/UPX
For instance these PostgreSQL installers take a large amount of resources and time to scan.
And there is little to squeeze out of the raw binaries.
They are not really archives but exe hence the reason why they are still scanned for now.
We will need to figure out a way to avoid issues when dealing with these large binaries that cannot yield much when scanned.
Both are compressed with UPX which makes their binary completely opaque short of decompressing them assuming they are using a standard UPX compressor.
There are some .pkg files that 7zip is able to do the extraction while extractcode fails to do so.
Note that I have already ran with the --all-formats
See https://github.com/wummel/patool/tree/master/tests/data for more tests cases.
A .lpkg file is compressed file format that contains .jar files to be deployed to Liferay DXP. See also https://help.liferay.com/hc/en-us/articles/360018159991-Overriding-lpkg-files .
After scanning the folder 'moby-20.10.5' I ran into following error:
There should be enough space left on the device.
Scan folder 'moby-20.10.5' with following function.
Note, that max_depth=0, such that there are no limitations.
For bug reports, it really helps us to know:
See http://ftp.gnu.org/gnu/gmp/gmp-5.1.3.tar.lz
These are not very frequent and in most cases there are gz or bz2 alternatives.
To consider at some point
We should be able to extract VMDK, VDI and similar qcow images, as well as ext2, ext3 and ext4 (and ideally some squashfs too?)
These are used with PHP, for instance with composer.
Extracting a patch as if it were an archive is seldom used and rarely useful.
We should drop this
This replacement is causing an issue with how debian system package resources are found and associated in the scancode.io docker pipeline. Some debian .list
files have :
in their names to separate the architecture from the package name, e.g. libc6:amd64.list
. However, extractcode extracts this file as libc6_amd64.list
. The code run in the docker pipeline is trying to find the original, unmodified name (libc6:amd64.list
), and such the pipeline does not find the declared resources of a package from the .list
as it was extracted as libc6_amd64.list
When using new commoncode release 21.1.14 calling extractcode fails.
This was not the case with commoncode release 20.10.20.
Error message from extractcode
Traceback (most recent call last):
File "/app/venv-extractcode/bin/extractcode", line 5, in
from extractcode.cli import extractcode
File "/app/venv-extractcode/lib/python3.6/site-packages/extractcode/init.py", line 39, in
from commoncode.fileutils import fsencode
ImportError: cannot import name 'fsencode'
# Workdir is /app
pip install virtualenv \
&& virtualenv -p /usr/local/bin/python3.6 venv-extractcode \
&& . venv-extractcode/bin/activate \
&& pip install extractcode
virtualenv -p /usr/local/bin/python3.6 venv-extractcode
. venv-extractcode/bin/activate
/app/venv-extractcode/bin/extractcode
Error also occurs with python 3.9.1.
Running on Linux in python:3.6 docker container.
Release 20.10 of extractcode.
Installed with pip.
Cannot extract lz4 file... Found in nexB/scancode.io#827 (comment)
$ cat foobar
dadsasd
$ lz4 foobar
$ extractcode --all-formats foobar.lz4
Extracting archives...
[####################] 4
WARNING extracting: foobar.lz4: 'dadsasd':
Missing type keyword in mtree specification
Extracting done.
$ ll foobar.lz4-extract/
total 8
drwxrwxr-x 2 pombreda pombreda 4096 Aug 1 15:52 ./
drwxrwxr-x 16 pombreda pombreda 4096 Aug 1 15:52 ../
-rw-rw-r-- 1 pombreda pombreda 0 Jan 1 1970 dadsasd
The file is empty!
So you will be able to specify the interface and maybe reuse some code.
A metaclass can do the work on autoregistration in into the registry.
A text file with this content followed by an LF is wrongly reported as zip file by the latest libmagic 5.22
80de10a8b9f13365de8cc4bbf8efec5e /etc/rsyslog.d/50-default.conf
This triggers some extraction error.
extractcode
does not extract links by design.... but this is proving to be limiting in some cases, in particular when extracting some package archive that do contain links and where the absence of such link would lead to eventually partial conclusion on the origin or license of such package.
❯ extractcode --all-formats libmediainfo-0.7.43.diff
Extracting archives...
[####################] 4
ERROR extracting: ./libmediainfo-0.7.43.diff: sequence item 0: expected str instance, bytes found
Extracting done.
./configure --dev
fails with
ERROR: Could not find a version that satisfies the requirement typecode[full]>=30.0.0; extra == "full" (from extractcode[full,patch,testing]) (from versions: 20.9, 20.9.28, 20.9.29, 20.10, 20.10.7, 20.10.12, 20.10.20, 21.1.8.1, 21.1.9.1, 21.1.21, 21.2.24, 21.5.31, 21.6.1, 30.0.0)
ERROR: No matching distribution found for typecode[full]>=30.0.0; extra == "full"
on fresh checkout of 1c64a75.
It would be useful (and a prep for nexB/scancode-toolkit#14) to be able to extract a single path or a list of paths from a given archive rather than everything all the times.
In particular this would allow smarter extracts of only specific files (such as metadata from package archives) when needed and would speed up some scans (and use less disk)
JMODs or Java Modules are zip files with a modified magic:
Instead of start with the zip 50 4B 03 04
they start with 4A 4D
(e.g. JM) then 01 00
and then the zip header 50 4B 03 04
See also
Environment:
Summary
FAILED tests/test_archive.py::TestGetExtractorTest::test_get_extractor_qcow2
FAILED tests/test_archive.py::TestRar::test_extract_rar_with_trailing_data - ...
FAILED tests/test_extractcode_cli.py::test_extractcode_command_can_take_an_empty_directory
FAILED tests/test_extractcode_cli.py::test_extractcode_command_does_extract_verbose
FAILED tests/test_extractcode_cli.py::test_extractcode_command_always_shows_something_if_not_using_a_tty_verbose_or_not
FAILED tests/test_extractcode_cli.py::test_extractcode_command_works_with_relative_paths
FAILED tests/test_extractcode_cli.py::test_extractcode_command_works_with_relative_paths_verbose
FAILED tests/test_extractcode_cli.py::test_usage_and_help_return_a_correct_script_name_on_all_platforms
FAILED tests/test_extractcode_cli.py::test_extractcode_command_can_extract_archive_with_unicode_names_verbose
FAILED tests/test_extractcode_cli.py::test_extractcode_command_can_extract_archive_with_unicode_names
FAILED tests/test_extractcode_cli.py::test_extractcode_command_can_extract_shallow
FAILED tests/test_extractcode_cli.py::test_extractcode_command_can_ignore - A...
FAILED tests/test_extractcode_cli.py::test_extractcode_command_does_not_crash_with_replace_originals_and_corrupted_archives
FAILED tests/test_extractcode_cli.py::test_extractcode_command_can_extract_nuget
Details:
=================================== FAILURES ===================================
________________ TestGetExtractorTest.test_get_extractor_qcow2 _________________
self = <test_archive.TestGetExtractorTest testMethod=test_get_extractor_qcow2>
def test_get_extractor_qcow2(self):
test_file = self.extract_test_tar('vmimage/foobar.qcow2.tar.gz')
test_file = str(Path(test_file) / 'foobar.qcow2')
expected = []
self.check_get_extractors(test_file, expected, kinds=extractcode.default_kinds)
expected = [archive.extract_vm_image]
> self.check_get_extractors(test_file, expected, kinds=())
tests/test_archive.py:217:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <test_archive.TestGetExtractorTest testMethod=test_get_extractor_qcow2>
test_file = '/tmp/scancode-tk-tests -uejk8_7m/u6e9wzs8/foobar.qcow2.tar.gz/foobar.qcow2'
expected = [<function extract at 0x7f702e883c40>], kinds = ()
def check_get_extractors(self, test_file, expected, kinds=()):
from extractcode import archive
test_loc = self.get_test_loc(test_file)
if kinds:
extractors = archive.get_extractors(test_loc, kinds)
else:
extractors = archive.get_extractors(test_loc)
fe = fileutils.file_extension(test_loc).lower()
em = ', '.join(e.__module__ + '.' + e.__name__ for e in extractors)
msg = ('%(expected)r == %(extractors)r for %(test_file)s\n'
'with fe:%(fe)r, em:%(em)s' % locals())
> assert expected == extractors, msg
E AssertionError: [<function extract at 0x7f702e883c40>] == [] for /tmp/scancode-tk-tests -uejk8_7m/u6e9wzs8/foobar.qcow2.tar.gz/foobar.qcow2
E with fe:'.qcow2', em:
E assert [<function ex...7f702e883c40>] == []
E Left contains one more item: <function extract at 0x7f702e883c40>
E Use -v to get more diff
tests/extractcode_assert_utils.py:166: AssertionError
_________________ TestRar.test_extract_rar_with_trailing_data __________________
self = <test_archive.TestRar testMethod=test_extract_rar_with_trailing_data>
def test_extract_rar_with_trailing_data(self):
test_file = self.get_test_loc('archive/rar/rar_trailing.rar')
test_dir = self.get_temp_dir()
expected = Exception('Unknown error')
> self.assertRaisesInstance(expected, archive.extract_rar, test_file, test_dir)
tests/test_archive.py:1693:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <test_archive.TestRar testMethod=test_extract_rar_with_trailing_data>
excInstance = Exception('Unknown error')
callableObj = <function extract at 0x7f702e881620>
args = ('/builddir/build/BUILD/extractcode-31.0.0/tests/data/archive/rar/rar_trailing.rar', '/tmp/scancode-tk-tests -uejk8_7m/h4a951m9')
kwargs = {}, excClass = <class 'Exception'>, excName = 'Exception'
def assertRaisesInstance(self, excInstance, callableObj, *args, **kwargs):
"""
This assertion accepts an instance instead of a class for refined
exception testing.
"""
kwargs = kwargs or {}
excClass = excInstance.__class__
try:
callableObj(*args, **kwargs)
except excClass as e:
assert str(e).startswith(str(excInstance))
else:
if hasattr(excClass, '__name__'):
excName = excClass.__name__
else:
excName = str(excClass)
> raise self.failureException('%s not raised' % excName)
E AssertionError: Exception not raised
tests/extractcode_assert_utils.py:184: AssertionError
_____________ test_extractcode_command_can_take_an_empty_directory _____________
def test_extractcode_command_can_take_an_empty_directory():
test_dir = test_env.get_temp_dir()
> result = run_extract([test_dir], expected_rc=0)
tests/test_extractcode_cli.py:64:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
options = ['/tmp/scancode-tk-tests -uejk8_7m/fna7ftnv'], expected_rc = 0
cwd = None
def run_extract(options, expected_rc=None, cwd=None):
"""
Run extractcode as a plain subprocess. Return rc, stdout, stderr.
"""
bin_dir = 'Scripts' if on_windows else 'bin'
# note: this assumes that we are using a standard directory layout as set
# with the configure script
cmd_loc = os.path.join(project_root, 'venv', bin_dir, 'extractcode')
> assert os.path.exists(cmd_loc + ('.exe' if on_windows else ''))
E AssertionError: assert False
E + where False = <function exists at 0x7f7030703920>(('/builddir/build/BUILD/extractcode-31.0.0/venv/bin/extractcode' + ''))
E + where <function exists at 0x7f7030703920> = <module 'posixpath' (frozen)>.exists
E + where <module 'posixpath' (frozen)> = os.path
tests/test_extractcode_cli.py:38: AssertionError
________________ test_extractcode_command_does_extract_verbose _________________
def test_extractcode_command_does_extract_verbose():
test_dir = test_env.get_test_loc('cli/extract', copy=True)
> result = run_extract(['--verbose', test_dir], expected_rc=1)
tests/test_extractcode_cli.py:72:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
options = ['--verbose', '/tmp/scancode-tk-tests -uejk8_7m/bbuw3oy4/extract']
expected_rc = 1, cwd = None
def run_extract(options, expected_rc=None, cwd=None):
"""
Run extractcode as a plain subprocess. Return rc, stdout, stderr.
"""
bin_dir = 'Scripts' if on_windows else 'bin'
# note: this assumes that we are using a standard directory layout as set
# with the configure script
cmd_loc = os.path.join(project_root, 'venv', bin_dir, 'extractcode')
> assert os.path.exists(cmd_loc + ('.exe' if on_windows else ''))
E AssertionError: assert False
E + where False = <function exists at 0x7f7030703920>(('/builddir/build/BUILD/extractcode-31.0.0/venv/bin/extractcode' + ''))
E + where <function exists at 0x7f7030703920> = <module 'posixpath' (frozen)>.exists
E + where <module 'posixpath' (frozen)> = os.path
tests/test_extractcode_cli.py:38: AssertionError
_ test_extractcode_command_always_shows_something_if_not_using_a_tty_verbose_or_not _
def test_extractcode_command_always_shows_something_if_not_using_a_tty_verbose_or_not():
test_dir = test_env.get_test_loc('cli/extract/some.tar.gz', copy=True)
> result = run_extract(options=['--verbose', test_dir], expected_rc=0)
tests/test_extractcode_cli.py:91:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
options = ['--verbose', '/tmp/scancode-tk-tests -uejk8_7m/36o7wnm7/some.tar.gz']
expected_rc = 0, cwd = None
def run_extract(options, expected_rc=None, cwd=None):
"""
Run extractcode as a plain subprocess. Return rc, stdout, stderr.
"""
bin_dir = 'Scripts' if on_windows else 'bin'
# note: this assumes that we are using a standard directory layout as set
# with the configure script
cmd_loc = os.path.join(project_root, 'venv', bin_dir, 'extractcode')
> assert os.path.exists(cmd_loc + ('.exe' if on_windows else ''))
E AssertionError: assert False
E + where False = <function exists at 0x7f7030703920>(('/builddir/build/BUILD/extractcode-31.0.0/venv/bin/extractcode' + ''))
E + where <function exists at 0x7f7030703920> = <module 'posixpath' (frozen)>.exists
E + where <module 'posixpath' (frozen)> = os.path
tests/test_extractcode_cli.py:38: AssertionError
______________ test_extractcode_command_works_with_relative_paths ______________
def test_extractcode_command_works_with_relative_paths():
# The setup is complex because we want to have a relative dir to the base
# dir where we run tests from, i.e. the git checkout dir To use relative
# paths, we use our tmp dir at the root of the code tree
from os.path import join
from commoncode import fileutils
import extractcode
import tempfile
import shutil
try:
test_file = test_env.get_test_loc('cli/extract_relative_path/basic.zip')
project_tmp = join(project_root, 'tmp')
fileutils.create_dir(project_tmp)
temp_rel = tempfile.mkdtemp(dir=project_tmp)
assert os.path.exists(temp_rel)
relative_dir = temp_rel.replace(project_root, '').strip('\\/')
shutil.copy(test_file, temp_rel)
test_src_file = join(relative_dir, 'basic.zip')
test_tgt_dir = join(project_root, test_src_file) + extractcode.EXTRACT_SUFFIX
> result = run_extract([test_src_file], expected_rc=0, cwd=project_root)
tests/test_extractcode_cli.py:124:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
options = ['tmp/tmpiu9t_zt7/basic.zip'], expected_rc = 0
cwd = '/builddir/build/BUILD/extractcode-31.0.0'
def run_extract(options, expected_rc=None, cwd=None):
"""
Run extractcode as a plain subprocess. Return rc, stdout, stderr.
"""
bin_dir = 'Scripts' if on_windows else 'bin'
# note: this assumes that we are using a standard directory layout as set
# with the configure script
cmd_loc = os.path.join(project_root, 'venv', bin_dir, 'extractcode')
> assert os.path.exists(cmd_loc + ('.exe' if on_windows else ''))
E AssertionError: assert False
E + where False = <function exists at 0x7f7030703920>(('/builddir/build/BUILD/extractcode-31.0.0/venv/bin/extractcode' + ''))
E + where <function exists at 0x7f7030703920> = <module 'posixpath' (frozen)>.exists
E + where <module 'posixpath' (frozen)> = os.path
tests/test_extractcode_cli.py:38: AssertionError
__________ test_extractcode_command_works_with_relative_paths_verbose __________
def test_extractcode_command_works_with_relative_paths_verbose():
# The setup is a tad complex because we want to have a relative dir
# to the base dir where we run tests from, i.e. the git checkout dir
# To use relative paths, we use our tmp dir at the root of the code tree
from os.path import join
from commoncode import fileutils
import tempfile
import shutil
try:
project_tmp = join(project_root, 'tmp')
fileutils.create_dir(project_tmp)
test_src_dir = tempfile.mkdtemp(dir=project_tmp).replace(project_root, '').strip('\\/')
test_file = test_env.get_test_loc('cli/extract_relative_path/basic.zip')
shutil.copy(test_file, test_src_dir)
test_src_file = join(test_src_dir, 'basic.zip')
> result = run_extract(['--verbose', test_src_file] , expected_rc=0)
tests/test_extractcode_cli.py:158:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
options = ['--verbose', 'tmp/tmpc_z4ga7z/basic.zip'], expected_rc = 0
cwd = None
def run_extract(options, expected_rc=None, cwd=None):
"""
Run extractcode as a plain subprocess. Return rc, stdout, stderr.
"""
bin_dir = 'Scripts' if on_windows else 'bin'
# note: this assumes that we are using a standard directory layout as set
# with the configure script
cmd_loc = os.path.join(project_root, 'venv', bin_dir, 'extractcode')
> assert os.path.exists(cmd_loc + ('.exe' if on_windows else ''))
E AssertionError: assert False
E + where False = <function exists at 0x7f7030703920>(('/builddir/build/BUILD/extractcode-31.0.0/venv/bin/extractcode' + ''))
E + where <function exists at 0x7f7030703920> = <module 'posixpath' (frozen)>.exists
E + where <module 'posixpath' (frozen)> = os.path
tests/test_extractcode_cli.py:38: AssertionError
______ test_usage_and_help_return_a_correct_script_name_on_all_platforms _______
def test_usage_and_help_return_a_correct_script_name_on_all_platforms():
options = ['--help']
> result = run_extract(options , expected_rc=0)
tests/test_extractcode_cli.py:177:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
options = ['--help'], expected_rc = 0, cwd = None
def run_extract(options, expected_rc=None, cwd=None):
"""
Run extractcode as a plain subprocess. Return rc, stdout, stderr.
"""
bin_dir = 'Scripts' if on_windows else 'bin'
# note: this assumes that we are using a standard directory layout as set
# with the configure script
cmd_loc = os.path.join(project_root, 'venv', bin_dir, 'extractcode')
> assert os.path.exists(cmd_loc + ('.exe' if on_windows else ''))
E AssertionError: assert False
E + where False = <function exists at 0x7f7030703920>(('/builddir/build/BUILD/extractcode-31.0.0/venv/bin/extractcode' + ''))
E + where <function exists at 0x7f7030703920> = <module 'posixpath' (frozen)>.exists
E + where <module 'posixpath' (frozen)> = os.path
tests/test_extractcode_cli.py:38: AssertionError
___ test_extractcode_command_can_extract_archive_with_unicode_names_verbose ____
def test_extractcode_command_can_extract_archive_with_unicode_names_verbose():
test_dir = test_env.get_test_loc('cli/unicodearch', copy=True)
> result = run_extract(['--verbose', test_dir] , expected_rc=0)
tests/test_extractcode_cli.py:195:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
options = ['--verbose', '/tmp/scancode-tk-tests -uejk8_7m/2rcvcl3l/unicodearch']
expected_rc = 0, cwd = None
def run_extract(options, expected_rc=None, cwd=None):
"""
Run extractcode as a plain subprocess. Return rc, stdout, stderr.
"""
bin_dir = 'Scripts' if on_windows else 'bin'
# note: this assumes that we are using a standard directory layout as set
# with the configure script
cmd_loc = os.path.join(project_root, 'venv', bin_dir, 'extractcode')
> assert os.path.exists(cmd_loc + ('.exe' if on_windows else ''))
E AssertionError: assert False
E + where False = <function exists at 0x7f7030703920>(('/builddir/build/BUILD/extractcode-31.0.0/venv/bin/extractcode' + ''))
E + where <function exists at 0x7f7030703920> = <module 'posixpath' (frozen)>.exists
E + where <module 'posixpath' (frozen)> = os.path
tests/test_extractcode_cli.py:38: AssertionError
_______ test_extractcode_command_can_extract_archive_with_unicode_names ________
def test_extractcode_command_can_extract_archive_with_unicode_names():
test_dir = test_env.get_test_loc('cli/unicodearch', copy=True)
> run_extract([test_dir] , expected_rc=0)
tests/test_extractcode_cli.py:213:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
options = ['/tmp/scancode-tk-tests -uejk8_7m/3a9hxexv/unicodearch']
expected_rc = 0, cwd = None
def run_extract(options, expected_rc=None, cwd=None):
"""
Run extractcode as a plain subprocess. Return rc, stdout, stderr.
"""
bin_dir = 'Scripts' if on_windows else 'bin'
# note: this assumes that we are using a standard directory layout as set
# with the configure script
cmd_loc = os.path.join(project_root, 'venv', bin_dir, 'extractcode')
> assert os.path.exists(cmd_loc + ('.exe' if on_windows else ''))
E AssertionError: assert False
E + where False = <function exists at 0x7f7030703920>(('/builddir/build/BUILD/extractcode-31.0.0/venv/bin/extractcode' + ''))
E + where <function exists at 0x7f7030703920> = <module 'posixpath' (frozen)>.exists
E + where <module 'posixpath' (frozen)> = os.path
tests/test_extractcode_cli.py:38: AssertionError
_________________ test_extractcode_command_can_extract_shallow _________________
def test_extractcode_command_can_extract_shallow():
test_dir = test_env.get_test_loc('cli/extract_shallow', copy=True)
> run_extract(['--shallow', test_dir] , expected_rc=0)
tests/test_extractcode_cli.py:230:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
options = ['--shallow', '/tmp/scancode-tk-tests -uejk8_7m/n1bkdpux/extract_shallow']
expected_rc = 0, cwd = None
def run_extract(options, expected_rc=None, cwd=None):
"""
Run extractcode as a plain subprocess. Return rc, stdout, stderr.
"""
bin_dir = 'Scripts' if on_windows else 'bin'
# note: this assumes that we are using a standard directory layout as set
# with the configure script
cmd_loc = os.path.join(project_root, 'venv', bin_dir, 'extractcode')
> assert os.path.exists(cmd_loc + ('.exe' if on_windows else ''))
E AssertionError: assert False
E + where False = <function exists at 0x7f7030703920>(('/builddir/build/BUILD/extractcode-31.0.0/venv/bin/extractcode' + ''))
E + where <function exists at 0x7f7030703920> = <module 'posixpath' (frozen)>.exists
E + where <module 'posixpath' (frozen)> = os.path
tests/test_extractcode_cli.py:38: AssertionError
_____________________ test_extractcode_command_can_ignore ______________________
def test_extractcode_command_can_ignore():
test_dir = test_env.get_test_loc('cli/extract_ignore', copy=True)
> run_extract(['--ignore', '*.tar', test_dir] , expected_rc=0)
tests/test_extractcode_cli.py:248:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
options = ['--ignore', '*.tar', '/tmp/scancode-tk-tests -uejk8_7m/bs_94w2l/extract_ignore']
expected_rc = 0, cwd = None
def run_extract(options, expected_rc=None, cwd=None):
"""
Run extractcode as a plain subprocess. Return rc, stdout, stderr.
"""
bin_dir = 'Scripts' if on_windows else 'bin'
# note: this assumes that we are using a standard directory layout as set
# with the configure script
cmd_loc = os.path.join(project_root, 'venv', bin_dir, 'extractcode')
> assert os.path.exists(cmd_loc + ('.exe' if on_windows else ''))
E AssertionError: assert False
E + where False = <function exists at 0x7f7030703920>(('/builddir/build/BUILD/extractcode-31.0.0/venv/bin/extractcode' + ''))
E + where <function exists at 0x7f7030703920> = <module 'posixpath' (frozen)>.exists
E + where <module 'posixpath' (frozen)> = os.path
tests/test_extractcode_cli.py:38: AssertionError
_ test_extractcode_command_does_not_crash_with_replace_originals_and_corrupted_archives _
def test_extractcode_command_does_not_crash_with_replace_originals_and_corrupted_archives():
test_dir = test_env.get_test_loc('cli/replace-originals', copy=True)
> result = run_extract(['--replace-originals', '--verbose', test_dir] , expected_rc=1)
tests/test_extractcode_cli.py:266:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
options = ['--replace-originals', '--verbose', '/tmp/scancode-tk-tests -uejk8_7m/ldztqpv3/replace-originals']
expected_rc = 1, cwd = None
def run_extract(options, expected_rc=None, cwd=None):
"""
Run extractcode as a plain subprocess. Return rc, stdout, stderr.
"""
bin_dir = 'Scripts' if on_windows else 'bin'
# note: this assumes that we are using a standard directory layout as set
# with the configure script
cmd_loc = os.path.join(project_root, 'venv', bin_dir, 'extractcode')
> assert os.path.exists(cmd_loc + ('.exe' if on_windows else ''))
E AssertionError: assert False
E + where False = <function exists at 0x7f7030703920>(('/builddir/build/BUILD/extractcode-31.0.0/venv/bin/extractcode' + ''))
E + where <function exists at 0x7f7030703920> = <module 'posixpath' (frozen)>.exists
E + where <module 'posixpath' (frozen)> = os.path
tests/test_extractcode_cli.py:38: AssertionError
__________________ test_extractcode_command_can_extract_nuget __________________
@pytest.mark.skipif(on_windows, reason='FIXME: this test fails on Windows until we have support for long file names.')
def test_extractcode_command_can_extract_nuget():
test_dir = test_env.get_test_loc('cli/extract_nuget', copy=True)
> result = run_extract(['--verbose', test_dir])
tests/test_extractcode_cli.py:283:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
options = ['--verbose', '/tmp/scancode-tk-tests -uejk8_7m/zcx7geku/extract_nuget']
expected_rc = None, cwd = None
def run_extract(options, expected_rc=None, cwd=None):
"""
Run extractcode as a plain subprocess. Return rc, stdout, stderr.
"""
bin_dir = 'Scripts' if on_windows else 'bin'
# note: this assumes that we are using a standard directory layout as set
# with the configure script
cmd_loc = os.path.join(project_root, 'venv', bin_dir, 'extractcode')
> assert os.path.exists(cmd_loc + ('.exe' if on_windows else ''))
E AssertionError: assert False
E + where False = <function exists at 0x7f7030703920>(('/builddir/build/BUILD/extractcode-31.0.0/venv/bin/extractcode' + ''))
E + where <function exists at 0x7f7030703920> = <module 'posixpath' (frozen)>.exists
E + where <module 'posixpath' (frozen)> = os.path
tests/test_extractcode_cli.py:38: AssertionError
This bug is tracked upstream there libarchive/libarchive#545
The failing test is:
https://github.com/nexB/scancode-toolkit/blob/develop/tests/extractcode/test_archive.py#L579
Some archives can contain a big size files. e.g. (https://github.com/gcc-mirror/gcc/releases/tag/releases%2Fgcc-9.4.0 with testdata) where are tar's located and two of them are 60gb big. Extractcode extract them by default.
It is possible to add size limit for those kind of files? like an ignore option.
or maybe to set the limit of the max. uncompressed size of the whole archive.
If extractcodes tries to extract and get some errors and afterwards tries to clean, up you will get error with filenotfound exception.
e.g. again gcc magic issue6550.gz archive
here traces without replace-originals
λ extractcode --verbose D:\test-extractcode\test.zip
Extracting archives...
Extracting: test.zip
Extracting: test.zip
Extracting: zweite_ebene.zip
Extracting: issue6550.gz
ERROR extracting: D:/test-extractcode/test.zip-extract/zweite_ebene.zip-extract/issue6550.gz: Error -3 while decompressing data: too many length or distance symbols
Extracting done.
with replace originals
λ extractcode --verbose --replace-originals D:\test-extractcode\test.zip
Extracting archives...
Extracting: test.zip
Extracting: test.zip
Extracting: zweite_ebene.zip
Extracting: issue6550.gz
Extracting: issue6550.gz
Traceback (most recent call last):
File "C:\WS\tools\Python36\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "C:\WS\tools\Python36\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\WS\tools\scancode-toolkit-21.3.31\Scripts\extractcode.exe\__main__.py", line 7, in <module>
File "c:\ws\tools\scancode-toolkit-21.3.31\lib\site-packages\click\core.py", line 722, in __call__
return self.main(*args, **kwargs)
File "c:\ws\tools\scancode-toolkit-21.3.31\lib\site-packages\commoncode\cliutils.py", line 87, in main
**extra,
File "c:\ws\tools\scancode-toolkit-21.3.31\lib\site-packages\click\core.py", line 697, in main
rv = self.invoke(ctx)
File "c:\ws\tools\scancode-toolkit-21.3.31\lib\site-packages\click\core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "c:\ws\tools\scancode-toolkit-21.3.31\lib\site-packages\click\core.py", line 535, in invoke
return callback(*args, **kwargs)
File "c:\ws\tools\scancode-toolkit-21.3.31\lib\site-packages\click\decorators.py", line 17, in new_func
return f(get_current_context(), *args, **kwargs)
File "c:\ws\tools\scancode-toolkit-21.3.31\lib\site-packages\extractcode\cli.py", line 184, in extractcode
for xev in extraction_events:
File "c:\ws\tools\scancode-toolkit-21.3.31\lib\site-packages\click\_termui_impl.py", line 259, in next
rv = next(self.iter)
File "c:\ws\tools\scancode-toolkit-21.3.31\lib\site-packages\extractcode\api.py", line 42, in extract_archives
ignore_pattern=ignore_pattern
File "c:\ws\tools\scancode-toolkit-21.3.31\lib\site-packages\extractcode\extract.py", line 142, in extract
fileutils.copytree(target, source)
File "c:\ws\tools\scancode-toolkit-21.3.31\lib\site-packages\commoncode\fileutils.py", line 403, in copytree
names = os.listdir(src)
FileNotFoundError: [WinError 3] Das System kann den angegebenen Pfad nicht finden: 'D:\\test-extractcode\\test.zip-extract\\zweite_ebene.zip-extract\\issue6550.gz-extract'
can be fixed in the extract.py
in method extract
to skip moving file when event has errors.
I mean here for example:
for event in extract_events:
yield event
if replace_originals:
processed_events_append(event)
At the moment the extractcode lib extracts CAB files al-right, but does not understand the underlying structure of the files. You end up with a pile of files names after some hash or UUID and not real file names as they would be installed.
The format is more or less documented here: https://msdn.microsoft.com/en-us/library/bb417343.aspx
https://github.com/n3k/PyCAB seems to implement some code to handle this in pure Python
Wine has an implementation of cabarc which runs likely only under wine.
cabextract is a portable, standalone extractor: http://cabextract.org.uk/
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.