pycompression / python-isal Goto Github PK
View Code? Open in Web Editor NEWFaster zlib and gzip compatible compression and decompression by providing python bindings for the isa-l library.
License: Other
Faster zlib and gzip compatible compression and decompression by providing python bindings for the isa-l library.
License: Other
Z_FINISH == ISAL_FULL_FLUSH?
Z_FULL_FLUSH != ISAL_FULL_FLUSH
ISAL_SYNC_FLUSH ? Z_SYNC_FLUSH?
Release checklist
CHANGELOG.rst
to stable version.__init__.py
main
.main
branch back into develop
.setup.py
and src/isal/__init__.py
)1.1.1
and 1.2.0
but not 1.1.0
, 1.1.1
and 1.2.0
.@animalize made a more intelligent output buffer for python. It utilizes python lists internally.
The old buffer simply uses a PyMem_Resize call. This has the disadvantage of reallocating the memory every time it is called. When the buffer is grown this way, the data at the beginning is copied quite a lot of times. This means that using larger buffer sizes runs into a limit at some point.
The blocks_output_buffer creates a python list with all the buffer blocks. These are created only once. In the end these blocks are all joined together, requiring only one memcpy call per block. This is not only theoretically faster, but it also turns out to be that way in practice.
Could you upload 3.11 compatible wheels to PyPI?
The stdlib gzip and zlib tests need to be replicated to be sure this library works correctly in all use cases.
Release checklist
CHANGELOG.rst
to stable version.__init__.py
main
.main
branch back into develop
.setup.py
and src/isal/__init__.py
)1.1.1
and 1.2.0
but not 1.1.0
, 1.1.1
and 1.2.0
.When using setuptools properly.
Python will segfault after decompression.
Ironically this happens upon closing python. The decompression is fine, it even finishes. Md5sum of outputted file checks out. When python is closed it gives a segfault.
Release checklist
CHANGELOG.rst
to stable version.__init__.py
main
.main
branch back into develop
.setup.py
and src/isal/__init__.py
)1.1.1
and 1.2.0
but not 1.1.0
, 1.1.1
and 1.2.0
.Hello @rhpvorderman,
Some months ago I reported a bug in the decompression of Gzip files (#60), and today while using cutadapt in a different computer it happened again. I remembered the previous time and checked the compressed files, and found that "zcat" and "gzip -t" were not giving any errors, so I suspected of isal.
In my personal computer I have installed version 0.8.1 which work fine with the files, so without changing anything else I tried installing the next isal versions one by one, and found that the the files are decompressed fine with isal versions 0.9.0 and 0.10.0, but breaks on last version 0.11.0:
fossandon@ubuntu:~/Documents/download$ pip3 install isal==0.11.0
Defaulting to user installation because normal site-packages is not writeable
Collecting isal==0.11.0
Using cached isal-0.11.0-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
Installing collected packages: isal
Attempting uninstall: isal
Found existing installation: isal 0.10.0
Uninstalling isal-0.10.0:
Successfully uninstalled isal-0.10.0
Successfully installed isal-0.11.0
fossandon@ubuntu:~/Documents/download$ cutadapt -a "AACTTTYARCAAYGGATCTC;max_error_rate=0.1;min_overlap=20" -A "TGATCCYTCCGCAGGT;max_error_rate=0.5;min_overlap=16" --pair-adapters --pair-filter any --cores 2 --output 136727_R1.fastq --paired-output 136727_R2.fastq 136727_S159_L001_R1_001.fastq.gz 136727_S159_L001_R2_001.fastq.gz
This is cutadapt 3.4 with Python 3.6.9
Command line parameters: -a AACTTTYARCAAYGGATCTC;max_error_rate=0.1;min_overlap=20 -A TGATCCYTCCGCAGGT;max_error_rate=0.5;min_overlap=16 --pair-adapters --pair-filter any --cores 2 --output 136727_R1.fastq --paired-output 136727_R2.fastq 136727_S159_L001_R1_001.fastq.gz 136727_S159_L001_R2_001.fastq.gz
Processing reads on 2 cores in paired-end mode ...
ERROR: Traceback (most recent call last):
File "/home/fossandon/.local/lib/python3.6/site-packages/cutadapt/pipeline.py", line 556, in run
dnaio.read_paired_chunks(f, f2, self.buffer_size)):
File "/home/fossandon/Documents/Github_repos/dnaio/src/dnaio/chunks.py", line 118, in read_paired_chunks
bufend1 = f.readinto(memoryview(buf1)[start1:]) + start1 # type: ignore
File "/usr/lib/python3.6/gzip.py", line 276, in read
return self._buffer.read(size)
File "/usr/lib/python3.6/_compression.py", line 68, in readinto
data = self.read(len(byte_view))
File "/home/fossandon/.local/lib/python3.6/site-packages/isal/igzip.py", line 265, in read
self._read_eof()
File "/usr/lib/python3.6/gzip.py", line 501, in _read_eof
hex(self._crc)))
OSError: CRC check failed 0x8b1f001a != 0xd2f5dc20
ERROR: Traceback (most recent call last):
File "/home/fossandon/.local/lib/python3.6/site-packages/cutadapt/pipeline.py", line 626, in run
raise e
OSError: CRC check failed 0x8b1f001a != 0xd2f5dc20
Traceback (most recent call last):
File "/home/fossandon/.local/bin/cutadapt", line 8, in <module>
sys.exit(main_cli())
File "/home/fossandon/.local/lib/python3.6/site-packages/cutadapt/__main__.py", line 848, in main_cli
main(sys.argv[1:])
File "/home/fossandon/.local/lib/python3.6/site-packages/cutadapt/__main__.py", line 913, in main
stats = r.run()
File "/home/fossandon/.local/lib/python3.6/site-packages/cutadapt/pipeline.py", line 825, in run
raise e
OSError: CRC check failed 0x8b1f001a != 0xd2f5dc20
Inspecting the changes in the last release, I found that a couple of lines added in 0.8.1 fix were modified:
Could it be that the modification caused a regression??
I shared the the files pair that caused the error in this folder, so you can reproduce it on your end:
https://drive.google.com/drive/folders/1iOqvXbDQQd8NDtnZhzutmOxx4wUONO-k?usp=sharing
Best regards,
('python=3.8.3 (tags/v3.8.3:6f8c832, May 13 2020, 22:20:19) [MSC v.1925 32 bit '
'(Intel)]')
'os=Windows-10-10.0.19041-SP0'
'numpy=1.21.0'
Processor: Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz, 2712 Mhz, 4 Core(s), 8 Logical Processor(s)
Building wheels for collected packages: isal
Building wheel for isal (pyproject.toml) ... error
error: subprocess-exited-with-error
× Building wheel for isal (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> [7 lines of output]
running bdist_wheel
running build
running build_py
running build_ext
cythoning src/isal/isal_zlib.pyx to src/isal\isal_zlib.c
cythoning src/isal/igzip_lib.pyx to src/isal\igzip_lib.c
error: [WinError 2] The system cannot find the file specified
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for isal
Failed to build isal
ERROR: Could not build wheels for isal, which is required to install pyproject.toml-based projects
Hello @rhpvorderman,
Yesterday, it happened to me and other bioinformaticians that the program that we were using (cutadapt) crashed unexpectedly when trying to open some gzipped files, which was the first time something like this happened: marcelm/cutadapt#520
fossandon@ubuntu:~/Documents/download$ cutadapt -a 'AACTTTYARCAAYGGATCTC;max_error_rate=0.1;min_overlap=20' -A 'TGATCCYTCCGCAGGT;max_error_rate=0.5;min_overlap=16' --pair-adapters --pair-filter any --cores 2 --output 94477_R1.fastq --paired-output 94477_R2.fastq 94477_S175_L001_R1_001.fastq.gz 94477_S175_L001_R2_001.fastq.gz
This is cutadapt 3.3 with Python 3.6.9
Command line parameters: -a AACTTTYARCAAYGGATCTC;max_error_rate=0.1;min_overlap=20 -A TGATCCYTCCGCAGGT;max_error_rate=0.5;min_overlap=16 --pair-adapters --pair-filter any --cores 2 --output 94477_R1.fastq --paired-output 94477_R2.fastq 94477_S175_L001_R1_001.fastq.gz 94477_S175_L001_R2_001.fastq.gz
Processing reads on 2 cores in paired-end mode ...
[ 8<---------] 00:00:03 88,831 reads @ 26.0 µs/read; 2.31 M reads/minuteERROR: Traceback (most recent call last):
File "/home/fossandon/.local/lib/python3.6/site-packages/cutadapt/pipeline.py", line 556, in run
dnaio.read_paired_chunks(f, f2, self.buffer_size)):
File "/home/fossandon/.local/lib/python3.6/site-packages/dnaio/chunks.py", line 118, in read_paired_chunks
bufend1 = f.readinto(memoryview(buf1)[start1:]) + start1 # type: ignore
File "/usr/lib/python3.6/gzip.py", line 276, in read
return self._buffer.read(size)
File "/usr/lib/python3.6/_compression.py", line 68, in readinto
data = self.read(len(byte_view))
File "/usr/lib/python3.6/gzip.py", line 454, in read
self._read_eof()
File "/usr/lib/python3.6/gzip.py", line 501, in _read_eof
hex(self._crc)))
OSError: CRC check failed 0x88b1f != 0x6fe5d9e4
ERROR: Traceback (most recent call last):
File "/home/fossandon/.local/lib/python3.6/site-packages/cutadapt/pipeline.py", line 626, in run
raise e
OSError: CRC check failed 0x88b1f != 0x6fe5d9e4
Traceback (most recent call last):
File "/home/fossandon/.local/bin/cutadapt", line 8, in <module>
sys.exit(main_cli())
File "/home/fossandon/.local/lib/python3.6/site-packages/cutadapt/__main__.py", line 848, in main_cli
main(sys.argv[1:])
File "/home/fossandon/.local/lib/python3.6/site-packages/cutadapt/__main__.py", line 913, in main
stats = r.run()
File "/home/fossandon/.local/lib/python3.6/site-packages/cutadapt/pipeline.py", line 825, in run
raise e
OSError: CRC check failed 0x88b1f != 0x6fe5d9e4
But using zcat and "gzip -t" on the files does not return any error, and they can be decompressed fine with "gzip -d", even running the same cutadapt command in different environments (python 3.6 and 3.8 were tested too) with the same version resulted in a crash for some environments and not for others. It took a long search and tests with a collegue, until we figure out that the key difference between crashing and not crashing was the version installed of the isal dependency (which uses the latest version when creating a docker image)... Using versions 0.8.0 and 0.7.0 generate the CRC error, but using 0.6.1 and 0.5.0 did not, so it seems the bug was introduced in 0.7.0, and keeping the intermediate dependencies the same but reverting isal to 0.6.1 allow it to work:
299 3047 0.0 8 2963 57 11 6 5 2 3
300 8 0.0 8 0 0 0 0 0 3 4 0 1
301 15028 0.0 8 0 14646 270 64 24 15 8 0 1
WARNING:
One or more of your adapter sequences may be incomplete.
Please see the detailed output above.
fossandon@ubuntu:~/Documents/temp$ pip3 list | egrep "cutadapt|dnaio|isal|xopen"
cutadapt 3.3
dnaio 0.5.0 /home/fossandon/.local/lib/python3.6/site-packages
isal 0.6.1
xopen 1.1.0
In my case, I was processing a folder where all gzipped files came from a source where they were created at the same time, but only a portion consistently crashed and the others not. So to help you have a test case, I uploaded the files pair that I was using with the cutadapt example above, so you can reproduce it on your own, I couldn't find smaller ones that reproduced this error.
https://drive.google.com/drive/folders/1eTmLbd9WINctLb48pzn57_Ohp1amwZah?usp=sharing
Best regards,
Release checklist
CHANGELOG.rst
to stable version.__init__.py
main
.main
branch back into develop
.setup.py
and src/isal/__init__.py
)1.1.1
and 1.2.0
but not 1.1.0
, 1.1.1
and 1.2.0
.Release checklist
CHANGELOG.rst
to stable version.main
.main
branch back into develop
.1.1.1
and 1.2.0
but not 1.1.0
, 1.1.1
and 1.2.0
.This is due to the upstream functions. Investigate the issue and report the bugs upstream.
Release checklist
CHANGELOG.rst
to stable version.__init__.py
main
.main
branch back into develop
.1.1.1
and 1.2.0
but not 1.1.0
, 1.1.1
and 1.2.0
.Release checklist
CHANGELOG.rst
to stable version.__init__.py
main
.main
branch back into develop
.setup.py
and src/isal/__init__.py
)1.1.1
and 1.2.0
but not 1.1.0
, 1.1.1
and 1.2.0
.I'm trying to install isal with pip on my MacBook Pro 2021 (M1 Pro Chip on macOS Monterey 12.1)
The installation fails because of multiple errors. The first one being:
In file included from erasure_code/aarch64/ec_aarch64_dispatcher.c:29:
/var/folders/yh/q11mmnrx4bj7m21b0l255mg40000gn/T/tmpmnogoka1/include/aarch64_multibinary.h:34:10: fatal error: asm/hwcap.h' file not found
#include <asm/hwcap.h>
^~~~~~~~~~~~~
1 error generated.
After that there are a bunch of other errors. I've added the full log as an attachment.
Is there anything I can do about that?
Cheers
Marco
Once python/cpython#101251 is merged the code can be ported to these bindings to upgdrade the performance.
Yes, that is more than CPython's gzip module. No, that does not say anything about quality. But the number should be as high as possible. It would be a shame of 1.0.0 only lives for a few days until 1.0.1 has to come along.
We are copying isal source to TMPDIR and running autogen.sh inside TMPDIR, but if TMPDIR is mounted as noexec, autogen.sh will raise Error 13 Permission Denied. (In the specific environment) as a workaround we needed to sudo mount -o remount,exec /tmp /tmp
.
If possible please fix setup.py to avoid this problem.
Release checklist
CHANGELOG.rst
to stable version.__init__.py
main
.main
branch back into develop
.setup.py
and src/isal/__init__.py
)1.1.1
and 1.2.0
but not 1.1.0
, 1.1.1
and 1.2.0
.ISA-L can run on ARM64, python can run on ARM64. Python-isal should therefore be able to run on ARM64 as well. Currently there is no equipment and no CI environment available to properly test this.
If someone where to provide a self-hosted github runner to run ARM tests on this issue could be tackled.
Release checklist
CHANGELOG.rst
to stable version.__init__.py
main
.main
branch back into develop
.setup.py
and src/isal/__init__.py
)1.1.1
and 1.2.0
but not 1.1.0
, 1.1.1
and 1.2.0
.As of now it is distinct from zlibmodule.c. It is not a very efficient use of the buffer. zlibmodule.c's method however is slower. As implemented now on the buffer branch.
Look for opprotunities to streamline this proces and make it more shared across the functions that need it.
Release checklist
CHANGELOG.rst
to stable version.__init__.py
main
.main
branch back into develop
.setup.py
and src/isal/__init__.py
)1.1.1
and 1.2.0
but not 1.1.0
, 1.1.1
and 1.2.0
.This helps with reproducible output as the original name and timestamp are not stored.
Publish the documentation on readthedocs, with full installation and usage guides.
IgzipDecompressor
doesn't need unconsumed_tail
for its own use, but for lower-level gzip module compatibility it would be good to expose it, and have it always return b""
?
It should be possible to support windows. This is a work in progress.
Current difficulties:
isa-l.h
is not build when compiling isa-l using nmake on windows. This file is used in python-isal.The goal of the issue is to begin discussion about GIL manipulation in the library.
Currently, all the code inside the library is run in single thread since Python requires to acquire GIL to work with Python objects. As example this blocks using isal
library in case of multiple concurrent threads each working with compressed file.
Have you consider releasing lock before executing I-SAL library functions that works with internal state (like isal_deflate
)? I didn't fully checked all the details of I-SAL library and may miss some potential problems.
This is also one of the difference from built-in gzip implementation since it releases lock before deflate
call and crc
calculation.
Release checklist
CHANGELOG.rst
to stable version.__init__.py
main
.main
branch back into develop
.1.1.1
and 1.2.0
but not 1.1.0
, 1.1.1
and 1.2.0
.Release checklist
CHANGELOG.rst
to stable version.main
.main
branch back into develop
.1.1.1
and 1.2.0
but not 1.1.0
, 1.1.1
and 1.2.0
.This file is now automatically created in the next upstream release when it is released.
Release checklist
CHANGELOG.rst
to stable version.__init__.py
main
.main
branch back into develop
.1.1.1
and 1.2.0
but not 1.1.0
, 1.1.1
and 1.2.0
.This will make the library fully typed.
The build prefix is removed, but the temporary prefix is not. This is a result of the setup.py implementation where libisal is only build once. Otherwise it would have been trivial to remove the prefix.
The temporary install prefix is 1.4 MB big. This is only an issue for installations from the source distribution.
Release checklist
CHANGELOG.rst
to stable version.__init__.py
main
.main
branch back into develop
.setup.py
and src/isal/__init__.py
)1.1.1
and 1.2.0
but not 1.1.0
, 1.1.1
and 1.2.0
.Instead of using autoconf and automake, simply make -f Makefile.unx
also creates the necessary files for static linking. This requires a change of the script.
Release checklist
CHANGELOG.rst
to stable version.__init__.py
main
.main
branch back into develop
.1.1.1
and 1.2.0
but not 1.1.0
, 1.1.1
and 1.2.0
.Do you plan add isa-l_crypto binding ? It is crypto algorithm of isa-l .
ISA-L_crypto includes multi-buffer optimization for crypto algorithm and some special hash algorithm.
Do you plan test this project on Arm64 platform ?
ISA-L for Arm64 has been done now. I just want to know if you plan to test it on Arm64 .
Travis CI can support Arm64 platform. You can check isa-l/.travis.yml to know how to add Arm64 support
See following benchmarks between igzip and python -m isal.igzip. Both are tested in the same conda environment. Therefore both use the same compile settings and have the same version of isa-l.
$ hyperfine -w 3 -r 10 'python -m isal.igzip -c ~/test/big2.fastq > /dev/null'
Benchmark #1: python -m isal.igzip -c ~/test/big2.fastq > /dev/null
Time (mean ± σ): 4.894 s ± 0.028 s [User: 4.758 s, System: 0.134 s]
Range (min … max): 4.856 s … 4.949 s 10 runs
$ hyperfine -w 3 -r 10 'igzip -c ~/test/big2.fastq > /dev/null'
Benchmark #1: igzip -c ~/test/big2.fastq > /dev/null
Time (mean ± σ): 4.732 s ± 0.020 s [User: 4.520 s, System: 0.211 s]
Range (min … max): 4.699 s … 4.756 s 10 runs
$ hyperfine -w 3 -r 10 'python -m isal.igzip -cd ~/test/big2.fastq.gz > /dev/null'
Benchmark #1: python -m isal.igzip -cd ~/test/big2.fastq.gz > /dev/null
Time (mean ± σ): 3.479 s ± 0.025 s [User: 3.398 s, System: 0.080 s]
Range (min … max): 3.432 s … 3.510 s 10 runs
$ hyperfine -w 3 -r 10 'igzip -cd ~/test/big2.fastq.gz > /dev/null'
Benchmark #1: igzip -cd ~/test/big2.fastq.gz > /dev/null
Time (mean ± σ): 2.872 s ± 0.029 s [User: 2.808 s, System: 0.063 s]
Range (min … max): 2.811 s … 2.914 s 10 runs
Compression: 4.894 / 4.732 = 1,034. 3,4% overhead when using python instead of a pure C implementation. That is quite good. Especially considering the portability of python-isal (works on windows, where igzip does not).
Decompression: 3.479 / 2.872 = 1,211. 21,1 % overhead when using python instead of the pure C implementation. Very bad! This is probably due to the overhead of juggling with the unconsumed_tail. Which means that a lot of bytes are converted from python to C and back again repeatedly without result.
Can only be decompressed with zlib.decompress(wbits=15) which is odd. This should be reported upstream.
Release checklist
CHANGELOG.rst
to stable version.__init__.py
main
.main
branch back into develop
.setup.py
and src/isal/__init__.py
)1.1.1
and 1.2.0
but not 1.1.0
, 1.1.1
and 1.2.0
.This will have:
With the compressor and decompressor object resembling those from bz2 and lzma modules. These can be used in igzip to reduce the overhead caused by unconsumed_tail.
Also igzip_lib has many more fine-grained settings for headers and trailers, which can be used to great effect (i.e. write no header, but do write a trailer). This can be used in igzip.compress and igzip.decompress.
Currently this is limited to positive values.
Release checklist
CHANGELOG.rst
to stable version.__init__.py
main
.main
branch back into develop
.setup.py
and src/isal/__init__.py
)1.1.1
and 1.2.0
but not 1.1.0
, 1.1.1
and 1.2.0
.A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.