martinellimarco / indexed_zstd Goto Github PK
View Code? Open in Web Editor NEWA bridge for libzstd-seek to python. Based on mxmlnkn/indexed_bzip2
License: MIT License
A bridge for libzstd-seek to python. Based on mxmlnkn/indexed_bzip2
License: MIT License
Related to #2 the issue appears to be not just MacOS but anything based on clang...
I ran into the issue while compiling w/clang 12 on Android. Complaint about -std=c++11
not being valid for language C.
By forcing the selection of the configuration for Darwin (by simply adding or True
to line 12... below) compilation succeeded.
Line 12 in f4caf60
Perhaps the test shouldn't be platform based but compiler based? [If compiler is clang....]
I can confirm the same error on Debian when the compiler is set to clang instead of gcc/g++... (e.g. CC="clang-11" CXX="clang++-11" python setup.py build_ext --cython --inplace
(I haven't had a chance to test on FreeBSD but since the default compiler is now clang, I would imagine the same issue would exist.)
When using IndexedZstdFile in conjunction with numpy deserialization (numpy.load
) this leads to a crash due to numpy being advertised a wrong (shorter) file size.
This code triggers the issue:
import numpy as np
import zstandard as zstd
from io import BytesIO
from indexed_zstd import IndexedZstdFile
from tempfile import NamedTemporaryFile
file = NamedTemporaryFile()
A = np.random.random((100, 100))
handler = zstd.open(file.name, "wb")
np.save(handler, A)
handler.close()
handler = IndexedZstdFile(file.name)
B = np.load(handler)
assert np.array_equal(A, B)
The resulting execution leads to
Traceback (most recent call last):
File "/tmp/a/bug.py", line 15, in <module>
B = np.load(handler)
File "/usr/lib/python3.10/site-packages/numpy/lib/npyio.py", line 440, in load
return format.read_array(fid, allow_pickle=allow_pickle,
File "/usr/lib/python3.10/site-packages/numpy/lib/format.py", line 787, in read_array
array.shape = shape
ValueError: cannot reshape array of size 9381 into shape (100,100)
This error does not occur if using IndexedZstdFileRaw instead of IndexedZstdFile.
I wanted to experiment if it's complicated to provide indexed_zstd
as windows wheels, and put together a working example. As a disclaimer: my C and windows knowledge is enough to get stuff done for me, but very far from knowing anything about best practices, and/or why some things are done the way they are.
see:
It's just a hack for now, since it assumes that you install and build it in a conda environment.
But it might be a good starting point if there's interest in providing indexed_zstd
for windows too.
Here's running the test script on windows:
(iz) C:\Users\andreas\Development\indexed_zstd\test>python test.py
Block offset completed?: False
{0: 0, 17: 4, 32: 6, 49: 10, 78: 26}
Block offset completed?: True
Test reading letters from A to Z
ABCDEFGHIJKLMNOPQRSTUVWXYZ
Tell is 26: 26
Compressed tell is 78: 78
Seeking back to 15
Tell is 15?: 15
Compressed tell is 49: 49
Filesize is 26? 26
Compressed tell is 49: 49
Reading a byte after EOF
Compressed tell is 78: 78
Test reading again, letters from P to Z
QRSTUVWXYZ
Testing set_block_offsets
Block offset completed?: False
{0: 0, 17: 4, 32: 6, 49: 10, 78: 26}
Block offset completed?: True
To build on windows:
git clone --recurse-submodules -b windows https://github.com/ap--/indexed_zstd.git
conda create -n iz python=3.11 zstd conda-build
conda activate iz
pip install ./indexed_zstd
I guess if someone is interested and feels more at home in the windows c-extension python world, it should be easy for them to clean this up. (We could then provide windows pkgs in conda-forge too.)
Cheers,
Andreas ๐
The code you have here seems to explicitly need a local filename to operate on. The zstdandard library, however, can act on arbitrary python file-like objects. I see the greatest feature of this library the possibility to read only required byte-ranges from remote files, so it would be great if interoperability with file-likes were possible. What do you think, is this tractable?
File "/.../anaconda3/envs/test/lib/python3.10/tarfile.py", line 2465, in __iter__
tarinfo = self.next()
File "/.../anaconda3/envs/test/lib/python3.10/tarfile.py", line 2332, in next
self.fileobj.seek(self.offset - 1)
File "indexed_zstd/indexed_zstd.pyx", line 77, in indexed_zstd._IndexedZstdFile.seek
TypeError: seek() takes exactly 2 positional arguments (1 given)
Works fine if I manually specify whence
as 0.
Hello @martinellimarco
I am starting work on providing conda packages for ratarmount and its dependencies.
Would you be interested in providing indexed_zstd via conda?
If yes, I could offer to create the feedstock repository for you to provide packages via conda-forge. I would also be maintaining this feedstock repository, and (of course) add you as a maintainer too.
For reference, here's the discussion in the downstream repo: mxmlnkn/ratarmount#99
Let me know what you think,
Cheers,
Andreas ๐
Your package is a dependency of https://github.com/mxmlnkn/ratarmount, which may become a dependency of https://github.com/yt-project/yt (if yt-project/yt#3443 is merged). In the process of testing yt, we discovered that indexed_zstd
does not compile on Mac-OSX (nor on Windows, but that is to be expected, isn't it?).
Unfortunately, I don't have access to a MacOS machine so I cannot debug it myself :(
# On MacOSX
pip install indexed_zstd
Actual error at build stage
[...]/indexed_zstd/libzstd-seek/zstd-seek.o -std=c++11 -O3 -DNDEBUG
error: invalid argument '-std=c++11' not allowed with 'C'
error: command '/usr/bin/clang' failed with exit code 1
Not sure if I'm doing something wrong but simply opening a file fails.
IndexedZstdFile(file_name)
File "indexed_zstd/indexed_zstd.pyx", line 135, in indexed_zstd.IndexedZstdFile.__init__
File "indexed_zstd/indexed_zstd.pyx", line 112, in indexed_zstd.IndexedZstdFileRaw.__init__
File "indexed_zstd/indexed_zstd.pyx", line 48, in indexed_zstd._IndexedZstdFile.__cinit__
TypeError: an integer is required
The README.md says:
IndexedZstdFile will only speed up seeking when there are more than one block, which sadly requires a bit of care in zstd.
Could you elaborate on the specific "bit of care" that's needed? At the zstd command-line, I see options like:
--target-compressed-block-size=# : generate compressed block of approximately targeted size
-B# : cut file into independent blocks of size # (default: no block)
...will one of these options produce the desired result? If so, what would you say is a reasonable blocksize? My only reference is bgzf which uses 65536 bytes.
Hi, I'm installing ratarmount
which has indexed_zstd
as dependency. It used to build and install fine on my Intel-based MacBook Pro, however, on my M2 Max (Apple Silicon) MacBook Pro it fails with this error:
Building wheel for indexed-zstd (pyproject.toml) ... error
error: subprocess-exited-with-error
ร Building wheel for indexed-zstd (pyproject.toml) did not run successfully.
โ exit code: 1
โฐโ> [18 lines of output]
running bdist_wheel
running build
running build_py
file indexed_zstd.py (for module indexed_zstd) not found
file indexed_zstd.py (for module indexed_zstd) not found
running build_clib
building 'zstd_zeek' library
creating build
creating build/temp.macosx-13.4-arm64-cpython-310
creating build/temp.macosx-13.4-arm64-cpython-310/indexed_zstd
creating build/temp.macosx-13.4-arm64-cpython-310/indexed_zstd/libzstd-seek
clang -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -I/redacted/usr/include -I/redacted/usr/include -I. -c indexed_zstd/libzstd-seek/zstd-seek.c -o build/temp.macosx-13.4-arm64-cpython-310/indexed_zstd/libzstd-seek/zstd-seek.o
In file included from indexed_zstd/libzstd-seek/zstd-seek.c:18:
indexed_zstd/libzstd-seek/zstd-seek.h:20:10: fatal error: 'zstd.h' file not found
#include <zstd.h>
^~~~~~~~
1 error generated.
error: command '/usr/bin/clang' failed with exit code 1
[end of output]
Any advice is welcome. Thanks for your great work!
Hello,
Someone had problems installing on macOS, see here.
It would be nice if there were wheels for macOS to avoid system prerequisites like zstd being installed.
I extended the Github Workflow CI of indexed_bzip2
to also create wheels for macOS and Windows. As this repo is based on indexed_bzip2
, maybe these changes could be pulled to this repo. I could try to open a MR if you want but it might take until next weekend.
When opening an .xz file with IndexedZstdFile, in order to check for exceptions, no exceptions is thrown, however when trying to read one byte, it will segfault! This should be detected and an exception should be thrown.
Consider this:
gdb --args python3 -c '
from indexed_zstd import IndexedZstdFile
f = IndexedZstdFile( "tests/simple.xz" )
print( "Opened File!" )
f.read( 1 )
'
I'll get this output:
GNU gdb (Ubuntu 9.2-0ubuntu1~20.04) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python3...
(No debugging symbols found in python3)
(gdb) r
Starting program: /usr/bin/python3 -c from\ indexed_zstd\ import\ IndexedZstdFile\;\ f\ =\ IndexedZstdFile\(\ \"tests/simple.xz\"\ \)\;\ print\(\ \"Opened\ File\!\"\ \)\;\ f.read\(\ 1\ \)
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Opened File!
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff73dc7d1 in ZSTDSeek_getJumpCoordinate (sctx=sctx@entry=0x9a3ef0, uncompressedPos=1048576) at indexed_zstd/libzstd-seek/zstd-seek.c:192
192 indexed_zstd/libzstd-seek/zstd-seek.c: No such file or directory.
(gdb) bt
#0 0x00007ffff73dc7d1 in ZSTDSeek_getJumpCoordinate (sctx=sctx@entry=0x9a3ef0, uncompressedPos=1048576)
at indexed_zstd/libzstd-seek/zstd-seek.c:192
#1 0x00007ffff73dcacd in ZSTDSeek_read (sctx=0x9a3ef0, outBuffSize=1048576, outBuff=0x7ffff6dc1010) at indexed_zstd/libzstd-seek/zstd-seek.c:325
#2 ZSTDSeek_read (outBuff=0x7ffff6dc1010, outBuffSize=1048576, sctx=0x9a3ef0) at indexed_zstd/libzstd-seek/zstd-seek.c:319
#3 0x00007ffff73d4bee in ZSTDReader::read (this=<optimized out>, this=<optimized out>, outputFileDescriptor=-1, nBytesToRead=<optimized out>,
outputBuffer=<optimized out>) at indexed_zstd/ZSTDReader.hpp:197
#4 __pyx_pf_12indexed_zstd_16_IndexedZstdFile_12readinto (__pyx_v_self=0x7ffff7483f90, __pyx_v_bytes_like=<optimized out>)
at indexed_zstd/indexed_zstd.cpp:2216
#5 __pyx_pw_12indexed_zstd_16_IndexedZstdFile_13readinto (__pyx_v_self=0x7ffff7483f90, __pyx_v_bytes_like=<optimized out>)
at indexed_zstd/indexed_zstd.cpp:2152
#6 0x00000000005c4b10 in ?? ()
#7 0x00000000005f4ca1 in ?? ()
#8 0x00000000005f57c0 in PyObject_CallMethodObjArgs ()
#9 0x000000000064db5d in ?? ()
#10 0x000000000064dc32 in ?? ()
#11 0x000000000064f425 in ?? ()
#12 0x0000000000504743 in ?? ()
#13 0x000000000056b399 in _PyEval_EvalFrameDefault ()
#14 0x000000000056955a in _PyEval_EvalCodeWithName ()
#15 0x000000000068c4a7 in PyEval_EvalCode ()
#16 0x000000000067bc91 in ?? ()
#17 0x000000000067bd0f in ?? ()
#18 0x000000000067bf1f in PyRun_StringFlags ()
#19 0x000000000067dc8f in PyRun_SimpleStringFlags ()
#20 0x00000000006b60ec in Py_RunMain ()
#21 0x00000000006b63bd in Py_BytesMain ()
#22 0x00007ffff7dd60b3 in __libc_start_main (main=0x4eea30 <main>, argc=3, argv=0x7fffffffd798, init=<optimized out>, fini=<optimized out>,
rtld_fini=<optimized out>, stack_end=0x7fffffffd788) at ../csu/libc-start.c:308
#23 0x00000000005fa4de in _start ()
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.