Code Monkey home page Code Monkey logo

indexed_zstd's People

Contributors

ap-- avatar martinellimarco avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

indexed_zstd's Issues

Not compiling w/clang (on Android & Linux)

Related to #2 the issue appears to be not just MacOS but anything based on clang...

I ran into the issue while compiling w/clang 12 on Android. Complaint about -std=c++11 not being valid for language C.

By forcing the selection of the configuration for Darwin (by simply adding or True to line 12... below) compilation succeeded.

if platform == "darwin":

Perhaps the test shouldn't be platform based but compiler based? [If compiler is clang....]

I can confirm the same error on Debian when the compiler is set to clang instead of gcc/g++... (e.g. CC="clang-11" CXX="clang++-11" python setup.py build_ext --cython --inplace

(I haven't had a chance to test on FreeBSD but since the default compiler is now clang, I would imagine the same issue would exist.)

Buffered IndexedZstdFile fails with numpy deserialization

When using IndexedZstdFile in conjunction with numpy deserialization (numpy.load) this leads to a crash due to numpy being advertised a wrong (shorter) file size.

This code triggers the issue:

import numpy as np
import zstandard as zstd
from io import BytesIO
from indexed_zstd import IndexedZstdFile
from tempfile import NamedTemporaryFile

file = NamedTemporaryFile()
A = np.random.random((100, 100))
handler = zstd.open(file.name, "wb")
np.save(handler, A)
handler.close()
handler = IndexedZstdFile(file.name)
B = np.load(handler)
assert np.array_equal(A, B)

The resulting execution leads to

Traceback (most recent call last):
  File "/tmp/a/bug.py", line 15, in <module>
    B = np.load(handler)
  File "/usr/lib/python3.10/site-packages/numpy/lib/npyio.py", line 440, in load
    return format.read_array(fid, allow_pickle=allow_pickle,
  File "/usr/lib/python3.10/site-packages/numpy/lib/format.py", line 787, in read_array
    array.shape = shape
ValueError: cannot reshape array of size 9381 into shape (100,100)

This error does not occur if using IndexedZstdFileRaw instead of IndexedZstdFile.

windows support

Hi @martinellimarco

I wanted to experiment if it's complicated to provide indexed_zstd as windows wheels, and put together a working example. As a disclaimer: my C and windows knowledge is enough to get stuff done for me, but very far from knowing anything about best practices, and/or why some things are done the way they are.

see:

It's just a hack for now, since it assumes that you install and build it in a conda environment.
But it might be a good starting point if there's interest in providing indexed_zstd for windows too.

Here's running the test script on windows:

(iz) C:\Users\andreas\Development\indexed_zstd\test>python test.py
Block offset completed?:  False
{0: 0, 17: 4, 32: 6, 49: 10, 78: 26}
Block offset completed?:  True
Test reading letters from A to Z
ABCDEFGHIJKLMNOPQRSTUVWXYZ
Tell is 26:  26
Compressed tell is 78:  78
Seeking back to 15
Tell is 15?:  15
Compressed tell is 49:  49
Filesize is 26?  26
Compressed tell is 49:  49
Reading a byte after EOF
Compressed tell is 78:  78
Test reading again, letters from P to Z
QRSTUVWXYZ
Testing set_block_offsets
Block offset completed?:  False
{0: 0, 17: 4, 32: 6, 49: 10, 78: 26}
Block offset completed?:  True

To build on windows:

git clone --recurse-submodules -b windows https://github.com/ap--/indexed_zstd.git
conda create -n iz python=3.11 zstd conda-build
conda activate iz
pip install ./indexed_zstd

I guess if someone is interested and feels more at home in the windows c-extension python world, it should be easy for them to clean this up. (We could then provide windows pkgs in conda-forge too.)

Cheers,
Andreas ๐Ÿ˜ƒ

Workflow with file-like

The code you have here seems to explicitly need a local filename to operate on. The zstdandard library, however, can act on arbitrary python file-like objects. I see the greatest feature of this library the possibility to read only required byte-ranges from remote files, so it would be great if interoperability with file-likes were possible. What do you think, is this tractable?

Tarfile, seek required arguments.

  File "/.../anaconda3/envs/test/lib/python3.10/tarfile.py", line 2465, in __iter__
    tarinfo = self.next()
  File "/.../anaconda3/envs/test/lib/python3.10/tarfile.py", line 2332, in next
    self.fileobj.seek(self.offset - 1)
  File "indexed_zstd/indexed_zstd.pyx", line 77, in indexed_zstd._IndexedZstdFile.seek
TypeError: seek() takes exactly 2 positional arguments (1 given)

Works fine if I manually specify whence as 0.

Providing conda packages

Hello @martinellimarco

I am starting work on providing conda packages for ratarmount and its dependencies.
Would you be interested in providing indexed_zstd via conda?
If yes, I could offer to create the feedstock repository for you to provide packages via conda-forge. I would also be maintaining this feedstock repository, and (of course) add you as a maintainer too.

For reference, here's the discussion in the downstream repo: mxmlnkn/ratarmount#99

Let me know what you think,
Cheers,
Andreas ๐Ÿ˜ƒ

Not compiling on MacOSX

Your package is a dependency of https://github.com/mxmlnkn/ratarmount, which may become a dependency of https://github.com/yt-project/yt (if yt-project/yt#3443 is merged). In the process of testing yt, we discovered that indexed_zstd does not compile on Mac-OSX (nor on Windows, but that is to be expected, isn't it?).

Unfortunately, I don't have access to a MacOS machine so I cannot debug it myself :(

Steps to reproduce

# On MacOSX
pip install indexed_zstd

Actual error at build stage

[...]/indexed_zstd/libzstd-seek/zstd-seek.o -std=c++11 -O3 -DNDEBUG
    error: invalid argument '-std=c++11' not allowed with 'C'
    error: command '/usr/bin/clang' failed with exit code 1

Basic usage fails?

Not sure if I'm doing something wrong but simply opening a file fails.

    IndexedZstdFile(file_name)
  File "indexed_zstd/indexed_zstd.pyx", line 135, in indexed_zstd.IndexedZstdFile.__init__
  File "indexed_zstd/indexed_zstd.pyx", line 112, in indexed_zstd.IndexedZstdFileRaw.__init__
  File "indexed_zstd/indexed_zstd.pyx", line 48, in indexed_zstd._IndexedZstdFile.__cinit__
TypeError: an integer is required

How to create a suitable zst file?

The README.md says:

IndexedZstdFile will only speed up seeking when there are more than one block, which sadly requires a bit of care in zstd.

Could you elaborate on the specific "bit of care" that's needed? At the zstd command-line, I see options like:

--target-compressed-block-size=# : generate compressed block of approximately targeted size
 -B#    : cut file into independent blocks of size # (default: no block)

...will one of these options produce the desired result? If so, what would you say is a reasonable blocksize? My only reference is bgzf which uses 65536 bytes.

Cannot compile on MacOS on Apple Silicon (M2 Max)

Hi, I'm installing ratarmount which has indexed_zstd as dependency. It used to build and install fine on my Intel-based MacBook Pro, however, on my M2 Max (Apple Silicon) MacBook Pro it fails with this error:

Building wheel for indexed-zstd (pyproject.toml) ... error
  error: subprocess-exited-with-error
  
  ร— Building wheel for indexed-zstd (pyproject.toml) did not run successfully.
  โ”‚ exit code: 1
  โ•ฐโ”€> [18 lines of output]
      running bdist_wheel
      running build
      running build_py
      file indexed_zstd.py (for module indexed_zstd) not found
      file indexed_zstd.py (for module indexed_zstd) not found
      running build_clib
      building 'zstd_zeek' library
      creating build
      creating build/temp.macosx-13.4-arm64-cpython-310
      creating build/temp.macosx-13.4-arm64-cpython-310/indexed_zstd
      creating build/temp.macosx-13.4-arm64-cpython-310/indexed_zstd/libzstd-seek
      clang -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -I/redacted/usr/include -I/redacted/usr/include -I. -c indexed_zstd/libzstd-seek/zstd-seek.c -o build/temp.macosx-13.4-arm64-cpython-310/indexed_zstd/libzstd-seek/zstd-seek.o
      In file included from indexed_zstd/libzstd-seek/zstd-seek.c:18:
      indexed_zstd/libzstd-seek/zstd-seek.h:20:10: fatal error: 'zstd.h' file not found
      #include <zstd.h>
               ^~~~~~~~
      1 error generated.
      error: command '/usr/bin/clang' failed with exit code 1
      [end of output]

Any advice is welcome. Thanks for your great work!

Wheels for macOS

Hello,

Someone had problems installing on macOS, see here.

It would be nice if there were wheels for macOS to avoid system prerequisites like zstd being installed.

I extended the Github Workflow CI of indexed_bzip2 to also create wheels for macOS and Windows. As this repo is based on indexed_bzip2, maybe these changes could be pulled to this repo. I could try to open a MR if you want but it might take until next weekend.

SIGSEGV, Segmentation fault when trying to open a non-zst file

When opening an .xz file with IndexedZstdFile, in order to check for exceptions, no exceptions is thrown, however when trying to read one byte, it will segfault! This should be detected and an exception should be thrown.

Consider this:

gdb --args python3 -c '
    from indexed_zstd import IndexedZstdFile
    f = IndexedZstdFile( "tests/simple.xz" )
    print( "Opened File!" )
    f.read( 1 )
'

I'll get this output:

GNU gdb (Ubuntu 9.2-0ubuntu1~20.04) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python3...
(No debugging symbols found in python3)
(gdb) r
Starting program: /usr/bin/python3 -c from\ indexed_zstd\ import\ IndexedZstdFile\;\ f\ =\ IndexedZstdFile\(\ \"tests/simple.xz\"\ \)\;\ print\(\ \"Opened\ File\!\"\ \)\;\ f.read\(\ 1\ \)
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Opened File!

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff73dc7d1 in ZSTDSeek_getJumpCoordinate (sctx=sctx@entry=0x9a3ef0, uncompressedPos=1048576) at indexed_zstd/libzstd-seek/zstd-seek.c:192
192	indexed_zstd/libzstd-seek/zstd-seek.c: No such file or directory.
(gdb) bt
#0  0x00007ffff73dc7d1 in ZSTDSeek_getJumpCoordinate (sctx=sctx@entry=0x9a3ef0, uncompressedPos=1048576)
    at indexed_zstd/libzstd-seek/zstd-seek.c:192
#1  0x00007ffff73dcacd in ZSTDSeek_read (sctx=0x9a3ef0, outBuffSize=1048576, outBuff=0x7ffff6dc1010) at indexed_zstd/libzstd-seek/zstd-seek.c:325
#2  ZSTDSeek_read (outBuff=0x7ffff6dc1010, outBuffSize=1048576, sctx=0x9a3ef0) at indexed_zstd/libzstd-seek/zstd-seek.c:319
#3  0x00007ffff73d4bee in ZSTDReader::read (this=<optimized out>, this=<optimized out>, outputFileDescriptor=-1, nBytesToRead=<optimized out>, 
    outputBuffer=<optimized out>) at indexed_zstd/ZSTDReader.hpp:197
#4  __pyx_pf_12indexed_zstd_16_IndexedZstdFile_12readinto (__pyx_v_self=0x7ffff7483f90, __pyx_v_bytes_like=<optimized out>)
    at indexed_zstd/indexed_zstd.cpp:2216
#5  __pyx_pw_12indexed_zstd_16_IndexedZstdFile_13readinto (__pyx_v_self=0x7ffff7483f90, __pyx_v_bytes_like=<optimized out>)
    at indexed_zstd/indexed_zstd.cpp:2152
#6  0x00000000005c4b10 in ?? ()
#7  0x00000000005f4ca1 in ?? ()
#8  0x00000000005f57c0 in PyObject_CallMethodObjArgs ()
#9  0x000000000064db5d in ?? ()
#10 0x000000000064dc32 in ?? ()
#11 0x000000000064f425 in ?? ()
#12 0x0000000000504743 in ?? ()
#13 0x000000000056b399 in _PyEval_EvalFrameDefault ()
#14 0x000000000056955a in _PyEval_EvalCodeWithName ()
#15 0x000000000068c4a7 in PyEval_EvalCode ()
#16 0x000000000067bc91 in ?? ()
#17 0x000000000067bd0f in ?? ()
#18 0x000000000067bf1f in PyRun_StringFlags ()
#19 0x000000000067dc8f in PyRun_SimpleStringFlags ()
#20 0x00000000006b60ec in Py_RunMain ()
#21 0x00000000006b63bd in Py_BytesMain ()
#22 0x00007ffff7dd60b3 in __libc_start_main (main=0x4eea30 <main>, argc=3, argv=0x7fffffffd798, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=0x7fffffffd788) at ../csu/libc-start.c:308
#23 0x00000000005fa4de in _start ()

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.