richox / orz Goto Github PK

View Code? Open in Web Editor NEW

802.0 19.0 53.0 26.45 MB

a high performance, general purpose data compressor written in the crab-lang

License: MIT License

Rust 98.23% Makefile 0.47% C 1.29%

compression data crab-lang

orz's People

Contributors

Stargazers

Watchers

Forkers

archer884 ff4415 isgasho drahnr yanwhr2 loojk2008 wellkao vebdhuz cjmxp rheasilvia yuanyonglin user-tony geekhuyang karlzhao ronghantao artoria2e5 xiaoboya djwbyte fengurt br-liu 123dou errorman23 blindzhou ikfr derek-zl gyc567 zeroisme peterz1997 weblfe hermixy freebd maxwell-kalin watch-later hexiyou liuyueshaxing whj1121 lclspring vpsbash cnguoyj yyf233333 tedth000 pyjcode leo-lionni evanrichter sunnywanggit ajunlonglive wade-wu wzlyvg387u ericzhenwork misakae iremkurek

orz's Issues

Add Cargo.lock to the repository

For some context, see https://doc.rust-lang.org/cargo/faq.html#why-do-binaries-have-cargolock-in-version-control-but-not-libraries

Not having a Cargo.lock file seems to have broken the packaging for VoidLinux. See void-linux/void-packages#15730

Can not run in win10 platform？

Hello.It is some trouble when I run the program in the win10 platform.Could you please give me some advises?

**F:\orz-master\orz-master\target\debug>.\orz.exe encode 1111111111111111111111111111111111 thread 'main' panicked at 'assertion index < len failed: index out of bounds: index = 16777251, len = 16777251', C:\Users\lenovo\.rustup\toolchains\stable-x86_64-pc-windows-msvc\lib/rustlib/src/rust\src\libcore\macros\mod.rs:16:9 note: run with RUST_BACKTRACE=1environment variable to display a backtrace**

[速度PK]

我最近简单用C++写了一个Huffman压缩算法，不知道和作者速度有多少速度差距，在Mac OS和Linux下测试成功，在Windows下没测试过，项目：https://github.com/yangyongkang2000/C-Programming/tree/master/Huffman/Huffman
欢迎测试速度差距。

thread 'main' has overflowed its stack

I tried to test something but didn't even get to that point as built executable crashes with this message:
thread 'main' has overflowed its stack

gdb says this:

$  gdb --args orz__debug_w32 encode README.md README.md.orz
GNU gdb (GDB) 7.9.1
(...)
Reading symbols from orz__debug_w32...done.
(gdb) r
Starting program: d:\progs\dev\src\orz\orz\orz__debug_w32.exe encode README.md README.md.orz
[New Thread 14228.0x11dc]
[New Thread 14228.0x2980]
[New Thread 14228.0x1cc4]
[New Thread 14228.0x450]

Program received signal SIGSEGV, Segmentation fault.
_alloca () at ../../../../../src/gcc-8.1.0/libgcc/config/i386/cygwin.S:88
88      ../../../../../src/gcc-8.1.0/libgcc/config/i386/cygwin.S: No such file or directory.
(gdb) bt
#0  _alloca () at ../../../../../src/gcc-8.1.0/libgcc/config/i386/cygwin.S:88
#1  0x0057ca13 in orz::encode::h64d6467265acc7bf (
    source=<error reading variable: Cannot access memory at address 0x6a0fc30>,
    target=<error reading variable: Cannot access memory at address 0x6a0fc38>, cfg=0x1a0fae4) at src/lib.rs:44
#2  0x004036d6 in orz::main::hc5aba79d15bc2c2c () at src/main.rs:94
#3  0x00407f0b in core::ops::function::FnOnce::call_once::hfde464d49ace8ae2 ()
    at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3\library\core\src\ops/function.rs:248
#4  0x00402062 in std::sys_common::backtrace::__rust_begin_short_backtrace::h3b09b2cc1997b89a (
    f=0x402910 <orz::main::hc5aba79d15bc2c2c>)
    at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3\library\std\src\sys_common/backtrace.rs:122
#5  0x00408a93 in std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::he2d87c0b87bf469b ()
    at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3\library\std\src/rt.rs:145
#6  0x0065c340 in call_once<(), (dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> () at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3\library\core\src\ops/function.rs:280
#7  do_call<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> () at library\std\src/panicking.rs:492
#8  try<i32, &(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> () at library\std\src/panicking.rs:456
#9  catch_unwind<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> () at library\std\src/panic.rs:137
#10 {closure#2} () at library\std\src/rt.rs:128
#11 do_call<std::rt::lang_start_internal::{closure_env#2}, isize> () at library\std\src/panicking.rs:492
#12 try<isize, std::rt::lang_start_internal::{closure_env#2}> () at library\std\src/panicking.rs:456
#13 catch_unwind<std::rt::lang_start_internal::{closure_env#2}, isize> () at library\std\src/panic.rs:137
#14 std::rt::lang_start_internal::h71a9cc7a00235f34 () at library\std\src/rt.rs:128
#15 0x00408a70 in std::rt::lang_start::h9847c1da96d8463b (main=0x402910 <orz::main::hc5aba79d15bc2c2c>, argc=4,
    argv=0x22e2df8) at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3\library\std\src/rt.rs:144
#16 0x004053c3 in main ()
(gdb) l
83      in ../../../../../src/gcc-8.1.0/libgcc/config/i386/cygwin.S
(gdb)  r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: d:\progs\dev\src\orz\orz\orz__debug_w32.exe encode README.md README.md.orz
[New Thread 22192.0x4bc8]
[New Thread 22192.0x24ac]
[New Thread 22192.0x3da4]
[New Thread 22192.0xf1c]

Program received signal SIGSEGV, Segmentation fault.
_alloca () at ../../../../../src/gcc-8.1.0/libgcc/config/i386/cygwin.S:88
88      in ../../../../../src/gcc-8.1.0/libgcc/config/i386/cygwin.S
(gdb) q

I used brand new rustc 1.62.1, host: i686-pc-windows-gnu/x86_64-pc-windows-gnu from here. I tried both with the same result.

please add a magic header

Unlike all other Unix compressors, orz's format doesn't give any reliable way to sniff it in a maybe-compressed file. While in some contexts (private data, files with .orz suffix) the format is already known, there are also cases where programs assume it's possible to find out transport compression by reading the start of the header. And eg. libarchive/bsdtar have no other mode but sniffing.

I see that you haven't committed to a stable bitstream yet -- at least, the decompressor gives a warning when trying to uncompress a file made with an earlier version. Thus, adding such a magic might still be acceptable to you.

A proper magic would be:

at least 32-bits in length
not all in ASCII (unlike current version number)
unlikely to happen in unrelated files

New hash function

Coming from c656c07#r37659833

We should probably comment here that hash_dword is always modulo LZ_MF_BUCKET_ITEM_HASH_SIZE (5219), or it would not make enough sense to do anything here (usize always fits a u32 on 32 and 64-bit platforms).

Now that log 5219 / log 2 ≈ 12.35, the largest we want would be a 16-bit hash function. A pearson hash does not look too bad in this case:

let pear: [u8; 256] = /* RFC 3074 table here */;
#[inline]
fn hash_pearson(val: u32) -> u8 {
  let mut h: u8 = pear[val << 24];
  h = pear[h ^ (val << 16) % 256];
  h = pear[h ^ (val << 8)  % 256];
  h = pear[h ^ (val)       % 256];
  h
}

/// Hash a u32 from buf[pos] to a usize (always modulo LZ_MF_BUCKET_ITEM_HASH_SIZE).
unsafe fn hash_dword(buf: &[u8], pos: usize) -> usize {
   let val = buf.read::<u32>(pos).to_be() as u32;
   (hash_pearson(val) << 8) | (hash_pearson(val ^ 0x01000000))
}

(djb2 looks cool too, if you like the multiplication stuff.)

Any test?

@richox

It seems that there is no test code added in the project, how do we ensure that the compression and decompression results are correct. The current project version has reached 1.4, which means that the project function has been stable and can be used in the production environment. In this case, it is very necessary to add the corresponding test code.

Can you support compressing directories?

Most compression software supports compressing directories, but this software currently only supports compressing a single file.
Can you support compressing directories?

Produces incorrect output if input is pipe

Steps to reproduce:

admin@ip-172-31-23-30:~/beat-orz/orz$ head -c 1000000 /dev/urandom > /tmp/a
admin@ip-172-31-23-30:~/beat-orz/orz$ cat /tmp/a | /home/admin/beat-orz/orz/target/release/orz encode > /tmp/or
[INFO] encode: 65536 bytes => 66663 bytes, 2.537MB/s
[INFO] statistics:
[INFO]   size:  65536 bytes => 66681 bytes
[INFO]   ratio: 101.75%
[INFO]   speed: 2.482 MB/s
[INFO]   time:  0.026 sec
admin@ip-172-31-23-30:~/beat-orz/orz$ /home/admin/beat-orz/orz/target/release/orz decode < /tmp/or > /tmp/trip
[INFO] decode: 65536 bytes <= 66663 bytes, 6.351MB/s
[INFO] statistics:
[INFO]   size:  65536 bytes => 66681 bytes
[INFO]   ratio: 101.75%
[INFO]   speed: 6.172 MB/s
[INFO]   time:  0.011 sec
admin@ip-172-31-23-30:~/beat-orz/orz$ cmp /tmp/trip /tmp/a
cmp: EOF on /tmp/trip after byte 65536, in line 284

OS: Debian Linux bookworm, rustc 1.73.0-nightly, orz 3380556

Possible reason: you probably don't check return value of libc::read or something

Error: "invalid level: 3"

Forbidden filename: checkout on WIndows 10 not possible

I want to give this a try, but git checkout and even zip download of this repo failes. I can't even copy the file contents by hand, as it is not possible to create a file named aux. The file extension doesn't even matter, it always leads to an exception.
Turnes out Windows has some reserved filenames:
https://kizu514.com/blog/forbidden-file-names-on-windows-10/

build failed

error: linker cc not found
|
= note: No such file or directory (os error 2)

error: aborting due to previous error

error: could not compile libc

To learn more, run the command again with --verbose.
warning: build failed, waiting for other jobs to finish...
error: linker cc not found
|
= note: No such file or directory (os error 2)

error: aborting due to previous error

error: failed to compile orz v1.6.1 (https://github.com/richox/orz#28811d98), intermediate artifacts can be found at /tmp/cargo-install9LTsSH

Caused by:
build failed

Install instruction fails

The readme tells me to use

cargo install --git https://github.com/richox/orz --tag v1.6.1

to install it, but that fails with the error

error: multiple packages with binaries found: benchmark-tool, orz

Should be

cargo install orz --git https://github.com/richox/orz --tag v1.6.1

Default compression level

Hello,
default compression level should be 2 instead of 3.
3 throws an error because I think it was removed.

I think it is in line 20 main.rs

#[structopt(long = "level", short = "l", default_value = "3")] /// Set compression level (0..3)

How to use orz with tar?

tar has an argument--use-compress-program , but I don't know how to use with orz.

Default compression level (3): Error: "invalid level: 3"

On FreeBSD I get (using v1.6.2):

Error: "invalid level: 3"

When using the default compression level (3) while encoding:

# orz encode /COPYRIGHT /COPYRIGHT.orz
Error: "invalid level: 3"

Same when specifying -l 3

# orz encode -l 3 /COPYRIGHT /COPYRIGHT.orz
Error: "invalid level: 3"

Dropping to -l 2 seems to work:

# orz encode -l 2 /COPYRIGHT /COPYRIGHT.orz
[INFO] encode: 6109 bytes => 3147 bytes, 1.861MB/s
[INFO] statistics:
[INFO]   size:  6109 bytes => 3165 bytes
[INFO]   ratio: 51.81%
[INFO]   time:  0.016 sec

Remove unsafe blocks without hurting performance

thread 'main' has overflowed its stack in WIndows.

Whwn I use orz in Windows machine get the error as :

'''
thread 'main' has overflowed its stack

'''

and creates an empty zip file.

Checked on a Debian machine , worked perfectly. The problem exist only in WIndows.

Panics on debug

i tried this code (using the master branch)...

use orz::encode;
use orz::lz::LZCfg;

fn main() {
    let mut src = "Hola a todos!".as_bytes();
    let mut out: Vec<u8> = vec![];
    let cfg = LZCfg {
        match_depth: 48,
        lazy_match_depth1: 32,
        lazy_match_depth2: 16,
    };
    let result;
    result = encode(&mut src, &mut out, &cfg);
    match result {
        Ok(stat) => {
            println!(
                "source_size: {} -- target_size: {}",
                stat.source_size, stat.target_size
            );
        }
        Err(e) => eprintln!("Error: {:?}", e),
    };
}

Only if a run it with the flag release works.

Good decompression speed up technique: musttail

Hi. This is impressive program! Unfortunately, it seems decompressing speed is not so good as zstd's. So I propose this trick:

https://blog.reverberate.org/2021/04/21/musttail-efficient-interpreters.html

This post may be useful too: https://blog.reverberate.org/2020/05/29/hoares-rebuttal-bubble-sorts-comeback.html

rust

Performance?

I was super excited to see this! I'm currently looking for a fast compression alternative for zstd for compressing postgresql wal archives.

At least for this use-case, I wasn't able to reproduce the benchmarks you've provided.

(orz v1.6.2 installed using cargo install as described in the README, also tested with cargo build --release from current HEAD):

$ zstd 00000003000025EF0000007C
00000003000025EF0000007C : 50.55%   (  16.0 MiB =>   8.09 MiB, 00000003000025EF0000007C.zst)
'zstd 00000003000025EF0000007C' time: 0.064s, cpu: 104%

orz encode -l0 00000003000025EF0000007C 00000003000025EF0000007C.orz
[INFO] encode: 16777216 bytes => 8111757 bytes, 25.301MB/s
[INFO] statistics:
[INFO]   size:  16777216 bytes => 8111839 bytes
[INFO]   ratio: 48.35%
[INFO]   time:  0.669 sec

Which is factor ~10 slower than zstd :(

Platform: M1 Apple Silicon macOS (native), x86_64 Linux (musl cross-compiled)

Provide a C API+ABI

The world out there still speaks C by and large. To let more people use the library, Orz should get a C API exported, so that people can use it from C++, Objective-C, Nim, Python, Node.js and everything else.

The A little Rust with your C chapter explains how to make public functions C compatible and how to generate headers using cbindgen.

建议增加抑制info信息输出的选项。

程序可以解压数据后直接打印到标准输出，但目前貌似没有--quiet/--silent这样的选项可以关闭输出信息，很多小的文本文件经常需要在终端解压直接查看内容，建议增加上这样的选项。否则干扰了输出的文本，不太优雅。