richox / orz Goto Github PK
View Code? Open in Web Editor NEWa high performance, general purpose data compressor written in the crab-lang
License: MIT License
a high performance, general purpose data compressor written in the crab-lang
License: MIT License
For some context, see https://doc.rust-lang.org/cargo/faq.html#why-do-binaries-have-cargolock-in-version-control-but-not-libraries
Not having a Cargo.lock file seems to have broken the packaging for VoidLinux. See void-linux/void-packages#15730
Hello.It is some trouble when I run the program in the win10 platform.Could you please give me some advises?
**F:\orz-master\orz-master\target\debug>.\orz.exe encode 1111111111111111111111111111111111 thread 'main' panicked at 'assertion index < len failed: index out of bounds: index = 16777251, len = 16777251', C:\Users\lenovo\.rustup\toolchains\stable-x86_64-pc-windows-msvc\lib/rustlib/src/rust\src\libcore\macros\mod.rs:16:9 note: run with
RUST_BACKTRACE=1environment variable to display a backtrace**
我最近简单用C++写了一个Huffman压缩算法,不知道和作者速度有多少速度差距,在Mac OS和Linux下测试成功,在Windows下没测试过,项目:https://github.com/yangyongkang2000/C-Programming/tree/master/Huffman/Huffman
欢迎测试速度差距。
I tried to test something but didn't even get to that point as built executable crashes with this message:
thread 'main' has overflowed its stack
gdb says this:
$ gdb --args orz__debug_w32 encode README.md README.md.orz
GNU gdb (GDB) 7.9.1
(...)
Reading symbols from orz__debug_w32...done.
(gdb) r
Starting program: d:\progs\dev\src\orz\orz\orz__debug_w32.exe encode README.md README.md.orz
[New Thread 14228.0x11dc]
[New Thread 14228.0x2980]
[New Thread 14228.0x1cc4]
[New Thread 14228.0x450]
Program received signal SIGSEGV, Segmentation fault.
_alloca () at ../../../../../src/gcc-8.1.0/libgcc/config/i386/cygwin.S:88
88 ../../../../../src/gcc-8.1.0/libgcc/config/i386/cygwin.S: No such file or directory.
(gdb) bt
#0 _alloca () at ../../../../../src/gcc-8.1.0/libgcc/config/i386/cygwin.S:88
#1 0x0057ca13 in orz::encode::h64d6467265acc7bf (
source=<error reading variable: Cannot access memory at address 0x6a0fc30>,
target=<error reading variable: Cannot access memory at address 0x6a0fc38>, cfg=0x1a0fae4) at src/lib.rs:44
#2 0x004036d6 in orz::main::hc5aba79d15bc2c2c () at src/main.rs:94
#3 0x00407f0b in core::ops::function::FnOnce::call_once::hfde464d49ace8ae2 ()
at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3\library\core\src\ops/function.rs:248
#4 0x00402062 in std::sys_common::backtrace::__rust_begin_short_backtrace::h3b09b2cc1997b89a (
f=0x402910 <orz::main::hc5aba79d15bc2c2c>)
at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3\library\std\src\sys_common/backtrace.rs:122
#5 0x00408a93 in std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::he2d87c0b87bf469b ()
at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3\library\std\src/rt.rs:145
#6 0x0065c340 in call_once<(), (dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> () at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3\library\core\src\ops/function.rs:280
#7 do_call<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> () at library\std\src/panicking.rs:492
#8 try<i32, &(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> () at library\std\src/panicking.rs:456
#9 catch_unwind<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> () at library\std\src/panic.rs:137
#10 {closure#2} () at library\std\src/rt.rs:128
#11 do_call<std::rt::lang_start_internal::{closure_env#2}, isize> () at library\std\src/panicking.rs:492
#12 try<isize, std::rt::lang_start_internal::{closure_env#2}> () at library\std\src/panicking.rs:456
#13 catch_unwind<std::rt::lang_start_internal::{closure_env#2}, isize> () at library\std\src/panic.rs:137
#14 std::rt::lang_start_internal::h71a9cc7a00235f34 () at library\std\src/rt.rs:128
#15 0x00408a70 in std::rt::lang_start::h9847c1da96d8463b (main=0x402910 <orz::main::hc5aba79d15bc2c2c>, argc=4,
argv=0x22e2df8) at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3\library\std\src/rt.rs:144
#16 0x004053c3 in main ()
(gdb) l
83 in ../../../../../src/gcc-8.1.0/libgcc/config/i386/cygwin.S
(gdb) r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: d:\progs\dev\src\orz\orz\orz__debug_w32.exe encode README.md README.md.orz
[New Thread 22192.0x4bc8]
[New Thread 22192.0x24ac]
[New Thread 22192.0x3da4]
[New Thread 22192.0xf1c]
Program received signal SIGSEGV, Segmentation fault.
_alloca () at ../../../../../src/gcc-8.1.0/libgcc/config/i386/cygwin.S:88
88 in ../../../../../src/gcc-8.1.0/libgcc/config/i386/cygwin.S
(gdb) q
I used brand new rustc 1.62.1, host: i686-pc-windows-gnu/x86_64-pc-windows-gnu from here. I tried both with the same result.
Unlike all other Unix compressors, orz's format doesn't give any reliable way to sniff it in a maybe-compressed file. While in some contexts (private data, files with .orz suffix) the format is already known, there are also cases where programs assume it's possible to find out transport compression by reading the start of the header. And eg. libarchive/bsdtar have no other mode but sniffing.
I see that you haven't committed to a stable bitstream yet -- at least, the decompressor gives a warning when trying to uncompress a file made with an earlier version. Thus, adding such a magic might still be acceptable to you.
A proper magic would be:
Coming from c656c07#r37659833
We should probably comment here that hash_dword
is always modulo LZ_MF_BUCKET_ITEM_HASH_SIZE (5219), or it would not make enough sense to do anything here (usize always fits a u32 on 32 and 64-bit platforms).
Now that log 5219 / log 2 ≈ 12.35, the largest we want would be a 16-bit hash function. A pearson hash does not look too bad in this case:
let pear: [u8; 256] = /* RFC 3074 table here */;
#[inline]
fn hash_pearson(val: u32) -> u8 {
let mut h: u8 = pear[val << 24];
h = pear[h ^ (val << 16) % 256];
h = pear[h ^ (val << 8) % 256];
h = pear[h ^ (val) % 256];
h
}
/// Hash a u32 from buf[pos] to a usize (always modulo LZ_MF_BUCKET_ITEM_HASH_SIZE).
unsafe fn hash_dword(buf: &[u8], pos: usize) -> usize {
let val = buf.read::<u32>(pos).to_be() as u32;
(hash_pearson(val) << 8) | (hash_pearson(val ^ 0x01000000))
}
(djb2 looks cool too, if you like the multiplication stuff.)
It seems that there is no test code added in the project, how do we ensure that the compression and decompression results are correct. The current project version has reached 1.4, which means that the project function has been stable and can be used in the production environment. In this case, it is very necessary to add the corresponding test code.
Most compression software supports compressing directories, but this software currently only supports compressing a single file.
Can you support compressing directories?
Steps to reproduce:
admin@ip-172-31-23-30:~/beat-orz/orz$ head -c 1000000 /dev/urandom > /tmp/a
admin@ip-172-31-23-30:~/beat-orz/orz$ cat /tmp/a | /home/admin/beat-orz/orz/target/release/orz encode > /tmp/or
[INFO] encode: 65536 bytes => 66663 bytes, 2.537MB/s
[INFO] statistics:
[INFO] size: 65536 bytes => 66681 bytes
[INFO] ratio: 101.75%
[INFO] speed: 2.482 MB/s
[INFO] time: 0.026 sec
admin@ip-172-31-23-30:~/beat-orz/orz$ /home/admin/beat-orz/orz/target/release/orz decode < /tmp/or > /tmp/trip
[INFO] decode: 65536 bytes <= 66663 bytes, 6.351MB/s
[INFO] statistics:
[INFO] size: 65536 bytes => 66681 bytes
[INFO] ratio: 101.75%
[INFO] speed: 6.172 MB/s
[INFO] time: 0.011 sec
admin@ip-172-31-23-30:~/beat-orz/orz$ cmp /tmp/trip /tmp/a
cmp: EOF on /tmp/trip after byte 65536, in line 284
OS: Debian Linux bookworm, rustc 1.73.0-nightly, orz 3380556
Possible reason: you probably don't check return value of libc::read
or something
Error: "invalid level: 3"
I want to give this a try, but git checkout and even zip download of this repo failes. I can't even copy the file contents by hand, as it is not possible to create a file named aux. The file extension doesn't even matter, it always leads to an exception.
Turnes out Windows has some reserved filenames:
https://kizu514.com/blog/forbidden-file-names-on-windows-10/
error: linker cc
not found
|
= note: No such file or directory (os error 2)
error: aborting due to previous error
error: could not compile libc
To learn more, run the command again with --verbose.
warning: build failed, waiting for other jobs to finish...
error: linker cc
not found
|
= note: No such file or directory (os error 2)
error: aborting due to previous error
error: failed to compile orz v1.6.1 (https://github.com/richox/orz#28811d98)
, intermediate artifacts can be found at /tmp/cargo-install9LTsSH
Caused by:
build failed
The readme tells me to use
cargo install --git https://github.com/richox/orz --tag v1.6.1
to install it, but that fails with the error
error: multiple packages with binaries found: benchmark-tool, orz
Should be
cargo install orz --git https://github.com/richox/orz --tag v1.6.1
Hello,
default compression level should be 2 instead of 3.
3 throws an error because I think it was removed.
I think it is in line 20 main.rs
#[structopt(long = "level", short = "l", default_value = "3")] /// Set compression level (0..3)
tar has an argument--use-compress-program
, but I don't know how to use with orz.
On FreeBSD I get (using v1.6.2
):
Error: "invalid level: 3"
When using the default compression level (3) while encoding:
# orz encode /COPYRIGHT /COPYRIGHT.orz
Error: "invalid level: 3"
Same when specifying -l 3
# orz encode -l 3 /COPYRIGHT /COPYRIGHT.orz
Error: "invalid level: 3"
Dropping to -l 2
seems to work:
# orz encode -l 2 /COPYRIGHT /COPYRIGHT.orz
[INFO] encode: 6109 bytes => 3147 bytes, 1.861MB/s
[INFO] statistics:
[INFO] size: 6109 bytes => 3165 bytes
[INFO] ratio: 51.81%
[INFO] time: 0.016 sec
Whwn I use orz in Windows machine get the error as :
'''
thread 'main' has overflowed its stack
'''
and creates an empty zip file.
Checked on a Debian machine , worked perfectly. The problem exist only in WIndows.
i tried this code (using the master branch)...
use orz::encode;
use orz::lz::LZCfg;
fn main() {
let mut src = "Hola a todos!".as_bytes();
let mut out: Vec<u8> = vec![];
let cfg = LZCfg {
match_depth: 48,
lazy_match_depth1: 32,
lazy_match_depth2: 16,
};
let result;
result = encode(&mut src, &mut out, &cfg);
match result {
Ok(stat) => {
println!(
"source_size: {} -- target_size: {}",
stat.source_size, stat.target_size
);
}
Err(e) => eprintln!("Error: {:?}", e),
};
}
Only if a run it with the flag release works.
Hi. This is impressive program! Unfortunately, it seems decompressing speed is not so good as zstd's. So I propose this trick:
https://blog.reverberate.org/2021/04/21/musttail-efficient-interpreters.html
This post may be useful too: https://blog.reverberate.org/2020/05/29/hoares-rebuttal-bubble-sorts-comeback.html
I was super excited to see this! I'm currently looking for a fast compression alternative for zstd
for compressing postgresql wal archives.
At least for this use-case, I wasn't able to reproduce the benchmarks you've provided.
(orz
v1.6.2 installed using cargo install
as described in the README, also tested with cargo build --release
from current HEAD
):
$ zstd 00000003000025EF0000007C
00000003000025EF0000007C : 50.55% ( 16.0 MiB => 8.09 MiB, 00000003000025EF0000007C.zst)
'zstd 00000003000025EF0000007C' time: 0.064s, cpu: 104%
orz encode -l0 00000003000025EF0000007C 00000003000025EF0000007C.orz
[INFO] encode: 16777216 bytes => 8111757 bytes, 25.301MB/s
[INFO] statistics:
[INFO] size: 16777216 bytes => 8111839 bytes
[INFO] ratio: 48.35%
[INFO] time: 0.669 sec
Which is factor ~10 slower than zstd
:(
Platform: M1 Apple Silicon macOS (native), x86_64 Linux (musl cross-compiled)
The world out there still speaks C by and large. To let more people use the library, Orz should get a C API exported, so that people can use it from C++, Objective-C, Nim, Python, Node.js and everything else.
The A little Rust with your C chapter explains how to make public functions C compatible and how to generate headers using cbindgen.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.