Code Monkey home page Code Monkey logo

paq's Introduction

Build Status Crates.io MIT licensed

paq

Hash file or directory recursively.

Powered by blake3 cryptographic hashing algorithm.

paq hashing demo

Performance

The go programming language repository was used as a test case (478 MB / 12,540 files).

Command Mean [ms] Min [ms] Max [ms] Relative
paq ./go 116.4 ± 2.6 111.4 120.9 1.00
shell b3sum 132.4 ± 1.5 129.6 135.9 1.14 ± 0.03
dirhash -a sha256 ./go 642.5 ± 5.8 634.7 649.8 5.52 ± 0.13
shell sha256sum 1583.0 ± 16.3 1568.6 1606.8 13.60 ± 0.33

Performance benchmark uses hyperfine.

Commands with shell use the following command with various <hashsum> implementations:

find ./go -type f -print0 | LC_ALL=C sort -z | xargs -0 <hashsum> | <hashsum>

Installation

Cargo Install

Installation requires cargo.

Install From Crates.io

cargo install paq

Install From Repository Clone

  1. Clone this repository.
  2. Run cargo install --path . from repository root.

Pre-Built Binary Package

  1. Find Latest Release .zip archive for computer Operating System and Architecture.
  2. Download and extract .zip.
  3. Modify permissions of the extracted paq binary to allow execution.
  4. Move paq to a system path.

Usage

Command Line Interface executable or Crate library.

Included in this repository is an example directory containing some sample files, a subdirectory and a symlink to test paq functionality.

Executable

Run paq [src] to hash source file or directory.

Output hash to .paq file as valid JSON.

For help, run paq --help.

Hash Example Directory

paq ./example

Path to example directory can be relative or absolute.

Expect different results if -i or --ignore-hidden flag argument is used.

Crate Library

Add paq to project dependencies in Cargo.toml.

Use Library

use paq;

let source = std::path::PathBuf::from("/path/to/source");
let ignore_hidden = true; // .dir or .file
let source_hash: paq::ArrayString<64> = paq::hash_source(&source, ignore_hidden);

println!("{}", source_hash);

Hash Example Directory

use paq;

let source = std::path::PathBuf::from("example");
let ignore_hidden = true;
let source_hash: paq::ArrayString<64> = paq::hash_source(&source, ignore_hidden);

assert_eq!(&source_hash[..], "a593d18de8b696c153df9079c662346fafbb555cc4b2bbf5c7e6747e23a24d74");

Expect different results if ignore_hidden is set to false.

Content Limitations

Hashes are generated using file system content as input data to the blake3 hashing algorithm.

By design, paq does NOT include file system metadata in hash input such as:

  • File modes
  • File ownership
  • File modification and access times
  • File ACLs and extended attributes
  • Hard links
  • Symlink target contents (target path is hashed)

Additionally, files or directory contents starting with dot or full stop can optionally be ignored.

How it Works

  1. Recursively get path(s) for a given source argument.
  2. Hash each path and file contents if path is to a file.
  3. Sort the list of hashes for consistent ordering.
  4. Compute the final hash by hashing the list of hashes.

License

MIT

paq's People

Contributors

gregl83 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

vitaly-z g30b1n

paq's Issues

Finalize paq hashing algorithm

Several hashing algorithms were experimented with when prototyping paq.

Ultimately, a consistent and extensible algorithm needs to be chosen and documented in the package README.

Considerations:

  • file/directory path hashes
  • file/directory metadata hashes
  • leaf hash value sorting if tree
  • merkle tree hash (bit of future proofing)
  • performance/scaling

GIT has been a source of reference for determining how/what to hash.

Write integration tests asserting final hashing algorithm

Prototype of paq was built rapidly as an experiment without tests.

This package shouldn't be considered for real-world use until tests have been added verifying the documented algorithm is functioning as expected.

The tests should function as acceptance criteria / business requirements and use the test host file system (IO).

Unsupported reparse point for Windows Bazel links

Running bazel builds in Windows produces directory links (e.g. bazel-bin) that end up an failed reparse point errors from the rust's std library fs:read_link function.

Error:

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Error { kind: Uncategorized, message: "Unsupported reparse point type" }', src/lib.rs:69:63
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

The bazel build was performed in a docker container using a volume. The container was from a bazel image using linux.

More investigation required in order to support these links.

update rayon iterator for fast failures

Rayon is used to parallelize iterating over source paths and hashing file contents; however, if an operating system failure (e.g., locked file) occurs the iterator doesn't immediately fail.

A feature introduced the panic_fuse() function on rayon iterators that terminates iterator threads when a panic occurs leading to a fast failure and ultimately a failed attempt at producing a paq hash but without delay.

There is potentially a near negligible performance degradation that MUST be tested thoroughly if this feature is implemented.

Refactor to use thread pool

Directory traversal should be refactored to use thread pool rather than walkdir. Pool should perform traversing and hashing at the same time.

Results will be the same since hashes are sorted prior to producing a final hash.

This should improve performance substantially on large file system trees.

Add error handling with messaging

Errors aren't handled in the prototype of paq.

Statements potentially resulting in errors need to be identified and respective messaging added.

add support for full file tree output

Add support to output each hash in the file tree.

Decide whether to use flag to alter default output or to write contents to additional file.

Output should be tab separated hash and path sorted by path.

Windows benchmark and issues

I needed to know if contents of a folder changed from last time I chcked.
image

Running first time with errors like this. And it must have taken 4 minutes

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 32, kind: Uncategorized, message: "The process cannot access the file because it is being used by another process." }', src/lib.rs:79:58
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 32, kind: Uncategorized, message: "The process cannot access the file because it is being used by another process." }', src/lib.rs:79:58
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 32, kind: Uncategorized, message: "The process cannot access the file because it is being used by another process." }', src/lib.rs:79:58

Second time round it seems to have taken 1second.. still same errors and niether runs returned a hash

Update file hashes to use sort_unstable

Rust vector sort_unstable can have better performance than the current sort method.

There will not be duplicate hashes so equal matches will not occur but even if they did it wouldn't affect the combined output hash.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.