Code Monkey home page Code Monkey logo

dedup's Issues

benchmark performance

`$ python3 benchrunner
Downloading test corpus...
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 124M 100 124M 0 0 3425k 0 0:00:37 0:00:37 --:--:-- 4143k
Unpacking test corpus...
Concatenating files in to single blob...
Building dedup...
Finished release [optimized] target(s) in 0.0 secs

awk : mean: 4.988 median: 4.803

dedup : mean: 2.337 median: 2.284

sort : mean: 17.03 median: 17.32
`

benchmark fails to run

After running 'Cargo install' which places the dedup executable in ~/.cargo/bin/dedup the benchrunner fails to correctly set the path/current working directory.

FileNotFoundError: [Errno 2] No such file or directory: 'sort -u europarl.txt'
FileNotFoundError: [Errno 2] No such file or directory: 'dedup europarl.txt'

How much memory does it use?

I need to dedupe line-separated records in a very large file (hundreds of gigs, each record a few KB). This looks like it could be useful for me, but I don't know how much RAM it uses. For stream mode it has to store the actual lines, which is understandable, but what about mmap mode?

warning on cargo build

warning: variant is never constructed: `Other`
 --> src/error.rs:9:5
  |
9 |     Other,
  |     ^^^^^
  |
  = note: #[warn(dead_code)] on by default

benchmark results

`python3 benchrunner

Downloading test corpus...

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0
100 124M 100 124M 0 0 1956k 0 0:01:04 0:01:04 --:--:-- 2747k

Unpacking test corpus...

Concatenating files in to single blob...

/bin/sh: dedup: command not found

/bin/sh: dedup: command not found

/bin/sh: dedup: command not found

/bin/sh: dedup: command not found

sort | uniq: [19.610267162322998, 19.310322523117065, 19.74545121192932] mean: 19.555346965789795

dedup: [0.0023207664489746094, 0.0020852088928222656, 0.0018146038055419922] mean: 0.002073526382446289
`

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.