Code Monkey home page Code Monkey logo

Comments (7)

zacchiro avatar zacchiro commented on June 3, 2024

Uhm, there seems something fishy and more general going on when reading from non-regular files.
While trying to create a minimal reproducible test case for this issue I've encountered this:

$ for i in $(seq 1 1000000) ; do echo $i ; done > foo.txt
$ wc -l foo.txt
1000000 foo.txt

$ ./pthash-build --minimal -c 5 -a 0.94 -e dictionary_dictionary -n 1000000 -p 2 --external -m 16 -i foo.txt -o foo.mph
{"n": "1000000", "c": "5.000000", "alpha": "0.940000", "minimal": "true", "encoder_type": "dictionary-dictionary", "num_partitions": "2", "num_threads": "1", "external_memory": "true", "partitioning_seconds": "0.053000", "mapping_ordering_seconds": "0.137000", "searching_seconds": "0.161000", "encoding_seconds": "0.004000", "total_seconds": "0.355000", "pt_bits_per_key": "2.458560", "mapper_bits_per_key": "0.417920", "bits_per_key": "2.876480", "nanosec_per_key": "0.000000"}
2023-11-29 10:16:54: saving data structure to disk...
2023-11-29 10:16:54: DONE

$  ./pthash-build --minimal -c 5 -a 0.94 -e dictionary_dictionary -n 1000000 -p 2 --external -m 16 -i <(cat foo.txt) -o foo.mph
terminate called after throwing an instance of 'std::runtime_error'
  what():  blank line detected after reading 999108 non-empty lines
[1]    272021 abort      ./pthash-build --minimal -c 5 -a 0.94 -e dictionary_dictionary -n 1000000 -p 

The only different between the first and second run is -i foo.txt versus -i <(cat foo.txt). The first run works. The second fails complaining there is a blank line in the input stream, blank line which is definitely not there.

from pthash.

jermp avatar jermp commented on June 3, 2024

Hi @zacchiro,
I never tried with input redirection, e.g., -i (cat foo.txt). Actually, I wonder if this is generally lecit. I mean: option -i expects to open a file, whereas with cat (or zcat, etc.) we are directly providing a stream. So I'm not sure these two things are exactly the same.

And, apparently, they aren't. Indeed, it seems that the redirection causes the stream to read an extra blank line.
Does this holds true not any size? E.g., 10 instead of 1000000?

from pthash.

jermp avatar jermp commented on June 3, 2024

Does this holds true not any size? E.g., 10 instead of 1000000?

Yes, I've tried with 100 keys and the file is not read properly if -i <(cat foo.txt) is used. Whereas, it works just fine with regular syntax -i foo.txt. I think the best thing to do would be to provide an iterator over gzipped files and/or over zstd-compressed files.

from pthash.

zacchiro avatar zacchiro commented on June 3, 2024

Yes, I've tried with 100 keys and the file is not written properly if -i <(cat foo.txt) is used. Whereas, it works just fine with regular syntax -i foo.txt. I think the best thing to do would be to provide an iterator over gzipped files and/or over zstd-compressed files.

I think you're right, it's just a pity to have to explicitly support multiple compression formats.

Also, it's quite weird. For context, what process substitution does is to fork, run the process, and redirect it's output to a FIFO (e.g., /dev/fd/XXX), then it passes the name of the FIFO to -i. The only particularity of that file from the point of view of build is that it's a non-seekable file, so in theory if you just read it from beginning to end it should not affect the build process. (That's why mmap() was failing in #18.) So either build is doing more than sequential reading here, or I'm missing something... Oh well.

from pthash.

zacchiro avatar zacchiro commented on June 3, 2024

Closing this in favor of #20. Sorry for the noise, it was worth a shot :-)

from pthash.

jermp avatar jermp commented on June 3, 2024

So either build is doing more than sequential reading here...

No, when using external memory, the input file is read once to calculate all the hashes of the input keys. So we just need sequential iteration. Indeed, it works when used as -i filename.

So there must be some subtle difference going on somewhere. I'll investigate the difference between -i filename and -i <(cat filename). This difference, whatever it is, does not seem related to PTHash nor the iterator object I wrote yesterday.
But please, correct me if you think I'm wrong!

Best,
-Giulio

from pthash.

zacchiro avatar zacchiro commented on June 3, 2024

from pthash.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.