Comments (7)
Uhm, there seems something fishy and more general going on when reading from non-regular files.
While trying to create a minimal reproducible test case for this issue I've encountered this:
$ for i in $(seq 1 1000000) ; do echo $i ; done > foo.txt
$ wc -l foo.txt
1000000 foo.txt
$ ./pthash-build --minimal -c 5 -a 0.94 -e dictionary_dictionary -n 1000000 -p 2 --external -m 16 -i foo.txt -o foo.mph
{"n": "1000000", "c": "5.000000", "alpha": "0.940000", "minimal": "true", "encoder_type": "dictionary-dictionary", "num_partitions": "2", "num_threads": "1", "external_memory": "true", "partitioning_seconds": "0.053000", "mapping_ordering_seconds": "0.137000", "searching_seconds": "0.161000", "encoding_seconds": "0.004000", "total_seconds": "0.355000", "pt_bits_per_key": "2.458560", "mapper_bits_per_key": "0.417920", "bits_per_key": "2.876480", "nanosec_per_key": "0.000000"}
2023-11-29 10:16:54: saving data structure to disk...
2023-11-29 10:16:54: DONE
$ ./pthash-build --minimal -c 5 -a 0.94 -e dictionary_dictionary -n 1000000 -p 2 --external -m 16 -i <(cat foo.txt) -o foo.mph
terminate called after throwing an instance of 'std::runtime_error'
what(): blank line detected after reading 999108 non-empty lines
[1] 272021 abort ./pthash-build --minimal -c 5 -a 0.94 -e dictionary_dictionary -n 1000000 -p
The only different between the first and second run is -i foo.txt
versus -i <(cat foo.txt)
. The first run works. The second fails complaining there is a blank line in the input stream, blank line which is definitely not there.
from pthash.
Hi @zacchiro,
I never tried with input redirection, e.g., -i (cat foo.txt)
. Actually, I wonder if this is generally lecit. I mean: option -i expects to open a file, whereas with cat (or zcat, etc.) we are directly providing a stream. So I'm not sure these two things are exactly the same.
And, apparently, they aren't. Indeed, it seems that the redirection causes the stream to read an extra blank line.
Does this holds true not any size? E.g., 10 instead of 1000000?
from pthash.
Does this holds true not any size? E.g., 10 instead of 1000000?
Yes, I've tried with 100 keys and the file is not read properly if -i <(cat foo.txt)
is used. Whereas, it works just fine with regular syntax -i foo.txt
. I think the best thing to do would be to provide an iterator over gzipped files and/or over zstd-compressed files.
from pthash.
Yes, I've tried with 100 keys and the file is not written properly if
-i <(cat foo.txt)
is used. Whereas, it works just fine with regular syntax-i foo.txt
. I think the best thing to do would be to provide an iterator over gzipped files and/or over zstd-compressed files.
I think you're right, it's just a pity to have to explicitly support multiple compression formats.
Also, it's quite weird. For context, what process substitution does is to fork, run the process, and redirect it's output to a FIFO (e.g., /dev/fd/XXX), then it passes the name of the FIFO to -i
. The only particularity of that file from the point of view of build
is that it's a non-seekable file, so in theory if you just read it from beginning to end it should not affect the build process. (That's why mmap() was failing in #18.) So either build
is doing more than sequential reading here, or I'm missing something... Oh well.
from pthash.
Closing this in favor of #20. Sorry for the noise, it was worth a shot :-)
from pthash.
So either build is doing more than sequential reading here...
No, when using external memory, the input file is read once to calculate all the hashes of the input keys. So we just need sequential iteration. Indeed, it works when used as -i filename
.
So there must be some subtle difference going on somewhere. I'll investigate the difference between -i filename
and -i <(cat filename)
. This difference, whatever it is, does not seem related to PTHash nor the iterator object I wrote yesterday.
But please, correct me if you think I'm wrong!
Best,
-Giulio
from pthash.
from pthash.
Related Issues (20)
- Encoders must be documented in the help text HOT 14
- C API HOT 5
- Seed not working HOT 3
- pthash stucks while creating mphf HOT 9
- missing include HOT 5
- "std::runtime_error ... using too many buckets: change bucket_id_type to uint64_t or use a smaller " with c >= 5 and 34*10^9 keys HOT 4
- runtime error "blank line detected after reading X non-empty lines" HOT 2
- -i can't read from a non-mmap-friendly input file when using --external HOT 2
- add support for reading from zstd-compressed files HOT 1
- Support process substitution to read keys from std::in HOT 12
- Compile issue on OSX (building on bioconda) HOT 4
- duplicate hash values on additive displacement HOT 3
- Compile issue in gcc 10.2.1 HOT 2
- Is there a map implementation of PTHash? HOT 8
- update README in pthash/phobic
- undefined reference to `pthread_create' HOT 8
- possibility to adapt builders for partitions with single elements
- C++14 support HOT 4
- Use pthash with unordered_map HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pthash.