Code Monkey home page Code Monkey logo

Comments (10)

carlini avatar carlini commented on August 29, 2024 1

That's weird. I don't know what that error is... looks like something to do with destroying the Pool object? If you're not tokenizing the dataset, then you could comment out the line that creates the pool also comment out this line

text = p.map(tok,text) 

from deduplicate-text-datasets.

carlini avatar carlini commented on August 29, 2024

Ah sorry, bad documentation. You want to run

cargo run collect_similar lm1b.test with just the name of the dataset

from deduplicate-text-datasets.

zijwang avatar zijwang commented on August 29, 2024

Thanks! Now the output is different:

cargo run collect_similar lm1b.test
warning: function is never used: `get_example_index`
   --> src/main.rs:447:4
    |
447 | fn get_example_index(table:&[u64], position:u64) -> usize{
    |    ^^^^^^^^^^^^^^^^^
    |
    = note: `#[warn(dead_code)]` on by default

warning: unused `Result` that must be used
   --> src/main.rs:367:2
    |
367 |     tablestream.file.read_exact(&mut tablestream.cache);
    |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |
    = note: `#[warn(unused_must_use)]` on by default
    = note: this `Result` may be an `Err` variant, which should be handled

warning: unused `Result` that must be used
   --> src/main.rs:379:2
    |
379 |     file.read_exact(&mut cache);
    |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |
    = note: this `Result` may be an `Err` variant, which should be handled

warning: 3 warnings emitted

    Finished dev [optimized + debuginfo] target(s) in 0.02s
     Running `target/debug/dedup_dataset collect_similar lm1b.test`
Done 134
Done 293
Done 151
Done 159
Done 337
Done 420
Done 426
Done 332
Done 55
Done 344
Done 275
Done 284
Done 318
Done 278
Done 306
Done 240
Done 284
Done 344
Done 268
Done 325
Done 291
Done 240
Done 368
Done 261
Done 530
Done 319
Done 325
Done 259
Done 381
Done 333
Done 332
Done 323
Done 277
Done 296
Done 272
Done 284
Done 295
Done 358
Done 315
Done 260
Done 371
Done 312
Done 290
Done 305
Done 310
Done 244
Done 333
Done 251
Done 295
Done 263
Done 303
Done 307
Done 251
Done 280
Done 291
Done 341
Done 275
Done 302
Done 288
Done 287
Done 295
Done 353
Done 292
Done 275
Done 327
Done 325
Done 281
Done 278
Done 346
Done 308
Done 353
Done 324
Done 350
Done 368
Done 297
Done 327
Done 313
Done 302
Done 323
Done 316
Done 371
Done 283
Done 389
Done 303
Done 16
Done 326
Done 235
Done 399
Done 0
Done 273
Done 311
Done 297
Done 322
Done 333
Done 226
Done 236
Done 453699
Done 453698
Done 453699
Done 453697
Done 453699
Done 453697
Done 453699
Done 453698
Done 453695
Done 453697
Done 453698
Done 453698
Done 453699
Done 453699
Done 453697
Done 453697
Done 453696
Done 453698
Done 453697
Done 453695
Done 453698
Done 453699
Done 453697
Done 453697
Done 453699
Done 453700
Done 453697
Done 453699
Done 453699
Done 453698
Done 453698
Done 453699
Done 453699
Done 453698
Done 453698
Done 453698
Done 453698
Done 453698
Done 453699
Done 453698
Done 453698
Done 453698
Done 453698
Done 453698
Done 453699
Done 453698
Done 453699
Done 453696
Done 453699
Done 453699
Done 453697
Done 453699
Done 453696
Done 453696
Done 453698
Done 453698
Done 453698
Done 453697
Done 453698
Done 453699
Done 453698
Done 453698
Done 453699
Done 453697
Done 453697
Done 453698
Done 453699
Done 453699
Done 453697
Done 453697
Done 453698
Done 453696
Done 453699
Done 453699
Done 453698
Done 453699
Done 453698
Done 453698
Done 453699
Done 453699
Done 453698
Done 453698
Done 453699
Done 453699
Done 453698
Done 453699
Done 453699
Done 453697
Done 453696
Done 453698
Done 453699
Done 453699
Done 453696
Done 453699
Done 453697
Done 453697
Sorting.
Sorted.
out 0 43555297

Would it be normal to have out 0 43555297? And where could I see the result?

from deduplicate-text-datasets.

carlini avatar carlini commented on August 29, 2024

That's a weird output. It means that it thinks every token from 0 to 43555297 is duplicated.

I just ran the following commands on a fresh clone on a new VM and got the expected result

python3 scripts/load_dataset.py --data_dir ~/tensorflow_datasets --save_dir data --name lm1b --split test
cargo build
python3 scripts/make_suffix_array.py data/lm1b.test
cargo run selfsimilar_parallel data/lm1b.test
cargo run collect_similar

and the output starts

out 185290 185564
424047 424148
482724 482824
534603 534717
...

Could you try to delete any files you have in /tmp that begin with dups_? That's the intermediate state that selfsimilar_parallel will spit out, so you should then re-run both the last two commands.

(If you want to check internal state, you can run

$ head /tmp/dups_lm1b.test_0-453700
7350692 6269097
12370922 40124734
30776035 10896071
16363080 41299798
8765812 11709544
2149078 36682546
...

from deduplicate-text-datasets.

zijwang avatar zijwang commented on August 29, 2024

Let me try it again. Somehow downloading LM1B is extremely slow (downloading wiki40b completes in minutes while LM1B takes a few hours).

Btw have you seen this error when running load_dataset?

$ python3 scripts/load_dataset.py --data_dir ~/tensorflow_datasets --save_dir data --name wiki40b --split train
...
Exception ignored in: <function Pool.__del__ at 0x7f85a1186ca0>
...
AttributeError: 'NoneType' object has no attribute 'dumps'

from deduplicate-text-datasets.

zijwang avatar zijwang commented on August 29, 2024

Apart from that, I use a fresh clone and test it with wiki40b/test. The results stay the same (Final answer 0 when running selfsimilar_parallel) :(

Commands I ran

cd deduplicate-text-datasets/
cargo build
python3 scripts/load_dataset.py --data_dir ~/tensorflow_datasets --save_dir data --name wiki40b --split test
python3 scripts/make_suffix_array.py data/wiki40b.test
cargo run selfsimilar_parallel data/wiki40b.test
cargo run collect_similar

from deduplicate-text-datasets.

carlini avatar carlini commented on August 29, 2024

I don't know what that "dumps" error comes from, do you have a traceback? I don't see a line that says "dumps" anywhere in the code.

"Final answer 0" is exepcted, that's not meaningful. The interesting part is the output of collect_similar. (And you should run collect_similar wiki40b.test, not just collect_similar.)

from deduplicate-text-datasets.

zijwang avatar zijwang commented on August 29, 2024

Yes. Here is the full log

 python3 scripts/load_dataset.py --data_dir ~/tensorflow_datasets --save_dir data --name wiki40b --split test
2021-09-03 19:19:10.186719: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-09-03 19:19:10.186783: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
INFO:absl:No config specified, defaulting to first: wiki40b/en
INFO:absl:Load dataset info from /home/ubuntu/tensorflow_datasets/wiki40b/en/1.3.0
INFO:absl:Field info.config_name from disk and from code do not match. Keeping the one from code.
INFO:absl:Field info.config_description from disk and from code do not match. Keeping the one from code.
INFO:absl:Field info.splits from disk and from code do not match. Keeping the one from code.
INFO:absl:Field info.module_name from disk and from code do not match. Keeping the one from code.
INFO:absl:Reusing dataset wiki40b (/home/ubuntu/tensorflow_datasets/wiki40b/en/1.3.0)
INFO:absl:Constructing tf.data.Dataset wiki40b for split test, from /home/ubuntu/tensorflow_datasets/wiki40b/en/1.3.0
2021-09-03 19:19:12.347217: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-09-03 19:19:12.347295: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2021-09-03 19:19:12.347306: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ip-172-31-20-98): /proc/driver/nvidia/version does not exist
2021-09-03 19:19:12.348145: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
<PrefetchDataset shapes: {text: (None,), version_id: (None,), wikidata_id: (None,)}, types: {text: tf.string, version_id: tf.string, wikidata_id: tf.string}>
2021-09-03 19:19:13.114712: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
0
1
2
Exception ignored in: <function Pool.__del__ at 0x7fb33d18bb80>
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.8/multiprocessing/pool.py", line 268, in __del__
  File "/home/ubuntu/anaconda3/lib/python3.8/multiprocessing/queues.py", line 362, in put
AttributeError: 'NoneType' object has no attribute 'dumps'

from deduplicate-text-datasets.

zijwang avatar zijwang commented on August 29, 2024

I ran the code on the other machine and somehow it seems working for wiki40b.test despite there is still the attributeerror. Thanks for helping out! Could you also briefly explain how to interpret the result/get the final dedupped dataset after running cargo run collect_similar wiki40b.test? For example, I saw

$ head /tmp/dups_wiki40b.test_0-5383036
95331546 92633740
95331527 92633721
95331516 92633710
95331592 92633786
95331567 92633761
95331578 92633772
95331555 92633749
95331536 92633730
384940843 334797560

but how could I translate these indices into the original doc? Also it looks like the dups_* file does not have 1-1 matching to the data, e.g., the first data of wiki40b.test is wiki40b.test.0-1292928931.

from deduplicate-text-datasets.

carlini avatar carlini commented on August 29, 2024

Yeah once you run collect_selfsimilar the output is byte pairs a b where this means the bytes from a to b should be deleted from the original wiki40b dataset.

The output of dups_ is a list of ints, a0 a1 a2 ... where bytes a0..a0+100 are equal to bytes a1..a1+100 etc.

from deduplicate-text-datasets.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.