Code Monkey home page Code Monkey logo

deduplicate-text-datasets's Introduction

Deduplicating Training Data Makes Language Models Better

This repository contains code to deduplicate language model datasets as descrbed in the paper "Deduplicating Training Data Makes Language Models Better" by Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch and Nicholas Carlini. We release the ExactSubstr deduplication implementation (written in Rust) along with the scripts we used in the paper to perform ExactSubstr deduplication and inspect the results (written in Python). We also release the document clusters resulting from running NearDup deduplication on C4, RealNews, LM1B, and Wiki-4B-en.

This is not an officially supported Google product.

Why deduplicate?

When datasets are created by scraping raw text from the Internet, this will often result in the same sequences being repeated multiple times (e.g., we find a single 50 word sequence that is repeated in the C4 dataset 60,000 times). Training models on deduplicated datasets is faster (because they see fewer total examples) and experimentally results in models with similar or better perplexity to models trained on data that hasn't been deduplicated. Moreover, language models are less likely to exhibit memorization when their training data has been well-deduplicated.

Citing this work

If you use this repository or our deduplicated datasets you can cite

@inproceedings{lee2021deduplicating,
      title={Deduplicating Training Data Makes Language Models Better}, 
      author={Katherine Lee and Daphne Ippolito and Andrew Nystrom and Chiyuan Zhang and Douglas Eck and Chris Callison-Burch and Nicholas Carlini},
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics",
    year = "2022",
    publisher = "Association for Computational Linguistics"
}

Exact Deduplication Code

We provide an implementation of the exact deduplication technique used in the paper. This is very much research code: it works well for what we designed it to do, and deduplicate text datasets, but it might not directly do what you want it to do. We did clean it up fairly significantly for a Version 1.0.0 release (see below for release history). If you want to deduplicate small (<10GB) datasets, it should work on any modern machine with ~16GB of RAM and a few CPU cores. As always, bigger machines are better. If you want to deduplicate something the size of C4 (~300GB) you will want a machine with as many cores as you can get (we used 96 cores) and >600GB of RAM. You will also need >1TB hard drive space. If your machine is big enough, there should be no upper bound on the size of the dataset it can handle (well, 2^64-1 bytes is the limit, but I think we can all agree that's essentially unlimited).

We build a suffix array (based on Andrew Gallant's suffix array implementation) in src/table.rs. It has some minor changes from the original version that make it so we can't just import this library as a crate. First, we need 64-bit integers. The original implementation says that u32 works for "reasonably sized documents (~4GB)" but we're working with unreasonably sized documents. So we need u64. Second, we don't want UTF8 strings. Everything is a [u8] byte array, because we might be working over token sequences which aren't valid UTF8. The main complication in the rest of src/main.rs is the fact that we want things to run in parallel, and we probably can't fit the entire suffix array into memory. And so all of our algorithms are designed around these constraints.

Installing

To run the rust deduplicator you will need to install Rust:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

You'll also need a C compiler, sudo apt-get install gcc will do that if you don't already.

If you additionally want to generate datasets to run the rust script on (and you probably do, at least to follow this demo) then you will need python dependencies:

pip3 install numpy scipy sentencepiece
pip3 install -r requirements-tf.txt

Basic Usage

This section walks through the code for getting started using it. Later we'll cover how to actually deduplicate a dataset, for now we'll just walk through the basics for how it works.

Start by running

cargo build

to compile the rust code, and then run

python3 scripts/load_dataset.py --data_dir $LOAD_DIR --save_dir $SAVE_DIR --name $DATASET --split $SPLIT [--tokenize]

For example, to get the Wik40B test set (you should do this, to walk through the demo) run

python3 scripts/load_dataset.py --data_dir ~/tensorflow_datasets --save_dir data --name wiki40b --split test

This should will take just a minute or so to run on the test set.

If the dataset is really big, you might want to add the --tokenize flag. This will shrink the dataset by roughly a factor of two by tokenizing it with the GPT-2 tokenizer.

This will create a file that's called data/wiki40b.test and data/wiki40b.test.size. The first file contains the entire Wiki40B test set smashed together, and the second file has the byte offset of where each individual training example begins, in sorted order.

From here we can now build a suffix array of this entire dataset that's now in a single file.

python3 scripts/make_suffix_array.py [path/to/dataset]

For example, if you run (you should do this to follow along!) python3 scripts/make_suffix_array.py data/wiki40b.test

This will create a file data/wiki40b.test.table.bin containing the suffix array. Again, this should be fast. The test set should process in about a minute.

(When running on larger files, if you get an error that you have too many open files, that's because this script opens lots of files. You should run ulimit -Sn 1000000 to "fix" the error. You might want to do this preemptively before hitting this crash after hour ten of the job.)

Querying a suffix array to find duplicated examples

We're not yet going to deduplicate a dataset. To start, let's just see how to count how often a particular example has been repeated. To do this, run

python3 scripts/count_occurrences.py --suffix [path/to/dataset] [--query query_string] [--query_file /path/to/query]

This should be very fast. Even when you run on a dataset that's 100s of gigabytes, it should take a few seconds, most of which is dominated by Python starting up. The actual core lookup just requires O(log(dataset_size)) time, which often is on the order of ~miliseconds.

On the LM1B test set, running python3 scripts/count_occurrences.py --suffix data/wiki40b.test --query " on Tuesday" should return 289. If you tokenized the dataset, then you should pass --tokenize to count_occurrences.py as well, to get the same result (plus or minus tokenization differences).

If you want to confirm this the outputted number is correct (assuming you haven't tokenized), you can run cat data/wiki40b.test | grep -ao " on Tuesday" | wc -l and get the same result (slower).

Deduplicating a Dataset

Now let's explain how to deduplicate a dataset as we do in the paper. As a running example we'll continue with the LM1b test set.

Finding all repeated substrings within a document

The first step in deduplicating a dataset is identifying all substrings of a given length that are repeated more than some threshold number of times. To do this we run the self-similar command:

cargo run self-similar --data-file data/wiki40b.test --length-threshold 100 --cache-dir /tmp/cache --num-threads 8

For larger datasets, you may want to replace num-threads with as many cores as you have on your machine. It parallelizes perfectly, so there's no reason not to. For now though, keep it at 8 just for the sake of keeping things on track with this guide.

The output of this should be the string

Duplicates found: 3374227

This means that the deduplicator found 3,374,227 sequences of length 100 that existed somewhere else in the dataset. The length threshold here is entirely dataset-dependent. In our paper, we used 50 tokens (which is 100 bytes---so remember that if you pass --tokenize you'll need to double the number of bytes for the length threshold).

At this point the deduplicator will have dumped a bunch of files to a cache directory. There are two kinds of files here

  • /cache/dups_$DATASET_A-B
  • /cache/sizes_$DATASET_A-B

Each dups file is a list of pointers into the dataset that corresponds to sequences repeated multiple times. Each file has the duplicates that correspond to items A through B in the suffix array. There should be 28,464 total entries when added up across all of these files. The duplicates are all clustered together, so all duplicates of the same string should appear sequentially.

Each sizes file says how large the cluster sizes are. This is typically a small number.

All pointers are the same size, but the size of the pointers depends on the size of the dataset. We use the smallest pointer size that could address the entire dataset. For the LM1B test set, this is a 32-bit pointer. For the training set it would be a 40-bit pointer. For larger documents it might be 48 bits. This helps save memory on disk.

The above explanation might be confusing. Let's see an example. Let's fine the first duplicate in the dataset:

$ xxd /tmp/cache/sizes_wiki40b.test_0-64596445 | head -n 1 
00000000: 0200 0000 0200 0000 0200 0000 0200 0000  ................
$ xxd /tmp/cache/dups_wiki40b.test_0-64596445 | head -n 1 
00000000: daa4 ae05 8c7a 8505 c7a4 ae05 797a 8505  .....z......yz

Recall these pointers are 32-bit pointers. You can determine this by checking the ratio in size between /tmp/data/lm1b.test and /tmp/data/lm1b.test.table.bin. So this says that the first cluster of duplicates is of size 2, and starts at location 0x05aea4da in the data file, with the second occurrence at location 0x05857a8c. To confirm this, you can run

$ python3
>>> open("data/wiki40b.test","rb").read()[0x05aea4da:0x05aea4da+100]
b'\n        \n          t\n          \n            0\n          \n        \n        ,\n        \n          t\n  '
>>> open("data/wiki40b.test","rb").read()[0x05857a8c:0x05857a8c+100]
b'\n        \n          t\n          \n            0\n          \n        \n        ,\n        \n          t\n  '

And we've confirmed that this example is correctly identified twice in the dataset. This is a fairly boring and benign duplicate, but it's definitely correct. (Exercise for the reader: how would you count how many times this string is repeated in the dataset? It should be twice. Can you check that?)

Collecting the duplicates together

The next step is to take all of the length-100 sequences we've found and collect them together to figure out what we should be removing from our dataset. To see why this is necessary, imagine that we have a length-200 sequence that's repeated more than once. The current data we have would tag this sequence as being a duplicate 99 times---once for each initial byte where a match occurs.

This step reduces that down to just find ranges of bytes [a,b) which are duplicated more than once. To do this, run

cargo run collect --data-file data/wiki40b.test --cache-dir /tmp/cache --length-threshold 100 > /tmp/wiki40b.test.remove.byterange

The output here will be a long list of byte pair ranges

...
out
41887 41999
42347 42479
42507 42715
42741 42931
43101 43315
43891 43993
44021 44220
44366 44604
...

What this means is that the substring in the dataset from byte 41887 to byte 41999 is repeated more than once and should be removed, as should the data from bytes 42347 to 42479 and so on. Let's check this.

$ python3
>>> data=open("data/wiki40b.test","rb").read()
>>> data[41887:41999]
b'8\xc2\xa0km\xc2\xb2), all of it land.\n_START_SECTION_\n2010 census\n_START_PARAGRAPH_\nAs of the census of 2010, there were 2,5'
>>> data.count(data[41887:41999])
1 ## WHAT??? See below
>>> data[42347:42479]
b'% from other races, and 0.9% from two or more races. Hispanic or Latino of any race were 2.5% of the population._NEWLINE_There were '
>>> data.count(data[42347:42479])
2

Okay so what's going on here? The first of these look like it's repeated just once (but the second looks correct). Well if you actually check what we're saying here is the following: every byte contained in the range 41887 to 41999 is a member of at least one length-100 duplicate match. So while the whole sequence isn't repeated, the sub-sequences are. So for example:

>>> data.count(data[41887:41887+100])
9
>>> data.count(data[41999-100:41999])
2

In our paper we suggest just taking all of these duplicate sequences that have been identified and completely striking them from the dataset. This somewhat breaks the flow of text, for example if previously had an example "Alice wanted to go to the store" and we deduplicated at the level of 10 characters, we might completely strike " to go to the " and be left with "Alice wantedstore". In practice we have found this doesn't break the language model because we remove relatively little text, and so these breaks don't cause harm.

How exactly how you write out a dataset that's been deduplicated depends on the format the dataset started as. If you're just running this on wiki40b, we've provided a script to do this conversion for you which will output another valid TensorFlow Dataset directory. But if you're using some other dataset, this is the part you'll have to take over and write the rest.

To run the wiki40b script, you can just run this command

python3 scripts/finish_dedup_wiki40b.py --data_dir ~/tensorflow_datasets/ --save_dir /tmp/tfds_wiki40b --name wiki40b --split test --suffixarray_dir data --remove /tmp/wiki40b.test.remove.byterange

This will create a new directory called /tmp/tfds_wiki40b_dedup, and will take a few minutes to process completely.

You can verify the deduplication has succeeded by then re-running the pipeline using the resulting output. Instead of finding 3,374,227 duplicate sequences during the deduplication phase, it should instead find 374. Importantly, you can check that these 374 duplicates are not errors of the pipeline: they are new sequences that are now duplicated when previously they were not. You can check this by running count-occurrences in the original dataset for the sequences that (now) have two occurrences.

To do this, just re-run everything top-down:

python3 scripts/load_dataset.py --data_dir /tmp/tfds_wiki40b_dedup --save_dir data_dedup --name wiki40b --split test
python3 scripts/make_suffix_array.py data_dedup/wiki40b.test
cargo run self-similar --data-file data/wiki40b.test --length-threshold 100 --cache-dir /tmp/cache --num-threads 8

and observe the output

Duplicates found: 374

Why do we get new duplicates? Consider the following example where we're going to remove all sequences of 4 characters that repeat twice: e a b c d f g h . e f a b c d g h. Initially the sequence a b c d is repeated twice. So we remove them both, and are now left with the file e f g h . e f g h. This file still has duplicates! It's not that the first run failed, it's that in doing the first deduplication, we ended up with more (new) duplicates.

To generate the result of our paper, we ran the deduplicator twice. This often cuts the number of duplicates down by over 100,000x, which in practice means to ~zero for normal datasets or ~a few hundred for massive 100GB+ datasets.

A full end-to-end dataset deduplication example

Okay so maybe you don't like reading. You skipped the entire section above. (Honestly I don't blame you.) You just want it to run. Then just do this

bash scripts/run_pipeline.sh
python3 scripts/finish_dedup_wiki40b.py --data_dir ~/tensorflow_datasets/ --save_dir /tmp/dedup --name wiki40b --split test --suffixarray_dir data --remove /tmp/wiki40b.test.remove.byterange

This will run the entire deduplication pipeline top-to-bottom, starting with loading the wiki40b test set, then creating a suffix array, finding all repeated sequences, merging them together to sequence ranges, and finally spitting out a deduplicated TF Dataset that you can use exactly as normal.

Note that this finish script is often the slowest part of the pipeline, depsite doing the least work. I'm sure this is something that could be parallelized or made faster, but it's not an algorithms problem, it's an engineering problem. And that's not particularly fun. If you want to do this and submit a PR we'd gladly take it.

A full end-to-end single file deduplication example

If you have a large single file and want to remove all length-N duplicates from within that file, we also provide the helper script here

bash scripts/deduplicate_single_file.sh [path/to/source] [path/to/destination] [dup_length_threshold] [num_cores]

Advanced Usage

The above scripts work by calling into the core Rust suffix array deduplicator. If you want to do each step yourself, the following options are available:

Single threaded suffix array construction

To build a suffix array for any particular file, you can run

cargo run make --data-file [file_path]

This will create a file called [file_path].table.bin which contains the suffix array for the file provided. This algorithm is linear time, but (a) only runs on a single core, and (b) has memory requirement O(big * len(file)) which is prohibitive for large files.

Parallel suffix array construction

To build a suffix array for an extremely large file (e.g., ~about as much RAM as available) it is better to run the script

python scripts/make_suffix_array.py [file_path]

This script will build the suffix array in parallel by splitting the single file into chunks, generating suffix arrays for each chunk, and then merging the suffix arrays together to form the full suffix array. Note that in general this algorithm is quadratic, but when the maximum substring length is short relative to the total file length (as it is, when generating suffix arrays for N independent training examples) it will never reach this worst case behavior.

The two steps are described below.

Building a piece of a suffix array from a piece of a file

The first generates a suffix array from a piece of a file. This is implemented by running

cargo run make_part --data-file [file_path] --start_byte [byte_offset] --end_byte [byte_offset]

And builds a suffix array for the byte sequence between [byte_start] and [byte_end] for the given file. Multiple of these can be run in parallel to build a suffix array for a file quickly.

Merging suffix array pieces to create a single suffix array

Given the several independent suffix arrays, merging them is now just a matter of calling

cargo run merge --suffix-path [path_to_partial_suffix_tree] [--suffix-path [another_path_to_partial] ...] -- output-file [tmp_output_directory] --num-threads [number-of-machine-cores]

to generate a collection of ordered suffix arrays pieces in the output directory. The final step just requires merging these together

cat [tmp_output_directory]/* > [file_path].table.bin

Finding Duplicates

Given a suffix array file, as generated in the previous section, it can now be queried for interesting statistics. The simplest operation, counting occurrences of particular substrings, takes O(log(N)) time and O(query_length) memory requirements, (as shown above with scripts/count_occurrences.py). To do this you can run:

cargo run count-occurrences --data-file /path/to/dataset --query-file /path/to/query_file

(Indeed, the python script is just a wrapper that makes calling this nicer, with the option for tokenization.) This is useful mainly as a commandline interface to interact with the dataset to find interesting properties. To run more sophisticated analysis, use the tools described below:

Finding duplicates between two different documents

Given a document A and another document B, we can find all duplicates between the two by (1) constructing suffix arrays for both, and then (2) linearly walking the suffix arrays in order to find all duplicates of a given length.

Once the suffix array for the dataset has been constructed, this algorithm therefore requires time O(len(dataset) + len(query)) and space O(len(dataset)). It is better to run this algorithm when the number of queries into the dataset is greater than O(len(dataset)/log(len(query))). However note that the prior code requires disk seeks and and this implementation is a linear scan through the suffix array table, so in practice there is at least a factor-of-10 speedup here. As a rough order of magnitude, for a dataset with ~100GB, it is faster to run across-similar (described below) when querying with more than a few megabytes of text. Otherwise it is probably faster to run count_occurances.

Notice that this command also requires that the entire dataset fits in memory. For many datasets this is not a problem, but the C4 dataset is 350 GB and the Pile dataset is 750 GB (both even after tokenization). The machine must therefore have a lot of RAM for this to work.

cargo run across-similar --data-file-1 [dataset1] --data-file-2 [dataset2] --length-threshold [num_bytes] --cache-dir [where/to/save] --num-threads [N]

This creates files (similar to the self-similar command containing the position of all examples in dataset2 that are also in dataset1, and also at the same time the position of all examples in dataset1 that are also in dataset2. As before, the output is both dups files that have the byte offset of where the length-threshold duplicates occur, and also sizes files that give the sizes of each cluster.

It's again possible to run

cargo run collect --data-name [dataset1 or dataset2]

This will write to stdout the byte ranges [a,b) where all tokens in this range are part of an overlap contained in the other document.

Finding duplicates within one document

To find duplicates that are contained within one document (for example, to actually deduplicate a dataset as we do in the paper) run the command

cargo run self-similar --data-file [path] --length-threshold [bytes] --cache-dir [where/to/save] --num-threads [cpu cores]

This will find all repeated substrings contained in the dataset above a given length threshold. To see how it is used look above where it's called as part of the dataset deduplication process. Again run collect_similar to find the indexs of repeated examples.

Rust Deduplicator Version History

Version 0.1.0 was an initial code release that reproduces the paper.

  • The code worked, but was rather terrible.
  • I am sorry if you had to look at it.
  • You don't want to look at this code unless you're explicitly trying to reproduce our paper.

Version 1.0.0 is complete restructuring of the code. IT IS NOT BACKWARDS COMPATIBLE.

  • The suffix array data structure is basically the only thing that remains unchanged (thanks to Andrew Gallant who actually understood how to write code). You won't need to re-generate the suffix array tables if you upgrade from 0.1 to 1.0.
  • The rust code now uses argument parsing, instead of relying on the order arguments are passed. So the CLI interface has changed.
  • Added one-line scripts to deduplicate a single file, or a TFDS dataset.
  • The intermediate data files have changed. This shouldn't matter unless you were looking at the internals of the code. If you were, then you will need to re-generate intermediate data files
  • The code is not entirely terrible to read, and has comments.

Approx Deduplication Results

The following CSVs contain three columns: the document ID, a boolean indicating whether or not this document was deleted during deduplication, and a cluster ID. Documents with the same cluster ID were identified as near-duplicates. For C4 and RealNews, the document ID is the url associated with the document. For Wiki-40B, it is the wikidata_id. LM1B coming soon.

Name Link Size
C4 link 13GB
RealNews link 1.4GB
Wiki-40B link 26MB

deduplicate-text-datasets's People

Contributors

alistairewj avatar carlini avatar daphnei avatar tevenlescao avatar wwwonderer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deduplicate-text-datasets's Issues

Error when running the code

Hi,

I try to deduplicate my plain text file, but it shows some errors. I first run

python scripts/make_suffix_array.py c4-train.00000-of-01024.txt

The output is

./target/debug/dedup_dataset make-part --data-file c4-train.00000-of-01024.txt --start-byte 0 --end-byte 114700294
./target/debug/dedup_dataset make-part --data-file c4-train.00000-of-01024.txt --start-byte 114600294 --end-byte 229300588
./target/debug/dedup_dataset make-part --data-file c4-train.00000-of-01024.txt --start-byte 229200588 --end-byte 343900882
./target/debug/dedup_dataset make-part --data-file c4-train.00000-of-01024.txt --start-byte 343800882 --end-byte 458401177
Waiting for jobs to finish
Checking all wrote correctly
FACT 4.0
FACT 4.0
FACT 4.0
FACT 4.0
Rerunning 0 jobs because they failed.
Merging suffix trees
./target/debug/dedup_dataset merge --output-file tmp/out.table.bin --suffix-path c4-train.00000-of-01024.txt.part.0-114700294 --suffix-path c4-train.00000-of-01024.txt.part.114600294-229300588 --suffix-path c4-train.00000-of-01024.txt.part.229200588-343900882 --suffix-path c4-train.00000-of-01024.txt.part.343800882-458401177 --num-threads 256
thread 'thread '<unnamed><unnamed>' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rscalled `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:', 875src/main.rs::125222
:note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
77
thread 'thread 'thread '<unnamed><unnamed><unnamed>' panicked at '' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', ', ', src/main.rssrc/main.rssrc/main.rs:::222875875:::77125125


thread 'thread '<unnamed><unnamed>thread 'thread '' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }<unnamed><unnamed>called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', ' panicked at '' panicked at '', src/main.rscalled `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }src/main.rs:', ', src/main.rssrc/main.rs:222::875222:222::77:1257777



thread 'thread '<unnamed><unnamed>' panicked at '' panicked at 'thread 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }<unnamed>', ', src/main.rs' panicked at 'src/main.rs:called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:875', 875:src/main.rs:125:thread '125
222<unnamed>
:' panicked at '77called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }
', src/main.rs:222:77
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
thread 'thread '<unnamed><unnamed>' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', thread '', src/main.rs<unnamed>src/main.rs:' panicked at ':222thread 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }875:<unnamed>', :77' panicked at 'src/main.rs125thread '
called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:
<unnamed>', 222thread '' panicked at 'src/main.rs:<unnamed>called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:77' panicked at '', 222
called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }src/main.rs:', :src/main.rs77222:
:22277:
77
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }thread '', <unnamed>src/main.rs' panicked at ':called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }875', :src/main.rs125:
875:125
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
thread 'thread 'thread '<unnamed><unnamed><unnamed>' panicked at '' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', ', ', src/main.rssrc/main.rssrc/main.rs:::222222222:::777777


thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
thread 'thread 'thread 'thread 'thread '<unnamed><unnamed><unnamed>' panicked at '<unnamed><unnamed>' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', ', src/main.rs', ', src/main.rssrc/main.rs:src/main.rssrc/main.rs::222::222222:222222::77::7777
7777



thread 'thread 'thread 'thread 'thread 'thread 'thread '<unnamed><unnamed><unnamed><unnamed><unnamed><unnamed><unnamed>' panicked at '' panicked at '' panicked at '' panicked at '' panicked at '' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', ', ', ', ', ', ', src/main.rssrc/main.rssrc/main.rssrc/main.rssrc/main.rssrc/main.rssrc/main.rs:::::::222222222222222222222:::::::77777777777777






thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
thread 'thread '<unnamed><unnamed>' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }thread 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }thread 'thread 'thread '', <unnamed><unnamed>', <unnamed><unnamed>src/main.rsthread '' panicked at '<unnamed>' panicked at 'src/main.rsthread 'thread 'thread 'thread 'thread '' panicked at '' panicked at ':called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:<unnamed><unnamed><unnamed><unnamed><unnamed>called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }222', called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', 222' panicked at '' panicked at '' panicked at '' panicked at '' panicked at '', ', :src/main.rs', src/main.rs:called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }src/main.rssrc/main.rs77:src/main.rs:77', ', ', ', ', ::
222:thread '222thread '
src/main.rssrc/main.rssrc/main.rssrc/main.rssrc/main.rs875222:22277:<unnamed>:thread '<unnamed>::::thread '::thread ':
77' panicked at '77<unnamed>' panicked at '875222222222<unnamed>222125<unnamed>77
called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }
' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }::::' panicked at ':
' panicked at '
', called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', 125777777called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }77called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }src/main.rs', src/main.rs



',
', :src/main.rs:src/main.rssrc/main.rs222:222::222:222:222:77::7777
7777



thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }thread '', <unnamed>src/main.rs' panicked at 'thread ':called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }<unnamed>875', :' panicked at 'src/main.rsthread '125called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:<unnamed>
', 222' panicked at ':src/main.rscalled `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }77:',
222src/main.rs::77222
:77
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:875:125
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:875:125
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', thread 'src/main.rs<unnamed>:' panicked at '875called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:', 125src/main.rs
:875:125
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:875:125
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
thread 'thread '<unnamed><unnamed>' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', ', src/main.rssrc/main.rs::222875::77125

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:875:125
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:875:125
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:875:125
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:875:125
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', thread 'src/main.rs<unnamed>:' panicked at '222called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:', 77src/main.rs
:875:125
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }thread '', <unnamed>src/main.rs' panicked at ':called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }222', :src/main.rs77:
222:77
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:875:125
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:875:thread '125<unnamed>
' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', thread 'src/main.rs<unnamed>:' panicked at '222called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:', 77src/main.rs
:875:125
thread 'thread '<unnamed><unnamed>' panicked at '' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', ', src/main.rssrc/main.rsthread '::<unnamed>thread '222875' panicked at '<unnamed>::called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }' panicked at '77125', called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }

src/main.rs', :src/main.rs875::222125:
77
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
thread '<unnamed>' panicked at 'thread 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }<unnamed>', ' panicked at 'src/main.rscalled `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }:', 222src/main.rs::77875
:125
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Os { code: 24, kind: Uncategorized, message: "Too many open files" }', src/main.rs:222:77
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Any { .. }', /home/yiming/.cargo/registry/src/github.com-1ecc6299db9ec823/crossbeam-0.3.2/src/scoped.rs:34:43
Now merging individual tables
Cleaning up

Yet, it successfully create the suffix array file

c4-train.00000-of-01024.txt.part.0-114700294
c4-train.00000-of-01024.txt.part.0-114700294.table.bin
c4-train.00000-of-01024.txt.part.114600294-229300588
c4-train.00000-of-01024.txt.part.114600294-229300588.table.bin
c4-train.00000-of-01024.txt.part.229200588-343900882
c4-train.00000-of-01024.txt.part.229200588-343900882.table.bin  
c4-train.00000-of-01024.txt.part.343800882-458401177           
c4-train.00000-of-01024.txt.part.343800882-458401177.table.bin  
c4-train.00000-of-01024.txt.table.bin

Then, I run

cargo run self-similar --data-file c4-train.00000-of-01024.txt --length-threshold 15 --cache-dir cache --num-threads 128

It gives me below error:

    Finished dev [optimized + debuginfo] target(s) in 5.69s
     Running `target/debug/dedup_dataset self-similar --data-file c4-train.00000-of-01024.txt --length-threshold 15 --cache-dir cache --num-threads 128`
Start load!
thread 'main' panicked at 'assertion failed: metadata.len() % (text.len() as u64) == 0', src/main.rs:479:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

May I ask how to fix this? Thank you!

Yiming

Error on self deduplication

I am planning to reproduce the self deduplication result for lm1b. I have already produced the result mentioned in the readme here.

However, when running selfsimilar_parallel, it shows Final answer 0 and when running collect_similar it throws an error of thread 'main' panicked at 'index out of bounds: the len is 0 but the index is 0', src/main.rs:1244:26. Am I missing something here?

Log:

$ python3 scripts/count_occurances.py --suffix dataset_save/lm1b.test --query " on Tuesday"
b' on Tuesday'
Number of times present: 1288


$ cargo run selfsimilar_parallel dataset_save/lm1b.test
warning: function is never used: `get_example_index`
   --> src/main.rs:447:4
    |
447 | fn get_example_index(table:&[u64], position:u64) -> usize{
    |    ^^^^^^^^^^^^^^^^^
    |
    = note: `#[warn(dead_code)]` on by default

warning: unused `Result` that must be used
   --> src/main.rs:367:2
    |
367 |     tablestream.file.read_exact(&mut tablestream.cache);
    |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |
    = note: `#[warn(unused_must_use)]` on by default
    = note: this `Result` may be an `Err` variant, which should be handled

warning: unused `Result` that must be used
   --> src/main.rs:379:2
    |
379 |     file.read_exact(&mut cache);
    |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |
    = note: this `Result` may be an `Err` variant, which should be handled

warning: 3 warnings emitted

    Finished dev [optimized + debuginfo] target(s) in 0.02s
     Running `target/debug/dedup_dataset selfsimilar_parallel dataset_save/lm1b.test`
Start load!
Loading ratio is 8
0 / 453700
Final answer 0


$ cargo run collect_similar dataset_save/lm1b.test
warning: function is never used: `get_example_index`
   --> src/main.rs:447:4
    |
447 | fn get_example_index(table:&[u64], position:u64) -> usize{
    |    ^^^^^^^^^^^^^^^^^
    |
    = note: `#[warn(dead_code)]` on by default

warning: unused `Result` that must be used
   --> src/main.rs:367:2
    |
367 |     tablestream.file.read_exact(&mut tablestream.cache);
    |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |
    = note: `#[warn(unused_must_use)]` on by default
    = note: this `Result` may be an `Err` variant, which should be handled

warning: unused `Result` that must be used
   --> src/main.rs:379:2
    |
379 |     file.read_exact(&mut cache);
    |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |
    = note: this `Result` may be an `Err` variant, which should be handled

warning: 3 warnings emitted

    Finished dev [optimized + debuginfo] target(s) in 0.02s
     Running `target/debug/dedup_dataset collect_similar dataset_save/lm1b.test`
Sorting.
Sorted.
thread 'main' panicked at 'index out of bounds: the len is 0 but the index is 0', src/main.rs:1244:26
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Simple test

Thanks a lot for open sourcing your amazing work!

I was just getting my hands dirty with the code and wrote a simple test:

from text_dedup.exact_dedup import GoogleSuffixArrayDeduplicator

k=3
deduplicator = GoogleSuffixArrayDeduplicator(k=k, google_repo_path="deduplicate-text-datasets")
texts = ['aaaaaaaaaaaabbbccccc', 'aaaaaaaaaaaaccccc']
slices = deduplicator.fit_predict(texts)

print(f"k:{k}")
print(slices)

def remove_slice_list(text, slice_list):
    offset = 0
    for s in slice_list:
        text = text[:(s.start-offset)] + text[(s.stop-offset):]
        offset += s.stop - s.start
    return text

for slice_list,text in zip(slices, texts):
    if slice_list != []:
        print(f"{text} -> {remove_slice_list(text, slice_list)}")

# which prints:
[[slice(0, 12, None)], [slice(0, 12, None)]]
aaaaaaaaaaaabbbccccc -> bbbccccc
aaaaaaaaaaaaccccc -> ccccc

Shouldn't the c's from both strings get removed as well? Maybe it might be due to my unfamiliarity with the algorithm and just curious.

# expected
aaaaaaaaaaaabbbccccc -> bbb
aaaaaaaaaaaaccccc -> 

For example, I imagine in a real world scenario we would like to remove both the repeating headers and footers of a website.

Also, tried the example from the README:

k=4
deduplicator = GoogleSuffixArrayDeduplicator(k=k, google_repo_path="deduplicate-text-datasets")
# texts = ['eabcdfgh . efabcdgh']
texts = ['abcdefgh . efabcdgh']
slices = deduplicator.fit_predict(texts)

print(f"k:{k}")
print(slices)

clean_texts = []
for slice_list,text in zip(slices, texts):
    if slice_list != []:
        clean_text = remove_slice_list(text, slice_list)
        print(f"{text} -> {clean_text}")
        clean_texts.append(clean_text)
    else:
        clean_texts.append(text)

texts = clean_texts
slices = deduplicator.fit_predict(texts)
print(slices)
for slice_list,text in zip(slices, texts):
    if slice_list != []:
        clean_text = remove_slice_list(text, slice_list)
        print(f"{text} -> {clean_text}")

# prints
[[slice(0, 4, None), slice(13, 17, None)]]
abcdefgh . efabcdgh -> efgh . efgh

[[slice(0, 4, None)]]
efgh . efgh ->  . efgh

# expected
efgh . efgh ->  . 

Count_occurrence does not work with tokenizer?

Hello!

Sorry if this is something simple that I missed, but I tried to count occurrences of words in 'rajpurkar/squad' dataset and after downloading data, tokenization and suffix array creation, the binary search from count_occurrence script does not return the correct number of word occurrences (it returns 0 most of the time...). I then digged deeper and found that the problem might be related to tokenization, where the same words can be tokenized differently based on context by a BPE tokenizer. This happens with GPT2Tokenizer from huggingface and with tiktoken tokenizer also... Below is a script to showcase this difference:

from transformers import GPT2Tokenizer
# Initialize the GPT-2 tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

a = "blah blah blah... Architecturally... blah blah"
b = "Architecturally"

# Encode both strings
encoded_a = tokenizer.encode(a)
encoded_b = tokenizer.encode(b)

print("encoded_a: ", encoded_a)
print("encoded_b: ", encoded_b)

# Check if encoded_b is a subsequence of encoded_a
def is_subsequence(sub, main):
    sub_len = len(sub)
    for i in range(len(main) - sub_len + 1):
        if main[i:i + sub_len] == sub:
            return True
    return False

print("Is encoded_b a subsequence of encoded_a?", is_subsequence(encoded_b, encoded_a))

The output on my machine is:

encoded_a:  [2436, 993, 33367, 33367, 986, 17340, 20221, 986, 33367, 33367]
encoded_b:  [19895, 5712, 20221]
Is encoded_b a subsequence of encoded_a? False

This leads to the binary search not finding the correct number of occurrences. Am I missing something? Is there a setting where I can set a BPE tokenizer to always tokenize the same text to the same integers independent of context?

Thank you!

Adjust TensorFlow version to fix cuDNN, cuFFT, cuBLAS errors.

Problem

When using pip install tensorflow, the installation defaults to the latest version, TensorFlow 2.14.0. However, this version is known to cause specific issues, as documented in this issue.

Solution

I have found a workaround that seems to be effective. By installing the following specific versions of TensorFlow and its related packages, the issue is resolved:

tensorflow==2.9.0
tensorflow-datasets==4.9.3
tensorflow-estimator==2.9.0
tensorflow-io-gcs-filesystem==0.34.0
tensorflow-metadata==1.14.0

These versions work well together and avoid the problem encountered with TensorFlow 2.14.0.

According to @Romeo-CC in tensorflow/tensorflow#62002 (comment): The issue was present in versions 2.10 and 2.11, resolved in 2.12, but reemerged in 2.14.

where the data is?

Hello Team,

Where I can find the data before dedupilcating? I have similar tasks to test dedupilcation algorithms.

Thanks,

Jianshu

Inplementation of NearDup(approximate match)

In the paper, you used two methods: suffix array for exact matches and Minhash for approximate matches.
Currently, the repo only contains a suffix array-based one. Kindly ask if you will release the implementation of the Minhash-based one for NearDup method in your paper.

Retain one instance per duplicate

Hi!

Thank for this incredibly fast code!

I noticed that when deduplicating all duplicate instances are being removed, instead of keeping at least one.

The byte-spans that we use in the end to remove the duplicates, potentially consist of combinations of smaller duplicates.

Is there any smart and efficient way to use this information to keep at least one instance for each duplicate?

Thanks!

How to dedup between two datasets?

A practical situation is that given two datasets A and B, we want to remove the data in A that has huge overlap with B. Is there a command that I could use to achieve this functionality? There is only command of single-document or single-document pairs in the readme on finding duplicates.

question about deduplication cluster size

As shown in following picture, the cluster starting at 0x02954cb9 has the size of 3.
image
but when I count it using bytes.count(), it shows 2.
image

I tried different datasets and observed the same phenomenon.
Did I make a mistake about the size meaning?

RAM crash when use collect method

first of all thanks for releasing the code

i have dataset(mc4) size about 110 GB

my machine specs is
96 cores cpu and 350 GB RAM

i've successfully created 524GB suffix array from that dataset

i also managed to run deduplicator (self similar method with 100 threshold) with no memory issue , create about ~140 GB cache files ( 20B examples)

but when i run collect method my RAM blowup after few minutes

i stacktrace the code and found my RAM crash when this code/step running

while let Some(Reverse((data_pointer, index, which_array))) = heap.pop() {

is this expected?
do you have workaround to solve the issue?

AFAIK, collect method is just merging all duplicate sequence that found in the dataset and its only return text file with pair of bytes,CMIIW

i'm thinking maybe write and text file as soon each cache files finish processed/read ,instead of waiting all of them to be finish
(this is just assumption, i dont know its possible...not expert on rust)

thank you

"failed to fill whole buffer" errors

Hi,

I have tried to run the code on simple string and count-occurrences fails with "failed to fill whole buffer" error.

Here are steps to reproduce:

  1. run ./target/debug/dedup_dataset make --data-file dup.txt, data file dup.txt contains simple string "aaabbb"
  2. then run ./target/debug/dedup_dataset count-occurrences --data-file dup.txt --query-file query.txt, where query.txt contains
  • "bb"
    expectation: Number of times present: 2
    reality: thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error { kind: UnexpectedEof, message: "failed to fill whole buffer" }', src/main.rs:275:31;
  • "ab"
    expectation: Number of times present: 1
    reality: thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error { kind: UnexpectedEof, message: "failed to fill whole buffer" }', src/main.rs:297:31;
  • "b"
    expectation: Number of times present: 2
    reality: Number of times present: 1;

May be I'm doing something wrong? Thanks.

Error with table size not being divisible by text size

Hi, I'm getting an error because size of the suffix table is not divisible by length of the text.

assert!(metadata.len() % (text.len() as u64) == 0);

I'm running bash scripts/deduplicate_single_file.sh test.txt test_dedup.txt 20 1 where test.txt just contains a few paragraphs from a random Wikipedia article and some duplicate text that I manually added. I'm doing this mainly for debugging purpose (I would like to later make some edits to keep the first occurrence of duplicate samples and throw away the rest). If I run the command on my actual dataset that is roughly ~70GB big, I'm not encountering such issue. So I'm wondering what the issue is? Does the code not work with datasets that are too small?

Thanks!

Update: I just found out that running the command on the actual 70GB dataset also raised the same error.

[Question] An error with the same repo guideline

Hi
I'm following the same guideline to do data deduplication for the wiki40 dataset,
in this step of building the suffix array, with this command python3 scripts/make_suffix_array.py data/wiki40b.test
In the data folder I have the .size and .test files
and the command generate 8 bin files then it throws that error.
./target/debug/dedup_dataset make-part --data-file data/wiki40b.test --start-byte 0 --end-byte 129292893 ./target/debug/dedup_dataset make-part --data-file data/wiki40b.test --start-byte 129192893 --end-byte 258485786 ./target/debug/dedup_dataset make-part --data-file data/wiki40b.test --start-byte 258385786 --end-byte 387678679 ./target/debug/dedup_dataset make-part --data-file data/wiki40b.test --start-byte 387578679 --end-byte 516771573 Waiting for jobs to finish Checking all wrote correctly FACT 4.0 517171572.0 517171572 FACT 4.0 517171572.0 517171572 FACT 4.0 517171572.0 517171572 FACT 4.0 516771576.0 516771576 Rerunning 0 jobs because they failed. Merging suffix trees rm: cannot remove 'tmp/out.table.bin.*': No such file or directory ./target/debug/dedup_dataset merge --output-file tmp/out.table.bin --suffix-path data/wiki40b.test.part.0-129292893 --suffix-path data/wiki40b.test.part.129192893-258485786 --suffix-path data/wiki40b.test.part.258385786-387678679 --suffix-path data/wiki40b.test.part.387578679-516771573 --num-threads 16 thread '<unnamed>' panicked at src/main.rs:1252:125: called Result::unwrap()on anErrvalue: Os { code: 2, kind: NotFound, message: "No such file or directory" } note: run withRUST_BACKTRACE=1environment variable to display a backtrace thread '<unnamed>' panicked at src/main.rs:1252:125: calledResult::unwrap()on anErrvalue: Os { code: 2, kind: NotFound, message: "No such file or directory" } thread '<unnamed>' panicked at src/main.rs:1252:125: calledResult::unwrap()on anErrvalue: Os { code: 2, kind: NotFound, message: "No such file or directory" } thread '<unnamed>' panicked at src/main.rs:1252:125: calledResult::unwrap()on anErrvalue: Os { code: 2, kind: NotFound, message: "No such file or directory" } thread '<unnamed>' panicked at src/main.rs:1252:125: calledResult::unwrap()on anErrvalue: Os { code: 2, kind: NotFound, message: "No such file or directory" } thread '<unnamed>' panicked at src/main.rs:1252:125: calledResult::unwrap()on anErrvalue: Os { code: 2, kind: NotFound, message: "No such file or directory" } thread '<unnamed>' panicked at src/main.rs:1252:125: calledResult::unwrap()on anErrvalue: Os { code: 2, kind: NotFound, message: "No such file or directory" } thread '<unnamed>' panicked at src/main.rs:1252:125: calledResult::unwrap()on anErrvalue: Os { code: 2, kind: NotFound, message: "No such file or directory" } thread '<unnamed>' panicked at src/main.rs:1252:125: calledResult::unwrap()on anErrvalue: Os { code: 2, kind: NotFound, message: "No such file or directory" } thread '<unnamed>' panicked at src/main.rs:1252:125: calledResult::unwrap()on anErrvalue: Os { code: 2, kind: NotFound, message: "No such file or directory" } thread '<unnamed>' panicked at src/main.rs:1252:125: calledResult::unwrap()on anErrvalue: Os { code: 2, kind: NotFound, message: "No such file or directory" } thread '<unnamed>' panicked at src/main.rs:1252:125: calledResult::unwrap()on anErrvalue: Os { code: 2, kind: NotFound, message: "No such file or directory" } thread '<unnamed>' panicked at src/main.rs:1252:125: calledResult::unwrap()on anErrvalue: Os { code: 2, kind: NotFound, message: "No such file or directory" } thread '<unnamed>' panicked at src/main.rs:1252:125: calledResult::unwrap()on anErrvalue: Os { code: 2, kind: NotFound, message: "No such file or directory" } thread '<unnamed>' panicked at src/main.rs:1252:125: calledResult::unwrap()on anErrvalue: Os { code: 2, kind: NotFound, message: "No such file or directory" } thread '<unnamed>' panicked at src/main.rs:1252:125: calledResult::unwrap()on anErrvalue: Os { code: 2, kind: NotFound, message: "No such file or directory" } thread 'main' panicked at /home/ubuntu/.cargo/registry/src/index.crates.io-6f17d22bba15001f/crossbeam-0.3.2/src/scoped.rs:34:43: calledResult::unwrap()on anErr value: Any { .. } Something went wrong with merging. Please check that you ran with ulimit -Sn 100000
I tried with this param ulimit -Sn 100000
and I had the same error,
Can you help me to figure it out
Thanks

Unexpected behavior with ending symbols

Hi again,

I found that count-occurrences have an unexpected behavior if you want to count last symbols in sequence. Here are the examples:

  • sequence "aaabbb", query "b": expect 3, but output is Number of times present: 2
  • another one is when sequence "aaabbb", query "bb": expected 2, but actual output is Number of times present: 0

Can you fix this? Thanks!

Distributed running

Hi, I notice that the self-similar function can only be launched on one machine, which largely limits the size of the dataset since most machine has a memory size < 1T. Can we change the self-similar function to enable distributed running on different machines?

Incomplete Sentences

Hello!

I'm currently using a suffix array and Persian language text. However, in some examples, the outcome of deduplication is not ideal when removing substrings from the text. This leads to boundaries of strings being overlapped by words, resulting in a deprecated and sometimes meaningless text. How can I rectify this issue?

one example (translated to english):

ORIGINAL: According to BBC and quoted by Currency, the dollar to ruble rate increased by 0.32% to 55.19 rubles and the euro decreased by 0.36% to 56.09 rubles.

RESULT AFTER DEDUP: o ruble rate increased by 0.32% to 55.19 rubles and the euro decreased by 0.36% to 56.09 rubles.

In the repository, you mentioned this, stating that it doesn't disrupt the language model. The reason provided is that only a relatively small amount of text is removed. However, I'm having difficulty understanding why this isn't considered harmful. Do you mean that this disruption has no effect whatsoever on the perplexity of the language model?

In our paper we suggest just taking all of these duplicate sequences that have been identified and completely striking them from the dataset. This somewhat breaks the flow of text, for example if previously had an example "Alice wanted to go to the store" and we deduplicated at the level of 10 characters, we might completely strike " to go to the " and be left with "Alice wantedstore". In practice we have found this doesn't break the language model because we remove relatively little text, and so these breaks don't cause harm.

Fix to issue #17 limits cmd_merge to be single-threaded

Hi,

it looks like the fix for issue #17, which puts some limits on the number of threads in cmd_merge, is a bit too aggressive, resulting in only using a single thread even for big workloads:

// Make sure we have enough space to take strided offsets for multiple threads
// This should be an over-approximation, and starts allowing new threads at 1k of data
let num_threads = std::cmp::min(num_threads, std::cmp::max((texts.len() as i64 - 1024)/10, 1));
println!("AA {}", num_threads);

texts.len() is equal to nn (the number of input parts), I think you want something like

    let num_threads = std::cmp::min(num_threads, std::cmp::max((texts_len.iter().sum::<usize>() as i64 - 1024)/10, 1));

instead.

customized dataset deduplication

Hi.
I download some data from common crawl website. Whether can I use this script to deduplication sub-strings in one file or in multi files?
Thanks!

Off-by-1 error in `collect`?

Hi, thanks for the great repo!

I'm using the tool to deduplicate a dataset, and I'm trying to investigate what happens in subsequent steps. I noticed that after running collect, some of the duplicate strings seem to start with control characters:, e.g. after running code similar to this:

>>> data=open("data/my_data.test","rb").read()
>>> data[left:right]

where left and right are one of the pairs returned by collect, I get sth like this:

b'\x00\x00Enter the username or e-mail you used in your profile. A password reset link will be sent to you by email.'

I'm cleaning the control characters up in my main text so it looks like parts of the separator codes are being leaked. Interestingly, this doesn't happen consistently, but it does happen more on the more frequent strings. Also, matched documents from my original dataset don't contain control characters.

Any chance there's some sort of an off-by-1 error in collect? Not a huge deal but I'd like to understand what's happening here

Question: Upper Bound

If your machine is big enough, there should be no upper bound on the size of the dataset it can handle (well, 2^64-1 bytes is the limit, but I think we can all agree that's essentially unlimited).

Just out of the curiosity, what was the reason of this calculation?

Should newline char be removed

Hi, So I notice that this read here adds a \n char to the end of the query. This then causes an issue with the count if its not actually an end-of-line. Should there be a .strip() added here?

arr = open(args.query_file,"rb").read().strip()

Thanks.

question about wstring_equal function

Hi,

For the wstring_equal function, you have:

fn wstring_equal(&self, stypes: &SuffixTypes, w1: u64, w2: u64) -> bool {
        ...
        for ((i1, c1), (i2, c2)) in w1chars.zip(w2chars) {
            ...
            if i1 > w1 && (stypes.is_valley(i1) || stypes.is_valley(i2)) {
                println!("stypes.is_valley(i1):{}, stypes.is_valley(i2):{}", stypes.is_valley(i1),stypes.is_valley(i2));
                return true;
            }
        }
        ...
    }

Why is it (stypes.is_valley(i1) || stypes.is_valley(i2)) and not (stypes.is_valley(i1) && stypes.is_valley(i2)), as we might want to make sure both the wstrings end? If one string ends with a valley while the other does not, they won't be equal?

Thank you!

Accessing the duplicates and their counts

Hey, thanks for releasing the code!

I'm a bit confused regarding how to use the dups_ and sizes_ files.
I would like to get a mapping between all duplicate strings and their corresponding number of appearances in the data.
From my understanding, this is what you get with the points in those files, but I don't understand how to read these.
Any explanation, would be helpful! (and code snippet / reference even better!)

Thanks

Can the tool run on plain text files?

Hello,
I'm trying to deduplicate several plain text files.
If i run
python scripts/make_suffix_array.py myfile.en it correctly generates the myfile.en.table.bin file

However, if i run cargo selfsimilar_parallel myfile.en it shows no duplicates.

myfile.en contains 10 times the same string, so I am wondering whether I have to use TFDS format or not.

[Paper Question] Why use w-shingles over k-shingles?

Hi – I was looking at the NearDup details in Appendix A and saw that "space tokenized consecutive 5-grams" were used for shingling.

I've found very little discussion on the internet comparing character and word-wise shingling, and was wondering if there was a particular reason or analysis that lead to this choice over k-shingling?

cargo build error.Could you upload cargo.lock file?

It seems Rust updates quite frequently, so it's necessary to use cargo.lock to pin the version. This is the bug I met when I run 'cargo build'.
rustc version: rustc 1.72.0 (5680fa18f 2023-08-23)
cargo version: cargo 1.72.0 (103a7ff2e 2023-08-15)

$ cargo build
    Updating `ustc` index
   Compiling libc v0.2.147
   Compiling version_check v0.9.4
   Compiling proc-macro2 v1.0.66
   Compiling either v1.9.0
   Compiling unicode-ident v1.0.11
   Compiling glob v0.3.1
   Compiling syn v1.0.109
   Compiling autocfg v1.1.0
   Compiling zstd-safe v2.0.6+zstd.1.4.7
   Compiling hashbrown v0.12.3
   Compiling os_str_bytes v6.5.1
   Compiling heck v0.4.1
   Compiling once_cell v1.18.0
   Compiling bitflags v1.3.2
   Compiling strsim v0.10.0
   Compiling textwrap v0.16.0
   Compiling termcolor v1.2.0
   Compiling crossbeam v0.3.2
   Compiling itertools v0.9.0
   Compiling clap_lex v0.2.4
   Compiling indexmap v1.9.3
   Compiling proc-macro-error-attr v1.0.4
   Compiling proc-macro-error v1.0.4
   Compiling jobserver v0.1.26
   Compiling atty v0.2.14
   Compiling filebuffer v0.4.0
   Compiling quote v1.0.33
   Compiling cc v1.0.83
   Compiling zstd-sys v1.4.18+zstd.1.4.7
thread 'rustc' panicked at 'MemDecoder exhausted', compiler/rustc_serialize/src/opaque.rs:352:9
stack backtrace:
   0:     0x7f8c0209fb61 - std::backtrace_rs::backtrace::libunwind::trace::he648b5c8dd376705
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1:     0x7f8c0209fb61 - std::backtrace_rs::backtrace::trace_unsynchronized::h5da3e203eef39e9f
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x7f8c0209fb61 - std::sys_common::backtrace::_print_fmt::h8d28d3f20588ae4c
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/sys_common/backtrace.rs:65:5
   3:     0x7f8c0209fb61 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::hd9a5b0c9c6b058c0
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/sys_common/backtrace.rs:44:22
   4:     0x7f8c0210677f - core::fmt::rt::Argument::fmt::h0afc04119f252b53
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/fmt/rt.rs:138:9
   5:     0x7f8c0210677f - core::fmt::write::h50b1b3e73851a6fe
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/fmt/mod.rs:1094:21
   6:     0x7f8c020924a7 - std::io::Write::write_fmt::h184eaf275e4484f0
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/io/mod.rs:1714:15
   7:     0x7f8c0209f975 - std::sys_common::backtrace::_print::hf58c3a5a25090e71
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/sys_common/backtrace.rs:47:5
   8:     0x7f8c0209f975 - std::sys_common::backtrace::print::hb9cf0a7c7f077819
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/sys_common/backtrace.rs:34:9
   9:     0x7f8c020a2753 - std::panicking::default_hook::{{closure}}::h066adb2e3f3e2c07
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/panicking.rs:269:22
  10:     0x7f8c020a24e4 - std::panicking::default_hook::h277fa2776900ff14
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/panicking.rs:288:9
  11:     0x7f8c053a456b - <rustc_driver_impl[10725d833993dc31]::install_ice_hook::{closure#0} as core[f12ae36cc2e1ecf0]::ops::function::FnOnce<(&core[f12ae36cc2e1ecf0]::panic::panic_info::PanicInfo,)>>::call_once::{shim:vtable#0}
  12:     0x7f8c020a2f7e - <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call::h09cad52ea08435f2
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/alloc/src/boxed.rs:2007:9
  13:     0x7f8c020a2f7e - std::panicking::rust_panic_with_hook::hceaf38da6d9db792
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/panicking.rs:709:13
  14:     0x7f8c020a2cc1 - std::panicking::begin_panic_handler::{{closure}}::h2bce3ed2516af7df
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/panicking.rs:595:13
  15:     0x7f8c0209ffc6 - std::sys_common::backtrace::__rust_end_short_backtrace::h090f3faf8f98a395
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/sys_common/backtrace.rs:151:18
  16:     0x7f8c020a2a52 - rust_begin_unwind
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/panicking.rs:593:5
  17:     0x7f8c021029f3 - core::panicking::panic_fmt::h4ec8274704d163a3
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/core/src/panicking.rs:67:14
  18:     0x7f8c05de8f8a - <rustc_serialize[42a6bc6c87d9c952]::opaque::MemDecoder>::decoder_exhausted
  19:     0x7f8c0353168a - <alloc[adbc6fffff8d40a5]::string::String as rustc_serialize[42a6bc6c87d9c952]::serialize::Decodable<rustc_metadata[b86bf7e30d297efc]::rmeta::decoder::DecodeContext>>::decode
  20:     0x7f8c04750d1b - <rustc_metadata[b86bf7e30d297efc]::locator::CrateLocator>::extract_one
  21:     0x7f8c04750316 - <rustc_metadata[b86bf7e30d297efc]::locator::CrateLocator>::extract_lib
  22:     0x7f8c0474cd2b - <rustc_metadata[b86bf7e30d297efc]::locator::CrateLocator>::maybe_load_library_crate
  23:     0x7f8c0474b140 - <rustc_metadata[b86bf7e30d297efc]::creader::CrateLoader>::load
  24:     0x7f8c04746c14 - <rustc_metadata[b86bf7e30d297efc]::creader::CrateLoader>::maybe_resolve_crate
  25:     0x7f8c0444d6a2 - <rustc_resolve[ce4e00308278261e]::Resolver>::early_resolve_ident_in_lexical_scope
  26:     0x7f8c036f8a5f - <rustc_resolve[ce4e00308278261e]::Resolver>::resolve_path_with_ribs
  27:     0x7f8c03ec98b5 - <rustc_resolve[ce4e00308278261e]::Resolver as rustc_expand[2551adda1704db0c]::base::ResolverExpand>::resolve_imports
  28:     0x7f8c03d80051 - <rustc_expand[2551adda1704db0c]::expand::MacroExpander>::fully_expand_fragment
  29:     0x7f8c04478307 - <rustc_expand[2551adda1704db0c]::expand::MacroExpander>::expand_crate
  30:     0x7f8c04477730 - <rustc_session[c98e5f13e8087e38]::session::Session>::time::<rustc_ast[2602dd4bdaa32052]::ast::Crate, rustc_interface[7e6a9899d53a1fe5]::passes::configure_and_expand::{closure#1}>
  31:     0x7f8c044238d8 - rustc_interface[7e6a9899d53a1fe5]::passes::resolver_for_lowering
  32:     0x7f8c047bad2a - rustc_query_impl[7f6201a046a7a363]::plumbing::__rust_begin_short_backtrace::<rustc_query_impl[7f6201a046a7a363]::query_impl::resolver_for_lowering::dynamic_query::{closure#2}::{closure#0}, rustc_middle[4cadb439cfabc8cf]::query::erase::Erased<[u8; 8usize]>>
  33:     0x7f8c047bad19 - <rustc_query_impl[7f6201a046a7a363]::query_impl::resolver_for_lowering::dynamic_query::{closure#2} as core[f12ae36cc2e1ecf0]::ops::function::FnOnce<(rustc_middle[4cadb439cfabc8cf]::ty::context::TyCtxt, ())>>::call_once
  34:     0x7f8c04701d7c - rustc_query_system[5a9d202ec1d2890c]::query::plumbing::try_execute_query::<rustc_query_impl[7f6201a046a7a363]::DynamicConfig<rustc_query_system[5a9d202ec1d2890c]::query::caches::SingleCache<rustc_middle[4cadb439cfabc8cf]::query::erase::Erased<[u8; 8usize]>>, false, false, false>, rustc_query_impl[7f6201a046a7a363]::plumbing::QueryCtxt, false>
  35:     0x7f8c04cead67 - rustc_query_impl[7f6201a046a7a363]::query_impl::resolver_for_lowering::get_query_non_incr::__rust_end_short_backtrace
  36:     0x7f8c0486af80 - <rustc_interface[7e6a9899d53a1fe5]::queries::QueryResult<&rustc_middle[4cadb439cfabc8cf]::ty::context::GlobalCtxt>>::enter::<&rustc_data_structures[cf1e21f3f9509cab]::steal::Steal<(rustc_middle[4cadb439cfabc8cf]::ty::ResolverAstLowering, alloc[adbc6fffff8d40a5]::rc::Rc<rustc_ast[2602dd4bdaa32052]::ast::Crate>)>, rustc_driver_impl[10725d833993dc31]::run_compiler::{closure#1}::{closure#2}::{closure#2}>
  37:     0x7f8c048699bb - <rustc_interface[7e6a9899d53a1fe5]::interface::Compiler>::enter::<rustc_driver_impl[10725d833993dc31]::run_compiler::{closure#1}::{closure#2}, core[f12ae36cc2e1ecf0]::result::Result<core[f12ae36cc2e1ecf0]::option::Option<rustc_interface[7e6a9899d53a1fe5]::queries::Linker>, rustc_span[14af2d27fb997609]::ErrorGuaranteed>>
  38:     0x7f8c04866cc5 - rustc_span[14af2d27fb997609]::set_source_map::<core[f12ae36cc2e1ecf0]::result::Result<(), rustc_span[14af2d27fb997609]::ErrorGuaranteed>, rustc_interface[7e6a9899d53a1fe5]::interface::run_compiler<core[f12ae36cc2e1ecf0]::result::Result<(), rustc_span[14af2d27fb997609]::ErrorGuaranteed>, rustc_driver_impl[10725d833993dc31]::run_compiler::{closure#1}>::{closure#0}::{closure#0}>
  39:     0x7f8c04866736 - <scoped_tls[a7f541bbfecfca9d]::ScopedKey<rustc_span[14af2d27fb997609]::SessionGlobals>>::set::<rustc_interface[7e6a9899d53a1fe5]::interface::run_compiler<core[f12ae36cc2e1ecf0]::result::Result<(), rustc_span[14af2d27fb997609]::ErrorGuaranteed>, rustc_driver_impl[10725d833993dc31]::run_compiler::{closure#1}>::{closure#0}, core[f12ae36cc2e1ecf0]::result::Result<(), rustc_span[14af2d27fb997609]::ErrorGuaranteed>>
  40:     0x7f8c04865cfc - std[7c7acd4e376d60d3]::sys_common::backtrace::__rust_begin_short_backtrace::<rustc_interface[7e6a9899d53a1fe5]::util::run_in_thread_pool_with_globals<rustc_interface[7e6a9899d53a1fe5]::interface::run_compiler<core[f12ae36cc2e1ecf0]::result::Result<(), rustc_span[14af2d27fb997609]::ErrorGuaranteed>, rustc_driver_impl[10725d833993dc31]::run_compiler::{closure#1}>::{closure#0}, core[f12ae36cc2e1ecf0]::result::Result<(), rustc_span[14af2d27fb997609]::ErrorGuaranteed>>::{closure#0}::{closure#0}, core[f12ae36cc2e1ecf0]::result::Result<(), rustc_span[14af2d27fb997609]::ErrorGuaranteed>>
  41:     0x7f8c04bde305 - <<std[7c7acd4e376d60d3]::thread::Builder>::spawn_unchecked_<rustc_interface[7e6a9899d53a1fe5]::util::run_in_thread_pool_with_globals<rustc_interface[7e6a9899d53a1fe5]::interface::run_compiler<core[f12ae36cc2e1ecf0]::result::Result<(), rustc_span[14af2d27fb997609]::ErrorGuaranteed>, rustc_driver_impl[10725d833993dc31]::run_compiler::{closure#1}>::{closure#0}, core[f12ae36cc2e1ecf0]::result::Result<(), rustc_span[14af2d27fb997609]::ErrorGuaranteed>>::{closure#0}::{closure#0}, core[f12ae36cc2e1ecf0]::result::Result<(), rustc_span[14af2d27fb997609]::ErrorGuaranteed>>::{closure#1} as core[f12ae36cc2e1ecf0]::ops::function::FnOnce<()>>::call_once::{shim:vtable#0}
  42:     0x7f8c020ad435 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::hc0b1022758ecac73
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/alloc/src/boxed.rs:1993:9
  43:     0x7f8c020ad435 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::h0c9654ebe7ad657e
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/alloc/src/boxed.rs:1993:9
  44:     0x7f8c020ad435 - std::sys::unix::thread::Thread::new::thread_start::h04c8e9c7d83d3bd5
                               at /rustc/5680fa18feaa87f3ff04063800aec256c3d4b4be/library/std/src/sys/unix/thread.rs:108:17
  45:     0x7f8c01f69609 - start_thread
  46:     0x7f8c01e8e293 - clone
  47:                0x0 - <unknown>

error: the compiler unexpectedly panicked. this is a bug.

note: we would appreciate a bug report: https://github.com/rust-lang/rust/issues/new?labels=C-bug%2C+I-ICE%2C+T-compiler&template=ice.md

note: rustc 1.72.0 (5680fa18f 2023-08-23) running on x86_64-unknown-linux-gnu

note: compiler flags: --crate-type lib -C embed-bitcode=no -C overflow-checks=off

note: some of the compiler flags provided by cargo are hidden

query stack during panic:
#0 [resolver_for_lowering] getting the resolver for lowering
end of query stack
error: could not compile `proc-macro-error` (lib)
warning: build failed, waiting for other jobs to finish...

[Bug] Out of range error when counting occurrences on a custom suffix array

Hello!

In some rare occasions, this bug can be triggered when counting occurrences of a pattern:

thread 'main' panicked at src\main.rs:284:40:
range end index 133018772 out of range for slice of length 133018768
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Below is a demo code to trigger this bug. To run the code, you first need to have a custom tokenized squad dataset as your suffix array. I constructed mine by concatenating context and question pairs of squad, with tiktoken.gpt2's eot_token as separator (instead of the sep() function's b"\xff\xff" + struct.pack("<I", UID)). The concatenated dataset as well as the suffix array can be found here: .https://drive.google.com/drive/folders/1-c3wc4WClZJhed3hVgadYQvlA5CEyAwE?usp=sharing. Assuming you put the files in data/squad, run the following python code:

import numpy as np
import tiktoken
import subprocess
import platform
import struct

tokenizer = tiktoken.get_encoding('gpt2')
window_size = 10

# tokenized to: ' indis', 'put', 'ably'
# tokens are: [49919, 1996, 1346]
# byte representation in hex: 06 00 00 00 FF C2 CC 07 42 05
problematic_str = " indisputably"
problematic_str_tokens = tokenizer.encode_ordinary(problematic_str)
problematic_str_bytes = np.array(problematic_str_tokens, dtype=np.uint16).view(np.uint8).tobytes()
problematic_str_bytes_with_len = struct.pack("<L", len(problematic_str_bytes)) + problematic_str_bytes

# for simplicity of demo, just delete these manually after running script
open('problematic_str_bytes', 'wb').write(problematic_str_bytes)
open('problematic_str_bytes_with_len', 'wb').write(problematic_str_bytes_with_len)

# launch count-occurrences(-multi) rust scripts
path_separator = "\\" if platform.system() == "Windows" else "/"
command = [
    f".{path_separator}target{path_separator}debug{path_separator}dedup_dataset",
    "count-occurrences",
    "--data-file", f"data{path_separator}squad{path_separator}train",
    "--query-file", "problematic_str_bytes",
    "--load-disk"
]
command2 = [
    f".{path_separator}target{path_separator}debug{path_separator}dedup_dataset",
    "count-occurrences-multi",
    "--data-file", f"data{path_separator}squad{path_separator}train",
    "--query-file", "problematic_str_bytes_with_len",
    "--load-disk"
]
result_count_occurrences = subprocess.run(command, capture_output=True, text=True, shell=(platform.system() == "Windows"))
result_count_occurrences_multi = subprocess.run(command2, capture_output=True, text=True, shell=(platform.system() == "Windows"))
print(result_count_occurrences.stdout)
if result_count_occurrences.stderr: print(result_count_occurrences.stderr)
print(result_count_occurrences_multi.stdout)
if result_count_occurrences_multi.stderr: print(result_count_occurrences_multi.stderr)

you should get this bug:

thread 'main' panicked at src\main.rs:284:40:
range end index 133018772 out of range for slice of length 133018768      
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

thread 'main' panicked at src\main.rs:278:40:
range end index 133018772 out of range for slice of length 133018768      
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I have traced the bug to be caused by the following reason: during the binary search of count_occurances and count_occurances_memory, it is possible (although very rare) that we can reach the end of the suffix array without finding the pattern. If that is the case, we would have low equal to size/(size_width as u64), and it will trigger an out of range error in the next line let pos: usize = table_load_filebuffer(&table, low as usize, size_width);, as it tries to read the next size_width bits from the end of the suffix array.

In this example, the token indisputably is tokenized (using tiktoken gpt2) into [49919, 1996, 1346], where 49919 is FF C2 in hex. This FF triggers the search to reach the end of the suffix array. I also tried to reproduce the bug with the provided load_dataset_hf.py script, which uses sep() function instead of eot_token as separator and the script works fine. (is b"\xff\xff" from sep() an intentional design to specifically prevent this kind of scenario?) I also tried using openwebtext as the suffix array with eot_token as separator, and the script also works fine. Out of a lot of queries, only indisputable triggered it on squad suffix with eot_token as separator. So it is very rare, but very annoying to debug once it happens...

I have made a pull request to fix this bug, can it be merged it if it is good enough? It would mean a lot to me :) Thanks!

false positives

I have a plain text file that I'd like to deduplicate. So I ran the following commands:

python3 scripts/make_suffix_array.py ../data.small.txt
cargo run selfsimilar_parallel ../data.small.txt
cargo run collect_similar data.small.txt # this command does not like the relative path

I saved the start end indices to a file.
However, when I double check the duplicates, I realize that some of them only appear once in the corpus.

from collections import deque, defaultdict
import pandas as pd

with open("../data.small.txt", "rb") as f:
    data = f.read()
    total = 0
    count = 0

    for x, y in pd.read_csv("indices.txt", sep=" ", names=["start", "end"]).values.tolist():
        total += 1
        if data.count(data[x:y]) <= 1:
            count += 1

print(total, count, count/total)
25 5 0.2

data.small.txt gs://deduplicate_sample_data/data.small.txt or https://storage.googleapis.com/deduplicate_sample_data/data.small.txt
indices.txt gs://deduplicate_sample_data/indices.txt or https://storage.googleapis.com/deduplicate_sample_data/indices.txt

Am I missing a step here?

remove_ex in finish_dedup_wiki40b

Thanks for your excellent code.

I have successfully rerun the code in the repository about exactdedup. However, I have a problem about the following code:

159        remove_ex[i].append((max(int(remove[ptr][0] - byte_start - 6), 0),
160                              min(int(remove[ptr][1] - byte_start), byte_end-byte_start)))

I know the meaning of "6", but why not also subtract "6" in the right?

Why not use Simhash?

Since Google has shown that Simhash is practically useful for identifying near-duplicates in web documents belonging to a multi-billion page repository (Detecting Near-Duplicates for Web Crawling). In your paper, you choose minhash for approximate matching. Why not use Simhash in this scenario?

does this tool can process Chinese?

my inputs Is
0 乳腺癌骨转移全身疼痛怎么办 乳腺癌骨转移有救吗?
0 乳腺癌转移到肺能活多久 乳腺癌转移肝能活多久?
1 妊娠期高血压的概念 什么是妊娠高血压
0 骨折了又是高血压怎么办 推荐吃什么药平时饮食方面吃什么又降血压又可解决贫血的问题的
0 重型乙肝治愈率有多高 丙肝的治愈率大吗比乙肝严重吗
but output is
��荏�解决�丯���灭活破坰��肠,她77�� 能�经吃不下东西了���,��90150,但一般情况是80135��10度的室内泳池游泾��春亓�而且有初筛资格��方啊��些工作�作��地方厒�灸��旋澡��麦严重
1�重和��圳����辇�大谁能给我准
0 3年,今年出现淤�同一只��5·3是有任娠��6.2��是哪一类的��友。。交和��我口,�..型 ��常的一些习�?
0��到一点丿��ca�505年听朋友倍他乐�年吃�,伤处���圕�特别痒腿上有几个红疙瘈�阴然后用手洗阴郹�,��一7夤�高,平想食疗。��卜麻 胖����在��萎缩、���几款�来的肠炎胃窦�稳定就��脏内科血液内科��秈����几级�过的针扎过��� ��它与普��中期�护理�麦��惯和饮食习惯153值吃益肝灵呢�礋����吗��冒或发烧或嗓子������交未��我啊,点怕怕的,可
1105天左右��须在半年之内扯�细��到我?阿胶含不含胆骨醇
1,叚子痛
150多岁了经常��睡觉,还虚胖,不怎么?�酒炖鸽子21号晚上���放心��率���发现ED患者问��
0 平时少喝水��可乐爱���������靑14.3千多��

absoutly, the output is wrong,how should I do , or choose other tool , thank you

how to deduplicate huggingface datasets

Hey there, excellent work on this repo and the paper.

I wanted to know on how could i use this to deduplicate my huggingface custom dataset. that i have custom developed and cleaned.

that has been saved as custom_dataset.save_to_disk("dataset_path")

and can be loaded as custom_dataset = datasets.load_from_disk("dataset_path")

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.