Code Monkey home page Code Monkey logo

Comments (7)

carlini avatar carlini commented on August 29, 2024 1

Huh. Thanks for reporting this. I'm not sure what's going on right now but I'll look into it later this week.

To help me debug, could you upload (assuming it's safe to do so, but because you're saying it's wikipedia data I suspect it is) the exact file you're using in case that makes a difference?

from deduplicate-text-datasets.

jinyongyoo avatar jinyongyoo commented on August 29, 2024

Yes absolutely! Here is the file test.txt (note: it's very small) I just copied the first few paragraphs of random articles (hint: they were Youtube, Google, and American Revolutionary War). Then I took line5 and line16 and duplicated them.

I ran this command bash scripts/deduplicate_single_file.sh test.txt test_dedup.txt 20 1 and got the following output:

./target/debug/dedup_dataset make-part --data-file test.txt --start-byte 0 --end-byte 10490
Waiting for jobs to finish
Checking all wrote correctly
FACT 2.0
Rerunning 0 jobs because they failed.
Merging suffix trees
./target/debug/dedup_dataset merge --output-file tmp/out.table.bin --suffix-path test.txt.part.0-10490 --num-threads 40
Now merging individual tables
Cleaning up
    Finished dev [optimized + debuginfo] target(s) in 0.06s
     Running `target/debug/dedup_dataset self-similar --data-file test.txt --length-threshold 20 --cache-dir /tmp/cache --num-threads 1`
Start load!
thread 'main' panicked at 'assertion failed: metadata.len() % (text.len() as u64) == 0', src/main.rs:479:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
    Finished dev [optimized + debuginfo] target(s) in 0.03s
     Running `target/debug/dedup_dataset collect --data-file test.txt --cache-dir /tmp/cache --length-threshold 20`
thread 'main' panicked at 'assertion failed: size_table % size_text == 0', src/main.rs:1049:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

from deduplicate-text-datasets.

jinyongyoo avatar jinyongyoo commented on August 29, 2024

I also have an unrelated question: Can this program handle UTF-8 text (e.g. Korean)?

Second, we don't want UTF8 strings. Everything is a [u8] byte array, because we might be working over token sequences which aren't valid UTF8.

I wasn't sure what this meant in the README.md. If we were to pass a file with utf-8 strings as input, does this mean we are essentially ignoring the utf-8 encoding and compare each byte individually for substring match?

from deduplicate-text-datasets.

carlini avatar carlini commented on August 29, 2024

Yeah, you're right in how it handles things: it doesn't understand the idea of a "letter" or "token" or anything -- it just thinks in terms of "bytes".

from deduplicate-text-datasets.

carlini avatar carlini commented on August 29, 2024

Okay this is very weird. I can't reproduce the error you see: when I run the commands everything comes out as expected.

Could you re-run the command and upload the following files:

test.txt.part.0-10490
test.txt.part.0-10490.table.bin
test.txt.table.bin

from deduplicate-text-datasets.

jinyongyoo avatar jinyongyoo commented on August 29, 2024

I think I figured out what the issue was. When I cleared the tmp folder and ran the command again, the error went away. My guess is that b/c the tmp folder never got cleared, I was merging tables from previous runs too. Maybe we should have this line come after L91?

#os.popen("rm tmp/out.table.bin.*").read()

from deduplicate-text-datasets.

carlini avatar carlini commented on August 29, 2024

Ah -- perfect yes thanks. I will do that.

from deduplicate-text-datasets.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.