Code Monkey home page Code Monkey logo

Comments (2)

carlini avatar carlini commented on August 29, 2024

Uhm. This is a good question. I do not know the answer to this right away---this code is from the following suffix array implementation

https://github.com/BurntSushi/suffix/blob/166a975bc9e85a9340567cef86c7c11c08a0860e/src/table.rs#L802

which implements the linear time suffix array construction. Do you an example input that causes something to go wrong? I've tested this code on ~tens of terabytes of data and haven't seen any issues, so my default assumption would be "the code is correct but doing something clever" but there might be some kind of corner case I haven't seen before!

from deduplicate-text-datasets.

WWWonderer avatar WWWonderer commented on August 29, 2024

Hi Nicholas, not really, I'm just trying to understand the code base and I find it to be a bit difficult... However I tested with some strings as follow:

fn main()  -> std::io::Result<()> {
    let text1 = "babcabcab";
    let utf8_text1 = Utf8(text1.as_bytes());
    let mut stypes_text1 = SuffixTypes::new(text1.len() as u64);
    stypes_text1.compute(&utf8_text1);
    println!("text1 test: {}", utf8_text1.wstring_equal(&stypes_text1, 1, 4));

    let text2 = "babcabcdab";
    let utf8_text2 = Utf8(text2.as_bytes());
    let mut stypes_text2 = SuffixTypes::new(text2.len() as u64);
    stypes_text2.compute(&utf8_text2);
    println!("text2 test: {}", utf8_text2.wstring_equal(&stypes_text2, 1, 4));
    Ok(())
}

text1 gives true while text2 gives false, so it kinda works. Maybe it is intended as the is_valley(i1) will differ from is_valley(i2) and return false in cases where wstrings are of different length (one would descend further while another will valley), so maybe checking 1 of them for is_valley is enough when returning true? But idk, still seems a bit weird to me but might be the original author's optimization. Thanks for the link.

from deduplicate-text-datasets.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.