helix-editor / nucleo Goto Github PK

View Code? Open in Web Editor NEW

705.0 18.0 23.0 213 KB

A fast and convenient fuzzy matcher library for rust

License: Mozilla Public License 2.0

Rust 99.87% Shell 0.13%

fuzzy-matching fuzzy-search performance rust text-processing

nucleo's People

Contributors

Stargazers

Watchers

Forkers

gabydd poliorcetics twolodzko hywan jessegrosjean solaeus tudyx zub a-kenji blinxen kallyaleksiev miloas truenaho jadegeek feel-ix-343 jrmoulton jvolante yonasbsd cosmikwolf

nucleo's Issues

consider using release tags and a changelog

I am considering using nucleo in gitui as it draws nice speed improvements but I would feel better if the crate would make it easier to figure out what changed from a release to another, two tings mainly helping with that:

tag releases in git
maintain a CHANGELOG.md

Add a feature flag to disable Unicode normalization support

As discussed recently in the Matrix room, it would be nice to add a feature flag to disable Unicode normalization. If the project using nucleo has already some unicode normalization dependency in its tree, it's not necessary to add more with nucleo.

Thanks!

Where is crate docs?

Tried looking in readme and crates.io but didn't find it.

Want to use this crate in my project. :)

Thank you.

bench: standalone fuzzy finder for benchmarking against other implementations

See jake-stewart/jfind#19 for context.

At least in the linked asciinema it is significantly slower, but I'm not sure for the exact reasons.
Startup time should be negligible for non-tiny datasets, so a fuzzer cli frontend would be easiest to use for fair comparison.

[Feature request] Way to get scores of many/all items

There currently only seem to be methods available for getting the best match.

Many usecases require ranking many/all items. Getting back a sorted list would be nice, or at least the ability to get a score for a single needle and a haystack so we can do the collecting and sorting ourselves.

How should Nucleo work?

Thanks for creating the fuzzy library.

I encounter a weird problem for Nucleo struct.

For the following code which you can run on rust-explorer

use std::sync::Arc;
use nucleo::Nucleo;
use nucleo::pattern::{CaseMatching, Normalization};

fn main() {
    let mut matcher = init_fuzzy_matcher();
    let inject = matcher.injector();
    let list = ["foobar", "fxxoo", "oo", "a"];
    list.iter().for_each(|s| {
        inject.push(s, |_| {});
    });
    matcher
        .pattern
        .reparse(0, "f", CaseMatching::Ignore, Normalization::Smart, false);
    let _status = matcher.tick(1000);
    dbg!(matcher.pattern.column_pattern(0));

    let mut counter = 0;
    loop {
        let _status = matcher.tick(100);
        // if status.changed {
        let snapshot = matcher.snapshot();
        let total = snapshot.item_count();
        let got = snapshot.matched_item_count();
        let res: Vec<_> = snapshot
            .matched_items(..)
            .map(|item| item.data)
            .collect();
        dbg!(total, got, res);
        // }
        // if !status.running {
        //     break;
        // }
        println!("running");
        if counter > 4 {
            break;
        }
        counter += 1;
    }
}

type Matcher = Nucleo<&'static str>;

fn init_fuzzy_matcher() -> Matcher {
    Nucleo::new(
        nucleo::Config::DEFAULT,
        Arc::new(|| println!("notified")),
        None,
        1,
    )
}

The res is always empty:

[src/main.rs:34:9] total = 4
[src/main.rs:34:9] got = 0
[src/main.rs:34:9] res = []

By using nucleo::Matcher, for the same config, input and needle string, there is the desired output.

use nucleo::pattern::{Atom, AtomKind, CaseMatching, Normalization};
use nucleo::Matcher;

fn main() {
    let mut matcher = init_fuzzy_matcher();
    let list = ["foobar", "fxxoo", "oo", "a"];
    let res = Atom::new(
        "f",
        CaseMatching::Ignore,
        Normalization::Smart,
        AtomKind::Fuzzy,
        false,
    )
    .match_list(&list, &mut matcher);
    dbg!(res);
}

fn init_fuzzy_matcher() -> Matcher {
    Matcher::new(nucleo::Config::DEFAULT)
}

[src/main.rs:20:5] res = [
    (
        "foobar",
        36,
    ),
    (
        "fxxoo",
        36,
    ),
]

So the question is how we use Nucleo in the right way? I see an issue asking for examples, but no replies in there.
I also scan the code in helix's source files, though nucleo is used as its dependency, the real use of it is Matcher, not Nucleo.

Generate Coverage Report in CI

I am striving for a high test coverage in nucleo. The matcher crate and the pattern parsing should already hit 80% test coverage. I would like to track coverage automatically in CI (for example with coveralls). This helps with triaging (identify uncovered branches) and makes it easier to track where tests are still needed.

I would imagine that we genrate test data with cargo tarpulin in CI and upload the report to coveralls or a similar service (I would need to setup the account once there is a PR). LLVM based instrumentation should be used.

Spurious matches with substring matching and non-ASCII

        let needle = Utf32String::from("lying");
        let haystack = Utf32String::from("Flibbertigibbet / イタズラっ子たち");
        let mut matcher = Matcher::new(Config::DEFAULT);
        assert_eq!(
            matcher.substring_match(haystack.slice(..), needle.slice(..)),
            None
        )

This should pass, but it fails with a score of 30; running with indices indicates that only the first codepoint in the haystack matches. If I get rid of the Japanese text the match goes away as expected. Fuzzy, postfix, and prefix match all indicate that there is no match; it's only substring match that breaks.

edit: If I use Utf32String::Unicode("lying".chars().collect()) there's no match, so I think the 'ascii needle, unicode haystack' codepath is the one with the problem.

Panic with simple pattern.

I know you're still working on this. But I might as well report it.
The example below panics with 'should have been caught by prefilter', .../git/checkouts/nucleo-fe29e1ee969779b0/9c4b710/matcher/src/fuzzy_optimal.rs:41:13

Has something to do with case, because it doesn't happen when ignore_case is false.

[dependencies]
nucleo = {version ="*", git="https://github.com/helix-editor/nucleo"}

use nucleo::*;

fn main() {
    let conf = MatcherConfig::DEFAULT;
    let mut matcher = Matcher::new(conf);

    let needle = "aB";
    let mut buf1 = Vec::new();
    let needle = Utf32Str::new(needle, &mut buf1);

    let haystack = "aaB";
    let mut buf2 = Vec::new();
    let haystack = Utf32Str::new(haystack, &mut buf2);

    let mut indices = Vec::new();
    let result = matcher.fuzzy_indices(haystack, needle, &mut indices);

    println!("{:?} {:?}", result, indices);
}

Starter example?

A simple starter example would be a great addition.

Run typos-rs in CI

I am using typos-rs locally to automatically fix (some) typos. It would be nice to have this run in CI so its caught during review. I already have an ignore file setp so its just a matter of adding the GH action step

Standalone CLI - toy project

Hi! I saw your reddit post about nucleo and I got curious about writing a standalone cli version as a little "side project" (as "coding-breaks" beside learning for my exams). I've already started but I didn't create this issue at the beginning because I don't know how far I'll get or if it will turn into a mature cli program at all.

However, you said in this answer:

So somebody else could also contribute that (although if somebody does this, please reach out first).

So I'm writing this issue, just in case you may be interested in it and want to use some of my code if you start to write the standalone cli of nucleo. Here's the link to my repo: https://github.com/TornaxO7/nucle

If you have some suggestions/hints/questions, feel free to ask.

How should/does nucleo handle umlauts?

For example I notice that a needle ë fails to fuzzy match bë. On the other hand a needle e will match bë, and a needle ë will match a haystack ë.

let paths = ["be", "bë"];
let mut matcher = Matcher::new(Config::DEFAULT);
let matches = Pattern::parse("ë", CaseMatching::Ignore).match_list(paths, &mut matcher);
assert_eq!(matches.len(), 1); // fails

Is that expected or a bug? If expected can you say a bit more about why and suggested workarounds... mostly just so I can document to people using my app why it works the way that it does.

Thank you.

Higher score for shorter matches?

First, thanks for posting this project!

I know you are trying to match fzf, but I'm finding some of the fzf scoring hard to make sense of. For example, consider these two cases:

Moby Dick
Though I cannot tell why it was exactly that those stage managers, the Fates, put me down for this shabby part of a whaling voyage

If I search for "md" the second example scores the highest matching "me down". This is also fzf behavior, but it doesn't seem right to me. Would it make sense to incorporate the percent of matching indexes into the score calculation somehow?

[Feature request] Get match indices and matched letters indices

I am creating an app like dmenu/rofi for windows using nucleo, emenu, and I have run into two main issues.

First, if there are multiple match candidates that are the same string, it would be nice to have a method on the snapshot to return the items along with the global index so I can differentiate between them in the gui, or just the indices and get the item with get_item().

Second, like with fzf-matcher, have some method to get the indices of the matched characters to highlight them in the gui.