bminixhofer / nlprule Goto Github PK

A fast, low-resource Natural Language Processing and Text Correction library written in Rust.

License: Apache License 2.0

Rust 94.10% Python 5.27% Shell 0.64%

grammar natural-language-processing spellcheck proofreading style-checker grammatical-error-correction machine-learning nlp rust

nlprule's Introduction

nlprule

A fast, low-resource Natural Language Processing and Error Correction library written in Rust. nlprule implements a rule- and lookup-based approach to NLP using resources from LanguageTool.

Python Usage

Install: pip install nlprule

Use:

from nlprule import Tokenizer, Rules

tokenizer = Tokenizer.load("en")
rules = Rules.load("en", tokenizer)

rules.correct("He wants that you send him an email.")
# returns: 'He wants you to send him an email.'

rules.correct("I can due his homework.")
# returns: 'I can do his homework.'

for s in rules.suggest("She was not been here since Monday."):
    print(s.start, s.end, s.replacements, s.source, s.message)
# prints:
# 4 16 ['was not', 'has not been'] WAS_BEEN.1 Did you mean was not or has not been?

for sentence in tokenizer.pipe("A brief example is shown."):
    for token in sentence:
        print(
            repr(token.text).ljust(10),
            repr(token.span).ljust(10),
            repr(token.tags).ljust(24),
            repr(token.lemmas).ljust(24),
            repr(token.chunks).ljust(24),
        )
# prints:
# 'A'        (0, 1)     ['DT']                   ['A', 'a']               ['B-NP-singular']       
# 'brief'    (2, 7)     ['JJ']                   ['brief']                ['I-NP-singular']       
# 'example'  (8, 15)    ['NN:UN']                ['example']              ['E-NP-singular']       
# 'is'       (16, 18)   ['VBZ']                  ['be', 'is']             ['B-VP']                
# 'shown'    (19, 24)   ['VBN']                  ['show', 'shown']        ['I-VP']                
# '.'        (24, 25)   ['.', 'PCT', 'SENT_END'] ['.']                    ['O']

Rust Usage

Recommended setup:

Cargo.toml

[dependencies]
nlprule = "<version>"

[build-dependencies]
nlprule-build = "<version>" # must be the same as the nlprule version!

build.rs

fn main() -> Result<(), nlprule_build::Error> {
    println!("cargo:rerun-if-changed=build.rs");

    nlprule_build::BinaryBuilder::new(
        &["en"],
        std::env::var("OUT_DIR").expect("OUT_DIR is set when build.rs is running"),
    )
    .build()?
    .validate()
}

src/main.rs

use nlprule::{Rules, Tokenizer, tokenizer_filename, rules_filename};

fn main() {
    let mut tokenizer_bytes: &'static [u8] = include_bytes!(concat!(
        env!("OUT_DIR"),
        "/",
        tokenizer_filename!("en")
    ));
    let mut rules_bytes: &'static [u8] = include_bytes!(concat!(
        env!("OUT_DIR"),
        "/",
        rules_filename!("en")
    ));

    let tokenizer = Tokenizer::from_reader(&mut tokenizer_bytes).expect("tokenizer binary is valid");
    let rules = Rules::from_reader(&mut rules_bytes).expect("rules binary is valid");

    assert_eq!(
        rules.correct("She was not been here since Monday.", &tokenizer),
        String::from("She was not here since Monday.")
    );
}

nlprule and nlprule-build versions are kept in sync.

Main features

Rule-based Grammatical Error Correction through multiple thousand rules.
A text processing pipeline doing sentence segmentation, part-of-speech tagging, lemmatization, chunking and disambiguation.
Support for English, German and Spanish.
Spellchecking. (in progress)

Goals

A single place to apply spellchecking and grammatical error correction for a downstream task.
Fast, low-resource NLP suited for running:
1. as a pre- / postprocessing step for more sophisticated (i. e. ML) approaches.
2. in the background of another application with low overhead.
3. client-side in the browser via WebAssembly.
100% Rust code and dependencies.

Comparison to LanguageTool

	\|Disambiguation rules\|	\|Grammar rules\|	LT version	nlprule time	LanguageTool time
English	843 (100%)	3725 (~ 85%)	5.2	1	1.7 - 2.0
German	486 (100%)	2970 (~ 90%)	5.2	1	2.4 - 2.8
Spanish	Experimental support. Not fully tested yet.

See the benchmark issue for details.

Projects using nlprule

prosemd: a proofreading and linting language server for markdown files with VSCode integration.
cargo-spellcheck: a tool to check all your Rust documentation for spelling and grammar mistakes.

Please submit a PR to add your project!

Acknowledgements

All credit for the resources used in nlprule goes to LanguageTool who have made a Herculean effort to create high-quality resources for Grammatical Error Correction and broader NLP.

License

nlprule is licensed under the MIT license or Apache-2.0 license, at your option.

The nlprule binaries (*.bin) are derived from LanguageTool v5.2 and licensed under the LGPLv2.1 license. nlprule statically and dynamically links to these binaries. Under LGPLv2.1 §6(a) this does not have any implications on the license of nlprule itself.

nlprule's People

Contributors

Stargazers

Watchers

nlprule's Issues

Support distinguishing between grammar and style errors

LanguageTool's web demo highlights grammar errors in yellow and style errors in blue.

It'd be nice to have access to that information in suggestions without having to take s.source and look it up in an external mapping table, similar to how, back in the GTK+ 2.x era with PyGTK and the Glade GUI builder, I had to do a second, independent parse of the XML UI definition with a normal DOM parser to recover certain bits of metadata I needed and which it stored but didn't expose through GtkBuilder API.

Roadmap

This meta-issue tracks what I plan to do with this library in the near future. I wrote this up to make it possible to comment on the direction and priorities of the project. These are the major things:

Currently I would consider the library feature complete after all of the above things are done but this will very possibly change over time. I appreciate any thoughts and discussion!

Wheels for 3.9

They'd be nice, I'm using 3.9.1 on Ubuntu 20.04 and when I try to pip install, there aren't any matching distributions.

Add a user config

Currently nlprule does not have any user configuration options. This is needed for:

enabling / disabling specific rule categories and rules
whitelists for the spellchecker (once #2 is implemented)
overriding internal options (e. g. lru cache size, backtracking limit for regexes, ..)

Support Rules written in Rust

Many useful LanguageTool rules are written in Java and not in the XML rule format (e.g. the A-Vs-AN rule which is kind of essential for me, because I always forget to write this correctly).
It would be great if nlprule would allow to rewrite these rules in Rust.

License of extracted rules

I had brief look into the licensing of language tools rules, if they are permitted to be distributed under other licenses than the language tool library itself, which is LGPLv2.1.

Mostly in relation to #12 which would render the whole idea of including it at compile time rather pointless for most applications.

Make rules more inspectable

It would be nice to return JSON (like LT HTTP server). It's frequent to not want a specific correction, some are even just suggestions.

Compile error in build.rs from README.md

Using nlprule and nlprule-build version 0.6.4 and the build.rs script from the current README.md, I get a compile error while building:

error[E0599]: no method named `validate` found for enum `Result<BinaryBuilder, nlprule_build::Error>` in the current scope
 --> build.rs:9:6
  |
9 |     .validate();
  |      ^^^^^^^^ method not found in `Result<BinaryBuilder, nlprule_build::Error>`

I have solved it by adding error propagation to builds.rs:

fn main() -> Result<(), nlprule_build::Error> {
    println!("cargo:rerun-if-changed=build.rs");
    nlprule_build::BinaryBuilder::new(
        &["en"],
        std::env::var("OUT_DIR").expect("OUT_DIR is set when build.rs is running"),
    )
    .build()?
    .validate()
}

If you want, I can make a PR.

oob access since 0.5.3

Since attempting to upgrade to 0.5.3 it consistently segfaults in https://github.com/bminixhofer/nlprule/blob/main/nlprule/src/rule/engine/composition.rs#L345-L347

See https://ci.spearow.io/teams/main/pipelines/cargo-spellcheck/jobs/pr-validate/builds/26

But the bug is in line 344 - where you should push the length in chars, not in bytes.

Benchmark against LanguageTool

There is now a benchmark in bench/__init__.py. It computes suggestions from LanguageTool via language-tool-python and NLPRule on 10k sentences from Tatoeba and compares the times.

Heres's the output for German:

(base) bminixhofer@pop-os:~/Documents/Projects/nlprule/bench$ python __init__.py --lang=de
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [01:24<00:00, 118.30it/s]
LanguageTool time: 63.019s
NLPRule time: 21.348s

n LanguageTool suggestions: 368
n NLPRule suggestions: 314
n same suggestions: 304

and for English:

(base) bminixhofer@pop-os:~/Documents/Projects/nlprule/bench$ python __init__.py --lang=en
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [05:57<00:00, 27.98it/s]
LanguageTool time: 305.641s
NLPRule time: 51.267s

n LanguageTool suggestions: 282
n NLPRule suggestions: 247
n same suggestions: 235

I disabled spellchecking in LanguageTool.
LT gives more suggestions because NLPRule does not support all rules.
Not all NLPRule suggestions are the same as LT. Likely because of differences in priority but I'll look a bit closer into that.

Correcting for the Java rules in LT and that NLPRule only supports 85-90% of LT rules by dividing the NLPRule time by 0.8 and normalizing this gives the following table:

	NLPRule time	LanguageTool time
English	1	4.77
German	1	2.36

These numbers are of course not 100% accurate but should at least give a ballpark estimate of performance.
I'll keep this issue open for discussion / improving the benchmark.

How to handle sentence boundary detection

NLPRule currently defers splitting text into sentences to an external tool. In Python this is solved by having a lambda texts: sentences as argument to the Rules and Tokenizer. The Rust API still needs this functionality too via a trait.

Instead, sentence boundary detection should be part of nlprule.

Languagetool does sentence boundary detection with SRX rules: https://github.com/languagetool-org/languagetool/blob/master/languagetool-core/src/main/resources/org/languagetool/resource/segment.srx. Porting this functionality would be one option.
rust-punkt does sentence splitting with the Punkt algorithm. Punkt would probably be sufficient for the splits in NLPRule but rust-punkt does currently not work properly (ferristseng/rust-punkt#16) and is not maintained. Rewriting this and maintaining it as another module is also an option.
nnsplit would yield higher quality splits than anything else but is prohibitively slow compared to NLPRule:

nnsplit                 time:   [5.9124 ms 5.9526 ms 5.9950 ms]                    
                        change: [-0.1868% +0.7157% +1.7242%] (p = 0.14 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  6 (6.00%) high mild

nlprule                 time:   [825.25 us 840.18 us 855.25 us]                    
                        change: [-3.4741% -1.0004% +1.4788%] (p = 0.44 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

Speeding up NLPRule is another option but would probably not be easy to do.

switch regex engine from oniguruma to fancy-regex

I would like to switch from rust-onig to fancy-regex.

This would probably come with a speedup and remove the last non-Rust dependency. This is nice in general and would enable compiling to WebAssembly.

Changing this in NLPRule would be easy but it is currently blocked by fancy-regex/fancy-regex#59 and fancy-regex/fancy-regex#49.

Allow usage of external tokenizers

Extracting a Trait to be used by external tokenizers i.e. https://github.com/guillaume-be/rust-bert would be very nice!

Configurability

There should be options which the user can pass:

upon initializing the Tokenizer / Rules with an extra method (Rust) or keyword arguments (Python).
with a method set_options at any time afterwards.

This is fairly straightforward to implement. Currently I think the only use case for options is a whitelist once spellchecking (#2) is implemented.

Token as returned by pipe() is relative to the sentence boundaries

// Token<'_>
    pub char_span: (usize, usize),
    pub byte_span: (usize, usize),

using fn pipe() returns a set of tokens, that includes spans relative to the sentence, but there seems to be no trivial way of retrieving the spans from within the original text provided to pipe.

Suggestion: Use a Range<usize> instead of a tuple for the relevant range of bytes/ characters for easier usage and make that relative to the input text.

Since for single sentences, there is no change in semantics. For multi sentence ones there is.

It would also make sense to add the respective bounds in bytes and chars of the sentence (or replace the sentence entirely).

pub sentence: &'t str,

Related cargo spellcheck issue drahnr/cargo-spellcheck#162

Readme link to languagetool

The link in the first paragraph should point to https://github.com/languagetool-org/languagetool but actually points to https://github.com/bminixhofer/nlprule/blob/main/github.com/languagetool-org/languagetool which 404's because the https:// is omitted.

Support more languages

More languages were understandably already requested multiple times (e. g. #14). This issue tracks progress in this area.

The primary goal is to make it easy for contributors to add a language, not necessarily to support many languages right away.

I have decided to not add support for any new languages until Spellchecking (#2) lands and the quality issue in the core (#44) is resolved.

Theoretically someone could contribute a new language right now but since LanguageTool has language-specific code there will likely have to be some adjustments in the core for each new language which is not easy enough with the current level of documentation. So for adding new languages to scale well the quality of the core has to be improved significantly.

project dead?

Hey,
is this project still being worked on or is it dead? I am asking as I was considering moving Fidus Writer to use it instead of language tool.

Quality of the core

Currently the core is still largely in the state it was during prototyping. While abstractions are good and the code is clean documentation is missing in the internals.

So the key issue is:

Improving documentation of the internals.

Besides that, there are some cleanups needed such as:

Thinking about the distinction between byte and char indices and when should be converted between them (#21)
Better error messages for wrong binaries (#20)
Profile & reduce allocations (use more iterators...) (related #28)
Check opportunities to use Cow<str>
derive(Clone) all the (public) things.
Replace to_string with to_owned when converting from &str to String.
Making rayon optional behind a feature flag.

Add spellchecking

I love this library already, I've been looking for something like this for a project of mine for months now! However, I saw the README said this about the project:
"and without all the extra stuff LanguageTool does such as spellchecking, n-gram based error detection, etc."

It would be super nice to have the spellchecking part of LanguageTool in this library, as spellchecking is one of the most used features in many, if not all, general-purpose NLP libraries. I'm only good at Python though, so I personally can't help until I focus more on improving my Rust :(

Rule selectors

I'm currently improving rule selection along with #41.

Rules are nested in up to three layers in the LanguageTool XML:

<category ...>
    <rulegroup ...> <!-- can be omitted, there's also <rule>s at this level -->
        <rule>
        </rule>
    </rulegroup>
</category>

At the moment the id is just a string with <group_name>|<rule_name>.<number>. I want to streamline this to allow easily disabling e. g. an entire category.

API

This is the API I currently have in mind for this. There will be one struct for each rule level ID:

pub struct CategoryID;
pub struct GroupID;
pub struct IndexID;

with conversions between them:

impl CategoryID {
    pub fn new<S: Into<String>>(category: S) -> Self;
    pub fn join<S: Into<String>>(&self, group: S) -> GroupID;
}

impl GroupID {
    pub fn join(&self, index: usize) -> IndexID;
    pub fn parent(&self) -> &CategoryID;
}

impl IndexID {
    pub fn parent(&self) -> &GroupID;
}

It will only be possible to create CategoryIDs directly. It can then be joined to create IDs at different levels. The id() field of Rule will become an IndexID (currently a String).

Selector

The structures above won't do any work on their own. For that, there is a RuleSelector which, given an IndexID, determines if it matches. ~~It can also be disabled to inverse the match~~:

pub enum IDSelector {
    Category(CategoryID),
    Group(GroupID),
    Index(IndexID),
}

impl IDSelector {
    pub fn is_match(&self, spec: &IndexID) -> bool;
}

and corresponding methods to create a selector from IDs at various levels:

impl From<GroupID> for IDSelector {};

// same for others

Selectors can also be cast to / from strings with the representation <category>/<group>/<index> e. g. "typos/verb_apostrophe_s/3" or "grammar". Selectors will be case insensitive.

Usage

Rust

The new user-facing RulesOptions (passed via .new_with_options(..., options: &RulesOptions)) will initially look like this:

struct RulesOptions {
    selectors: Vec<IDSelector>,
}

where selectors is a list of selectors which are applied in order, and selectors which are disabled by default for the language are implicitly prepended to the list. E. g. to disable all rules in the typos category but verb_apostrophe_s:

let rules = Rules::new_with_options("rules.bin", RulesOptions {
    selectors: vec![
        (CategoryID::new("typos").into(), false),
        (CategoryID::new("typos").join("verb_apostrophe_s").into(), true)
    ],
})

alternatively:

let rules = Rules::new_with_options("rules.bin", RulesOptions {
    selectors: vec![
        ("typos".try_into().unwrap(), false),
        ("typos/verb_apostrophe_s".try_into().unwrap(), true),
    ],
})

or, conversely, to enable all typos and only disable verb_apostrophe_s:

let rules = Rules::new_with_options("rules.bin", RulesOptions {
    selectors: vec![
        ("typos".try_into().unwrap(), true),
        ("typos/verb_apostrophe_s".try_into().unwrap(), false),
    ],
})

There will also be a method which returns all the currently used selectors, including the ones which are implicitly prepended by default:

impl Rules {
    pub fn selectors(&self) -> &[(IDSelector, bool)];
}

Python

In Python, only the string representation of the IDs will be visible to the user:

rules = Rules.load("en", selectors=[
    ("typos", False), 
    ("typos/verb_apostrophe_s", True)
])

and there will be a .selectors() method returning a list of tuples of selectors / enabled state.

All of this will only be implemented for the Rules, not for the Tokenizer. Rules in the Tokenizer are applied hierarchically so disabling one can have an effect on the others.

Points of discussion

I am not so sure about the selectors terminology and about the names CategoryID, GroupID and IndexID.
I'm quite happy with the abstraction itself. But it does add additional complexity and it might not be as intuitive at first glance as a simple enable / disable list of IDs. Although it is much more expressive (e. g. it is impossible to express the simple scenarios from above with an enable / disable list).

As always, I appreciate discussion about this a lot :). Writing it down helped me a lot to define the API (and it actually changed significantly while writing this), more discussion helps even more.

Update: Selectors will not have an on / off state. Instead, the selectors argument will be a vector of tuples of (selector, enabled). Having an on / off state makes it unclear how finding rules by selector should work.

Better error if binary does not match version

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: InvalidBoolEncoding(2)', src/checker/nlprules.rs:22:70


static TOKENIZER_BYTES: &[u8] = include_bytes!(concat!(env!("OUT_DIR"), "/tokenizer.bin"));
static RULES_BYTES: &[u8] = include_bytes!(concat!(env!("OUT_DIR"), "/rules.bin"));

lazy_static::lazy_static! {
    static ref TOKENIZER: Tokenizer = Tokenizer::from_reader(&mut &*TOKENIZER_BYTES).unwrap();
    static ref RULES: Rules = Rules::from_reader(&mut &*RULES_BYTES).unwrap(); /// <<< this one errors, and it's always this one
}

Downloading from Github Releases sometimes times out

tokenizer = Tokenizer.load("en")

ValueError: error decoding response body: operation timed out

Release on crates.io

I want to publish NLPRule on crates.io. A couple of things that need to be done:

Clear distinction between public / private API in Rust
Minimum acceptable documentation for the public API
Fix dependencies, specifically NLPRule depends on an unreleased version of serde-xml-rs (RReverser/serde-xml-rs#145).
Find a way to rename the rust crate to nlprule from nlprule_core without clashing with bindings/python/Cargo.toml.
~~Make rayon (and related dependencies) optional.~~ No easy way to do this, so will not be done unless needed.

API to include the correct binaries at compile time

Hey, nice library and I am currently checking what would be needed to obsolete the current LanguageTool backend in https://github.com/drahnr/cargo-spellcheck .

There are a few things which would need to be addressed, the most important is to
avoid the need for https://github.com/bminixhofer/nlprule/blob/master/scripts/build.sh .
The compile feature could gate a build.rs file which would prep the data which in turn could be included via include_bytes!.
That way, one can locally source the data at compile time and include the source files within the binary, with optional external overrides.
Another thing that would be nice, is documentation on how to obtain the referenced dumps.

Looking forward 💯

Clarify license statement

Can you clarify this phrasing?

The nlprule binaries (*.bin) are derived from LanguageTool v5.2 and licensed under the LGPLv2.1 license. nlprule statically and dynamically links to these binaries. Under LGPLv2.1 §6(a) this does not have any implications on the license of nlprule itself.

...because:

I don't see any sign of static or dynamic linking in the sense the LGPL considers... just aggregating assets, similar to how you can use runtime-loaded CC-BY-SA art assets in a game with GPLed code without there being a license conflict as long as you don't embed the art assets inside the binary or otherwise make the binary unavoidably dependent from the assets. (eg. compiling in a hash check that will fail if someone swaps in new .png files.)
When people see "statically and dynamically links" and "LGPL", they get concerned, because Rust statically links all its code so, if you statically link your LGPLed stuff into nlprule and you statically link nlprule into a Rust binary, then that Rust binary must be distributed in accordance with the LGPL's requirement that it be possible to swap out the LGPLed components with modified versions... and Rust doesn't have stable ABI to facilitate that without sharing the source.

I've actually seen people warn other people away from nlprule in favour of some more recent bindings for the LanguageTool HTTP API because "nlprule statically links to LGPLed stuff, which means your Rust binaries must be released under the LGPL, GPL, or AGPL".

Better error handling

At the moment error handling is almost non existent. It is not that bad because the core methods (.correct, .pipe, .suggest) are not fallible but still has to be fixed. Specifically:

using thiserror for a crate-level Error struct.
less unwraps
some thought on what should be expected and what should return an Error.
exposing the crate-level Error in the public API instead of e. g. returning bincode::Error if from_reader fails as is currently the case.

Usability of the rules API degraded from 0.4.6 to 0.5.1

Updating nlprule from 0.4.6 to 0.5.1 due to #53 is not as smooth as anticipated.

The whole idea of having a straight forward iterator over the individual rules, filtering and collecting them as needed is gone.

The API of select is rather awkward to use, it seems like a stripped down version of (the previously working) .into_iter().filter(|| -> bool {}) but without the possibility to specify a closure. There is no alternative based on rules() since the Rule is not Clone. Operating on the mutable slice instead is non-idiomatic.

Would you consider reverting this? It seems more hassle without any(or, I don't see it) gain.

Thanks!

Craft releases frequently

It's significantly easier for other projects to adapt on small changes, rather than rare yet large changesets.

In this case I would love to craft a cargo spellcheck release, but I'd be in need of the current master, and in a perfect world the changes of #24

Thanks

I really appreciate the work you do here!

Grammar check fails

Issue:
Grammatically incorrect english sentence is not identified

sentence = 'Make sure to include you phone number.'

Received result:
Make sure to include you phone number.

Expected result:
Make sure to include your phone number.

Version:
nlprule-0.5.3

To recreate:

from nlprule import Tokenizer, Rules

tokenizer = Tokenizer.load("en")
rules = Rules.load("en", tokenizer)

content = 'Make sure to include you phone number.'
fixed = rules.correct(content)

if fixed == content:
    print('Grammar passed')
else:
    print('Grammar failed')

# returns: Grammar passed

Result on LanguageTool

Cache the compressed artifacts

In order to be able to include the .bin artifacts in a repository and craft releases / publish with cargo the sources may not be larger than 10MB or failures like:

error: api errors (status 200 OK): max upload size is: 10485760

will pop up.

The simplest path is to cache the compressed artifacts rather than the uncompressed and decompress at runtime. An optional builder API could be used to load compressed or decompressed .bin variants.

Support for older glibc

Hi, first off thank you for this library, it's the only non-java languagetool alternative I've found.

Unfortunately, I am receiving an error when trying to use it,ImportError: /lib/x86_64-linux-gnu/libm.so.6: version 'GLIBC_2.27' not found (required by python/lib/python3.8/site-packages/nlprule.cpython-38-x86_64-linux-gnu.so)

I'm on a hosting environment where I don't have access to upgrade system libraries so i can't just upgrade glibc. The current version is glibc 2.19.

Is glibc 2.27 a hard requirement or is there a way to specify an older version of glibc?

I have a feeling this is Rust specific issue but I am new to Rust and not familiar with it's environment.

Thanks

Document how to load custom rulesets

I have a project where I'd prefer not to reinvent nlprule for applying my custom grammar rules (common validly-spelled typos I see in fanfiction), but the documentation is very unclear on how to do anything with custom rules.

In a PyQt application, how do I specify files by path like with the Rust API?
How do I go from the raw LanguageTool XML to the .bin files?
Do I need to do multiple passes with different nlprule instances if I also want to check regular grammar stuff or is there a way to merge rulesets?

Questions about adding a new language

Hi,

I would like to ask if it is easy to add a new language?

If so, are there any good reference to guess how to add it?

Best,
Chris

Some methods should return iterators

Tokenizer.pipe, Rules.suggest and maybe some other methods should return an iterator instead of a Vec<_> for more flexibility.

can't correction the text, is it my error?

from nlprule import Tokenizer, Rules, SplitOn

tokenizer = Tokenizer.load("en")
rules = Rules.load("en", tokenizer, SplitOn([".", "?", "!"]))

rules.correct("He want that you send him a email.")

returns: He want you to send him a email.

Modularizing the crate

As suggested originally by @drahnr (#2 (comment)) we should consider splitting nlprule into multiple sub-crates.

What's certain is that there will be one nlprule crate combining sub-crates into one higher level API which is equivalent to the Python API.

There are multiple reasons for doing a split:

Some users might want to only do tokenization, or only some part of the tokenization pipeline. They should not have to pull in the weight from suggestion producers.
Some users might not want to use all of the suggestion producers, but only spellchecking, only grammar rules etc.

Modularizing the crate would primarily benefit size of the binaries and needed dependencies.

There is a distinction between:

The original split into sentences and tokens.
Things that set / modify information related to the tokens (e. g. disambiguation rules, chunking, part-of-speech tagging) (functionality currently in the Tokenizer)
What I called suggestion producers above. Things that draw conclusions from the information set on the tokens (currently only the grammar rules, spellchecking will fall into this category too)

Splitting (3.) into modules is easy: there should be one module for each separate entity which operators on the tokens i. e. one for spellchecking (nlprule-spellcheck) and one for grammar rules (nlprule-grammar or something similar).

Splitting (2.) and (1.) is harder. I think having (1.) as a separate crate which only does sentence segmentation and token segmentation (nlprule-tokenize) would make sense. Then there could be multiple crates which set different information on the tokens for example nlprule-disambiguate (disambiguation rules), nlprule-multiword (multiword tagging) and nlprule-crf (chunking and possibly other CRF models behind feature flags).

There's a number of open issues like:

How to handle distribution of binaries. Distributing binaries with everything enabled and functionality to disable specific parts in nlprule-build might be an option.
How to handle invalid combinations of the modules. For example, nlprule-grammar does not make sense without nlprule-disambiguate. Panicking as early as possible is probably best.
How to handle interdependencies between modules related to (2). For example the lexical tagging is required for disambiguation, but it would be conceivable to use tagging without wanting to do disambiguation.
Whether the name nlprule still makes sense since it is possible to combine modules in a way that does not use rules at all. But that's not the biggest issue 🙂

Implementing this is not a near-term goal right now (see #42) and I am not sure whether it is worth it from a practical point of view but I wanted to open this issue for discussion. Also thanks @drahnr for the original suggestion.

Web demo

A web demo running client-side via WebAssembly would be really cool and useful. Ideally this should have:

a live text correction tool
text analysis like https://community.languagetool.org/analysis (this would be really useful for debugging as well!)
a rule explorer

The website should live in another repository and be hosted via GH pages. It should already be possible to implement with the library in its current state.

It's completely open how this is implemented (could be yew, or vuejs / react with a JS interface to nlprule made with wasm-pack, or something else).

It's quite a piece of work but it would be amazing to have. I want to focus on the other things first since they are more important for the core library.

Here I would appreciate contributions a lot!

Support python 3.11

Any plan to release a new version supporting python 3.11?

Single Or Pural

Hi! Thanks for the great projects.
I'm working with code generation, so I need further grammar corrections on the generated code. I found that this toolkit is unable to respond to such simple grammatical knowledge as whether a noun is in singular or plural form.

Improve loading speed (of regex?) - cli usecase

The biggest issue using this library currently is the fact, that on each startup a lot of regular expressions are compiled.

If regex (or whatever crate being used) implements serialization of the compiled regex this could be entirely avoided and shifted to build time / once upon a time. The current issue here is the impl of the crate regex itself which does not impl serde serialization.

In the meantime, parallel compilation of the parsed regexps would probably speed up the initial loading by a factor of $cores.

panic in `Regex::regex()`

thread '<unnamed>' panicked at 'called `Option::unwrap()` on a `None` value', /tmp/build/56ca5ece/git-pull-request-resource/../cargo/registry/src/github.com-1ecc6299db9ec823/nlprule-0.6.2/src/utils/regex.rs:78:33

thread 'stack backtrace:

<unnamed>' panicked at 'called `Option::unwrap()` on a `None` value', /tmp/build/56ca5ece/git-pull-request-resource/../cargo/registry/src/github.com-1ecc6299db9ec823/nlprule-0.6.2/src/utils/regex.rs:78:33

   0:     0x560ee78bdfd0 - std::backtrace_rs::backtrace::libunwind::trace::h5e9d00f0cdf4f57e

                               at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/../../backtrace/src/backtrace/libunwind.rs:90:5

   1:     0x560ee78bdfd0 - std::backtrace_rs::backtrace::trace_unsynchronized::hd5302bd66215dab9

                               at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5

There is an .unwrap on the regex.borrow() call which panics.

https://ci.spearow.io/teams/main/pipelines/cargo-spellcheck/jobs/pr-validate/builds/45

Be more responsible about network requests

When I tried entering an invalid language code to confirm that there's a Python exception I need to handle if the language code selected in my existing Enchant-based infrastructure isn't supported by nlprule, I got this very surprising error message:

ValueError: HTTP status client error (404 Not Found) for url (https://github.com/bminixhofer/nlprule/releases/download/0.6.4/ef_tokenizer.bin.gz)

Personally, I consider it very irresponsible to not warn people that a dependency is going to perform network requests under some circumstances, nor to provide an obvious way to handle things offline.

I highly recommend you change this and, for my own use, since I tend to incorporate PyO3-based stuff into my PyQt apps anyway, I think I'll probably switch to writing my own nlprule wrapper so I can trust that, if no network libraries show up in the Cargo.lock, and the author isn't being actively malicious, then what I build will work on an airgapped machine or in a networkless sandbox.

(Seriously. Sandboxes like Flatpak are becoming more and more common. Just assuming applications will have network access is not cool.)

postprocess has different semantic than anticipated

So for my usecase as defined in #27 is reversed.

nlprule-data/0.4.4/en/en_tokenizer.bin
target/debug/build/cargo-spellcheck-2b832a17a2fec7ef/out/en_tokenizer.bin.brotli

What the usecase described in #27 would require, is to be able to apply compression before storing it in the cache dir and then uncompressing it for the target/debug/....

Reasoning: When uploading with cargo it picks a subset of the git tree, the size of the binary is not relevant.

I think adding a secondary fn cache_preprocess() so I can compress it there before it is stored to $cache_dir and then decompress as part of the current fn postprocess() so it ends up only as binencoded in the $OUT_DIR from where it can be included in the binary.

Tighter error bounds around `Error`

In https://github.com/bminixhofer/nlprule/blob/main/build/src/lib.rs#L23 the bounds are Box<dyn std::error::Error> where it should be Box<dyn std::error::Error + Send + Sync + 'static> for maximum compatibility with error frameworks.

Make rayon optional

rayon is not necessary and should be behind a feature flag (like in e.g. ndarray). Resolving this issue in rayon-cond would make it trivial: cuviper/rayon-cond#3

If this change does not get accepted in rayon-cond I believe the best solution is forking it, adding a rayon feature flag and also moving the abstraction currently in src/utils/parallelism.rs to the fork.

Loading rules error: invalid value: integer 7, expected variant index 0 <= i < 2

I've generated rules.bin for English (same way as it was used in CI)

And when I'm trying to do Rules::new I'm getting this error: invalid value: integer 7, expected variant index 0 <= i < 2
I've used to generate on git sources and use on 0.4.6, trying to use on git sources to see if it's versions incompatibility

Support for AnnotatedText

hey, thanks for this awesome project!
do you consider adding AnnotatedText support?

this would allow nlprule to be used to spell-check markdown/word/html/etc. documents converted to AnnotatedText format (supported by LanguageTool)

right now i'm thinking how it could be done, but i can't quite figure out how LanguageTool can spellcheck ignoring the markup but then map the ranges to original document

Use byte indices in Rust core

It would be more idiomatic to Rust to use byte indices everywhere internally (and everywhere in the public Rust API) and only convert to char indices at the boundary to Python.

Coalesced words - tokenization

I've attempted to deal abbreviated forms of type we've and I'd and it's as part of drahnr/cargo-spellcheck#186 which is a mere workaround.

Probably out of scope of nlprule, yet a pitfal for real life usage.

Since nlprule is going to support spellchecking as well it might be worth discussing / keeping in mind.