rth / vtext Goto Github PK

View Code? Open in Web Editor NEW

147.0 5.0 11.0 280 KB

Simple NLP in Rust with Python bindings

License: Apache License 2.0

Python 22.90% Rust 75.01% Shell 0.73% Batchfile 1.01% Dockerfile 0.35%

nlp information-retrieval tokenization bag-of-words tf-idf

vtext's Introduction

vtext

NLP in Rust with Python bindings

This package aims to provide a high performance toolkit for ingesting textual data for machine learning applications.

Features

Tokenization: Regexp tokenizer, Unicode segmentation + language specific rules
Stemming: Snowball (in Python 15-20x faster than NLTK)
Token counting: converting token counts to sparse matrices for use in machine learning libraries. Similar to CountVectorizer and HashingVectorizer in scikit-learn but will less broad functionality.
Levenshtein edit distance; Sørensen-Dice, Jaro, Jaro Winkler string similarities

Usage

Usage in Python

vtext requires Python 3.6+ and can be installed with,

pip install vtext

Below is a simple tokenization example,

>>> from vtext.tokenize import VTextTokenizer
>>> VTextTokenizer("en").tokenize("Flights can't depart after 2:00 pm.")
["Flights", "ca", "n't", "depart" "after", "2:00", "pm", "."]

For more details see the project documentation: vtext.io/doc/latest/index.html

Usage in Rust

Add the following to Cargo.toml,

[dependencies]
vtext = "0.2.0"

For more details see rust documentation: docs.rs/vtext

Benchmarks

Tokenization

Following benchmarks illustrate the tokenization accuracy (F1 score) on UD treebanks ,

lang	dataset	regexp	spacy 2.1	vtext
en	EWT	0.812	0.972	0.966
en	GUM	0.881	0.989	0.996
de	GSD	0.896	0.944	0.964
fr	Sequoia	0.844	0.968	0.971

and the English tokenization speed,

	regexp	spacy 2.1	vtext
Speed (10⁶ tokens/s)	3.1	0.14	2.1

Text vectorization

Below are benchmarks for converting textual data to a sparse document-term matrix using the 20 newsgroups dataset, run on Intel(R) Xeon(R) CPU E3-1270 v6 @ 3.80GHz,

Speed (MB/s)	scikit-learn 0.20.1	vtext (n_jobs=1)	vtext (n_jobs=4)
CountVectorizer.fit	14	104	225
CountVectorizer.transform	14	82	303
CountVectorizer.fit_transform	14	70	NA
HashingVectorizer.transform	19	89	309

Note however that these two estimators in vtext currently support only a fraction of scikit-learn's functionality. See benchmarks/README.md for more details.

License

vtext is released under the Apache License, Version 2.0.

vtext's People

Contributors

Stargazers

Watchers

Forkers

symerio jbowles chschock technic jaysonsantos javclaude joshlk kthwaite fiag andrew-johnson-4 joeyburzynski iq-scm

vtext's Issues

Make to_ascii_lowercase optional

Hi thanks for cool crate!

Could you remove or make to_ascii_lowercase optional? I think such pre-processing should be done on the library client side, since it is simple (.map(|doc| doc.to_ascii_lowercase())), and is not required for main heavy tokenization fitting and transform logic, I would prefer to call it my self when needed.

General architecture feedback

@rth awesome, I'm all about collaboration; I'm going to checkout the package after work today!

Great, thank you @jbowles! Feel free to write any general comments you have about this project here (or in any of the related issues).

To give you some background, I have been working on topics related to CountVectorizer / HashingVectorizer in scikit-learn for a few years and this project originated as an attempt in making those faster. A few things got added along the way. I'm a fairly beginner Rust programmer, so general feedback about the architecture of this crate would be very welcome. In particular, adding more common traits per module would probably be good (I started some of the work on it in #48). Some of it was also limited by the fact that I wanted to make a thin wrapper in PyO3 to expose the functionality in Python which adds some constraints (e.g. #48 (comment))

For tokenization, one thing I saw was that if one takes the unicode-segmentation crate, it will tokenize the text almost exactly as expected for NLP applications, with a few exceptions. The nice thing about it is that it's language independent and based on the Unicode spec, which removes the need to maintain a large number of regexp / custom rules. To improve the F1 score for tokenization on the UD treebank a few custom rules are additionally applied.

On the other side, we can imagine other tokenizers. In particular, the fact that some tasks require custom processing is a valid point. I'm not sure how to make that easier.

I also found an implementation of punkt tokenizer rust-punkt!

Yes, it looks quite good. Related issue #51

Generally if can do anything to make this collaboration easier please let me know :)

Multi-OS Python wheels

Once #1 is merged, it would be good to build binary Python wheels and upload them on PyPi.

setuptools-rust sounds like the most simple solution.

Linux
MacOS
Windows

Support different hash functions in HashingVectorizer

Currently, we use the MurmurHash3 hash function from the rust-fasthash (to be more similar to scikit-learn implementation). That crate also supports a number of other hash functions,

City Hash
Farm Hash
Metro Hash
Mum Hash
Sea Hash
Spooky Hash
T1 Hash
xx Hash

I'm not convinced hashing is currently the performance bottleneck, but in any case using a faster hash function such as xxhash would not hurt.

This would involve updating the text-vectorize crate and adding hasher parameter to the HashingVectorizer python estimator.

Another use case could to use different hash functions to reduce the effect of collisions Svenstrup et. al. 2017, discussed e.g. in https://stackoverflow.com/q/53767469/1791279

Python wrappers

In order to use the python package as a drop in replacements of scikit-learn's text vectorizers we would need,

Wrap HashingVectorizer #1
Wrap CountVectorizer #14
Implement most common functionality
Pass scikit-learn tests (when they make sense).

Character n-grams

Allowing tokenize documents with character n-grams would be useful.

Implement IDF transforms

It would be necessary to implement IDF transforms, and possibly expose a TfidfVectorizer estimator.

This requires selecting a sparse array library. For now, we use custom CSRArray structs to represent CSR arrays. https://github.com/vbarrielle/sprs is a good candidate but this needs more investigation in any case.

Build release wheels with LTO

Link time Optimization (LTO) would add around 10% in performance for vtext and also reduce binary sizes. However, it also increases the compilation/link time significantly so we may want to only enable it for commits with an existing release tag.

ENH Avoid copying tokens in tokenizers in Python

Currently, tokenizers return Vec<String> where tokens are be slices of the input string. Moving to Vec<&str> would remove one memory copy and is likely to help with run time.

This should be possible with PyO3 0.7.0 (not yet released) that will allow using lifetime specifiers in pymethods.

Better support of configuration parameters in vectorizers

Currently CountVectorizer and HashingVectorizer mostly perform BOW token counting without the possibility to change the tokenizer or any other parameters.

While we intentionally won't support all the parameter that scikit-learn versions does (as these meta-estimators are doing too much), additional parametrization would be preferable.

parametrization of the tokenizer will be addressed in #48

Fine-tune tokenizers

It can happen that the tokenization results are unsatisfactory in some way, and the question is what should be the mechanism to customize/improve them. Whether it should be by,
a) adding options make these optional improvements in the tokenizer. The issue with these is that some of these might be relevant to multiple tokenizers
b) add a new step later in the pipeline. That's probably the best way to allow arbitrary customization. The issue is that some steps might be specific to the previous step, and adding them in the library might be confusing.

There is probably a balance that needs to be found between the two.

For instance,

PunctuationTokenizer,
- currently doesn't take into account repeated punctuation
```
>>> PunctuationTokenizer().tokenize("test!!!")                                                                                                     
['test!', '!', '!']
```
- will tokenize abbreviations separated by . as separate sentence
```
>>> PunctuationTokenizer().tokenize("W.T.O.")
['W.', 'T.', 'O.']
```
both could probably be addressed by adding an option to force sentences to be longer than some minimal length (and otherwise append them to the previous token).
UnicodeSentenceTokenizer,
will not tokenizer sentences separated by a punctuation without space e.g.,
```
>>> UnicodeSentenceTokenizer().tokenize('One sentence.Another sentence.')
['One sentence.Another sentence.']
```
That's a very common occurrence in actual text, and I think a workaround should be found (e.g. using an additional tokenization pass with a regex/punctuation tokenizer).

Generally it would be good to add some evaluation benchmarks to evaluation/ for sentence tokenization to evaluation/ folder.

UnicodeTokenizer is currently extended in VTextTokenizer (for lack of a better name), with a few additional rules. Maybe this could have been a separate token-processing step, particularly if one imagine that more rules could be added (or potentially even using an ML model).

Word n-grams

Currently, only bag of word vectorization is implemented. It would be good to extend the code to word n-grams as well.

Rename UnicodeSegmentTokenizer to UnicodeWordTokenizer

UnicodeSegmentTokenizer was meant to be a shorter version of Unicode segmentation tokenizer, but the name is not very explicit. Besides UnicodeSentenceTokenizer also uses unicode-segmentation crate which adds to the confusion. Maybe UnicodeWordTokenizer would be a better name?

Standardize language option

From #78 (comment) by @joshlk

I think it would be sensible to identify different languages throughout the package using ISO two-letter codes (e.g. en, fr, de ...).

In particular, we should implement this for the Snowball stemmer in python which currently uses the full language names.

I am also wondering if in Rust, we should use String for the language parameter or define an Enum e.g.

use vtext::lang

let stemmer = SnowballStemmerParams::default().lang(lang::en).build()

The latter is probably simpler, but it makes it a bit harder to extend e.g. if someone designs an custom estimator for a language not in the list (e.g. some ancient infrequently used language), they would have to create a new enum.

Also just to be consistent the parameter name would be "lang" not "language", right?

Make estimators picklables

Currently, Python classes / functions generated with Pyo3 are not picklable (PyO3/pyo3#100) which makes their use problematic in typical data science workflows (e.g. with joblib parallel or in scikit-learn pipelines).

Implementing __getstate__, __setstate__ methods is probably necessary to make it work.

NLP pipeline design

Ideally, an NLP pipeline in Rust could look something like,

preprocessor = DefaultPreprocessor::new()
tokenizer = RegexpTokenizer::new(r"\b\w\w+\b")
stemmer = SnowballStemmer::new("en")
analyzer = NgramAnalyzer(range=(1, 1))

pipe = collection
          .map(preprocessor)
          .map(tokenizer)
          .map(|tokens| tokens.map(stemmer))
          .map(analyzer)

where collection is an iterator over documents.

There are several chalenges with it though,

It is better to avoid allocating strings for tokens in each pre-processing step and instead use a slice of the original document. Performance depends very strongly on this. The current implementation e.g. of RegexpTokenizer takes a reference to the document and return an Iterable of &str with the same lifetime as the input document, but then borrow checker doesn't appear to be happy when it is used in the pipeline. This may be related to using closures (cf next point) though.
Because structs are not callable, collection.map(tokenizer) doesn't work,
nor does collection.map(tokenizer.tokenize) (i.e. using a method) for some reason. We can use collection.map(|document| tokenizer.tokenize(&document)) but then lifetime is not properly handled between input and output (described in the previous point).

More investigation would be necessary, and both points are likely related.

Add sentence splitter

It would be useful to add a sentence splitter, for instance, possibilities could be,

Puntk sentence tokenizer from NLTK (needs pre-trained model)
Unicode sentence boundaries from unicode-rs/unicode-segmentation#24 (doesn't need a pre-trained model)
investigate spacy implementation (likely needs pre-trained model)

Better unicode support in tokenization rules

Currently, the VTextTokenizer first computes Unicode segmentation (which should handle Unicode well by definition) than applies a few simple rules on top to produce tokenization that is more standard in NLP (and possibly language dependent).

These rules might need to be generalized a bit to handle Unicode better. For instance, currently we merge tokens linked by - but only the ascii one, not on other Unicode variants.