ferristseng / rust-punkt Goto Github PK

Implementation of the Punkt sentence tokenizing algorithm in Rust.

License: Apache License 2.0

Rust 100.00%

rust-punkt's Introduction

punkt

Status

I am no longer maintaining this library. Please contact me or create an issue if you would like to become a maintainer.

Overview

Implementation of Tibor Kiss' and Jan Strunk's Punkt algorithm for sentence tokenization. Results have been compared with small and large texts that have been tokenized using NLTK.

Training

Training data can be provided to a SentenceTokenizer for better results. Data can be acquired manually by training with a Trainer, or using already compiled data from NLTK (example: TrainingData::english()).

Typical Usage

The punkt algorithm allows you to derive all the necessary data to perform sentence tokenization from the document itself.

#
let trainer: Trainer<Standard> = Trainer::new();
let mut data = TrainingData::new();

trainer.train(doc, &mut data);

for s in SentenceTokenizer::<Standard>::new(doc, &data) {
  println!("{:?}", s);
}

rust-punkt also provides pretrained data that can be loaded for certain languages.

#
#
let data = TrainingData::english();

rust-punkt also allows training data to be incrementally gathered.

#
let trainer: Trainer<Standard> = Trainer::new();
let mut data = TrainingData::new();

for d in docs.iter() {
  trainer.train(d, &mut data);

  for s in SentenceTokenizer::<Standard>::new(d, &data) {
    println!("{:?}", s);
  }
}

Customization

rust-punkt exposes a number of traits to customize how the trainer, sentence tokenizer, and internal tokenizers work. The default settings, which are nearly identical, to the ones available in the Python library are available in punkt::params::Standard.

To modify only how the trainer works:

#
struct MyParams;

impl DefinesInternalPunctuation for MyParams {}
impl DefinesNonPrefixCharacters for MyParams {}
impl DefinesNonWordCharacters for MyParams {}
impl DefinesPunctuation for MyParams {}
impl DefinesSentenceEndings for MyParams {}

impl TrainerParameters for MyParams {
  const ABBREV_LOWER_BOUND: f64 = 0.3;
  const ABBREV_UPPER_BOUND: f64 = 8f64;
  const IGNORE_ABBREV_PENALTY: bool = false;
  const COLLOCATION_LOWER_BOUND: f64 = 7.88;
  const SENTENCE_STARTER_LOWER_BOUND: f64 = 35f64;
  const INCLUDE_ALL_COLLOCATIONS: bool = false;
  const INCLUDE_ABBREV_COLLOCATIONS: bool = true;
  const COLLOCATION_FREQUENCY_LOWER_BOUND: f64 = 0.8f64;
}

To fully modify how everything works:

#
struct MyParams;

impl DefinesSentenceEndings for MyParams {
  // const SENTENCE_ENDINGS: &'static Set<char> = &phf_set![...];
}

impl DefinesInternalPunctuation for MyParams {
  // const INTERNAL_PUNCTUATION: &'static Set<char> = &phf_set![...];
}

impl DefinesNonWordCharacters for MyParams {
  // const NONWORD_CHARS: &'static Set<char> = &phf_set![...];
}

impl DefinesPunctuation for MyParams {
  // const PUNCTUATION: &'static Set<char> = &phf_set![...];
}

impl DefinesNonPrefixCharacters for MyParams {
  // const NONPREFIX_CHARS: &'static Set<char> = &phf_set![...];
}

impl TrainerParameters for MyParams {
  // const ABBREV_LOWER_BOUND: f64 = ...;
  // const ABBREV_UPPER_BOUND: f64 = ...;
  // const IGNORE_ABBREV_PENALTY: bool = ...;
  // const COLLOCATION_LOWER_BOUND: f64 = ...;
  // const SENTENCE_STARTER_LOWER_BOUND: f64 = ...;
  // const INCLUDE_ALL_COLLOCATIONS: bool = ...;
  // const INCLUDE_ABBREV_COLLOCATIONS: bool = true;
  // const COLLOCATION_FREQUENCY_LOWER_BOUND: f64 = ...;
}

Benchmarks

Specs of my machine:

i5-4460 @ 3.20 x 4
8 GB RAM
Fedora 20
SSD

test tokenizer::bench_sentence_tokenizer_train_on_document_long   ... bench: 129,877,668 ns/iter (+/- 6,935,294)
test tokenizer::bench_sentence_tokenizer_train_on_document_medium ... bench:     901,867 ns/iter (+/- 12,984)
test tokenizer::bench_sentence_tokenizer_train_on_document_short  ... bench:     702,976 ns/iter (+/- 13,554)
test tokenizer::word_tokenizer_bench_long                         ... bench:  14,897,528 ns/iter (+/- 689,138)
test tokenizer::word_tokenizer_bench_medium                       ... bench:     339,535 ns/iter (+/- 21,692)
test tokenizer::word_tokenizer_bench_short                        ... bench:     281,293 ns/iter (+/- 3,256)
test tokenizer::word_tokenizer_bench_very_long                    ... bench:  54,256,241 ns/iter (+/- 1,210,575)
test trainer::bench_trainer_long                                  ... bench:  27,674,731 ns/iter (+/- 550,338)
test trainer::bench_trainer_medium                                ... bench:     681,222 ns/iter (+/- 31,713)
test trainer::bench_trainer_short                                 ... bench:     527,203 ns/iter (+/- 11,354)
test trainer::bench_trainer_very_long                             ... bench:  98,221,585 ns/iter (+/- 5,297,733)

Python results for sentence tokenization, and training on the document (the first 3 tests mirrored from above):

The following script was used to benchmark NLTK.

f0 is the contents of the file that is being tokenized.
s is an instance of a PunktSentenceTokenizer.
timed is the total time it takes to run tests number of tests.

False is being passed into tokenize to prevent NLTK from aligning sentence boundaries. This functionality is currently unimplemented.

timed = timeit.timeit('s.train(f0); [s for s in s.tokenize(f0, False)]', 'from bench import s, f0', number=tests)
print(timed)
print(timed / tests)

long    - 1.3414202709775418 s   = 1.34142 x 10^9 ns ~ 10.3283365927x improvement 
medium  - 0.007250561956316233 s = 7.25056 x 10^6 ns ~ 8.03950245027x improvement
short   - 0.005532620595768094 s = 5.53262 x 10^6 ns ~ 7.870283759x   improvement

License

Licensed under either of

Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

rust-punkt's People

Contributors

Stargazers

Watchers

Forkers

hirschenberger kgnlp shepmaster ktp-forked-repos kfreiman veer66 rythmosadvisory archer884 fros1y dodomorandi matthewleon jredrado beringlab asher-gh

rust-punkt's Issues

Can't build. `bench` is a part of custom test frameworks which are unstable

cargo --version   
cargo 1.38.0-nightly (42a8c0adf 2019-08-07)

cargo build --lib

error[E0658]: use of unstable library feature 'test': `bench` is a part of custom test frameworks which are unstable
   --> src/trainer.rs:774:7
    |
774 |       #[bench] fn $name(b: &mut ::test::Bencher) {
    |         ^^^^^
...
785 | / bench_trainer!(
786 | |   bench_trainer_short,
787 | |   include_str!("../test/raw/sigma-wiki.txt")
788 | | );
    | |__- in this macro invocation
    |
    = note: for more information, see https://github.com/rust-lang/rust/issues/50297
    = help: add `#![feature(test)]` to the crate attributes to enable

PR follows

Alter the way objects can be configured

Right now the object receives a structure that contains the tokenizer's parameters. It would be better if the parameter configuration was type-based / trait-based.

Add a way to realign sentences with certain punctuation

NLTK has a way to realign sentences ending with characters such as ), }, ], ", etc...

Is it possible to get byte or character offsets of tokenized words / sentences?

For example, if the following were tokenized:

hello, world!

Could we get tuples of (0,5), (7,12)? I'm flexible about the details like if the numbers are bytes or characters, or 0 or 1 based or inclusive / exclusive. Thanks for the cool project! 🌴

`Default` is an unfortunate choice of name

The standard Rust prelude has the trait Default included. This makes the examples in the README confusing, as it doesn't mention punkt::params::Default until the configuration section.

It's also sad that every usage of SentenceTokenizer::new needs to have an explicit type parameter... but I'm not sure what to suggest there.

Second sentence seems to be missing

extern crate punkt;

use punkt::trainer::{TrainingData, Trainer};
use punkt::tokenizer::{SentenceTokenizer, WordTokenizer};

fn main() {
    let training_data = TrainingData::english();

    for sent in SentenceTokenizer::new("this is a great sentence! this is a sad sentence.", &training_data) {
        println!("{:?}", sent);
    }
}

Has the result:

"this is a great sentence!"

And the sad sentence never shows up. 😢

Panic when parsing an ellipsis at end of text

This results in an unwrap-panic:

let doc =  "I bought $5.50 worth of apples from the store. I gave them to my dog when I came home.)...";
let mut data = TrainingData::english();
let s: Vec<_> = SentenceTokenizer::<Standard>::new(&doc, &data).collect();

PR follows...

Relicense under dual MIT/Apache-2.0

This issue was automatically generated. Feel free to close without ceremony if
you do not agree with re-licensing or if it is not possible for other reasons.
Respond to @cmr with any questions or concerns, or pop over to
#rust-offtopic on IRC to discuss.

You're receiving this because someone (perhaps the project maintainer)
published a crates.io package with the license as "MIT" xor "Apache-2.0" and
the repository field pointing here.

TL;DR the Rust ecosystem is largely Apache-2.0. Being available under that
license is good for interoperation. The MIT license as an add-on can be nice
for GPLv2 projects to use your code.

Why?

The MIT license requires reproducing countless copies of the same copyright
header with different names in the copyright field, for every MIT library in
use. The Apache license does not have this drawback. However, this is not the
primary motivation for me creating these issues. The Apache license also has
protections from patent trolls and an explicit contribution licensing clause.
However, the Apache license is incompatible with GPLv2. This is why Rust is
dual-licensed as MIT/Apache (the "primary" license being Apache, MIT only for
GPLv2 compat), and doing so would be wise for this project. This also makes
this crate suitable for inclusion and unrestricted sharing in the Rust
standard distribution and other projects using dual MIT/Apache, such as my
personal ulterior motive, the Robigalia project.

Some ask, "Does this really apply to binary redistributions? Does MIT really
require reproducing the whole thing?" I'm not a lawyer, and I can't give legal
advice, but some Google Android apps include open source attributions using
this interpretation. Others also agree with
it.
But, again, the copyright notice redistribution is not the primary motivation
for the dual-licensing. It's stronger protections to licensees and better
interoperation with the wider Rust ecosystem.

How?

To do this, get explicit approval from each contributor of copyrightable work
(as not all contributions qualify for copyright, due to not being a "creative
work", e.g. a typo fix) and then add the following to your README:

## License

Licensed under either of

 * Apache License, Version 2.0 ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)
 * MIT license ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)

at your option.

### Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted
for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any
additional terms or conditions.

and in your license headers, if you have them, use the following boilerplate
(based on that used in Rust):

// Copyright 2016 rust-punkt developers
//
// Licensed under the Apache License, Version 2.0 <LICENSE-APACHE or
// http://www.apache.org/licenses/LICENSE-2.0> or the MIT license
// <LICENSE-MIT or http://opensource.org/licenses/MIT>, at your
// option. This file may not be copied, modified, or distributed
// except according to those terms.

It's commonly asked whether license headers are required. I'm not comfortable
making an official recommendation either way, but the Apache license
recommends it in their appendix on how to use the license.

Be sure to add the relevant LICENSE-{MIT,APACHE} files. You can copy these
from the Rust repo for a plain-text
version.

And don't forget to update the license metadata in your Cargo.toml to:

license = "MIT/Apache-2.0"

I'll be going through projects which agree to be relicensed and have approval
by the necessary contributors and doing this changes, so feel free to leave
the heavy lifting to me!

Contributor checkoff

To agree to relicensing, comment with :

I license past and future contributions under the dual MIT/Apache-2.0 license, allowing licensees to chose either at their option.

Or, if you're a contributor, you can check the box in this repo next to your
name. My scripts will pick this exact phrase up and check your checkbox, but
I'll come through and manually review this issue later as well.

@ferristseng

Use string interning

Strings are duplicated. The representation of a token needs only be stored once...to do this research into string interning libs. Probably can store the strings in the data object.

How to add new training data?

How should one go about adding training data for other languages? What is the JSON data structure?

Sentence splitting errors and different output compared to NLTK

Rust-punkt and NLTK Punkt (with aligning off) produce different results when using exactly the same model. NLTK Punkt correctly identifies abbreviations and doesn't split on them, while rust-punkt, with the same model, splits sentences on almost every period.

To test things, I loaded the JSON model from rust-punkt:

from collections import defaultdict
from nltk.tokenize.punkt import PunktParameters, PunktSentenceTokenizer

with open(model_path, mode='r', encoding='UTF8') as model_file:
    model = json.load(model_file)

params = PunktParameters()
params.sent_starters = set(model['sentence_starters'])
params.abbrev_types = set(model['abbrev_types'])
params.collocations = set([tuple(t) for t in model['collocations']])
params.ortho_context = defaultdict(int, model['ortho_context'])

punkt = PunktSentenceTokenizer(params)
punkt.tokenize(text, realign_boundaries=False)

The output from NLTK Punkt:

Choć zapis pól ($X) może kojarzyć się z zapisem określającym zmienne (jak np. w perlu), to jednak określa pola bieżącego rekordu.

While rust-punkt produced:

Choć zapis pól ($X) może kojarzyć się z zapisem określającym zmienne (jak np.
w perlu), to jednak określa pola bieżącego rekordu.

Panic with multibyte string

It seems, like somewhere length of string defines not correctly.

Code to reproduce:

use punkt::*;
use punkt::params::*;

fn main() {
  let content = "Функция. Речи.";

  let trainer: Trainer<Standard> = Trainer::new();
  let mut data = TrainingData::new();

  trainer.train(content, &mut data);

  for s in SentenceTokenizer::<Standard>::new(content, &data) {
    println!("{:?}", s);
  }
}

RUST_BACKTRACE=1 cargo run

    Finished dev [unoptimized + debuginfo] target(s) in 0.02s
     Running `target/debug/comprehensibility`
thread 'main' panicked at 'byte index 13 is not a char boundary; it is inside 'я' (bytes 12..14) of `функция`', src/libcore/str/mod.rs:2036:5
stack backtrace:
   0: backtrace::backtrace::libunwind::trace
             at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.29/src/backtrace/libunwind.rs:88
   1: backtrace::backtrace::trace_unsynchronized
             at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.29/src/backtrace/mod.rs:66
   2: std::sys_common::backtrace::_print
             at src/libstd/sys_common/backtrace.rs:47
   3: std::sys_common::backtrace::print
             at src/libstd/sys_common/backtrace.rs:36
   4: std::panicking::default_hook::{{closure}}
             at src/libstd/panicking.rs:200
   5: std::panicking::default_hook
             at src/libstd/panicking.rs:214
   6: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:477
   7: std::panicking::continue_panic_fmt
             at src/libstd/panicking.rs:384
   8: rust_begin_unwind
             at src/libstd/panicking.rs:311
   9: core::panicking::panic_fmt
             at src/libcore/panicking.rs:85
  10: core::str::slice_error_fail
             at src/libcore/str/mod.rs:0
  11: core::str::traits::<impl core::slice::SliceIndex<str> for core::ops::range::RangeTo<usize>>::index::{{closure}}
             at /rustc/4560cb830fce63fcffdc4558f4281aaac6a3a1ba/src/libcore/str/mod.rs:1823
  12: core::option::Option<T>::unwrap_or_else
             at /rustc/4560cb830fce63fcffdc4558f4281aaac6a3a1ba/src/libcore/option.rs:419
  13: core::str::traits::<impl core::slice::SliceIndex<str> for core::ops::range::RangeTo<usize>>::index
             at /rustc/4560cb830fce63fcffdc4558f4281aaac6a3a1ba/src/libcore/str/mod.rs:1823
  14: core::str::traits::<impl core::ops::index::Index<I> for str>::index
             at /rustc/4560cb830fce63fcffdc4558f4281aaac6a3a1ba/src/libcore/str/mod.rs:1625
  15: punkt::trainer::is_rare_abbrev_type
             at /home/kirill/.cargo/registry/src/github.com-1ecc6299db9ec823/punkt-1.0.5/src/trainer.rs:449
  16: punkt::trainer::Trainer<P>::train
             at /home/kirill/.cargo/registry/src/github.com-1ecc6299db9ec823/punkt-1.0.5/src/trainer.rs:382
  17: comprehensibility::main
             at src/main.rs:10
  18: std::rt::lang_start::{{closure}}
             at /rustc/4560cb830fce63fcffdc4558f4281aaac6a3a1ba/src/libstd/rt.rs:64
  19: std::rt::lang_start_internal::{{closure}}
             at src/libstd/rt.rs:49
  20: std::panicking::try::do_call
             at src/libstd/panicking.rs:296
  21: __rust_maybe_catch_panic
             at src/libpanic_unwind/lib.rs:80
  22: std::panicking::try
             at src/libstd/panicking.rs:275
  23: std::panic::catch_unwind
             at src/libstd/panic.rs:394
  24: std::rt::lang_start_internal
             at src/libstd/rt.rs:48
  25: std::rt::lang_start
             at /rustc/4560cb830fce63fcffdc4558f4281aaac6a3a1ba/src/libstd/rt.rs:64
  26: main
  27: __libc_start_main
  28: _start
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Transferring crate?

Hello! I would be happy to take over this crate, certainly long enough to see it into a better place. Mostly, I would like to clean up its dependency tree so that code that depends on it can keep building on new compilers, as it is a nightly-only crate:

punkt depends on a nightly feature(proc_macro_hygiene)
- that blob of features has been incrementally stabilized over time, so maybe not required?
punkt has rustc-serialize, a nightly-only crate, in its dependency tree
- however, rustc-serialize is... probably not much longer for this world.

Three other crates depend on this one, now, so if this could be made-stable, those crates will be able to continue building into the future.

Training data should come as separate libraries

Right now, the training data all comes as a single package. It might be better to include it as compiled code that is generated from a JSON document.