ferristseng / rust-punkt Goto Github PK
View Code? Open in Web Editor NEWImplementation of the Punkt sentence tokenizing algorithm in Rust.
License: Apache License 2.0
Implementation of the Punkt sentence tokenizing algorithm in Rust.
License: Apache License 2.0
How should one go about adding training data for other languages? What is the JSON data structure?
Strings are duplicated. The representation of a token needs only be stored once...to do this research into string interning libs. Probably can store the strings in the data object.
For example, if the following were tokenized:
hello, world!
Could we get tuples of (0,5), (7,12)
? I'm flexible about the details like if the numbers are bytes or characters, or 0 or 1 based or inclusive / exclusive. Thanks for the cool project! 🌴
Right now the object receives a structure that contains the tokenizer's parameters. It would be better if the parameter configuration was type-based / trait-based.
NLTK has a way to realign sentences ending with characters such as ), }, ], ", etc...
This issue was automatically generated. Feel free to close without ceremony if
you do not agree with re-licensing or if it is not possible for other reasons.
Respond to @cmr with any questions or concerns, or pop over to
#rust-offtopic
on IRC to discuss.
You're receiving this because someone (perhaps the project maintainer)
published a crates.io package with the license as "MIT" xor "Apache-2.0" and
the repository field pointing here.
TL;DR the Rust ecosystem is largely Apache-2.0. Being available under that
license is good for interoperation. The MIT license as an add-on can be nice
for GPLv2 projects to use your code.
The MIT license requires reproducing countless copies of the same copyright
header with different names in the copyright field, for every MIT library in
use. The Apache license does not have this drawback. However, this is not the
primary motivation for me creating these issues. The Apache license also has
protections from patent trolls and an explicit contribution licensing clause.
However, the Apache license is incompatible with GPLv2. This is why Rust is
dual-licensed as MIT/Apache (the "primary" license being Apache, MIT only for
GPLv2 compat), and doing so would be wise for this project. This also makes
this crate suitable for inclusion and unrestricted sharing in the Rust
standard distribution and other projects using dual MIT/Apache, such as my
personal ulterior motive, the Robigalia project.
Some ask, "Does this really apply to binary redistributions? Does MIT really
require reproducing the whole thing?" I'm not a lawyer, and I can't give legal
advice, but some Google Android apps include open source attributions using
this interpretation. Others also agree with
it.
But, again, the copyright notice redistribution is not the primary motivation
for the dual-licensing. It's stronger protections to licensees and better
interoperation with the wider Rust ecosystem.
To do this, get explicit approval from each contributor of copyrightable work
(as not all contributions qualify for copyright, due to not being a "creative
work", e.g. a typo fix) and then add the following to your README:
## License
Licensed under either of
* Apache License, Version 2.0 ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)
* MIT license ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)
at your option.
### Contribution
Unless you explicitly state otherwise, any contribution intentionally submitted
for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any
additional terms or conditions.
and in your license headers, if you have them, use the following boilerplate
(based on that used in Rust):
// Copyright 2016 rust-punkt developers
//
// Licensed under the Apache License, Version 2.0 <LICENSE-APACHE or
// http://www.apache.org/licenses/LICENSE-2.0> or the MIT license
// <LICENSE-MIT or http://opensource.org/licenses/MIT>, at your
// option. This file may not be copied, modified, or distributed
// except according to those terms.
It's commonly asked whether license headers are required. I'm not comfortable
making an official recommendation either way, but the Apache license
recommends it in their appendix on how to use the license.
Be sure to add the relevant LICENSE-{MIT,APACHE}
files. You can copy these
from the Rust repo for a plain-text
version.
And don't forget to update the license
metadata in your Cargo.toml
to:
license = "MIT/Apache-2.0"
I'll be going through projects which agree to be relicensed and have approval
by the necessary contributors and doing this changes, so feel free to leave
the heavy lifting to me!
To agree to relicensing, comment with :
I license past and future contributions under the dual MIT/Apache-2.0 license, allowing licensees to chose either at their option.
Or, if you're a contributor, you can check the box in this repo next to your
name. My scripts will pick this exact phrase up and check your checkbox, but
I'll come through and manually review this issue later as well.
Rust-punkt and NLTK Punkt (with aligning off) produce different results when using exactly the same model. NLTK Punkt correctly identifies abbreviations and doesn't split on them, while rust-punkt, with the same model, splits sentences on almost every period.
To test things, I loaded the JSON model from rust-punkt:
from collections import defaultdict
from nltk.tokenize.punkt import PunktParameters, PunktSentenceTokenizer
with open(model_path, mode='r', encoding='UTF8') as model_file:
model = json.load(model_file)
params = PunktParameters()
params.sent_starters = set(model['sentence_starters'])
params.abbrev_types = set(model['abbrev_types'])
params.collocations = set([tuple(t) for t in model['collocations']])
params.ortho_context = defaultdict(int, model['ortho_context'])
punkt = PunktSentenceTokenizer(params)
punkt.tokenize(text, realign_boundaries=False)
The output from NLTK Punkt:
Choć zapis pól ($X) może kojarzyć się z zapisem określającym zmienne (jak np. w perlu), to jednak określa pola bieżącego rekordu.
While rust-punkt produced:
Choć zapis pól ($X) może kojarzyć się z zapisem określającym zmienne (jak np.
w perlu), to jednak określa pola bieżącego rekordu.
Right now, the training data all comes as a single package. It might be better to include it as compiled code that is generated from a JSON document.
cargo --version
cargo 1.38.0-nightly (42a8c0adf 2019-08-07)
cargo build --lib
error[E0658]: use of unstable library feature 'test': `bench` is a part of custom test frameworks which are unstable
--> src/trainer.rs:774:7
|
774 | #[bench] fn $name(b: &mut ::test::Bencher) {
| ^^^^^
...
785 | / bench_trainer!(
786 | | bench_trainer_short,
787 | | include_str!("../test/raw/sigma-wiki.txt")
788 | | );
| |__- in this macro invocation
|
= note: for more information, see https://github.com/rust-lang/rust/issues/50297
= help: add `#![feature(test)]` to the crate attributes to enable
PR follows
This results in an unwrap-panic:
let doc = "I bought $5.50 worth of apples from the store. I gave them to my dog when I came home.)...";
let mut data = TrainingData::english();
let s: Vec<_> = SentenceTokenizer::<Standard>::new(&doc, &data).collect();
PR follows...
It seems, like somewhere length of string defines not correctly.
Code to reproduce:
use punkt::*;
use punkt::params::*;
fn main() {
let content = "Функция. Речи.";
let trainer: Trainer<Standard> = Trainer::new();
let mut data = TrainingData::new();
trainer.train(content, &mut data);
for s in SentenceTokenizer::<Standard>::new(content, &data) {
println!("{:?}", s);
}
}
RUST_BACKTRACE=1 cargo run
Finished dev [unoptimized + debuginfo] target(s) in 0.02s
Running `target/debug/comprehensibility`
thread 'main' panicked at 'byte index 13 is not a char boundary; it is inside 'я' (bytes 12..14) of `функция`', src/libcore/str/mod.rs:2036:5
stack backtrace:
0: backtrace::backtrace::libunwind::trace
at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.29/src/backtrace/libunwind.rs:88
1: backtrace::backtrace::trace_unsynchronized
at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.29/src/backtrace/mod.rs:66
2: std::sys_common::backtrace::_print
at src/libstd/sys_common/backtrace.rs:47
3: std::sys_common::backtrace::print
at src/libstd/sys_common/backtrace.rs:36
4: std::panicking::default_hook::{{closure}}
at src/libstd/panicking.rs:200
5: std::panicking::default_hook
at src/libstd/panicking.rs:214
6: std::panicking::rust_panic_with_hook
at src/libstd/panicking.rs:477
7: std::panicking::continue_panic_fmt
at src/libstd/panicking.rs:384
8: rust_begin_unwind
at src/libstd/panicking.rs:311
9: core::panicking::panic_fmt
at src/libcore/panicking.rs:85
10: core::str::slice_error_fail
at src/libcore/str/mod.rs:0
11: core::str::traits::<impl core::slice::SliceIndex<str> for core::ops::range::RangeTo<usize>>::index::{{closure}}
at /rustc/4560cb830fce63fcffdc4558f4281aaac6a3a1ba/src/libcore/str/mod.rs:1823
12: core::option::Option<T>::unwrap_or_else
at /rustc/4560cb830fce63fcffdc4558f4281aaac6a3a1ba/src/libcore/option.rs:419
13: core::str::traits::<impl core::slice::SliceIndex<str> for core::ops::range::RangeTo<usize>>::index
at /rustc/4560cb830fce63fcffdc4558f4281aaac6a3a1ba/src/libcore/str/mod.rs:1823
14: core::str::traits::<impl core::ops::index::Index<I> for str>::index
at /rustc/4560cb830fce63fcffdc4558f4281aaac6a3a1ba/src/libcore/str/mod.rs:1625
15: punkt::trainer::is_rare_abbrev_type
at /home/kirill/.cargo/registry/src/github.com-1ecc6299db9ec823/punkt-1.0.5/src/trainer.rs:449
16: punkt::trainer::Trainer<P>::train
at /home/kirill/.cargo/registry/src/github.com-1ecc6299db9ec823/punkt-1.0.5/src/trainer.rs:382
17: comprehensibility::main
at src/main.rs:10
18: std::rt::lang_start::{{closure}}
at /rustc/4560cb830fce63fcffdc4558f4281aaac6a3a1ba/src/libstd/rt.rs:64
19: std::rt::lang_start_internal::{{closure}}
at src/libstd/rt.rs:49
20: std::panicking::try::do_call
at src/libstd/panicking.rs:296
21: __rust_maybe_catch_panic
at src/libpanic_unwind/lib.rs:80
22: std::panicking::try
at src/libstd/panicking.rs:275
23: std::panic::catch_unwind
at src/libstd/panic.rs:394
24: std::rt::lang_start_internal
at src/libstd/rt.rs:48
25: std::rt::lang_start
at /rustc/4560cb830fce63fcffdc4558f4281aaac6a3a1ba/src/libstd/rt.rs:64
26: main
27: __libc_start_main
28: _start
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
extern crate punkt;
use punkt::trainer::{TrainingData, Trainer};
use punkt::tokenizer::{SentenceTokenizer, WordTokenizer};
fn main() {
let training_data = TrainingData::english();
for sent in SentenceTokenizer::new("this is a great sentence! this is a sad sentence.", &training_data) {
println!("{:?}", sent);
}
}
Has the result:
"this is a great sentence!"
And the sad sentence never shows up. 😢
The standard Rust prelude has the trait Default
included. This makes the examples in the README confusing, as it doesn't mention punkt::params::Default
until the configuration section.
It's also sad that every usage of SentenceTokenizer::new
needs to have an explicit type parameter... but I'm not sure what to suggest there.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.