Code Monkey home page Code Monkey logo

rust-punkt's People

Contributors

ferristseng avatar hirschenberger avatar kfreiman avatar shepmaster avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

rust-punkt's Issues

Use string interning

Strings are duplicated. The representation of a token needs only be stored once...to do this research into string interning libs. Probably can store the strings in the data object.

Alter the way objects can be configured

Right now the object receives a structure that contains the tokenizer's parameters. It would be better if the parameter configuration was type-based / trait-based.

Relicense under dual MIT/Apache-2.0

This issue was automatically generated. Feel free to close without ceremony if
you do not agree with re-licensing or if it is not possible for other reasons.
Respond to @cmr with any questions or concerns, or pop over to
#rust-offtopic on IRC to discuss.

You're receiving this because someone (perhaps the project maintainer)
published a crates.io package with the license as "MIT" xor "Apache-2.0" and
the repository field pointing here.

TL;DR the Rust ecosystem is largely Apache-2.0. Being available under that
license is good for interoperation. The MIT license as an add-on can be nice
for GPLv2 projects to use your code.

Why?

The MIT license requires reproducing countless copies of the same copyright
header with different names in the copyright field, for every MIT library in
use. The Apache license does not have this drawback. However, this is not the
primary motivation for me creating these issues. The Apache license also has
protections from patent trolls and an explicit contribution licensing clause.
However, the Apache license is incompatible with GPLv2. This is why Rust is
dual-licensed as MIT/Apache (the "primary" license being Apache, MIT only for
GPLv2 compat), and doing so would be wise for this project. This also makes
this crate suitable for inclusion and unrestricted sharing in the Rust
standard distribution and other projects using dual MIT/Apache, such as my
personal ulterior motive, the Robigalia project.

Some ask, "Does this really apply to binary redistributions? Does MIT really
require reproducing the whole thing?" I'm not a lawyer, and I can't give legal
advice, but some Google Android apps include open source attributions using
this interpretation. Others also agree with
it
.
But, again, the copyright notice redistribution is not the primary motivation
for the dual-licensing. It's stronger protections to licensees and better
interoperation with the wider Rust ecosystem.

How?

To do this, get explicit approval from each contributor of copyrightable work
(as not all contributions qualify for copyright, due to not being a "creative
work", e.g. a typo fix) and then add the following to your README:

## License

Licensed under either of

 * Apache License, Version 2.0 ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)
 * MIT license ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)

at your option.

### Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted
for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any
additional terms or conditions.

and in your license headers, if you have them, use the following boilerplate
(based on that used in Rust):

// Copyright 2016 rust-punkt developers
//
// Licensed under the Apache License, Version 2.0 <LICENSE-APACHE or
// http://www.apache.org/licenses/LICENSE-2.0> or the MIT license
// <LICENSE-MIT or http://opensource.org/licenses/MIT>, at your
// option. This file may not be copied, modified, or distributed
// except according to those terms.

It's commonly asked whether license headers are required. I'm not comfortable
making an official recommendation either way, but the Apache license
recommends it in their appendix on how to use the license.

Be sure to add the relevant LICENSE-{MIT,APACHE} files. You can copy these
from the Rust repo for a plain-text
version.

And don't forget to update the license metadata in your Cargo.toml to:

license = "MIT/Apache-2.0"

I'll be going through projects which agree to be relicensed and have approval
by the necessary contributors and doing this changes, so feel free to leave
the heavy lifting to me!

Contributor checkoff

To agree to relicensing, comment with :

I license past and future contributions under the dual MIT/Apache-2.0 license, allowing licensees to chose either at their option.

Or, if you're a contributor, you can check the box in this repo next to your
name. My scripts will pick this exact phrase up and check your checkbox, but
I'll come through and manually review this issue later as well.

Sentence splitting errors and different output compared to NLTK

Rust-punkt and NLTK Punkt (with aligning off) produce different results when using exactly the same model. NLTK Punkt correctly identifies abbreviations and doesn't split on them, while rust-punkt, with the same model, splits sentences on almost every period.

To test things, I loaded the JSON model from rust-punkt:

from collections import defaultdict
from nltk.tokenize.punkt import PunktParameters, PunktSentenceTokenizer

with open(model_path, mode='r', encoding='UTF8') as model_file:
    model = json.load(model_file)

params = PunktParameters()
params.sent_starters = set(model['sentence_starters'])
params.abbrev_types = set(model['abbrev_types'])
params.collocations = set([tuple(t) for t in model['collocations']])
params.ortho_context = defaultdict(int, model['ortho_context'])

punkt = PunktSentenceTokenizer(params)
punkt.tokenize(text, realign_boundaries=False)

The output from NLTK Punkt:

Choć zapis pól ($X) może kojarzyć się z zapisem określającym zmienne (jak np. w perlu), to jednak określa pola bieżącego rekordu.

While rust-punkt produced:

Choć zapis pól ($X) może kojarzyć się z zapisem określającym zmienne (jak np.
w perlu), to jednak określa pola bieżącego rekordu.

Can't build. `bench` is a part of custom test frameworks which are unstable

cargo --version   
cargo 1.38.0-nightly (42a8c0adf 2019-08-07)

cargo build --lib
error[E0658]: use of unstable library feature 'test': `bench` is a part of custom test frameworks which are unstable
   --> src/trainer.rs:774:7
    |
774 |       #[bench] fn $name(b: &mut ::test::Bencher) {
    |         ^^^^^
...
785 | / bench_trainer!(
786 | |   bench_trainer_short,
787 | |   include_str!("../test/raw/sigma-wiki.txt")
788 | | );
    | |__- in this macro invocation
    |
    = note: for more information, see https://github.com/rust-lang/rust/issues/50297
    = help: add `#![feature(test)]` to the crate attributes to enable

PR follows

Panic when parsing an ellipsis at end of text

This results in an unwrap-panic:

let doc =  "I bought $5.50 worth of apples from the store. I gave them to my dog when I came home.)...";
let mut data = TrainingData::english();
let s: Vec<_> = SentenceTokenizer::<Standard>::new(&doc, &data).collect();

PR follows...

Panic with multibyte string

It seems, like somewhere length of string defines not correctly.

Code to reproduce:

use punkt::*;
use punkt::params::*;

fn main() {
  let content = "Функция. Речи.";

  let trainer: Trainer<Standard> = Trainer::new();
  let mut data = TrainingData::new();

  trainer.train(content, &mut data);

  for s in SentenceTokenizer::<Standard>::new(content, &data) {
    println!("{:?}", s);
  }
}
RUST_BACKTRACE=1 cargo run

    Finished dev [unoptimized + debuginfo] target(s) in 0.02s
     Running `target/debug/comprehensibility`
thread 'main' panicked at 'byte index 13 is not a char boundary; it is inside 'я' (bytes 12..14) of `функция`', src/libcore/str/mod.rs:2036:5
stack backtrace:
   0: backtrace::backtrace::libunwind::trace
             at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.29/src/backtrace/libunwind.rs:88
   1: backtrace::backtrace::trace_unsynchronized
             at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.29/src/backtrace/mod.rs:66
   2: std::sys_common::backtrace::_print
             at src/libstd/sys_common/backtrace.rs:47
   3: std::sys_common::backtrace::print
             at src/libstd/sys_common/backtrace.rs:36
   4: std::panicking::default_hook::{{closure}}
             at src/libstd/panicking.rs:200
   5: std::panicking::default_hook
             at src/libstd/panicking.rs:214
   6: std::panicking::rust_panic_with_hook
             at src/libstd/panicking.rs:477
   7: std::panicking::continue_panic_fmt
             at src/libstd/panicking.rs:384
   8: rust_begin_unwind
             at src/libstd/panicking.rs:311
   9: core::panicking::panic_fmt
             at src/libcore/panicking.rs:85
  10: core::str::slice_error_fail
             at src/libcore/str/mod.rs:0
  11: core::str::traits::<impl core::slice::SliceIndex<str> for core::ops::range::RangeTo<usize>>::index::{{closure}}
             at /rustc/4560cb830fce63fcffdc4558f4281aaac6a3a1ba/src/libcore/str/mod.rs:1823
  12: core::option::Option<T>::unwrap_or_else
             at /rustc/4560cb830fce63fcffdc4558f4281aaac6a3a1ba/src/libcore/option.rs:419
  13: core::str::traits::<impl core::slice::SliceIndex<str> for core::ops::range::RangeTo<usize>>::index
             at /rustc/4560cb830fce63fcffdc4558f4281aaac6a3a1ba/src/libcore/str/mod.rs:1823
  14: core::str::traits::<impl core::ops::index::Index<I> for str>::index
             at /rustc/4560cb830fce63fcffdc4558f4281aaac6a3a1ba/src/libcore/str/mod.rs:1625
  15: punkt::trainer::is_rare_abbrev_type
             at /home/kirill/.cargo/registry/src/github.com-1ecc6299db9ec823/punkt-1.0.5/src/trainer.rs:449
  16: punkt::trainer::Trainer<P>::train
             at /home/kirill/.cargo/registry/src/github.com-1ecc6299db9ec823/punkt-1.0.5/src/trainer.rs:382
  17: comprehensibility::main
             at src/main.rs:10
  18: std::rt::lang_start::{{closure}}
             at /rustc/4560cb830fce63fcffdc4558f4281aaac6a3a1ba/src/libstd/rt.rs:64
  19: std::rt::lang_start_internal::{{closure}}
             at src/libstd/rt.rs:49
  20: std::panicking::try::do_call
             at src/libstd/panicking.rs:296
  21: __rust_maybe_catch_panic
             at src/libpanic_unwind/lib.rs:80
  22: std::panicking::try
             at src/libstd/panicking.rs:275
  23: std::panic::catch_unwind
             at src/libstd/panic.rs:394
  24: std::rt::lang_start_internal
             at src/libstd/rt.rs:48
  25: std::rt::lang_start
             at /rustc/4560cb830fce63fcffdc4558f4281aaac6a3a1ba/src/libstd/rt.rs:64
  26: main
  27: __libc_start_main
  28: _start
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Second sentence seems to be missing

extern crate punkt;

use punkt::trainer::{TrainingData, Trainer};
use punkt::tokenizer::{SentenceTokenizer, WordTokenizer};

fn main() {
    let training_data = TrainingData::english();

    for sent in SentenceTokenizer::new("this is a great sentence! this is a sad sentence.", &training_data) {
        println!("{:?}", sent);
    }
}

Has the result:

"this is a great sentence!"

And the sad sentence never shows up. 😢

`Default` is an unfortunate choice of name

The standard Rust prelude has the trait Default included. This makes the examples in the README confusing, as it doesn't mention punkt::params::Default until the configuration section.

It's also sad that every usage of SentenceTokenizer::new needs to have an explicit type parameter... but I'm not sure what to suggest there.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.