neurosnap / sentences Goto Github PK

View Code? Open in Web Editor NEW

428.0 15.0 38.0 15.68 MB

A multilingual command line sentence tokenizer in Golang

Home Page: https://sentences-231000.appspot.com/

License: MIT License

Go 91.76% Makefile 7.12% Python 1.12%

sentence-tokenizer tokenizer cli sentences

sentences's Introduction

Sentences - A command line sentence tokenizer

This command line utility will convert a blob of text into a list of sentences.

Demo
Docs

Features

Supports multiple languages (english, czech, dutch, estonian, finnish, german, greek, italian, norwegian, polish, portuguese, slovene, and turkish)
Zero dependencies
Extendable
Fast

Install

arch

aur

mac

brew tap neurosnap/sentences
brew install sentences

other

Or you can find the pre-built binaries on the github releases page.

using golang

go get github.com/neurosnap/sentences
go install github.com/neurosnap/sentences/cmd/sentences

Command

Get it

go get github.com/neurosnap/sentences

Use it

import (
    "fmt"
    "os"

    "github.com/neurosnap/sentences"
)

func main() {
    text := `A perennial also-ran, Stallings won his seat when longtime lawmaker David Holmes
    died 11 days after the filing deadline. Suddenly, Stallings was a shoo-in, not
    the long shot. In short order, the Legislature attempted to pass a law allowing
    former U.S. Rep. Carolyn Cheeks Kilpatrick to file; Stallings challenged the
    law in court and won. Kilpatrick mounted a write-in campaign, but Stallings won.`

    // download the training data from this repo (./data) and save it somewhere
    b, _ := os.ReadFile("./path/to/english.json")

    // load the training data
    training, _ := sentences.LoadTraining(b)

    // create the default sentence tokenizer
    tokenizer := sentences.NewSentenceTokenizer(training)
    sentences := tokenizer.Tokenize(text)

    for _, s := range sentences {
        fmt.Println(s.Text)
    }
}

English

This package attempts to fix some problems I noticed for english.

import (
    "fmt"

    "github.com/neurosnap/sentences/english"
)

func main() {
    text := "Hi there. Does this really work?"

    tokenizer, err := english.NewSentenceTokenizer(nil)
    if err != nil {
        panic(err)
    }

    sentences := tokenizer.Tokenize(text)
    for _, s := range sentences {
        fmt.Println(s.Text)
    }
}

Contributing

I need help maintaining this library. If you are interested in contributing to this library then please start by looking at the golden-rules branch which tests the Golden Rules for english sentence tokenization created by the Pragmatic Segmenter library.

Create an issue for a particular failing test and submit an issue/PR.

I'm happy to help anyone willing to contribute.

Customize

sentences was built around composability, most major components of this package can be extended.

Eager to make ad-hoc changes but don't know how to start? Have a look at github.com/neurosnap/sentences/english for a solid example.

Notice

I have not tested this tokenizer in any other language besides English. By default the command line utility loads english. I welcome anyone willing to test the other languages to submit updates as needed.

A primary goal for this package is to be multilingual so I'm willing to help in any way possible.

This library is a port of the nltk's punkt tokenizer.

A Punkt Tokenizer

An unsupervised multilingual sentence boundary detection library for golang. The way the punkt system accomplishes this goal is through training the tokenizer with text in that given language. Once the likelihoods of abbreviations, collocations, and sentence starters are determined, finding sentence boundaries becomes easier.

There are many problems that arise when tokenizing text into sentences, the primary issue being abbreviations. The punkt system attempts to determine whether a word is an abbreviation, an end to a sentence, or even both through training the system with text in the given language. The punkt system incorporates both token- and type-based analysis on the text through two different phases of annotation.

Unsupervised multilingual sentence boundary detection

Performance

Using Brown Corpus which is annotated American English text, we compare this package with other libraries across multiple programming languages.

Library	Avg Speed (s, 10 runs)	Accuracy (%)
Sentences	1.96	98.95
NLTK	5.22	99.21

sentences's People

Contributors

Stargazers

Watchers

sentences's Issues

WordTokenizer crops last word in string

The WordTokenizer does not properly tokenize words because it seems to be cropping the last word in the string.

Data file structure/creation

I'd like to use train a punkt model on a custom corprus- in this case, a large set of tweets collected from the Twitter API. While it is technically english, I'm not having great results with any off the shelf tokenizer available in go. Twitter obviously has some idiosyncrisies- unique abbreviations, the misspellings inherent to web text, urls, emoji...and so on. I wanted to take an unsupervised approach first which led me to punkt and this package.

I've taken a quick look at the data files provided in the repo, but it isn't completely clear what the structure is, or how they were created. I'm happy to make a pull request with my work if I see some results, but if you could point me in the right direction as to the generation of the data file for a custom corprus I'd really appreciate it. It's entirely possible that I'm missing some existing documentation. If not, I'd be happy to clean up any explanation you can give and make a pull request to include it in docs. Thanks!

Add support for Faroese

Hello, I was wondering how I would go about adding support for more languages. I can see that the key is to have training data, but how do I generate the required JSON file? Thank you in advance for making this package!

The demo doesn't seem to work with these two paragraphs.

An excerpt from Adventures with mmap

This week I started at the Recurse Center, a self directed program where everyone is working at becoming a better programmer. If you’ve been considering it, you should definitely do it! It’s even more awesome than you’ve heard!
The first project I’m working on is a distributed in-memory datastore. But it’s primarily an excuse to play around with stuff I’ve been reading about and haven’t gotten around to! This is the story of my adventure with mmap.

The demo breaks on umlauts

Test text.

Dürüm Döner. What. Happens. Here?

Result:

Dürüm Döner. Wh
at. Ha
ppens. He
re?

Use with Windows

I was wondering how to use this in Windows? I downloaded sentences_windows-amd64.tar.gz and extracted/added to path. I was hoping to use as such:

type input.txt > sentences > output.txt

The extracted file does not seem to work as an executable though - could someone help?

Ellipses are split off into sentences

The out put of this selection of text:

“Can’t, Tom, I’m on Hogwarts business,” said Hagrid, clapping his great hand on Harry’s shoulder and making Harry’s knees buckle.
“Good Lord,” said the bartender, peering at Harry, “is this — can this be — ?”
The Leaky Cauldron had suddenly gone completely still and silent.
“Bless my soul,” whispered the old bartender, “Harry Potter . . . what an honor.”
He hurried out from behind the bar, rushed toward Harry and seized his hand, tears in his eyes.
“Welcome back, Mr. Potter, welcome back.”

is:

“Can’t, Tom, I’m on Hogwarts business,” said Hagrid, clapping his great hand on Harry’s shoulder and making Harry’s knees buckle.
“Good Lord,” said the bartender, peering at Harry, “is this — can this be — ?”
The Leaky Cauldron had suddenly gone completely still and silent.
“Bless my soul,” whispered the old bartender, “Harry Potter . . . what an honor.”
He hurried out from behind the bar, rushed toward Harry and seized his hand, tears in his eyes.
“Welcome back, Mr. Potter, welcome back.”
1 characters remaining
“Can’t, Tom, I’m on Hogwarts business,” said Hagrid, clapping his great hand on Harry’s shoulder and making Harry’s knees buckle.
“Good Lord,” said the bartender, peering at Harry, “is this — can this be — ?”
The Leaky Cauldron had suddenly gone completely still and silent.
“Bless my soul,” whispered the old bartender, “Harry Potter .
.
.
what an honor.”
He hurried out from behind the bar, rushed toward Harry and seized his hand, tears in his eyes.
“Welcome back, Mr. Potter, welcome back.”

How to have all supported languages available at runtime?

I'm trying to use this library in a multilingual environment. I have function that receives the raw text and a language name as parameters, then loads the right language package and return sentences.

As only english is loaded by default, my test fails for all other languages, which I expected. But then I tried to run "make spanish" in the project folder and had two different errors:

First one, a permission error since data/spanish.json is readonly (installed with go get ...)
Then I ran with sudo, which worked fine. But my test fails with this error:

gopkg.in/neurosnap/sentences.v1/data

/Users/***/go/pkg/mod/gopkg.in/neurosnap/[email protected]/data/spanish.go:18:6: bindataRead redeclared in this block

Could you give me some indication on how to compile all supported language packages so they are available to choose at runtime?

Thanks!

More sentence examples

The python lib pragmatic_segmenter has a list of 50+ sentence split examples that this lib fails to parse. You can use their list to test this lib.

For example:

He left the bank at 6 P.M. Mr. Smith then went to the store.

Which neurosnap/sentences assumes is one sentence.

Sentences get cut off if semi-colon used

The following sentence gets cut off after the semi-colon.

I am here; you are over there.

I tested that sentence at sentences.erock.io.

All I see is

I am here

Add a formatting check to the CI Pipeline

At this point, a PR could break the formatting and still be accepted. It could be useful to extend the Pipeline by a formatting check.

Installation instructionss broken, binary links dead

When trying to download the binaries, I'm getting something like this for all of them.

<Error> <Code>AllAccessDisabled</Code> <Message>All access to this object has been disabled</Message><RequestId>7268EB2B3DC8532F</RequestId> <HostId>i2U6tSOMH/7Kyq29rzKr/A7HubUHQRQI/01b8nsYxBshadyeuc1jwRBDtHjaGA26ivrIH9tTEHU=</HostId> </Error>

Also, the commands in the readme are incorrect- sentences/cmd/sentences no longer exists as cmd was renamed to _cmd.

Loadtraining fails with

Instead of compiling the assets with go-bindata I load the json files like

f, err := os.Open(initalisationfilenames[lang].segmentationfilename)
if err != nil {
	return nil, err
}
b, err := ioutil.ReadAll(f)
if err != nil {
	return nil, err
}

The results seem to be correct, can you confirm?

Allow sentence-final lower-case i

I've been trying to manually tweak the English JSON data to get the tokeniser to recognise ... i. (a word consisting of a single lower case i) as a valid end of sentence, without success. Any suggestions would be welcome.

optimization suggestion

staring at a profile at the moment where it appears that regex compilation happens at each tokenization. Seems like caching compiled regexs would make this (awesome) library twice as fast for use on large corpus?

double-newlines should always start new sentence?

I noticed this in the context of cited quotations like

I think there's a bug here.  — me

And then another paragraph.

I think that should be 3 "sentences". The double-newline might be a reliable clue: continuing a sentence from one paragraph to the next is at least uncommon if not disallowed, right? (depending whether you want to keep them together if one paragraph ends with an ellipsis and the next starts with an ellipsis, perhaps) Another way would be to recognize this cited-quotation form, but I guess that could be risky.

diff --git a/sentences_test.go b/sentences_test.go
index e506188..d178f09 100644
--- a/sentences_test.go
+++ b/sentences_test.go
@@ -174,6 +174,19 @@ func TestSpacedPeriod(t *testing.T) {
        compareSentence(t, actualText, expected)
 }
 
+func TestQuotationSourceAndDoubleNewlines(t *testing.T) {
+       t.Log("Tokenizer should treat double-newline as end of sentence regardless of ending punctuation")
+
+       actualText := "'A witty saying proves nothing.' — Voltaire\n\nAnd yet it commands attention."
+       expected := []string{
+               "'A witty saying proves nothing.'",
+               " — Voltaire",
+               "And yet it commands attention.",
+       }
+
+       compareSentence(t, actualText, expected)
+}
+

I was poking around; I see you have token.ParaStart being set sometimes when a double-newline is detected, but treating ParaStart the same as SentBreak in Tokenize() didn't fix it.

spf13/cobra for command line leads to many recursive deps

When you use sentences as a module and don't need the command line utility, you're stuck with a massive number of recursive dependencies to vendor stemming form spf13/cobra.

gvt fetch gopkg.in/neurosnap/sentences.v1
2017/03/05 20:23:42 Fetching: gopkg.in/neurosnap/sentences.v1
2017/03/05 20:23:47 · Fetching recursive dependency: github.com/spf13/cobra
2017/03/05 20:23:50 ·· Fetching recursive dependency: github.com/spf13/viper
2017/03/05 20:23:52 ··· Fetching recursive dependency: github.com/fsnotify/fsnotify
2017/03/05 20:23:54 ···· Skipping (existing): golang.org/x/sys/unix
2017/03/05 20:23:54 ··· Fetching recursive dependency: github.com/mitchellh/mapstructure
2017/03/05 20:23:56 ··· Fetching recursive dependency: github.com/xordataexchange/crypt/config
2017/03/05 20:23:58 ···· Fetching recursive dependency: github.com/xordataexchange/crypt/backend
2017/03/05 20:23:59 ····· Fetching recursive dependency: github.com/armon/consul-api
2017/03/05 20:24:01 ····· Fetching recursive dependency: github.com/coreos/go-etcd/etcd
2017/03/05 20:24:05 ······ Fetching recursive dependency: github.com/ugorji/go/codec
2017/03/05 20:24:08 ···· Fetching recursive dependency: github.com/xordataexchange/crypt/encoding/secconf
2017/03/05 20:24:08 ····· Fetching recursive dependency: golang.org/x/crypto/openpgp
2017/03/05 20:24:10 ······ Fetching recursive dependency: golang.org/x/crypto/cast5
2017/03/05 20:24:10 ··· Fetching recursive dependency: github.com/spf13/jwalterweatherman
2017/03/05 20:24:12 ··· Fetching recursive dependency: github.com/spf13/afero
2017/03/05 20:24:14 ···· Fetching recursive dependency: github.com/pkg/sftp
2017/03/05 20:24:17 ····· Fetching recursive dependency: github.com/pkg/errors
2017/03/05 20:24:19 ····· Fetching recursive dependency: github.com/kr/fs
2017/03/05 20:24:21 ····· Fetching recursive dependency: golang.org/x/crypto/ssh
2017/03/05 20:24:21 ····· Deleting existing subpackage to prevent overlap: golang.org/x/crypto/ssh/terminal
2017/03/05 20:24:21 ······ Fetching recursive dependency: golang.org/x/crypto/ed25519
2017/03/05 20:24:21 ······ Skipping (existing): golang.org/x/sys/unix
2017/03/05 20:24:21 ······ Fetching recursive dependency: golang.org/x/crypto/curve25519
2017/03/05 20:24:21 ···· Fetching recursive dependency: golang.org/x/text/unicode/norm
2017/03/05 20:24:23 ····· Fetching recursive dependency: golang.org/x/text/transform
2017/03/05 20:24:23 ····· Fetching recursive dependency: golang.org/x/text/internal/triegen
2017/03/05 20:24:23 ····· Fetching recursive dependency: golang.org/x/text/internal/ucd
2017/03/05 20:24:23 ····· Skipping (existing): golang.org/x/text/internal/gen
2017/03/05 20:24:23 ··· Fetching recursive dependency: github.com/spf13/pflag
2017/03/05 20:24:26 ··· Fetching recursive dependency: github.com/pelletier/go-toml
2017/03/05 20:24:28 ···· Fetching recursive dependency: github.com/pelletier/go-buffruneio
2017/03/05 20:24:30 ··· Fetching recursive dependency: github.com/spf13/cast
2017/03/05 20:24:32 ··· Fetching recursive dependency: github.com/magiconair/properties
2017/03/05 20:24:34 ··· Fetching recursive dependency: github.com/hashicorp/hcl
2017/03/05 20:24:36 ··· Fetching recursive dependency: gopkg.in/yaml.v2
2017/03/05 20:24:39 ·· Fetching recursive dependency: github.com/inconshreveable/mousetrap
2017/03/05 20:24:41 ·· Fetching recursive dependency: github.com/cpuguy83/go-md2man/md2man
2017/03/05 20:24:43 ··· Fetching recursive dependency: github.com/cpuguy83/go-md2man/vendor/github.com/russross/blackfriday
2017/03/05 20:24:43 ···· Fetching recursive dependency: github.com/cpuguy83/go-md2man/vendor/github.com/shurcooL/sanitized_anchor_name
2017/03/05 20:24:43 · Fetching recursive dependency: github.com/neurosnap/sentences/english