dakrone / clojure-opennlp Goto Github PK

Natural Language Processing in Clojure (opennlp)

License: Eclipse Public License 1.0

Clojure 100.00%

clojure-opennlp's Introduction

Clojure library interface to OpenNLP - https://opennlp.apache.org/

A library to interface with the OpenNLP (Open Natural Language Processing) library of functions. Not all functions are implemented yet.

Additional information/documentation:

Read the source from Marginalia

http://dakrone.github.com/clojure-opennlp/

Known Issues

When using the treebank-chunker on a sentence, please ensure you have a period at the end of the sentence, if you do not have a period, the chunker gets confused and drops the last word. Besides, your sentences should all be grammactially correct anyway right?

Usage from Leiningen:

[clojure-opennlp "0.5.0"] ;; uses Opennlp 1.9.0

clojure-opennlp works with clojure 1.5+

Basic Example usage (from a REPL):

(use 'clojure.pprint) ; just for this documentation
(use 'opennlp.nlp)
(use 'opennlp.treebank) ; treebank chunking, parsing and linking lives here

You will need to make the processing functions using the model files. These assume you're running from the root project directory. You can also download the model files from the opennlp project at http://opennlp.sourceforge.net/models-1.5

(def get-sentences (make-sentence-detector "models/en-sent.bin"))
(def tokenize (make-tokenizer "models/en-token.bin"))
(def detokenize (make-detokenizer "models/english-detokenizer.xml"))
(def pos-tag (make-pos-tagger "models/en-pos-maxent.bin"))
(def name-find (make-name-finder "models/namefind/en-ner-person.bin"))
(def chunker (make-treebank-chunker "models/en-chunker.bin"))

The tool-creators are multimethods, so you can also create any of the tools using a model instead of a filename (you can create a model with the training tools in src/opennlp/tools/train.clj):

(def tokenize (make-tokenizer my-tokenizer-model)) ;; etc, etc

Then, use the functions you've created to perform operations on text:

Detecting sentences:

(pprint (get-sentences "First sentence. Second sentence? Here is another one. And so on and so forth - you get the idea..."))
["First sentence. ", "Second sentence? ", "Here is another one. ",
 "And so on and so forth - you get the idea..."]

Tokenizing:

(pprint (tokenize "Mr. Smith gave a car to his son on Friday"))
["Mr.", "Smith", "gave", "a", "car", "to", "his", "son", "on",
 "Friday"]

Detokenizing:

(detokenize ["Mr.", "Smith", "gave", "a", "car", "to", "his", "son", "on", "Friday"])
"Mr. Smith gave a car to his son on Friday."

Ideally, s == (detokenize (tokenize s)), the detokenization model XML file is a work in progress, please let me know if you run into something that doesn't detokenize correctly in English.

Part-of-speech tagging:

(pprint (pos-tag (tokenize "Mr. Smith gave a car to his son on Friday.")))
(["Mr." "NNP"]
 ["Smith" "NNP"]
 ["gave" "VBD"]
 ["a" "DT"]
 ["car" "NN"]
 ["to" "TO"]
 ["his" "PRP$"]
 ["son" "NN"]
 ["on" "IN"]
 ["Friday." "NNP"])

Name finding:

(name-find (tokenize "My name is Lee, not John."))
("Lee" "John")

Treebank-chunking splits and tags phrases from a pos-tagged sentence. A notable difference is that it returns a list of structs with the :phrase and :tag keys, as seen below:

(pprint (chunker (pos-tag (tokenize "The override system is meant to deactivate the accelerator when the brake pedal is pressed."))))
({:phrase ["The" "override" "system"], :tag "NP"}
 {:phrase ["is" "meant" "to" "deactivate"], :tag "VP"}
 {:phrase ["the" "accelerator"], :tag "NP"}
 {:phrase ["when"], :tag "ADVP"}
 {:phrase ["the" "brake" "pedal"], :tag "NP"}
 {:phrase ["is" "pressed"], :tag "VP"})

For just the phrases:

(phrases (chunker (pos-tag (tokenize "The override system is meant to deactivate the accelerator when the brake pedal is pressed."))))
(["The" "override" "system"] ["is" "meant" "to" "deactivate"] ["the" "accelerator"] ["when"] ["the" "brake" "pedal"] ["is" "pressed"])

And with just strings:

(phrase-strings (chunker (pos-tag (tokenize "The override system is meant to deactivate the accelerator when the brake pedal is pressed."))))
("The override system" "is meant to deactivate" "the accelerator" "when" "the brake pedal" "is pressed")

Document Categorization:

See opennlp.test.tools.train for better usage examples.

(def doccat (make-document-categorizer "my-doccat-model"))

(doccat "This is some good text")
"Happy"

Probabilities of confidence

The probabilities OpenNLP supplies for a given operation are available as metadata on the result, where applicable:

(meta (get-sentences "This is a sentence. This is also one."))
{:probabilities (0.9999054310803004 0.9941126097177366)}

(meta (tokenize "This is a sentence."))
{:probabilities (1.0 1.0 1.0 0.9956236737394807 1.0)}

(meta (pos-tag ["This" "is" "a" "sentence" "."]))
{:probabilities (0.9649410482478001 0.9982592902509803 0.9967282012835504 0.9952498677248117 0.9862225658078769)}

(meta (chunker (pos-tag ["This" "is" "a" "sentence" "."])))
{:probabilities (0.9941248001899835 0.9878092935921453 0.9986106511439116 0.9972975733070356 0.9906377695586069)}

(meta (name-find ["My" "name" "is" "John"]))
{:probabilities (0.9996272005494383 0.999999997485361 0.9999948113868132 0.9982291838206192)}

Beam Size

You can rebind opennlp.nlp/*beam-size* (the default is 3) for the pos-tagger and treebank-parser with:

(binding [*beam-size* 1]
  (def pos-tag (make-pos-tagger "models/en-pos-maxent.bin")))

Advance Percentage

You can rebind opennlp.treebank/*advance-percentage* (the default is 0.95) for the treebank-parser with:

(binding [*advance-percentage* 0.80]
  (def parser (make-treebank-parser "parser-model/en-parser-chunking.bin")))

Treebank-parsing

Note: Treebank parsing is very memory intensive, make sure your JVM has a sufficient amount of memory available (using something like -Xmx512m) or you will run out of heap space when using a treebank parser.

Treebank parsing gets its own section due to how complex it is.

Note none of the treebank-parser model is included in the git repo, you will have to download it separately from the opennlp project.

Creating it:

(def treebank-parser (make-treebank-parser "parser-model/en-parser-chunking.bin"))

To use the treebank-parser, pass an array of sentences with their tokens separated by whitespace (preferably using tokenize)

(treebank-parser ["This is a sentence ."])
["(TOP (S (NP (DT This)) (VP (VBZ is) (NP (DT a) (NN sentence))) (. .)))"]

In order to transform the treebank-parser string into something a little easier for Clojure to perform on, use the (make-tree ...) function:

(make-tree (first (treebank-parser ["This is a sentence ."])))
{:chunk {:chunk ({:chunk {:chunk "This", :tag DT}, :tag NP} {:chunk ({:chunk "is", :tag VBZ} {:chunk ({:chunk "a", :tag DT} {:chunk "sentence", :tag NN}), :tag NP}), :tag VP} {:chunk ".", :tag .}), :tag S}, :tag TOP}

Here's the datastructure split into a little more readable format:

{:tag TOP
 :chunk {:tag S
         :chunk ({:tag NP
                  :chunk {:tag DT
                          :chunk "This"}}
                 {:tag VP
                  :chunk ({:tag VBZ
                           :chunk "is"}
                          {:tag NP
                           :chunk ({:tag DT
                                    :chunk "a"}
                                   {:tag NN
                                    :chunk "sentence"})})}
                 {:tag .
                  :chunk "."})}}

Hopefully that makes it a little bit clearer, a nested map. If anyone else has any suggesstions for better ways to represent this information, feel free to send me an email or a patch.

Treebank parsing is considered beta at this point.

Filters

Filtering pos-tagged sequences

(use 'opennlp.tools.filters)

(pprint (nouns (pos-tag (tokenize "Mr. Smith gave a car to his son on Friday."))))
(["Mr." "NNP"]
 ["Smith" "NNP"]
 ["car" "NN"]
 ["son" "NN"]
 ["Friday" "NNP"])

(pprint (verbs (pos-tag (tokenize "Mr. Smith gave a car to his son on Friday."))))
(["gave" "VBD"])

Filtering treebank-chunks

(use 'opennlp.tools.filters)

(pprint (noun-phrases (chunker (pos-tag (tokenize "The override system is meant to deactivate the accelerator when the brake pedal is pressed")))))
({:phrase ["The" "override" "system"], :tag "NP"}
 {:phrase ["the" "accelerator"], :tag "NP"}
 {:phrase ["the" "brake" "pedal"], :tag "NP"})

Creating your own filters:

(pos-filter determiners #"^DT")
#'user/determiners
(doc determiners)
-------------------------
user/determiners
([elements__52__auto__])
  Given a list of pos-tagged elements, return only the determiners in a list.

(pprint (determiners (pos-tag (tokenize "Mr. Smith gave a car to his son on Friday."))))
(["a" "DT"])

You can also create treebank-chunk filters using (chunk-filter ...)

(chunk-filter fragments #"^FRAG$")

(doc fragments)
-------------------------
opennlp.nlp/fragments
([elements__178__auto__])
  Given a list of treebank-chunked elements, return only the fragments in a list.

Being Lazy

There are some methods to help you be lazy when tagging methods, depending on the operation desired, use the corresponding method:

#'opennlp.tools.lazy/lazy-get-sentences
#'opennlp.tools.lazy/lazy-tokenize
#'opennlp.tools.lazy/lazy-tag
#'opennlp.tools.lazy/lazy-chunk
#'opennlp.tools.lazy/sentence-seq

Here's how to use them:

(use 'opennlp.nlp)
(use 'opennlp.treebank)
(use 'opennlp.tools.lazy)

(def get-sentences (make-sentence-detector "models/en-sent.bin"))
(def tokenize (make-tokenizer "models/en-token.bin"))
(def pos-tag (make-pos-tagger "models/en-pos-maxent.bin"))
(def chunker (make-treebank-chunker "models/en-chunker.bin"))

(lazy-get-sentences ["This body of text has three sentences. This is the first. This is the third." "This body has only two. Here's the last one."] get-sentences)
;; will lazily return:
(["This body of text has three sentences. " "This is the first. " "This is the third."] ["This body has only two. " "Here's the last one."])

(lazy-tokenize ["This is a sentence." "This is another sentence." "This is the third."] tokenize)
;; will lazily return:
(["This" "is" "a" "sentence" "."] ["This" "is" "another" "sentence" "."] ["This" "is" "the" "third" "."])

(lazy-tag ["This is a sentence." "This is another sentence."] tokenize pos-tag)
;; will lazily return:
((["This" "DT"] ["is" "VBZ"] ["a" "DT"] ["sentence" "NN"] ["." "."]) (["This" "DT"] ["is" "VBZ"] ["another" "DT"] ["sentence" "NN"] ["." "."]))

(lazy-chunk ["This is a sentence." "This is another sentence."] tokenize pos-tag chunker)
;; will lazily return:
(({:phrase ["This"], :tag "NP"} {:phrase ["is"], :tag "VP"} {:phrase ["a" "sentence"], :tag "NP"}) ({:phrase ["This"], :tag "NP"} {:phrase ["is"], :tag "VP"} {:phrase ["another" "sentence"], :tag "NP"}))

Feel free to use the lazy functions, but I'm still not 100% set on the layout, so they may change in the future. (Maybe chaining them so instead of a sequence of sentences it looks like (lazy-chunk (lazy-tag (lazy-tokenize (lazy-get-sentences ...))))).

Generating a lazy sequence of sentences from a file using opennlp.tools.lazy/sentence-seq:

(with-open [rdr (clojure.java.io/reader "/tmp/bigfile")]
  (let [sentences (sentence-seq rdr get-sentences)]
    ;; process your lazy seq of sentences however you desire
    (println "first 5 sentences:")
    (clojure.pprint/pprint (take 5 sentences))))

Training

There is code to allow for training models for each of the tools. Please see the documentation in TRAINING.markdown

License

Distributed under the Eclipse Public License, the same as Clojure uses. See the file COPYING.

Contributors

Rob Zinkov - zaxtax
Alexandre Patry - apatry

TODO

~~add method to generate lazy sequence of sentences from a file~~ (done!)
~~Detokenizer~~ (still more work to do, but it works for now)
Do something with parse-num for treebank parsing
~~Split up treebank stuff into its own namespace~~ (done!)
~~Treebank chunker~~ (done!)
~~Treebank parser~~ (done!)
~~Laziness~~ (done! for now.)
Treebank linker (WIP)
~~Phrase helpers for chunker~~ (done!)
~~Figure out what license to use.~~ (done!)
Filters for treebank-parser
Return multiple probability results for treebank-parser
~~Explore including probability numbers~~ (probability numbers added as metadata)
~~Model training/trainer~~ (done!)
Revisit datastructure format for tagged sentences
~~Document beam-size functionality~~
~~Document advance-percentage functionality~~
Build a full test suite: -- ~~core tools~~ (done) -- ~~filters~~ (done) -- ~~laziness~~ (done) -- training (pretty much done except for tagging)

clojure-opennlp's People

Contributors

Stargazers

Watchers

Forkers

arohner zaxtax rplevy apatry ranjithtenz zmedelis danielglauser crisweber budu daviddpark alexott stask elnopintan gnarmis kirasystems jimpil runexec kthguru mpenet clojens akhudek hiredman ilikedata iterion jlindsey15 wangzhiwei-ai hellcoderz ikarth dthume gskielian pce1991 jonathanmarvens otrewyi191 yilab arnaudsj sirilanka yogsototh skottk mpereira alisheikh juancarloscruzd cvic gorinovic bugra hardikus siyuan1990 zzmjohn berhoden rowhit bahostetterlewis ttuulari ailoan blankrain cavhack ruedigergad joelittlejohn tranchis colinchenmaster nile free-variation mammammamoi arnaudyoh plumpmath wenxijuji tony824 devasthali-machine dpom solertis faiz-lisp afcarl clojusc s312569 nlpka6j stjordanis danieltanfh95 sandlunds reborg commotum yijingluo standardgalactic glottocrisio

clojure-opennlp's Issues

could you include the models for dates, organizations, money, location, and time?

These seem easy to bring in and similar to the name recognizer, but would be super useful to people in industry trying to use some basic nlp.

How to deal with indeterminacy?

Evaluating (treebank-parser ["What can happen in a second ."]) using the set-up in the README file here, I get the following parse:

(TOP
 (SBARQ
  (WHNP (WP What))
  (SQ
   (VP (MD can) (VP (VB happen) (PP (IN in) (NP (DT a) (JJ second))))))
  (. .)))

Actually I'm pretty sure the JJ should be an NN. Is this alternative known to the OpenNLP engine at some point in its parse, and if so, can I get it to report on the known alternative(s)?

NullPointerException when chunk-filter encounters a phrase with {:tag nil}

The chunker occasionally outputs a chunk with a nil tag in cases where the chunk isn't part of a detected phrase, such as a sentence that starts with a coordinating conjunction like "And".

(use 'clojure.pprint)
(use 'opennlp.nlp)
(use 'opennlp.treebank)
(use 'opennlp.tools.filters)

(def tokenize (make-tokenizer "models/en-token.bin"))
(def pos-tag (make-pos-tagger "models/en-pos-maxent.bin"))
(def chunker (make-treebank-chunker "models/en-chunker.bin"))

(pprint
  (noun-phrases
   (chunker 
     (pos-tag 
        (tokenize "And when the party entered the assembly room, it consisted of only five altogether; Mr. Bingley, his two sisters, the husband of the eldest, and
another young man.")))))

Results in:

NullPointerException   java.util.regex.Matcher.getTextLength (:-1)

Because the first phrase has the nil tag:

(pprint (noun-phrases
          '({:phrase ["And"], :tag nil})))

For reference, bin/opennlp ChunkerME en-chunker.bin handles the same text this way, not putting the coordinating conjunction in a phrase at all:

And_CC when_WRB the_DT party_NN entered_VBD the_DT assembly_NN room,_NN it_PRP consisted_VBD of_IN five_CD altogether._.
=>
 And_CC [ADVP when_WRB ] [NP the_DT party_NN ] [VP entered_VBD ] [NP the_DT assembly_NN room,_NN ] [NP it_PRP ] [VP consisted_VBD ] [PP of_IN ] [NP five_CD ] altogether._.

The nil tag is probably a good way to represent this, except for the fact that re-find throws an exception when passed a nil string.

I've fixed this in my fork by removing nil phrases before filtering, but this has the side-effect of making it impossible to filter to select the nil phrases themselves. This may be an acceptable trade-off. I'm not sure.

(defmacro fixed-chunk-filter
  "Declare a filter for treebank-chunked lists with the given name and regex."
  [n r]
  (let [docstring (str "Given a list of treebank-chunked elements, "
                       "return only the " n " in a list.")]
    `(defn ~n
       ~docstring
       [elements#]
       (filter (fn [t#] (re-find ~r (:tag t#))) 
               (remove #(nil? (:tag %)) elements#)))))

CompilerException clojure.lang.ArityException

I am trying to use the library but getting error. I am working on OS X Yosemite version 10.10.1 and installed opennlp using brew install apache-opennlp.

(defproject firstattempt "0.1.0-SNAPSHOT"
  :description "FIXME: write description"
  :url "http://example.com/FIXME"
  :license {:name "Eclipse Public License"
            :url "http://www.eclipse.org/legal/epl-v10.html"}
  :dependencies [[org.clojure/clojure "1.7.0"]
                 [clojure-opennlp "0.3.3"]])

user=> (use 'clojure.pprint)
nil
user=> (use 'opennlp.nlp)
nil
user=> (use 'opennlp.treebank)

CompilerException clojure.lang.ArityException: Wrong number of args (2) passed to: StringReader, compiling:(abnf.clj:189:28)

treebank make-tree uses clojure reader, chokes on some tokens from natural language

Not a show stopper.

The parsing code here gets treebank strings from OpenNLP. The treebank strings
are very nearly s-expressions and are parsed as such. They are only "very nearly"
s-expressions, not perfectly so because of tokens that are not parsed by clojure.
The code here uses the Clojure reader, so it crashes when it sees a token it doesn't like.
The general idea of going from treebank strings into trees of clojure objects is
still worth pursuing. However, doing it perfectly will require either some pre-processing
or a modified reader.

Not everything that isn't a sequence is a symbol. Not all scalars are symbols. Numbers for example, are happily read by the reader, but are not symbols. That's OK. Some tokens from natural text are not lexed by the reader into clojure. Time values for example, like "2:30". They would appear in the natural language input without quotes. Clojure tries to make things that start with numerals into some kind of number, and the colon throws it off. Since the OpenNLP tokenizer doesn't split 2:30 into 2 : 30, but leaves it, Clojure throws.

My boss at work is a classic AI LISP hack an recommends not using the reader for things that are not lisp s-expressions. He mentioned lisp code he has that basically does the same thing, but can be modified to deal with this case. We work with his academic nephew who is more familiar with the Clojure dialect. He suspects the use of Lisp features not available in clojure. I'll check it out. Hopefully we can get the two of them in on github community fun.
(BTW the features involve macro-related changes to the reader (table?))

It's a plug-in fix. A modified reader would work with the same interface as read-string,
and quote odd stuff like 2:30 that the clojure reader doesn't like...making a string of them.
Using the existing reader for now is fine.

Upgrade to OpenNLP 1.5.2

Currently there are issues with the trainer with 1.5.2.

opennlp library

the java opennlp lib is missing from the dependencies in project.clj

Tokenizing not happening perfectly

My code is taken from README :

(use 'clojure.pprint) ; just for this documentation
(use 'opennlp.nlp)

(def tokenize (make-tokenizer "models/en-token.bin"))
(def pos-tag (make-pos-tagger "models/en-pos-maxent.bin"))

(pprint (pos-tag (tokenize "john macharty quits")))
;// verb is taken as noun.
(["john" "NN"] ["macharty" "NN"] ["quits" "NNS"])

;// here verb is taken as noun.
(pprint (pos-tag (tokenize "bl joshi quits")))
(["bl" "JJ"] ["joshi" "NNP"] ["quits" "NNS"])

The verb quit is predicted as noun. Please see the comments in the code.

Am i doing something wrong ?
I see that we use the latest verion of opennlp.

Do we have any online testing resource of opennlp like that of stanford http://nlp.stanford.edu:8080/parser/index.jsp to compare them ?

NoClassDefFoundError for instaparse when creating uberjar

Hey,

There seems to be an error when you try to run the Uberjar-

Exception in thread "main" java.lang.NoClassDefFoundError: instaparse/print$parser__GT_str (wrong name: instaparse/print$Parser__GT_str)

The issue is due to the outdated dependency to instaparse. Updating the dependency should solve the issue.

Upgrading to OpenNLP 1.6

Are there any plans to upgrade to the latest stable version?

Proposal for treebank-parser tree structure

Hey @dakrone,

I am particularly interested by the treebank-parser.

One cool representation would be actually a one-to-one translation from the string representation of the tree into a Clojure List, with the first element being the tag and the rest of it the chunk!
This will be visually more understandable, and stick with Lisp's common representation of data in general !
This could be done using some reader-tricks:

(load-string  (str "(quote "
                                    (first  (treebank-parser ["This is a sentence ."]))
                                    ")"))
;;=> (TOP (S (NP (DT This)) (VP (VBZ is) (NP (DT a) (NN sentence))) (. .)))

But it would be better to have it generated when the parse is being done...
Whadda ya think ?

IOException Mark invalid java.BufferedReader.reset (BufferedReader.java:505)

I'm getting this error when I try to use the following sample code from the readme:

(with-open [rdr (clojure.java.io/reader "/tmp/bigfile")]
  (let [sentences (sentence-seq rdr get-sentences)]
    ;; process your lazy seq of sentences however you desire
    (println "first 5 sentences:")
    (clojure.pprint/pprint (take 5 sentences))))

My file exists and I'm able to do (slurp "/tmp/bigfile")
I'm new to Clojure so I'm sorry if it's a basic java interop issue. Nevertheless I successfully imported the get-sentences and sentence-seq functions and have otherwise been able to use the library without problems.

build-posdictionary is broken?

Hi,

I tried to train new language model and found out "build-posdictionary" is not working.

Here's the snippets of code that i'm using

(def tagdict (build-posdictionary "jv-tagdict"))
(def pos-model (train-pos-tagger "jv" "workdir/jv-pos.train" tagdict))

I'm using opennlp-tools "1.5.3", clojure "1.5.1" and clojure-opennlp "0.3.1-SNAPSHOT". and here's the error message.

Exception in thread "main" java.lang.ClassCastException: java.io.BufferedReader cannot be cast to java.io.InputStream, compiling:(jv-pos-learn.clj:23:14)
...
Caused by: java.lang.ClassCastException: java.io.BufferedReader cannot be cast to java.io.InputStream
    at opennlp.tools.train$build_posdictionary.invoke(train.clj:49)
...

Does anyone have any ideas?

Thanks.
Jim

java.io.FileNotFoundException: Could not locate opennlp/nlp__init.class or opennlp/nlp.clj on classpath

I keep getting this issue. It seems like it might be because the opennlp.jar file doesn't exist. This blog http://writequit.org/blog/?p=365 says it can be found here:
http://github.com/dakrone/clojure-opennlp/tree/master/lib/
but the directory doesn't seem to exist...

Anyone have any ideas?

The chunker needs punctuation to work properly

Using the definitions of tokenize, pos-tag, and chunker from the readme, and 1.5.1 versions of the model files, the following behaviour is observed:

 (-> "I am looking for a good way to annotate this english text."
    tokenize pos-tag chunker phrases)
;; => (["I"] ["am" "looking"]  ["for"]  ["a" "good" "way"] ["to" "annotate"] ["this" "English" "text"]))

;; cf. the same operation, when the text is not full-stop terminated:
 (-> "I am looking for a good way to annotate this English text"
    tokenize pos-tag chunker phrases)
;; => (["I"] ["am" "looking"] ["for"] ["a" "good" "way"] ["to" "annotate"] ["this" "English"])

The pos-tag output seems correct however.

Custom Feature generation impossible via 'make-name-finder'

Hi there,

It seems that 'make-name-finder' does not take into account the several constructors in the NameFinderME.java ...More specifically, there is no way to use the constructor that accepts a custom feature generator... I propose this, which is not a breaking change:

(defmethod make-name-finder TokenNameFinderModel
  [model & {:keys [feature-generator]}] ;;optional arg - defaults to nil
  (fn name-finder
    [tokens & contexts]
    {:pre [(seq tokens)
           (every? #(= (class %) String) tokens)]}
    (let [finder (NameFinderME. model feature-generator *beam-size*) ;can be nil - no problem
          matches (.find finder (into-array String tokens))
          probs (seq (.probs finder))]
      (with-meta
        (distinct (Span/spansToStrings matches (into-array String tokens)))
        {:probabilities probs}))))

Jim

bare clojure.java.io/readers, writers, input-streams etc etc all over tools/train.clj

all clojure.java.io/readers, writers, input-streams, output-streams etc inside train.clj have not been wrapped with the 'with-open' macro. This strikes me as very weird because they are correct in all other namespaces but not in train.clj which uses them most! Unless, I'm missing something important this should be fixed asap... it took 3 minutes to fix it in my fork...

dakrone / clojure-opennlp Goto Github PK

clojure-opennlp's Introduction

Clojure library interface to OpenNLP - https://opennlp.apache.org/

Known Issues

Usage from Leiningen:

Basic Example usage (from a REPL):

Probabilities of confidence

Beam Size

Advance Percentage

Treebank-parsing

Filters

Filtering pos-tagged sequences

Filtering treebank-chunks

Creating your own filters:

Being Lazy

Training

License

Contributors

TODO

clojure-opennlp's People

Contributors

Stargazers

Watchers

Forkers

clojure-opennlp's Issues

Recommend Projects

Recommend Topics

Recommend Org