knowitall / nlptools Goto Github PK

View Code? Open in Web Editor NEW

101.0 101.0 36.0 1.58 MB

A toolkit that wraps various natural language processing implementations behind a common interface.

Scala 99.75% Shell 0.25%

nlptools's People

Contributors

Stargazers

Watchers

nlptools's Issues

Weka confidence function and training tool

We should integrate this code into nlptools. This is of the utmost importance and should be done immediately or there will be dire consequences.

NullPointerException

Got the NullPointerException . How to solve it ?

public class OpenIETest {
public static void main(String a[])
{
OpenIE openie = new OpenIE(new ClearParser(new ClearPostagger(new ClearTokenizer())),new ClearSrl(),false,false);
System.out.println( openie.extract("The whales will not eat the otters"));

}

Factorie Support

It would be nice to add the Nlp tools from Factorie
http://factorie.cs.umass.edu/index.html

Intern postags and chunks

Save memory and increase speed. They could be Symbols, but that would be too much work by now.

Extend Stemmer interface to take PostaggedTokens in addition to just Strings

In some cases (particularly Morpha) adding a POS tag can be used improve stemming accuracy. For example, "Reye's Syndrome" is incorrectly stemmed (after tokenization) to "reye ' syndrome" unless postags are included. The original TextRunner demo passed postags to Morpha roughly like so:

val wordtag = word + "_" + tag
val morpha = new Morpha(new StringReader(wordtag))
morpha.yybegin(Morpha.scan)
return _lexer.next()

LogisticRegression should throw immediate error when there's missing feature

Otherwise, when you run it, you get this:

java.util.NoSuchElementException: key not found: which|who|that before rel
    at scala.collection.MapLike$class.default(MapLike.scala:228)
    at scala.collection.AbstractMap.default(Map.scala:58)
    at scala.collection.MapLike$class.apply(MapLike.scala:141)
    at scala.collection.AbstractMap.apply(Map.scala:58)
    at edu.knowitall.tool.conf.impl.LogisticRegression$$anonfun$1.apply(LogisticRegression.scala:41)
    at edu.knowitall.tool.conf.impl.LogisticRegression$$anonfun$1.apply(LogisticRegression.scala:40)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
    at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.sum(TraversableOnce.scala:203)
    at scala.collection.AbstractIterator.sum(Iterator.scala:1157)
    at edu.knowitall.tool.conf.impl.LogisticRegression.getConf(LogisticRegression.scala:44)
    at edu.knowitall.tool.conf.impl.LogisticRegression.apply(LogisticRegression.scala:32)
    at edu.knowitall.chunkedextractor.Relnoun$$anonfun$main$2$$anonfun$apply$4.apply(Relnoun.scala:664)
    at edu.knowitall.chunkedextractor.Relnoun$$anonfun$main$2$$anonfun$apply$4.apply(Relnoun.scala:663)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at edu.knowitall.chunkedextractor.Relnoun$$anonfun$main$2.apply(Relnoun.scala:663)
    at edu.knowitall.chunkedextractor.Relnoun$$anonfun$main$2.apply(Relnoun.scala:659)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at edu.knowitall.chunkedextractor.Relnoun$.main(Relnoun.scala:659)
    at edu.knowitall.chunkedextractor.Relnoun.main(Relnoun.scala)

Token/PostaggedToken/ChunkedToken has poor serialization support

There's half-completed code to serialize these as a tab separated list of whitespace separated token aspects. We also want some serialization that keeps all the aspects together.

I@0/PRP/B-NP rode@5/VB/B-VP

vs.

I@0 rode@5  \t  PRP VB  \t  B-NP B-VP

In the tab format, should the offsets be separated from the tokens?

I rode  \t  0 5  \t  PRP VB  \t  B-NP B-VP

Grammatical simplification of sentences

Such a model could be used to increase parsing precision and perf.

srl deserialization fails

doing RemoteSRL on this sentence:

"Thus a natural hazard will not result in a natural disaster in areas without vulnerability, e. g. strong earthquakes in uninhabited areas."

fails with this exception:

scala.MatchError: Could not deserialize relation: g._17.01 (of class java.lang.String)
        at edu.knowitall.tool.srl.Relation$.deserialize(Frame.scala:51)
        at edu.knowitall.tool.srl.Frame$.deserialize(Frame.scala:19)
        at Test$RemoteSrl$$anonfun$apply$1.apply(Test.scala:14)
        at Test$RemoteSrl$$anonfun$apply$1.apply(Test.scala:14)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
...

curling the parse for that sentence returns this:

result_6.01:[AM-DIS=Thus_0, A1=hazard_3, AM-MOD=will_4, AM-NEG=not_5, A2=in_7]
g._17.01:[A0=e._16, A1=earthquakes_19, AM-LOC=in_20]

I'm not sure what those things mean, but it looks like there's an extra . in the second line than the deserialization matcher is expecting.

Parsers should have method `parsePostagged` and `parseTokenized`

This is presently a problem because it's difficult to use a thread-safe instance of ClearParser. With OpenNlpTokenizer for example.

Rob, would you be interested in looking into this after the present project is over? I'd like for you and John each to have a small project that digs into nlptools a little deeper. I realize you need to move on to work with Tony, but I think this would only take a small amount of your time.

`Token` should not be able to have whitespace

Unfortunately, BreezeSentencer uses Tokenizer.computeOffsets to compute offsets from the resulting sentences, so simply adding require(string.forall(!_.isWhitespace)) breaks BreezeSentencer.

knowitall / nlptools Goto Github PK

nlptools's People

Contributors

Stargazers

Watchers

Forkers

nlptools's Issues

Weka confidence function and training tool

NullPointerException

Factorie Support

Intern postags and chunks

Extend Stemmer interface to take PostaggedTokens in addition to just Strings

LogisticRegression should throw immediate error when there's missing feature

Token/PostaggedToken/ChunkedToken has poor serialization support

Grammatical simplification of sentences

srl deserialization fails

Parsers should have method `parsePostagged` and `parseTokenized`

`Token` should not be able to have whitespace

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent