knowitall / nlptools Goto Github PK
View Code? Open in Web Editor NEWA toolkit that wraps various natural language processing implementations behind a common interface.
A toolkit that wraps various natural language processing implementations behind a common interface.
We should integrate this code into nlptools. This is of the utmost importance and should be done immediately or there will be dire consequences.
Got the NullPointerException . How to solve it ?
public class OpenIETest {
public static void main(String a[])
{
OpenIE openie = new OpenIE(new ClearParser(new ClearPostagger(new ClearTokenizer())),new ClearSrl(),false,false);
System.out.println( openie.extract("The whales will not eat the otters"));
}
}
It would be nice to add the Nlp tools from Factorie
http://factorie.cs.umass.edu/index.html
Save memory and increase speed. They could be Symbol
s, but that would be too much work by now.
In some cases (particularly Morpha) adding a POS tag can be used improve stemming accuracy. For example, "Reye's Syndrome" is incorrectly stemmed (after tokenization) to "reye ' syndrome" unless postags are included. The original TextRunner demo passed postags to Morpha roughly like so:
val wordtag = word + "_" + tag
val morpha = new Morpha(new StringReader(wordtag))
morpha.yybegin(Morpha.scan)
return _lexer.next()
Otherwise, when you run it, you get this:
java.util.NoSuchElementException: key not found: which|who|that before rel
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:58)
at edu.knowitall.tool.conf.impl.LogisticRegression$$anonfun$1.apply(LogisticRegression.scala:41)
at edu.knowitall.tool.conf.impl.LogisticRegression$$anonfun$1.apply(LogisticRegression.scala:40)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.sum(TraversableOnce.scala:203)
at scala.collection.AbstractIterator.sum(Iterator.scala:1157)
at edu.knowitall.tool.conf.impl.LogisticRegression.getConf(LogisticRegression.scala:44)
at edu.knowitall.tool.conf.impl.LogisticRegression.apply(LogisticRegression.scala:32)
at edu.knowitall.chunkedextractor.Relnoun$$anonfun$main$2$$anonfun$apply$4.apply(Relnoun.scala:664)
at edu.knowitall.chunkedextractor.Relnoun$$anonfun$main$2$$anonfun$apply$4.apply(Relnoun.scala:663)
at scala.collection.immutable.List.foreach(List.scala:318)
at edu.knowitall.chunkedextractor.Relnoun$$anonfun$main$2.apply(Relnoun.scala:663)
at edu.knowitall.chunkedextractor.Relnoun$$anonfun$main$2.apply(Relnoun.scala:659)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at edu.knowitall.chunkedextractor.Relnoun$.main(Relnoun.scala:659)
at edu.knowitall.chunkedextractor.Relnoun.main(Relnoun.scala)
There's half-completed code to serialize these as a tab separated list of whitespace separated token aspects. We also want some serialization that keeps all the aspects together.
I@0/PRP/B-NP rode@5/VB/B-VP
vs.
I@0 rode@5 \t PRP VB \t B-NP B-VP
In the tab format, should the offsets be separated from the tokens?
I rode \t 0 5 \t PRP VB \t B-NP B-VP
Such a model could be used to increase parsing precision and perf.
doing RemoteSRL on this sentence:
"Thus a natural hazard will not result in a natural disaster in areas without vulnerability, e. g. strong earthquakes in uninhabited areas."
fails with this exception:
scala.MatchError: Could not deserialize relation: g._17.01 (of class java.lang.String)
at edu.knowitall.tool.srl.Relation$.deserialize(Frame.scala:51)
at edu.knowitall.tool.srl.Frame$.deserialize(Frame.scala:19)
at Test$RemoteSrl$$anonfun$apply$1.apply(Test.scala:14)
at Test$RemoteSrl$$anonfun$apply$1.apply(Test.scala:14)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
...
curling the parse for that sentence returns this:
result_6.01:[AM-DIS=Thus_0, A1=hazard_3, AM-MOD=will_4, AM-NEG=not_5, A2=in_7]
g._17.01:[A0=e._16, A1=earthquakes_19, AM-LOC=in_20]
I'm not sure what those things mean, but it looks like there's an extra . in the second line than the deserialization matcher is expecting.
This is presently a problem because it's difficult to use a thread-safe instance of ClearParser
. With OpenNlpTokenizer
for example.
Rob, would you be interested in looking into this after the present project is over? I'd like for you and John each to have a small project that digs into nlptools a little deeper. I realize you need to move on to work with Tony, but I think this would only take a small amount of your time.
Unfortunately, BreezeSentencer
uses Tokenizer.computeOffsets
to compute offsets from the resulting sentences, so simply adding require(string.forall(!_.isWhitespace))
breaks BreezeSentencer
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.