digitalheir / java-probabilistic-earley-parser Goto Github PK

🎲 Efficient Java implementation of the probabilistic Earley algorithm to parse Stochastic Context Free Grammars (SCFGs)

License: MIT License

Java 100.00%

probabilistic-earley-parser earley-algorithm computational-linguistics parser parsing java cfg grammar ambiguous-sentences context-free

java-probabilistic-earley-parser's People

Contributors

Stargazers

Watchers

Forkers

huajinghua ucpdh23 nguyenthaitho semanticparsing nlpguyz steve-todorov teddyjoan drjoeavg nready-rnd naheedmk computational-linguistics-research gitrdm

java-probabilistic-earley-parser's Issues

implement inside-outside algorithm for estimating rule probabilities

Suppose you have a grammar and a set of parsed sentences, we want to use inside-outside to estimate the most likely probability distribution for the grammar rules

Do not allow malformed grammars

Ensure that the probabilities in a SCFG are proper and consistent as defined in Booth and Thompson (1973), and that the grammar contains no useless nonterminals (ones that can never appear in a derivation).

check that no rules are doubled with different probabilities (in which case we either have undefined dehaviour or conflate the rules?)

EXAMPLE PROJECT

Can you make a project, of this code, that I can execute/run on Mac OS by terminal, please? I haven't familiarity with Maven.
Thanks a lot in advance.

P.S if you make this example project, say me how can run it. Thanks!

left-recursive grammar breaks the parser

I tried the following grammar:

S -> a
S -> S a

Reading it like this:

Grammar<String> grammar = Grammar.parse(
                Paths.get("/some/path/test.cfg"), Charset.forName("UTF-8"));

Results in:

java.lang.RuntimeException: Matrix is singular.

	at org.leibnizcenter.cfg.algebra.matrix.LUDecomposition.solve(LUDecomposition.java:140)
	at org.leibnizcenter.cfg.algebra.matrix.Matrix.solve(Matrix.java:346)
	at org.leibnizcenter.cfg.algebra.matrix.Matrix.inverse(Matrix.java:357)
	at org.leibnizcenter.cfg.grammar.Grammar.getReflexiveTransitiveClosure(Grammar.java:134)
	at org.leibnizcenter.cfg.grammar.Grammar.<init>(Grammar.java:102)
	at org.leibnizcenter.cfg.grammar.Grammar$Builder.build(Grammar.java:416)
	at org.leibnizcenter.cfg.grammar.Grammar.parse(Grammar.java:183)
	at org.leibnizcenter.cfg.grammar.Grammar.parse(Grammar.java:166)
	at com.vision4j.internal.cli.PlayTest.cfg(PlayTest.java:48)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
	at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
	at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:51)
	at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:237)
	at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)

Converting the grammar to right-recursive avoids this issue:

S -> a
S -> a S

I am using the latest version in Maven: 0.9.12
Is there something I misunderstood about the behaviour of the grammar or is this bug?

investigate parallelization gains

use streams/lambdas to automatically parallelize parse functions; report results

Question regarding grammar

Is it possible to write a grammar to parse the following pattern:

...anything RULE1 anything RUL2 anything...

What i want is match the rules defined in the sentence, and ignore the noises (anything -> may be any characters)

Writing cf-gammars without probabilities

Dear colleagues,

I'm exploring a good Earley parsers for writing cf-grammars, and this one seems to be friendly for me. Could you tell please, does this parser allow to write cf-grammars without probabilities setting?

P.S. I need Java parser like Lark (Python) for directly rule writing.

Thanks,
Daria

Example of drawing a parse tree when using JPEP as a library?

Hi again,

The example of how to use JPEP as a library is for "Parser.recognize". It would be nice to add a "println" of a parse tree, just like the command-line app does.

PS: in a previous issue, I mentioned I'm struggling with CommandLine's "argument magic". What I meant is that CommanLIne draws the parse by calling "System.out.println(parse.parseTree);", where "parse" is an object of class "ParseTreeWithScore", taking the arguments to the ParseTreeWithScore form the command-line arguments in a somewhat complex way (to me, at least). So I guess the question is how to build an object of type "ParseTreeWithScore" when using JPEP as a library, given a particular grammar and a set of tokens (as you would from the command-line).

Again, thanks and regards!

allow callbacks after predicting, scanning and completing

Easy addition. User might want to mess with the chart some.

ERROR CHECKING IMPLEMENTATION

Can you implement an error handling?

In case of error, we can be insert correct token and using synchronizing token method.
7-parsing-error.pdf

Implement ε-rules (empty rules)

The parser currently can't handle rules of the form

X → ε            (p)

where ε is the empty string.

See section 4.7 Null Productions on page 19 of Stolcke's paper.

We have the choice of extending prediction and completion to work with ε-rules, but this is a bit complicated. Another possibility is to rewrite the grammar to eliminate these productions, described at the end of page 20, 4.7.4 Eliminating null productions.

Best to implement the simpler solution first, and implement the philosphically correct version later.

Allow regular expressions to describe tokens in .cfg files

Eg, N -> /(wo)?Man/i

Implementation almost done

Error: "The method parse(Path, Charset) is undefined for the type Grammar"

Hello,

I'm trying to use java-probabilistic-earley-parser as a library. Following the instructions:

You can parse .cfg files as follows:

Grammar<String> g = Grammar.parse(Paths.get("path", "to", "grammar.cfg"), Charset.forName("UTF-8"));

I get (in Eclipse) the error:
Error: "The method parse(Path, Charset) is undefined for the type Grammar"

I'm not using the Maven dependency, I'm just adding the latest jar to my project.

From the command line, everything works and I get a nice parse tree based on my grammar file, but the CommandLIne class does some "magic" with the arguments and I'm struggling to figure out how to do the equivalent thing without command-line arguments.

Thanks in advance!