Code Monkey home page Code Monkey logo

grobid-dictionaries's People

Contributors

dependabot[bot] avatar kermitt2 avatar lfoppiano avatar medkhem avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

grobid-dictionaries's Issues

Segmentation of the morphological and grammatical information

It's about extracting all morphological and grammatical information of the previous level.
These information could figure directly in the <entry> under the extracted <form> block, in <sense> or/and the <re> blocks.

  • In the case of <entry>, the following example:

screen shot 2017-04-28 at 16 04 05

screen shot 2017-04-27 at 22 07 09

becomes

screen shot 2017-04-28 at 16 14 44

  • In the case of <sense>:

screen shot 2017-04-28 at 16 04 21

screen shot 2017-04-28 at 16 27 36

becomes

screen shot 2017-04-28 at 16 06 48

  • In the case of <re>:

screen shot 2017-04-28 at 16 05 16

screen shot 2017-04-28 at 16 20 57

becomes

screen shot 2017-04-28 at 16 07 35

Process Lexical Entries: Error when attempting to parse multiple pages entries

I have trained Dictionary Segmentation, Dictionary Body Segmentation and Lexical Entry Segmentation with my own data (see this post). When running the Lexical Entry segmentation service, I get an error when attempting to parse a PDF that contains entries that are longer than 2 pages, which I don't get when parsing a PDF without long entries:

Error encountered while requesting the server.
[GENERAL] Model file does not exists or a directory: /grobid/grobid-dictionaries/../grobid-home/models/form/model.wapiti

errorlog.txt

java.lang.NullPointerException when running createTrainingDictionarySegmentation

Running the command:

java -jar target/grobid-dictionaries-0.4.3-SNAPSHOT.one-jar.jar -verbose -gH ../grobid-home/ -gP ../grobid-home/config/grobid.properties -dIn training/lexica/in/ -dOut training/lexica/out/ -exe createTrainingDictionarySegmentation

yields the java.lang.NullPointerException error:

Caused by: java.lang.NullPointerException at org.grobid.core.engines.DictionarySegmentationParser.copyFileUsingStream(DictionarySegmentationParser.java:1525) at org.grobid.core.engines.DictionarySegmentationParser.createTrainingDictionary(DictionarySegmentationParser.java:907) at org.grobid.core.engines.DictionarySegmentationParser.createTrainingBatch(DictionarySegmentationParser.java:825) ... 7 more

From what I was able to understand the engine (DictionarySegmentationParser.java:907) expects the resources/templates/dictionarySegmentation.rng to exist after (or before) the pdf2xml is done:

File existingRngFile = new File("resources/templates/dictionarySegmentation.rng"); File newRngFile = new File(outputDirectory + "/" +"dictionarySegmentation.rng"); copyFileUsingStream(existingRngFile,newRngFile);.

The full stack trace is:

10 lis 2017 21:27.08 [DEBUG] DocumentSource - pdf2xml process finished. Time to process:428ms Exception in thread "main" java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.simontuffs.onejar.Boot.run(Boot.java:340) at com.simontuffs.onejar.Boot.main(Boot.java:166) Caused by: org.grobid.core.exceptions.GrobidException: [GENERAL] An exception occured while running Grobid training data generation for segmentation model. at org.grobid.core.engines.DictionarySegmentationParser.createTrainingBatch(DictionarySegmentationParser.java:840) at org.grobid.core.main.batch.DictionaryMain.main(DictionaryMain.java:202) ... 6 more Caused by: java.lang.NullPointerException at org.grobid.core.engines.DictionarySegmentationParser.copyFileUsingStream(DictionarySegmentationParser.java:1525) at org.grobid.core.engines.DictionarySegmentationParser.createTrainingDictionary(DictionarySegmentationParser.java:907) at org.grobid.core.engines.DictionarySegmentationParser.createTrainingBatch(DictionarySegmentationParser.java:825) ... 7 more

Create training method not working properly

Hi @MedKhem, thanks for making grobid-dictionary. So, I installed as per the documentation guidelines successfully.
But I am currently facing two errors:

(i) The models are getting trained perfectly and producing the appropriate results on web server but when I try to produce new training data (xmls) using the create training method, I am unable to get the correct tags in the xml produced.
By using CreateTraining methods for respective models,I am generating weird output because the dictionarySegmentation model is producing no tags, whereas the dictionaryBodysegmentation is producing the label tags of dictionarySegmentation model, and the lexicalEntry model is generating the label tags for dictionaryBodySegmentation model.

(ii) Secondly, when I launch the web server using maven, I sometimes get the following error Error encountered while requesting the server. Content is not allowed in prolog.

Thanks in advance for helping me !

Docker problem: service doesn't run for process full dictionary

Here is the error message:

java.lang.NullPointerException at org.grobid.service.DictionaryProcessFile.processFullDictionary(DictionaryProcessFile.java:174) at org.grobid.service.DictionaryRestService.processFullDictionary_post(DictionaryRestService.java:69) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205) at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339) at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416) at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537) at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:699) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:833) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1650) at org.eclipse.jetty.websocket.server.WebSocketUpgradeFilter.doFilter(WebSocketUpgradeFilter.java:206) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1637) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:190) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:188) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1253) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:168) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:166) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1155) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.server.Server.handle(Server.java:564) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:317) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:279) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:110) at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:124) at org.eclipse.jetty.util.thread.Invocable.invokePreferred(Invocable.java:128) at org.eclipse.jetty.util.thread.Invocable$InvocableExecutor.invoke(Invocable.java:222) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:294) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:199) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:672) at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:590) at java.lang.Thread.run(Thread.java:748)

createAnnotatedTrainingDictionaryBodySegmentation generates linebreak in the resulting TEI file

Hi there !

Unlike other functions without Annotated that do not create \n tags, it seems that the prefixed createAnnotatedTraining adds line breaks (version 0.5.4, pulled this morning).

createTraining.rawtxt.txt
createTraining.xml.txt
createAnnotatedTrainign.rawtxt.txt
createAnnotatedTrainign.xml.txt

createTraining:

<lb/>que forma ou du moins species comprendrait la couleur, <lb/>la grandeur et autres détails. <lb/>Fimos, v. Lutum. <lb/>Findere. Scindere. Findere, diviser un corps dans le <lb/>sens de ses joints naturels, le décomposer pour ainsi dire <lb/>en ses parties élémentaires, comme fendre et cliver; scin¬ <lb/>dere, le diviser par force sans aucun égard aux joints et le <lb/>mettre en pièces, comme couper et déchirer. Findere <lb/>lignum veut dire fendre une bûche de bois en s'aidant de <lb/>la nature môme du bois, dans le sens de la longueur; mais <lb/>scindere, casser par pure force, en largeur. Le findens <lb/>oequor nave considère la mer comme un assemblage de <lb/>parties liquides; le scindens, comme n'ayant fait qu'un <lb/>tout dès l'origine.

createAnnotated:

<lb/>que forma ou du moins species comprendrait la couleur, <lb/>la grandeur et autres détails</entry>
. <lb/><entry>Fimos, v. Lutum</entry>
. <lb/><entry>Findere. Scindere. Findere, diviser un corps dans le <lb/>sens de ses joints naturels, le décomposer pour ainsi dire <lb/>en ses parties élémentaires, comme fendre et cliver; scin¬ <lb/>dere, le diviser par force sans aucun égard aux joints et le <lb/>mettre en pièces, comme couper et déchirer. Findere <lb/>lignum veut dire fendre une bûche de bois en s'aidant de <lb/>la nature môme du bois, dans le sens de la longueur; mais <lb/>scindere, casser par pure force, en largeur. Le findens <lb/>oequor nave considère la mer comme un assemblage de <lb/>parties liquides; le scindens, comme n'ayant fait qu'un <lb/>tout dès l'origine</entry>

Firing up Jetty App runs unit tests, despite -Dmaven.test.skip=true...!

I'm using Maven 3.6, and when I run:

 mvn -Dmaven.test.skip=true jetty:run-war

All the unit tests are run! However:

mvn -DskipTests jetty:run-war 

works as expected.

According to http://maven.apache.org/surefire/maven-surefire-plugin/test-mojo.html#skip the maven.test.skip isn't recommended, and the solution is to use the http://maven.apache.org/surefire/maven-surefire-plugin/test-mojo.html#skipTests parameter, ie -DskipTests. I think this could be just a simple documentation fix, and happy to submit a PR for that.

Or I can dig more into the surefire plugin configuration...

JAXB dependency

You may want to look into adding a dependency for JAXB.
From what I get, it's been removed from the default JDK distribution as of Java 9, and trying to build grobid-dictionaries with that version of Java (or newer) results in errors.
For what it's worth, adding something like

<dependency>
  <groupId>javax.xml.bind</groupId>
  <artifactId>jaxb-api</artifactId>
  <version>2.3.1</version>
</dependency>

<dependency>
  <groupId>org.glassfish.jaxb</groupId>
  <artifactId>jaxb-runtime</artifactId>
  <version>2.3.1</version>
</dependency>

to pom.xml (as per https://stackoverflow.com/questions/43574426/how-to-resolve-java-lang-noclassdeffounderror-javax-xml-bind-jaxbexception-in-j) allows grobid-dictionaries to build and run properly.

Etymology extension

Implement components for parsing and segmenting etymological information in etymology section detected in a lexical entry

First line of the body disappears (1st model)

corpus.zip
puhvel-h-1-3.pdf

As discussed with Mohamed a few seconds ago, when using the attached training data for the first model (corpus.zip), I get a 100% f-measure when evaluating on the training data, but then when I throw the attached PDF (i.e., the first 3 pages of my PDF, which are included in the training data), the first line of the body-part of each page simply disappears.

Check why pdf2xml is run 3 times for each pdf

here the log from the generation of training data for 1 pdf:

18 May 2017 08:12.02 [INFO ] Lexicon                   - Initiating dictionary
18 May 2017 08:12.02 [INFO ] Lexicon                   - End of Initialization of dictionary
18 May 2017 08:12.02 [INFO ] Lexicon                   - Initiating names
18 May 2017 08:12.02 [INFO ] Lexicon                   - End of initialization of names
18 May 2017 08:12.02 [INFO ] Lexicon                   - Initiating country codes
18 May 2017 08:12.02 [INFO ] Lexicon                   - End of initialization of country codes
18 May 2017 08:12.02 [INFO ] WapitiModel               - Loading model: /Users/lfoppiano/development/inria/grobid/grobid-home/models/dictionary-body-segmentation/model.wapiti (size: 155377)
[Wapiti] Loading model: "/Users/lfoppiano/development/inria/grobid/grobid-home/models/dictionary-body-segmentation/model.wapiti"
Model path: /Users/lfoppiano/development/inria/grobid/grobid-home/models/dictionary-body-segmentation/model.wapiti
18 May 2017 08:12.02 [DEBUG] DocumentSource            - start pdf2xml
18 May 2017 08:12.02 [DEBUG] DocumentSource            - Executing command: [bash, -c, ulimit -Sv 6242304 && /Users/lfoppiano/development/inria/grobid/grobid-home/pdf2xml/mac-64/pdftoxml -blocks -noImageInline -fullFontName  -noImage  -annotation  'resources/byDictionary/BasicEnglish/corpus/pdf/BasicEnglish30.pdf' /Users/lfoppiano/development/inria/grobid/grobid-home/tmp/KyvxTsun9N.lxml]
18 May 2017 08:12.02 [DEBUG] DocumentSource            - Executing: [bash, -c, ulimit -Sv 6242304 && /Users/lfoppiano/development/inria/grobid/grobid-home/pdf2xml/mac-64/pdftoxml -blocks -noImageInline -fullFontName  -noImage  -annotation  'resources/byDictionary/BasicEnglish/corpus/pdf/BasicEnglish30.pdf' /Users/lfoppiano/development/inria/grobid/grobid-home/tmp/KyvxTsun9N.lxml]
18 May 2017 08:12.03 [DEBUG] DocumentSource            - pdf2xml process finished. Time to process:95ms
18 May 2017 08:12.03 [DEBUG] DocumentSource            - start pdf2xml
18 May 2017 08:12.03 [DEBUG] DocumentSource            - Executing command: [bash, -c, ulimit -Sv 6242304 && /Users/lfoppiano/development/inria/grobid/grobid-home/pdf2xml/mac-64/pdftoxml -blocks -noImageInline -fullFontName  -noImage  -annotation  'resources/byDictionary/BasicEnglish/corpus/pdf/BasicEnglish30.pdf' /Users/lfoppiano/development/inria/grobid/grobid-home/tmp/t6C8gT7qYA.lxml]
18 May 2017 08:12.03 [DEBUG] DocumentSource            - Executing: [bash, -c, ulimit -Sv 6242304 && /Users/lfoppiano/development/inria/grobid/grobid-home/pdf2xml/mac-64/pdftoxml -blocks -noImageInline -fullFontName  -noImage  -annotation  'resources/byDictionary/BasicEnglish/corpus/pdf/BasicEnglish30.pdf' /Users/lfoppiano/development/inria/grobid/grobid-home/tmp/t6C8gT7qYA.lxml]
18 May 2017 08:12.03 [DEBUG] DocumentSource            - pdf2xml process finished. Time to process:47ms
1 files to be processed.
1 files processed in 454 milliseconds
Johan:grobid-dictionaries lfoppiano$ ls

Running title i pages

There is (sometimes) a problem with pages with a running title. The line just after the page number disappears. Cf. example infra.
1908_12_14_NOE (dragged).pdf

 <item n="">
           <desc> 0. Il expose que, vu ses ne peut
                  être celle d'un écolier commanda, chapelle envoyé à Rome, dans le recueillir la
                  succession de travaux antérieurs, sa situation à I olier. Il rappelle les travaux
                  que Lou vois de Marly, de Chambord, et qu'il a été but de se perfectionner et
                  d'être apte à Mignard. n. 1780, m. 1857. La jeune Muse, réponse à des couplets qui
                  m'ont été adressés par Mlle... âgée de douze ans, chanson autographe, 2 p.
                  in-4°</desc>
</item>

Decouple parsers and tei output

just as a reminder, we need to take off the *Parser classes all the information about TEI, ideally we would like to have a separater set of classes to build the TEI.

Some adjustment to the demo part

The demo part is a bit confusing IMHO:

  1. the selection Process Dictionary should be something equivalent ot the grobid process fulltext, the application of all cascade models to label and extract information from a dictionary.
  2. The other selection could be used for debugging, to see visually what each model is doing.

screen shot 2017-05-08 at 08 33 44

My 2 cents 🍡

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.