medkhem / grobid-dictionaries Goto Github PK

View Code? Open in Web Editor NEW

31.0 31.0 7.0 48.41 MB

Java 72.00% JavaScript 19.86% HTML 1.97% CSS 3.16% XSLT 1.74% Dockerfile 0.30% Shell 0.97%

grobid-dictionaries's People

Contributors

Stargazers

Watchers

Forkers

mrihtar aazhar andrewbriand speedsters fztutor medarca lemonfruut

grobid-dictionaries's Issues

Segmentation of the morphological and grammatical information

It's about extracting all morphological and grammatical information of the previous level.
These information could figure directly in the <entry> under the extracted <form> block, in <sense> or/and the <re> blocks.

In the case of <entry>, the following example:

becomes

In the case of <sense>:

becomes

In the case of <re>:

becomes

Can we send postman requests to grobid-dictionaries api functions like (process dictionary/createtrainingmethods)?

Hi @MedKhem @lfoppiano ,

I wanted to send requests using postman but I am not sure if its stated in the documentation. Do I have to write the rest api functions similar to grobid in grobid-dictionary ?

Thanks in advance!

Enable the extraction of Lemmas and POS from full parsing

When the morphological is processed, extracting the list of lemmas and pos should be possible

Process Lexical Entries: Error when attempting to parse multiple pages entries

I have trained Dictionary Segmentation, Dictionary Body Segmentation and Lexical Entry Segmentation with my own data (see this post). When running the Lexical Entry segmentation service, I get an error when attempting to parse a PDF that contains entries that are longer than 2 pages, which I don't get when parsing a PDF without long entries:

Error encountered while requesting the server.
[GENERAL] Model file does not exists or a directory: /grobid/grobid-dictionaries/../grobid-home/models/form/model.wapiti

errorlog.txt

Generation of fresh training data: Line breaks are omitted when extracted for some pdf files

java.lang.NullPointerException when running createTrainingDictionarySegmentation

Running the command:

java -jar target/grobid-dictionaries-0.4.3-SNAPSHOT.one-jar.jar -verbose -gH ../grobid-home/ -gP ../grobid-home/config/grobid.properties -dIn training/lexica/in/ -dOut training/lexica/out/ -exe createTrainingDictionarySegmentation

yields the java.lang.NullPointerException error:

Caused by: java.lang.NullPointerException at org.grobid.core.engines.DictionarySegmentationParser.copyFileUsingStream(DictionarySegmentationParser.java:1525) at org.grobid.core.engines.DictionarySegmentationParser.createTrainingDictionary(DictionarySegmentationParser.java:907) at org.grobid.core.engines.DictionarySegmentationParser.createTrainingBatch(DictionarySegmentationParser.java:825) ... 7 more

From what I was able to understand the engine (DictionarySegmentationParser.java:907) expects the resources/templates/dictionarySegmentation.rng to exist after (or before) the pdf2xml is done:

File existingRngFile = new File("resources/templates/dictionarySegmentation.rng"); File newRngFile = new File(outputDirectory + "/" +"dictionarySegmentation.rng"); copyFileUsingStream(existingRngFile,newRngFile);.

The full stack trace is:

10 lis 2017 21:27.08 [DEBUG] DocumentSource - pdf2xml process finished. Time to process:428ms Exception in thread "main" java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.simontuffs.onejar.Boot.run(Boot.java:340) at com.simontuffs.onejar.Boot.main(Boot.java:166) Caused by: org.grobid.core.exceptions.GrobidException: [GENERAL] An exception occured while running Grobid training data generation for segmentation model. at org.grobid.core.engines.DictionarySegmentationParser.createTrainingBatch(DictionarySegmentationParser.java:840) at org.grobid.core.main.batch.DictionaryMain.main(DictionaryMain.java:202) ... 6 more Caused by: java.lang.NullPointerException at org.grobid.core.engines.DictionarySegmentationParser.copyFileUsingStream(DictionarySegmentationParser.java:1525) at org.grobid.core.engines.DictionarySegmentationParser.createTrainingDictionary(DictionarySegmentationParser.java:907) at org.grobid.core.engines.DictionarySegmentationParser.createTrainingBatch(DictionarySegmentationParser.java:825) ... 7 more

Download button for the TEI result

Removing the dots at the end of the LE from the output of the second model

The dots are actually kept as elements in the lexical entry list to not lose them. But in this way they are displayed as separate lexical entries.

Create training method not working properly

Hi @MedKhem, thanks for making grobid-dictionary. So, I installed as per the documentation guidelines successfully.
But I am currently facing two errors:

(i) The models are getting trained perfectly and producing the appropriate results on web server but when I try to produce new training data (xmls) using the create training method, I am unable to get the correct tags in the xml produced.
By using CreateTraining methods for respective models,I am generating weird output because the dictionarySegmentation model is producing no tags, whereas the dictionaryBodysegmentation is producing the label tags of dictionarySegmentation model, and the lexicalEntry model is generating the label tags for dictionaryBodySegmentation model.

(ii) Secondly, when I launch the web server using maven, I sometimes get the following error Error encountered while requesting the server. Content is not allowed in prolog.

Thanks in advance for helping me !

Extend the TEI model for lexical entries

more labels could be used to encode a lexical entry other than: <form>, <etym>, <sense>, <re> and <dictScrap>.

Docker problem: service doesn't run for process full dictionary

Here is the error message:

java.lang.NullPointerException at org.grobid.service.DictionaryProcessFile.processFullDictionary(DictionaryProcessFile.java:174) at org.grobid.service.DictionaryRestService.processFullDictionary_post(DictionaryRestService.java:69) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205) at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339) at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416) at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537) at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:699) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:833) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1650) at org.eclipse.jetty.websocket.server.WebSocketUpgradeFilter.doFilter(WebSocketUpgradeFilter.java:206) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1637) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:190) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:188) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1253) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:168) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:166) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1155) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.server.Server.handle(Server.java:564) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:317) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:279) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:110) at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:124) at org.eclipse.jetty.util.thread.Invocable.invokePreferred(Invocable.java:128) at org.eclipse.jetty.util.thread.Invocable$InvocableExecutor.invoke(Invocable.java:222) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:294) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:199) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:672) at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:590) at java.lang.Thread.run(Thread.java:748)

Enable the encoding of morpho and semantic encyclopaedic content

createAnnotatedTrainingDictionaryBodySegmentation generates linebreak in the resulting TEI file

Hi there !

Unlike other functions without Annotated that do not create \n tags, it seems that the prefixed createAnnotatedTraining adds line breaks (version 0.5.4, pulled this morning).

createTraining.rawtxt.txt
createTraining.xml.txt
createAnnotatedTrainign.rawtxt.txt
createAnnotatedTrainign.xml.txt

createTraining:

<lb/>que forma ou du moins species comprendrait la couleur, <lb/>la grandeur et autres détails. <lb/>Fimos, v. Lutum. <lb/>Findere. Scindere. Findere, diviser un corps dans le <lb/>sens de ses joints naturels, le décomposer pour ainsi dire <lb/>en ses parties élémentaires, comme fendre et cliver; scin¬ <lb/>dere, le diviser par force sans aucun égard aux joints et le <lb/>mettre en pièces, comme couper et déchirer. Findere <lb/>lignum veut dire fendre une bûche de bois en s'aidant de <lb/>la nature môme du bois, dans le sens de la longueur; mais <lb/>scindere, casser par pure force, en largeur. Le findens <lb/>oequor nave considère la mer comme un assemblage de <lb/>parties liquides; le scindens, comme n'ayant fait qu'un <lb/>tout dès l'origine.

createAnnotated:

<lb/>que forma ou du moins species comprendrait la couleur, <lb/>la grandeur et autres détails</entry>
. <lb/><entry>Fimos, v. Lutum</entry>
. <lb/><entry>Findere. Scindere. Findere, diviser un corps dans le <lb/>sens de ses joints naturels, le décomposer pour ainsi dire <lb/>en ses parties élémentaires, comme fendre et cliver; scin¬ <lb/>dere, le diviser par force sans aucun égard aux joints et le <lb/>mettre en pièces, comme couper et déchirer. Findere <lb/>lignum veut dire fendre une bûche de bois en s'aidant de <lb/>la nature môme du bois, dans le sens de la longueur; mais <lb/>scindere, casser par pure force, en largeur. Le findens <lb/>oequor nave considère la mer comme un assemblage de <lb/>parties liquides; le scindens, comme n'ayant fait qu'un <lb/>tout dès l'origine</entry>

Dictionary Segmentation Parser: Fix Pre-annotated training data generation

sense CSS needs to be fixed

Sense CSS needs to be fixed. The problem occurs when tei.xml is opened in Author mode in oXygen.

Restructure the parsers to avoid information loss in the final TEI output

The actual structure of the parsers causes a loss of some information detected in higher levels. For example, <pc> and <other> elements generated by DictionaryBodySegmentation model are not shown in the results of LexicalEntry model and consequently in all its following models

Firing up Jetty App runs unit tests, despite -Dmaven.test.skip=true...!

I'm using Maven 3.6, and when I run:

 mvn -Dmaven.test.skip=true jetty:run-war

All the unit tests are run! However:

mvn -DskipTests jetty:run-war

works as expected.

According to http://maven.apache.org/surefire/maven-surefire-plugin/test-mojo.html#skip the maven.test.skip isn't recommended, and the solution is to use the http://maven.apache.org/surefire/maven-surefire-plugin/test-mojo.html#skipTests parameter, ie -DskipTests. I think this could be just a simple documentation fix, and happy to submit a PR for that.

Or I can dig more into the surefire plugin configuration...

Enable 2 data training data generation modes per model and necessary files for annotation

For each model, 2 commands should be available: one for raw text creation (to be annotated from scratch) and one with pre-annotated text (which is going to be refined in the annotation process). A css and rng files should be automatically created with the data files to make the annotation more intuitive through the author mode in Oxygen

JAXB dependency

You may want to look into adding a dependency for JAXB.
From what I get, it's been removed from the default JDK distribution as of Java 9, and trying to build grobid-dictionaries with that version of Java (or newer) results in errors.
For what it's worth, adding something like

<dependency>
  <groupId>javax.xml.bind</groupId>
  <artifactId>jaxb-api</artifactId>
  <version>2.3.1</version>
</dependency>

<dependency>
  <groupId>org.glassfish.jaxb</groupId>
  <artifactId>jaxb-runtime</artifactId>
  <version>2.3.1</version>
</dependency>

to pom.xml (as per https://stackoverflow.com/questions/43574426/how-to-resolve-java-lang-noclassdeffounderror-javax-xml-bind-jaxbexception-in-j) allows grobid-dictionaries to build and run properly.

Etymology extension

Implement components for parsing and segmenting etymological information in etymology section detected in a lexical entry

Check the rendering of the final TEI output

After adding and testing new models, their output should follow the same logic as previous models (case when the entry is cut on 2 pages)

First line of the body disappears (1st model)

corpus.zip
puhvel-h-1-3.pdf

As discussed with Mohamed a few seconds ago, when using the attached training data for the first model (corpus.zip), I get a 100% f-measure when evaluating on the training data, but then when I throw the attached PDF (i.e., the first 3 pages of my PDF, which are included in the training data), the first line of the body-part of each page simply disappears.

& character remains not escaped

Fix LexicalEntryParser

A body element is for the whole document and not for every page

https://traces1.inria.fr/grobid-dictionaries/ Demo url is not working

Well formation error in the final TEI output

An internal validation scheme should be probably added

Some special characters like </ can cause the malformation of the tei training data

Check why pdf2xml is run 3 times for each pdf

here the log from the generation of training data for 1 pdf:

18 May 2017 08:12.02 [INFO ] Lexicon                   - Initiating dictionary
18 May 2017 08:12.02 [INFO ] Lexicon                   - End of Initialization of dictionary
18 May 2017 08:12.02 [INFO ] Lexicon                   - Initiating names
18 May 2017 08:12.02 [INFO ] Lexicon                   - End of initialization of names
18 May 2017 08:12.02 [INFO ] Lexicon                   - Initiating country codes
18 May 2017 08:12.02 [INFO ] Lexicon                   - End of initialization of country codes
18 May 2017 08:12.02 [INFO ] WapitiModel               - Loading model: /Users/lfoppiano/development/inria/grobid/grobid-home/models/dictionary-body-segmentation/model.wapiti (size: 155377)
[Wapiti] Loading model: "/Users/lfoppiano/development/inria/grobid/grobid-home/models/dictionary-body-segmentation/model.wapiti"
Model path: /Users/lfoppiano/development/inria/grobid/grobid-home/models/dictionary-body-segmentation/model.wapiti
18 May 2017 08:12.02 [DEBUG] DocumentSource            - start pdf2xml
18 May 2017 08:12.02 [DEBUG] DocumentSource            - Executing command: [bash, -c, ulimit -Sv 6242304 && /Users/lfoppiano/development/inria/grobid/grobid-home/pdf2xml/mac-64/pdftoxml -blocks -noImageInline -fullFontName  -noImage  -annotation  'resources/byDictionary/BasicEnglish/corpus/pdf/BasicEnglish30.pdf' /Users/lfoppiano/development/inria/grobid/grobid-home/tmp/KyvxTsun9N.lxml]
18 May 2017 08:12.02 [DEBUG] DocumentSource            - Executing: [bash, -c, ulimit -Sv 6242304 && /Users/lfoppiano/development/inria/grobid/grobid-home/pdf2xml/mac-64/pdftoxml -blocks -noImageInline -fullFontName  -noImage  -annotation  'resources/byDictionary/BasicEnglish/corpus/pdf/BasicEnglish30.pdf' /Users/lfoppiano/development/inria/grobid/grobid-home/tmp/KyvxTsun9N.lxml]
18 May 2017 08:12.03 [DEBUG] DocumentSource            - pdf2xml process finished. Time to process:95ms
18 May 2017 08:12.03 [DEBUG] DocumentSource            - start pdf2xml
18 May 2017 08:12.03 [DEBUG] DocumentSource            - Executing command: [bash, -c, ulimit -Sv 6242304 && /Users/lfoppiano/development/inria/grobid/grobid-home/pdf2xml/mac-64/pdftoxml -blocks -noImageInline -fullFontName  -noImage  -annotation  'resources/byDictionary/BasicEnglish/corpus/pdf/BasicEnglish30.pdf' /Users/lfoppiano/development/inria/grobid/grobid-home/tmp/t6C8gT7qYA.lxml]
18 May 2017 08:12.03 [DEBUG] DocumentSource            - Executing: [bash, -c, ulimit -Sv 6242304 && /Users/lfoppiano/development/inria/grobid/grobid-home/pdf2xml/mac-64/pdftoxml -blocks -noImageInline -fullFontName  -noImage  -annotation  'resources/byDictionary/BasicEnglish/corpus/pdf/BasicEnglish30.pdf' /Users/lfoppiano/development/inria/grobid/grobid-home/tmp/t6C8gT7qYA.lxml]
18 May 2017 08:12.03 [DEBUG] DocumentSource            - pdf2xml process finished. Time to process:47ms
1 files to be processed.
1 files processed in 454 milliseconds
Johan:grobid-dictionaries lfoppiano$ ls

Running title i pages

There is (sometimes) a problem with pages with a running title. The line just after the page number disappears. Cf. example infra.
1908_12_14_NOE (dragged).pdf

 <item n="">
           <desc> 0. Il expose que, vu ses ne peut
                  être celle d'un écolier commanda, chapelle envoyé à Rome, dans le recueillir la
                  succession de travaux antérieurs, sa situation à I olier. Il rappelle les travaux
                  que Lou vois de Marly, de Chambord, et qu'il a été but de se perfectionner et
                  d'être apte à Mignard. n. 1780, m. 1857. La jeune Muse, réponse à des couplets qui
                  m'ont été adressés par Mlle... âgée de douze ans, chanson autographe, 2 p.
                  in-4°</desc>
</item>

the selection Process Dictionary should be something equivalent ot the grobid process fulltext, the application of all cascade models to label and extract information from a dictionary.
The other selection could be used for debugging, to see visually what each model is doing.

My 2 cents 🍡

This may be "works as designed", however wanted to raise it. In the Docker setup instructions, the training docs are referenced (https://github.com/MedKhem/grobid-dictionaries/wiki/Docker_Instructions) as being in a Google Drive. That folder requires permission to access.

medkhem / grobid-dictionaries Goto Github PK

grobid-dictionaries's People

Contributors

Stargazers

Watchers

Forkers

grobid-dictionaries's Issues

Recommend Projects

Recommend Topics

Recommend Org