bibliome / alvisnlp Goto Github PK

View Code? Open in Web Editor NEW

15.0 6.0 6.0 53.26 MB

ALvisNLP corpus processing engine

Home Page: https://bibliome.github.io/alvisnlp/

License: Apache License 2.0

Java 70.38% HTML 0.41% XSLT 2.32% CSS 1.84% Shell 0.50% JavaScript 23.18% Batchfile 0.05% PowerShell 0.05% Python 1.27%

nlp pipeline alvis corpus-processing workflow java natural-language-processing workflow-engine machine-learning

alvisnlp's People

Contributors

Stargazers

Watchers

Forkers

arnaudferre jibe-b doriankodelja reloadbrain xddd-ys nkkkyyy

alvisnlp's Issues

Refactor RunResource and PubAnnotation

PubAnnotation delegates to RunAnnotation, where they could inherit from a common abstract class.

Take in charge the OMTD-SHARE metadata directly into alvis

The maven repository and dependencies of the registry Api : the Java classes generated from OMTD-SHARE metadata are present there into eu.openminted.registry.domain

the dependency

<dependency>
<groupId>eu.openminted</groupId>
<artifactId>omtd-registry-api</artifactId>
</dependency>

the dependency is present in the two repositories

<repository>
<id>omtd-releases</id>
<layout>default</layout>
<url>https://repo.openminted.eu/content/repositories/releases</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>

<repository>
<id>omtd-snapshots</id>
<layout>default</layout>
<url>https://repo.openminted.eu/content/repositories/snapshots</url>
<releases>
<enabled>false</enabled>
</releases>
<snapshots>
<enabled>true</enabled>
</snapshots>
</repository>

Problème CaseInsensitive sur Tabular Projector

L'option fait des résultats "aléatoires"

On cherche à normalizer ce fichier :

Fichier d'entrée

AP1
AP2
CAL
FWA
FT
AP3
PI
AG
UFO
CO
Co
LD
GA1

Résultat :
<caseInsensitive> inactivée:

AP1 AT1G69120
AP2 AT4G36920
CAL AT1G26310
FWA AT4G25530
FT AT1G65480
AP3 AT3G54340
PI AT5G20240
AG AT4G18960
UFO AT1G30950
CO AT5G15840
Co
LD AT4G02560
GA1 AT4G02780

<caseInsensitive> Activée:

AP1 AT1G69120
AP2 AT4G36920
CAL AT1G26310
FWA AT4G25530
FT
AP3 AT3G54340
PI AT5G20240
AG AT4G18960
UFO AT1G30950
CO
Co
LD AT4G02560
GA1 AT4G02780

pour tester :
/bibdev/install/alvisnlp/devel/bin/alvisnlp -inputDir /bibdev/travail/arabidopsis/alvisir2_devel -log plan/alvisnlp.log normalize_genes.plan

Problem with AlvisNLP and Stanford NER new versions (>3.4)

There is an error when trying to run AlvisNLP with new versions of Stanford NER (>3.4)
Example with stanford-ner-2016-10-31:

Loading classifier from /bibdev/sources/stanford/stanford-ner-2016-10-31/classifiers/english.all.3class.distsim.crf.ser.gz ... Loading distsim lexicon from /u/nlp/data/pos_tags_are_useless/egw4-reut.512.clusters ... [2017-07-06 15:22:47.000][alvisnlp] SEVERE java.io.FileNotFoundException: /u/nlp/data/pos_tags_are_useless/egw4-reut.512.clusters (Aucun fichier ou dossier de ce type)

Error obtained when running: /bibdev/install/alvisnlp/devel/bin/alvisnlp -verbose -log err.log /bibdev/travail/OpenMinted/UseCases/Wheat/uc-tdm-AS-D/plans/test-stanford.plan

However, there is no problem running Stanford NER as a standalone tool:

$ java -mx600m -cp /bibdev/sources/stanford/stanford-ner-2016-10-31/stanford-ner.jar:/bibdev/sources/stanford/stanford-ner-2016-10-31/lib/* edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier /bibdev/sources/stanford/stanford-ner-2016-10-31/classifiers/english.all.3class.distsim.crf.ser.gz -textFile /bibdev/sources/stanford/stanford-ner-2016-10-31/sample.txt
Invoked on Thu Jul 06 15:17:01 CEST 2017 with arguments: -loadClassifier /bibdev/sources/stanford/stanford-ner-2016-10-31/classifiers/english.all.3class.distsim.crf.ser.gz -textFile /bibdev/sources/stanford/stanford-ner-2016-10-31/sample.txt
loadClassifier=/bibdev/sources/stanford/stanford-ner-2016-10-31/classifiers/english.all.3class.distsim.crf.ser.gz
textFile=/bibdev/sources/stanford/stanford-ner-2016-10-31/sample.txt
Loading classifier from /bibdev/sources/stanford/stanford-ner-2016-10-31/classifiers/english.all.3class.distsim.crf.ser.gz ... done [1,3 sec].
The/O fate/O of/O Lehman/ORGANIZATION Brothers/ORGANIZATION ,/O the/O beleaguered/O investment/O bank/O ,/O hung/O in/O the/O balance/O on/O Sunday/O as/O Federal/ORGANIZATION Reserve/ORGANIZATION officials/O and/O the/O leaders/O of/O major/O financial/O institutions/O continued/O to/O gather/O in/O emergency/O meetings/O trying/O to/O complete/O a/O plan/O to/O rescue/O the/O stricken/O bank/O ./O 
Several/O possible/O plans/O emerged/O from/O the/O talks/O ,/O held/O at/O the/O Federal/ORGANIZATION Reserve/ORGANIZATION Bank/ORGANIZATION of/ORGANIZATION New/ORGANIZATION York/ORGANIZATION and/O led/O by/O Timothy/PERSON R./PERSON Geithner/PERSON ,/O the/O president/O of/O the/O New/ORGANIZATION York/ORGANIZATION Fed/ORGANIZATION ,/O and/O Treasury/ORGANIZATION Secretary/O Henry/PERSON M./PERSON Paulson/PERSON Jr./PERSON ./O

XMI Serialization

Serialize the data structure as UIMA XMI, using a fixed typesystem.

Deploy demo version of endpoint

PubAnnotation cannot access to laptop deployment.

deployment on bibdev
loading the taxon dictionary ends up with OutOfMemory exception, need to use dictionary serialization...

Package rename

bibliome.org hence belongs to BioFoundation so package naming org.bibliome.… may conflict with other.

We probably don't care as the contents date back to 2011, 2012.

TabularExport : single option for whole corpus output

This is so common:

<files>$</files>
<fileName>"somefile.txt"</fileName>

that we could use a single parameter, like that

<corpusFile>somefile.txt</corpusFile>

SimpleProjector2 ne projette pas sur les premiers termes d'un abstract

Redmine issue.

Make PubAnnotation endpoint accept POST

Currently only accepts GET.

Export annotations as PubAnnotation JSON.

Create an ALvisNLP module that exports PubAnnotation JSON.

Problems:

In an AlvisNLP/ML workflow, not all annotations should be exposed/exported. How to select the interesting layers?
Choose the right feature in an annotation to export as obj
Choose the right feature in a tuple to export as pred
Choose the right roles in a tuple to export as subj and obj

Kernel in classifiers

TrainingElementClassifier and TaggingElementClassifier should be able to accept a Kernel instead of a RelationDefinition, if the classifying algorithm comes from LibSVM.

Add resources to PubAnnotation

Add documents to PubAnnotation : about 1million pmids from pubmed to add
Add dictionary to Pubdictionaries : about 4 million dictionary entries from ncbi to add

alvisnlp releases

could you please add a release for alvisnlp at here https://github.com/Bibliome/alvisnlp/releases ?

Problem with the current installation of TreeTagger for French purpose.
(<treeTaggerExecutable>..../install/tree-tagger-3.2/bin/tree-tagger</treeTaggerExecutable>)
and
(<parFile>.../install/tree-tagger-3.2/lib/french-utf8.par</parFile>)

--
This problem is resolved with :

With this
<treeTaggerExecutable>tree-tagger-linux-3.2.1/bin/tree-tagger</treeTaggerExecutable>

and the french.par
<parFile>french.par</parFile>
it is ok

Organize site

From aa67372, wiki pages will be migrated to the site:

https://bibliome.github.io/alvisnlp/

[Projectors] Allow multiple features in layer subject

The subject parameter in *Projector modules:

    <subject layer="words" feature="form,lemma"/>

Try to match either features form or lemma.

Can't use -outputDir and -inputDir with ToMap (unless Yatea file already exists)

Using -outputDir and -inputDir with ToMap generates an error, the program can't find the yatea file that was generated in the outputDir, even when specifying the outputDir as inputDir.

Example on Migale:
/projet/mig/work/textemig/software/install/alvisnlp/bin/alvisnlp -cleanTmp -verbose -log corpus/test/batch/0000/alvisnlp.log -inputDir corpus/test/batch/0000/ -inputDir corpus/test/batch/0000/output -outputDir corpus/test/batch/0000/output plans/test-output.plan
Generates the following error:

[2017-11-27 10:52:10.182][entities.term-extraction.yatea] done in 586 ms
[2017-11-27 10:52:10.182][entities.term-extraction] done in 616 ms
[2017-11-27 10:52:10.182][entities.tomap] processing
[2017-11-27 10:52:10.352][alvisnlp] SEVERE org.xml.sax.SAXParseException; Premature end of file.

NB: this only happens the first time you run the command. If you try a second time (= the yatea output already exists), then it works.

Fill synopsis for everything

Provide a synopsis for each module, library, conversion.

Docker dist

Providing packaged docker would enable to install easily AlvisNLP.

prepare the object that corresponds to the list of sentences into Alvis Corpus.

Create the object that contains the list of sentences into Method createTheSentences(...) of Class Corpus2InteractionXML.java

See Example of Sentence here

TomapProjector attribution match options

lemmaKeys, caseInsensitive and ignoreDiacritics control both projection and attribution. Make separate options for attribution
add option to match surface form OR lemma

alvisnlp-rest: PubAnnotation endpoint

We need an empty end-point for AlvisNLP--PubAnnotation:

One URL for each plan?
Accept the text parameter for direct input
Accept the sourcedb and sourceid parameters, should download the text by calling back pubannotation.org

Implement synchronous PubAnnotation

PubAnnotation asynchronous client isn't very stable.

Use rich format for PubAnnotationExport

Use richer JSON for PubAnnotation, specified here

PatternMatcher problème résultat ? différent de {0,1}

See issue on redmine.

See utils issue.

Prepare demo for BLAH3 wrap up

show AlvisNLP/ML service
show exposed plans
show API doc, esp PubAnnotation endpoint
choose one or two cool documents to annotate
show annotation process
what next

Reactivate non-regression tests

Cannot put on GitHub because data size and license.

Check new PubAnnotation annotator framework

Annotators (online automatic annotation services) now can be registered on PubAnnotation:

http://pubannotation.org/annotators

The API is documented here:

http://www.pubannotation.org/docs/annotation-server-api/

Additional SourceStream options

For modules accept input file paths as parameters, some of these files can be quite large (dictionaries).
These files could be compressed. AlvisNLP should be able to open transparently plain or compressed files.

See: Bibliome/bibliome-java-utils#2

Test plan parameter passing to PubAnnotation

See if parameters can be passed to the plan through PubAnnotation.

XLSProjector module (was: .xls to .txt)

Je ne suis pas sûre de l'endroit mais bon (cc @mandiayba )

Dans mes étapes de pré-process à AlvisNLP, je transforme mes ressources dont une au format .xls en .txt. (cf https://github.com/openminted/uc-tdm-AS-E/blob/master/Execution_resources.sh)

J'utilise pour cela le programme inclut dans gnumeric, "ssconvert"

Je ne sais pas s'il faut l'intégrer dans AlvisNLP ou pas (si cela peut être tuilie). Il y a des images dockers qui existent mais je ne sais pas qu'elle est la version , et je ne sais pas trop comment l'utiliser si on réutilise un docker image.

Data model diagram source

In [[AlvisNLP-ML-data-model]], the diagram has a small mistake. Please insert the source in the wiki repo.

@mandiayba

Full compile regression test

The test compilation in the test-alvisnlp.sh script should be done with an empty Maven local repo.

Avoid problems like #39

Normalize space in XMLReader2 XSLT functions

Add a space normalization option for concat() and inline() XSLT extension functions provided by XMLReader2.

Difficulty: keep track of character offsets.

Workaround: MergeSections

Input of PubAnnotation: do not assume parameter _read_

Currently the PubAnnotation endpoint assumes a parameter read of a TextFileReader module.

Chose a damn place for documentation

Dans la documentation des modules (par ex. Tbular export)

les liens vers la documentation des types dirigent vers une page 404 (du glassfish)

ex:

fileName

Mandatory
Type: Expression

Export the JSON in PubAnnotation endpoint

Create a JSONExport module (#3), and inject it in the plan.

Shell completion

The Shell and Shell2 modules should support completion for:

keywords
layer names
feature names
document ids, section names and relation names

PubAnnotation create one file per document

PatternMatcher removeAnnotations action performance

/bibdev/travail/arabidopsis/alvisir2_devel/plan/entities-test-RB.plan

[metadata] Tees n'a pas de documentation

Tees n'a pas de documentation, ou tout du moins, l'invocation de alvisnlp -moduleDoc Tees retourne l'erreur

Exception in thread "main" org.bibliome.util.clio.CLIOException: java.lang.reflect.InvocationTargetException
at org.bibliome.util.clio.CLIOParser.processOption(CLIOParser.java:154)
at org.bibliome.util.clio.CLIOParser.parse(CLIOParser.java:116)
at alvisnlp.app.cli.AbstractAlvisNLP.run(AbstractAlvisNLP.java:1045)
at alvisnlp.app.cli.AlvisNLP.main(AlvisNLP.java:85)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.bibliome.util.clio.CLIOParser.processOption(CLIOParser.java:145)
... 3 more
Caused by: org.bibliome.util.service.UnsupportedServiceException: alias: Tees
at org.bibliome.util.service.CompositeServiceFactory.getServiceByAlias(CompositeServiceFactory.java:129)
at alvisnlp.app.cli.AbstractAlvisNLP.getModuleDocumentation(AbstractAlvisNLP.java:384)
at alvisnlp.app.cli.AbstractAlvisNLP.moduleDoc(AbstractAlvisNLP.java:447)
... 8 more

AlvisNLP Bibliome Module Factory ................... FAILURE

Bonjour,

En essayant de suivre les instructions, lors de la seconde instruction (mvn clean install), l'installation échoue pour AlvisNLP Bibliome Module Factory. Voici ce qu'affiche à la fin de la console:

[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building AlvisNLP Bibliome Module Factory 0.5rc-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[WARNING] The POM for fr.jouy.inra.maiage.bibliome:alvisdb-core:jar:0.1-SNAPSHOT is missing, no dependency information available
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] AlvisNLP/ML ........................................ SUCCESS [ 0.798 s]
[INFO] AlvisNLP Core ...................................... SUCCESS [ 6.976 s]
[INFO] AlvisNLP Bibliome Module Factory ................... FAILURE [ 0.878 s]
[INFO] alvisnlp-rest ...................................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 8.957 s
[INFO] Finished at: 2017-11-22T10:46:45+01:00
[INFO] Final Memory: 27M/78M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project alvisnlp-bibliome: Could not resolve dependencies for project fr.jouy.inra.maiage.bibliome:alvisnlp-bibliome:jar:0.5rc-SNAPSHOT: Failure to find fr.jouy.inra.maiage.bibliome:alvisdb-core:jar:0.1-SNAPSHOT in http://bibliome.jouy.inra.fr/maven-repository was cached in the local repository, resolution will not be reattempted until the update interval of bibliome has elapsed or updates are forced -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR] mvn -rf :alvisnlp-bibliome

Pouvez-vous m'aider à résoudre ce problème?

Install procedure for windows

Tested on Windows 10 running in a VirtualBox machine.

Download and install JDK 8 for Windows.
Set the JAVA_HOME environment variable to the JDK directory. Usually something like C:\Program Files\jdk1.8.0_XXX, where XXX is the update version of the JDK. You can set environment variables through the control panel.
Download and install git for Windows.
Download and install Maven. Installing Maven means extracting the archive in a sensible place like your home or Program Files.
Set the Path environment variable to %Path%;C:\sensibleplace\apache-maven-3.5.2\bin.
Open a Windows command line window. You may find it by searching for cmd.
Download and compile AlvisNLP/ML:

git clone https://github.com/Bibliome/alvisnlp.git
cd alvisnlp
mvn clean install

@ArnaudFerre: could you try this in a native Windows machine?

PatternMatcher actions do not perform corrctly

Rename module classes

ExportCadixeJSON -> AlvisAEExport

Improve the performance of XMLReader

XMLReader2 is awfully long. Since its usage is universal, it should be improved.

Might need to switch from Xalan to Saxon.