Code Monkey home page Code Monkey logo

entity-fishing's People

Contributors

dependabot[bot] avatar kermitt2 avatar lfoppiano avatar slashdacoda avatar tantikristanti avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

entity-fishing's Issues

Remove property file

Iin branch 0.0.3, we use now yaml config files, so remove this last bit of old-style config.

NPE when a PDF doesn't contains text

When a PDF doesn't contain text, grobid correctly respond NO_BLOCK but somewhere a NPE is thrown:

05 Sep 2017 14:10.24 [DEBUG] NerdRestProcessFile       - >> received query to process: {"language":{"lang":"en"},"onlyNER":false,"resultLanguages":["de","fr"],"nbest":false,"customisation":"generic"}
05 Sep 2017 14:10.24 [DEBUG] IOUtilities               - >> set origin document for stateless service'...
05 Sep 2017 14:10.24 [DEBUG] NerdRestProcessFile       - >> input PDF file saved locally...
05 Sep 2017 14:10.24 [DEBUG] NerdRestProcessFile       - >> set query object...
05 Sep 2017 14:10.24 [DEBUG] NerdRestProcessFile       - >> language provided in query: en;1.0
05 Sep 2017 14:10.24 [DEBUG] DocumentSource            - start pdf2xml
05 Sep 2017 14:10.24 [DEBUG] DocumentSource            - Executing command: [bash, -c, ulimit -Sv 6242304 && /Users/lfoppiano/development/inria/grobid/grobid-home/pdf2xml/mac-64/pdftoxml -blocks -noImageInline -fullFontName  -noImage  -annotation  '/Users/lfoppiano/development/inria/grobid/grobid-home/tmp/origin8744748248510258475.pdf' /Users/lfoppiano/development/inria/grobid/grobid-home/tmp/davgUBsHYD.lxml]
05 Sep 2017 14:10.24 [DEBUG] DocumentSource            - Executing: [bash, -c, ulimit -Sv 6242304 && /Users/lfoppiano/development/inria/grobid/grobid-home/pdf2xml/mac-64/pdftoxml -blocks -noImageInline -fullFontName  -noImage  -annotation  '/Users/lfoppiano/development/inria/grobid/grobid-home/tmp/origin8744748248510258475.pdf' /Users/lfoppiano/development/inria/grobid/grobid-home/tmp/davgUBsHYD.lxml]
05 Sep 2017 14:10.24 [DEBUG] DocumentSource            - pdf2xml process finished. Time to process:32ms
05 Sep 2017 14:10.24 [ERROR] NerdRestProcessFile       - Cannot process input pdf file. 
org.grobid.core.exceptions.GrobidException: [NO_BLOCKS] PDF parsing resulted in empty content
	at org.grobid.core.document.Document.addTokenizedDocument(Document.java:408)
	at org.grobid.core.engines.Segmentation.processing(Segmentation.java:94)
	at com.scienceminer.nerd.service.NerdRestProcessFile.processQueryAndPdfFile(NerdRestProcessFile.java:110)
	at com.scienceminer.nerd.service.NerdRestService.processQueryJson(NerdRestService.java:128)
	at sun.reflect.GeneratedMethodAccessor46.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
	at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205)
	at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
	at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288)
	at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
	at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
	at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
	at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
	at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469)
	at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400)
	at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
	at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
	at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
	at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)
	at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:699)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
	at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:833)
	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1650)
	at org.eclipse.jetty.websocket.server.WebSocketUpgradeFilter.doFilter(WebSocketUpgradeFilter.java:206)
	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1637)
	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
	at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:190)
	at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:188)
	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1253)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:168)
	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)
	at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:166)
	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1155)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
	at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
	at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
	at org.eclipse.jetty.server.Server.handle(Server.java:564)
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:317)
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:279)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:110)
	at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:124)
	at org.eclipse.jetty.util.thread.Invocable.invokePreferred(Invocable.java:128)
	at org.eclipse.jetty.util.thread.Invocable$InvocableExecutor.invoke(Invocable.java:222)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:294)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:199)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:673)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:591)
	at java.lang.Thread.run(Thread.java:745)
05 Sep 2017 14:10.24 [INFO ] NerdRestProcessFile       - runtime: 57
05 Sep 2017 14:10.24 [ERROR] NerdRestProcessFile       - An unexpected exception occurs. 
java.lang.NullPointerException
	at java.util.Collections.sort(Collections.java:141)
	at com.scienceminer.nerd.service.NerdRestProcessFile.processQueryAndPdfFile(NerdRestProcessFile.java:359)
	at com.scienceminer.nerd.service.NerdRestService.processQueryJson(NerdRestService.java:128)
	at sun.reflect.GeneratedMethodAccessor46.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
	at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205)
	at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
	at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288)
	at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
	at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
	at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
	at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
	at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469)
	at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400)
	at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
	at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
	at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
	at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)
	at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:699)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
	at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:833)
	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1650)
	at org.eclipse.jetty.websocket.server.WebSocketUpgradeFilter.doFilter(WebSocketUpgradeFilter.java:206)
	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1637)
	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
	at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:190)
	at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:188)
	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1253)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:168)
	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)
	at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:166)
	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1155)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
	at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
	at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
	at org.eclipse.jetty.server.Server.handle(Server.java:564)
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:317)
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:279)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:110)
	at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:124)
	at org.eclipse.jetty.util.thread.Invocable.invokePreferred(Invocable.java:128)
	at org.eclipse.jetty.util.thread.Invocable$InvocableExecutor.invoke(Invocable.java:222)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:294)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:199)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:673)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:591)
	at java.lang.Thread.run(Thread.java:745)
05 Sep 2017 14:10.24 [DEBUG] NerdRestProcessFile       - << com.scienceminer.nerd.service.NerdRestProcessFile.methodLogOut

Try some more features

From the traditional features, we still need to experiment with:

  • prob_i conditional probability of the string given the concept (i.e. reverse prob_c - this is given by db pageLabel, currently not loaded in LMDB)

  • a lexical cohesion measure, e.g. log likelyhood, DICe coefficient or PMI

Not traditional, we also need to experiment with:

  • properties available in Wikidata, in particular P31 and P279

Evaluation data pre-annotation

The pre-annotation process is taking in input a directory of documents (text files) (or if they are supplied as pdf, the pdf content is extracted into text files before further processing) and is supposed to generate the xml annotation file using the current models.

The xml annotation file is then corrected by a human to be shared as gold standard.

The process should take as parameter the input location and an optional output location. if the output location is not specified the xml would be placed inside the input directory.

Servlet container package

Generate a build for a servlet container, for example tomcat, ideally

  • the data (directory /data) should be packaged as a zip
  • the application should be packaged as war (logs should be correctly tokenised for tomcat)
  • the grobid home

We also need to correctly tokenise grobid-home, grobid.properties and data directory from tomcat to a specific location

Check customisation

The customisation are not behaving as expected (or maybe I didn't understand), here the example:

POST /customisation

value:

{
	"customisation": {
		"wikipedia": [
			105942, 1499966, 4105431
		],
"lang": "fr",
		"texts": [
			"Place de la République, Hotel Moderne, vaste batisse où étaient logées les petites souris grises, d’autres disent « les Salamandres », jeunes allemandes en uniforme. Elles partent et elles ne pouvait emporter qu’un léger bagage à la main. Jetons aux combinards de Vichy et de Washington, en défi, une tête de traître."
		],
		"description": "customisation for the ww2 french liberation"
	}
}

name: ww2fr

But then when analysing the sentence:

Jetons aux combinards de Vichy et de Washington, en défi, une tête de traître.

Vichy is not recognised as Regime de Vichy but as Vichy the town, when in the customisation I have added the wikipedia id of the Regime de Vichy.

Support acronyms

Detect acronyms introduced (explicitly or not) in a document , and maintain them as possible mention in the current document.

Example: frequent for name of species (C. Lupus, C. n. gregoryi), Cigarette smoke (CS)-induced

/disambiguate service with content type application/json

Currently /disambiguate service supports only multipart/form-data as content-type, even if the parameter query is in json without PDF file.
We should also support application/json as content-type when the request has only a json parameter without PDF file.

Migration to gradle build

Since we are starting the migration with GROBID, is good to have a task for entity-fishing as well not to forget

Support more small morphosyntactic variations

Wikipedia redirects and anchors cover most of the frequent morphosyntactic variants (e.g. plurial), but not in an exhaustive manner - we coud add a process (or pre-process) to support them.

Improve error message returned in the interface

The error message is currently returning "Error encountered while requesting the server." which is confusing (as the server is fine) when the problem is the input.

The error message should be at least say whether the problem is the system, the input data or some 3rd party services (if applicable).

screen shot 2016-11-23 at 11 02 01

inconsistent model used to train ranker for french and german

There is a inconsistency between german, french and english.
It seems that german and french rankers haven't been migrated to GradientTreeBoost model, so at the moment they throw a CCE:

08 Dec 2017 11:53.12 [DEBUG] NerdEngine                - Fail to compute ranker score.
java.lang.ClassCastException: smile.regression.RandomForest cannot be cast to smile.regression.GradientTreeBoost
	at com.scienceminer.nerd.disambiguation.NerdRanker.getProbability(NerdRanker.java:139)
	at com.scienceminer.nerd.disambiguation.NerdEngine.rank(NerdEngine.java:964)
	at com.scienceminer.nerd.disambiguation.NerdEngine.disambiguate(NerdEngine.java:242)
	at com.scienceminer.nerd.service.NerdRestProcessQuery.processQueryText(NerdRestProcessQuery.java:154)
	at com.scienceminer.nerd.service.NerdRestProcessQuery.processQuery(NerdRestProcessQuery.java:50)
	at com.scienceminer.nerd.service.NerdRestService.processQueryJson(NerdRestService.java:129)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
	at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205)
	at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
	at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288)
	at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
	at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
	at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
	at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
	at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469)
	at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400)
	at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
	at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
	at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
	at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)
	at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:699)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
	at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:833)
	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1650)
	at org.eclipse.jetty.websocket.server.WebSocketUpgradeFilter.doFilter(WebSocketUpgradeFilter.java:206)
	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1637)
	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
	at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:190)
	at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:188)
	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1253)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:168)
	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)
	at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)
	at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:166)
	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1155)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
	at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
	at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
	at org.eclipse.jetty.server.Server.handle(Server.java:564)
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:317)
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:279)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:110)
	at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:124)
	at org.eclipse.jetty.util.thread.Invocable.invokePreferred(Invocable.java:128)
	at org.eclipse.jetty.util.thread.Invocable$InvocableExecutor.invoke(Invocable.java:222)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:294)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:126)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:673)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:591)
	at java.lang.Thread.run(Thread.java:745)

Implement selective sentence range for processing

To be checked whether

  1. I'm doing something wrong, or
  2. the documentaiton needs to be modified.

Here the issues:

  1. The ranges seems not supported:

e.g. in the following query:

{
    "onlyNER": false,
    "nbest": false,
    "text": "We are heading to Washington. The cat is on the Table in Milan.",
    "processSentence": [0-1], 
    "sentences": [
        {
            "offsetStart": 0,
            "offsetEnd": 29
        },
        {
            "offsetStart": 29,
            "offsetEnd": 63
        }
    ]
}

the "processSentence":[0-1] would result in

Caused by: com.fasterxml.jackson.databind.JsonMappingException: Unexpected character ('-' (code 45)): was expecting comma to separate Array entries

Two solutions, either we (a) allow only integers like [0,1,2,3] or we modify that item as (b) string like ['0','1-2']

  1. The processSentences seems ignored, for example the following query, should return only Washington:
{
    "onlyNER": false,
    "nbest": false,
    "text": "We are heading to Washington. The cat is on the Table in Milan.",
    "processSentence": [0], 
    "sentences": [
        {
            "offsetStart": 0,
            "offsetEnd": 29
        },
        {
            "offsetStart": 29,
            "offsetEnd": 63
        }
    ]
}

Experiment with an entity vector approach

The creation of some entity vector is pretty straightforward and could be used as additional/alternative context relevance measure. A context is modelled as the centroid of the vectors representing its words (v_context), and the relevance of a given entity e (with vector v_e) is the cosine cos(v_context,v_e). One clear advantage over the relatedness measure is that it will be much faster.

See http://www.di.unipi.it/~ottavian/files/wsdm15_fel.pdf or https://github.com/ot/entity2vec

Hypenization in PDF

De-hypenization in PDF is a bit more complicated to manage than in text because we have to keep track of the coordinates via multiple layout tokens for a single text token.

Not able to build nerd succesfully

Hello Team,

Whenever I am trying to run mvn clean install for build purpose , BUILD FAILURE occurs with the following error

[ERROR] Tests run: 6, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 16.975 s <<< FAILURE! - in com.scienceminer.nerd.disambiguation.TestProcessText
[ERROR] testProcess(com.scienceminer.nerd.disambiguation.TestProcessText)  Time elapsed: 4.156 s  <<< ERROR!
java.lang.NoSuchFieldError: year
	at com.scienceminer.nerd.disambiguation.TestProcessText.testProcess(TestProcessText.java:50)

[INFO] Running com.scienceminer.nerd.disambiguation.SentenceTest
[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 s - in com.scienceminer.nerd.disambiguation.SentenceTest
[INFO] 
[INFO] Results:
[INFO] 
[ERROR] Errors: 
[ERROR]   TestProcessText.testProcess:50 » NoSuchField year
[INFO] 
[ERROR] Tests run: 30, Failures: 0, Errors: 1, Skipped: 0

What is this [ERROR] TestProcessText.testProcess:50 » NoSuchField year

What I should do after this so that BUILD SUCCESS occurs ?

Unable to train with Wikipedia

Hi @lfoppiano,

I am trying to train with wikipedia articles
i used the below command

bash > mvn compile exec:exec -Ptrain_annotate_en

$ mvn compile exec:exec -Ptrain_annotate_en
[INFO] Scanning for projects...
[WARNING]
[WARNING] Some problems were encountered while building the effective model for com.scienceminer.nerd:nerd-service:war:0.0.2
[WARNING] 'build.plugins.plugin.version' for org.codehaus.mojo:exec-maven-plugin is missing. @ line 256, column 29
[WARNING] 'build.plugins.plugin.version' for org.apache.maven.plugins:maven-jar-plugin is missing. @ line 47, column 21
[WARNING]
[WARNING] It is highly recommended to fix these problems because they threaten the stability of your build.
[WARNING]
[WARNING] For this reason, future Maven versions might no longer support building such malformed projects.
[WARNING]
Downloading: https://repo.maven.apache.org/maven2/org/codehaus/mojo/exec-maven-plugin/maven-metadata.xml
Downloaded: https://repo.maven.apache.org/maven2/org/codehaus/mojo/exec-maven-plugin/maven-metadata.xml (741 B at 0.3 KB/sec)
Downloading: https://repo.maven.apache.org/maven2/org/codehaus/mojo/exec-maven-plugin/1.6.0/exec-maven-plugin-1.6.0.pom
Downloaded: https://repo.maven.apache.org/maven2/org/codehaus/mojo/exec-maven-plugin/1.6.0/exec-maven-plugin-1.6.0.pom (13 KB at 11.0 KB/sec)
Downloading: https://repo.maven.apache.org/maven2/org/codehaus/mojo/mojo-parent/40/mojo-parent-40.pom
Downloaded: https://repo.maven.apache.org/maven2/org/codehaus/mojo/mojo-parent/40/mojo-parent-40.pom (33 KB at 20.3 KB/sec)
Downloading: https://repo.maven.apache.org/maven2/org/codehaus/mojo/exec-maven-plugin/1.6.0/exec-maven-plugin-1.6.0.jar
Downloaded: https://repo.maven.apache.org/maven2/org/codehaus/mojo/exec-maven-plugin/1.6.0/exec-maven-plugin-1.6.0.jar (57 KB at 25.0 KB/sec)
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building (N)ERD 0.0.2
[INFO] ------------------------------------------------------------------------
[WARNING] The POM for org.grobid.ner:grobid-ner:jar:0.4.3-SNAPSHOT is missing, no dependency information available
[INFO]
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ nerd-service ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 5 resources
[INFO]
[INFO] --- maven-compiler-plugin:3.6.1:compile (default-compile) @ nerd-service ---
[INFO] Nothing to compile - all classes are up to date
[INFO]
[INFO] --- maven-jar-plugin:3.0.2:jar (make-a-jar) @ nerd-service ---
[INFO] Building jar: /Users/kvincent1/Desktop/Factbot-simplest/nerd/target/nerd-service-0.0.2.jar
[INFO]
[INFO] --- exec-maven-plugin:1.6.0:exec (default-cli) @ nerd-service ---
Downloading: https://repo.maven.apache.org/maven2/org/codehaus/plexus/plexus-component-annotations/1.5.4/plexus-component-annotations-1.5.4.pom
Downloaded: https://repo.maven.apache.org/maven2/org/codehaus/plexus/plexus-component-annotations/1.5.4/plexus-component-annotations-1.5.4.pom (815 B at 1.0 KB/sec)
Downloading: https://repo.maven.apache.org/maven2/org/codehaus/plexus/plexus-containers/1.5.4/plexus-containers-1.5.4.pom
Downloaded: https://repo.maven.apache.org/maven2/org/codehaus/plexus/plexus-containers/1.5.4/plexus-containers-1.5.4.pom (5 KB at 5.1 KB/sec)
Downloading: https://repo.maven.apache.org/maven2/org/apache/commons/commons-exec/1.3/commons-exec-1.3.pom
Downloaded: https://repo.maven.apache.org/maven2/org/apache/commons/commons-exec/1.3/commons-exec-1.3.pom (11 KB at 11.8 KB/sec)
Downloading: https://repo.maven.apache.org/maven2/org/apache/commons/commons-parent/35/commons-parent-35.pom
Downloaded: https://repo.maven.apache.org/maven2/org/apache/commons/commons-parent/35/commons-parent-35.pom (57 KB at 22.8 KB/sec)
Downloading: https://repo.maven.apache.org/maven2/org/apache/commons/commons-exec/1.3/commons-exec-1.3.jar
Downloaded: https://repo.maven.apache.org/maven2/org/apache/commons/commons-exec/1.3/commons-exec-1.3.jar (54 KB at 15.2 KB/sec)

init upper level language independent environment
building Environment for upper knowledge base
Environment built - 9155139 concepts.
init Environment for language en
building Environment for language en
isLoaded: true
Environment built - 14651883 pages.
domains en / isLoaded: true
Warning: Orthopedic surgery is not a category found in Wikipedia.
Warning: Environment is not a category found in Wikipedia.
init Environment for language de
building Environment for language de
isLoaded: true
Environment built - 3523959 pages.
init Environment for language fr
building Environment for language fr
isLoaded: true
Environment built - 3631810 pages.

GROBID_HOME=/Users/kvincent1/Desktop/Factbot-simplest/grobid/grobid-home
building full markup database for language en
markupFull / isLoaded: false
com.scienceminer.nerd.exceptions.NerdResourceException: Markup file not found
at com.scienceminer.nerd.kb.db.MarkupDatabase.loadFromXmlFile(MarkupDatabase.java:108)
at com.scienceminer.nerd.kb.db.KBLowerEnvironment.buildFullMarkup(KBLowerEnvironment.java:291)
at com.scienceminer.nerd.kb.LowerKnowledgeBase.loadFullContentDB(LowerKnowledgeBase.java:74)
at com.scienceminer.nerd.training.WikipediaTrainer.(WikipediaTrainer.java:60)
at com.scienceminer.nerd.training.WikipediaTrainer.main(WikipediaTrainer.java:130)
Create article sets...
Article sample is empty for set 0
Article sample is empty for set 1
Article sample is empty for set 2
Article sample is empty for set 3
Article sample is empty for set 4
Create Ranker arff files...
Exception in thread "main" java.lang.NullPointerException
at com.scienceminer.nerd.disambiguation.NerdRanker.train(NerdRanker.java:168)
at com.scienceminer.nerd.training.WikipediaTrainer.createRankerArffFiles(WikipediaTrainer.java:92)
at com.scienceminer.nerd.training.WikipediaTrainer.main(WikipediaTrainer.java:136)
[ERROR] Command execution failed.
org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1)
at org.apache.commons.exec.DefaultExecutor.executeInternal(DefaultExecutor.java:404)
at org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:166)
at org.codehaus.mojo.exec.ExecMojo.executeCommandLine(ExecMojo.java:804)
at org.codehaus.mojo.exec.ExecMojo.executeCommandLine(ExecMojo.java:751)
at org.codehaus.mojo.exec.ExecMojo.execute(ExecMojo.java:313)
at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
at org.apache.maven.cli.MavenCli.execute(MavenCli.java:863)
at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
at org.apache.maven.cli.MavenCli.main(MavenCli.java:199)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 13:23 min
[INFO] Finished at: 2017-09-02T19:53:10+05:30
[INFO] Final Memory: 18M/178M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.6.0:exec (default-cli) on project nerd-service: Command execution failed. Process exited with an error: 1 (Exit value: 1) -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

entities supplied to the query shall taken in consideration also in presence of only wikidata id

as subject, as continuation of #20

Example (by removing the wikipediaRefId, the entity is not taken in consideration):

   {
       "text": "Austria invaded and fought the Serbian army at the Battle of Cer and Battle of Kolubara beginning on 12 August.",
       "language": {
           "lang": "en"
       },
       "entities": [
            {
               "rawName": "German Army",
               "offsetStart": 1107,
               "offsetEnd": 1118,
               "wikipediaExternalRef": 11702744,
               "wikidataId": "Q701923"
            }
       ]
   }

Nerd with Port 8080

Hi,

How do i start nerd with specific port

I have entity fishing running on port number 8090

Generic build deployment

  • Add maven task to generate the data zip file
  • Check whether the web.xml or configuration.properties needs to be adapted
  • Check documentation

Parameter for Cross-Origin Resource Sharing

Currently CORS is allowed for any domains by default for all entity-fishing services.

Add & support a parameter in the yaml config file to allow or not CORS, either for any domains (*) or some selected domains.

Move embeddings from memory to LMDB

In the current implementation, the embeddings are currently loaded in memory, which means having 1.5-2 Gb for each language.

need to add more information

  • input files are the .vec which shoudl be references as the wikipedia files
  • remove the compression, keep the quantisation, the data will be loaded directly in lmdb.
  • LMDBs information will be included in the language specific resources
  • remove loading of the embeddings/${lang}/ embeddings

Add multithread for data preparation

As in title, we could process each page (or each language) in parallel.

Note: This issue is not urgent as normal users will get the pre-processed data already.

Disambiguation in French: Charles Ier (Charlesmagne)

I write here not to forget. Here there is an examples to be checked from the page of Charlemagne:

Charlemagne, du latin Carolus Magnus, ou Charles Ier dit « le Grand », né le 2 avril 742 (voire 747 ou 748)2, mort le 28 janvier 814 à Aix-la-Chapelle, est un roi des Francs et empereur. Il appartient à la dynastie des Carolingiens, à laquelle il a donné son nom.\nFils de Pépin le Bref, il est roi des Francs à partir de 768, devient par conquête roi des Lombards en 774 et est couronné empereur à Rome par le pape Léon III le 25 décembre 800, relevant une dignité disparue depuis la chute de l'Empire romain d'Occident en 476.\nRoi guerrier, il agrandit notablement son royaume par une série de campagnes militaires, en particulier contre les Saxons païens dont la soumission fut difficile et violente (772-804), mais aussi contre les Lombards en Italie et les musulmans d'Al-Andalus.

The token Charles Ier is disambiguated with the Charles Ier (empereur d'Autriche)

When searching for it in the term lookup there is no confidence and the id is not pointing to the right wikipedia page (but works fine the wikidata id):

screen shot 2017-09-01 at 17 21 59

Something to be checked

Adapt IITB corpus format

currently in branch 0.0.3 -> the iitb corpus is not exactly in the same format as the other corpus, and the evaluation is not working on it. Annotations are given without distinguishing the document where they occur, while this is required in the current evaluation.
A bit of XML massage on the file iitb.xml is required :)

Add Italian and Spanish LowerKnowledgeBase

  • create DefaultConfigItWp and DefaultConfigEsWp for the mediawiki parser
  • generate resources (in grisp project) and loader (here)
  • make available LMDB data
  • create models

Plug into GERBIL

Branch 0.0.3 contains a corpus-based evaluation together most of the usual NED corpora (ACE, AQUAINT, AIDA-CONLL, MSNBC, ...).

However, it would be good to plug the tool on GERBIL for third party evaluation and comparison with other entity disambiguation tools.

http://aksw.org/Projects/GERBIL.html

(but this cannot replace the existing evaluation, as more detailed eval and intermediary results are in practice needed)

Collapsible element for list of statements

In the demo console, all the statements are listed in the infobox making it sometime ridiculously huge. We would need to put the statements in a collapsible element to have a cleaner infobox.

query with onlyNER=true returns an entity without type

The following query:

{"text": "Un compte rendu de Jean-Guillaume Lanuque (avec l'aide amicale de Christian Beuvain) C\u2019est un pan plut\u00f4t m\u00e9connu de l\u2019histoire de la Russie r\u00e9volutionnaire en ses premi\u00e8res ann\u00e9es que Giles Milton nous \u00e9claire, celui de l\u2019action des services secrets britanniques, le fameux MI6 (n\u00e9 juste avant la Premi\u00e8re Guerre mondiale, sous la houlette de Mansfield Cumming), en terre russe. Pour ce faire, l\u2019auteur a mis \u00e0 profit des sources g\u00e9n\u00e9ralement in\u00e9dites en langue fran\u00e7aise, m\u00e9moires \u00e9crits des agents et documents d\u2019archives. Le tout est pr\u00e9sent\u00e9 comme un roman, sans notes de bas de page (les r\u00e9f\u00e9rences sont concentr\u00e9es en fin d\u2019ouvrage), et si la lecture en est d\u2019autant plus ais\u00e9e, un sentiment g\u00eanant se d\u00e9gage assez rapidement\u00a0: Giles Milton prend en effet le plus souvent pour argent comptant tout ce que lui apprennent ses sources \u2013 alors que des t\u00e9moignages, d'agents secrets qui plus est, demandent au minimum une critique interne pouss\u00e9e \u2013 et ne fait pas l\u2019effort de les croiser de mani\u00e8re syst\u00e9matique1\u00a0; il fait \u00e9galement sienne la pr\u00e9vention des dirigeants britanniques \u00e0 l\u2019\u00e9gard des bolcheviques, syst\u00e9matiquement pr\u00e9sent\u00e9s ici sous un jour n\u00e9gatif, gestionnaires incomp\u00e9tents, violents et barbares (le terme de terrorisme revient aussi \u00e0 plusieurs reprises)2.  En dehors de ce parti pris pesant, le livre de Giles Milton se pr\u00e9sente comme une synth\u00e8se, bien qu'incompl\u00e8te. Elle d\u00e9bute avec la participation des services secrets britanniques \u00e0 l\u2019assassinat de Raspoutine (sans d\u2019ailleurs approfondir le sujet en d\u00e9tail), et se poursuit avec la mission de Somerset Maugham \u2013 agent secret, romancier c\u00e9l\u00e8bre, dramaturge \u2013 charg\u00e9 d\u2019apporter le soutien (y compris financier) des anglo-saxons \u00e0 Kerenski, par crainte de la d\u00e9fection russe dans la guerre. Avec l\u2019arriv\u00e9e au pouvoir des bolcheviques, le Royaume-Uni dispose d\u00e9j\u00e0 d\u2019un homme dans la place\u00a0: Arthur Ransome, journaliste occupant une position privil\u00e9gi\u00e9e car proche des nouveaux dirigeants (dont Karl Radek, qui le pr\u00e9senta \u00e0 L\u00e9nine et Trotsky), sympathisant de la r\u00e9volution d'octobre, mais qui, selon l'auteur, aurait jou\u00e9 un r\u00f4le d\u2019agent double (en renseignant le sous-secr\u00e9taire du Foreign Office, Lord Robert Cecil), tout en se mettant en couple avec Evguenia Chelepina, secr\u00e9taire de Trotsky (p. 90-93). Mansfield Cumming, dans le m\u00eame temps, d\u00e9pla\u00e7a son bureau russe \u00e0 Stockholm, et envoya en Russie m\u00eame deux agents, Sidney Reilly et George Hill. Les d\u00e9tails sur les identit\u00e9s multiples \u00e0 b\u00e2tir, les pr\u00e9cautions \u00e0 prendre ou les m\u00e9thodes \u00e0 utiliser afin de transmettre des messages sont ici dignes d\u2019un roman d'espionnage. Le r\u00e9seau mont\u00e9 par Sidney Reilly, en particulier, comprenait des individus bien introduits dans le syst\u00e8me de pouvoir, ainsi du colonel Aleksandr V. Friede, un Letton qui travaillait au Commissariat du peuple \u00e0 la guerre, sans compter Boris Bajanov, qui selon ses m\u00e9moires, sur lesquelles s'appuie l'auteur3, fut un agent double d\u00e8s son adh\u00e9sion au parti, en 1919. M\u00eame si Giles Milton n\u2019aborde pas les choses sous cet angle, on a l\u00e0 autant d\u2019\u00e9l\u00e9ments, impliquant les Britanniques, qui permettent de comprendre, au moins en partie, la m\u00e9fiance croissante des nouveaux dirigeants face \u00e0 un cercle d'ennemis bien r\u00e9els, et le r\u00f4le exponentiel d\u00e9volu \u00e0 la Tcheka.   Avec le d\u00e9but de l\u2019intervention \u00e9trang\u00e8re en Russie, afin d\u2019aider les forces anti-bolcheviques, Hill et Reilly basculent compl\u00e8tement dans la clandestinit\u00e9. Leur action se concentre alors sur le renseignement et le sabotage. Reilly va jusqu\u2019\u00e0 concevoir un projet de coup d\u2019\u00c9tat visant \u00e0 renverser le pouvoir bolchevique. Pour ce faire, Giles Milton pr\u00e9tend qu'il se serait attir\u00e9 la complicit\u00e9 d\u2019\u00c9douard Berzine4, commandant du premier r\u00e9giment letton de fusiliers (sous un pr\u00e9texte tellement l\u00e9ger \u2013 le souhait de rentrer au pays \u2013 qu\u2019il frise sans doute la manipulation), ainsi que des financements fran\u00e7ais et \u00e9tatsuniens\u00a0; un gouvernement provisoire avait m\u00eame \u00e9t\u00e9 \u00e9labor\u00e9, avec la participation de Ioudenitch. Toutefois, ce plan ambitieux ne rentra jamais en application, devanc\u00e9 \u00e0 quelques jours pr\u00e8s par les assassinats (ou tentatives) perp\u00e9tr\u00e9s sur Moiss\u00e9i Ouritski5 et L\u00e9nine, le 30 ao\u00fbt 1918. Il fut \u00e9galement d\u00e9nonc\u00e9 par Ren\u00e9 Marchand, correspondant du Figaro ralli\u00e9 aux bolcheviques6, ce qui entra\u00eena la prise de contr\u00f4le de l\u2019ambassade britannique \u00e0 Moscou et toute une s\u00e9rie d\u2019arrestations, parmi lesquelles celle du diplomate Robert Bruce Lockhart, plus tard \u00e9chang\u00e9 (avec George Hill) contre Litvinov, alors repr\u00e9sentant des Soviets en Grande-Bretagne, arr\u00eat\u00e9 pour espionnage\u00a0; Reilly, lui, r\u00e9ussit \u00e0 fuir la Russie. Cela n\u2019emp\u00eachera pas Hill comme Reilly de repartir en Russie, aupr\u00e8s de Denikine, tout comme Paul Dukes, expert en grimages et d\u00e9guisements, affect\u00e9 \u00e0 Petrograd (il y r\u00e9ussit \u00e0 adh\u00e9rer au Parti et \u00e0 devenir bri\u00e8vement d\u00e9l\u00e9gu\u00e9 au soviet). On reste toutefois quelque peu d\u00e9contenanc\u00e9 par la qualit\u00e9 des informations que ces derniers ou Arthur Ransome recueillent, tout au moins telles que Giles Milton nous les pr\u00e9sente (la volont\u00e9 d\u2019une r\u00e9volution mondiale est loin d\u2019\u00eatre un secret\u00a0!). Par contre, les pages \u00e9voquant Churchill sont nettement plus int\u00e9ressantes. On apprend en effet que le secr\u00e9taire d\u2019\u00c9tat \u00e0 la guerre, fervent anticommuniste et partisan d\u2019une lutte soutenue contre le pouvoir bolchevique, poussa \u00e0 l\u2019emploi d\u2019armes chimiques dernier cri, effectivement utilis\u00e9es \u00e0 la fin de l\u2019\u00e9t\u00e9 1919 dans le nord de la Russie\u00a0; le m\u00eame \u00e9tait d\u2019ailleurs pr\u00eat \u00e0 les employer \u00e9galement contre les Indiens en r\u00e9volte\u2026  Parall\u00e8lement \u00e0 la situation \u00e0 Moscou et Petrograd, Giles Milton \u00e9voque \u00e9galement largement l\u2019Asie centrale. Le Royaume-Uni s\u2019inqui\u00e9tait en effet d\u2019une possible contagion r\u00e9volutionnaire dans les zones \u00e0 la fronti\u00e8re nord de l'Inde, risquant de menacer le fleuron de son empire, ce qui explique l\u2019envoi d\u2019une mission de renseignements au Turkestan russe, compos\u00e9e principalement de Frederick Bailey et Stewart Blacker. Mais sur la situation \u00e0 Tachkent \u2013 centre administratif du Turkestan, o\u00f9 les bolcheviques, majoritaires au Soviet de cette ville de 200 000 habitants, ont pris le pouvoir le 1er novembre 1917 \u2013 le propos est laconique, \u00e9voquant surtout l\u2019isolement, la mauvaise situation \u00e9conomique et les efforts de recrutement de prisonniers autrichiens dans l\u2019Arm\u00e9e rouge\u2026 Tr\u00e8s vite menac\u00e9, Bailey change d\u2019identit\u00e9 et prend celle d\u2019un prisonnier autrichien, alors qu\u2019au printemps 1919, une r\u00e9volte secoue l\u2019Afghanistan. S\u2019engageant dans la Tcheka, Bailey parvient finalement \u00e0 regagner les terres britanniques au prix d\u2019une travers\u00e9e du d\u00e9sert de Karakoum. La main est alors reprise par le g\u00e9n\u00e9ral Wilfrid Malleson, qui, gr\u00e2ce \u00e0 tout un r\u00e9seau et \u00e0 un vaste travail de d\u00e9sinformation, parviendra \u00e0 faire se d\u00e9grader les relations russo-afghanes. Le r\u00e9volutionnaire indien M.N. Roy est \u00e9galement \u00e9voqu\u00e9, charg\u00e9 qu\u2019il fut d\u2019un projet de formation militaire, \u00e0 compter de la fin 1920, afin de pr\u00e9parer le soutien \u00e0 l\u2019insurrection en Inde, un projet finalement abandonn\u00e9 \u00e0 l\u2019occasion de l\u2019accord anglo-sovi\u00e9tique de mars 1921. Mais l\u00e0 encore, le parti pris de Giles Milton nous emp\u00eache d\u2019en apprendre davantage sur ce sujet, le parcours ult\u00e9rieur de Roy \u00e9tant plus que laconique sous sa plume7.  Si les guerres secr\u00e8tes, l'utilisation d'agents clandestins et les entreprises de d\u00e9sinformation/manipulation ne doivent pas \u00eatre n\u00e9glig\u00e9es, surtout dans les p\u00e9riodes de r\u00e9volutions politiques et de ruptures sociales, encore faut-il que leur histoire soit abord\u00e9e d'une mani\u00e8re scientifique, encore plus rigoureusement que d'autres \u00e9v\u00e9nements, eu \u00e9gard au caract\u00e8re sp\u00e9cifique et myst\u00e9rieux de cet objet historique. L'ouvrage de Giles Milton ne r\u00e9pond gu\u00e8re \u00e0 ces crit\u00e8res. On lui pr\u00e9f\u00e9rera la biographie de Reginald Teague-Jones, un de ses hommes de l'ombre des services secrets britanniques en Russie sovi\u00e9tique, par l'historienne Taline Ter Minassian8.  1On peut ainsi citer un \u00ab\u00a0Conseil supr\u00eame militaire bolchevique\u00a0\u00bb (p. 113), pr\u00e9sent\u00e9 comme le c\u0153ur de l\u2019organisation bolchevique, ou Trotsky ayant dirig\u00e9 l\u2019assaut contre Cronstadt\u2026  2C\u2019est au point que Giles Milton accuse implicitement les bolcheviques d\u2019\u00eatre responsables du d\u00e9clenchement des hostilit\u00e9s une fois les premi\u00e8res forces britanniques d\u00e9barqu\u00e9es au nord du pays\u00a0! (p. 147).  3Boris Bajanov, Avec Staline dans le Kremlin, Paris, Les \u00c9ditions de France, 1930, 263 p. R\u00e9\u00e9dit\u00e9 en 1979 sous le titre Bajanov r\u00e9v\u00e8le Staline. Souvenirs d'un ancien secr\u00e9taire de Staline, chez Gallimard. A lire Giles Milton, Bajanov fut d\u00e8s 1920 \u00ab\u00a0secr\u00e9taire de l'appareil principal du parti\u00a0\u00bb (p. 312), alors qu'il ne devient Secr\u00e9taire du Politburo [Bureau politique] qu'\u00e0 l'\u00e9t\u00e9 1923. D'ailleurs, quel cr\u00e9dit accorder aux r\u00e9cits de transfuges, quels qu'ils soient\u00a0?  4Par la suite, \u00c9douard Berzine participa \u00e0 la cr\u00e9ation du Goulag. Ne pas confondre avec Ian Berzine, un des meilleurs sp\u00e9cialistes du renseignement de l'Arm\u00e9e rouge. Tous deux sont ex\u00e9cut\u00e9s lors des purges de 1937-38.  5Moiss\u00e9i/Mikha\u00efl Ouritski (1873-1918) est abattu par le socialiste-r\u00e9volutionnaire (SR) Leonid Kanegisser en tant que dirigeant de la Tcheka de Petrograd. D'abord menchevique, puis membre de la Mejra\u00efonka (un groupe internationaliste aussi nomm\u00e9 \u00ab\u00a0inter-district\u00a0\u00bb ou inter-rayons\u00a0\u00bb) avant de rejoindre les bolcheviques. La tentative d'assassinat sur L\u00e9nine est l\u2019\u0153uvre de Fanny Kaplan, ancienne anarchiste devenue membre de l'organisation de combat SR. Auparavant, elle avait pr\u00e9vu de tuer L\u00e9on Trotsky. Avant d'\u00eatre ex\u00e9cut\u00e9e, elle partagea la cellule du diplomate Robert Bruce Lockhart (Orlando Figes, La R\u00e9volution russe, Paris, Deno\u00ebl, 2007, p. 775).  6En 1919, il publie \u00e0 Petrograd Pourquoi je soutiens le bolchevisme.  7Sur M. N. Roy, on lira avec bien plus de profit l'\u00e9tude de Jean\u00a0Vigreux, \u00ab\u00a0Manabendra Nath Roy (1887-1954), \u00ab repr\u00e9sentant des Indes britanniques\u00a0\u00bb au Komintern ou la critique de l\u2019imp\u00e9rialisme britannique\u00a0\u00bb,\u00a0Cahiers d\u2019histoire. Revue d\u2019histoire critique, n\u00b0111, 2010, p. 81-95, sur https://chrhc.revues.org/2075  8Taline Ter Minassian, Reginald Teague-Jones. Au service secret de l'Empire britannique, Paris, Grasset & Fasquelle, 2012. Lire le compte rendu de cet ouvrage dans ce dossier.", "onlyNER": true}

Return the JSON where the entity 'socialiste-revolutionnaire' doesn't provide a type NER:

        {
            "rawName": "socialiste-révolutionnaire",
            "offsetStart": 9032,
            "offsetEnd": 9034,
            "nerd_score": 0.8,
            "nerd_selection_score": 0
        },

Not able to use Editor.html

Hi Team,

I am trying to use Editor web page, SO i made a change in the Web.xml file

Here's the Web.Xml file,

But i got the Error Like,

http://localhost:8090/service/NERDCustomisations --404 (Not Found)

NERD service - a RESTful service for the (Named) Entity Recognition and Disambiguation nerd-service com.sun.jersey.spi.container.servlet.ServletContainer
    <init-param>
        <param-name>com.sun.jersey.config.property.resourceConfigClass</param-name>
        <param-value>com.sun.jersey.api.core.PackagesResourceConfig</param-value>
    </init-param>
    <init-param>
        <param-name>com.sun.jersey.config.property.packages</param-name>
        <param-value>com.scienceminer.nerd.service</param-value>
    </init-param>
    <load-on-startup>1</load-on-startup>
</servlet>

<servlet>
    <servlet-name>defaultStatic</servlet-name>
    <servlet-class>org.eclipse.jetty.servlet.DefaultServlet</servlet-class>
    <load-on-startup>1</load-on-startup>
</servlet>
<servlet-mapping>
    <servlet-name>defaultStatic</servlet-name>
    <url-pattern>/editor.html</url-pattern>
</servlet-mapping>
<servlet-mapping>
    <servlet-name>defaultStatic</servlet-name>
    <url-pattern>/resources/*</url-pattern>
</servlet-mapping>
<servlet-mapping>
    <servlet-name>defaultStatic</servlet-name>
    <url-pattern>/nerd/editor.js</url-pattern>
</servlet-mapping>

<servlet-mapping>
    <servlet-name>nerd-service</servlet-name>
    <url-pattern>/admin</url-pattern>
</servlet-mapping>
<servlet-mapping>
    <servlet-name>nerd-service</servlet-name>
    <url-pattern>/language</url-pattern>
</servlet-mapping>
<servlet-mapping>
    <servlet-name>nerd-service</servlet-name>
    <url-pattern>/disambiguate</url-pattern>
</servlet-mapping>
<servlet-mapping>
    <servlet-name>nerd-service</servlet-name>
    <url-pattern>/segmentation</url-pattern>
</servlet-mapping>
<servlet-mapping>
    <servlet-name>nerd-service</servlet-name>
    <url-pattern>/customisations</url-pattern>
</servlet-mapping>
<servlet-mapping>
    <servlet-name>nerd-service</servlet-name>
    <url-pattern>/customisation</url-pattern>
</servlet-mapping>
<servlet-mapping>
    <servlet-name>nerd-service</servlet-name>
    <url-pattern>/kb/concept</url-pattern>
</servlet-mapping>
<servlet-mapping>
    <servlet-name>nerd-service</servlet-name>
    <url-pattern>/kb/term</url-pattern>
</servlet-mapping>

<!--servlet-mapping>
    <servlet-name>nerd-service</servlet-name>
    <url-pattern>/NERDCustomisation/*</url-pattern>
</servlet-mapping-->
<!--servlet-mapping>
    <servlet-name>nerd-service</servlet-name>
    <url-pattern>/createNERDCustomisation/*</url-pattern>
</servlet-mapping-->
<!--servlet-mapping>
    <servlet-name>nerd-service</servlet-name>
    <url-pattern>/extendNERDCustomisation/*</url-pattern>
</servlet-mapping-->
<servlet-mapping>
    <servlet-name>nerd-service</servlet-name>
    <url-pattern>/service/*</url-pattern>
</servlet-mapping>
<!--servlet-mapping>
    <servlet-name>nerd-service</servlet-name>
    <url-pattern>nerd/*</url-pattern>
</servlet-mapping-->

<welcome-file-list>
    <welcome-file>nerd/editor.html</welcome-file>
    <welcome-file>editor.html</welcome-file>
</welcome-file-list>

<!--filter>
   <filter-name>cross-origin</filter-name>
   <filter-class>org.eclipse.jetty.servlets.CrossOriginFilter</filter-class>
   <init-param>
       <param-name>allowedOrigins</param-name>
       <param-value>*</param-value>
   </init-param>
   <init-param>
       <param-name>allowedMethods</param-name>
       <param-value>*</param-value>
   </init-param>
   <init-param>
       <param-name>allowedHeaders</param-name>
       <param-value>*</param-value>
   </init-param>
 </filter>
 <filter-mapping>
     <filter-name>cross-origin</filter-name>
     <url-pattern>/*</url-pattern>
 </filter-mapping-->

Different result of domains for Italian text disambiguation process between 'localhost' and 'nerd.huma-num.fr/test/'

In order to do the test of new language, Italian, the source of test from the link of 'http://nerd.huma-num.fr/test/' and also from the localhost are provided. Several tests are done to see the result given from both of them. With the test of disambiguation, it's found the fact that the domains given by certain mentions are different. For instance, the mention 'fiori' in 'http://nerd.huma-num.fr/test/' has the domains 'Agriculture, Plants', while in the localhost has the domain 'Plants'.

It should be re-checked the way of calculation in order to get the domains since they are treated differently.

Mismatch results

I have a text that is returning different results depending on the instance where is run.
The text/query is the following:

{
    "text": "Suite de la tournée des relations d'avant-guerre. J'ai aperçu mon plombier — il y a une véritable joie à retrouver des relations d'autrefois, après quatre années de coupure et de se sentir à l'unisson sur Pétain. Quand il m'en a parlé j'ai hésité à répondre catégoriquement, pour ne pas les choquer et j'ai dit : c'est un pauvre homme. Quel déchaînement : elle m'a dit c'est ainsi que vous appelez un homme qui nuit à son pays, etc... etc... \n Cette femme, très simple, est vraiment épatante. \n Elle m'explique que depuis le début elle écoute les informations de la radio anglaise et les diffuse dans le quartier. Je leur demande s'ils sont affiliés à une organisation — Oui - Laquelle \"la résistance\" C'sst ici que parle le bon sens et la clairvoyance : au sommet on se bat pour des initiales, à la base on croit en la résistance.\n On y croit avec plus de lucidité que de prétendus experts. \n Cet homme était de droite autrefois ;Il m'explique que parmi les riches il y en a beaucoup qui ne sont pas avec nous, parce qu'ils craignant pour leur gros sous. Ils n'ont d'ailleurs pas renié leur origine, elle me parle de la fierté qu'elle éprouve à retrouver beaucoup de catholiques dans la résistance. \n Nous parlons d'autres voisins du quartiers que sont-ils devenus. Celui—là vous savez c'est un français... et ça veut tout dire. Elle a raison cela veut tout dire — la droits a éclaté au feu de la guerre — il y a d'un côté les Français, plombiers ou hommes de lettres, et de l'autre ceux qui pensent à leurs gros sous...",
    "entities": [],
"sentences": [
        {
            "offsetStart": 0,
            "offsetEnd": 49
        },
        {
            "offsetStart": 49,
            "offsetEnd": 212
        },
        {
            "offsetStart": 212,
            "offsetEnd": 335
        },
        {
            "offsetStart": 335,
            "offsetEnd": 434
        },
        {
            "offsetStart": 434,
            "offsetEnd": 441
        },
        {
            "offsetStart": 441,
            "offsetEnd": 492
        },
        {
            "offsetStart": 492,
            "offsetEnd": 613
        },
        {
            "offsetStart": 613,
            "offsetEnd": 831
        },
        {
            "offsetStart": 831,
            "offsetEnd": 891
        },
        {
            "offsetStart": 891,
            "offsetEnd": 1055
        },
        {
            "offsetStart": 1055,
            "offsetEnd": 1199
        },
        {
            "offsetStart": 1199,
            "offsetEnd": 1266
        },
        {
            "offsetStart": 1266,
            "offsetEnd": 1307
        },
        {
            "offsetStart": 1307,
            "offsetEnd": 1329
        },
        {
            "offsetStart": 1329,
            "offsetEnd": 1521
        }
    ],
    "processSentence":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],
    "onlyNER": false,
    "resultLanguages": [
        "de",
        "fr"
    ],
    "nbest": false,
    "customisation": "generic"
}

In both Huma-num and science-miner Pétain is returned only when the text is submitted without sentences and processSentences.

When processSentences/sentences is provided Pétain is not recognised anymore, even when we are processing all the sentences.

Non deterministic results

In order to test the process of disambiguation, some possibilities of query were given.
Let's take test cases with disambiguation of Pdf files, the service showed a strange behavior since it gave different results even for the same query.
The following is some test cases done on a Pdf file with the same query template:
2009.Infiniti.pdf

{
    "mentions": [
        "ner",
        "wikipedia"
    ],
    "nbest": false,
    "customisation": "generic"
}

The service gave different results, for instance the mention 'Francesco Speranza' can be full recognized as 'Francesco Speranza', can partially recognized as 'Speranza', or even cannot be recognized at all.

Below are some screenshots of the results.

  1. 'Francesco Speranza' can be full recognized

screen shot 2018-01-15 at 16 41 10

  1. 'Francesco Speranza' can partially recognized as 'Speranza'

screen shot 2018-01-15 at 16 51 44

  1. 'Francesco Speranza' cannot be recognized at all

screen shot 2018-01-15 at 16 39 54

Nerd's service doesn't work anymore in local machine for Italian and Spanish language

Nerd's service in local machine (localhost:8090) doesn't work anymore for Italian and Spanish language, but it works properly for English.

The last version of branch 0.0.3 on 16 January 2018 showed a success status when it was re-built.

screen shot 2018-01-16 at 13 32 38

But, the service didn't work for the Italian and Spanish language, and only works for English. The log showed that the language is recognized but no mention was raised.

screen shot 2018-01-16 at 13 37 09

screen shot 2018-01-16 at 13 38 21

This issue is related to the issue #15 .

LMDBs knowledge bases are not exactly the same

Mac OSX:

wikidata: 37413613 concepts.
en: 14899737 pages.
de: 3579552 pages.
fr: 3681264 pages.
en: 3322291 pages.
it: 2291751 pages.

Linux:

wikidata: 9155139 concepts.
en: 14651883 pages.
de: 3523959 pages.
fr: 3631810 pages.
es: 3322291 pages.
it: 2291751 pages.

Update the demo console to boostrap 3

... or even to bootstrap 4 if it moves from beta to final release.

(we still use bootstrap 1 here, oh shame! only anHALytics front-end has been updated to bootstrap 3)

this would solve I think the problems indicated in #3

Missing information in documentation

From the example at section related to "Entities" here the example is missing

{
    "text": "Austria invaded and fought the Serbian army at the Battle of Cer and Battle of Kolubara beginning on 12 August.",
    "language": {
        "lang": "en"
    },
    "entities": [
        {}
    ]
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.