Comments (8)
@kermitt2 all correct. I was thinking maybe on something (more simple, or maybe the same thing with a different name?) we could extend the candidates matching using the "also known as" information within wikidata to increase the possibility of match of a wikipedia article when different forms are used (e.g. Charles Ier -> Wikidata:Q3044 -> Wikipedia:Charlemagne)?
@tantikristanti your comment should be better moved to task #51 ;-)
from entity-fishing.
It's because it comes from a disambiguation page, and it does not occur elsewhere neither as anchor nor as title - thus the lack of prior probability (it's not a confidence score here).
So this was not really an error, rather a lack of usable data in wikipedia which leads to this non-used lexical entry.
This was fixed in branch 0.0.3 by setting a default prior in these disambiguation cases.
from entity-fishing.
I'm now re-checking with the new version.
I'm using this query:
{
"text": "Charlemagne, du latin Carolus Magnus, ou Charles Ier dit « le Grand », né le 2 avril 742 (voire 747 ou 748)2, mort le 28 janvier 814 à Aix-la-Chapelle, est un roi des Francs et empereur. Il appartient à la dynastie des Carolingiens, à laquelle il a donné son nom.\nFils de Pépin le Bref, il est roi des Francs à partir de 768, devient par conquête roi des Lombards en 774 et est couronné empereur à Rome par le pape Léon III le 25 décembre 800, relevant une dignité disparue depuis la chute de l'Empire romain d'Occident en 476.\nRoi guerrier, il agrandit notablement son royaume par une série de campagnes militaires, en particulier contre les Saxons païens dont la soumission fut difficile et violente (772-804), mais aussi contre les Lombards en Italie et les musulmans d'Al-Andalus.",
"shortText": "",
"termVector": [],
"language": {
"lang": "fr"
},
"entities": [],
"mentions": [
"ner",
"wikipedia"
],
"nbest": false,
"sentence": false,
"customisation": "generic"
}
Now Charles Ier
is disambiguated as Charles Ier (roi d'Angleterre) but, most interesting result is Carolus Magnus
disambiguated as the board game
which is somehow related, but bizarre, as from the term look up the right entry is pulled out
from entity-fishing.
You can't use the old French model with the new version, features are different. The new disambiguation models for French have to be created first.
from entity-fishing.
With the latest model Charles Ier is still disambiguated with Charles Ier (roi d'Angleterre) but Carolus Magnus is not taken in consideration (which is better I think).
from entity-fishing.
It seems that Charles Ier doesn't have a specific page or a reference to the Charlemagne page, so it's probably more difficult to find it as a candidate entity for Charlemagne.
Any though?
from entity-fishing.
You're mixing different things :)
Regarding Charles Ier
, the problem is indeed that in the French Wikipedia it is not a mention "realizing" the entity Charlemagne
. The are plenty of other kings in Wikipedia that are referred to with the mention Charles Ier
. Interestingly the variant Charles I
is used as an anchor leading to Charlemagne Wikipedia page, but the conditional probability is so low (0.005), that in practice it won't be considered anyway.
The solution is possibly to exploit also the labels of Wikidata - not done now, because we don't have statistical information about them to perform the disambiguation. For French, Charles Ier
is a label introduced for Q3044. The question on how to use these labels in the disambiguation process without statistical information remains however open! Maybe good priors? Label propagation?
Regarding mentions that appear or not following different queries, it's another issue. They are usually very close to the threshold and I suppose sensitive to random seed, so can keep track of that in issue #51.
from entity-fishing.
@lfoppiano usually the problem is that there are too many entity candidates for a given mention... If we add more entity candidates for a given mention without statistical ground, we end up in average with an ambiguity explosion, endless runtime, much lower accuracy... The labels in Wikidata are numerous from the most common to the very very rare, without any usage information, so we cant use such a simple approach. Currently we limit the entity candidate for a given mention to the top-5 most probable ones to manage this problem. Increasing the number of candidate results in significant accuracy decrease.
from entity-fishing.
Related Issues (20)
- Impact of text length on identified entities HOT 3
- Problem of disambiguation of ENs according to the case or spelling of terms HOT 1
- Sometimes entity fishing returns "Invalid id or excluded via caching" as rawName of preferredForm HOT 1
- Different results when supplying entity spans HOT 2
- Named Entity Recognition and Classification for languages other than EN/FR HOT 4
- Case and term selection for French
- Bad formatting of json response HOT 1
- Add an option to make a warm-up of lower/upper KB databases at startup HOT 1
- EF display of French dates HOT 1
- Add an option to retrieve a text only wikidata definition from entity ? HOT 7
- not able to build entity-fishing HOT 2
- Support for Swedish language HOT 6
- Japanese language alpha 2 misconifgured HOT 3
- Docker HOT 2
- HTTP 2.0 support or support for request batching? HOT 1
- installation failes (at arm64) HOT 2
- Dutch language support HOT 2
- General Statistics of Retrievable Wiki Entities HOT 4
- The first request to disambiguate is slow and also memory is growing as more requests are coming in HOT 1
- Entity Fishing Service Randomly not Yielding Wiki Link HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from entity-fishing.