Code Monkey home page Code Monkey logo

Comments (8)

lfoppiano avatar lfoppiano commented on June 11, 2024 1

@kermitt2 all correct. I was thinking maybe on something (more simple, or maybe the same thing with a different name?) we could extend the candidates matching using the "also known as" information within wikidata to increase the possibility of match of a wikipedia article when different forms are used (e.g. Charles Ier -> Wikidata:Q3044 -> Wikipedia:Charlemagne)?

@tantikristanti your comment should be better moved to task #51 ;-)

from entity-fishing.

kermitt2 avatar kermitt2 commented on June 11, 2024

It's because it comes from a disambiguation page, and it does not occur elsewhere neither as anchor nor as title - thus the lack of prior probability (it's not a confidence score here).
So this was not really an error, rather a lack of usable data in wikipedia which leads to this non-used lexical entry.

This was fixed in branch 0.0.3 by setting a default prior in these disambiguation cases.

from entity-fishing.

lfoppiano avatar lfoppiano commented on June 11, 2024

I'm now re-checking with the new version.

I'm using this query:

{
    "text": "Charlemagne, du latin Carolus Magnus, ou Charles Ier dit « le Grand », né le 2 avril 742 (voire 747 ou 748)2, mort le 28 janvier 814 à Aix-la-Chapelle, est un roi des Francs et empereur. Il appartient à la dynastie des Carolingiens, à laquelle il a donné son nom.\nFils de Pépin le Bref, il est roi des Francs à partir de 768, devient par conquête roi des Lombards en 774 et est couronné empereur à Rome par le pape Léon III le 25 décembre 800, relevant une dignité disparue depuis la chute de l'Empire romain d'Occident en 476.\nRoi guerrier, il agrandit notablement son royaume par une série de campagnes militaires, en particulier contre les Saxons païens dont la soumission fut difficile et violente (772-804), mais aussi contre les Lombards en Italie et les musulmans d'Al-Andalus.",
    "shortText": "",
    "termVector": [],
    "language": {
        "lang": "fr"
    },
    "entities": [],
    "mentions": [
        "ner",
        "wikipedia"
    ],
    "nbest": false,
    "sentence": false,
    "customisation": "generic"
}

Now Charles Ier is disambiguated as Charles Ier (roi d'Angleterre) but, most interesting result is Carolus Magnus disambiguated as the board game

screen shot 2017-12-08 at 12 38 10

which is somehow related, but bizarre, as from the term look up the right entry is pulled out

screen shot 2017-12-08 at 12 45 12

from entity-fishing.

kermitt2 avatar kermitt2 commented on June 11, 2024

You can't use the old French model with the new version, features are different. The new disambiguation models for French have to be created first.

from entity-fishing.

lfoppiano avatar lfoppiano commented on June 11, 2024

With the latest model Charles Ier is still disambiguated with Charles Ier (roi d'Angleterre) but Carolus Magnus is not taken in consideration (which is better I think).

screen shot 2018-01-04 at 19 52 53

from entity-fishing.

lfoppiano avatar lfoppiano commented on June 11, 2024

It seems that Charles Ier doesn't have a specific page or a reference to the Charlemagne page, so it's probably more difficult to find it as a candidate entity for Charlemagne.

Any though?

from entity-fishing.

kermitt2 avatar kermitt2 commented on June 11, 2024

You're mixing different things :)

Regarding Charles Ier, the problem is indeed that in the French Wikipedia it is not a mention "realizing" the entity Charlemagne. The are plenty of other kings in Wikipedia that are referred to with the mention Charles Ier. Interestingly the variant Charles I is used as an anchor leading to Charlemagne Wikipedia page, but the conditional probability is so low (0.005), that in practice it won't be considered anyway.

The solution is possibly to exploit also the labels of Wikidata - not done now, because we don't have statistical information about them to perform the disambiguation. For French, Charles Ier is a label introduced for Q3044. The question on how to use these labels in the disambiguation process without statistical information remains however open! Maybe good priors? Label propagation?

Regarding mentions that appear or not following different queries, it's another issue. They are usually very close to the threshold and I suppose sensitive to random seed, so can keep track of that in issue #51.

from entity-fishing.

kermitt2 avatar kermitt2 commented on June 11, 2024

@lfoppiano usually the problem is that there are too many entity candidates for a given mention... If we add more entity candidates for a given mention without statistical ground, we end up in average with an ambiguity explosion, endless runtime, much lower accuracy... The labels in Wikidata are numerous from the most common to the very very rare, without any usage information, so we cant use such a simple approach. Currently we limit the entity candidate for a given mention to the top-5 most probable ones to manage this problem. Increasing the number of candidate results in significant accuracy decrease.

from entity-fishing.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.