Code Monkey home page Code Monkey logo

Comments (12)

kermitt2 avatar kermitt2 commented on June 11, 2024 1

This is something visible for long. My hypothesis is that it is due to the random seed of SMILE, leading to different partitioning of decision trees in the ensemble decision algo, and thus different probabilities - above or bellow the thresholds, and finally to this non-deterministic behaviour.
If it comes from that, in sklearn we typically fix the random seed for reproducibility.

from entity-fishing.

kermitt2 avatar kermitt2 commented on June 11, 2024 1

Also note that it has nothing specific to the PDF, it is apparent whatever input we provide to the monster :)

from entity-fishing.

lfoppiano avatar lfoppiano commented on June 11, 2024

I found that the HAL_1 example on the stable version (0.0.2) is pretty fuzzy. Himalaya is sometimes recognised, sometimes not (tested in huma-num/nerd)

from entity-fishing.

lfoppiano avatar lfoppiano commented on June 11, 2024

This query, on the other hand, show certain stability:

{
    "text": "Solo un ricorso in Cassazione nelle prossime 48 ore può salvare le 15 candidature uninominali della coalizione di centrodestra nella circoscrizione Lombardia 1 alla Camera: l’ufficio elettorale della corte d’Appello di Milano le ha infatti escluse in conseguenza della mancanza della dichiarazione di apparentamento alla coalizione da parte della lista «Noi con l’Italia». I candidati in questione sono: Michela Vittoria Brambilla (nel collegio di Abbiategrasso) , Massimo Garavaglia (Legnano), Andrea Crippa (Bollate), Paola Frassinetti (Seregno), Andrea Mandelli (Monza), Valentina Aprea (Gorgonzola), Jari Colla (Cinisello Balsamo), Luca Squeri (Cologno Monzese), Guido Della Frera (Milano - Sesto), Alessandro Morelli (Milano), Igor Iezzi (Milano), Cristina Rossello (Milano), Laura Molteni (Milano), Federica Zanella (Milano) e Graziano Musella (Rozzano). I rappresentanti della lista sostengono che in realtà quel documento sarebbe stato depositato, come avvenuto in tutta Italia, dalla lista «capofila» (Forza Italia), che aveva presentato l’intera documentazione. ",
    "language": {
        "lang": "it"
    },
    "entities": [],
    "mentions": [
        "ner",
        "wikipedia"
    ],
    "nbest": false,
    "sentence": false,
    "customisation": "generic"
}

from entity-fishing.

lfoppiano avatar lfoppiano commented on June 11, 2024

Here a quick and dirty way to make some preliminary tests - counting the number of entities returned:

wifi-pro-82-244:tests lfoppiano$ for i in {1..10}; do curl --silent 'http://localhost:8090/service/disambiguate' -X POST -F "query={'language': {'lang':'it'}}, 'mentions': ['ner', 'wikipedia']}" -F"[email protected]" | jq -r '.entities' | grep 'rawName' | wc -l; done;
     142
     125
     128
     142
     132
     133
     122
     125
     133
     136
wifi-pro-82-244:tests lfoppiano$ 

PS: the pdf is in the issue #48

from entity-fishing.

lfoppiano avatar lfoppiano commented on June 11, 2024

Here an example with (3-4 paragraphs) text:

wifi-pro-82-244:tests lfoppiano$ for i in {1..10}; do curl --silent 'http://nerd.huma-num.fr/test/service/disambiguate' -X POST -F 'query={
>   "text": "Treno fuori dai binari alla stazione di Roma-Termini: durante la manovra di trasferimento di un convoglio senza passeggeri verso il deposito, un carrello di un vagone è uscito dal binario. Il traffico ferroviario - ha reso noto Rfi sul portale d’informazioni - sta registrando dalle 10.30 di mercoledì 31, modifiche e rallentamenti fino a 30 minuti sulle linee Roma - Civitavecchia e Roma - Cassino. Sono in corso di accertamento le cause dell’episodio avvenuto intorno alle 10.30 e che ha provocato ritardi a catena su diverse linee regionali. \nPer l’anormalità tecnica in atto a Roma Termini - si legge - il traffico sulla linea Roma–Cassino registra ritardi medi di 30 minuti. Alcuni treni sulla linea Roma – Civitavecchia sono, invece, limitati nella stazione di Roma Ostiense. Per il servizio no-stop da Roma Termini a Fiumicino Aeroporto con il Leonardo Express, invece è stato riprogrammato con un treno ogni 30 minuti. I treni infatti non possono arrivare e partire dai binari 19, 20, 21. L’episodio del Frecciabianca a Roma, arriva a qualche giorno dalla tragedia avvenuta alle porte di Milano, dove un convoglio di Trenord è uscito dai binari durante la sua corsa, per andarsi a schiantare contro un palo. Terribile il bilancio delle vittime dell’incidente con tre donne morte e decine di feriti. \nTrenitalia fa sapere di aver potenziato l’assistenza ai passeggeri nelle stazioni di Ciampino, Ostiense e Termini: sono in corso le verifiche sul treno incidentato che dovrà essere rimesso in asse e poi spostato dal binario. Al momento i treni per Grosseto e Pisa partono da Termini con frequenze di un’ora; mentre il capolinea dei mezzi da Civitavecchia a Roma è stato spostato alla stazione Ostiense. I ritardi sulle linee, come ad esempio per Cassino, si attestano per ora intorno ai trenta minuti. Regolare il traffico ferroviario sulla tratta del Leonardo Express per Fiumicino.", "mentions": [ "ner", "wikipedia"]}' | jq -r '.entities' | grep 'rawName' | wc -l ; done;
      37
      37
      37
      37
      37
      37
      37
      37
      37
      37
wifi-pro-82-244:tests lfoppiano$ 

from entity-fishing.

lfoppiano avatar lfoppiano commented on June 11, 2024

To be checked: haifengl/smile#262

from entity-fishing.

kermitt2 avatar kermitt2 commented on June 11, 2024

ok so fixing the seed for smile ML do not solve the problem :(
let's further explore...

from entity-fishing.

kermitt2 avatar kermitt2 commented on June 11, 2024

The problem was coming from the way the context is built - the sort of entities preliminary to building a context was not deterministic, resulting in different selection of senses from one request to the other for the same input.

from entity-fishing.

kermitt2 avatar kermitt2 commented on June 11, 2024

Should be now entirely fixed with commit 79e4e98

from entity-fishing.

lfoppiano avatar lfoppiano commented on June 11, 2024

Looks it's fixed:

Johan:tests lfoppiano$ for i in {1..10}; do curl --silent 'http://localhost:8091/service/disambiguate' -X POST -F "query={'language': {'lang':'it'}}, 'mentions': ['ner', 'wikipedia']}" -F"file=@Bodria_Berruto.pdf" | jq -r '.entities' | grep 'rawName' | wc -l; done;
     822
     822
     822
     822
     822
     822
     822
     822
     822
     822
Johan:tests lfoppiano$ 
Johan:tests lfoppiano$ for i in {1..10}; do curl --silent 'http://localhost:8091/service/disambiguate' -X POST -F "query={'language': {'lang':'it'}}, 'mentions': ['ner', 'wikipedia']}" -F"[email protected]" | jq -r '.entities' | grep 'rawName' | wc -l; done;
     109
     109
     109
     109
     109
     109
     109
     109
     109
     109
Johan:tests lfoppiano$ 
for i in {1..10}; do curl --silent 'http://localhost:8091/service/disambiguate' -X POST -F 'query={ "text": "WASHINGTON — President Trump sketched out an ominous view of America’s international role on Tuesday, emphasizing adversaries over allies, threats over opportunities, and a world to be pacified rather than elevated. \nBe it Iran or the Islamic State, Mr. Trump promised that the United States would vanquish rivals and stand up for those who fight for freedom. He took credit for the military campaign against the Islamic State, which he said had liberated “almost 100 percent of the territory once held by the killers in Iraq and Syria.” \nVowing to rebuild the nation’s nuclear arsenal, Mr. Trump said, “perhaps someday in the future there will be a magical moment when the countries of the world will get together to eliminate their nuclear weapons.” \n“Unfortunately, we are not there yet, sadly,” he said in his State of the Union address, his first. \nBut the president saved his longest foreign policy passage, and strongest words, for North Korea, whose “reckless pursuit of nuclear weapons,” he said, “could very soon threaten our homeland.” \n“We are waging a campaign of maximum pressure to prevent that from happening,” he said. “Past experience has taught us that complacency and concessions only invite aggression and provocation. I will not repeat the mistakes of past administrations that got us into this dangerous position.” \nMr. Trump did not, as he has before, issue specific threats of a military strike on the North. But he outlined an unrelenting case for what he called the North Korean government’s “depraved character,” echoing a speech he delivered to the South Korean National Assembly in Seoul in November. \nThe president drew on the stories of two victims of North Korean cruelty: an American college student, Otto F. Warmbier, who fell into an irreversible coma while in detention in Pyongyang, the capital, and later died; and a North Korean man who lost his leg while searching for food for his starving family. He later defected. \nGesturing to Mr. Warmbier’s parents, Fred and Cindy, who watched from the visitors’ gallery in the House, their eyes wet with tears, Mr. Trump said, “You are powerful witnesses to a menace that threatens our world, and your strength truly inspires us all.” \nThe defector, Ji Seong-ho, was also in the gallery and held up his wooden crutches in triumph when Mr. Trump hailed him. \nHours before the speech, the president’s Korea policy was buffeted by the administration’s decision to abandon a long-delayed plan to nominate a prominent Korea scholar, Victor D. Cha, as its ambassador to Seoul. \nMr. Cha, 57, had voiced opposition to the administration’s threat to carry out a preventive military strike against North Korea, said two people with knowledge of the decision. He had already undergone an extensive vetting process, and his name had been submitted for approval to the South Korean government — normally an indication that the background check was complete. \nOfficials in Seoul had already signed off on the ambassadorship; Mr. Cha is a Republican who identifies as a hawk on North Korea. But friends said he told Pentagon and other administration officials his concerns about ordering a pre-emptive, or preventive, military strike on North Korea before it had the capacity to fire a nuclear-armed missile at the United States. \nAdministration officials, particularly the White House national security adviser, Lt. Gen. H. R. McMaster, have raised the prospect of such a strike — sometimes called the “bloody nose” strategy — though they emphasize they would prefer to solve the confrontation with Pyongyang through diplomacy. \nMr. Cha has also publicly voiced the high cost to both Washington and Seoul of ripping up the Korea Free Trade Agreement, as Mr. Trump has threatened to do, unless the South Koreans agree to renegotiate the deal. \nThe White House declined to comment Tuesday on the reasons for its decision, though a senior official played down policy disagreements as the cause. The administration had not formally submitted Mr. Cha’s name to the Senate, even after he had undergone months of vetting. \nThe White House had initially hoped to have a new ambassador in place in time for the Winter Games, which begin in 10 days in the South Korean town of Pyeongchang. But as the deadline approached, Mr. Cha told friends he had heard nothing from the White House or the State Department about the status of his nomination. The Washington Post first reported that the White House was not moving forward with his nomination. \nMichael J. Green, a colleague of Mr. Cha, said the dropped ambassadorship was “discouraging in terms of what it says about the administration’s North Korea policy, but also their ability to attract qualified people to come into these kinds of jobs.” \nIn his speech, Mr. Trump made no mention of the Winter Olympic Games. Nor did he mention a budding détente between North and South Korea, which have agreed to march their teams into the opening ceremony under a single flag and to field a unified women’s ice hockey team. \nFor the president, cataloging the horrors inflicted by North Korea was part of an exercise that he called “restoring clarity about our adversaries.” He said he had stood up for antigovernment demonstrators in Iran and asked Congress to fix the flaws in the “terrible” nuclear deal that world powers brokered with the country in 2015.", "mentions": ["ner", "wikipedia"] }' | jq -r '.entities' | grep 'rawName' | wc -l ; done;
      83
      83
      83
      83
      83
      83
      83
      83
      83
      83

from entity-fishing.

lfoppiano avatar lfoppiano commented on June 11, 2024

Same example as #51 (comment)

Johan:tests lfoppiano$ for i in {1..10}; do curl --silent 'http://localhost:8091/service/disambiguate' -X POST -F 'query={  "text": "Treno fuori dai binari alla stazione di Roma-Termini: durante la manovra di trasferimento di un convoglio senza passeggeri verso il deposito, un carrello di un vagone è uscito dal binario. Il traffico ferroviario - ha reso noto Rfi sul portale d’informazioni - sta registrando dalle 10.30 di mercoledì 31, modifiche e rallentamenti fino a 30 minuti sulle linee Roma - Civitavecchia e Roma - Cassino. Sono in corso di accertamento le cause dell’episodio avvenuto intorno alle 10.30 e che ha provocato ritardi a catena su diverse linee regionali. \nPer l’anormalità tecnica in atto a Roma Termini - si legge - il traffico sulla linea Roma–Cassino registra ritardi medi di 30 minuti. Alcuni treni sulla linea Roma – Civitavecchia sono, invece, limitati nella stazione di Roma Ostiense. Per il servizio no-stop da Roma Termini a Fiumicino Aeroporto con il Leonardo Express, invece è stato riprogrammato con un treno ogni 30 minuti. I treni infatti non possono arrivare e partire dai binari 19, 20, 21. L’episodio del Frecciabianca a Roma, arriva a qualche giorno dalla tragedia avvenuta alle porte di Milano, dove un convoglio di Trenord è uscito dai binari durante la sua corsa, per andarsi a schiantare contro un palo. Terribile il bilancio delle vittime dell’incidente con tre donne morte e decine di feriti. \nTrenitalia fa sapere di aver potenziato l’assistenza ai passeggeri nelle stazioni di Ciampino, Ostiense e Termini: sono in corso le verifiche sul treno incidentato che dovrà essere rimesso in asse e poi spostato dal binario. Al momento i treni per Grosseto e Pisa partono da Termini con frequenze di un’ora; mentre il capolinea dei mezzi da Civitavecchia a Roma è stato spostato alla stazione Ostiense. I ritardi sulle linee, come ad esempio per Cassino, si attestano per ora intorno ai trenta minuti. Regolare il traffico ferroviario sulla tratta del Leonardo Express per Fiumicino.", "mentions": [ "ner", "wikipedia"]}' | jq -r '.entities' | grep 'rawName' | wc -l ; done;
      44
      44
      44
      44
      44
      44
      44
      44
      44
      44

from entity-fishing.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.