Code Monkey home page Code Monkey logo

fact-extractor's Introduction

Fact Extractor

Fact Extraction from Wikipedia Text

Intro

The DBpedia Extraction Framework is pretty much mature when dealing with Wikipedia semi-structured content like infoboxes, links and categories.
However, unstructured content (typically text) plays the most crucial role, due to the amount of knowledge it can deliver, and few efforts have been carried out to extract structured data out of it.
For instance, given the Germany Football Team article, we want to extract a set of meaningful facts and structure them in machine-readable statements.
The following sentence:

In Euro 1992, Germany reached the final, but lost 0–2 to Denmark

would produce statements (triples) like:

<Germany, defeat, Denmark>
<defeat, score, 0–2>
<defeat, winner, Denmark>
<defeat, competition, Euro 1992>

High-level Workflow

INPUT = Wikipedia corpus

Corpus Analysis

  1. Corpus Raw Text Extraction
  2. Verb Extraction
  3. Verb Ranking

Unsupervised Fact Extraction

  1. Entity Linking
  2. Frame Classification
  3. Dataset Production

Supervised Fact Extraction

  1. Training Set Creation
  2. Classifier Training
  3. Frame Classification
  4. Dataset Production

Get Ready

  • Python, pip and Java should be there in your machine, aren't they?
  • Install all the Python requirements:
$ pip install -r requirements.txt
# For The Wiki Machine
TWM_URL = 'your service URL'
TWM_APPID = 'your app ID'
TWM_APPKEY = 'your app key'

# For Dandelion API
NEX_URL = 'https://api.dandelion.eu/datatxt/nex/v1'
NEX_APPID = 'your app ID'
NEX_APPKEY = 'your app key'

Get Started

Here is how to produce the unsupervised Italian soccer dataset:

$ wget http://dumps.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2
$ make extract-pages
$ make extract-soccer
$ make extract-sentences-baseline
$ make unsupervised-run

Done!

Note: Wikipedia Dump Pre-processing

Wikipedia dumps are packaged as XML documents and contain text formatted according to the Mediawiki markup syntax, with templates to be transcluded. To obtain a raw text corpus, we use the WikiExtractor, integrated in a frozen version here.

Development Policy

Contributors should follow the standard team development practices:

  1. Branch out of master;
  2. Commit frequently with clear messages;
  3. Make a pull request.

Coding Style

Pull requests not complying to these guidelines will be ignored.

  • Use 4 spaces (soft tab) for indentation;
  • Naming conventions
    • use an underscore as a word separator (files, variables, functions);
    • constants are UPPERCASE;
    • anything else is lowercase.
  • Use 2 empty lines to separate functions;
  • Write docstrings according to PEP 287, with a special attention to field lists. IDEs like PyCharm will do the job.

References

License

The source code is under the terms of the GNU General Public License, version 3.

fact-extractor's People

Contributors

aktgth avatar e-dorigatti avatar kkasunperera avatar marfox avatar ninawan avatar pkuwzr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fact-extractor's Issues

Merge all the contiguous chunks into a big single one

Consider the following JSON output yielded by the chunk combination script:

 {
    "chunks": [
      "FEC",
      "Levallois",
      "USL",
      "Dunkerque",
      "FC",
      "Thépot",
      "Brest",
      "Red"
    ],
    "id": "12",
    "sentence": "Thépot giocò per il Brest, FEC Levallois, Red Star FC, e l'USL Dunkerque."
  }

It would be nice to have:

 {
    "chunks": [
      "FEC Levallois",
      "USL Dunkerque",
      "FC",
      "Thépot",
      "Brest",
      "Red"
    ],
    "id": "12",
    "sentence": "Thépot giocò per il Brest, FEC Levallois, Red Star FC, e l'USL Dunkerque."
  }

Star is not extracted, so no way to get Red Star FC (even if it would make sense).

Prepare job data for the event

It should contain 500 sentences for each of the following frames.
Triggering LUs are aside: ranked ones as per this file in bold, while the others come from Kicktionary or FrameNet.

Frame LUs
Attività (Activity) andare, esordire, debuttare, giocare, rimanere
Partita (Match) affrontare, giocare, incontrare
Vittoria (Victory) battere, sconfiggere, vincere
Sconfitta (Defeat) crollare, perdere, piegarsi
Stato (State) rimanere
Trofeo (Finish_Competition) vincere

If the same initial token appears in more annotations, training data only gets the first annotation

We are trying to build training data in an n-gram fashion, replacing single tokens with the full annotated entity.
See for instance the following sentence:

19  0   Ha  VER:pres    avere   Attività   O
19  1   giocato VER:pper    giocare Attività   B-LU
19  2   7   NUM @card@  Attività   O
19  3   partite NOM partita Attività   O
19  4   per PRE per Attività   O
19  5   la  DET:def il  Attività   O
19  6   Nazionale cipriota  ENT nazionale   Attività   B-Squadra_Attività
19  7   tra PRE tra Attività   O
19  8   il 2004 ENT il  Attività   B-Tempo_Attività
19  9   e   CON e   Attività   O
19  10  il 2004 ENT il  Attività   B-Tempo_Attività

See full sample output.
The problem arises here.

Tokens appearing in more than 1 Frame Element (FE) are wrongly transformed

This script takes as input:

  1. a CSV file containing CrowdFlower annotation results
  2. a directory with TreeTagger output files, 1 sentence per file
    and translates it into IOB format to be fed to a classifier.
    See #7 for more details on how to run the script.

You should fix a bug introduced in these lines: if the same token appears in different FEs, the algorithm only remembers 1 FE.
See for instance the following error from this sentence of the IOB output sample:

3   8   il  DET:def il  Vittoria    B-Competizione_Vittoria
3   9   Varese  NPR Varese  Vittoria    I-Avversario_Vittoria

where the B- label should be B-Avversario_Vittoria instead.
This is due to the token il appearing both in il Varese and in il Campionato Primavera 2010-2011

The subject should be the article URI from which the fact was extracted

Currently, we don't keep track of the sentences provenance (i.e., the Wikipedia article).
A sample extracted fact now looks like:

resource:SENTENCE0000 fact:Attività fact:Attività0000 .

We only keep track of the sentence ID from which the fact was extracted.
Instead, we should link to the Wikipedia article containing that sentence:

resource:${WIKIPEDIA_ARTICLE} fact:Attività fact:Attività0000 .

Automatically annotate numerical FEs

Here is a (non-exhaustive) summary of the expressions we want to annotate automatically and transform into standard XML Schema Datatypes (xsd), such as dates.
We decided that the easiest strategy is to implement a set of regexes.

Heads-up: relative expressions should be resolved against an absolute date (e.g., $YEAR), but that's a hard one. $YEAR should take a default value year if no absolute date is found.

Tempo (Time)

Input Output
il 14 settembre 2010 "2010-09-14"^^xsd:date
il 15 ottobre "10-15"^^xsd:gMonthDay
nel giugno 2012 "2012-06"^^xsd:gYearMonth
in maggio "05"^^xsd:gMonth
nel 2014 "2014"^^xsd:gYear
l'anno seguente "$YEAR + 1" (plain literal)

Durata (Duration)

Input Output
stagione 1984-1985 "P1Y"^^xsd:duration
campionato 1921-1922 "P1Y"^^xsd:duration
fino al 1999 "to 1999" (plain literal) "1999"^^xsd:gYear (coupled with an end date property)
per due stagioni "P2Y"^^xsd:duration
per due anni "P2Y"^^xsd:duration
dal 1996 al 2005 ["P9Y"^^xsd:duration, "1996"^^xsd:gYear (start date), "2005"^^xsd:gYear (end date)]
una sola stagione "P1Y"^^xsd:duration

Punteggio (Score)

Input Output
2-1 "2-1" (plain literal)

Classifica (Ranking)

Input Output
primo "I" (roman numeral, plain literal) Harmful

Merge chunks on common words

Consider the following sentence:

Ha giocato per quattro anni nella massima serie scozzese con il Dundee.

This script combines 3 chunking strategies, and yields the following set of chunks for the given sentence:

set([u'quattro', u'massima serie', u'serie scozzese', u'anni', u'Dundee'])

Currently, overlap among the chunkers output is only handled at substring/superstring level.
We should also merge chunks on common words.
The output should be:

set([u'quattro', u'massima serie scozzese', u'anni', u'Dundee'])

Refactor, clean and organize the code base

Currently there are a bunch of scripts floating around, it's time to clean up and organize them in some way. Merge the date-normalizer and unsupervised branches into master, while the no-chunker and chunk-combo are leftovers from failed experiments, so leave them be.

TODOs:

  • Extract relevant data from the wikipedia dump (articles and soccer player pages)
  • Rank verbs and extract the most significant ones
  • Interfacing with crowdflower (create input sentences (aka gold) and interface, parse results)
  • Unsupervised training
  • Seed selection and sentence splitting
  • Launch classifier jobs (training etc.)

Other Ideas:

Investigate available annotated data for the soccer domain

DISCLAIMER: this issue is not related to code

The cost of crowdsourcing the training set annotation can be reduced if we find already annotated sentences (with frame information of course) from third-party resources.
You should explore the availability of such data and produce a report.

Thanks to @fsonntag for pointing this out.

When a duration FE is found, start and end date triples should be stated

For instance, stagione 2006-2007 now yields "P1Y"^^xsd:duration, but should also yield "2006^^xsd:gYear"and "2007"^^xsd:gYear, keeping track of the start and end dates.
The output triples would then look like below:

fact:Vittoria_1056741_3 fact:hasDurata "P1Y"^^xsd:duration .
fact:Vittoria_1056741_3 dbpedia-owl:startYear "2006^^xsd:gYear .
fact:Vittoria_1056741_3 dbpedia-owl:endYear "2007^^xsd:gYear .

Lexical Units (LUs) with more than 1 token are not handled

This script takes as input:

  1. a CSV file containing CrowdFlower annotation results
  2. a directory with TreeTagger output files, 1 sentence per file
    and translates it into IOB format to be fed to a classifier.
    See #7 for more details on how to run the script.

You should fix a bug introduced in these lines: if the LU is composed of more than 1 token, the remaining ones are not handled (i.e., there will be no I- tag assignment).

Map frames and FEs to DBPO properties

For each frame and FEs in our current definitions, try to find a suitable mapping to a DBpedia ontology property.
If it doesn't exist, it may indicate that our approach found a new property, which can be proposed for addition.
Otherwise, it would help to assess the knowledge base increase impact on already available properties.

Add a fact confidence score to the unsupervised results

Each FE comes with its confidence score: derive a global score from them.
Possible solutions:

  1. Mean average
  2. Weighted average
  3. F-measure (harmonic mean)

Weights come from the FE type: core ones should weigh more than extra ones.

Add a flexible command line to the supervised classifier run script

This script is in charge of various tasks related to the supervised classifier.
Currently, it only parses positional command line arguments, but some of them are optional.

The script should be able to perform the following tasks, separately:

  1. Train
  2. Run, interactive mode
  3. Run, batch mode
  4. Evaluate against gold

Implement a standard command line parser.

Map Frame Extraction results to the RDF data model

DISCLAIMER: this issue is not related to code

The final objective of the fact extractor is to produce statements for DBpedia.
Hence, we need a model that maps the results of the frame extraction step into the RDF data model.
The classification results format is the same as the training data one.

Can you think of the less verbose way to represent them as RDF statements? This would be a great addition for your GSoC proposal.

Normalize Date Expressions in Training Set

Lots of FEs are dates:

  • absolute, e.g., May 2008
  • relative, e.g., the previous season
  • interval, e.g., from 2008 to 2015

A normalizer based on this CFG grammar should be implemented at training set building time.
It is written as an ANTLR grammar.

N.B.

Please submit your pull request to (or work on) the date-normalizer branch.

Extract soccer player articles out of your Wikipedia corpus

Run step 1.ii of the workflow, as per the README.
You should review the related script and make it more robust, e.g., add docstrings, useful comments, etc.
Some DBpedia magic!
Wonder how to get the list of soccer players Wiki IDs?
You should fire the following SPARQL query to your DBpedia language endpoint:

SELECT *
WHERE {
  ?s a <http://dbpedia.org/ontology/SoccerPlayer> ;
    <http://dbpedia.org/ontology/wikiPageID> ?id .
}

Here is the list of active DBpedia language chapters.
No active chapter for your language? Consider deploying it!

Keep track of the SVM probability score

You should first locate the class where libsvm is used.
Then, you should activate the flag for computing the probability score (if not already done).
Finally, you should store it at classification runtime.

Serialize a dataset with triple confidence scores

See #67
Given the triple

<SUBJECT> <FRAME> <FRAME_01> .

you should produce the following triple in a separate dataset:

<FRAME_01> <http://dbpedia.org/fact-extraction/confidence> 0.587372^^<http://www.w3.org/2001/XMLSchema#float> .

Find existing frames that fit into the soccer domain

DISCLAIMER: this issue is not related to code

Produce a list of already available frame candidates (i.e., frame + frame elements + example sentences) that would fit well into the soccer domain.
You should proceed in either ways, with 2 opposite paradigms in mind.

  • top-down: you look for suitable frames in a frame repository;
  • bottom-up: you start from the top-ranked lexical units as per the output of step 2 and look for suitable ones in a frame repository.

Here are 2 frame repos as a starting point:

Build a list of relation candidates that do not currently exist in DBpedia

DISCLAIMER: this issue is not related to code

We want to assess the potential of the fact extractor.
Try to address the following question:

Which relations can we extract that are not already mapped in the DBpedia ontology or do not exist in the raw infobox properties datasets?

  • Focus on the soccer domain in your mother tongue;
  • Build a list of potentially interesting relations that are not already in the corresponding DBpedia chapter;
  • Compute statistics by browsing through soccer-related DBpedia entities, i.e., instances of SoccerPlayer, SoccerClub, SoccerManager, SoccerLeague, SoccerTournament, SoccerLeagueSeason.

DBpedia magic

The following SPARQL query computes the amount of raw infobox properties for a given $CLASS. You can fire it to your DBpedia language endpoint

SELECT DISTINCT(?properties), (count(?properties) as ?amount)
WHERE
{
  ?s a <http://dbpedia.org/ontology/$CLASS> ;
    ?properties ?o .
  FILTER (regex(?properties, "dbpedia.org/property"))
}
ORDER BY ?amount

Bonus

Can you figure out the query for the ontology properties?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.