dbpedia / fact-extractor Goto Github PK

Fact Extraction from Wikipedia Text

Python 52.31% Shell 2.81% HTML 1.45% Makefile 2.99% Java 40.44%

fact-extractor's Introduction

Fact Extractor

Fact Extraction from Wikipedia Text

Intro

The DBpedia Extraction Framework is pretty much mature when dealing with Wikipedia semi-structured content like infoboxes, links and categories.
However, unstructured content (typically text) plays the most crucial role, due to the amount of knowledge it can deliver, and few efforts have been carried out to extract structured data out of it.
For instance, given the Germany Football Team article, we want to extract a set of meaningful facts and structure them in machine-readable statements.
The following sentence:

In Euro 1992, Germany reached the final, but lost 0–2 to Denmark

would produce statements (triples) like:

<Germany, defeat, Denmark>
<defeat, score, 0–2>
<defeat, winner, Denmark>
<defeat, competition, Euro 1992>

High-level Workflow

INPUT = Wikipedia corpus

Corpus Analysis

Corpus Raw Text Extraction
Verb Extraction
Verb Ranking

Unsupervised Fact Extraction

Entity Linking
Frame Classification
Dataset Production

Supervised Fact Extraction

Training Set Creation
Classifier Training
Frame Classification
Dataset Production

Get Ready

Python, pip and Java should be there in your machine, aren't they?
Install all the Python requirements:

$ pip install -r requirements.txt

Install the third party dependencies:
- TreeTagger
- libsvm
Request access to a supported entity linking API:
- The Wiki Machine
- Dandelion API
Put your API credentials into a new file lib/secrets.py as follows:

# For The Wiki Machine
TWM_URL = 'your service URL'
TWM_APPID = 'your app ID'
TWM_APPKEY = 'your app key'

# For Dandelion API
NEX_URL = 'https://api.dandelion.eu/datatxt/nex/v1'
NEX_APPID = 'your app ID'
NEX_APPKEY = 'your app key'

Get Started

Here is how to produce the unsupervised Italian soccer dataset:

$ wget http://dumps.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2
$ make extract-pages
$ make extract-soccer
$ make extract-sentences-baseline
$ make unsupervised-run

Done!

Note: Wikipedia Dump Pre-processing

Wikipedia dumps are packaged as XML documents and contain text formatted according to the Mediawiki markup syntax, with templates to be transcluded. To obtain a raw text corpus, we use the WikiExtractor, integrated in a frozen version here.

Development Policy

Contributors should follow the standard team development practices:

Branch out of master;
Commit frequently with clear messages;
Make a pull request.

Coding Style

Pull requests not complying to these guidelines will be ignored.

Use 4 spaces (soft tab) for indentation;
Naming conventions
- use an underscore as a word separator (files, variables, functions);
- constants are UPPERCASE;
- anything else is lowercase.
Use 2 empty lines to separate functions;
Write docstrings according to PEP 287, with a special attention to field lists. IDEs like PyCharm will do the job.

References

License

The source code is under the terms of the GNU General Public License, version 3.

fact-extractor's People

Contributors

Stargazers

Watchers

Forkers

kkasunperera ninawan aktgth anshulgo yustoris hzshuai warun26 jerryking58 ankushjindal fsonntag malcolmgreaves ilciucsergiu alvations purgna aurora1625 manasrk lu839684437 redreamality littmus fanzhongrui nudtchengqing xsongx thefakeryan provemyself abioy kjherold likaiguo kyoungrok0517 bolddata hzauccg ericxsun kod3r ranjeet-floyd vikingmew enayatullah hemantpugaliya flyeven anukat2015 ktitan123 luisfgutierrez imclab wellwang theolivenbaum abhinav-sharma-6167 lguyogiro sungjinlees akshayjagatap chappyhome renespeck liuzp shizhh albertusk95 fancycheung bibhuti93 yufish amirnaderabz thzll2001 pombredanne utpal0401 chikubee sainiudit debarghag andyprasetya rezacsedu admariner therakeshpurohit staceysav napoler standardgalactic anushika-srivastava txti python-repository-hub kkw-21 daukantas ldmichel techthiyanes eric-doug buke2016 bkeit

fact-extractor's Issues

Integrate numerical FEs into the unsupervised classification

See #47

Integrate CrowdFlower API for automatic job posting

INPUT

CSV data spreadsheet, as per this sample;
Title = Capisci il significato delle parole
Instructions, as per this file;
HTML interface template, as per this sample.

OUTPUT

JSON job object, as per CrowdFlower API response on success, otherwise an error message.

N.B.

Refer to the (rather informal) CrowdFlower API overview to build the call.

Merge all the contiguous chunks into a big single one

Consider the following JSON output yielded by the chunk combination script:

 {
    "chunks": [
      "FEC",
      "Levallois",
      "USL",
      "Dunkerque",
      "FC",
      "Thépot",
      "Brest",
      "Red"
    ],
    "id": "12",
    "sentence": "Thépot giocò per il Brest, FEC Levallois, Red Star FC, e l'USL Dunkerque."
  }

It would be nice to have:

 {
    "chunks": [
      "FEC Levallois",
      "USL Dunkerque",
      "FC",
      "Thépot",
      "Brest",
      "Red"
    ],
    "id": "12",
    "sentence": "Thépot giocò per il Brest, FEC Levallois, Red Star FC, e l'USL Dunkerque."
  }

Star is not extracted, so no way to get Red Star FC (even if it would make sense).

Provide exhaustive stop word lists for all the supported languages

This script produces a frequency dictionary of words given a text corpus, skipping non-lexical frequent words (AKA stop words).
Currently, the stop word list is only implemented for Italian.

Add exhaustive stop word lists for all the languages we want to support, currently as per TreeTagger.

Compute Fleiss' Kappa agreement score from the crowd results

INPUT = crowdflower full results CSV, like this one;
OUTPUT = K score.

Have a look at a Python implementation.
See also the Wikipedia article to understand what the score is all about.

Prepare job data for the event

It should contain 500 sentences for each of the following frames.
Triggering LUs are aside: ranked ones as per this file in bold, while the others come from Kicktionary or FrameNet.

Frame	LUs
Attività (Activity)	~~andare~~, esordire, debuttare, giocare, rimanere
Partita (Match)	affrontare, giocare, incontrare
Vittoria (Victory)	battere, sconfiggere, vincere
Sconfitta (Defeat)	~~crollare~~, perdere, ~~piegarsi~~
Stato (State)	rimanere
Trofeo (Finish_Competition)	vincere

Normalize Date Expressions before classification

Lots of FEs are dates:

absolute, e.g., May 2008
relative, e.g., the previous season
interval, e.g., from 2008 to 2015

If the same initial token appears in more annotations, training data only gets the first annotation

We are trying to build training data in an n-gram fashion, replacing single tokens with the full annotated entity.
See for instance the following sentence:

19  0   Ha  VER:pres    avere   Attività   O
19  1   giocato VER:pper    giocare Attività   B-LU
19  2   7   NUM @card@  Attività   O
19  3   partite NOM partita Attività   O
19  4   per PRE per Attività   O
19  5   la  DET:def il  Attività   O
19  6   Nazionale cipriota  ENT nazionale   Attività   B-Squadra_Attività
19  7   tra PRE tra Attività   O
19  8   il 2004 ENT il  Attività   B-Tempo_Attività
19  9   e   CON e   Attività   O
19  10  il 2004 ENT il  Attività   B-Tempo_Attività

See full sample output.
The problem arises here.

Tokens appearing in more than 1 Frame Element (FE) are wrongly transformed

This script takes as input:

a CSV file containing CrowdFlower annotation results
a directory with TreeTagger output files, 1 sentence per file
and translates it into IOB format to be fed to a classifier.
See #7 for more details on how to run the script.

You should fix a bug introduced in these lines: if the same token appears in different FEs, the algorithm only remembers 1 FE.
See for instance the following error from this sentence of the IOB output sample:

3   8   il  DET:def il  Vittoria    B-Competizione_Vittoria
3   9   Varese  NPR Varese  Vittoria    I-Avversario_Vittoria

where the B- label should be B-Avversario_Vittoria instead.
This is due to the token il appearing both in il Varese and in il Campionato Primavera 2010-2011

Training sentences must be 1 sentence

Use a sentence splitter to discard longer sentences

Keep track of the linked entity confidence score at supervised runtime

This class is responsible for entity linking.

The subject should be the article URI from which the fact was extracted

Currently, we don't keep track of the sentences provenance (i.e., the Wikipedia article).
A sample extracted fact now looks like:

resource:SENTENCE0000 fact:Attività fact:Attività0000 .

We only keep track of the sentence ID from which the fact was extracted.
Instead, we should link to the Wikipedia article containing that sentence:

resource:${WIKIPEDIA_ARTICLE} fact:Attività fact:Attività0000 .

Serialize supervised classification results into RDF

The serialization logic is already implemented in the unsupervised classification. See here.

Automatically annotate numerical FEs

Here is a (non-exhaustive) summary of the expressions we want to annotate automatically and transform into standard XML Schema Datatypes (xsd), such as dates.
We decided that the easiest strategy is to implement a set of regexes.

Heads-up: relative expressions should be resolved against an absolute date (e.g., $YEAR), but that's a hard one. $YEAR should take a default value year if no absolute date is found.

Tempo (Time)

Input	Output
il 14 settembre 2010	"2010-09-14"^^xsd:date
il 15 ottobre	"10-15"^^xsd:gMonthDay
nel giugno 2012	"2012-06"^^xsd:gYearMonth
in maggio	"05"^^xsd:gMonth
nel 2014	"2014"^^xsd:gYear
l'anno seguente	"$YEAR + 1" (plain literal)

Durata (Duration)

Input	Output
stagione 1984-1985	"P1Y"^^xsd:duration
campionato 1921-1922	"P1Y"^^xsd:duration
fino al 1999	~~"to 1999" (plain literal)~~ "1999"^^xsd:gYear (coupled with an `end date` property)
per due stagioni	"P2Y"^^xsd:duration
per due anni	"P2Y"^^xsd:duration
dal 1996 al 2005	["P9Y"^^xsd:duration, "1996"^^xsd:gYear (`start date`), "2005"^^xsd:gYear (`end date`)]
una sola stagione	"P1Y"^^xsd:duration

Punteggio (Score)

Input	Output
2-1	"2-1" (plain literal)

Classifica (Ranking)

Input	Output
~~primo~~	~~"I" (roman numeral, plain literal)~~ Harmful

Merge chunks on common words

Consider the following sentence:

Ha giocato per quattro anni nella massima serie scozzese con il Dundee.

This script combines 3 chunking strategies, and yields the following set of chunks for the given sentence:

set([u'quattro', u'massima serie', u'serie scozzese', u'anni', u'Dundee'])

Currently, overlap among the chunkers output is only handled at substring/superstring level.
We should also merge chunks on common words.
The output should be:

set([u'quattro', u'massima serie scozzese', u'anni', u'Dundee'])

Add a fact confidence score to the supervised results

See #67 for possible solutions.
Please note that we can consider the entity linking score (EL) and the SVM probability score (SVM) as components of the measures.
For instance, F1 would be:
2 * ( (EL * SVM) / (EL + SVM) )

Validate crowdsourced gold standard (part 2)

Check and curate the annotations obtained from CrowdFlower, transformed into training data format.
Part 2.
Save to a new file with extension = .curated

Double-check validated gold standard (part 2)

Part 2.
Place the double checked file here.

Refactor, clean and organize the code base

Currently there are a bunch of scripts floating around, it's time to clean up and organize them in some way. Merge the date-normalizer and unsupervised branches into master, while the no-chunker and chunk-combo are leftovers from failed experiments, so leave them be.

TODOs:

Extract relevant data from the wikipedia dump (articles and soccer player pages)
Rank verbs and extract the most significant ones
Interfacing with crowdflower (create input sentences (aka gold) and interface, parse results)
Unsupervised training
Seed selection and sentence splitting
Launch classifier jobs (training etc.)

Other Ideas:

Makefile is becoming too big and with different rules related to the same job, split into multiple makefiles

Handle a new input format in the CrowdFlower-to-training data conversion script

Adapt crowdflower_results_into_training_data.py given this CrowdFlower results sample.
Please make a pull request against the no-chunker branch.

Use CrowdFlower aggregated results to produce the training data

INPUT = aggregated results CSV, like this one;
OUTPUT = training data TSV, like this one.

This is an experiment that may impact the supervised classification performance.
In practice, instead of the majority vote (current), the annotation is assigned via an agreement weighted by contributor trust.

Investigate available annotated data for the soccer domain

DISCLAIMER: this issue is not related to code

The cost of crowdsourcing the training set annotation can be reduced if we find already annotated sentences (with frame information of course) from third-party resources.
You should explore the availability of such data and produce a report.

Thanks to @fsonntag for pointing this out.

When a duration FE is found, start and end date triples should be stated

For instance, stagione 2006-2007 now yields "P1Y"^^xsd:duration, but should also yield "2006^^xsd:gYear"and "2007"^^xsd:gYear, keeping track of the start and end dates.
The output triples would then look like below:

fact:Vittoria_1056741_3 fact:hasDurata "P1Y"^^xsd:duration .
fact:Vittoria_1056741_3 dbpedia-owl:startYear "2006^^xsd:gYear .
fact:Vittoria_1056741_3 dbpedia-owl:endYear "2007^^xsd:gYear .

Translate sample CrowdFlower results into training data format

Implement step 4.i of the workflow, as per the README.
You should review the related script and make it more robust.
In other words:

parametrize hard-coded items
add function docstrings
add descriptive comments
implement a solid command-line with the argparse module

Use the following samples as input:

Wrap up the project in a blog post

Lexical Units (LUs) with more than 1 token are not handled

This script takes as input:

a CSV file containing CrowdFlower annotation results
a directory with TreeTagger output files, 1 sentence per file
and translates it into IOB format to be fed to a classifier.
See #7 for more details on how to run the script.

You should fix a bug introduced in these lines: if the LU is composed of more than 1 token, the remaining ones are not handled (i.e., there will be no I- tag assignment).

Map frames and FEs to DBPO properties

For each frame and FEs in our current definitions, try to find a suitable mapping to a DBpedia ontology property.
If it doesn't exist, it may indicate that our approach found a new property, which can be proposed for addition.
Otherwise, it would help to assess the knowledge base increase impact on already available properties.

Add a fact confidence score to the unsupervised results

Each FE comes with its confidence score: derive a global score from them.
Possible solutions:

Mean average
Weighted average
F-measure (harmonic mean)

Weights come from the FE type: core ones should weigh more than extra ones.

Add a flexible command line to the supervised classifier run script

This script is in charge of various tasks related to the supervised classifier.
Currently, it only parses positional command line arguments, but some of them are optional.

The script should be able to perform the following tasks, separately:

Train
Run, interactive mode
Run, batch mode
Evaluate against gold

Implement a standard command line parser.

Extract the raw text corpus from your mother tongue Wikipedia chapter

Run step 1.i of the workflow, as per the README.
You should review the related script (especially these lines), polish it and update it accordingly.
Heads-up!
Make the script more robust by letting the user provide the ~~Wikipedia dump URL~~ language as a command line argument (like it or italian for Italian)

Publish Performance Measures Reports

Both for the unsupervised and the supervised strategies.

A decision should be made to break ties in CrowdFlower results

This line of the crowdflower_results_into_training_data.py sets the annotation to the majority vote answer, but then skips it in case of a tie (i.e., no majority).

Change this behavior by randomly breaking the tie.

Integrate numerical FEs into gold standard

Map Frame Extraction results to the RDF data model

DISCLAIMER: this issue is not related to code

The final objective of the fact extractor is to produce statements for DBpedia.
Hence, we need a model that maps the results of the frame extraction step into the RDF data model.
The classification results format is the same as the training data one.

Can you think of the less verbose way to represent them as RDF statements? This would be a great addition for your GSoC proposal.

Add supervised classification as an internal module

Import this repo into the supervised module.

Build gold standard for evaluation

A fully annotated set of 500 sentences is needed to evaluate the classifiers.

Investigate supervised frame classification

Understand why it currently performs modestly, while FE classification performs well

Generate CrowdFlower interface template based on input data spreadsheet

INPUT = a CrowdFlower input data spreadsheet like this one
OUTPUT = an HTML template file like this one

Basically, you should generate:

as many question blocks as the amount of fe_name fields;
as many cml:radio blocks as the amount of fe fields;
as many li blocks as the amount of type fields.

The script should be written in Python.

Normalize Date Expressions in Training Set

Lots of FEs are dates:

absolute, e.g., May 2008
relative, e.g., the previous season
interval, e.g., from 2008 to 2015

A normalizer based on this CFG grammar should be implemented at training set building time.
It is written as an ANTLR grammar.

N.B.

Please submit your pull request to (or work on) the date-normalizer branch.

Extract verbs from the soccer player sub-corpus of your mother tongue Wikipedia chapter

Implement step 1.iii of the workflow, as per the README.
You should review this script (especially the highlighted lines), polish it and update it accordingly.
Heads-up!
You need TreeTagger with the proper parameter file specific to your language to achieve this. If there is no such file, go for English.

Extract soccer player articles out of your Wikipedia corpus

Run step 1.ii of the workflow, as per the README.
You should review the related script and make it more robust, e.g., add docstrings, useful comments, etc.
Some DBpedia magic!
Wonder how to get the list of soccer players Wiki IDs?
You should fire the following SPARQL query to your DBpedia language endpoint:

SELECT *
WHERE {
  ?s a <http://dbpedia.org/ontology/SoccerPlayer> ;
    <http://dbpedia.org/ontology/wikiPageID> ?id .
}

Here is the list of active DBpedia language chapters.
No active chapter for your language? Consider deploying it!

Integrate numerical FEs into the supervised classification

See #47

Keep track of the SVM probability score

You should first locate the class where libsvm is used.
Then, you should activate the flag for computing the probability score (if not already done).
Finally, you should store it at classification runtime.

Add entity linking results into the supervised classifier output

In order to produce the final triples we need the URIs of the recognized FEs. Since the classifier uses this information add it in its TSV output (fifth column, before the frame)

Serialize a dataset with triple confidence scores

See #67
Given the triple

<SUBJECT> <FRAME> <FRAME_01> .

you should produce the following triple in a separate dataset:

<FRAME_01> <http://dbpedia.org/fact-extraction/confidence> 0.587372^^<http://www.w3.org/2001/XMLSchema#float> .

Investigate how to port the date normalizer to Python

See #43

Validate crowdsourced gold standard (part 1)

Check and curate the annotations obtained from CrowdFlower, transformed into training data format.
Part 1.
Save to a new file with extension = .curated

Find existing frames that fit into the soccer domain

DISCLAIMER: this issue is not related to code

Produce a list of already available frame candidates (i.e., frame + frame elements + example sentences) that would fit well into the soccer domain.
You should proceed in either ways, with 2 opposite paradigms in mind.

top-down: you look for suitable frames in a frame repository;
bottom-up: you start from the top-ranked lexical units as per the output of step 2 and look for suitable ones in a frame repository.

Here are 2 frame repos as a starting point:

Double-check validated gold standard (part 1)

Part 1.
Place the double checked file here.

Build a list of relation candidates that do not currently exist in DBpedia

DISCLAIMER: this issue is not related to code

We want to assess the potential of the fact extractor.
Try to address the following question:

Which relations can we extract that are not already mapped in the DBpedia ontology or do not exist in the raw infobox properties datasets?

Focus on the soccer domain in your mother tongue;
Build a list of potentially interesting relations that are not already in the corresponding DBpedia chapter;
Compute statistics by browsing through soccer-related DBpedia entities, i.e., instances of SoccerPlayer, SoccerClub, SoccerManager, SoccerLeague, SoccerTournament, SoccerLeagueSeason.

DBpedia magic

The following SPARQL query computes the amount of raw infobox properties for a given $CLASS. You can fire it to your DBpedia language endpoint

SELECT DISTINCT(?properties), (count(?properties) as ?amount)
WHERE
{
  ?s a <http://dbpedia.org/ontology/$CLASS> ;
    ?properties ?o .
  FILTER (regex(?properties, "dbpedia.org/property"))
}
ORDER BY ?amount

Bonus

Can you figure out the query for the ontology properties?

dbpedia / fact-extractor Goto Github PK

fact-extractor's Introduction

Fact Extractor

Intro

High-level Workflow

Corpus Analysis

Unsupervised Fact Extraction

Supervised Fact Extraction

Get Ready

Get Started

Note: Wikipedia Dump Pre-processing

Development Policy

Coding Style

References

License

fact-extractor's People

Contributors

Stargazers

Watchers

Forkers

fact-extractor's Issues

INPUT

OUTPUT

N.B.

Tempo (Time)

Durata (Duration)

Punteggio (Score)

Classifica (Ranking)

DISCLAIMER: this issue is not related to code

DISCLAIMER: this issue is not related to code

N.B.

DISCLAIMER: this issue is not related to code

DISCLAIMER: this issue is not related to code

DBpedia magic

Bonus

Recommend Projects

Recommend Topics

Recommend Org