CLEF2018-factchecking

This repository contains the dataset for the CLEF2018-factcheking task.

It also contains the format checker, scorer and baselines for the task.

FCPD corpus for the CLEF-2018 LAB on "Automatic Identification and Verification of Claims in Political Debates"
Version 1.0: February 25, 2018 (TRIAL)

This file contains the basic information regarding the CLEF2018-factcheck on Fact Checking Political Debates dataset provided for the CLEF-2018 Lab on "Automatic Identification and Verification of Claims in Political Debates". The current TRIAL version (1.0, January 25, 2018) corresponds to the release of a part of the training data set. The full training sets and the test sets will be provided in future versions. All changes and updates on these data sets are reported in Section 1 of this document.

Table of contents:

List of Versions
Contents of the Distribution v1.0
Subtasks
Data Format
Results File Format
Format checkers
Scorers
- Evaluation metrics
Baselines
Notes
Licensing
Citation
Credits

List of Versions

v1.0 [2018/01/25] - TRIAL data. Partial distribution of the training data for task 1 and 2, in English and Arabic: It contains examples extracted from two US Presidential and one Vice-Presidential debate in 2016.

Contents of the Distribution v1.0

We provide the following files:

Main folder: data
- README.md
  this file
- Subfolder /task1/English/
  Task1-English-1st-Presidential.txt
  Task1-English-2nd-Presidential.txt
  Task1-English-Vice-Presidential.txt
- Subfolder /task1/Arabic/
  same content as the previous folder but with the Arabic datasets
  
  Task1-Arabic-1st-Presidential.txt
  Task1-Arabic-2nd-Presidential.txt
  Task1-Arabic-Vice-Presidential.txt
- Subfolder /task2/English/
  Task2-English-1st-Presidential.txt
  Task2-English-2nd-Presidential.txt
  Task2-English-Vice-Presidential.txt
- Subfolder /task2/Arabic/
  same content as the previous folder but with the Arabic datasets
  
  Task2-Arabic-1st-Presidential.txt
  Task2-Arabic-2nd-Presidential.txt
  Task2-Arabic-Vice-Presidential.txt

Subtasks

For ease of explanation, here we list the tasks:

Task 1: Check-Worthiness. Predict which claim in a political debate should be prioritized for fact-checking. In particular, given a debate, the goal is to produce a ranked list of its sentences based on their worthiness for fact checking.
Task 2: Factuality. Checking the factuality of the identified worth-checking claims. In particular, given a sentence that is worth checking, the goal is for the system to determine whether the claim is likely to be true or false, or that it is unsure of its factuality.

Data Format

The datasets are text files with the information TAB separated. The text encoding is UTF-8.

Task 1:

line_number speaker text label

Where:

line_no: the line number (starting from 1)
speaker: the person speaking (a candidate, the moderator, or "SYSTEM"; the latter is used for the audience reaction)
text: a sentence that the speaker said
label: 1 if this sentence is to be fact-checked, and 0 otherwise

Example:

...
65 TRUMP So we're losing our good jobs, so many of them. 0
66 TRUMP When you look at what's happening in Mexico, a friend of mine who builds plants said it's the eighth wonder of the world. 0
67 TRUMP They're building some of the biggest plants anywhere in the world, some of the most sophisticated, some of the best plants. 0
68 TRUMP With the United States, as he said, not so much. 0
69 TRUMP So Ford is leaving. 1
70 TRUMP You see that, their small car division leaving. 1
71 TRUMP Thousands of jobs leaving Michigan, leaving Ohio. 1
72 TRUMP They're all leaving. 0
...

Task 2:

line_number speaker text claim_number normalized_claim label

Where:

line_no: the line number (starting from 1)
speaker: the person speaking (a candidate, the moderator, or "SYSTEM"; the latter is used for the audience reaction)
text: a sentence that the speaker said
claim_number: claim number if this claim is to be fact-checked, and 0 otherwise
normalized_claim: normalized form of the claim, i.e., this is what is to be checked, or "-" otherwise.
label: TRUE, HALF-TRUE or FALSE (in case the claim is to be checked), or "-" otherwise

NOTE 1: If the line does NOT contain an interesting claim, then claim_number=='N/A' and normalized_claim, label are missing.

NOTE 2: The same normalized claim (with the same claim number) can be associated with more than one sentence.

Example:

...
65 TRUMP So we're losing our good jobs, so many of them. N/A
66 TRUMP When you look at what's happening in Mexico, a friend of mine who builds plants said it's the eighth wonder of the world. N/A
67 TRUMP They're building some of the biggest plants anywhere in the world, some of the most sophisticated, some of the best plants. N/A
68 TRUMP With the United States, as he said, not so much. N/A
69 TRUMP So Ford is leaving. 1 Ford Motor Company is moving their small car division out of the USA. TRUE
70 TRUMP You see that, their small car division leaving. 1 Ford Motor Company is moving their small car division out of the USA. TRUE
71 TRUMP Thousands of jobs leaving Michigan, leaving Ohio. 2 Thousands of jobs are being lost in Michigan and Ohio due to Ford Motor Company moving their small car division out of the USA. FALSE
...

Results File Format:

Task 1:

For this task, the expected results file is a list of claims with the estimated score for check-worthiness. Each line contains a tab-separated line with:

line_number score

Where line_number is the number of the claim in the debate and score is a number indicating the priority of the claim for fact-checking. For example:

1 0.9056
2 0.6862
3 0.7665
4 0.9046
5 0.2598
6 0.6357
7 0.9049
8 0.8721
9 0.5729
10 0.1693
11 0.4115
...

Your result file MUST contain scores for all lines from the respective input file. Otherwise the scorer will not score this result file.

Task 2:

For this subtask, participants should estimate the credibility of the fact-checked claims. The results file contains one tab-separeted line per instance with:

claim_number label

Where claim_number is the consecutive number only of the fact-checked claims and the label is one of: TRUE, FALSE, HALF-TRUE. For example:

1 TRUE
2 FALSE
3 TRUE
4 HALF-TRUE
5 HALF-TRUE
6 HALF-TRUE
7 HALF-TRUE
8 HALF-TRUE
9 FALSE
10 TRUE
...

Your result file MUST contain predictions for all claims from the respective input file. Otherwise the scorer will not score this result file.

Format checkers

The checkers for each subtask are located in the format_checker module of the project. Each format checker verifies that your generated results file complies with the expected format. To launch them run:

python3 task1.py --pred_file_path=<path_to_your_results_file>
python3 task2.py --pred_file_path=<path_to_your_results_file>

run_format_checker.sh includes examples of the output of the checkers when dealing with an ill-formed results file. Its output can be seen in run_format_checker_out.txt The checks for completness (if the result files contain all lines / claims) is NOT handled by the format checkers, because they receive only the results file and not the gold one.

Scorers

Launch the scorers for each task as follows:

python3 task1.py --gold_file_path="<path_gold_file_1, path_to_gold_file_k>" --pred_file_path="<predictions_file_1, predictions_file_k>"
python3 task2.py --gold_file_path="<path_gold_file_1, path_to_gold_file_k>" --pred_file_path="<predictions_file_1, predictions_file_k>"

Both --gold_file_path and --pred_file_path take a single string that contains a comma separated list of file paths. The lists may be of arbitraty positive length (so even a single file path is OK) but their lengths must match.

<path_to_gold_file_n> is the path to the file containing the gold annotations for debate n and <predictions_file_n> is the path to the respective file holding predicted results for debate n, which must follow the format, described in the 'Results File Format' section.

The scorers call the format checkers for the corresponding task to verify the output is properly shaped. They also handle checking if the provided predictions file contains all lines / claims from the gold one.

run_scorer.sh provides examples on using the scorers and the results can be viewed in the run_scorer_out.txt file.

Evaluation metrics

For Task 1 (ranking): R-Precision, Average Precision, Recipocal Rank, Precision@k and means of these over multiple debates. The official metric for task1, that will be used for the competition ranking is the Mean Average Precision (MAP)

For Task 2 (classification): Mean Absolute Error (MAE), Macro-average MAE, Accuracy, Macro-average F1, Macro-average Recall (+ confusion matrix). The official metric for task2, that will be used for the competition ranking is the Mean Absolute Error (MAE)

Baselines

The baselines module contains a random and a simple ngram baseline for each of the tasks and each of the languages.

If you execute any of the scripts, both of the baselines will be trained on the 1st Presidential and the Vice-Presidential debates and evaluated on the 2nd Presidential debate. The performance of both baselines will be displayed.

Notes:

This distribution is directly downloadable from the official CLEF-2018 Fact Checking Lab repository: https://github.com/clef2018-factchecking/clef2018-factchecking

Licensing

These datasets are free for general research use.

Citation

Whenever using this resource you should use the CLEF-2018 paper by the organizers describing the Fact Checking Lab. For the moment, the paper is not available. We will update the BIB entry below in subsequent versions of this document.

@InProceedings{,
    author    = {Nakov, Preslav  and  M\`{a}rquez, Llu\'{i}s and  Barr\'{o}n-Cede\~no, Alberto and Zaghouani, Wajdi and Elsayed, Tamer and Suwaileh, Reem and Gencheva, Pepa},  
    title     = {{CLEF}-2018 Lab on Automatic Identification and Verification of Claims in Political Debates}, <br>
    booktitle = {Proceedings of the CLEF-2018}, 
    year      = {2018}, 
}

Credits

Lab Organizers:

Preslav Nakov, Qatar Computing Research Institute, HBKU
Lluís Màrquez, Amazon
Alberto Barrón-Cedeño, Qatar Computing Research Institute, HBKU
Wajdi Zaghouani, Carnegie Mellon University - Qatar
Tamer Elsayed, Qatar University
Reem Suwaileh, Qatar University
Pepa Gencheva, Sofia University
Spas Kyuchukov, Sofia University
Giovanni Da San Martino, Qatar Computing Research Institute, HBKU

Task website: http://alt.qcri.org/clef2018-factcheck/ The official rules are published on the website, check them!

Contact: [email protected]

nguyenvo09 / clef2018-factchecking Goto Github PK