Code Monkey home page Code Monkey logo

table-fact-checking's Introduction

Introduction

We introduce a large-scale dataset called TabFact, which consists of 118,439 manually annotated statements with regard to 16,621 Wikipedia tables, their relations are classified as ENTAILED and REFUTED. The full paper is "TabFact: A Large-scale Dataset for Table-based Fact Verification "

The table-based fact verification is the first dataset to perform fact verification on strctured data, which involves mixed reasoning in both symbolic and linguistic form. Therefore, we propose two models, namely Table-BERT and the Latent Program Algorithm to tackle this task.
  • The brief architecture of Latent Program Algorithm (LPA) looks like following:

  • The brief architecture of Table-BERT looks like following:

Explore the data

We design an interface for you to browse and eplore the dataset in https://tabfact.github.io/explore.html

Requirements

  • Python 3.5
  • Pytorch 1.0
  • Ujson 1.35
  • Pytorch 1.0+
  • Pytorch_Pretrained_Bert 0.6.2 (Huggingface Implementation)
  • Pandas

Data Preprocessing

The folder "collected_data" contains the raw data collected directly from Mechnical Turker, all the text are lower-cased, containing foreign characters in some tables. There are two files, the r1 file is collected in the first round (simple channel), which contains sentences involving less reasoning. The r2 file is collected in the second round (complex channel), which involves more complex multi-hop reasoning. These two files in total contains roughly 110K statements, the positive and negative satements are balanced. We demonstrate the data format as follows:

Table-id: {
[
Statement 1,
Statement 2,
...
],
[
Label 1,
Label 2,
...
],
Table Caption
}
  1. General Tokenization and Entity Matching

    • tokenized_data: This folder contains the data after tokenization with preprocess_data.py by:
      cd code/
      python preprocess_data.py
      
      this script is mainly used for feature-based entity linking, the entities in the statements are linked to the longest text span in the table cell. The result file is tokenized_data/full_cleaned.json, which has a data format like:
      Table-id: {
      [
      Statement 1: xxxxx #xxx;idx1,idx2# xxx.
      Statement 2: xx xxx #xxx;idx1,idx2# xxx.
      ...
      ],
      [
      Label 1,
      Label 2,
      ...
      ],
      Table Caption
      }
      
      The enclosed snippet #xxx;idx1,idx2# denotes that the word "xxx" is linked to the entity residing in idx1-th row and idx2-th column of table "Table-id.csv", if idx1=-1, it links to the table caption. The entity linking step is essential for performing the following program search algorithm.
  2. Tokenization For Latent Program Algorithm

    • preprocessed_data_program: This folder contains the preprocessed.json, which is obtained by:
      cd code/
      python run.py
      
      this script is mainly used to perform cache (string, number) initialization, the result file looks like:
      [
        [
        Table-id,
        Statement: xxx #xxx;idx1,idx2# (after entity linking),
        Pos-Tagging information,
        Statement with place-holder,
        [linked string entity],
        [linked number entity],
        [linked string header],
        [linked number header],
        Statement-id,
        Label
        ],
      ]
      
      This file is directly fed into run.py to search for program candidates using dynamic programming, which also contains the tsv files neccessary for the program ranking algorithm.
    • all_programs: this folder contains the searched intermediate results for different statements, we save the results in different files for different statements, the format of intermediate program results looks like:
      [
        csv_file,
        statement,
        placeholder-text,
        label,
        [
          program1,
          program2,
          ...
        ]
      ]
      
  3. Tokenization for Table-BERT

  cd code/
  python preprocess_BERT.py --scan horizontal
  python preprocess_BERT.py --scan vertical

Latent Program Algorithm

  1. Downloading the preprocessed data for LPA Here we provide the data we obtained after preprocessing through the above pipeline, you can download that by running
  sh get_data.sh
  1. Training the ranking model Once we have all the training and evaluating data in folder "preprocessed_data_program", we can simply run the following command to evaluate the fact verification accuracy as follows:
  cd code/
  python model.py --do_train --do_val
  1. Evaluating the ranking model We have put our pre-trained model in code/checkpoints/, the model can reproduce the exact number reported in the paper:
  cd code/
  python model.py --do_test --resume
  python model.py --do_simple --resume
  python model.py --do_complex --resume

Table-BERT

  1. Training the verification model
  cd code/
  python run_BERT.py --do_train [--do_eval] --scan [horizontal, vertical] --fact [first/second]
  1. Evaluating the verification model
  cd code/
  python run_BERT.py --do_eval --scan [horizontal, vertical] --fact [first/second] --load_dir YOUR_TRAINED_MODEL --eval_batch_size N

Reference

Please cite the paper in the following format if you use this dataset during your research.

@inproceedings{2019TabFactA,
  title={TabFact : A Large-scale Dataset for Table-based Fact Verification},
  author={Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang, Shiyang Li, Xiyou Zhou and William Yang Wang},
  year={2019}
}

Q&A

If you encounter any problem, please either directly contact the first author or leave an issue in the github repo.

table-fact-checking's People

Contributors

chenjianshu avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.