Code Monkey home page Code Monkey logo

science-result-extractor's Introduction

Science-result-extractor

Introduction

This repository contains code and a few datasets to extract TDMS (Task, Dataset, Metric, Score) tuples from scientific papers in the NLP domain. We envision three primary uses for this repository: (1) to extract table content from PDF files, (2) to replicate the paper's results or run experiments based on a textual entailment system, and (3) to train a model to extract TDM mentions. Please refer to the following paper for the full details:

Yufang Hou, Charles Jochim, Martin Gleize, Francesca Bonin, Debasis Ganguly. Identification of Tasks, Datasets, Evaluation Metrics, and Numeric Scores for Scientific Leaderboards Construction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), Florence, Italy, 27 July - 2 August 2019

Yufang Hou, Charles Jochim, Martin Gleize, Francesca Bonin, Debasis Ganguly. TDMSci: A Specialized Corpus for Scientific Literature Entity Tagging of Tasks Datasets and Metrics. In Proceedings of the 16th conference of the European Chapter of the Association for Computational Linguistics (EACL 2021), Online, 19-23 April 2021

Extract table content from PDF files

We developed a deterministic PDF table parser based on GROBID. To use our parser, follow the steps below:

  1. Fork and clone this repository, e.g.,
> git clone https://github.com/IBM/science-result-extractor.git
  1. Download and install GROBID 0.5.3, following the installation instructions, e.g.,
> wget https://github.com/kermitt2/grobid/archive/0.5.3.zip
> unzip 0.5.3.zip
> cd grobid-0.5.3/
> ./gradlew clean install

(note that gradlew must be installed beforehand)

  1. Configure pGrobidHome and pGrobidProperties in config.properties. The default configuration assumes that GROBID directory grobid-0.5.3 is a sister of the science-result-extractor directory.
pGrobidHome=../../grobid-0.5.3/grobid-home
pGrobidProperties=../../grobid-0.5.3/grobid-home/config/grobid.properties
  1. PdfInforExtractor provides methods to extract section content and table content from a given PDF file.

Run experiments based on textual entailment system

We release the training/testing datasets for all experiments described in the paper. You can find them under the data/exp directory. The results reported in the paper are based on the datasets under the data/exp/few-shot-setup/NLP-TDMS/paperVersion directory. We later further clean the datasets (e.g., remove five pdf files from the testing datasets which appear in the training datasets with a different name) and the clean version is under the data/exp/few-shot-setup/NLP-TDMS folder. Below we illustrate how to run experiments on the NLP-TDSM dataset in the few-shot setup to extract TDM pairs.

  1. Fork and clone this repository.

  2. Download or clone BERT.

  3. Copy run_classifier_sci.py into the BERT directory.

  4. Download BERT embeddings. We use the base uncased models.

  5. If we use BERT_DIR to point to the directory with the embeddings and DATA_DIR to point to the directory with our train and test data, we can run the textual entailment system with run_classifier_sci.py. For example:

> DATA_DIR=../data/exp/few-shot-setup/NLP-TDMS/
> BERT_DIR=./model/uncased_L-12_H-768_A-12/
> python3 run_classifier_sci.py --do_train=true --do_eval=false --do_predict=true --data_dir=${DATA_DIR} --task_name=sci --vocab_file=${BERT_DIR}/vocab.txt --bert_config_file=${BERT_DIR}/bert_config.json --init_checkpoint=${BERT_DIR}/bert_model.ckpt --output_dir=bert_tdms --max_seq_length=512 --train_batch_size=6 --predict_batch_size=6
  1. TEModelEvalOnNLPTDMS provides methods to evaluate TDMS tuples extraction.

  2. GenerateTestDataOnPDFPapers provides methods to generate testing dataset for any PDF papers.

Read NLP-TDMS and ARC-PDN corpora

  1. Follow the instructions in the README in data/NLP-TDMS/downloader/ to download the entire collection of raw PDFs of the NLP-TDMS dataset. The downloaded PDFs can be moved to data/NLP-TDMS/pdfFile (i.e., mv *.pdf ../pdfFile/.).

  2. For the ARC-PDN corpus, the original pdf files can be downloaded from the ACL Anthology Reference Corpus (Version 20160301). We use papers from ACL(P)/EMNLP(D)/NAACL(N) between 2010 and 2015. After uncompressing the downloaded PDF files, put the PDF files into the corresponding directories under the /data/ARC-PDN/ folder, e.g., copy D10 to /data/ARC-PDN/D/D10.

  3. We release the parsed NLP-TDMS and ARC-PDN corpora. NlpTDMSReader and ArcPDNReader in the corpus package illustrate how to read section and table contents from PDF files in these two corpora.

train a model to extract TDM mentions

We release the TDMSci corpus (under the data folder). The dataset is in the standard CoNLL format.

Citing science-result-extractor

Please cite the following paper when using science-result-extractor:

@inproceedings{houyufang2019acl,
  title={Identification of Tasks, Datasets, Evaluation Metrics, and Numeric Scores for Scientific Leaderboards Construction},
  author={Hou, Yufang and Jochim, Charles and Gleize, Martin and Bonin, Francesca and Ganguly, Debasis},
  booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, {\em Florence, Italy, 27 July -- 2 August 2019}},
  year      = {2019}
}

@inproceedings{houyufang2021eacl,
  title={TDMSci: A Specialized Corpus for Scientific Literature Entity Tagging of Tasks Datasets and Metrics},
  author={Hou, Yufang and Jochim, Charles and Gleize, Martin and Bonin, Francesca and Ganguly, Debasis},
  booktitle = {Proceedings of the  the 16th conference of the European Chapter of the Association for Computational Linguistics, {\em Online, 19--23 April 2021}},
  year      = {2021}
}

science-result-extractor's People

Contributors

yufang-ibm avatar yufanghou avatar stevemart avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.