Code Monkey home page Code Monkey logo

cda's Introduction

Cross-Document Alignment (CDA)

This is the code for our EMNLP2020 paper: Multilevel Text Alignment with Cross-Document Attention

Citation:

@inproceedings{Zhou2020Multilevel,
  author={Xuhui Zhou, Nikolaos Pappas, Noah A. Smith},
  title={Multilevel Text Alignment with Cross-Document Attention},
  booktitle={EMNLP},
  year={2020}
}

Project Website:

We also release a benchmark for Document Relation Prediction and Localization, where you can download the dataset used in our paper here!

Installation:

conda env create -f environment.yml
conda activate dev

Working with other versions of packages:

Our code has been run across many different python\ pytorch\ transformers versions (those are the essential packages for this projects). You can probably just use your own version of packages and run our code without further modification. Caution that we haven't tested out all models on different envs, so your results may vary from what we report in paper.

Example commands

Running GRU-HAN

The following command will train a GRU-based hierachical attention network (HAN) classifier on the AAN corpus. Please modify the relative path accordingly to run the .sh script.

./run/acl_train.sh
  • Please nvigate the files starting from .sh file which will point to relevant files in the /run folder.

  • If a .sh file starts with acl, ai2_ab, ai2_g, and pla, it corresponds to AAN, OC, S2ORC, and PAN tasks described in our paper respectively.

Running BERT-HAN

The following command will train a BERT-based hierachical attention network (HAN) classifier on the AAN corpus. Please modify the relative path accordingly to run the .sh script.

./run/acl_train_bert.sh

You need to obtain the pre-trained contextualized embedding (.npy file as well as .index file) first to run the code. Though there are many ways to achieve that, we recommend using the following command (you can find get_rep.py in this repo):

export TRAIN_FILE='this does not matter'
export TEST_FILE='your data'

python get_rep.py \
    --output_dir=cite_models \
    --overwrite_output_dir \
    --model_type=bert \
    --per_gpu_eval_batch_size=10 \
    --model_name_or_path=bert-base-cased \
    --line_by_line \
    --train_data_file=$TRAIN_FILE \
    --special_eval \
    --eval_data_file=$TEST_FILE \
    --rep_name=../HAMN/data/cite_ai2/test_ai2_ab.npy \
    --mlm

Note that one needs to produce a sentence-level .txt file alone with an .index file to feed into the get_rep.py; One example could be found in

Running finetuning BERT-HAN

Warning: Please ensure you have GPU space exceeds 10 GB to run the fine-tuning version:

The following command will train a BERT-based hierachical attention network (HAN) classifier on the AAN corpus with BERT finetuning. Please modify the relative path accordingly to run the .sh script.

./BERT-HAN/run_ex_sent.sh

Other Friendly Reminders

  • For evaluating models, just find corresponding evaluation .sh scripts, they should be straightforward.
  • For other scripts, they are used for variations of our experiments, some of them are reported in the paper while some are not. You can just ignore them or you can also read on.
  • You can directly take a look at d_graph_hier_mat_model_g.py and graph_hier_mat_model_g.py for our implementation for Deep and Shallow CDA. This should be the place where magic happens.
  • For evaluation scripts, we recommend taking a look at test_pla_mask_onestep.py and test_state_dict_mask_onestep.py for plagiarism task and citation recommendation task. They have some subtle difference for S2D task.

cda's People

Contributors

xuhuizhou avatar cy7533 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.