Code Monkey home page Code Monkey logo

chmm-alt's Introduction

CHMM-ALT

Alternate-training for Conditional hidden Markov model and BERT-NER.

made-with-python Maintenance GitHub stars GitHub forks

This code accompanies the paper BERTifying the Hidden Markov Model for Multi-Source Weakly Supervised Named Entity Recognition.

To view the previous version of the program used for the paper, switch to branch prev.

Conditional hidden Markov model (CHMM) is also included in the Wrench project ๐Ÿ”ง

Check out my follow-up to this work: Sparse-CHMM

1. Dependencies

Please check requirement.txt for the package dependency requirement. Notice that only the packages used for model training are listed in the file. Those for dataset construction are not listed for the reason mentioned below.

2. Dataset Construction

Pre-Processed Datasets

The dataset construction program depends on several external libraries such as AllenNLP, wiser or skweak, some of which have conflict dependencies, some are no longer maintained. Building datasets from the source data could be hard under this situation. Hence, we directly post the pre-processed datasets in .json format under the data directory for reproduction. If you prefer building the dataset from source, you can refer to the following subsection.

Source Data

The dataset construction program for the NCBI-Disease, BC5CDR and LaptopReview datasets is modified from the wiser project (paper) that contains three repos.

The dataset construction program for the CoNLL 2003 dataset is based on skweak.

The source data are provided in the folders data_constr/<DATASET NAME>/data. You can also download the source data from the links below:

  • BC5CDR: Download the train, development, and test BioCreative V CDR corpus data files.

  • NCBI Disease: Download the complete training, development, and testing sets.

  • LaptopReview: Download the train data V2.0 for the Laptops and Restaurants dataset and the test data - phase B.

  • CoNLL 2003: You can find a pre-processed CoNLL 2003 English dataset here.

Place the downloaded data in the corresponding folders data_constr/<DATASET NAME>/data.

External Dependencies

To build datasets, you may need to get the external dictionaries and models on which skweak and wiser depends.

You can get the files from Google Drive or download them individually from here and here. Unzip them and place the outputs into data_constr/Dependency/ for usage.

Building datasets

Run the build.sh script in the dataset folder data_constr/<DATASET NAME> with

./build.sh

You will see train.json, valid.json, test.json, and meta.json files in your target folder if the program runs successfully.

You can also customize the script with your favorite arguments.

Backward compatibility

Notice: the datasets constructed in the way above are not completely the same as the datasets used in the paper.

However, our code has full support for the previous version of datasets. To reproduce the results in the paper, please refer to the dataset construction methods in the prev branch and link the file location arguments to their directories.

Note: Our data format is not compatible with Wrench.

3. Run

We use the argument parsing techniques from the Huggingface transformers repo in our program. It supports the ordinary argument parsing approach from shell inputs as well as parsing from json files.

We have three entry files: chmm.py, bert.py and alt.py, which are all stored in the run. Each file corresponds to a component in our alternate training pipeline. In the scripts folder are the configuration files that defines the hyperparameters for model training. You can either use .json or .sh. Please make sure you are at the project directory ([]/CHMM-ALT/).

To train and evaluate CHMM, go to ./label_model/ and run

PYHTONPATH="." CUDA_VISIBLE_DEVICES=0 python ./run/chmm.py ./scripts/config_chmm.json

Here conig_chmm.json is a configuration file only for demonstration. Another option is

sh ./scripts/run_chmm.sh

You need to fine-tune the hyper-parameters to get better performance.

The way to run BERT-NER (./run/bert.py) or Alternation training (./run/alt.py) is similar and will not be detailed.

4. Citation

If you find our work helpful, you can cite it as

@inproceedings{li-etal-2021-bertifying,
    title = "{BERT}ifying the Hidden {M}arkov Model for Multi-Source Weakly Supervised Named Entity Recognition",
    author = "Li, Yinghao  and
      Shetty, Pranav  and
      Liu, Lucas  and
      Zhang, Chao  and
      Song, Le",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-long.482",
    doi = "10.18653/v1/2021.acl-long.482",
    pages = "6178--6190",
}

chmm-alt's People

Contributors

yinghao-li avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

chmm-alt's Issues

Error in data loading

Hi there,

if I try to execude your code (e.g. with the data from Laptop Review) I get the following error:

span_dict[(span[0], span[1])] = span[2]
IndexError: string index out of range

There seems to be some logic broken in this lines from dataset.py

# get true labels
lbs = span_to_label(span_list_to_dict(data['label']), sent_tks)
lbs_list.append(lbs)
# get lf annotations (weak labels) in one-hot format tensor
w_lbs = [span_to_label(span_list_to_dict(data['weak_labels'][lf_idx]), sent_tks) for lf_idx in lf_rec_ids]

It tries to parses spans here, but in fact, there are no spans, but single labels.
Maybe I am also mistaken - could you elaborate on this and provide some help?

Thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.