Code Monkey home page Code Monkey logo

matbert_ner's Introduction

MatBERT NER

A framework for materials science NER using the HuggingFace Transformers NLP Toolkit.

Installation

git clone https://github.com/walkernr/MatBERT_NER.git MatBERT_NER
cd MatBERT_NER
pip install -r requirements.txt .

Download the model weights here

Example Usage

The folowing command will train the MatBERT model on the solid state dataset using default parameters

python train.py -dv gpu:0 -ds solid_state -ml matbert

Additional parameters can be specified.

usage: train.py [-h] [-dv DEVICE] [-sd SEEDS] [-ts TAG_SCHEMES] [-st SPLITS] [-ds DATASETS] [-ml MODELS] [-sl] [-bs BATCH_SIZE] [-on OPTIMIZER_NAME] [-wd WEIGHT_DECAY] [-ne N_EPOCH]
                [-eu EMBEDDING_UNFREEZE] [-tu TRANSFORMER_UNFREEZE] [-el EMBEDDING_LEARNING_RATE] [-tl TRANSFORMER_LEARNING_RATE] [-cl CLASSIFIER_LEARNING_RATE] [-sf SCHEDULING_FUNCTION]   
                [-km]

optional arguments:
  -h, --help            show this help message and exit
  -dv DEVICE, --device DEVICE
                        computation device for model (e.g. cpu, gpu:0, gpu:1)
  -sd SEEDS, --seeds SEEDS
                        comma-separated seeds for data shuffling and model initialization (e.g. 1,2,3 or 2,4,8)
  -ts TAG_SCHEMES, --tag_schemes TAG_SCHEMES
                        comma-separated tagging schemes to be considered (e.g. iob1,iob2,iobes)
  -st SPLITS, --splits SPLITS
                        comma-separated training splits to be considered, in percent (e.g. 80). test split will always be 10% and the validation split will be 1/8 of the training split   
                        unless the training split is 100%
  -ds DATASETS, --datasets DATASETS
                        comma-separated datasets to be considered (e.g. solid_state,doping)
  -ml MODELS, --models MODELS
                        comma-separated models to be considered (e.g. matbert,scibert,bert)
  -sl, --sentence_level
                        switch for sentence-level learning instead of paragraph-level
  -bs BATCH_SIZE, --batch_size BATCH_SIZE
                        number of samples in each batch
  -on OPTIMIZER_NAME, --optimizer_name OPTIMIZER_NAME
                        name of optimizer, add "_lookahead" to implement lookahead on top of optimizer (not recommended for ranger or rangerlars)
  -wd WEIGHT_DECAY, --weight_decay WEIGHT_DECAY
                        weight decay for optimizer (excluding bias, gamma, and beta)
  -ne N_EPOCH, --n_epoch N_EPOCH
                        number of training epochs
  -eu EMBEDDING_UNFREEZE, --embedding_unfreeze EMBEDDING_UNFREEZE
                        epoch (index) at which bert embeddings are unfrozen
  -tu TRANSFORMER_UNFREEZE, --transformer_unfreeze TRANSFORMER_UNFREEZE
                        comma-separated number of transformers (encoders) to unfreeze at each epoch
  -el EMBEDDING_LEARNING_RATE, --embedding_learning_rate EMBEDDING_LEARNING_RATE
                        embedding learning rate
  -tl TRANSFORMER_LEARNING_RATE, --transformer_learning_rate TRANSFORMER_LEARNING_RATE
                        transformer learning rate
  -cl CLASSIFIER_LEARNING_RATE, --classifier_learning_rate CLASSIFIER_LEARNING_RATE
                        pooler/classifier learning rate
  -sf SCHEDULING_FUNCTION, --scheduling_function SCHEDULING_FUNCTION
                        function for learning rate scheduler (linear, exponential, or cosine)
  -km, --keep_model     switch for saving the best model parameters to disk

To train on custom annotated datasets, the train.py script has a dictionary data_files where additional datasets can be specified. Similarly, alternative pre-trained models can be used by modifying the model_files dictionary.

For prediction, the predict function contained within predict.py can be used. An example that was used internally can be found in the predict_script.py file. Furthermore, an example utilizing MongoDB can be found in the predict_mongo.py script. Note that these two examples will need to be edited for your specific needs to be usable.

License

matbert_ner's People

Contributors

amaliet avatar ardunn avatar kevcruse96 avatar walkernr avatar

Stargazers

 avatar

Forkers

ardunn lbnlp

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.