Scikit-learn style model finetuning for NLP
Finetune is a library that allows users to leverage state-of-the-art pretrained NLP models for a wide variety of downstream tasks.
Finetune currently supports TensorFlow implementations of the following models:
- BERT, from "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding"
- RoBERTa, from "RoBERTa: A Robustly Optimized BERT Pretraining Approach"
- GPT, from "Improving Language Understanding by Generative Pre-Training"
- GPT2, from "Language Models are Unsupervised Multitask Learners"
- TextCNN, from "Convolutional Neural Networks for Sentence Classification"
- Temporal Convolution Network, from "An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling"
Huge thanks to Alec Radford and Jeff Wu for their hard work and quality research.
Section | Description |
---|---|
API Tour | Base models, configurables, and more |
Installation | How to install using pip or directly from source |
Finetune with Docker | Finetune and inference within a Docker Container |
Documentation | Full API documentation |
Finetuning the base language model is as easy as calling Classifier.fit
:
model = Classifier() # Load base model
model.fit(trainX, trainY) # Finetune base model on custom data
model.save(path) # Serialize the model to disk
...
model = Classifier.load(path) # Reload models from disk at any time
predictions = model.predict(testX) # [{'class_1': 0.23, 'class_2': 0.54, ..}, ..]
Choose your desired base model from finetune.base_models
:
from finetune.base_models import BERT, RoBERTa, GPT, GPT2, TextCNN, TCN
model = Classifier(base_model=BERT)
Optimize your model with a variety of configurables. A detailed list of all config items can be found HERE.
model = Classifier(low_memory_mode=True, lr_schedule="warmup_linear", max_length=512, l2_reg=0.01, oversample=True, ...)
The library supports finetuning for a number of tasks. A detailed description of all target models can be found HERE.
from finetune import *
models = (Classifier, MultiLabelClassifier, MultiFieldClassifier, MultipleChoice, # Classify one or more inputs into one or more classes
Regressor, OrdinalRegressor, MultifieldRegressor, # Regress on one or more inputs
SequenceLabeler, Association, # Extract tokens from a given class, or infer relationships between them
Comparison, ComparisonRegressor, ComparisonOrdinalRegressor, # Compare two documents for a given task
LanguageModel, MultiTask, # Further pretrain your base models
DeploymentModel # Wrapper to optimize your serialized models for a production environment
)
For example usage of each of these target types, see the finetune/datasets directory. For purposes of simplicity and runtime these examples use smaller versions of the published datasets.
If you have large amounts of unlabeled training data and only a small amount of labeled training data, you can finetune in two steps for best performance.
model = Classifier() # Load base model
model.fit(unlabeledX) # Finetune base model on unlabeled training data
model.fit(trainX, trainY) # Continue finetuning with a smaller amount of labeled data
predictions = model.predict(testX) # [{'class_1': 0.23, 'class_2': 0.54, ..}, ..]
model.save(path) # Serialize the model to disk
Finetune can be installed directly from PyPI by using pip
pip3 install finetune
or installed directly from source:
git clone -b master https://github.com/IndicoDataSolutions/finetune && cd finetune
python3 setup.py develop # symlinks the git directory to your python path
pip3 install tensorflow-gpu --upgrade # or tensorflow-cpu
python3 -m spacy download en # download spacy tokenizer
In order to run finetune
on your host, you'll need a working copy of CUDA >= 8.0, libcudnn >= 6, tensorflow-gpu >= 1.6 and up to date nvidia-driver versions.
You can optionally run the provided test suite to ensure installation completed successfully.
pip3 install pytest
pytest
If you'd prefer you can also run finetune
in a docker container. The bash scripts provided assume you have a functional install of docker and nvidia-docker.
git clone https://github.com/IndicoDataSolutions/finetune && cd finetune
# For usage with NVIDIA GPUs
./docker/build_gpu_docker.sh # builds a docker image
./docker/start_gpu_docker.sh # starts a docker container in the background, forwards $PWD to /finetune
docker exec -it finetune bash # starts a bash session in the docker container
For CPU-only usage:
./docker/build_cpu_docker.sh
./docker/start_cpu_docker.sh
Full documentation and an API Reference for finetune
is available at finetune.indico.io.