Probing is a popular evaluation method for blackbox language models. In the simplest case, the representation of a token or a sentence is fed to a small classifier that tries to predict some linguistic label. This setup is exposed to a small amount of training data but only the classifier parameters are trained, the blackbox model's parameters are kept fixed.
This library was developed for probing contextualized language models such as BERT.
This framework was used in the following projects:
It supports two types of evaluation.
Probe a single word in its sentence context, such as deriving the tense of the English word cut in these examples, where context clearly plays an important role:
I cut my hair yesterday.
Make sure you cut the edges.
Morphology probing uses TSV files as input. One line represents one sample. Each line has 4 tab-separated columns:
- the full sentence
- the target word
- the index of the target word when using space tokenization
- the label.
The above two examples would look like this:
I cut my hair yesterday. cut 1 Past
Make sure you cut the edges. cut 3 Pres
In tagging tasks each line represents a word and sentence boundaries are denoted by empty lines. Each line consists of a word and its tag or label separated by a tab.
An example for English POS tagging:
If SCONJ
you PRON
need VERB
any DET
more ADJ
info NOUN
, PUNCT
please INTJ
advise VERB
. PUNCT
Hi INTJ
all DET
. PUNCT
See examples/data
for more.
Tagging assigns a label to each word in the sentence. Common examples are part of speech tagging or named entity recognition.
We support most BERT-like language models available in Huggingface's repository. We add a small multilayer perceptron on top of the representation and train its weights. The language models are not finetuned and we cache the output of the language models when possible. This allows running a very large number of experiments on a single GPU. For tagging tasks, the parameters of the MLP are shared across all tokens.
Requirements:
- Python >= 3.6
- PyTorch >= 1.7 (we recommend installing PyTorch before installing this package)
Install command:
cd <PATH_TO_PROBING_REPO>
pip install .
Experiment configuration is managed through YAML config files.
We provided some examples in the examples/config
directory along with example toy datasets in the examples/data
directory.
The train and dev files can be provided as command line arguments as well.
Morphology probing:
python probing/train.py \
--config examples/config/transformer_probing.yaml \
--train-file examples/data/morphology_probing/english_verb_tense/train.tsv \
--dev-file examples/data/morphology_probing/english_verb_tense/dev.tsv
POS tagging:
python probing/train.py \
--config examples/config/pos_tagging.yaml \
--train-file examples/data/pos_tagging/english/train \
--dev-file examples/data/pos_tagging/english/dev
train_many_configs.py
takes a Python source file as its parameter which must contain a function named generate_configs
returns or yields Config
objects.
This makes it possible to run an arbitrary number of experiments with varying configuration.
This toy example trains two models for English POS tagging, one that uses the first subword token of each token and one that used the last subword token.
We explore these options in detail in this paper.
python probing/train_many_configs.py \
--config examples/config/pos_tagging.yaml \
--param-generator examples/config/generate_pos_configs.py
Inference on one experiment:
python probing/inference.py \
--experiment-dir PATH_TO_AN_EXPDIR \
--test-file PATH_TO_A_TEST_FILE \
> output
If no test file is provided, it reads from the standard input.
Inference on multiple experiments:
python probing/batch_inference.py \
EXPERIMENT_DIRS \
--run-on-dev \
--run-on-test
EXPERIMENT_DIRS
is an arbitrary number of positional arguments accepted by batch_inference.py
.
It may by a glob such as ~/workdir/exps/*
.
If --run-on-dev
or --run-on-test
is not provided, the dev or the test set will not be evaluated.
batch_inference.py
only evaluated directories where there is no test.out
or it is older than the last model checkpoint.