picoBERT

Dependencies

pip install -r requirements.txt

Tested on Python 3.9.10.

Usage

bert.py contains the actual BERT model code.
utils.py includes utility code to download, load, and tokenize stuff.
tokenization.py includes BERT WordPiece tokenizer code.
pretrain_demo.py code to demo BERT doing pre-training tasks (MLM and NSP).
classify_demo.py code to demo training an SKLearn classifier using the BERT output embeddings as input. This is not the same as actually fine-tuning the BERT model.

To demo BERT on pre-training tasks:

python pretrain_demo.py \
    --text_a "The apple doesn't fall far from the tree." \
    --text_b "Instead, it falls on Newton's head." \
    --model_name "bert-base-uncased" \
    --mask_prob 0.20

Which outputs:

mlm_accuracy = 0.75
is_next_sentence = True

If we add the --verbose flag, we can also see where the model went wrong with masked language modeling:

input = ['[CLS]', 'the', 'apple', 'doesn', "'", '[MASK]', 'fall', 'far', 'from', 'the', 'tree', '.', '[SEP]', 'instead', ',', 'it', 'falls', 'on', '[MASK]', "'", '[MASK]', '[MASK]', '.', '[SEP]']

actual: t
pred: t

actual: newton
pred: one

actual: s
pred: s

actual: head
pred: head

Instead of predicting the word "newton", it predicted the word "one", which still gives a valid sentence "Instead, it falls on one's head.".

For a demo of training an SKLearn classifier for the IMDB dataset, using BERT output embeddings as input to the classifier:

python classify_demo.py
    dataset_name "imdb" \
    N 1000 \
    test_ratio 0.2 \
    model_name "bert-base-uncased" \
    models_dir "models"

Which outputs (note, it takes a while to run the BERT model and extract all the embeddings):

              precision    recall  f1-score   support

           0       0.78      0.85      0.81       104
           1       0.82      0.74      0.78        96

    accuracy                           0.80       200
   macro avg       0.80      0.79      0.79       200
weighted avg       0.80      0.80      0.79       200

Not bad, 80% accuracy using only 800 training examples and a simple SKLearn model. Of course, fine-tuning the entire model over all the training examples would yield much better results.

ichbinhandsome / picobert Goto Github PK

picobert's Introduction

picoBERT

Dependencies

Usage

picobert's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent