Code Monkey home page Code Monkey logo

picobert's Introduction

picoBERT

Like picoGPT, but for BERT.

Dependencies

pip install -r requirements.txt

Tested on Python 3.9.10.

Usage

  • bert.py contains the actual BERT model code.
  • utils.py includes utility code to download, load, and tokenize stuff.
  • tokenization.py includes BERT WordPiece tokenizer code.
  • pretrain_demo.py code to demo BERT doing pre-training tasks (MLM and NSP).
  • classify_demo.py code to demo training an SKLearn classifier using the BERT output embeddings as input. This is not the same as actually fine-tuning the BERT model.

To demo BERT on pre-training tasks:

python pretrain_demo.py \
    --text_a "The apple doesn't fall far from the tree." \
    --text_b "Instead, it falls on Newton's head." \
    --model_name "bert-base-uncased" \
    --mask_prob 0.20

Which outputs:

mlm_accuracy = 0.75
is_next_sentence = True

If we add the --verbose flag, we can also see where the model went wrong with masked language modeling:

input = ['[CLS]', 'the', 'apple', 'doesn', "'", '[MASK]', 'fall', 'far', 'from', 'the', 'tree', '.', '[SEP]', 'instead', ',', 'it', 'falls', 'on', '[MASK]', "'", '[MASK]', '[MASK]', '.', '[SEP]']

actual: t
pred: t

actual: newton
pred: one

actual: s
pred: s

actual: head
pred: head

Instead of predicting the word "newton", it predicted the word "one", which still gives a valid sentence "Instead, it falls on one's head.".

For a demo of training an SKLearn classifier for the IMDB dataset, using BERT output embeddings as input to the classifier:

python classify_demo.py
    dataset_name "imdb" \
    N 1000 \
    test_ratio 0.2 \
    model_name "bert-base-uncased" \
    models_dir "models"

Which outputs (note, it takes a while to run the BERT model and extract all the embeddings):

              precision    recall  f1-score   support

           0       0.78      0.85      0.81       104
           1       0.82      0.74      0.78        96

    accuracy                           0.80       200
   macro avg       0.80      0.79      0.79       200
weighted avg       0.80      0.80      0.79       200

Not bad, 80% accuracy using only 800 training examples and a simple SKLearn model. Of course, fine-tuning the entire model over all the training examples would yield much better results.

picobert's People

Contributors

jaymody avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.