Code Monkey home page Code Monkey logo

bow-to-bert's Introduction

Bow to Bert

Evolution of word vectors from long, sparse, and one-hot to short, dense, and context sensitive

This is the source code to go along with the blog article

Bow to Bert

Context sensitive embeddings with BERT

Figure 3. BERT embeddings are contextual. Each row show three sentences. The sentence in the middle expresses the same context as the sentence on its right, but different from the one on its left. All three sentences in the row have a word in common. The numbers show the computed cosine-similarity between the indicated word pairs. BERT embedding for the word in the middle is more similar to the same word on the right than the one on the left.

Summary

Word vectors have evolved over the years to know the difference between "record the play" vs "play the record". They have evolved from a one-hot world where every word was orthogonal to every other word, to a place where word vectors morph to suit the context. Slapping a BoW on word vectors is the usual way to build a document vector for tasks such as classification. But BERT does not need a BoW as the vector shooting out of the top [CLS] token is already fine tuned for the specific classification objective...

Dependencies

tensorflow
numpy

To reproduce the resuts in the post

Download: crawl-300d-2M-subword.vec

Download: BERT-Base, Uncased

Edit the script getBertWordVectors.sh and update path accordingly
	BERT_BASE_DIR="$PRE_TRAINED_HOME/bert/uncased_L-12_H-768_A-12"
Edit the scripts fasttext_sentence_similarity.py & fasttext_word_similarity.py  and update path accordingly
	filename = os.environ["PRE_TRAINED_HOME"] + '/fasttext/crawl-300d-2M-subword.vec'

Get BERT word embeddings for the words/sentences in bert_sentences.txt

./getBertWordVectors.sh

Process BERT embeddings to compute cosine similarity for context sensitive words

pipenv run python bert_similarity.py

to get output like:

arms bend at the elbow  <=>  germany sells arms to saudi arabia 			<=>  0.482
arms bend at the elbow  <=>  wave your arms around 						<=>  0.615

Fasttext word similarity

pipenv run python ./fasttext_word_similarity.py holiday vacation paper

to get output like:

Cosine Similarity: holiday & vacation : 0.7388389
Cosine Similarity: holiday & paper : 0.2716892
Cosine Similarity: vacation & paper : 0.27176374

Fasttext sentence similarity

pipenv run python ./fasttext_sentence_similarity.py

to get output like:

words not found in fasttext.. 0

Cosine Similarity: enjoy your holiday & have a fun vacation : 0.72311985
Cosine Similarity: enjoy your holiday & study the paper : 0.5743288
Cosine Similarity: have a fun vacation & study the paper : 0.51478416

bow-to-bert's People

Contributors

ashokc avatar nauynix avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.