Code Monkey home page Code Monkey logo

word2bits's Introduction

Word2Bits - Quantized Word Vectors

Word2Bits extends the Word2Vec algorithm to output high quality quantized word vectors that take 8x-16x less storage than regular word vectors. Read the details at https://arxiv.org/abs/1803.05651.

What are Quantized Word Vectors?

Quantized word vectors are word vectors where each parameter is one of 2^bitlevel values.

For example, the 1-bit quantized vector for "king" looks something like

0.33333334 0.33333334 0.33333334 -0.33333334 -0.33333334 -0.33333334 0.33333334 0.33333334 -0.33333334 0.33333334 0.33333334 ...

Since parameters are limited to one of 2^bitlevel values, each parameter takes only bitlevel bits to represent; this drastically reduces the amount of storage that word vectors take.

Download Pretrained Word Vectors

  • All word vectors are in Glove/Fasttext format (format details here). Files are compressed using gzip.
# Bits per parameter Dimension Trained on Vocabulary size File Size (Compressed) Download Link
1 800 English Wikipedia 2017 Top 400k 86M w2b_bitlevel1_size800_vocab400K.tar.gz
1 1000 English Wikipedia 2017 Top 400k 106M w2b_bitlevel1_size1000_vocab400K.tar.gz
1 1200 English Wikipedia 2017 Top 400k 126M w2b_bitlevel1_size1200_vocab400K.tar.gz
2 400 English Wikipedia 2017 Top 400k 67M w2b_bitlevel2_size400_vocab400K.tar.gz
2 800 English Wikipedia 2017 Top 400k 134M w2b_bitlevel2_size800_vocab400K.tar.gz
2 1000 English Wikipedia 2017 Top 400k 168M w2b_bitlevel2_size1000_vocab400K.tar.gz
32 200 English Wikipedia 2017 Top 400k 364M w2b_bitlevel0_size200_vocab400K.tar.gz
32 400 English Wikipedia 2017 Top 400k 724M w2b_bitlevel0_size400_vocab400K.tar.gz
32 800 English Wikipedia 2017 Top 400k 1.4G w2b_bitlevel0_size800_vocab400K.tar.gz
32 1000 English Wikipedia 2017 Top 400k 1.8G w2b_bitlevel0_size1000_vocab400K.tar.gz
1 800 English Wikipedia 2017 3.7M (Full) 812M w2b_bitlevel1_size800_vocab3.7M.tar.gz
2 400 English Wikipedia 2017 3.7M (Full) 671M w2b_bitlevel2_size400_vocab3.7M.tar.gz
32 400 English Wikipedia 2017 3.7M (Full) 6.7G w2b_bitlevel0_size400_vocab3.7M.tar.gz

Visualizing Quantized Word Vectors

(Note: every 5 word vectors are labelled; turquoise line boundary between nearest and furthest word vectors from target.)

Using the Code

Quickstart

Compile with

make word2bits

Run with

./word2bits -train input -bitlevel 1 -size 200 -window 10 -negative 12 -threads 2 -iter 5 -min-count 5  -output 1bit_200d_vectors -binary 0

Description of the most common flags:

-train                       Input corpus text file
-bitlevel          	     Number of bits for each parameter. 0 is full precision (or 32 bits).
-size                        Word vector dimension
-window                      Window size
-negative                    Negative sample size
-threads                     Number of threads to use to train
-iter                        Number of epochs to train
-min-count                   Minimum count value. Words appearing less than value are removed from corpus.
-output                      Path to write output word vectors
-binary                      0 to write in Glove format; 1 to write in binary format.

Example: Word2Bits on text8

  1. Download and preprocess text8 (make sure you're in the Word2Bits base directory).

    bash data/download_text8.sh
    
  2. Compile Word2Bits and compute accuracy

    make word2bits
    
    make compute_accuracy
    
  3. Train 1 bit 200 dimensional word vectors for 5 epochs using 4 threads (save in binary so that compute_accuracy can work with it)

    ./word2bits -bitlevel 1 -size 200 -window 8 -negative 24 -threads 4 -iter 5 -min-count 5 -train text8  -output 1b200d_vectors -binary 1
    

    (This will take several minutes. Run with more threads if you have more cores!)

  4. Evaluate vectors on Google Analogy Task

    ./compute_accuracy ./1b200d_vectors < data/google_analogies_test_set/questions-words.txt
    

    You should see output like:

    Starting eval...
    capital-common-countries:
    ACCURACY TOP1: 19.76 %  (100 / 506)
    Total accuracy: 19.76 %   Semantic accuracy: 19.76 %   Syntactic accuracy: -nan %
    capital-world:
    ACCURACY TOP1: 8.81 %  (239 / 2713)
    Total accuracy: 10.53 %   Semantic accuracy: 10.53 %   Syntactic accuracy: -nan %
    ...
    gram8-plural:
    ACCURACY TOP1: 19.92 %  (251 / 1260)
    Total accuracy: 11.48 %   Semantic accuracy: 13.27 %   Syntactic accuracy: 10.25 %
    gram9-plural-verbs:
    ACCURACY TOP1: 6.09 %  (53 / 870)
    Total accuracy: 11.20 %   Semantic accuracy: 13.27 %   Syntactic accuracy: 9.88 %
    Questions seen / total: 16284 19544   83.32 %
    

    Inspecting the vector file in hex should show something like:

    $ od --format=x1 --read-bytes=160 1b200d_vectors
    0000000 36 30 32 33 38 20 32 30 30 0a 3c 2f 73 3e 20 ab
    0000020 aa aa 3e ab aa aa 3e ab aa aa be ab aa aa be ab
    0000040 aa aa 3e ab aa aa 3e ab aa aa 3e ab aa aa be ab
    ...
    0000160 aa aa be ab aa aa 3e ab aa aa be ab aa aa 3e ab
    0000200 aa aa be ab aa aa 3e ab aa aa 3e ab aa aa 3e ab
    0000220 aa aa be ab aa aa 3e ab aa aa be ab aa aa 3e ab
    

word2bits's People

Contributors

agnusmaximus avatar loretoparisi avatar stefan-it avatar stefanik12 avatar witiko avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.