Binarized Neural Machine Translation

We explore ways to reduce computation and model size for neural machine translation. With the development of binary weight networks and XNOR networks in vision, we attempt to extend that work to machine translation. In particular, we evaluate how binary convolutions can be used in machine translation and their effects.

Paper
Poster

Datasets

Although our analysis is done on Multi30k dataset, our code supports the following datasets:

WMT 14 EN - FR
IWSLT
Multi30k

Models

Baseline Models

We implement 4 baseline models to compare our binarized models against.

Simple LSTM

An encoder decoder model, that encodes the source language with an LSTM, then presents the final hidden state to the decoder. The decoder uses the final hidden state to decode the output.

Attention RNN

An encoder decoder model, similar to the last but at every decoder step applies an attention mechanism over all the encoder outputs conditioned on the current hidden state.

Attention QRNN

The same model as above, but using QRNN (Quasi Recurrent Neural Network developed by Salesforce Research) instead of LSTMs. QRNN should be much faster since the rely on lower level convolutions and can be parallelized further than Attention RNN.

ConvS2S

This model (implemented by FAIR) rather than using RNNs, creates a series of convolutional layers that are used for the encoder, and decoder along with attention.

Binarized Models

We implement two variants of binarized networks to compare performance.

ConvS2S Binarized Weight Networks

This model is the same as the one implemented above, with one key difference. All the weights are represented as a binary tensor β, and a normalization vector such that W ≈ β · α. The benefit here is that a convolution can be estimated as (I · β) · α

ConvS2S XNOR network

This model extends upon the binarized weight network. The input is binarized as well so the convolutions can be estimated as (sign(I) · sign(β)) · α.

Notable Results

Translation Performance

Other stats can be found in this issue

Model Size

We compare model size of two different sets of models. First the models we ran our Multi30k experiments on. Then the large models. Since our dataset is quite a bit smaller, we also ran experiments on the size of the models that are used for larger translation datasets such as WMT, and note the hyper parameters reported in their papers.

Set Up

A short cut to do all the setup:

# creates a virutal environment and downloads the data
$ bash setup.sh

To set up the python code create a python3 environment with the following:

# create a virtual environment
$ python3 -m venv env

# activate environment
$ source env/bin/activate

# install all requirements
$ pip install -r requirements.txt

If you add a new package you will have to update the requirements.txt with the following command:

# add new packages
$ pip freeze > requirements.txt

And if you want to deactivate the virtual environment

# decativate the virtual env
$ deactivate

# if using python 3.7.x, no official tensorflow distro is available so use this for mac:
$ pip install https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-0.12.0-py3-none-any.whl

# use this for linux
$ pip install https://github.com/adrianodennanni/tensorflow-1.12.0-cp37-cp37m-linux_x86_64/blob/master/tensorflow-1.12.0-cp37-cp37m-linux_x86_64.whl?raw=true

References

Attention and Simple LSTM Pictures
FairSeq ConvS2S Gif Original

Papers

XNOR - Net: Paper
Multi bit quantization networks: Paper
Binarized LSTM Language Model: Paper
Fair Seq Convolutinal Sequence Learning: Paper
Quasi Recurrent Networks: Paper
WMT 14 Translation Task Paper
Attention is all you need Paper
Imagination improves multimodal translation Paper
Multi30k dataset Paper
IWSLT paper

kevinb22 / binarizednmt Goto Github PK

binarizednmt's Introduction