Code Monkey home page Code Monkey logo

bsg's Introduction

Bayesian Skip-gram(BSG)

This repository contains Theano code for the Bayesian Skip-gram model, COLING 2018.

[1] Embedding Words as Distributions with a Bayesian Skip-gram Model, Arthur Bražinskas, Serhii Havrylov, Ivan Titov, arxiv

The model represents words are Gaussian distributions instead of point estimates, and is capable of learning addition word properties, such as generality that is encoded in variances. The instructions below provide a guide on how to install and run the model, also how to evaluate word pairs.

Requirements

  • Python 2.7
  • Theano 0.9.0
  • numpy 1.14.2
  • nltk 3.2.2
  • scipy 0.18.1
  • Lasagne 0.2.dev1

Installation

First of all, install the dependency Python modules, such as Theano and nltk.

pip install requirements.txt

Afterwards, install the necessary NLTK sub-packages.

python -m nltk.downloader wordnet

python -m nltk.downloader punkt

Runing the model

In order to run the model, please refer to run_bsg.py file that contains an example code on how to train and evaluate the model. Upon completion of training, word representations will be saved to the output folder. For example, one can use trained word Gaussian representations(mus and sigmas) as input to word pairs evaluation.

Data

A small dataset consisting of 15 million tokens dataset is available for smoke tests of the setup. Alternatively, a dataset consisting of approximately 1 billion tokens is also available for the public use. The dataset that was used originally in the research is not publicly available, but can be (requested)[http://wacky.sslmit.unibo.it/doku.php?id=corpora].

Word pairs evaluation

One can use the eval/word_pairs_eval.py console application as a playground for word pairs evaluation in terms of similarity, Kullback-Leibler divergence, and entailment directionality. The console application expects paths for word pairs, mu and sigma vectors(i.e. representations of word). A word pairs file should contain two words(order does not matter) per line separated by space. The latter two files are obtained from a trained BSG model. Alternative, pre-trained on the 3B tokens dataset word representations.

The example command below will evaluate pairs stored in eval/example_word_pairs.txt, and output results to the console.

python eval/word_pairs_eval.py -wpp eval/example_word_pairs.txt -mup vectors/mu.vectors -sigmap vectors/sigma.vectors

Additional resourced used in the project

Lexical substitution benchmark is a modified version of https://github.com/orenmel/lexsub

Citation

@inproceedings{brazinskas-etal-2018-embedding,
    title = "Embedding Words as Distributions with a {B}ayesian Skip-gram Model",
    author = "Bra{\v{z}}inskas, Arthur  and
      Havrylov, Serhii  and
      Titov, Ivan",
    booktitle = "Proceedings of the 27th International Conference on Computational Linguistics",
    month = aug,
    year = "2018",
    address = "Santa Fe, New Mexico, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/C18-1151",
    pages = "1775--1789",
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.