Code Monkey home page Code Monkey logo

lda-bump-cpp's Introduction

lda-bump-cpp

Latent Dirichlet allocation (LDA) with bumping variational inference.

Implements three versions of LDA

[1] Alp Kucukelbir and David M Blei. Population Empirical Bayes. Uncertainty in Artificial Intelligence (UAI) 2015.

Requirements

lda-bump-cpp is written in C++11. It requires a modern compiler. It also depends on Eigen 3, Boost, and CMake. It uses docopt (provided).

Refer to platform-specific instructions for installation. (I recommend homebrew on Mac OS X.)

Instructions to Build and Run

The driver (main) program runs all three algorithms.

cmake .
make driver

A toy dataset of arXiv abstracts is provided.

Example

./driver --topics=5
         --vocabulary=data/arxiv-vocab.dat
         --datatr=data/arxiv-train-5k.dat
         --datatest=data/arxiv-test-1k.dat

For more help, run

./driver -h

lda-bump-cpp LDA with bumping variational inference.

Usage:
  driver --topics=NUM_TOPICS --vocabulary=VOCAB
         --datatr=TRAIN --datatest=TEST
         [--bootstrap=NUM_BOOTSTRAP] [--minibatch=MINIBATCH]
         [--alpha=ALPHA] [--eta=ETA]
         [--tau0=TAU0] [--kappa=KAPPA]
         [--fixed_step_size=STEPSIZE]
         [--max_itr=MAX_ITR]
         [--compute_elbo]
  driver (-h | --help)
  driver --version

Options:
  --topics=NUM_TOPICS        Number of topics for LDA
  --vocabulary=VOCAB         Vocabulary, one word per line
  --datatr=TRAIN             Training data in LDA-C format
  --datatest=TEST            Testing  data in LDA-C format
  --bootstrap=NUM_BOOTSTRAP  Number of bootstraps for bumping [default: 10]
  --minibatch=MINIBATCH      Number of docs in minibatch [default: 500]
  --alpha=ALPHA              Hyperparameter on topic proportions [default: 1/K]
  --eta=ETA                  Hyperparameter on topics [default: 100/V]
  --tau0=TAU0                Learning rate delay [default: 10.0]
  --kappa=KAPPA              Learning rate forgetting rate [default: 0.75]
  --fixed_step_size=STEPSIZE Fixed stepsize instead RobMonro [default: 0.0]
  --max_itr=MAX_ITR          Max number of iterations for LDA [default: 100]
  --compute_elbo             Boolean flag for computing ELBO
  -h --help                  Show this screen
  --version                  Show version

Vocabulary Data Format

A text file with each word ([term_1] through [term_N]) on a separate line.

Corpus Data Format

A text file where each line is of the form (the LDA-C format):

[M] [term_1]:[count] [term_2]:[count] ... [term_N]:[count]

where [M] is the number of unique terms in the document, and the [count] associated with each term is how many times that term appeared in the document.

Visualizing the Output

A python script visualizes the topics. (Modified from https://github.com/Blei-Lab/onlineldavb)

./printtopics.py data/arxiv-vocab.dat
                 results/Thu_Nov_27_10-45-09_2014/lambda_coord_ascent.dat

./printtopics.py data/arxiv-vocab.dat
                 results/Thu_Nov_27_10-45-09_2014/lambda_svi.dat

./printtopics.py data/arxiv-vocab.dat
                 results/Thu_Nov_27_10-45-09_2014/lambda_bumping.dat

lda-bump-cpp's People

Contributors

akucukelbir avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.