Code Monkey home page Code Monkey logo

cca's Introduction

Author: Karl Stratos ([email protected])

Release version: 1.0
                                      
Requirements: python (2.7), numpy, scipy, sparsesvd, Matlab

This program is an implementation of canoncial correlation analysis (CCA) in 
the context of deriving word embeddings. A theoretical justification of this 
implementation is provided in: 

A spectral algorithm for learning class-based n-gram models of natrual language
Karl Stratos, Do-kyum Kim, Michael Collins, and Daniel Hsu.
In Proceedings of UAI (2014).

v------------------------------------------------------------------------------v
| Setup                                                                        |
^------------------------------------------------------------------------------^
First, make sure your machine has all the required programs listed above. Also,
to be able to run Matlab on your machine, you need to change the line in the
call_matlab function in src/call_matlab.py to the path to Matlab on that 
machine. For example, for me it's: 

matlab = '/Applications/MATLAB_R2013b.app/bin/matlab' 

The easiest way to check everything is good is to run debug.py: 

$ python debug.py

v------------------------------------------------------------------------------v
| Preparing input data                                                         |
^------------------------------------------------------------------------------^
We assume a raw (but properly tokenized) text corpus as an input. There is no 
restriction such as 'one sentence per line'---we don't need sentence boundaries.
But sentence boundaries can be incorporated as special tokens. For example, 
there is a toy corpus input/example/example.corpus:

the dog saw the cat
the dog barked
the cat meowed

You can put boundary markers, as in:

_START_ the dog saw the cat _END_
_START_ the dog barked _END_
_START_ the cat meowed _END_

v------------------------------------------------------------------------------v
| Step 1: Deriving statistics                                                  |
^------------------------------------------------------------------------------^
In step 1, we extract co-occurrence statistics. For example, running:

python cca.py --corpus input/example/example.corpus --cutoff 1

will create a directory input/example/example.cutoff1.window3/ that contains 
statistics of example.corpus. The command line arguments for step 1 are 
the following:

  --corpus CORPUS  count words from this corpus
  --cutoff CUTOFF  cut off words appearing <= this number
  --vocab VOCAB    size of the vocabulary
  --window WINDOW  size of the sliding window
  --want WANT      want words in this file
  --rewrite        rewrite the (processed) corpus, not statistics

In particular, you can decide the context (window)---the default is 3, i.e., 
previous/next words. You can control the size of the vocabulary by discarding 
rare words (cutoff) or using only a restricted set of vocabulary (vocab). 

Rare words are all replaced by a special token "<?>".

v------------------------------------------------------------------------------v
| Step 2: Deriving embeddings Ur                                               |
^------------------------------------------------------------------------------^
In step 2, we run Matlab to perform SVD on the statistics from step 1. Running:

python cca.py --stat input/example/example.cutoff1.window3/ --m 2 --kappa 2

will create a directory output/example.cutoff1.window3.m2.kappa2.matlab.out/
that contains the word embedding file Ur:

4 the -2.3410244894135657e-01 -9.7221193337649348e-01
3 <?> -8.6218169891930729e-01 -5.0659916901690338e-01
2 dog -9.3955297838817597e-01 3.4240356423657153e-01
2 cat -9.6347323867084655e-01 2.6780462722871301e-01

where the format of each line is <frequency>, <word>, <val_1>, <val_2>, ..., 
<val_m>. Also, the rows are ordered in decreasing frequency. 

The command line arguments for step 2 are the following:

  --stat STAT      directory containing statistics
  --m M            number of dimensions
  --kappa KAPPA    smoothing parameter
  --quiet          quiet mode
  --no_matlab      do not call matlab - use python sparsesvd

In particular, m is the dimensionality of CCA, and kappa is a "pseudocount". 
The value of kappa needs to be tuned for the given corpus. Try experimenting 
with 50, 100, 200, ... (or if your data is huge like Google Ngram, 1000, 2000, 
...) until the performance on your problem stops improving. Matlab's SVD is 
very fast, so you can try many parameter values with ease. 

v------------------------------------------------------------------------------v
| Optional post processing                                                     |
^------------------------------------------------------------------------------^
Depending on your problem, it might be a good idea to use only the top subspace 
of your word embeddings. You can derive lower dimensional embeddings via 
principal component analysis (PCA), e.g.:

python src/pca.py --embedding_file output/example.cutoff1.window3.m2.kappa2.matlab.out/Ur --pca_dim 1

Now you have a file Ur.pca1 that looks like:

4 the 0.906265637029
3 <?> 0.20812022154
2 dog -0.585143449361
2 cat -0.529242409207

cca's People

Contributors

karlstratos avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

cca's Issues

initialization of deque for every new line

In the function "extract_stat()" of file "strop.py", I feel that a deque has to be initialized with buffer symbol every time we see a new line in the input corpus file. otherwise we might add unrelated cooccurences in the XY Counter of different unrelated lines of text.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.