Code Monkey home page Code Monkey logo

linearcorex's Introduction

Latent Factor Models Based on Linear Total Correlation Explanation (CorEx)

Linear CorEx finds latent factors that are as informative as possible about relationships in the data. The approach is described in this paper: Low Complexity Gaussian Latent Factor Models and a Blessing of Dimensionality. This is useful for covariance estimation, clustering related variables, and dimensionality reduction, especially in the high-dimensional, under-sampled regime.

To install:

pip install linearcorex

Mathematically, the objective is to find factors, y, where y = W x and
x in R^n is the data and W is an m by n weight matrix. We are minimizing TC(X|Y) + TC(Y) where TC is the "total correlation" or multivariate mutual information. This objective is optimized when X's are independent after conditioning on Y's, and the Y's themselves are independent. Instead of heuristically upper bounding this objective as we do for discrete CorEx, we are able to optimize it exactly in the linear case. While this extension required assumptions of linearity, the advantage is that the code is pretty fast since it only relies on matrix algebra. In principle it could be further accelerated using GPUs.

Without further constraints, the optima of this objective may have an undesirable property: information about the X_i's can be stored "synergistically" in the latent factors. In other words, to predict a single variable you need to combine info from all the latent factors. Therefore, we add a constraint that the solutions should be non-synergistic (latent factors are individually informative about each variable X_i). This also recovers the property of the original lower bound formulation from AISTATS that each latent factor has a non-negative added contribution towards TC. Note that by default, we constrain solutions to eliminate synergy. But, you can turn it off by setting eliminate_synergy=False in the python API or -a from the command line. For making nice trees, it should be left on (e.g. for personality data or ADNI data).

To test the command line interface, try:

cd $INSTALL_DIRECTORY/linearcorex/
python vis_corex.py ../tests/data/test_big5.csv --layers=5,1 --verbose=1 --no_row_names -o big5
python vis_corex.py ../tests/data/adni_blood.csv --layers=30,5,1 --missing=-1e6 --verbose=1 -o adni
python vis_corex.py ../tests/data/matrix.tcga_ov.geneset1.log2.varnorm.RPKM.txt --layers=30,5,1 --delimiter=' ' --verbose=1 --gaussianize="outliers" -o gene

Each of these examples generates pairwise plots of relationships and a graph.

The python API uses the sklearn conventions of fit/transform.

import linearcorex as lc
import numpy as np

out = lc.Corex(n_hidden=5, verbose=True)  # A Corex model with 5 factors
X = np.random.random((100, 50))  # Random data with 100 samples and 50 variables
out.fit(X)  # Fit the model on data
y = out.transform(X)  # Transform data into latent factors
print(out.clusters)  # See the clusters
cov = out.get_covariance()  # The covariance matrix

Missing values can be specified, but are just imputed in a naive way.

Papers

See Sifting Common Info... and Maximally informative representations... for work building up to this method. The main paper describing the method is Low Complexity Gaussian Latent Factor Models and a Blessing of Dimensionality. The connections with the idea of "synergy" will be described in future work.

Troubleshooting visualization

For Mac users:

To get the visualization of the hierarchy looking nice sometimes takes a little effort. To get graphs to compile correctly do the following. Using "brew" to install, you need to do "brew install gts" followed by "brew install --with-gts graphviz". The (hacky) way that the visualizations are produced is the following. The code, vis_corex.py, produces a text file called "graphs/graph.dot". This just encodes the edges between nodes in dot format. Then, the code calls a command line utility called sfdp that is part of graphviz,

sfdp graph.dot -Tpdf -Earrowhead=none -Nfontsize=12  -GK=2 -Gmaxiter=1000 -Goverlap=False -Gpack=True -Gpackmode=clust -Gsep=0.01 -Gsplines=False -o graph_sfdp.pdf

These dot files can also be opened with OmniGraffle if you would like to be able to manipulate them by hand. If you want, you can try to recompile graphs yourself with different options to make them look nicer. Or you can edit the dot files to get effects like colored nodes, etc.

For Ubuntu users:

Credits: https://gitlab.com/graphviz/graphviz/issues/1237

  1. Remove any existing installation with conda uninstall graphviz. (If you did not install with Conda, you might need to do sudo apt purge graphviz and/or pip uninstall graphviz).

  2. run sudo apt install libgts-dev

  3. run sudo pkg-config --libs gts

  4. run sudo pkg-config --cflags gts

  5. Download graphviz-2.40.1.tar.gz from here

  6. Navigate to directory containing download, and extract with tar -xvf graphviz-2.40.1.tar.gz (or newer whatever the download is named.)

  7. cd into extracted folder (ie cd graphviz-2.40.1) and run sudo ./configure --with-gts

  8. Run sudo make in the folder

  9. Run sudo make install in the folder

  10. Reinstall library using pip install graphviz

linearcorex's People

Contributors

eswar3 avatar gregversteeg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

linearcorex's Issues

Finding the optimal number of hidden factors

Hello, me and my team we're trying to understand how to choose the optimal number of hidden factor. For what we've found, the goal is to maximize the TC (Total Correlation). But after some tries with different settings, the value obtained by the property tc is always increasing when increasing the number of hidden factors.
We have doubts about the TCs property too, since we're not sure on the meaning: after some execution, the median of the TCs rapidly decrease with the increase of hidden factors. But we're not sure how to interpret that.

So, basically, the main problem is: how can we choose the optimal number of hidden factors?

Thank you,
Roberto

X the number of hidden factors, Y the TC value
TC

X the number of hidden factors, Y the TCs median
TCs median

Error at the end of fitting

File "run_corex.py", line 26, in <module>
   out.fit(mat2)
File "/home/charles/LinearCorex/linearcorex/linearcorex.py", line 134, in fit
   last_tc = self.tc  # Save this TC to compare to possible updates
 File "/home/charles/LinearCorex/linearcorex/linearcorex.py", line 176, in tc
   return self.moments["TC"]
TypeError: 'bool' object has no attribute '__getitem__'

Happened after I tried a long run - it seemed to finish fitting but then die at the end. Hard to find exactly the cause since it runs fine with smaller data sets (this one had size 126654x10000 with n_hidden=500 and gpu=True)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.