mvlearn / mvlearn Goto Github PK

View Code? Open in Web Editor NEW

202.0 10.0 22.0 48.32 MB

Python package for multi-view machine learning

Home Page: https://mvlearn.github.io/

License: MIT License

Python 100.00%

machine-learning python multiview-learning data-science

mvlearn's Introduction

mvlearn is an open-source Python software package for multiview learning tools.

mvlearn aims to serve as a community-driven open-source software package that offers reference implementations for algorithms and methods related to multiview learning (machine learning in settings where there are multiple incommensurate views or feature sets for each sample). It brings together the most widely-used tools in this setting with a standardized scikit-learn like API, well tested code and high-quality documentation. Doing so, we aim to facilitate application, extension, and comparison of methods, and offer a foundation for research into new multiview algorithms. We welcome new contributors and the addition of methods with proven efficacy and current use.

Citing mvlearn

If you find the package useful for your research, please cite our JMLR Paper.

Perry, Ronan, et al. "mvlearn: Multiview Machine Learning in Python." Journal of Machine Learning Research 22.109 (2021): 1-7.

BibTeX entry:

@article{perry2021mvlearn,
  title={mvlearn: Multiview Machine Learning in Python},
  author={Perry, Ronan and Mischler, Gavin and Guo, Richard and Lee, Theodore and Chang, Alexander and Koul, Arman and Franz, Cameron and Richard, Hugo and Carmichael, Iain and Ablin, Pierre and Gramfort, Alexandre and Vogelstein, Joshua T.},
  journal={Journal of Machine Learning Research},
  volume={22},
  number={109},
  pages={1-7},
  year={2021}
}

mvlearn's People

Contributors

Stargazers

Watchers

mvlearn's Issues

MVMDS Tutorial

The last step in my PR is creating a comprehensive notebok.

Remove PLS

Projection coefficients are readily available in sklearn

Add GCCA

See PR #9

Code see notebook

implement kernel gcca

generalize kcca to 2+ views

Update internal functions in MVMDS/clear up the proof

Adding functions to make the checks and if statements not run within the fit function but rather in a different internal function as per Richard and Ben's requests.

Also want to make an artificial dataset that has more visually evident results (as opposed to simply numerically significant)

MVMDS Notebook

The tests that I wrote for MVMDS aren't currently working. I would like to fix them and then work on creating a tutorial notebook that properly shows what is going on.

Implement CoTraining Classification

Cotraining for classification tasks is important because it can be used in a wide variety of semi-supervised tasks, and it can be used in multiview settings or single view settings. Based on the paper by Blum and Mitchell, it is a very valuable algorithm which should be included in the multiview package.

To be most applicable, it should work with a variety of classifier types so that the user can specify many parameters about the classification process on their own.

References
Blum, Avrim, and Tom Mitchell. "Combining labeled and unlabeled data with co-training." Proceedings of the eleventh annual conference on Computational learning theory. ACM, 1998.

Doc build fails due to extensive memory usage

This is coming from readthedocs when pip installing all our libraries.

The build will sometimes pass and sometimes fail. I'm guessing this depends on how many resources readthedocs has available at the time.

Fix and experiment with Multiview Spectral Clustering Algorithm

The current implementation of the multi-view spectral clustering algorithm produces similar results as the code provided by the authors of the original paper for one particular data set. However, it does not replicate results from the corresponding paper, nor does it consistently outperform single-view spectral clustering. Further experimentation with the code (and the lab's matlab code) is required to determine potential differences in the implementations and the potential limitations of multi-view spectral clustering. I will therefore re-validate my implementation of multi-view spectral clustering and add the updated version to the package.

Link to the paper: https://pdfs.semanticscholar.org/917e/01d4be4a3417b2941ecd548186a6bf868358.pdf

Explore Multiview Brain Data

I'd like to explore data where there are two views, and one is data from the brain. I want to think about the ideal theoretical function that could transform one view to the other.

Datasets:
MindBrainBody:
15 minute resting MRI <-----> results of various psychological tests

Simulated:
random image data <-----> some complex function of the random image data

Andreas Tolias Paper:
5100 Imagenet 32x64 images <-----> mouse response at 8000 neurons (two-photon imaging) in a couple of mice

I think the last two are the most promising for finding some relationship. I'll have to see if I can get the Tolias data, so right now the simulated data looks most promising.

For the simulated data, one end of the spectrum is cryptographic functions. At other side of the spectrum is linear functions.

Port over promising multi-omics packages

List of repos/packages here to consider implementing.

https://github.com/mikelove/awesome-multi-omics

Ensure that multiview kmeans can handle empty clusters without errors.

Add Deep CCA

Deep CCA is a useful technique for enhancing the results of CCA with deep networks in a 2 view situation. It is analagous to training a special data-driven kernel for a modified kernel CCA. For the first sprint this semester, my ultimate goal will be to implement DCCA for this package based on the original paper:

http://proceedings.mlr.press/v28/andrew13.pdf

To do this, I will benchmark existing implementations, such as the following, in order to compare these implementations to the results of the paper, and ultimately develop code either by drawing from these implementations or starting from scratch. I will compare benchmark these implementations against Table 1 of the paper, which provides correlation results from a 2-view modified MNIST dataset. Then, if they seem to function well, I will port them into mvlearn. If they don't, I will work on my own implementation and benchmark it against these and the paper.

Existing DCCA implementations:

https://github.com/Michaelvll/DeepCCA
https://github.com/VahidooX/DeepCCA

Generalized Sparse CCA

R package, referencing tools from the paper

https://cran.r-project.org/web/packages/PMA/index.html

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2861323/

Work with data

Create some useful insights through using functions from package on Satterwhite data. Also looking to work on the paper as it shapes up.

requirements.txt is too strict

for example there's the line numpy==1.17.2 whereas numpy>=1.17 or even just numpy would probably be fine and prevent people from running into issues installing different versions of packages they already have.

We can probably change this by figuring out which requirements need to be strict and changing the rest to not be ==

Torch is fairly large to install. Consider making an optional dependency?

Links in documentation should be intermapped so they update automatically

https://mvlearn.netlify.com/references/cotraining.html

the reference listed is incomplete:

[2]	Blum, A., & Mitchell, T. (1998, July). Combining labeled and

Add multiview simulations

Add some simple simulations (some of these can be taken from papers or simulations that you used in tutorials).

Should be able to specify number of dimensions, correlation between views, etc. modeled after sklearn.datasets.

Add some tutorials for these as well.

Implement Multi-view Spherical KMeans

This algorithm is very similar to multi-view kmeans and much of the code has already been implemented on a jupyter notebook. Furthermore, the implementation has been verified by replicating some results from the corresponding paper. As such, it would make sense to clean up the code, add unit tests, perform additional validation, and add it to the package.

For validating the algorithm, I will be recreating the second half of Figure 3 from the Multi-view Clustering Paper written by Bickel and Scheffer. I will also be comparing the performance of the multi-view algorithm against the single-view spherical kmeans algorithm on simulated data using similar cases as the validation performed on multi-view kmeans.

Link to the paper: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.66.312&rep=rep1&type=pdf

Submission for "Gradients in brain organization" issue

Call for papers at link. Includes

"- Novel methods for mapping spatial gradients in brain organization and to study variability across individuals in dimensional spaces"

Fix to be more like SKlearn style and fix unit tests

Unit tests are not specific enough Trying to think about what would show strong coverage

Implement other initialization methods for MV Kmeans

It would be nice to implement KMeans++ in order to have better centroid initializations for the algorithm. I should also add options to pass in centroids for custom initialization.

Adding DCCAE

How can we handle adding DCCAE? It involves two autoencoders but usually choosing the architecture (e.g. number of layers, convolutional or not) is a big decision. Maybe we can create two basic parameters (convolutional or not, how many layers) to cover most use cases.

Another problem is the .fit(self, X, y) method of the scikit-learn api. This is fine for data that fits into memory but for bigger datasets the standard thing to use is a DataLoader, which loads the data from disk and applies any transforms on the fly.

Create web-hosted documentation

Graspy used Sphinx and Netlify to compile and host the documentation. Currently being worked on in the "netfily" branch. I think we would just want to compile from master but we could do a release branch and compile from there.

Migrate to Neurodata org. once docs and stuff are legit

Implement multi-view spectral clustering based off a different paper.

The paper that I am currently using as a reference for the multi-view spectral clustering algorithm has results that are difficult to reproduce. I will instead implement the algorithm detailed in this paper: https://pdfs.semanticscholar.org/9383/f08c697b8aa43782e16c9a57e089911584d8.pdf

Fix Notebooks to reflect mvlearn change

change notebooks to import mvlearn. Also need to fix headers so there is only one '#' header for readthedocs compilation

Ensure fit() before predict() in cotraining

In the cotraining classifier, need to make sure that the object has been fit before being used to predict.

Update SplitAE internals to snake_case

currently in camelCase; rest of package uses snake_case; consistency

Write paper

Write a paper and submit to JMLR MLOSS. The paper will be similar to that of GrasPy (https://arxiv.org/abs/1904.05329) and will help mvlearn become more widely known and used.

Helpful Links:
http://jmlr.org/mloss/
http://www.jmlr.org/mloss/mloss-info.html

N-clustering

Selection of highly correlative features with GCCA followed by (clustering on features space?). A potential generalization of bi-clustering. With applications to real data. May need to regularize to reduce dimension of selected features.

Implement Jive

Write a useful implementation of JIVE and add to package
Have code and validation notebooks PRed
Based on Algorithm outlined here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3671601/
Modeling off of code can be found here: https://genome.unc.edu/jive/

Implement DGCCA

Deep Canonical Correlation Analysis:
https://arxiv.org/abs/1702.02519

https://github.com/VahidooX/DeepCCA

PLS Tutorial

Add tutorial notebook for PLS projection.

Implement Generalized Multiview Analysis (supervised CCA)

500+ citations, consider investigating.

"GMA is a supervised extension of Canonical Correlational Analysis (CCA), which is useful for cross-view classification and retrieval. The proposed approach is general and has the potential to replace CCA whenever classification or retrieval is the purpose and label information is available. GMA has all the desirable properties required for cross-view classification and retrieval: it is supervised, it allows generalization to unseen classes, it is multi-view and kernelizable, it affords an efficient eigenvalue based solution and is applicable to any domain."

https://www.cs.umd.edu/~bhokaal/files/documents/GMA.pdf

put this on pypi

Implement JIVE

Paper:
https://arxiv.org/pdf/1704.02060.pdf

"jive is a data analysis package for high-dimensional, multi-block (or multi-view) data. The multi-block data setting means two or more data matrices with a fixed set of observations (e.g. patients) and multiple sets of features (e.g. clinical features and gene expression data)."

Code from paper authors:
https://github.com/MeileiJiang/AJIVE_Project

Well documented pip installable package by third part
https://github.com/idc9/py_jive

Fix Multi-view Kmeans so that it can handle empty consensus clusters at the end of the fit function.

This error is encountered during the fit function, when final cluster centroids are computed. This seems to occur for datasets where one view is an invertible transformation of the other.

update readme

it should look like this:
https://github.com/neurodata/mgc

Add sequential fit to GCCA

First GCCA step is preprocessing and SVD of each view separately. Add a multistep fit function to pass in views one at a time. Is necessary for large datasets so as not to exceed RAM.

Use JIVE on Satterwhite data

Port omnibus embedding from graspy over to package

Zhu Godsi for GCCA component selection

First step of GCCA takes the SVD of views separately. See Graspy select_dimension. Selects the embedding dimension based on an identified elbo. Thus accounts for variability in noise across views.

Fix docs

Can't access items in reference
Consider making notebook output build when the docs build

https://github.com/mariceli3/multiview

what is the deal with this other repo, does it have anything useful for us?

Before running kCCA center the data

GCCA centers data, but kCCA doesnt. Will add this for consistency.

implement and validate kcca

implement gaussian kcca
implement polynomial kcca
validate both
already did linear kcca

Add CoTraining Regression

Given that the package already has CoTraining Classification, it seems a natural extension to provide support for cotraining regression. While not as widely used as cotraining classifiers, the need for regression comes up in many contexts, and could potentially be a useful tool for users of the package in a variety of multiview settings.

The base algorithm is based on kNN regression, and would logically extend the Base CoTraining class in the package.

References:
https://pdfs.semanticscholar.org/437c/85ad1c05f60574544d31e96bd8e60393fc92.pdf

mvlearn / mvlearn Goto Github PK

mvlearn's Introduction

Citing mvlearn

mvlearn's People

Contributors

Stargazers

Watchers

Forkers

mvlearn's Issues

Recommend Projects

Recommend Topics

Recommend Org