Code Monkey home page Code Monkey logo

mvlearn's Introduction

PyPI - Python Version GH Actions Tests CircleCI codecov PyPI version Conda Version License Paper shield

mvlearn is an open-source Python software package for multiview learning tools.

mvlearn aims to serve as a community-driven open-source software package that offers reference implementations for algorithms and methods related to multiview learning (machine learning in settings where there are multiple incommensurate views or feature sets for each sample). It brings together the most widely-used tools in this setting with a standardized scikit-learn like API, well tested code and high-quality documentation. Doing so, we aim to facilitate application, extension, and comparison of methods, and offer a foundation for research into new multiview algorithms. We welcome new contributors and the addition of methods with proven efficacy and current use.

Citing mvlearn

If you find the package useful for your research, please cite our JMLR Paper.

Perry, Ronan, et al. "mvlearn: Multiview Machine Learning in Python." Journal of Machine Learning Research 22.109 (2021): 1-7.

BibTeX entry:

@article{perry2021mvlearn,
  title={mvlearn: Multiview Machine Learning in Python},
  author={Perry, Ronan and Mischler, Gavin and Guo, Richard and Lee, Theodore and Chang, Alexander and Koul, Arman and Franz, Cameron and Richard, Hugo and Carmichael, Iain and Ablin, Pierre and Gramfort, Alexandre and Vogelstein, Joshua T.},
  journal={Journal of Machine Learning Research},
  volume={22},
  number={109},
  pages={1-7},
  year={2021}
}

mvlearn's People

Contributors

agramfort avatar akoul1 avatar alexc3071 avatar armankoul avatar cameronfr avatar cshih14 avatar dependabot[bot] avatar gauravsinghal09 avatar gavinmischler avatar hugorichard avatar idc9 avatar j1c avatar pierreablin avatar rflperry avatar rguo123 avatar theohj826 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mvlearn's Issues

MVMDS Tutorial

The last step in my PR is creating a comprehensive notebok.

Remove PLS

Projection coefficients are readily available in sklearn

Update internal functions in MVMDS/clear up the proof

Adding functions to make the checks and if statements not run within the fit function but rather in a different internal function as per Richard and Ben's requests.

Also want to make an artificial dataset that has more visually evident results (as opposed to simply numerically significant)

MVMDS Notebook

The tests that I wrote for MVMDS aren't currently working. I would like to fix them and then work on creating a tutorial notebook that properly shows what is going on.

Implement CoTraining Classification

Cotraining for classification tasks is important because it can be used in a wide variety of semi-supervised tasks, and it can be used in multiview settings or single view settings. Based on the paper by Blum and Mitchell, it is a very valuable algorithm which should be included in the multiview package.

To be most applicable, it should work with a variety of classifier types so that the user can specify many parameters about the classification process on their own.

References
Blum, Avrim, and Tom Mitchell. "Combining labeled and unlabeled data with co-training." Proceedings of the eleventh annual conference on Computational learning theory. ACM, 1998.

Doc build fails due to extensive memory usage

This is coming from readthedocs when pip installing all our libraries.

The build will sometimes pass and sometimes fail. I'm guessing this depends on how many resources readthedocs has available at the time.

Fix and experiment with Multiview Spectral Clustering Algorithm

The current implementation of the multi-view spectral clustering algorithm produces similar results as the code provided by the authors of the original paper for one particular data set. However, it does not replicate results from the corresponding paper, nor does it consistently outperform single-view spectral clustering. Further experimentation with the code (and the lab's matlab code) is required to determine potential differences in the implementations and the potential limitations of multi-view spectral clustering. I will therefore re-validate my implementation of multi-view spectral clustering and add the updated version to the package.

Link to the paper: https://pdfs.semanticscholar.org/917e/01d4be4a3417b2941ecd548186a6bf868358.pdf

Explore Multiview Brain Data

I'd like to explore data where there are two views, and one is data from the brain. I want to think about the ideal theoretical function that could transform one view to the other.

Datasets:
MindBrainBody:
15 minute resting MRI <-----> results of various psychological tests

Simulated:
random image data <-----> some complex function of the random image data

Andreas Tolias Paper:
5100 Imagenet 32x64 images <-----> mouse response at 8000 neurons (two-photon imaging) in a couple of mice

I think the last two are the most promising for finding some relationship. I'll have to see if I can get the Tolias data, so right now the simulated data looks most promising.

For the simulated data, one end of the spectrum is cryptographic functions. At other side of the spectrum is linear functions.

Add Deep CCA

Deep CCA is a useful technique for enhancing the results of CCA with deep networks in a 2 view situation. It is analagous to training a special data-driven kernel for a modified kernel CCA. For the first sprint this semester, my ultimate goal will be to implement DCCA for this package based on the original paper:

http://proceedings.mlr.press/v28/andrew13.pdf

To do this, I will benchmark existing implementations, such as the following, in order to compare these implementations to the results of the paper, and ultimately develop code either by drawing from these implementations or starting from scratch. I will compare benchmark these implementations against Table 1 of the paper, which provides correlation results from a 2-view modified MNIST dataset. Then, if they seem to function well, I will port them into mvlearn. If they don't, I will work on my own implementation and benchmark it against these and the paper.

Existing DCCA implementations:

https://github.com/Michaelvll/DeepCCA
https://github.com/VahidooX/DeepCCA

Work with data

Create some useful insights through using functions from package on Satterwhite data. Also looking to work on the paper as it shapes up.

requirements.txt is too strict

for example there's the line numpy==1.17.2 whereas numpy>=1.17 or even just numpy would probably be fine and prevent people from running into issues installing different versions of packages they already have.

We can probably change this by figuring out which requirements need to be strict and changing the rest to not be ==

Add multiview simulations

Add some simple simulations (some of these can be taken from papers or simulations that you used in tutorials).

Should be able to specify number of dimensions, correlation between views, etc. modeled after sklearn.datasets.

Add some tutorials for these as well.

Implement Multi-view Spherical KMeans

This algorithm is very similar to multi-view kmeans and much of the code has already been implemented on a jupyter notebook. Furthermore, the implementation has been verified by replicating some results from the corresponding paper. As such, it would make sense to clean up the code, add unit tests, perform additional validation, and add it to the package.

For validating the algorithm, I will be recreating the second half of Figure 3 from the Multi-view Clustering Paper written by Bickel and Scheffer. I will also be comparing the performance of the multi-view algorithm against the single-view spherical kmeans algorithm on simulated data using similar cases as the validation performed on multi-view kmeans.

Link to the paper: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.66.312&rep=rep1&type=pdf

Adding DCCAE

How can we handle adding DCCAE? It involves two autoencoders but usually choosing the architecture (e.g. number of layers, convolutional or not) is a big decision. Maybe we can create two basic parameters (convolutional or not, how many layers) to cover most use cases.

Another problem is the .fit(self, X, y) method of the scikit-learn api. This is fine for data that fits into memory but for bigger datasets the standard thing to use is a DataLoader, which loads the data from disk and applies any transforms on the fly.

Create web-hosted documentation

Graspy used Sphinx and Netlify to compile and host the documentation. Currently being worked on in the "netfily" branch. I think we would just want to compile from master but we could do a release branch and compile from there.

N-clustering

Selection of highly correlative features with GCCA followed by (clustering on features space?). A potential generalization of bi-clustering. With applications to real data. May need to regularize to reduce dimension of selected features.

Implement Generalized Multiview Analysis (supervised CCA)

500+ citations, consider investigating.

"GMA is a supervised extension of Canonical Correlational Analysis (CCA), which is useful for cross-view classification and retrieval. The proposed approach is general and has the potential to replace CCA whenever classification or retrieval is the purpose and label information is available. GMA has all the desirable properties required for cross-view classification and retrieval: it is supervised, it allows generalization to unseen classes, it is multi-view and kernelizable, it affords an efficient eigenvalue based solution and is applicable to any domain."

https://www.cs.umd.edu/~bhokaal/files/documents/GMA.pdf

Add sequential fit to GCCA

First GCCA step is preprocessing and SVD of each view separately. Add a multistep fit function to pass in views one at a time. Is necessary for large datasets so as not to exceed RAM.

Fix docs

  • Can't access items in reference
  • Consider making notebook output build when the docs build

Add CoTraining Regression

Given that the package already has CoTraining Classification, it seems a natural extension to provide support for cotraining regression. While not as widely used as cotraining classifiers, the need for regression comes up in many contexts, and could potentially be a useful tool for users of the package in a variety of multiview settings.

The base algorithm is based on kNN regression, and would logically extend the Base CoTraining class in the package.

References:
https://pdfs.semanticscholar.org/437c/85ad1c05f60574544d31e96bd8e60393fc92.pdf

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.