Code Monkey home page Code Monkey logo

umap's Introduction

UMAP logo

pypi_version_ pypi_downloads_

conda_version_ conda_downloads_

License_ _ Coverage_

Docs_ joss_paper_

UMAP

Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualisation similarly to t-SNE, but also for general non-linear dimension reduction. The algorithm is founded on three assumptions about the data:

  1. The data is uniformly distributed on a Riemannian manifold;
  2. The Riemannian metric is locally constant (or can be approximated as such);
  3. The manifold is locally connected.

From these assumptions it is possible to model the manifold with a fuzzy topological structure. The embedding is found by searching for a low dimensional projection of the data that has the closest possible equivalent fuzzy topological structure.

The details for the underlying mathematics can be found in our paper on ArXiv:

McInnes, L, Healy, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv e-prints 1802.03426, 2018

The important thing is that you don't need to worry about that—you can use UMAP right now for dimension reduction and visualisation as easily as a drop in replacement for scikit-learn's t-SNE.

Documentation is available via Read the Docs.

New: this package now also provides support for densMAP. The densMAP algorithm augments UMAP to preserve local density information in addition to the topological structure of the data. Details of this method are described in the following paper:

Narayan, A, Berger, B, Cho, H, Assessing Single-Cell Transcriptomic Variability through Density-Preserving Data Visualization, Nature Biotechnology, 2021

Installing

UMAP depends upon scikit-learn, and thus scikit-learn's dependencies such as numpy and scipy. UMAP adds a requirement for numba for performance reasons. The original version used Cython, but the improved code clarity, simplicity and performance of Numba made the transition necessary.

Requirements:

  • Python 3.6 or greater
  • numpy
  • scipy
  • scikit-learn
  • numba
  • tqdm
  • pynndescent

Recommended packages:

  • For plotting
    • matplotlib
    • datashader
    • holoviews
  • for Parametric UMAP
    • tensorflow > 2.0.0

Install Options

Conda install, via the excellent work of the conda-forge team:

The conda-forge packages are available for Linux, OS X, and Windows 64 bit.

PyPI install, presuming you have numba and sklearn and all its requirements (numpy and scipy) installed:

If you wish to use the plotting functionality you can use

to install all the plotting dependencies.

If you wish to use Parametric UMAP, you need to install Tensorflow, which can be installed either using the instructions at https://www.tensorflow.org/install (recommended) or using

for a CPU-only version of Tensorflow.

If you're on an x86 processor, you can also optionally install tbb, which will provide additional CPU optimizations:

If pip is having difficulties pulling the dependencies then we'd suggest installing the dependencies manually using anaconda followed by pulling umap from pip:

For a manual install get this package:

Optionally, install the requirements through Conda:

Then install the package

How to use UMAP

The umap package inherits from sklearn classes, and thus drops in neatly next to other sklearn transformers with an identical calling API.

There are a number of parameters that can be set for the UMAP class; the major ones are as follows:

  • n_neighbors: This determines the number of neighboring points used in local approximations of manifold structure. Larger values will result in more global structure being preserved at the loss of detailed local structure. In general this parameter should often be in the range 5 to 50, with a choice of 10 to 15 being a sensible default.
  • min_dist: This controls how tightly the embedding is allowed compress points together. Larger values ensure embedded points are more evenly distributed, while smaller values allow the algorithm to optimise more accurately with regard to local structure. Sensible values are in the range 0.001 to 0.5, with 0.1 being a reasonable default.
  • metric: This determines the choice of metric used to measure distance in the input space. A wide variety of metrics are already coded, and a user defined function can be passed as long as it has been JITd by numba.

An example of making use of these options:

UMAP also supports fitting to sparse matrix data. For more details please see the UMAP documentation

Benefits of UMAP

UMAP has a few signficant wins in its current incarnation.

First of all UMAP is fast. It can handle large datasets and high dimensional data without too much difficulty, scaling beyond what most t-SNE packages can manage. This includes very high dimensional sparse datasets. UMAP has successfully been used directly on data with over a million dimensions.

Second, UMAP scales well in embedding dimension—it isn't just for visualisation! You can use UMAP as a general purpose dimension reduction technique as a preliminary step to other machine learning tasks. With a little care it partners well with the hdbscan clustering library (for more details please see Using UMAP for Clustering).

Third, UMAP often performs better at preserving some aspects of global structure of the data than most implementations of t-SNE. This means that it can often provide a better "big picture" view of your data as well as preserving local neighbor relations.

Fourth, UMAP supports a wide variety of distance functions, including non-metric distance functions such as cosine distance and correlation distance. You can finally embed word vectors properly using cosine distance!

Fifth, UMAP supports adding new points to an existing embedding via the standard sklearn transform method. This means that UMAP can be used as a preprocessing transformer in sklearn pipelines.

Sixth, UMAP supports supervised and semi-supervised dimension reduction. This means that if you have label information that you wish to use as extra information for dimension reduction (even if it is just partial labelling) you can do that—as simply as providing it as the y parameter in the fit method.

Seventh, UMAP supports a variety of additional experimental features including: an "inverse transform" that can approximate a high dimensional sample that would map to a given position in the embedding space; the ability to embed into non-euclidean spaces including hyperbolic embeddings, and embeddings with uncertainty; very preliminary support for embedding dataframes also exists.

Finally, UMAP has solid theoretical foundations in manifold learning (see our paper on ArXiv). This both justifies the approach and allows for further extensions that will soon be added to the library.

Performance and Examples

UMAP is very efficient at embedding large high dimensional datasets. In particular it scales well with both input dimension and embedding dimension. For the best possible performance we recommend installing the nearest neighbor computation library pynndescent . UMAP will work without it, but if installed it will run faster, particularly on multicore machines.

For a problem such as the 784-dimensional MNIST digits dataset with 70000 data samples, UMAP can complete the embedding in under a minute (as compared with around 45 minutes for scikit-learn's t-SNE implementation). Despite this runtime efficiency, UMAP still produces high quality embeddings.

The obligatory MNIST digits dataset, embedded in 42 seconds (with pynndescent installed and after numba jit warmup) using a 3.1 GHz Intel Core i7 processor (n_neighbors=10, min_dist=0.001):

UMAP embedding of MNIST digits

The MNIST digits dataset is fairly straightforward, however. A better test is the more recent "Fashion MNIST" dataset of images of fashion items (again 70000 data sample in 784 dimensions). UMAP produced this embedding in 49 seconds (n_neighbors=5, min_dist=0.1):

UMAP embedding of "Fashion MNIST"

The UCI shuttle dataset (43500 sample in 8 dimensions) embeds well under correlation distance in 44 seconds (note the longer time required for correlation distance computations):

UMAP embedding the UCI Shuttle dataset

The following is a densMAP visualization of the MNIST digits dataset with 784 features based on the same parameters as above (n_neighbors=10, min_dist=0.001). densMAP reveals that the cluster corresponding to digit 1 is noticeably denser, suggesting that there are fewer degrees of freedom in the images of 1 compared to other digits.

densMAP embedding of the MNIST dataset

Plotting

UMAP includes a subpackage umap.plot for plotting the results of UMAP embeddings. This package needs to be imported separately since it has extra requirements (matplotlib, datashader and holoviews). It allows for fast and simple plotting and attempts to make sensible decisions to avoid overplotting and other pitfalls. An example of use:

The plotting package offers basic plots, as well as interactive plots with hover tools and various diagnostic plotting options. See the documentation for more details.

Parametric UMAP

Parametric UMAP provides support for training a neural network to learn a UMAP based transformation of data. This can be used to support faster inference of new unseen data, more robust inverse transforms, autoencoder versions of UMAP and semi-supervised classification (particularly for data well separated by UMAP and very limited amounts of labelled data). See the documentation of Parametric UMAP or the example notebooks for more.

densMAP

The densMAP algorithm augments UMAP to additionally preserve local density information in addition to the topological structure captured by UMAP. One can easily run densMAP using the umap package by setting the densmap input flag:

This functionality is built upon the densMAP implementation provided by the developers of densMAP, who also contributed to integrating densMAP into the umap package.

densMAP inherits all of the parameters of UMAP. The following is a list of additional parameters that can be set for densMAP:

  • dens_frac: This determines the fraction of epochs (a value between 0 and 1) that will include the density-preservation term in the optimization objective. This parameter is set to 0.3 by default. Note that densMAP switches density optimization on after an initial phase of optimizing the embedding using UMAP.
  • dens_lambda: This determines the weight of the density-preservation objective. Higher values prioritize density preservation, and lower values (closer to zero) prioritize the UMAP objective. Setting this parameter to zero reduces the algorithm to UMAP. Default value is 2.0.
  • dens_var_shift: Regularization term added to the variance of local densities in the embedding for numerical stability. We recommend setting this parameter to 0.1, which consistently works well in many settings.
  • output_dens: When this flag is True, the call to fit_transform returns, in addition to the embedding, the local radii (inverse measure of local density defined in the densMAP paper) for the original dataset and for the embedding. The output is a tuple (embedding, radii_original, radii_embedding). Note that the radii are log-transformed. If False, only the embedding is returned. This flag can also be used with UMAP to explore the local densities of UMAP embeddings. By default this flag is False.

For densMAP we recommend larger values of n_neighbors (e.g. 30) for reliable estimation of local density.

An example of making use of these options (based on a subsample of the mnist_784 dataset):

See the documentation for more details.

Help and Support

Documentation is at Read the Docs. The documentation includes a FAQ that may answer your questions. If you still have questions then please open an issue and I will try to provide any help and guidance that I can.

Citation

If you make use of this software for your work we would appreciate it if you would cite the paper from the Journal of Open Source Software:

If you would like to cite this algorithm in your work the ArXiv paper is the current reference:

Additionally, if you use the densMAP algorithm in your work please cite the following reference:

If you use the Parametric UMAP algorithm in your work please cite the following reference:

License

The umap package is 3-clause BSD licensed.

We would like to note that the umap package makes heavy use of NumFOCUS sponsored projects, and would not be possible without their support of those projects, so please consider contributing to NumFOCUS.

Contributing

Contributions are more than welcome! There are lots of opportunities for potential projects, so please get in touch if you would like to help out. Everything from code to notebooks to examples and documentation are all equally valuable so please don't feel you can't contribute. To contribute please fork the project make your changes and submit a pull request. We will do our best to work through any issues with you and get your code merged into the main branch.

umap's People

Contributors

adalmia96 avatar ajtritt avatar bkmgit avatar chrismbryant avatar fchollet avatar gclen avatar gclendenning avatar gregdemand avatar hamelin avatar hhcho avatar hndgzkn avatar jc-healy avatar jlmelville avatar josephcourtney avatar leriomaggio avatar lmcinnes avatar markfraney avatar matthieuheitz avatar mithaler avatar parashardhapola avatar paxtonfitzpatrick avatar pujaltes avatar rocketknight1 avatar sg-s avatar sleighsoft avatar thomasnickerson avatar timsainb avatar tomwhite avatar usul83 avatar vicramr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

umap's Issues

Multi CPU / GPU capabilities?

@lmcinnes
As you may have guessed I have several CPUs and GPUs at hand and I work with high-dimensional data.
Now I am benching a 500k * 5k => 500k * 2 vector vs. PCA (I need a high level clustering to filter my data to feed it further in the pipeline).

So a couple of questions:

  1. Any plans on multi-CPU / GPU support?
  2. Does your implementation utilize vectorized operations (not really sure how embedding methods like T-SNE and UMAP work, I believe they minimize some kind of distance in high dimension space?) If so, can I help?
  3. Did you run benchmarks (like HDBSCAN) on large and huge datasets? If so, then is it feasible to expect 500k * 5k => 500k * 2 to finish in reasonable time, or should I do PCA => UMAP?

Evaluating dimensionality reduction?

Hello Leland,

Thank you for sharing this new algorithm.
I have a question regarding evaluation measures of dimensionality reduction methods. I'm aware of trustworthiness and continuity, but I'm looking for measures that can handle large datasets.

I found the paper "Scale-independent quality criteria for dimensionality reduction" which is an alternative quality measure, but it is still for small datasets.

How are you evaluating umap against other approaches at the moment?

umap non determinism - intended?

Was testing it out and noticed that setting the random seed doesn't stop the embedding from changing upon different runs.

is non-determinism part of the design (like tsne)? is there a way to replicate prior results?

hdbscan on UMAP subspace

As the doc said "With a little care (documentation on how to be careful is coming) it partners well with the hdbscan clustering library" I wonder is there any updates about the little care or quick answers how to use hdbscan to perform clustering on the UMAP subspace ?
Thanks in advance !

Bad argument for scipy.sparse.coo_matrix

I want to use umap.UMAP().fit_transform(X) but i got an error :
ValueError: negative column index found
from the scipy function scipy.sparse.coo_matrix

When i have investigated, i've found in umap_.py, in the function fuzzy_simplicial_set(), the variable tmp_indices contains values < 0 (-1), but scipy.sparse.coo_matrix need values >= 0 !

Thx for answering me.

Add progress log

It would be nice to report some intermediate calculation progress as log info messages. For large data the process can be relatively long and progress info will help.

get a RecursionError

working_uamp = umap.UMAP(n_neighbors=5,
n_components=2,
min_dist=0.3,
metric='euclidean')

input_feat = df.as_matrix([df.columns[1:1001]])
embeddings = working_uamp.fit_transform(input_feat)

---------------------------error message ------------------------------

make_tree(data, indices, leaf_size)
105 rng_state)
106 left_node = make_tree(data, left_indices, leaf_size)
--> 107 right_node = make_tree(data, right_indices, leaf_size)
108 node = RandomProjectionTreeNode(indices, False, left_node, right_node)
109 else:

RecursionError: maximum recursion depth exceeded

Transform and Unsupervised Data

Hello,

maybe I'm missing it, but is there the 'transform' function, i.e. after you trained the UMAP instance with data you can apply the same instance on an unseen point?
If not, why? And is it foreseen?
Thank you!

Sparse matrix support

In principle a distance function could take sparse vectors and thus allow UMAP to take sparse matrices as input. This would allow for much higher dimensional data (NLP related data for example) to be handled by UMAP.

Recursion Error (Different from Previous Post)

EDIT: Is this just a version issue? RecursionError was RuntimeError before python 3.5.

I'm working with a large data set (~1,700,000 x ~400), and I'm getting the following error:

Traceback (most recent call last): File "umapping.py", line 25, in <module> u = umap.UMAP(metric="correlation").fit_transform(data) File "/users/nicolerg/anaconda2/lib/python2.7/site-packages/umap/umap_.py", line 1573, in fit_transform self.fit(X) File "/users/nicolerg/anaconda2/lib/python2.7/site-packages/umap/umap_.py", line 1534, in fit self.verbose File "/users/nicolerg/anaconda2/lib/python2.7/site-packages/umap/umap_.py", line 559, in rptree_leaf_array except RecursionError: NameError: global name 'RecursionError' is not defined

This is the relevant bit of code:

data = ps.read_csv(args.input, compression="gzip", header=1, sep=',') nrows = len(data) colors = np.random.rand(nrows, 3) # RGB colors u = umap.UMAP(metric="correlation", n_neighbors=25).fit_transform(data)

I increased n_neighbors from the default of 15 to 25 to see if that would help, but I got the same error. I do not expect that I have equivalent rows in my data. I am trying to cluster the ~1,700,000 instances in ~400 dimensions. Any suggestions?

Custom losses, coherent embeddings

Nice property of TSNE, that is not exploited in most of implementations, is that it can be treated as a combination of two orthogonal components: loss function and optimization algorithm. For example one may visualize set on temporally varying vectors with a sequence on coherent embeddings by adding a loss term that penalizes unnecessary movement of each vector between those embeddings. Is it possible to have provide such flexibility to use additional constrains with UMAP?

Precomputed distances

Would you be interested in / see any obstacles for implementing UMAP on a distance matrix? After a first glance this seems to be quite straight forward to include.
I'd be inclined to contribute.

does not support option: 'parallel'"

Hi, thanks for the exciting work! I am playing with your algorithm and I got the following error message, when I was running your demo with digits.data. Do you have a sense of what is going on here?


KeyError Traceback (most recent call last)
/Users/Qihong/anaconda/envs/brainiak/lib/python3.6/site-packages/numba/targets/options.py in from_dict(self, dic)
17 try:
---> 18 ctor = self.OPTIONS[k]
19 except KeyError:

KeyError: 'parallel'

During handling of the above exception, another exception occurred:

KeyError Traceback (most recent call last)
in ()
6 embedding = umap.UMAP(n_neighbors=5,
7 min_dist=0.3,
----> 8 metric='correlation').fit_transform(digits.data)

/Users/Qihong/Dropbox/github/umap/umap/umap_.py in fit_transform(self, X, y)
790 Embedding of the training data in low-dimensional space.
791 """
--> 792 self.fit(X)
793 return self.embedding_

/Users/Qihong/Dropbox/github/umap/umap/umap_.py in fit(self, X, y)
757
758 graph = fuzzy_simplicial_set(X, self.n_neighbors,
--> 759 self._metric, self.metric_kwds)
760
761 if self.n_edge_samples is None:

/Users/Qihong/anaconda/envs/brainiak/lib/python3.6/site-packages/numba/dispatcher.py in _compile_for_args(self, *args, **kws)
305 argtypes.append(self.typeof_pyval(a))
306 try:
--> 307 return self.compile(tuple(argtypes))
308 except errors.TypingError as e:
309 # Intercept typing error that may be due to an argument

/Users/Qihong/anaconda/envs/brainiak/lib/python3.6/site-packages/numba/dispatcher.py in compile(self, sig)
577
578 self._cache_misses[sig] += 1
--> 579 cres = self._compiler.compile(args, return_type)
580 self.add_overload(cres)
581 self._cache.save_overload(sig, cres)

/Users/Qihong/anaconda/envs/brainiak/lib/python3.6/site-packages/numba/dispatcher.py in compile(self, args, return_type)
70 def compile(self, args, return_type):
71 flags = compiler.Flags()
---> 72 self.targetdescr.options.parse_as_flags(flags, self.targetoptions)
73 flags = self._customize_flags(flags)
74

/Users/Qihong/anaconda/envs/brainiak/lib/python3.6/site-packages/numba/targets/options.py in parse_as_flags(cls, flags, options)
26 def parse_as_flags(cls, flags, options):
27 opt = cls()
---> 28 opt.from_dict(options)
29 opt.set_flags(flags)
30 return flags

/Users/Qihong/anaconda/envs/brainiak/lib/python3.6/site-packages/numba/targets/options.py in from_dict(self, dic)
19 except KeyError:
20 fmt = "%r does not support option: '%s'"
---> 21 raise KeyError(fmt % (self.class, k))
22 else:
23 self.values[k] = ctor(v)

KeyError: "<class 'numba.targets.cpu.CPUTargetOptions'> does not support option: 'parallel'"

UMAP as a dimensionality reduction (umap.transform())

Hey hi @lmcinnes

First of all, thx for this method. It working so well !

So I have a general question about using UMAP as a dimensionality reduction step in a prediction pipeline. We have a classification model where using a UMAP as a first dimensionality reduction step seem to gives really good results. It fixes a lot of regularization issue we have with this specific model. Now I guess my question is more related to manifold training in general, but I usually fit the dim reduction model first on the train data and then use the same model for the inference/prediction in order to have a consistent lower-dimensional projection.

Now obviously, like t-SNE, the manifold itself is learned with the data so it’s hard to “transform” new incoming data so that’s why there is no umap.transform() method I guess. There was a closely related discussion on sklearn at some point on a possible parametric t-SNE that would make this projection easier (scikit-learn/scikit-learn#5361) but looks like it’s a non trivial task in t-SNE. Anyway, long story short, since it’s mentioned in the documentation that UMAP can be used as a “reduction technique as a preliminary step to other machine learning tasks”, I was wondering how a prediction pipeline using UMAP would like like ?

The method I found so far is to reduce the dimensionality of the training AND test data at the same time in a single umap.fit_transform(), then train the model on the reduced train data and predict with the reduced test data. It’s work well in the a test scenario, but obviously in a real world environnement it mean that we would have to perform the dim reduction of the incoming data alongside the entire training dataset every time.

Is there a more elegant way of doing this ?

Martin

Problem with `n_components > 2`

Hi Leland,

Thank you for all the hard work you've put in UMAP. I'm very fond of it.

I'm using UMAP for dimension reduction. I was actually wondering what happens when you stack multiple instances of UMAP with different component counts. In the process I encountered the following error, produced by the below code.

A detail to point is that I'm also having problems with the scipy.sparse.csgraph import and I believe this is related.

import umap
import scipy.sparse.csgraph
from sklearn.datasets import load_digits

digits = load_digits()

embedding = umap.UMAP(
    n_components=20,
    n_neighbors=5,
    min_dist=0.3,
    metric='correlation'
).fit_transform(digits.data)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-35aa497c09c5> in <module>()
     10     min_dist=0.3,
     11     metric='correlation'
---> 12 ).fit_transform(digits.data)

~/virtualenvs/work/lib/python3.6/site-packages/umap/umap_.py in fit_transform(self, X, y)
   1473             Embedding of the training data in low-dimensional space.
   1474         """
-> 1475         self.fit(X)
   1476         return self.embedding_

~/virtualenvs/work/lib/python3.6/site-packages/umap/umap_.py in fit(self, X, y)
   1453             self.init,
   1454             random_state,
-> 1455             self.verbose
   1456         )
   1457 

~/virtualenvs/work/lib/python3.6/site-packages/umap/umap_.py in simplicial_set_embedding(graph, n_components, initial_alpha, a, b, gamma, negative_sample_rate, n_epochs, init, random_state, verbose)
   1115     elif isinstance(init, str) and init == 'spectral':
   1116         # We add a little noise to avoid local minima for optimization to come
-> 1117         initialisation = spectral_layout(graph, n_components, random_state)
   1118         expansion = 10.0 / initialisation.max()
   1119         embedding = (initialisation * expansion) + \

~/virtualenvs/work/lib/python3.6/site-packages/umap/umap_.py in spectral_layout(graph, dim, random_state)
    881             init = random_state.uniform(low=-10.0, high=10.0,
    882                                         size=(n_samples, 2))
--> 883             init[labels == largest_component] = eigenvectors[:, order]
    884             return init
    885     except scipy.sparse.linalg.ArpackError:

ValueError: shape mismatch: value array of shape (1770,20) could not be broadcast to indexing result of shape (1770,2)

Relevant version numbers:

python==3.6.1
scipy==1.0.0
numpy==1.14.0
numba==0.36.2
scikit-learn==0.19.1

Will continue to look into it in the meantime.

UMAP Roadmap

A rough roadmap of things to be done for UMAP. Some of these tasks are easy, some are hard, and some require deeper knowledge of UMAP. Short and medium term tasks should be approachable for many people. Reply to this issue if you are interested in taking up any of them.

Short term items

  • Support for sparse matrix input
  • Add random seed as an user option
  • Support for cosine distance RP-trees
  • Allow non-RP-tree initialisation of NN-descent
  • Better document (via docstrings) all the support functions
  • "Custom" initialisation with a predefined positioning.

Medium term items

  • Generate notebook for basic usage demonstration
  • Generate notebook explaining parameter options and their effects
  • Set up CI and build a basic test suite
  • Start building basic documentation and integrate with readthedocs

Longer term items

  • Generate notebook for "How UMAP works"
  • Add code (and devise API(?)) for UMAP on general pandas dataframes
  • Add support for semi-supervised dimension reduction via UMAP
  • UMAP as a generative model (code + demo)
  • UMAP for text data (similar to word2vec)
  • A transform function for new previously unseen data (see issue #40)
  • Model persistence for UMAP models

No priority

  • GPU support for UMAP
  • Conda-forge UMAP package
  • Improve numba usage (better numba expertise required)
  • Concurrency via Dask for multicore and distributed support

Converging to a single point

I'm using UMAP to embed a bunch of 128 dimensional face embeddings generated by a neural net.

As I increase the number of embeddings (I have 3M total) the output from UMAP converges to a single point in the center surrounded by a sparse cloud around it. How can I fix this? Here are some examples from fewer samples to more samples. n = 73728, 114688, 172032, 196608, 245760

73728
114688
172032
196608
245760

ZeroDivisionError with sparse input and metric='jaccard'

Example to reproduce error:

import numpy as np
from sklearn import manifold
import umap

X = np.random.choice([0, 1], size=(1000, 50), p=[90./100, 10./100])

tsne = manifold.TSNE(metric='jaccard')
y_tsne = tsne.fit_transform(X)

um = umap.UMAP(metric='jaccard')
y_umap = um.fit_transform(X)

p=[85./100, 15./100] works.

Cosine distance RP-Trees

UMAP currently uses RP-trees to initialise the NN-descent algorithm. The current version of this uses euclidean distance RP-Trees. In principle cosine distance RP-trees are simple to implement and would be more useful for cosine and correlation distance metrics. Allowing the option would be beneficial.

This requires both implementing the trees (simply write a new splitting function), and threading the option through the code to be able to present it to the user an class instantiation time.

Support for input neighbor sets

This would allow mutual nearest neighbors, or other approaches to nearest neighbors to be used, providing greater flexibility.

Preprocessing of the features

Hello Leland,

Congrats for the work and thanks for the code and examples. Looking forward for the paper!

If we'd like to use UMAP with the features that output a CNN, for example on MNIST dataset. Do the features need to be zero-centered? Or in some range?

I get the following error when running fit_transform(data): "ZeroDivisionError: division by zero".

Thanks again,
Amelia.

AttributeError: module 'scipy.sparse' has no attribute 'csgraph'

Hello,

Thank you for the great contribution.

I can't seem to get it running. Any help is appreciated.

Here are my versions:

Requirement already satisfied: umap-learn in ./anaconda3/lib/python3.6/site-packages
Requirement already satisfied: numba>=0.34 in ./anaconda3/lib/python3.6/site-packages (from umap-learn)
Requirement already satisfied: scipy>=0.19 in ./anaconda3/lib/python3.6/site-packages (from umap-learn)
Requirement already satisfied: scikit-learn>=0.16 in ./anaconda3/lib/python3.6/site-packages (from umap-learn)
Requirement already satisfied: llvmlite in ./anaconda3/lib/python3.6/site-packages (from numba>=0.34->umap-learn)
Requirement already satisfied: numpy in ./anaconda3/lib/python3.6/site-packages (from numba>=0.34->umap-learn)

Running the example,

import umap
from sklearn.datasets import load_digits

digits = load_digits()

embedding = umap.UMAP().fit_transform(digits.data)

outputs:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-32-e5c7a5ee7150> in <module>()
      4 digits = load_digits()
      5 
----> 6 embedding = umap.UMAP().fit_transform(digits.data)

~/anaconda3/lib/python3.6/site-packages/umap/umap_.py in fit_transform(self, X, y)

~/anaconda3/lib/python3.6/site-packages/umap/umap_.py in fit(self, X, y)

~/anaconda3/lib/python3.6/site-packages/umap/umap_.py in simplicial_set_embedding(graph, n_components, initial_alpha, a, b, gamma, negative_sample_rate, n_epochs, init, random_state, verbose)

~/anaconda3/lib/python3.6/site-packages/umap/umap_.py in spectral_layout(graph, dim, random_state)

AttributeError: module 'scipy.sparse' has no attribute 'csgraph'

But I can import csgraph witout problems from scipy,
from scipy.sparse import csgraph

RuntimeWarning: divide by zero encountered in power

Hi there, first of all, thanks loads for this exciting algorithm. I'm writing a blog post on comparing this to a couple of other dim reduction techniques. I noticed when I'm using umap on a dataset with entries such as:

array([  0.        ,   0.        ,   0.        ,   0.        ,
         0.        ,   0.        ,   0.        ,   0.        ,
         0.        ,   0.        ,   0.        ,   0.        ,
         0.        ,   0.        ,   0.        ,   0.        ,
        -1.88847315,  11.17503262,  -0.69157058,   5.85993528,
         0.98581624,  -1.14453554,   0.61075902,  -3.21815372,
         4.9411006 ,   5.51712704,  -1.7895503 ,   2.04580665,
         0.22949766,  -6.60904551,   8.11811924,   1.88291252,
         0.        ,   0.        ,   0.        ,   0.        ,
         0.        ,   0.        ,   0.        ,   0.        ,
         0.        ,   0.        ,   0.        ,   0.        ,
         0.        ,   0.        ,   0.        ,   0.        ])

I get the following runtime warning

/usr/local/lib/python3.5/dist-packages/umap/umap_.py:592: RuntimeWarning: divide by zero encountered in power
  return 1.0 / (1.0 + a * x ** (2 * b))
/usr/local/lib/python3.5/dist-packages/scipy/optimize/minpack.py:779: OptimizeWarning: Covariance of the parameters could not be estimated
  category=OptimizeWarning)

And less than satisfactory results such as this (plots of many neighbour and distance settings:
wah
Any thoughts?

Access to high-dimensional fuzzy simplicial set

I find for some applications (e.g. clustering) it is good to have access to the high-dimensional fuzzy simplicial set through the class UMAP. This can be easily implemented by storing graph as self.graph in the method UMAP.fit(). If you think this is a useful feature but are too busy with more urgent things, I will be happy to implement it through a pull request. Please let me know.

Sudden outlier

I've quickly tested Multicore-TSNE and umap on the telecom churn dataset.
Here is the notebook. The dataset is available in the very same repository.

Looks like umap suffered from a sudden outlier which hadn't affected t-SNE.

I haven't played with umap hyperparameters but maybe the example will be useful.

Install problems

I was had some trouble getting umap installed on my system. I think the problem was with getting numba to work properly without using Anaconda.

Traceback (most recent call last):
  File "visualize.py", line 2, in <module>
    import umap
  File "/home/sauln/research/graphs/venv/lib/python3.6/site-packages/umap/__init__.py", line 1, in <module>
    from .umap_ import UMAP
  File "/home/sauln/research/graphs/venv/lib/python3.6/site-packages/umap/umap_.py", line 7, in <module>
    import numba
  File "/home/sauln/research/graphs/venv/lib/python3.6/site-packages/numba/__init__.py", line 12, in <module>
    from .special import typeof, prange
  File "/home/sauln/research/graphs/venv/lib/python3.6/site-packages/numba/special.py", line 3, in <module>
    from .typing.typeof import typeof
  File "/home/sauln/research/graphs/venv/lib/python3.6/site-packages/numba/typing/__init__.py", line 2, in <module>
    from .context import BaseContext, Context
  File "/home/sauln/research/graphs/venv/lib/python3.6/site-packages/numba/typing/context.py", line 10, in <module>
    from numba.typeconv import Conversion, rules
  File "/home/sauln/research/graphs/venv/lib/python3.6/site-packages/numba/typeconv/rules.py", line 3, in <module>
    from .typeconv import TypeManager, TypeCastingRules
  File "/home/sauln/research/graphs/venv/lib/python3.6/site-packages/numba/typeconv/typeconv.py", line 3, in <module>
    from . import _typeconv, castgraph, Conversion
ImportError: libpython3.6m.so.1.0: cannot open shared object file: No such file or directory

I was able to get everything to work without using Conda by installing the python dev:

sudo apt-get install libpython3.6-dev

Though my issue is already solved, other people might run into the same problems. It might be helpful to incorporate this information into the docs somewhere, or just close the issue and direct people here if it comes up again.

JavaScript implementation?

Are you aware of any JavaScript implementations?

Most probably, there are none, so please ping if you'd be interested as well. There's already e.g. a powerful ML JavaScript toolkit https://github.com/mljs/ml so I'd love to have UMAP there included.

Performance regression in 0.2

I was pip updating from 0.1.3 to 0.2. Two sample workloads of us took a significant hit in performance: Reducing 480x13500 to 80x13500 ran 2:24 instead of 1:14 and reducing 480x6700 to 80x6700 took 1:49 instead of 0:28.

Alongside updating umap-learn, other libraries got a bump (llvmlite 0.2 to 0.21, numba 0.35.0 to 0.36.2). Neither of those affected running times. After downgrading to 0.1.3, I got the former numbers.

I saw that this commit disabled jitting for fuzzy_simplical_set. Could this or anything else cause this regression?

Custom embedding initialization

Looking at the embedding initialization options, I see 'random' and 'spectral'. Would it be possible to initialize with a custom embedding? And if so, would this embedding be at all preserved?

In trying to compare the effect of different parameter changes, it could be helpful to use the embedding of a previous run as the initialization to a new UMAP instance with slightly different parameters. For example, these two min_dist values result in embeddings with different global orientations but similar local relationships.

download

download-5

Attribute Error

Hi there, these are my system specs:
macOS Sierra 10.12.3 (16D32)

I have installed umap through pip. When I try to run it this the error message that comes up. I'm unsure what the problem is, any ideas?

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-10-68ef34dfa695> in <module>()
     16         umap_mfccs = get_scaled_umap_embeddings(mfcc_features,
     17                                                 neighbours,
---> 18                                                 distances)
     19         umap_embeddings_mfccs.append(umap_mfccs)
     20 

<ipython-input-10-68ef34dfa695> in get_scaled_umap_embeddings(features, neighbour, distance)
      1 def get_scaled_umap_embeddings(features, neighbour, distance):
      2 
----> 3     embedding = umap.UMAP(n_neighbors=neighbour,
      4                           min_dist = distance,
      5                           metric = 'correlation').fit_transform(features)

AttributeError: module 'umap' has no attribute 'UMAP'

Does not work well on trained doc2vec model

I trained a doc2vec model on the large movie review dataset and then tried to use UMAP to reduce the dimensions of the resulting document vectors. I had hoped that it would be possible to separate the documents by sentiment (positive and negative), but unfortunately the embedding is one big blob. A notebook can be seen here and the rest of the files for training the doc2vec model are in that repository as well.

Digits example

569 self._raise_no_convergence()
570 else:
--> 571 raise ArpackError(self.info, infodict=self.iterate_infodict)
572
573 def extract(self, return_eigenvectors):

ArpackError: ARPACK error 3: No shifts could be applied during a cycle of the Implicitly restarted Arnoldi iteration. One possibility is to increase the size of NCV relative to NEV.

`umap_utils` missing?

Looks like umap_utils is missing?

ImportError                               Traceback (most recent call last)
<ipython-input-2-c9a53c6b2768> in <module>()
----> 1 import umap

/Users/max/Dropbox/Dokument/Python/voteringar/umap/__init__.py in <module>()
----> 1 from .umap_ import UMAP

/Users/max/Dropbox/Dokument/Python/voteringar/umap/umap_.py in <module>()
----> 1 from .umap_utils import fuzzy_simplicial_set, simplicial_set_embedding
      2 from scipy.optimize import curve_fit
      3 from sklearn.base import BaseEstimator
      4 
      5 import numpy as np

ImportError: No module named umap_utils

ZeroDivisionError when dataset is 4096 elements or more

Hi, I've narrowed in on a ZeroDivisionError that happens 100% of the time when my dataset is above >= 4096 elements, or 2^12, and none of the time when below 4096.
Varying min_dist, bandwidth, or n_neighbours parameters doesn't avoid it.
I tried with 3 datasets; the two that were 300 dims failed this way, but one with 600 dims had no error.

The trace:

File "C:/W/py_tests/dim_red/umap_reduct.py", line 9, in <module>
  embedding = umap.UMAP(n_neighbors=6, min_dist=0.002, bandwidth=0.6, metric='cosine').fit_transform(df)
File "C:\W\py_tests\venv\lib\site-packages\umap\umap_.py", line 1476, in fit_transform
  self.fit(X)
File "C:\W\py_tests\venv\lib\site-packages\umap\umap_.py", line 1434, in fit
  self.verbose
File "C:\W\py_tests\venv\lib\site-packages\umap\umap_.py", line 761, in fuzzy_simplicial_set
  verbose=verbose)
ZeroDivisionError: division by zero

This is version 0.2.1 on Windows 10.

Thanks

n_neighbors of point i includes i

In fuzzy_simplicial_set, in the small data case where X is the full distance matrix, the following code gets run:

    if metric == 'precomputed':
        # Note that this does not support sparse distance matrices yet ...
        # Compute indices of n nearest neighbors
        knn_indices = np.argsort(X)[:, :n_neighbors]
        # Compute the nearest neighbor distances
        #   (equivalent to np.sort(X)[:,:n_neighbors])
        knn_dists = X[np.arange(X.shape[0])[:, None], knn_indices].copy()

I believe that knn_indices and knn_dists contain the point itself and not just the k neighbors, i.e. it is always true that knn_indices[i][0] == i and knn_dists[i][0] == 0.

This leads to an error in simplicial_set_embedding if umap.UMAP is run with n_neighbors = 1 because graph just consists of zeros, with the following error (in PyCharm on Windows at any rate):

  File "C:\dev\python\umap\umap\umap_.py", line 1152, in simplicial_set_embedding
    graph.data[graph.data < (graph.data.max() / float(n_epochs))] = 0.0
  File "C:\dev\python\py36\lib\site-packages\numpy\core\_methods.py", line 26, in _amax
    return umr_maximum(a, axis, None, out, keepdims)
ValueError: zero-size array to reduction operation maximum which has no identity

An additional effect is that you can never use all neighbor distances in fuzzy_simplicial_set, because n_neighbors must be smaller than the dataset size. Admittedly this isn't a very useful thing to do practically, but it seems like you ought to be able to do it.

So I think this is a bug. I haven't tried with large data and the metric_nn_descent code block.

Intermittent ZeroDivisionError: division by zero

This is a fantastic library, thanks very much for your great work. Periodically though, I'm getting a ZeroDivisionError: division by zero while building a UMAP projection. My data doesn't change, nor does the way I call the UMAP constructor:

model = umap.UMAP(n_neighbors=25, min_dist=0.00001, metric='correlation')
fit_model = model.fit_transform( np.array(image_vectors) )

Once in a while (maybe 5% of runs) this throws the following trace (umap version 0.1.5):

File "imageplot.py", line 278, in <module>
    Imageplot(image_dir=sys.argv[1], output_dir='output')
  File "imageplot.py", line 30, in __init__
    self.create_2d_projection()
  File "imageplot.py", line 148, in create_2d_projection
    model = self.build_model(image_vectors)
  File "imageplot.py", line 175, in build_model
    return model.fit_transform( np.array(image_vectors) )
  File "/Users/yaledhlab/anaconda3/lib/python3.6/site-packages/umap/umap_.py", line 1402, in fit_transform
    self.fit(X)
  File "/Users/yaledhlab/anaconda3/lib/python3.6/site-packages/umap/umap_.py", line 1361, in fit
    self.verbose
  File "/Users/yaledhlab/anaconda3/lib/python3.6/site-packages/umap/umap_.py", line 385, in rptree_leaf_array
    angular=angular)
  File "/Users/yaledhlab/anaconda3/lib/python3.6/site-packages/umap/umap_.py", line 310, in make_tree
    angular)
  File "/Users/yaledhlab/anaconda3/lib/python3.6/site-packages/umap/umap_.py", line 315, in make_tree
    angular)
  File "/Users/yaledhlab/anaconda3/lib/python3.6/site-packages/umap/umap_.py", line 315, in make_tree
    angular)
  File "/Users/yaledhlab/anaconda3/lib/python3.6/site-packages/umap/umap_.py", line 310, in make_tree
    angular)
  File "/Users/yaledhlab/anaconda3/lib/python3.6/site-packages/umap/umap_.py", line 315, in make_tree
    angular)
  File "/Users/yaledhlab/anaconda3/lib/python3.6/site-packages/umap/umap_.py", line 315, in make_tree
    angular)
  File "/Users/yaledhlab/anaconda3/lib/python3.6/site-packages/umap/umap_.py", line 310, in make_tree
    angular)
  File "/Users/yaledhlab/anaconda3/lib/python3.6/site-packages/umap/umap_.py", line 310, in make_tree
    angular)
  File "/Users/yaledhlab/anaconda3/lib/python3.6/site-packages/umap/umap_.py", line 310, in make_tree
    angular)
  File "/Users/yaledhlab/anaconda3/lib/python3.6/site-packages/umap/umap_.py", line 301, in make_tree
    rng_state)
ZeroDivisionError: division by zero

I took a quick look at the make_tree function but that didn't show much--the real problem seems to be swallowed in the stacktrace by the recursion. Do you have an idea what might cause this? I'll upgrade to the latest master and see if the problem continues.

Adding performance notebook

I created a small notebook for replication of UMAP performance when increasing dataset size and dimensionality.

I can not create a pull request though. Is this something on my end, or do I need a permission for this?

How to proceed?

increasing dimensionality

increasing data

notebook

[Question] What's the scaling complexity?

Looks like a great alternative to t-SNE! The readme mentions how fast it is, but I wonder what is the complexity in big-O depending on the number of samples and dimensions? Waiting for the paper to read impatiently!

Wierd results on dataset

Tried it on some Swedish parlament voting data. Did a notebook comparing it to t-SNE that works fine, but umap just produces one big blob. Tried some different parameters without any luck, but honestly I have no clue what either of the parameters does :).

See notebook for more info (if you run it you will have nice interactive plots, but i also added a static plot since the interactive is stripped from the gist)

https://gist.github.com/maxberggren/56efa53776f42755b83261c54081496e

[Question] Clustering on UMAP output

Hi,

when using tSNE, it is usually not recommended to perform clustering on the "reduced space" with algorithms such as k-means or DBSCAN (and HDBSCAN?) because the dimensionality reduction applied by tSNE doesn't keep properties like relative distance and density (see https://stats.stackexchange.com/questions/263539/k-means-clustering-on-the-output-of-t-sne).

Would it make sense to perform such clustering (with k-means, DBSCAN, HDBSCAN etc.) on the UMAP output?

Thank you very much.

Encountering numba current locale errors?

`raise NotImplementedError("cannot convert native %s to Python object" % (typ,))
LoweringError: cannot convert native const('\tnn descent iteration ') to Python object
File "../../../../home/.local/lib/python2.7/site-packages/umap/umap_.py", line 663
[1] During: lowering "print($515.5)" at /home/.local/lib/python2.7/site-packages/umap/umap_.py (663)

Failed at nopython (nopython mode backend)
cannot convert native const('\tnn descent iteration ') to Python object
File "../../../../home/.local/lib/python2.7/site-packages/umap/umap_.py", line 663
[1] During: lowering "print($515.5)" at /home/.local/lib/python2.7/site-packages/umap/umap_.py (663)
`
Related to 1.12.3.3 here: http://numba.pydata.org/numba-doc/dev/user/faq.html

Doesn't Work!!

Thank you for providing umap module!
I installed it by pip3, and tried following example.

import umap
from sklearn.datasets import load_digits

digits = load_digits()

embedding = umap.UMAP().fit_transform(digits.data)

However, it doesn't work and error says that

 module 'scipy.sparse' has no attribute 'csgraph'

I reinstalled spicy but error remains to be left.
Could you deal with it?

Mixed-type datasets

Thanks for sharing the great algorithm and library!

Wondering what would be the recommended way of feeding mixed-type data with some categorical features to UMAP? Binary encoding (possibly with appropriate distance metrics)?

RuntimeWarning: overflow

I get this warning on a dataset:

umap_.py:154: RuntimeWarning: overflow encountered in int_scalars
  self.init

Is it something that you can check without the dataset or do you need it?

ValueError: negative column index found

error.zip

A strange issue happen if you try to compute these simple 500 rows (file is attached).

The code:

import umap
import pandas as pd

df = pd.read_csv('error.csv', header=None)
embedding = umap.UMAP(n_neighbors=15, min_dist=0.1,
                      metric='correlation').fit_transform(df.values)

In result we're getting this error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-7-a14777825bbd> in <module>()
      1 embedding = umap.UMAP(n_neighbors=15, min_dist=0.1,
----> 2                       metric='correlation').fit_transform(df.values)

~/venv3/lib/python3.6/site-packages/umap_learn-0.1.3-py3.6.egg/umap/umap_.py in fit_transform(self, X, y)
    790             Embedding of the training data in low-dimensional space.
    791         """
--> 792         self.fit(X)
    793         return self.embedding_

~/venv3/lib/python3.6/site-packages/umap_learn-0.1.3-py3.6.egg/umap/umap_.py in fit(self, X, y)
    757 
    758         graph = fuzzy_simplicial_set(X, self.n_neighbors,
--> 759                                      self._metric, self.metric_kwds)
    760 
    761         if self.n_edge_samples is None:

~/venv3/lib/python3.6/site-packages/scipy/sparse/coo.py in __init__(self, arg1, shape, dtype, copy)
    189             self.data = self.data.astype(dtype, copy=False)
    190 
--> 191         self._check()
    192 
    193     def getnnz(self, axis=None):

~/venv3/lib/python3.6/site-packages/scipy/sparse/coo.py in _check(self)
    241                 raise ValueError('negative row index found')
    242             if self.col.min() < 0:
--> 243                 raise ValueError('negative column index found')
    244 
    245     def transpose(self, axes=None, copy=False):

ValueError: negative column index found

Any help is appreciated.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.