yingfanwang / pacmap Goto Github PK

View Code? Open in Web Editor NEW

447.0 447.0 51.0 93.34 MB

PaCMAP: Large-scale Dimension Reduction Technique Preserving Both Global and Local Structure

License: Apache License 2.0

Python 48.16% Jupyter Notebook 51.56% Shell 0.27%

pacmap's People

Contributors

Stargazers

Watchers

Forkers

bhoov stephenhky stefanwenger akiori vishalbelsare pmartinchile apwaung anpeibooa hozhenwai-machinelearning liyubov edvieira wiestdaessle kemaleren nfultz pdwaggoner mjdall bbenyamini olayinkaadeleye xinwang-hnu hyhuang00 gael-car rmallof abhisheka456 imyizhang kitkatdafu innocentposas u-tony-wu jiazhengzhu ktorp kya-allen lbrice1 richardscottoz ejrivera6 kalufinnle apollohuang1 peterdunson valeman ryan180711 munrojm userratos

pacmap's Issues

transform() only produces constant when save_tree = True

I have an example of using the new transform() feature on iris at :

https://colab.research.google.com/drive/1T3ALLtbx8kw9NAoZzvJQSFgiWvHOvIvM?usp=sharing

It appears to only generate a constant when save_tree is set.

I had expected calling transform() on the iris data to be equivalent to the result of fit_transform()

Or perhaps I read the docstring wrong and there is a different calling convention?

Error with only a few data points

This one works (100 data points):

import numpy as np
from pacmap import pacmap

X = np.random.rand(100, 50)
pacmap.PaCMAP().fit_transform(X)

This one fails (10 data points):

import numpy as np
from pacmap import pacmap

X = np.random.rand(10, 50)
pacmap.PaCMAP().fit_transform(X)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-dc1438c486bd> in <module>
      3 
      4 X = np.random.rand(10, 50)
----> 5 pacmap.PaCMAP().fit_transform(X)

/usr/local/anaconda3/lib/python3.8/site-packages/pacmap/pacmap.py in fit_transform(self, X, init, save_pairs)
    502 
    503     def fit_transform(self, X, init="random", save_pairs=True):
--> 504         self.fit(X, init, save_pairs)
    505         if self.intermediate:
    506             return self.intermediate_states

/usr/local/anaconda3/lib/python3.8/site-packages/pacmap/pacmap.py in fit(self, X, init, save_pairs)
    463             )
    464         if save_pairs:
--> 465             self.embedding_, self.intermediate_states, self.pair_neighbors, self.pair_MN, self.pair_FP = pacmap(
    466                 X,
    467                 self.n_dims,

/usr/local/anaconda3/lib/python3.8/site-packages/pacmap/pacmap.py in pacmap(X, n_dims, n_neighbors, n_MN, n_FP, pair_neighbors, pair_MN, pair_FP, distance, lr, num_iters, Yinit, apply_pca, verbose, intermediate)
    308                 if verbose:
    309                     print(X)
--> 310         pair_neighbors, pair_MN, pair_FP = generate_pair(
    311             X, n_neighbors, n_MN, n_FP, distance, verbose
    312         )

/usr/local/anaconda3/lib/python3.8/site-packages/pacmap/pacmap.py in generate_pair(X, n_neighbors, n_MN, n_FP, distance, verbose)
    234     for i in range(n):
    235         nbrs_ = tree.get_nns_by_item(i, n_neighbors_extra+1)
--> 236         nbrs[i, :] = nbrs_[1:]
    237         for j in range(n_neighbors_extra):
    238             knn_distances[i, j] = tree.get_distance(i, nbrs[i, j])

ValueError: cannot copy sequence with size 9 to array axis with dimension 10

Early stopping in third phase of training

My current data set had a trace is below. I believe that the fit would have been essentially identical if the training had ended a hundred iterations early.

So it could be very practical to also specify a stopping condition in terms of a minimum improvement (instead of just a fixed number of iterations), especially for use cases where the training function is called repeatedly for hyperparam tuning.

Initial Loss: 221937.015625
Iteration:   10, Loss: 2350144.000000
Iteration:   20, Loss: 402674.156250
Iteration:   30, Loss: 240955.015625
Iteration:   40, Loss: 197510.812500
Iteration:   50, Loss: 153939.875000
Iteration:   60, Loss: 129958.703125
Iteration:   70, Loss: 117012.429688
Iteration:   80, Loss: 107633.703125
Iteration:   90, Loss: 98910.828125
Iteration:  100, Loss: 88535.101562
Iteration:  110, Loss: 160596.593750
Iteration:  120, Loss: 146496.093750
Iteration:  130, Loss: 138674.703125
Iteration:  140, Loss: 134762.906250
Iteration:  150, Loss: 132901.375000
Iteration:  160, Loss: 132175.421875
Iteration:  170, Loss: 132131.562500
Iteration:  180, Loss: 132340.203125
Iteration:  190, Loss: 132734.750000
Iteration:  200, Loss: 133220.187500
Iteration:  210, Loss: 60164.875000
Iteration:  220, Loss: 54855.210938
Iteration:  230, Loss: 53705.199219
Iteration:  240, Loss: 53232.484375
Iteration:  250, Loss: 53050.156250
Iteration:  260, Loss: 52986.171875
Iteration:  270, Loss: 52963.117188
Iteration:  280, Loss: 52954.531250
Iteration:  290, Loss: 52950.292969
Iteration:  300, Loss: 52948.816406
Iteration:  310, Loss: 52948.343750
Iteration:  320, Loss: 52947.710938
Iteration:  330, Loss: 52947.429688
Iteration:  340, Loss: 52947.203125
Iteration:  350, Loss: 52947.214844
Iteration:  360, Loss: 52947.093750
Iteration:  370, Loss: 52946.875000
Iteration:  380, Loss: 52946.687500
Iteration:  390, Loss: 52946.578125
Iteration:  400, Loss: 52946.523438
Iteration:  410, Loss: 52946.421875
Iteration:  420, Loss: 52946.429688
Iteration:  430, Loss: 52946.238281
Iteration:  440, Loss: 52946.078125
Iteration:  450, Loss: 52945.914062
Elapsed time: 447.48s
CPU times: user 10min 29s, sys: 17.6 s, total: 10min 46s
Wall time: 7min 46s

Clarify options for init argument

In the readme it says init: the initialization of the lower dimensional embedding. One of "pca" or "random". Default to "pca".

Looking through the code it seems like init can also be a user-supplied matrix. If that's true, the readme should be corrected and the code should be updated to check that the number of columns and rows in the user-supplied matrix is acceptable, and if not, a specific error should be thrown.

transform() doesn't work

Try out calling .transform() after the fitting the dataset several times on the test set, the results will start converging to the same value for each item in the test batch.

Failed to install PaCMAP with pip

hello, we're two users trying to install the library for the first time but without succes. Does anyone know how to solve the problem.

gpu support

hi, do you have GPU implementation for PaCMAP?

Nearest neighbors on knowledge graph embedding PaCMAP reduction did not work well

Hi, I applied PaCMAP on pretrained knowledge graph embeddings made from wikidata5m knowledge graph (https://graphvite.io/docs/latest/pretrained_model.html). These are quite chunky and looking up nearest neighbors eats JUST works within my swap limits, but is very slow (so I know it works and the quality of the embedding nearest neighbors is good). I tried RcppHNSW to build an index, but that did not work. So I thought I try PaCMAP. Results were not usable, how can I learn more how to use PaCMAP for this use case? How reliable are nearest neighbors with PaCMAP?

use the same data but get different result

i use same data (dim = 512), but i get different result, this is my code:
embedding_pacmap = pacmap.PaCMAP(n_dims=2, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0) X_transformed = embedding_pacmap.fit_transform(partxvec, init="pca")

and at the same time, i use UMAP, i set random papramater
reducer = umap.UMAP(random_state=42) embedding = reducer.fit_transform(partxvec)
and i get the same result,
so i want to know whether this algorithm have same configer with UMAP?

Add PaCMAP to Conda-forge please

Hello, can you add this package to conda-forge or are there issues that are still being worked out?

If you nee directions, you can look here: https://github.com/conda-forge/staged-recipes

Sample code does not work

results:

if I change n_dims for n_components it works, but when I fit:

Thanks!

allow test/query data to be used with Transform() API call

Currently, it is not obvious how to apply a fitted PaCMAP model to a test dataset. Although a transform() call is available, it is not obvious how that is used, nor the applicable syntax, so some documentation on that would be appreciated.

Paper already peer reviewed and published?

As for citing your brilliant work, it would really help if the paper was already published (not only as a preprint) - how far is the progress on that matter?:)

Unable to install PaCMAP

Hello,

When I try to install PaCMAP using the commande pip install pacmap, I get the following error message :

ERROR: Command errored out with exit status 1:
command: 'C:\ProgramData\Anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\Mohamed-Sidina\AppData\Local\Temp\pip-install-5ibkk7og\annoy_02d031dd786044c0aafd66183b64ad17\setup.py'"'"'; file='"'"'C:\Users\Mohamed-Sidina\AppData\Local\Temp\pip-install-5ibkk7og\annoy_02d031dd786044c0aafd66183b64ad17\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\Mohamed-Sidina\AppData\Local\Temp\pip-wheel-bd8a74xa'
cwd: C:\Users\Mohamed-Sidina\AppData\Local\Temp\pip-install-5ibkk7og\annoy_02d031dd786044c0aafd66183b64ad17
Complete output (10 lines):
running bdist_wheel
running build
running build_py
creating build
creating build\lib.win-amd64-3.8
creating build\lib.win-amd64-3.8\annoy
copying annoy_init_.py -> build\lib.win-amd64-3.8\annoy
running build_ext
building 'annoy.annoylib' extension
error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/

ERROR: Failed building wheel for annoy

Do have any solution for this ?

speed up processing large dataset

I have a very large dataset (50 million rows by 768 features) I am trying to use, and a test case with 1M rows took about 35 minutes. Scaling to 50M implies it would take over 1 day to finish. This is on a machine with 160GB memory and 40 cores. Any suggestions on how to speed this up? I would prefer trying to fit the whole dataset vs fitting just a subset and applying that to the remaining data, but will do that if necessary.

There are a subset of rows that I think are linked - would specifying n_neighbors help with speeding things up?

Is there a way of providing user-defined distance matrices to PaCMAP

First, i want to congratulate you on PaCMAP. I have been using it for some weeks now and it has been working really nicely with my data.

I wanted to ask: is there a way to use PaCMAP with user-input distance matrices? i found a blog post about a R wrapper for PaCMAP, and there the metric keyword is provided as an argument referring that the input matrix is a distance matrix. But i couldn't find how to use that in Python...

Thanks and best wishes
Joana

Seems there is no transform method available ?

Currently working with umap, I can do the following:

reducer = umap.UMAP(n_neighbors=15,n_components=10)

then save the reducer using pickle

and later on, I can reuse it

reduce_data = reducer.transform(scaled_data)

PacMap does have fit and fit_transform methods but I don't find the transform method, so totally useless for me :-(
I can't apply an existing reducer to a new dataset.

Could you add it, I really want to benchmark it against umap (that gives me very good results)

Thanks

AttributeError: module 'pacmap' has no attribute 'PaCMAP'

I was trying to run pacmap but i ran into the following error:

Python 3.7.0 (default, Oct 9 2018, 10:31:47)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.

import pacmap
import numpy as np
import matplotlib.pyplot as plt

loading preprocessed coil_20 dataset

... # you can change it with any dataset that is in the ndarray format, with the shape (N, D)
... # where N is the number of samples and D is the dimension of each sample
... X = np.load("./data/coil_20.npy", allow_pickle=True)

X = X.reshape(X.shape[0], -1)
y = np.load("./data/coil_20_labels.npy", allow_pickle=True)
embedding = pacmap.PaCMAP(n_dims=2, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0)
Traceback (most recent call last):
File "", line 1, in
AttributeError: module 'pacmap' has no attribute 'PaCMAP'

Please advise.

Thank you.

About ILSVRC 2012 dataset result

About Figures A.13, A.14, A.15, A.16, and A.17 include visualization of the activations of all data from the ILSVRC 2012 dataset, is training set or val set?

Doesn't work in sklearn's Pipeline

pipe = Pipeline([('reduce_dim', pacman()), ('kmeans, KMeans())])

AttributeError: 'PaCMAP' object has no attribute 'random_state'

Would be great if it did

Rainbow Plots For Bad Loss!

Dear author, can you please provide the code to plot the rainbow of the four bad loss functions

Error running test code

Hi there,

I am trying to run this test code:

import pacmap
print('pacmap version:', pacmap.__version__)
import numpy as np
print('numpy version:', np.__version__)
import numba as nb
print('numba version:', nb.__version__)
import matplotlib.pyplot as plt

# loading preprocessed coil_20 dataset
# you can change it with any dataset that is in the ndarray format, with the shape (N, D)
# where N is the number of samples and D is the dimension of each sample
X = np.load("./data/coil_20.npy", allow_pickle=True)
X = X.reshape(X.shape[0], -1)
y = np.load("./data/coil_20_labels.npy", allow_pickle=True)

# initializing the pacmap instance
# Setting n_neighbors to "None" leads to a default choice shown below in "parameter" section
embedding = pacmap.PaCMAP(n_components=2, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0)

# fit the data (The index of transformed data corresponds to the index of the original data)
X_transformed = embedding.fit_transform(X, init="pca")

# visualize the embedding
fig, ax = plt.subplots(1, 1, figsize=(6, 6))
ax.scatter(X_transformed[:, 0], X_transformed[:, 1], cmap="Spectral", c=y, s=0.6)

However, I get this error

/Users/Morgan/miniconda3/envs/pacmap/lib/python3.11/site-packages/pacmap/pacmap.py:96: NumbaPendingDeprecationWarning: The 'old_style' error capturing is deprecated and will be replaced by new_style in a future release.
j = np.random.randint(maximum)
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
/Users/Morgan/miniconda3/envs/pacmap/lib/python3.11/site-packages/pacmap/pacmap.py:143: NumbaPendingDeprecationWarning: The 'old_style' error capturing is deprecated and will be replaced by new_style in a future release.
sampled = np.random.randint(0, n, 6)
/Users/Morgan/miniconda3/envs/pacmap/lib/python3.11/site-packages/pacmap/pacmap.py:166: NumbaPendingDeprecationWarning: The 'old_style' error capturing is deprecated and will be replaced by new_style in a future release.
sampled = np.random.randint(0, n, 6)

Which seems to be attributed to numba, but I have tried with 0.57.0 (the one originally installed with conda) and the upgraded 0.58.0. Both give an error.

Not sure how to fix or what versions are needed here to be compatible.

Thanks,
Morgan

angular distance embedding inconsistencies

I have a matrix X with 28 columns and a unit normalization, uX, of X, in which each row of X is divided by its L2 norm. I would expect the PaCMAP 3d embedding using angular distance on X to be the same as on uX, but they are not. Why?

`fit_transform` and `transform` on the same feature doesn't return the same value

Hi, thanks for developing PaCMAP, lovely work!

I found that using transform after using fit_transform on the same set of features yields different results.

I ran the following example:

import pacmap
import numpy as np

np.random.seed(0)

init = "pca"  # results can be reproduced also with "random"

reducer = pacmap.PaCMAP(
    n_components=2, n_neighbors=10, MN_ratio=0.5, FP_ratio=2.0, save_tree=True
)

features = np.random.randn(100, 30)

reduced_features = reducer.fit_transform(features, init=init)
print(reduced_features[:10])

transformed_features = reducer.transform(features)
print(transformed_features[:10])

And returns

[[ 0.7728913   3.785831  ]
 [-0.69379026  2.116452  ]
 [-1.7770871  -0.97542125]
 [ 2.5090704   1.8718773 ]
 [-0.06890291 -2.2959301 ]
 [ 1.9657456   1.1580495 ]
 [ 1.0486693  -1.4648851 ]
 [-1.4896832   1.7203271 ]
 [ 0.54106015  2.38868   ]
 [ 3.0175838  -1.9216222 ]]

[[-0.03516154  2.543376  ]
 [-0.467008    1.6641414 ]
 [-0.44973713 -1.535601  ]
 [ 1.0218439   1.5691875 ]
 [-0.30733356 -2.3227684 ]
 [ 0.8294033   1.0432268 ]
 [ 0.10503205 -0.8651409 ]
 [-0.63982046  0.59202313]
 [ 0.38573623  1.5135498 ]
 [ 2.0508025  -1.5033388 ]]

I would expect the same results because the fit_transform should be the combination of fit and transform (regardless of the implementation details). This is what PCA in sklearn and UMAP do.

Is this an intended feature? And if the answer is No, what should we do? One possible solution I found is

reducer = reducer.fit(features, init=init)

# Now the following lines return the same feature.
reduced_features = reducer.transform(features)
transformed_features = reducer.transform(features)

But this only solves the problem at the implementation level, not at the conceptual level. Since the returned values from fit_transform and transform are different, I'm not sure I can trust the output of transform.

PS: this has nothing to do with the random seed, since I fixed the random seed, I can get the same result across runs.

Readme n_dims vs. n_components

In the example in REAMDE.md, the PaCMAP object is initialized with

embedding = pacmap.PaCMAP(n_dims=2, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0)

This produces an error TypeError: __init__() got an unexpected keyword argument 'n_dims'.
Probably it should be n_comonents instead of n_dims.

Random State is required in Usage example,

Hi,
when I try to run the transform, it seems that the random state is required.
I solved it changing the example embedding from:

embedding = pacmap.PaCMAP(n_dims=2, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0)

to:

embedding = pacmap.PaCMAP(n_dims=2, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0, random_state = 1)

Dealing with challenging data sets

Hi,

Many thanks for such great package! I found very interesting the dimensionality reduction approach proposed in PaCMAP when compared to other techniques. So, I decided to give a try with my data sets.

I tried different initial conditions:

number of neighbors 10 or 32 (according to number of samples in data set) with default MN_ratio and FP_ratio
With MN_ratio 0.25 or 1.0 while keeping number of neighbors 10 (or 32) and default FP_ratio
With FP_ratio 1.0 or 3.0 while keeping number of neighbors 10 (or 32) and default MN_ratio
With apply_pca set to False or True
With init set to 'random' or 'pca'

In all tests, I always get a "blob"

So, I am looking for you suggestions/comments. I provided a Python script with one of my data sets (see attached file).

Many thanks,

Ivan

testing_dim_reduction.zip

Import performance

Importing pacmap takes ~4.3 seconds on my machine (core I7/ 8th Generation, 40GB RAM, python 3.8.10). Since i want to include it in my package i overtake these 4.3 seconds into my package and delays everything by a lot.

Could you increase the import performance?

Timed with multiple runs of
python3 -m timeit -r 1 "import pacmap"

Example not working after `pip install` with requirements

Really cool work! I look forward to exploring more.

It looks like the example on the README does not work with the pacmap installed from pip. I get the following error:

    361         self.n_dims = n_dims
    362         self.n_neighbors = n_neighbors
--> 363         self.n_MN = int((n_neighbors * MN_ratio) // 1)
    364         self.n_FP = int((n_neighbors * FP_ratio) // 1)
    365         self.pair_neighbors = pair_neighbors

TypeError: unsupported operand type(s) for *: 'NoneType' and 'float'

After fixing this by putting in setting n_neighbors=10, I find the following error dealing with dimensions of the data:

    257 ):
    258     start_time = time.time()
--> 259     n, high_dim = X.shape
    260 
    261     if intermediate:

ValueError: too many values to unpack (expected 2)

The shape of the data/coil_20.npy file stored in github ((1440, 128, 128)) is inconsistent with that expected by the example code.

Thanks!

Enhancement request, implement fractional distance

Hi,

I bumped into this research work https://bib.dbvis.de/uploadedFiles/155.pdf, in which the authors propose a fractional distance to deal better with the notion of near/far neighbors in high dimensional data sets. According to their results, the fractional distance can provide better outcomes compared to Euclidean (or Manhattan) distance.

So, I would like to submit this enhancement request: include fractional distance in PaCMAP package.

Kind regards,

Ivan

Large-scale PaCMAP

Hey there,

do you have rough estimations how much compute / ram is necessary to scale PaCMAP to 10M, 100M, 1B and 10B rows with each 786 embeddings? Or do you provide a multi-node solution?

Best,

Robert

build fails without strict numba requirement

When trying to install this with pip alongside holoviews (installed via conda), I get an error upon import:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/popos/mambaforge/envs/maple_test/lib/python3.9/site-packages/pacmap/__init__.py", line 1, in <module>
    from .pacmap import *
  File "/home/popos/mambaforge/envs/maple_test/lib/python3.9/site-packages/pacmap/pacmap.py", line 1, in <module>
    import numba
  File "/home/popos/mambaforge/envs/maple_test/lib/python3.9/site-packages/numba/__init__.py", line 43, in <module>
    from numba.np.ufunc import (vectorize, guvectorize, threading_layer,
  File "/home/popos/mambaforge/envs/maple_test/lib/python3.9/site-packages/numba/np/ufunc/__init__.py", line 3, in <module>
    from numba.np.ufunc.decorators import Vectorize, GUVectorize, vectorize, guvectorize
  File "/home/popos/mambaforge/envs/maple_test/lib/python3.9/site-packages/numba/np/ufunc/decorators.py", line 3, in <module>
    from numba.np.ufunc import _internal

I have isolated the problem to numba. When I enforce numba>=0.57, I don't get the import error, but when I allow any version of numba, 0.53 gets installed and it results in this error. Not sure if intermediate numba versions will work, but updating the dependency to be numba>=5.7 should fix it. Thanks for the great tool

[Bug] Problem with angular metric

Given data X, with size [m, n], the fit_transform results differ depending on if the data is pre-normalized (so each row is unit-norm). However the expected result is that the angular metric should remain robust to the norm.

Metric learning with PaCMAP?

Would it be possible to add a feature that allows one to embed an unseen/unlabeled data point to an existing embedding using metric learning? This would be similar to the function currently available with UMAP:
https://umap-learn.readthedocs.io/en/latest/supervised.html

Thanks!

Support externally supplied nearest neighbors

Correct me if I'm wrong, but it looks like users are currently locked into ANNOY for nearest neighbors. This is pretty limiting. Can you support externally supplied nearest neighbors?

While we're on the subject of nearest neighbors, could you provide some context for the n_neighbors_extra strategy? It's not obvious why a constant of 50 should be added. I recall seeing this same thing in the TriMap code. Perhaps this is noted in the paper but you'll understand it's easier to ask about this directly. Besides an explanation here, some code comments would be good.

Further to the above, what about n_neighbors_extra+1?

An edge case related to these questions: it seems it was considered that this strategy may result in requesting more nearest neighbors than there are rows in the dataset based on the line n_neighbors_extra = min(n_neighbors + 50, n). But if we assume this case is possible, doesn't n_neighbors_extra+1 introduce the problem again?

Storing PaCMAP on DB?

Hello.
I'm trying to store the PaCMAP model in a db for further transformations. I tried to pickle, but the tree is an annoy.annoy object.
Also tried to save the annoy.annoy object with embedding.tree.save('./annoy_object.ann'), this works but I cannot load, since creating the PaCMAP do not initialize the annoy.annoy tree.
Is there a way to save/load PaCMAP object or tree? My main objective is to send it to a DB, so I can transform new incoming data in my clustering pipeline.

Thanks for your attention.

Save a PaCMAP model

Hi, I am really enjoying your PaCMAP package. Getting some great results. I have a request for a new feature (unless this is a feature that exists and I am not aware of it).

I am running PaCMAP on large datasets and then using the model with the transform function to transform new data into the existing embedded space. With these models developed using large datasets it would be great if I could save a PaCMAP model locally and then reloaded it and apply it to new data, rather than having to run the PaCMAP model again when I start a new session. An analogy would be I often save my XGBoost models and apply them to new data rather than rerun the model each time I have new data.

Is this feature already available or something that can be implemented?

Thanks again.

Enhancement requests. Intermediate manifold intersection and custom distance metrics

Hi there!

I currently use UMAP a lot in my projects and am aware of some its limitations so it's great to see new algorithms popping up that aim to fix some of the shortcomings in the field. UMAP has been around for awhile now and as such the community has made some pretty cool modifications and enhancements to actual algorithm. One of the best features, I think, is the ability to perform intersections and unions on the intermediate fuzzy representations: https://umap-learn.readthedocs.io/en/latest/composing_models.html
This lets the user generate a unified UMAP representation using multiple different distance metrics with ease. I believe that PaCMAP also has an intermediate manifold stage that might make it possible to implement a similar feature, but perhaps I'm wrong on this. Either way, I just wanted to bring it to your attention and perhaps get your thoughts on it.

The other request is something I'm sure is likely to come in future, but that is the ability to easily supply custom distance metrics to PaCMAP much like how the python UMAP handles this: https://umap-learn.readthedocs.io/en/latest/parameters.html?highlight=metric#metric

I realise that these likely non-trivial requests but I hearing what you think about them would be appreciated.

Cheers,
Rhys

Error setting random_state in PaCMAP

Hi,
I have being exploring PaCMAP this past week and wanted to test it against multiple attributes. One of them is repeatability and to do that I added a random_state = 1,10,20. I used FMNIST for the dataset and the code is the following:

import numpy as np
import pacmap
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import time
import seaborn as sns
import pandas as pd
import umap.plot
import sys
import umap
from io import BytesIO
from PIL import Image
import base64
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource, CategoricalColorMapper
from bokeh.palettes import Spectral10, Category10


train = np.load("/home/icalle/Documents/umap/Data/fmnist_images.npy", allow_pickle=True)
train = train.reshape(train.shape[0], -1)
test = np.load("/home/icalle/Documents/umap/Data/fmnist_labels.npy", allow_pickle=True)


reducer = pacmap.PaCMAP(n_dims=2, n_neighbors=10, MN_ratio=0.5, FP_ratio=2.0, random_state=20)
embedding = reducer.fit_transform(train, init="pca")

plt.scatter(reducer.embedding_[:, 0], reducer.embedding_[:, 1], s= 5, c=test, cmap='Spectral')
plt.gca().set_aspect('equal', 'datalim')
cbar = plt.colorbar(boundaries=np.arange(11)-0.5)
cbar.set_ticks([0,1,2,3,4,5,6,7,8,9])
cbar.set_ticklabels(["T-shirt/top","Trouser","Pullover","Dress","Coat", "Sandal","Shirt","Sneaker","Bag","Ankle boot"])
plt.title('PaCMAP Fashion-MNIST; n_neighbors=10, random_state= 20', fontsize=12);
plt.show()

After plotting, the results are the following:

To conclude, as seen in the plot, this is not the correct output and I wanted to know where I went wrong in the code or if this issue has been raised before. I also want to mention, I tested it on the Digits dataset, on multiple random_state values, and with init="pca", but still the same result.

Implementation with PyTorch

Even with CPU thread limiting applied while using the provided code, the CPU consumption remains excessively high (around 1400%). Is it possible to implement this algorithm in PyTorch?

PaCMAP is stochastic with sparse data even if random seed is set

Hi all,

I noticed that (the current version) of PaCMAP seems to give stochastic results on sparse binary data even if random seed is set. Below is an example of code with randomly generated data -- in the first case, the data is sparse and the result is stochastic on my machine; in the second case the data is not sparse and the result is deterministic.

I have not given deeper thought to whether it is reasonable to run PaCMAP on sparse data, but it would still be nice if it was deterministic when random seed is set.

Could you have a look into this?
Many thanks!

import pacmap
import numpy as np

###----
### Stochastic when there's sparsity
###----
print('\n\nBinary data with high proportion of zeros')

# Create a binary dataset with low frequency of 1s
rng = np.random.default_rng(seed=42)
d = rng.binomial(n=1,p=0.01,size=(1000,100))

# Just in case - if any columns are all zeros, remove these
mask = d.sum(axis=0) > 0
d = d[:,mask]
print('Shape of data: \n{}'.format(d.shape))
print('Number of nonzero elements in each column: \n{}'.format(d.sum(axis=0)))
print('Snapshot of data: \n{}'.format(d[:10,:10]))

# Pacmap
e = pacmap.PaCMAP(n_components=2, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0, random_state=42)
t = e.fit_transform(d, init="pca")

e2 = pacmap.PaCMAP(n_components=2, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0, random_state=42)
t2 = e2.fit_transform(d, init="pca")

# Test
print(t[:3, :3])
print(t2[:3, :3])
test = np.sum(np.abs(t-t2)) < 1e-8
print('Difference between values of two runs is less than 1e-8? {}'.format(test))
test2 = (t == t2).all()
print('Values between two runs are equal? {}'.format(test2))

###----
### Deterministic when there's less sparsity
###----
print('\n\nBinary data with roughly equal proportion of zeros and ones')

# Create a binary dataset with low frequency of 1s
rng = np.random.default_rng(seed=42)
d = rng.binomial(n=1,p=0.5,size=(1000,100))

# Just in case - if any columns are all zeros, remove these
mask = d.sum(axis=0) > 0
d = d[:,mask]
print('Shape of data: \n{}'.format(d.shape))
print('Number of nonzero elements in each column: \n{}'.format(d.sum(axis=0)))
print('Snapshot of data: \n{}'.format(d[:10,:10]))

# Pacmap
e = pacmap.PaCMAP(n_components=2, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0, random_state=42)
t = e.fit_transform(d, init="pca")

e2 = pacmap.PaCMAP(n_components=2, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0, random_state=42)
t2 = e2.fit_transform(d, init="pca")

# Test
print(t[:3, :3])
print(t2[:3, :3])
test = np.sum(np.abs(t-t2)) < 1e-8
print('Difference between values of two runs is less than 1e-8? {}'.format(test))
test2 = (t == t2).all()
print('Values between two runs are equal? {}'.format(test2))

##
print('PaCMAP version: {}, numpy version: {}'.format(pacmap.__version__, np.__version__))

The output I get is:

Binary data with high proportion of zeros
Shape of data: 
(1000, 100)
Number of nonzero elements in each column: 
[12 16 11 11 11 14  7  6  8 11  8 10 11  9  8 13 10 11  5 10  9 11  9  9
 11 21 14 11 13 12 15  8 11  7 10 11  6  8  6 10  7  9  9 12  6 15  5  9
  8  7  9  8 11  9 15  5  6  8 11 14 12 13  7  9  8  6 12  8  6 12 12 14
 17  7 14 12 12  8 10 13  6 13 15 12  6  7 10 13  9 13  8  7  7 17  9 12
  9 10 15  7]
Snapshot of data: 
[[0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 0 1]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 1]
 [0 0 0 0 0 0 0 0 0 0]]
C:\Users\andres.tamm\Anaconda3\lib\site-packages\pacmap\pacmap.py:774: UserWarning: Warning: random state is set to 42
  warnings.warn(f'Warning: random state is set to {_RANDOM_STATE}')
C:\Users\andres.tamm\Anaconda3\lib\site-packages\pacmap\pacmap.py:774: UserWarning: Warning: random state is set to 42
  warnings.warn(f'Warning: random state is set to {_RANDOM_STATE}')
[[ 0.8841729   2.0089896 ]
 [-0.5350942   0.95636046]
 [ 1.7903003   2.135278  ]]
[[ 0.73095715 -2.7870746 ]
 [ 0.6337722  -0.6057222 ]
 [ 1.952701   -2.0199978 ]]
Difference between values of two runs is less than 1e-8? False
Values between two runs are equal? False


Binary data with roughly equal proportion of zeros and ones
Shape of data: 
(1000, 100)
Number of nonzero elements in each column: 
[506 519 499 509 499 526 520 492 500 511 520 512 510 526 532 489 503 496
 502 487 491 508 509 518 496 467 488 510 498 516 488 520 516 509 495 511
 485 498 528 520 514 472 481 522 488 513 491 520 499 507 468 513 515 510
 493 489 496 508 480 523 494 518 489 491 514 507 489 507 489 520 501 487
 502 505 498 476 503 517 507 474 502 498 513 492 518 493 502 491 511 494
 504 508 472 500 505 534 478 497 533 493]
Snapshot of data: 
[[1 0 1 1 0 1 1 1 0 0]
 [1 1 0 1 1 1 0 0 0 1]
 [1 1 1 0 0 0 0 1 0 1]
 [1 0 1 1 1 0 0 1 0 0]
 [0 1 1 0 1 1 0 1 0 1]
 [1 1 1 1 1 1 0 1 0 0]
 [0 1 1 1 1 1 1 0 1 0]
 [0 0 0 1 0 1 1 1 0 0]
 [1 0 1 1 0 0 1 0 1 1]
 [0 0 0 0 1 1 0 0 0 0]]
C:\Users\andres.tamm\Anaconda3\lib\site-packages\pacmap\pacmap.py:774: UserWarning: Warning: random state is set to 42
  warnings.warn(f'Warning: random state is set to {_RANDOM_STATE}')
C:\Users\andres.tamm\Anaconda3\lib\site-packages\pacmap\pacmap.py:774: UserWarning: Warning: random state is set to 42
  warnings.warn(f'Warning: random state is set to {_RANDOM_STATE}')
[[ 1.0503646 -2.2979746]
 [ 1.6191299  2.3795164]
 [-3.1472695 -1.3276936]]
[[ 1.0503646 -2.2979746]
 [ 1.6191299  2.3795164]
 [-3.1472695 -1.3276936]]
Difference between values of two runs is less than 1e-8? True
Values between two runs are equal? True
PaCMAP version: 0.6.3, numpy version: 1.21.1

Stochasticity found after setting a random seed

Hello, we are working with PaCMAP and found stochasticity when testing with one of our datasets. We used init = 'PCA' and we also kept 'apply_pca=True' . When we ran the code the first time we got 5 clusters (setting a random seed of 20). We then ran the code again without changing any parameter, with the same random seed and we got 4 clusters.
Is this stochasticity something that should be expected, even with setting a specific random state?

Thanks.

Multiprocessor support

Hi Team,
Thank you for creating the amazing package and it works extremely well for my use case. But it happened to work very slow while performing DR from 128D -> 2D.

I am trying to reduce 60k rows of embeddings and it takes 1hr of time, is there any faster way or a parallel way to do it?

Thank

Issue Inputting Dataframe Data into PaCMAP

I have a pandas dataframe with mostly numerical data but very high dimensions (37 variables 56 rows). I am struggling to get the pandas dataframe to convert to nd array and work with PaCMAP. The variable of interest is 'Intact DNA/million CD4 T cells logscale binarized' and are the labels, while the rest of the variables are predictors. Also, how should I handle categorical variables?

Segmentation fault when running model in loop

I wanted to run the model on several subsets in a basic loop which causes several errors including segmentation fault. E.g. UMAP is not causing any problems. Any workarounds or suggestions how to solve that?

Possible bug: plotting intermediate snapshots

I am currently working on a numerical dataset with 42359 rows and 12 columns and would like to apply PaCMAP to it and visualize the intermediate snapshots during training. I applied the code below, where df_scaled is my dataset after applying StandardScaler().

embedding1 = pacmap.PaCMAP(n_components=2, n_neighbors=10,MN_ratio=0.5, FP_ratio=2.0,
random_state=20, save_tree=False,
intermediate=True,
verbose=1)

X_transformed1 = embedding1.fit_transform(df_scaled, init="pca")

The output was:

X is normalized
PaCMAP(n_neighbors=10, n_MN=5, n_FP=20, distance=euclidean, lr=1.0, n_iters=450, apply_pca=True, opt_method='adam', verbose=1, intermediate=True, seed=20)
Finding pairs
Found nearest neighbor
Calculated sigma
Found scaled dist
Pairs sampled successfully.

File c:\Users\arthu\anaconda3\envs\dr_visualization\lib\site-packages\pacmap\pacmap.py:530, in pacmap(X, n_dims, pair_neighbors, pair_MN, pair_FP, lr, num_iters, Yinit, verbose, intermediate, inter_snapshots, pca_solution, tsvd)

[526](file:///c%3A/Users/arthu/anaconda3/envs/dr_visualization/lib/site-packages/pacmap/pacmap.py?line=525) n, _ = X.shape
[528](file:///c%3A/Users/arthu/anaconda3/envs/dr_visualization/lib/site-packages/pacmap/pacmap.py?line=527) if intermediate:
[529](file:///c%3A/Users/arthu/anaconda3/envs/dr_visualization/lib/site-packages/pacmap/pacmap.py?line=528)     intermediate_states = np.empty(

--> 530 (len(inter_snapshots), n, n_dims), dtype=np.float32)
531 else:
532 intermediate_states = None

TypeError: object of type 'bool' has no len()

Can you guys help me solve this issue? I think my problem is in len(inter_snapshots) but it should be already known.

inverse_transform() in PaCMAP

Hi
I would like to know if there is a possibility to transform data back to its original space in PaCMAP? Like method inverse_transform(X) in PCA
More specifically, I want to change data in the embedded low-dimensional space and to see what data in the original space it corresponds.
Thank you in advance!

Error when the number of instances grow to large

Hello below I showcase the line of the error & the error it self:

site-packages/pacmap/pacmap.py", line 462, in generate_pair
nbrs = np.zeros((n, n_neighbors_extra), dtype=np.int32)
TypeError: 'float' object cannot be interpreted as an integer

When the number of instances that I want to cluster grows too large (above 20k), the n_neighbors_extra becomes a float (e.g 67.321) and then I get the following error.
Locally, I escaped this error by casting the n_neighbors_extra as an int inside the function but I am not sure if for the quality of the solution this is a proper fix.

Is `inverse_transform` possible with PaCMAP?

Is inverse_transform possible with PaCMAP?

I appreciate that it hasn't been implemented as yet, but it can be very useful when assessing how well the model is capturing the subtleties of the input data. However, for certain techniques, it's simply not possible.

If it's possible, I will open an enhancement request 😊

Include package metadata in GH repo

I'd like to use the current version from github, but there is no package metadata / setup.py checked in to the repo :(

Although it is definitely in the version on pypi.

Ideally, this would work:

pip install  git+https://github.com/YingfanWang/PaCMAP

yingfanwang / pacmap Goto Github PK

pacmap's People

Contributors

Stargazers

Watchers

Forkers

pacmap's Issues

loading preprocessed coil_20 dataset

Recommend Projects

Recommend Topics

Recommend Org