yingfanwang / pacmap Goto Github PK
View Code? Open in Web Editor NEWPaCMAP: Large-scale Dimension Reduction Technique Preserving Both Global and Local Structure
License: Apache License 2.0
PaCMAP: Large-scale Dimension Reduction Technique Preserving Both Global and Local Structure
License: Apache License 2.0
I have an example of using the new transform() feature on iris at :
https://colab.research.google.com/drive/1T3ALLtbx8kw9NAoZzvJQSFgiWvHOvIvM?usp=sharing
It appears to only generate a constant when save_tree is set.
I had expected calling transform() on the iris data to be equivalent to the result of fit_transform()
Or perhaps I read the docstring wrong and there is a different calling convention?
This one works (100 data points):
import numpy as np
from pacmap import pacmap
X = np.random.rand(100, 50)
pacmap.PaCMAP().fit_transform(X)
This one fails (10 data points):
import numpy as np
from pacmap import pacmap
X = np.random.rand(10, 50)
pacmap.PaCMAP().fit_transform(X)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-11-dc1438c486bd> in <module>
3
4 X = np.random.rand(10, 50)
----> 5 pacmap.PaCMAP().fit_transform(X)
/usr/local/anaconda3/lib/python3.8/site-packages/pacmap/pacmap.py in fit_transform(self, X, init, save_pairs)
502
503 def fit_transform(self, X, init="random", save_pairs=True):
--> 504 self.fit(X, init, save_pairs)
505 if self.intermediate:
506 return self.intermediate_states
/usr/local/anaconda3/lib/python3.8/site-packages/pacmap/pacmap.py in fit(self, X, init, save_pairs)
463 )
464 if save_pairs:
--> 465 self.embedding_, self.intermediate_states, self.pair_neighbors, self.pair_MN, self.pair_FP = pacmap(
466 X,
467 self.n_dims,
/usr/local/anaconda3/lib/python3.8/site-packages/pacmap/pacmap.py in pacmap(X, n_dims, n_neighbors, n_MN, n_FP, pair_neighbors, pair_MN, pair_FP, distance, lr, num_iters, Yinit, apply_pca, verbose, intermediate)
308 if verbose:
309 print(X)
--> 310 pair_neighbors, pair_MN, pair_FP = generate_pair(
311 X, n_neighbors, n_MN, n_FP, distance, verbose
312 )
/usr/local/anaconda3/lib/python3.8/site-packages/pacmap/pacmap.py in generate_pair(X, n_neighbors, n_MN, n_FP, distance, verbose)
234 for i in range(n):
235 nbrs_ = tree.get_nns_by_item(i, n_neighbors_extra+1)
--> 236 nbrs[i, :] = nbrs_[1:]
237 for j in range(n_neighbors_extra):
238 knn_distances[i, j] = tree.get_distance(i, nbrs[i, j])
ValueError: cannot copy sequence with size 9 to array axis with dimension 10
My current data set had a trace is below. I believe that the fit would have been essentially identical if the training had ended a hundred iterations early.
So it could be very practical to also specify a stopping condition in terms of a minimum improvement (instead of just a fixed number of iterations), especially for use cases where the training function is called repeatedly for hyperparam tuning.
Initial Loss: 221937.015625
Iteration: 10, Loss: 2350144.000000
Iteration: 20, Loss: 402674.156250
Iteration: 30, Loss: 240955.015625
Iteration: 40, Loss: 197510.812500
Iteration: 50, Loss: 153939.875000
Iteration: 60, Loss: 129958.703125
Iteration: 70, Loss: 117012.429688
Iteration: 80, Loss: 107633.703125
Iteration: 90, Loss: 98910.828125
Iteration: 100, Loss: 88535.101562
Iteration: 110, Loss: 160596.593750
Iteration: 120, Loss: 146496.093750
Iteration: 130, Loss: 138674.703125
Iteration: 140, Loss: 134762.906250
Iteration: 150, Loss: 132901.375000
Iteration: 160, Loss: 132175.421875
Iteration: 170, Loss: 132131.562500
Iteration: 180, Loss: 132340.203125
Iteration: 190, Loss: 132734.750000
Iteration: 200, Loss: 133220.187500
Iteration: 210, Loss: 60164.875000
Iteration: 220, Loss: 54855.210938
Iteration: 230, Loss: 53705.199219
Iteration: 240, Loss: 53232.484375
Iteration: 250, Loss: 53050.156250
Iteration: 260, Loss: 52986.171875
Iteration: 270, Loss: 52963.117188
Iteration: 280, Loss: 52954.531250
Iteration: 290, Loss: 52950.292969
Iteration: 300, Loss: 52948.816406
Iteration: 310, Loss: 52948.343750
Iteration: 320, Loss: 52947.710938
Iteration: 330, Loss: 52947.429688
Iteration: 340, Loss: 52947.203125
Iteration: 350, Loss: 52947.214844
Iteration: 360, Loss: 52947.093750
Iteration: 370, Loss: 52946.875000
Iteration: 380, Loss: 52946.687500
Iteration: 390, Loss: 52946.578125
Iteration: 400, Loss: 52946.523438
Iteration: 410, Loss: 52946.421875
Iteration: 420, Loss: 52946.429688
Iteration: 430, Loss: 52946.238281
Iteration: 440, Loss: 52946.078125
Iteration: 450, Loss: 52945.914062
Elapsed time: 447.48s
CPU times: user 10min 29s, sys: 17.6 s, total: 10min 46s
Wall time: 7min 46s
In the readme it says init: the initialization of the lower dimensional embedding. One of "pca" or "random". Default to "pca".
Looking through the code it seems like init can also be a user-supplied matrix. If that's true, the readme should be corrected and the code should be updated to check that the number of columns and rows in the user-supplied matrix is acceptable, and if not, a specific error should be thrown.
Try out calling .transform() after the fitting the dataset several times on the test set, the results will start converging to the same value for each item in the test batch.
hello, we're two users trying to install the library for the first time but without succes. Does anyone know how to solve the problem.
hi, do you have GPU implementation for PaCMAP?
Hi, I applied PaCMAP on pretrained knowledge graph embeddings made from wikidata5m knowledge graph (https://graphvite.io/docs/latest/pretrained_model.html). These are quite chunky and looking up nearest neighbors eats JUST works within my swap limits, but is very slow (so I know it works and the quality of the embedding nearest neighbors is good). I tried RcppHNSW to build an index, but that did not work. So I thought I try PaCMAP. Results were not usable, how can I learn more how to use PaCMAP for this use case? How reliable are nearest neighbors with PaCMAP?
i use same data (dim = 512), but i get different result, this is my code:
embedding_pacmap = pacmap.PaCMAP(n_dims=2, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0) X_transformed = embedding_pacmap.fit_transform(partxvec, init="pca")
and at the same time, i use UMAP, i set random papramater
reducer = umap.UMAP(random_state=42) embedding = reducer.fit_transform(partxvec)
and i get the same result,
so i want to know whether this algorithm have same configer with UMAP?
Hello, can you add this package to conda-forge or are there issues that are still being worked out?
If you nee directions, you can look here: https://github.com/conda-forge/staged-recipes
Currently, it is not obvious how to apply a fitted PaCMAP model to a test dataset. Although a transform() call is available, it is not obvious how that is used, nor the applicable syntax, so some documentation on that would be appreciated.
As for citing your brilliant work, it would really help if the paper was already published (not only as a preprint) - how far is the progress on that matter?:)
Hello,
When I try to install PaCMAP using the commande pip install pacmap, I get the following error message :
ERROR: Failed building wheel for annoy
Do have any solution for this ?
I have a very large dataset (50 million rows by 768 features) I am trying to use, and a test case with 1M rows took about 35 minutes. Scaling to 50M implies it would take over 1 day to finish. This is on a machine with 160GB memory and 40 cores. Any suggestions on how to speed this up? I would prefer trying to fit the whole dataset vs fitting just a subset and applying that to the remaining data, but will do that if necessary.
There are a subset of rows that I think are linked - would specifying n_neighbors help with speeding things up?
First, i want to congratulate you on PaCMAP. I have been using it for some weeks now and it has been working really nicely with my data.
I wanted to ask: is there a way to use PaCMAP with user-input distance matrices? i found a blog post about a R wrapper for PaCMAP, and there the metric
keyword is provided as an argument referring that the input matrix is a distance matrix. But i couldn't find how to use that in Python...
Thanks and best wishes
Joana
Currently working with umap, I can do the following:
reducer = umap.UMAP(n_neighbors=15,n_components=10)
then save the reducer using pickle
and later on, I can reuse it
reduce_data = reducer.transform(scaled_data)
PacMap does have fit and fit_transform methods but I don't find the transform method, so totally useless for me :-(
I can't apply an existing reducer to a new dataset.
Could you add it, I really want to benchmark it against umap (that gives me very good results)
Thanks
Hi
I was trying to run pacmap but i ran into the following error:
Python 3.7.0 (default, Oct 9 2018, 10:31:47)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
import pacmap
import numpy as np
import matplotlib.pyplot as pltloading preprocessed coil_20 dataset
... # you can change it with any dataset that is in the ndarray format, with the shape (N, D)
... # where N is the number of samples and D is the dimension of each sample
... X = np.load("./data/coil_20.npy", allow_pickle=True)
X = X.reshape(X.shape[0], -1)
y = np.load("./data/coil_20_labels.npy", allow_pickle=True)
embedding = pacmap.PaCMAP(n_dims=2, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0)
Traceback (most recent call last):
File "", line 1, in
AttributeError: module 'pacmap' has no attribute 'PaCMAP'
Please advise.
Thank you.
About Figures A.13, A.14, A.15, A.16, and A.17 include visualization of the activations of all data from the ILSVRC 2012 dataset, is training set or val set?
pipe = Pipeline([('reduce_dim', pacman()), ('kmeans, KMeans())])
AttributeError: 'PaCMAP' object has no attribute 'random_state'
Would be great if it did
Dear author, can you please provide the code to plot the rainbow of the four bad loss functions
Hi there,
I am trying to run this test code:
import pacmap
print('pacmap version:', pacmap.__version__)
import numpy as np
print('numpy version:', np.__version__)
import numba as nb
print('numba version:', nb.__version__)
import matplotlib.pyplot as plt
# loading preprocessed coil_20 dataset
# you can change it with any dataset that is in the ndarray format, with the shape (N, D)
# where N is the number of samples and D is the dimension of each sample
X = np.load("./data/coil_20.npy", allow_pickle=True)
X = X.reshape(X.shape[0], -1)
y = np.load("./data/coil_20_labels.npy", allow_pickle=True)
# initializing the pacmap instance
# Setting n_neighbors to "None" leads to a default choice shown below in "parameter" section
embedding = pacmap.PaCMAP(n_components=2, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0)
# fit the data (The index of transformed data corresponds to the index of the original data)
X_transformed = embedding.fit_transform(X, init="pca")
# visualize the embedding
fig, ax = plt.subplots(1, 1, figsize=(6, 6))
ax.scatter(X_transformed[:, 0], X_transformed[:, 1], cmap="Spectral", c=y, s=0.6)
However, I get this error
/Users/Morgan/miniconda3/envs/pacmap/lib/python3.11/site-packages/pacmap/pacmap.py:96: NumbaPendingDeprecationWarning: The 'old_style' error capturing is deprecated and will be replaced by
new_style
in a future release.
j = np.random.randint(maximum)
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
/Users/Morgan/miniconda3/envs/pacmap/lib/python3.11/site-packages/pacmap/pacmap.py:143: NumbaPendingDeprecationWarning: The 'old_style' error capturing is deprecated and will be replaced bynew_style
in a future release.
sampled = np.random.randint(0, n, 6)
/Users/Morgan/miniconda3/envs/pacmap/lib/python3.11/site-packages/pacmap/pacmap.py:166: NumbaPendingDeprecationWarning: The 'old_style' error capturing is deprecated and will be replaced bynew_style
in a future release.
sampled = np.random.randint(0, n, 6)
Which seems to be attributed to numba, but I have tried with 0.57.0 (the one originally installed with conda) and the upgraded 0.58.0. Both give an error.
Not sure how to fix or what versions are needed here to be compatible.
Thanks,
Morgan
I have a matrix X with 28 columns and a unit normalization, uX, of X, in which each row of X is divided by its L2 norm. I would expect the PaCMAP 3d embedding using angular distance on X to be the same as on uX, but they are not. Why?
Hi, thanks for developing PaCMAP, lovely work!
I found that using transform
after using fit_transform
on the same set of features yields different results.
I ran the following example:
import pacmap
import numpy as np
np.random.seed(0)
init = "pca" # results can be reproduced also with "random"
reducer = pacmap.PaCMAP(
n_components=2, n_neighbors=10, MN_ratio=0.5, FP_ratio=2.0, save_tree=True
)
features = np.random.randn(100, 30)
reduced_features = reducer.fit_transform(features, init=init)
print(reduced_features[:10])
transformed_features = reducer.transform(features)
print(transformed_features[:10])
And returns
[[ 0.7728913 3.785831 ]
[-0.69379026 2.116452 ]
[-1.7770871 -0.97542125]
[ 2.5090704 1.8718773 ]
[-0.06890291 -2.2959301 ]
[ 1.9657456 1.1580495 ]
[ 1.0486693 -1.4648851 ]
[-1.4896832 1.7203271 ]
[ 0.54106015 2.38868 ]
[ 3.0175838 -1.9216222 ]]
[[-0.03516154 2.543376 ]
[-0.467008 1.6641414 ]
[-0.44973713 -1.535601 ]
[ 1.0218439 1.5691875 ]
[-0.30733356 -2.3227684 ]
[ 0.8294033 1.0432268 ]
[ 0.10503205 -0.8651409 ]
[-0.63982046 0.59202313]
[ 0.38573623 1.5135498 ]
[ 2.0508025 -1.5033388 ]]
I would expect the same results because the fit_transform
should be the combination of fit
and transform
(regardless of the implementation details). This is what PCA in sklearn and UMAP do.
Is this an intended feature? And if the answer is No, what should we do? One possible solution I found is
reducer = reducer.fit(features, init=init)
# Now the following lines return the same feature.
reduced_features = reducer.transform(features)
transformed_features = reducer.transform(features)
But this only solves the problem at the implementation level, not at the conceptual level. Since the returned values from fit_transform
and transform
are different, I'm not sure I can trust the output of transform
.
PS: this has nothing to do with the random seed, since I fixed the random seed, I can get the same result across runs.
In the example in REAMDE.md, the PaCMAP object is initialized with
embedding = pacmap.PaCMAP(n_dims=2, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0)
This produces an error TypeError: __init__() got an unexpected keyword argument 'n_dims'
.
Probably it should be n_comonents
instead of n_dims
.
Hi,
when I try to run the transform, it seems that the random state is required.
I solved it changing the example embedding from:
embedding = pacmap.PaCMAP(n_dims=2, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0)
to:
embedding = pacmap.PaCMAP(n_dims=2, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0, random_state = 1)
Hi,
Many thanks for such great package! I found very interesting the dimensionality reduction approach proposed in PaCMAP when compared to other techniques. So, I decided to give a try with my data sets.
I tried different initial conditions:
In all tests, I always get a "blob"
So, I am looking for you suggestions/comments. I provided a Python script with one of my data sets (see attached file).
Many thanks,
Ivan
Importing pacmap takes ~4.3 seconds on my machine (core I7/ 8th Generation, 40GB RAM, python 3.8.10). Since i want to include it in my package i overtake these 4.3 seconds into my package and delays everything by a lot.
Could you increase the import performance?
Timed with multiple runs of
python3 -m timeit -r 1 "import pacmap"
Really cool work! I look forward to exploring more.
It looks like the example on the README does not work with the pacmap
installed from pip. I get the following error:
361 self.n_dims = n_dims
362 self.n_neighbors = n_neighbors
--> 363 self.n_MN = int((n_neighbors * MN_ratio) // 1)
364 self.n_FP = int((n_neighbors * FP_ratio) // 1)
365 self.pair_neighbors = pair_neighbors
TypeError: unsupported operand type(s) for *: 'NoneType' and 'float'
After fixing this by putting in setting n_neighbors=10
, I find the following error dealing with dimensions of the data:
257 ):
258 start_time = time.time()
--> 259 n, high_dim = X.shape
260
261 if intermediate:
ValueError: too many values to unpack (expected 2)
The shape of the data/coil_20.npy
file stored in github ((1440, 128, 128)
) is inconsistent with that expected by the example code.
Thanks!
Hi,
I bumped into this research work https://bib.dbvis.de/uploadedFiles/155.pdf, in which the authors propose a fractional distance to deal better with the notion of near/far neighbors in high dimensional data sets. According to their results, the fractional distance can provide better outcomes compared to Euclidean (or Manhattan) distance.
So, I would like to submit this enhancement request: include fractional distance in PaCMAP package.
Kind regards,
Ivan
Hey there,
do you have rough estimations how much compute / ram is necessary to scale PaCMAP to 10M, 100M, 1B and 10B rows with each 786 embeddings? Or do you provide a multi-node solution?
Best,
Robert
When trying to install this with pip alongside holoviews (installed via conda), I get an error upon import:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/popos/mambaforge/envs/maple_test/lib/python3.9/site-packages/pacmap/__init__.py", line 1, in <module>
from .pacmap import *
File "/home/popos/mambaforge/envs/maple_test/lib/python3.9/site-packages/pacmap/pacmap.py", line 1, in <module>
import numba
File "/home/popos/mambaforge/envs/maple_test/lib/python3.9/site-packages/numba/__init__.py", line 43, in <module>
from numba.np.ufunc import (vectorize, guvectorize, threading_layer,
File "/home/popos/mambaforge/envs/maple_test/lib/python3.9/site-packages/numba/np/ufunc/__init__.py", line 3, in <module>
from numba.np.ufunc.decorators import Vectorize, GUVectorize, vectorize, guvectorize
File "/home/popos/mambaforge/envs/maple_test/lib/python3.9/site-packages/numba/np/ufunc/decorators.py", line 3, in <module>
from numba.np.ufunc import _internal
I have isolated the problem to numba. When I enforce numba>=0.57, I don't get the import error, but when I allow any version of numba, 0.53 gets installed and it results in this error. Not sure if intermediate numba versions will work, but updating the dependency to be numba>=5.7 should fix it. Thanks for the great tool
Given data X, with size [m, n], the fit_transform results differ depending on if the data is pre-normalized (so each row is unit-norm). However the expected result is that the angular metric should remain robust to the norm.
Would it be possible to add a feature that allows one to embed an unseen/unlabeled data point to an existing embedding using metric learning? This would be similar to the function currently available with UMAP:
https://umap-learn.readthedocs.io/en/latest/supervised.html
Thanks!
Correct me if I'm wrong, but it looks like users are currently locked into ANNOY for nearest neighbors. This is pretty limiting. Can you support externally supplied nearest neighbors?
While we're on the subject of nearest neighbors, could you provide some context for the n_neighbors_extra
strategy? It's not obvious why a constant of 50 should be added. I recall seeing this same thing in the TriMap code. Perhaps this is noted in the paper but you'll understand it's easier to ask about this directly. Besides an explanation here, some code comments would be good.
Further to the above, what about n_neighbors_extra+1
?
An edge case related to these questions: it seems it was considered that this strategy may result in requesting more nearest neighbors than there are rows in the dataset based on the line n_neighbors_extra = min(n_neighbors + 50, n)
. But if we assume this case is possible, doesn't n_neighbors_extra+1
introduce the problem again?
Hello.
I'm trying to store the PaCMAP model in a db for further transformations. I tried to pickle, but the tree is an annoy.annoy object.
Also tried to save the annoy.annoy object with embedding.tree.save('./annoy_object.ann'), this works but I cannot load, since creating the PaCMAP do not initialize the annoy.annoy tree.
Is there a way to save/load PaCMAP object or tree? My main objective is to send it to a DB, so I can transform new incoming data in my clustering pipeline.
Thanks for your attention.
Hi, I am really enjoying your PaCMAP package. Getting some great results. I have a request for a new feature (unless this is a feature that exists and I am not aware of it).
I am running PaCMAP on large datasets and then using the model with the transform function to transform new data into the existing embedded space. With these models developed using large datasets it would be great if I could save a PaCMAP model locally and then reloaded it and apply it to new data, rather than having to run the PaCMAP model again when I start a new session. An analogy would be I often save my XGBoost models and apply them to new data rather than rerun the model each time I have new data.
Is this feature already available or something that can be implemented?
Thanks again.
Hi there!
I currently use UMAP a lot in my projects and am aware of some its limitations so it's great to see new algorithms popping up that aim to fix some of the shortcomings in the field. UMAP has been around for awhile now and as such the community has made some pretty cool modifications and enhancements to actual algorithm. One of the best features, I think, is the ability to perform intersections and unions on the intermediate fuzzy representations: https://umap-learn.readthedocs.io/en/latest/composing_models.html
This lets the user generate a unified UMAP representation using multiple different distance metrics with ease. I believe that PaCMAP also has an intermediate manifold stage that might make it possible to implement a similar feature, but perhaps I'm wrong on this. Either way, I just wanted to bring it to your attention and perhaps get your thoughts on it.
The other request is something I'm sure is likely to come in future, but that is the ability to easily supply custom distance metrics to PaCMAP much like how the python UMAP handles this: https://umap-learn.readthedocs.io/en/latest/parameters.html?highlight=metric#metric
I realise that these likely non-trivial requests but I hearing what you think about them would be appreciated.
Cheers,
Rhys
Hi,
I have being exploring PaCMAP this past week and wanted to test it against multiple attributes. One of them is repeatability and to do that I added a random_state = 1,10,20. I used FMNIST for the dataset and the code is the following:
import numpy as np
import pacmap
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import time
import seaborn as sns
import pandas as pd
import umap.plot
import sys
import umap
from io import BytesIO
from PIL import Image
import base64
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import HoverTool, ColumnDataSource, CategoricalColorMapper
from bokeh.palettes import Spectral10, Category10
train = np.load("/home/icalle/Documents/umap/Data/fmnist_images.npy", allow_pickle=True)
train = train.reshape(train.shape[0], -1)
test = np.load("/home/icalle/Documents/umap/Data/fmnist_labels.npy", allow_pickle=True)
reducer = pacmap.PaCMAP(n_dims=2, n_neighbors=10, MN_ratio=0.5, FP_ratio=2.0, random_state=20)
embedding = reducer.fit_transform(train, init="pca")
plt.scatter(reducer.embedding_[:, 0], reducer.embedding_[:, 1], s= 5, c=test, cmap='Spectral')
plt.gca().set_aspect('equal', 'datalim')
cbar = plt.colorbar(boundaries=np.arange(11)-0.5)
cbar.set_ticks([0,1,2,3,4,5,6,7,8,9])
cbar.set_ticklabels(["T-shirt/top","Trouser","Pullover","Dress","Coat", "Sandal","Shirt","Sneaker","Bag","Ankle boot"])
plt.title('PaCMAP Fashion-MNIST; n_neighbors=10, random_state= 20', fontsize=12);
plt.show()
After plotting, the results are the following:
To conclude, as seen in the plot, this is not the correct output and I wanted to know where I went wrong in the code or if this issue has been raised before. I also want to mention, I tested it on the Digits dataset, on multiple random_state values, and with init="pca", but still the same result.
Even with CPU thread limiting applied while using the provided code, the CPU consumption remains excessively high (around 1400%). Is it possible to implement this algorithm in PyTorch?
Hi all,
I noticed that (the current version) of PaCMAP seems to give stochastic results on sparse binary data even if random seed is set. Below is an example of code with randomly generated data -- in the first case, the data is sparse and the result is stochastic on my machine; in the second case the data is not sparse and the result is deterministic.
I have not given deeper thought to whether it is reasonable to run PaCMAP on sparse data, but it would still be nice if it was deterministic when random seed is set.
Could you have a look into this?
Many thanks!
import pacmap
import numpy as np
###----
### Stochastic when there's sparsity
###----
print('\n\nBinary data with high proportion of zeros')
# Create a binary dataset with low frequency of 1s
rng = np.random.default_rng(seed=42)
d = rng.binomial(n=1,p=0.01,size=(1000,100))
# Just in case - if any columns are all zeros, remove these
mask = d.sum(axis=0) > 0
d = d[:,mask]
print('Shape of data: \n{}'.format(d.shape))
print('Number of nonzero elements in each column: \n{}'.format(d.sum(axis=0)))
print('Snapshot of data: \n{}'.format(d[:10,:10]))
# Pacmap
e = pacmap.PaCMAP(n_components=2, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0, random_state=42)
t = e.fit_transform(d, init="pca")
e2 = pacmap.PaCMAP(n_components=2, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0, random_state=42)
t2 = e2.fit_transform(d, init="pca")
# Test
print(t[:3, :3])
print(t2[:3, :3])
test = np.sum(np.abs(t-t2)) < 1e-8
print('Difference between values of two runs is less than 1e-8? {}'.format(test))
test2 = (t == t2).all()
print('Values between two runs are equal? {}'.format(test2))
###----
### Deterministic when there's less sparsity
###----
print('\n\nBinary data with roughly equal proportion of zeros and ones')
# Create a binary dataset with low frequency of 1s
rng = np.random.default_rng(seed=42)
d = rng.binomial(n=1,p=0.5,size=(1000,100))
# Just in case - if any columns are all zeros, remove these
mask = d.sum(axis=0) > 0
d = d[:,mask]
print('Shape of data: \n{}'.format(d.shape))
print('Number of nonzero elements in each column: \n{}'.format(d.sum(axis=0)))
print('Snapshot of data: \n{}'.format(d[:10,:10]))
# Pacmap
e = pacmap.PaCMAP(n_components=2, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0, random_state=42)
t = e.fit_transform(d, init="pca")
e2 = pacmap.PaCMAP(n_components=2, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0, random_state=42)
t2 = e2.fit_transform(d, init="pca")
# Test
print(t[:3, :3])
print(t2[:3, :3])
test = np.sum(np.abs(t-t2)) < 1e-8
print('Difference between values of two runs is less than 1e-8? {}'.format(test))
test2 = (t == t2).all()
print('Values between two runs are equal? {}'.format(test2))
##
print('PaCMAP version: {}, numpy version: {}'.format(pacmap.__version__, np.__version__))
The output I get is:
Binary data with high proportion of zeros
Shape of data:
(1000, 100)
Number of nonzero elements in each column:
[12 16 11 11 11 14 7 6 8 11 8 10 11 9 8 13 10 11 5 10 9 11 9 9
11 21 14 11 13 12 15 8 11 7 10 11 6 8 6 10 7 9 9 12 6 15 5 9
8 7 9 8 11 9 15 5 6 8 11 14 12 13 7 9 8 6 12 8 6 12 12 14
17 7 14 12 12 8 10 13 6 13 15 12 6 7 10 13 9 13 8 7 7 17 9 12
9 10 15 7]
Snapshot of data:
[[0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 1]
[0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1]
[0 0 0 0 0 0 0 0 0 0]]
C:\Users\andres.tamm\Anaconda3\lib\site-packages\pacmap\pacmap.py:774: UserWarning: Warning: random state is set to 42
warnings.warn(f'Warning: random state is set to {_RANDOM_STATE}')
C:\Users\andres.tamm\Anaconda3\lib\site-packages\pacmap\pacmap.py:774: UserWarning: Warning: random state is set to 42
warnings.warn(f'Warning: random state is set to {_RANDOM_STATE}')
[[ 0.8841729 2.0089896 ]
[-0.5350942 0.95636046]
[ 1.7903003 2.135278 ]]
[[ 0.73095715 -2.7870746 ]
[ 0.6337722 -0.6057222 ]
[ 1.952701 -2.0199978 ]]
Difference between values of two runs is less than 1e-8? False
Values between two runs are equal? False
Binary data with roughly equal proportion of zeros and ones
Shape of data:
(1000, 100)
Number of nonzero elements in each column:
[506 519 499 509 499 526 520 492 500 511 520 512 510 526 532 489 503 496
502 487 491 508 509 518 496 467 488 510 498 516 488 520 516 509 495 511
485 498 528 520 514 472 481 522 488 513 491 520 499 507 468 513 515 510
493 489 496 508 480 523 494 518 489 491 514 507 489 507 489 520 501 487
502 505 498 476 503 517 507 474 502 498 513 492 518 493 502 491 511 494
504 508 472 500 505 534 478 497 533 493]
Snapshot of data:
[[1 0 1 1 0 1 1 1 0 0]
[1 1 0 1 1 1 0 0 0 1]
[1 1 1 0 0 0 0 1 0 1]
[1 0 1 1 1 0 0 1 0 0]
[0 1 1 0 1 1 0 1 0 1]
[1 1 1 1 1 1 0 1 0 0]
[0 1 1 1 1 1 1 0 1 0]
[0 0 0 1 0 1 1 1 0 0]
[1 0 1 1 0 0 1 0 1 1]
[0 0 0 0 1 1 0 0 0 0]]
C:\Users\andres.tamm\Anaconda3\lib\site-packages\pacmap\pacmap.py:774: UserWarning: Warning: random state is set to 42
warnings.warn(f'Warning: random state is set to {_RANDOM_STATE}')
C:\Users\andres.tamm\Anaconda3\lib\site-packages\pacmap\pacmap.py:774: UserWarning: Warning: random state is set to 42
warnings.warn(f'Warning: random state is set to {_RANDOM_STATE}')
[[ 1.0503646 -2.2979746]
[ 1.6191299 2.3795164]
[-3.1472695 -1.3276936]]
[[ 1.0503646 -2.2979746]
[ 1.6191299 2.3795164]
[-3.1472695 -1.3276936]]
Difference between values of two runs is less than 1e-8? True
Values between two runs are equal? True
PaCMAP version: 0.6.3, numpy version: 1.21.1
Hello, we are working with PaCMAP and found stochasticity when testing with one of our datasets. We used init = 'PCA' and we also kept 'apply_pca=True' . When we ran the code the first time we got 5 clusters (setting a random seed of 20). We then ran the code again without changing any parameter, with the same random seed and we got 4 clusters.
Is this stochasticity something that should be expected, even with setting a specific random state?
Thanks.
Hi Team,
Thank you for creating the amazing package and it works extremely well for my use case. But it happened to work very slow while performing DR from 128D -> 2D.
I am trying to reduce 60k rows of embeddings and it takes 1hr of time, is there any faster way or a parallel way to do it?
Thank
I have a pandas dataframe with mostly numerical data but very high dimensions (37 variables 56 rows). I am struggling to get the pandas dataframe to convert to nd array and work with PaCMAP. The variable of interest is 'Intact DNA/million CD4 T cells logscale binarized' and are the labels, while the rest of the variables are predictors. Also, how should I handle categorical variables?
I wanted to run the model on several subsets in a basic loop which causes several errors including segmentation fault. E.g. UMAP is not causing any problems. Any workarounds or suggestions how to solve that?
I am currently working on a numerical dataset with 42359 rows and 12 columns and would like to apply PaCMAP to it and visualize the intermediate snapshots during training. I applied the code below, where df_scaled is my dataset after applying StandardScaler().
embedding1 = pacmap.PaCMAP(n_components=2, n_neighbors=10,MN_ratio=0.5, FP_ratio=2.0,
random_state=20, save_tree=False,
intermediate=True,
verbose=1)
X_transformed1 = embedding1.fit_transform(df_scaled, init="pca")
The output was:
X is normalized
PaCMAP(n_neighbors=10, n_MN=5, n_FP=20, distance=euclidean, lr=1.0, n_iters=450, apply_pca=True, opt_method='adam', verbose=1, intermediate=True, seed=20)
Finding pairs
Found nearest neighbor
Calculated sigma
Found scaled dist
Pairs sampled successfully.
File c:\Users\arthu\anaconda3\envs\dr_visualization\lib\site-packages\pacmap\pacmap.py:530, in pacmap(X, n_dims, pair_neighbors, pair_MN, pair_FP, lr, num_iters, Yinit, verbose, intermediate, inter_snapshots, pca_solution, tsvd)
[526](file:///c%3A/Users/arthu/anaconda3/envs/dr_visualization/lib/site-packages/pacmap/pacmap.py?line=525) n, _ = X.shape
[528](file:///c%3A/Users/arthu/anaconda3/envs/dr_visualization/lib/site-packages/pacmap/pacmap.py?line=527) if intermediate:
[529](file:///c%3A/Users/arthu/anaconda3/envs/dr_visualization/lib/site-packages/pacmap/pacmap.py?line=528) intermediate_states = np.empty(
--> 530 (len(inter_snapshots), n, n_dims), dtype=np.float32)
531 else:
532 intermediate_states = None
TypeError: object of type 'bool' has no len()
Can you guys help me solve this issue? I think my problem is in len(inter_snapshots) but it should be already known.
Hi
I would like to know if there is a possibility to transform data back to its original space in PaCMAP? Like method inverse_transform(X) in PCA
More specifically, I want to change data in the embedded low-dimensional space and to see what data in the original space it corresponds.
Thank you in advance!
Hello below I showcase the line of the error & the error it self:
site-packages/pacmap/pacmap.py", line 462, in generate_pair
nbrs = np.zeros((n, n_neighbors_extra), dtype=np.int32)
TypeError: 'float' object cannot be interpreted as an integer
When the number of instances that I want to cluster grows too large (above 20k), the n_neighbors_extra becomes a float (e.g 67.321) and then I get the following error.
Locally, I escaped this error by casting the n_neighbors_extra as an int inside the function but I am not sure if for the quality of the solution this is a proper fix.
Is inverse_transform
possible with PaCMAP?
I appreciate that it hasn't been implemented as yet, but it can be very useful when assessing how well the model is capturing the subtleties of the input data. However, for certain techniques, it's simply not possible.
If it's possible, I will open an enhancement request ๐
I'd like to use the current version from github, but there is no package metadata / setup.py checked in to the repo :(
Although it is definitely in the version on pypi.
Ideally, this would work:
pip install git+https://github.com/YingfanWang/PaCMAP
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.