Hi, I am very exicted to try to assess the quality of my embeddings using EMBEDR.

Thanks! I was a bit confused by this: <div class="snippet-clipboard-content notran

Assessing UMAP embedding quality and sweeping across n_neighbours parameter about embedr HOT 3 OPEN

ohickl commented on July 23, 2024

Assessing UMAP embedding quality and sweeping across n_neighbours parameter

from embedr.

Comments (3)

ejohnson643 commented on July 23, 2024

Hi Oskar!

If you want run the code using UMAP, it will ignore the perplexity parameter. I will make this clear in the documentation! Thanks for the question!

from embedr.

ohickl commented on July 23, 2024

Thanks! I was a bit confused by this:

perplexity: float
        Similar to the perplexity parameter from van der Maaten (2008); sets 
        the scale of the affinity kernel used to measure embedding quality.  
        NOTE: In the EMBEDR algorithm, this parameter is used EVEN WHEN NOT 
        USING t-SNE!  Default is 30

in the EMBEDR class.

from embedr.

ejohnson643 commented on July 23, 2024

Oh, of course! The perplexity parameter does double duty in that it is involved with how the embedding quality is assessed as well as in running t-SNE. That is, currently, the quality of an embedding is calculated as the similarity of two data-affinity matrices, one from the original data space and one from the embedded space. The high-dimensional affinity matrix depends on a perplexity parameter, perp_aff, which needs to be set somehow.

If you use the same value for perp_aff throughout a sweep of the UMAP n_neighbors parameter, you are examining the quality with which neighborhoods of a size set by perp_aff are embedded by UMAP as UMAP is allowed to use more or fewer neighbors to actually carry out the embedding. This is akin to fixing the resolution of your "quality ruler" and then examining the different conditions created by UMAP. I don't think there will be anything wrong with this.

Alternately, you can change perp_aff to correspond to the neighborhood size that t-SNE/UMAP is operating at. This is easy to do with t-SNE because perp_aff can be set to be the same as the canonical perplexity. However, to do this with UMAP, we need to map perp_aff to some sort of k_effective number of nearest neighbors. I am currently working on implementing this.

However, if you're concerned after you've run your sweep that you've chosen the wrong perp_aff for some reason, you don't have to re-run everything, but you will have to hack the methods a bit. What you can do is something like the following:

from embedr import EMBEDR
import matplotlib.pyplot as plt
import numpy as np
from openTSNE.affinity import PerplexityBasedNN
import utility as utl

X = np.loadtxt("./Data/mnist2500_X.txt")

old_perp = 30
new_perp = 100

n_jobs = -1
seed = 1
verbose = 5

n_data_embed = 1
n_null_embed = 2

fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(12, 5))

## Initialize and fit the data like normal
UMAP_embed = EMBEDR(perplexity=old_perp,
                    dimred_params={'n_neighbors': n_neighbors},
                    # cache_results=False,  ## Turn off file caching.
                    dimred_alg="UMAP",
                    n_jobs=n_jobs,
                    random_state=seed,
                    verbose=verbose,
                    n_data_embed=n_data_embed,
                    n_null_embed=n_null_embed,
                    project_name='changing_perplexity_test')
UMAP_embed.fit(X)

## Let's see the results!
UMAP_embed.plot(ax=ax1, show_cbar=False)

## Calculate a new affinity matrix at the new perplexity
new_aff_mat = PerplexityBasedNN(X,
                                perplexity=new_perp,
                                n_jobs=n_jobs,
                                random_state=seed,
                                verbose=verbose)

## Calculate null affinity matrices at the new perplexity
new_null_mat = {}
for nNo in range(n_null_embed):

    null_X = utl.generate_nulls(X, seed=seed + nNo).squeeze()
    nP = PerplexityBasedNN(null_X,
                           perplexity=new_perp,
                           n_jobs=n_jobs,
                           random_state=seed,
                           verbose=verbose)

    new_null_mat[nNo] = nP

## Reset the affinity matrices in the method
UMAP_embed._affmat = new_aff_mat
UMAP_embed._null_affmat = new_null_mat

## Recalculate the p-Values and quality scores.
UMAP_embed.do_cache = False  ## Need to turn off file caching to force the
                             ## method to recalculate.
UMAP_embed._calc_EES()

## Let's see the results!
UMAP_embed.plot(ax=ax2)

ax1.set_title(f"Affinity Perplexity = {old_perp}")
ax2.set_title(f"Affinity Perplexity = {new_perp}")
ax1.set_xticklabels([])
ax1.set_yticklabels([])
ax2.set_xticklabels([])
ax2.set_yticklabels([])

fig.tight_layout()

plt.show()

I'm going to leave this whole thing open as something to prioritize in the next version because this should be easier! Also, this really underscores how these parameters should be separated semantically in the code. In my reply, I invented perp_aff, but I'll actually make this a more obvious parameter in the code!

TLDR: You can probably leave perplexity fixed, but future methods will automatically update it depending on the DRA.

from embedr.

Assessing UMAP embedding quality and sweeping across n_neighbours parameter about embedr HOT 3 OPEN

Comments (3)

Related Issues (9)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent