Code Monkey home page Code Monkey logo

umato's Introduction

UMATO

Uniform Manifold Approximation with Two-phase Optimization


Uniform Manifold Approximation with Two-phase Optimization (UMATO) is a dimensionality reduction technique, which can preserve the global as well as the local structure of high-dimensional data. Most existing dimensionality reduction algorithms focus on either of the two aspects, however, such insufficiency can lead to overlooking or misinterpreting important global patterns in the data. Moreover, the existing algorithms suffer from instability. To address these issues, UMATO proposes a two-phase optimization: global optimization and local optimization. First, we obtain the global structure by selecting and optimizing the hub points. Next, we initialize and optimize other points using the nearest neighbor graph. Our experiments with one synthetic and three real world datasets show that UMATO can outperform the baseline algorithms, such as PCA, t-SNE, Isomap, UMAP, LAMP and PacMAP, in terms of accuracy, stability, and scalability.

System Requirements

  • Python 3.9 or greater
  • scikit-learn
  • numpy
  • scipy
  • numba
  • pandas (to read csv files)

Installation

UMATO is available via pip.

pip install umato
import umato
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)
emb = umato.UMATO(hub_num=50).fit_transform(X)

API

Default Class: UMATO

Umato has only one class UMATO. Note that UMATO shares a bunch of parameters with UMAP. For more details, please refer to UMAP API.

class UMATO(BaseEstimator):
    def __init__(
        self,
        n_neighbors=50,
	min_dist=0.1,
        n_components=2,
        hub_num=300,
        metric="euclidean",
        global_n_epochs=None,
        local_n_epochs=None,
        global_learning_rate=0.0065,
        local_learning_rate=0.01,
        spread=1.0,
        low_memory=False,
        set_op_mix_ratio=1.0,
        local_connectivity=1.0,
        gamma=0.1,
        negative_sample_rate=5,
        random_state=None,
        angular_rp_forest=False,
        init="pca",
	verbose=False
    ):

Parameters

n_neighbors = 50

The size of the local neighborhood (defined by the number of nearby sample points) used for manifold approximation. Bigger values lead to a more comprehensive view of the manifold, whereas smaller values retain more local information. Generally, values should fall within the range of 2 to 100. It must be an integer greater than 1. Same effect as it does in UMAP.

min_dist = 0.1

The minimum distance between embedded points. Same effect as it does in UMAP.

n_components = 2

The dimensionality of the output embedding space. It must be a positive integer. This defaults to 2, but can reasonably be set to any integer value below the number of the original dataset dimensions.

hub_num = 300

Number of hub points to use for the embedding. It must be a positive integer or -1 (None).

metric = "euclidean"

The metric to use to compute distances in high dimensional space. If a string is passed it must match a valid predefined metric. If a general metric is required a function that takes two 1d arrays and returns a float can be provided. For performance purposes it is required that this be a numba jit’d function.

The default distance function is a Euclidean. The list of available options can be found in the source code.

global_n_epochs = None

The number of epochs for the global optimization phase. It must be a positive integer of at least 10. If not defiend, it will be set to 100.

local_n_epochs = None 

The number of epochs for the local optimization phase. It must be a positive integer of at least 10. If not defined, it will be set to 50.

global_learning_rate = 0.0065,

The learning rate for the global optimization phase. It must be positive.

local_learning_rate = 0.01

The learning rate for the local optimization phase. It must be positive.

spread = 1.0

Determines the scale at which embedded points will be spread out. Higher values will lead to more separation between points. min_dist must be less than or equal to spread.

low_memory = False

Whether to use a lower memory, but more computationally expensive approach to construct k-nearest neighbor graph

set_op_mix_ratio

Interpolation parameter for combining the global and local structures in the fuzzy simplicial set. It must be between 0.0 and 1.0. A value of 1.0 will use a pure fuzzy union, while 0.0 will use a pure fuzzy intersection.

local_connectivity = 1.0

The local connectivity required – i.e. the number of nearest neighbors that should be assumed to be connected at a local level. The higher this value the more connected the manifold becomes locally.

gamma = 0.1

The gamma parameter used in local optimization for adjusting the balance between attractive and repulsive forces. It must be non-negative.

negative_sample_rate = 5

The number of negative samples to select per positive sample in the optimization process. Increasing this value will result in greater repulsive force being applied, greater optimization cost, but slightly more accuracy.

random_state = None

The seed of the pseudo random number generator to use when shuffling the data.

angular_rp_forest = False

A boolean flag that indicates whether to use angular random projection forest for approximate nearest neighbor search. It is set to True if the self.metric is in the set {"cosine", "correlation", "dice", "jaccard", "ll_dirichlet", "hellinger"}.

These parameters, along with their conditions and constraints, control various aspects of the embedding process, including the distance metric, optimization settings, and the structure of the resulting low-dimensional space.

Whether to utilize an angular random projection forest for initializing the approximate nearest neighbor search. This approach can offer improved speed, but it is primarily beneficial for metrics employing an angular-style distance, such as cosine, correlation, and others. For these metrics, angular forests will be automatically selected.

init = "pca"

The initialization method to use for the embedding. It must be a string or a numpy array. If a string is passed it must match one of the following: init, random, spectral.

verbose = False

Whether to print information about the optimization process.

Function fit

def fit(self, X):

This fit function embeds the input data X into a lower-dimensional space. It handles optional arguments, validates parameters, checks the data sparsity, and builds the nearest neighbor graph structure. It also computes global and local optimization, initializes the embedding using the original hub information, and embeds outliers.

After validating the input data and setting default values for optional arguments, the function checks if the metric is supported by PyNNDescent and computes the nearest neighbors accordingly. It then builds the k-nearest neighbor graph structure, runs global and local optimization, and embeds the remaining data points. Finally, it embeds outliers and returns the fitted model.

Parameters

X : array, shape (n_samples, n_features) or (n_samples, n_samples)

If the metric is 'precomputed' X must be a square distance matrix. Otherwise it contains a sample per row. If the method is 'exact', X may be a sparse matrix of type 'csr', 'csc' or 'coo'.

Function fit_transform

def fit_transform(self, X):

Fit X into an embedded space and return that transformed output.

Parameters

X : array, shape (n_samples, n_features) or (n_samples, n_samples)

If the metric is ‘precomputed’ X must be a square distance matrix. Otherwise it contains a sample per row.

Returns

X_new : array, shape (n_samples, n_components) Embedding of the training data in low-dimensional space.

Citation

UMATO can be cited as follows:

@inproceedings{jeon2022vis,
  title={Uniform Manifold Approximation with Two-phase Optimization},
  author={Jeon, Hyeon and Ko, Hyung-Kwon and Lee, Soohyun and Jo, Jaemin and Seo, Jinwook},
  booktitle={2022 IEEE Visualization and Visual Analytics (VIS)},
  pages={80--84},
  year={2022},
  organization={IEEE}
}

Jeon, H., Ko, H. K., Lee, S., Jo, J., & Seo, J. (2022, October). Uniform Manifold Approximation with Two-phase Optimization. In 2022 IEEE Visualization and Visual Analytics (VIS) (pp. 80-84). IEEE.

umato's People

Contributors

dependabot[bot] avatar hj-n avatar hunrotation avatar hyungkwonko avatar syphonarch avatar taehyun2017330 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

umato's Issues

MNIST: ValueError

There's an issue when running umato on MNIST dataset.

import umato 
from sklearn.datasets import fetch_openml

mnist = fetch_openml("mnist_784", version=1)
X = mnist.data

emb = umato.UMATO(hub_num=20).fit_transform(X)

results in

ValueError: total data # (70000) != total embedded # (69992)!

README iris example doesn't work

Hello, first of all, thank you for the paper and package which I have enjoyed studying.

The README example:

import umato
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)
emb = umato.UMATO(hub_num=300).fit_transform(X)

unfortunately gives a ValueError with the current version on PyPI (0.1.1):

ValueError: hub_num must be less than the number of data points

Which makes sense because there are only 150 items in the iris dataset. Unfortunately, I couldn't find a setting of n_neighbors and hub_num which gave a nice layout with other default parameters. Any suggestions for good values with smaller data?

Also, there is a minor typo in the README section about the init parameter: randobm should be random.

Example not working and mismatch between embedd and real data.

Hi. After to install manually the package (it is not longer available on pip) and try to run the example. This example doesn't work.
When I try UMATO with my data, I received this msg:

ValueError: total data # (26170) != total embedded # (26109)!

I want to compare with the example in the README to find out what it is the problem.

I hope you can help me.

Bests,

UMATO not in PyPi?

Current error:

ERROR: Could not find a version that satisfies the requirement umato (from versions: none)
ERROR: No matching distribution found for umato

Test and Benchmark Scripts not working

Hi!
I downloaded the repository, installed via install.py and tried to run the test.py benchmark.sh and visual benchmarks. Unfortunately none of them worked out of the box on my system. Running python 3.10 via miniconda on M1 mac.

test.py has a problem with numba's function caching:

Traceback (most recent call last):
  File "/Users/nf/Python/umato/test.py", line 1, in <module>
    import umato
  File "/Users/nf/miniconda3/envs/umato/lib/python3.10/site-packages/umato-0.0.4-py3.10.egg/umato/__init__.py", line 1, in <module>
  File "/Users/nf/miniconda3/envs/umato/lib/python3.10/site-packages/umato-0.0.4-py3.10.egg/umato/umato_.py", line 28, in <module>
  File "/Users/nf/miniconda3/envs/umato/lib/python3.10/site-packages/umato-0.0.4-py3.10.egg/umato/sparse.py", line 10, in <module>
  File "/Users/nf/miniconda3/envs/umato/lib/python3.10/site-packages/umato-0.0.4-py3.10.egg/umato/utils.py", line 698, in <module>
  File "/Users/nf/miniconda3/envs/umato/lib/python3.10/site-packages/numba/core/decorators.py", line 212, in wrapper
    disp.enable_caching()
  File "/Users/nf/miniconda3/envs/umato/lib/python3.10/site-packages/numba/core/dispatcher.py", line 863, in enable_caching
    self._cache = FunctionCache(self.py_func)
  File "/Users/nf/miniconda3/envs/umato/lib/python3.10/site-packages/numba/core/caching.py", line 601, in __init__
    self._impl = self._impl_class(py_func)
  File "/Users/nf/miniconda3/envs/umato/lib/python3.10/site-packages/numba/core/caching.py", line 337, in __init__
    raise RuntimeError("cannot cache function %r: no locator available "
RuntimeError: cannot cache function 'rdist': no locator available for file '/Users/nf/miniconda3/envs/umato/lib/python3.10/site-packages/umato-0.0.4-py3.10.egg/umato/utils.py

When running run-benchmark.sh, the whole script is running but complains about missing module specifications for all the methods and no results are shown:

/Users/nf/miniconda3/envs/umato/bin/python: Error while finding module specification for 'evaluation.models.umap' (ModuleNotFoundError: No module named 'evaluation')

When running the svelte app for visual benchmarking as instructed:

Error: Package subpath './compiler.js' is not defined by "exports" in /Users/nf/Python/umato/visualization/node_modules/svelte/package.json

Conda environment:
conda list

# packages in environment at /Users/nikofleischer/miniconda3/envs/umato:
#
# Name                    Version                   Build  Channel
anyio                     3.6.2                    pypi_0    pypi
appnope                   0.1.3                    pypi_0    pypi
argon2-cffi               21.3.0                   pypi_0    pypi
argon2-cffi-bindings      21.2.0                   pypi_0    pypi
asttokens                 2.0.8                    pypi_0    pypi
attrs                     22.1.0                   pypi_0    pypi
babel                     2.10.3                   pypi_0    pypi
backcall                  0.2.0                    pypi_0    pypi
beautifulsoup4            4.11.1                   pypi_0    pypi
bleach                    5.0.1                    pypi_0    pypi
bokeh                     2.4.3                    pypi_0    pypi
bzip2                     1.0.8                h3422bc3_4    conda-forge
ca-certificates           2022.9.24            h4653dfc_0    conda-forge
certifi                   2022.9.24                pypi_0    pypi
cffi                      1.15.1                   pypi_0    pypi
charset-normalizer        2.1.1                    pypi_0    pypi
click                     8.1.3                    pypi_0    pypi
cloudpickle               2.2.0                    pypi_0    pypi
colorcet                  3.0.1                    pypi_0    pypi
contourpy                 1.0.5                    pypi_0    pypi
cycler                    0.11.0                   pypi_0    pypi
dask                      2022.10.0                pypi_0    pypi
datashader                0.14.2                   pypi_0    pypi
datashape                 0.5.2                    pypi_0    pypi
debugpy                   1.6.3                    pypi_0    pypi
decorator                 5.1.1                    pypi_0    pypi
defusedxml                0.7.1                    pypi_0    pypi
distributed               2022.10.0                pypi_0    pypi
entrypoints               0.4                      pypi_0    pypi
executing                 1.1.1                    pypi_0    pypi
fastjsonschema            2.16.2                   pypi_0    pypi
fcsparser                 0.2.4                    pypi_0    pypi
fonttools                 4.37.4                   pypi_0    pypi
fsspec                    2022.10.0                pypi_0    pypi
heapdict                  1.0.1                    pypi_0    pypi
holoviews                 1.15.1                   pypi_0    pypi
idna                      3.4                      pypi_0    pypi
ipykernel                 6.16.1                   pypi_0    pypi
ipython                   8.5.0                    pypi_0    pypi
ipython-genutils          0.2.0                    pypi_0    pypi
jedi                      0.18.1                   pypi_0    pypi
jinja2                    3.1.2                    pypi_0    pypi
joblib                    1.2.0              pyhd8ed1ab_0    conda-forge
json5                     0.9.10                   pypi_0    pypi
jsonschema                4.16.0                   pypi_0    pypi
jupyter-client            7.4.3                    pypi_0    pypi
jupyter-core              4.11.2                   pypi_0    pypi
jupyter-server            1.21.0                   pypi_0    pypi
jupyterlab                3.4.8                    pypi_0    pypi
jupyterlab-pygments       0.2.2                    pypi_0    pypi
jupyterlab-server         2.16.1                   pypi_0    pypi
kiwisolver                1.4.4                    pypi_0    pypi
libblas                   3.9.0           16_osxarm64_openblas    conda-forge
libcblas                  3.9.0           16_osxarm64_openblas    conda-forge
libcxx                    14.0.6               h2692d47_0    conda-forge
libffi                    3.4.2                h3422bc3_5    conda-forge
libgfortran               5.0.0           11_3_0_hd922786_25    conda-forge
libgfortran5              11.3.0              hdaf2cc0_25    conda-forge
liblapack                 3.9.0           16_osxarm64_openblas    conda-forge
libllvm11                 11.1.0               hfa12f05_4    conda-forge
libopenblas               0.3.21          openmp_hc731615_3    conda-forge
libsqlite                 3.39.4               h76d750c_0    conda-forge
libzlib                   1.2.13               h03a7124_4    conda-forge
llvm-openmp               14.0.4               hd125106_0    conda-forge
llvmlite                  0.39.1          py310h1e34944_0    conda-forge
locket                    1.0.0                    pypi_0    pypi
markdown                  3.4.1                    pypi_0    pypi
markupsafe                2.1.1                    pypi_0    pypi
matplotlib                3.6.1                    pypi_0    pypi
matplotlib-inline         0.1.6                    pypi_0    pypi
mistune                   2.0.4                    pypi_0    pypi
msgpack                   1.0.4                    pypi_0    pypi
multipledispatch          0.6.0                    pypi_0    pypi
nbclassic                 0.4.5                    pypi_0    pypi
nbclient                  0.7.0                    pypi_0    pypi
nbconvert                 7.2.2                    pypi_0    pypi
nbformat                  5.7.0                    pypi_0    pypi
ncurses                   6.3                  h07bb92c_1    conda-forge
nest-asyncio              1.5.6                    pypi_0    pypi
notebook                  6.5.1                    pypi_0    pypi
notebook-shim             0.2.0                    pypi_0    pypi
npm                       0.1.1                    pypi_0    pypi
numba                     0.56.3          py310h3124f1e_0    conda-forge
numpy                     1.23.4          py310h5d7c261_0    conda-forge
openssl                   3.0.5                h03a7124_2    conda-forge
optional-django           0.1.0                    pypi_0    pypi
packaging                 21.3                     pypi_0    pypi
pandas                    1.5.1           py310h2b830bf_0    conda-forge
pandocfilters             1.5.0                    pypi_0    pypi
panel                     0.14.0                   pypi_0    pypi
param                     1.12.2                   pypi_0    pypi
parso                     0.8.3                    pypi_0    pypi
partd                     1.3.0                    pypi_0    pypi
pexpect                   4.8.0                    pypi_0    pypi
pickleshare               0.7.5                    pypi_0    pypi
pillow                    9.2.0                    pypi_0    pypi
pip                       22.3               pyhd8ed1ab_0    conda-forge
prometheus-client         0.15.0                   pypi_0    pypi
prompt-toolkit            3.0.31                   pypi_0    pypi
psutil                    5.9.3                    pypi_0    pypi
ptyprocess                0.7.0                    pypi_0    pypi
pure-eval                 0.2.2                    pypi_0    pypi
pycparser                 2.21                     pypi_0    pypi
pyct                      0.4.8                    pypi_0    pypi
pygments                  2.13.0                   pypi_0    pypi
pynndescent               0.5.7                    pypi_0    pypi
pyparsing                 3.0.9                    pypi_0    pypi
pyrsistent                0.18.1                   pypi_0    pypi
python                    3.10.6          hae75cb6_0_cpython    conda-forge
python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
python_abi                3.10                    2_cp310    conda-forge
pytz                      2022.5             pyhd8ed1ab_0    conda-forge
pyviz-comms               2.2.1                    pypi_0    pypi
pyyaml                    6.0                      pypi_0    pypi
pyzmq                     24.0.1                   pypi_0    pypi
readline                  8.1.2                h46ed386_0    conda-forge
requests                  2.28.1                   pypi_0    pypi
scikit-learn              1.1.2           py310h3d7afdd_0    conda-forge
scipy                     1.9.2           py310ha0d8a01_0    conda-forge
send2trash                1.8.0                    pypi_0    pypi
setuptools                65.5.0             pyhd8ed1ab_0    conda-forge
six                       1.16.0             pyh6c4a22f_0    conda-forge
sniffio                   1.3.0                    pypi_0    pypi
sortedcontainers          2.4.0                    pypi_0    pypi
soupsieve                 2.3.2.post1              pypi_0    pypi
stack-data                0.5.1                    pypi_0    pypi
tblib                     1.7.0                    pypi_0    pypi
terminado                 0.16.0                   pypi_0    pypi
threadpoolctl             3.1.0              pyh8a188c0_0    conda-forge
tinycss2                  1.2.1                    pypi_0    pypi
tk                        8.6.12               he1e0b03_0    conda-forge
tomli                     2.0.1                    pypi_0    pypi
toolz                     0.12.0                   pypi_0    pypi
tornado                   6.1                      pypi_0    pypi
tqdm                      4.64.1                   pypi_0    pypi
traitlets                 5.5.0                    pypi_0    pypi
typing-extensions         4.4.0                    pypi_0    pypi
tzdata                    2022e                h191b570_0    conda-forge
umap-learn                0.5.3                    pypi_0    pypi
urllib3                   1.26.12                  pypi_0    pypi
wcwidth                   0.2.5                    pypi_0    pypi
webencodings              0.5.1                    pypi_0    pypi
websocket-client          1.4.1                    pypi_0    pypi
wget                      3.2                      pypi_0    pypi
wheel                     0.37.1             pyhd8ed1ab_0    conda-forge
xarray                    2022.10.0                pypi_0    pypi
xz                        5.2.6                h57fd34a_0    conda-forge
zict                      2.2.0                    pypi_0    pypi```

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.