awslabs / amazon-denseclus Goto Github PK

Clustering for mixed-type data

Home Page: https://aws.amazon.com/blogs/opensource/introducing-denseclus-an-open-source-clustering-package-for-mixed-type-data/

License: MIT No Attribution

Makefile 0.25% Python 6.09% Jupyter Notebook 93.66%

clustering embedding machinelearning-python python

amazon-denseclus's People

Contributors

Stargazers

Watchers

amazon-denseclus's Issues

Denseclus on large data set

Hi, hi. 😊
I would kindly ask for a little bit of help.
I have ta large data set (over 6mil rows and 20 columns ) with mixtured datatypes.

The idea was to implement the DenseClus.
(mix of umap embeddings for different data types with HDBSCAN).
I've implemented on subset oh 100K.
But the OOM error popped up really soon.

My core questions would be?

Is there option of fitting DanseClaus
on this kind of dataset?
Recommendations related to prams?
There is no possibility of transform the unseen data/ how to apply on entire data set?
As I work on DataBricks, some recommendations regarding the memory-performance of indtance?
The strategy of dealing with this kind of task?

Additional advices are welcoming. 🤗

Getting error on version 0.8.27?

yi10yosa56-algo-1-l1c8r | WARNING: Discarding https://files.pythonhosted.org/packages/32/bb/59a75bc5ac66a9b4f9b8f979e4545af0e98bb1ca4e6ae96b3b956b554223/hdbscan-0.8.27.tar.gz#sha256=e3a418d0d36874f7b6a1bf0b7461f3857fc13a525fd48ba34caed2fe8973aa26 (from https://pypi.org/simple/hdbscan/). Command errored out with exit status 1: /miniconda3/bin/python3 /miniconda3/lib/python3.7/site-packages/pip install --ignore-installed --no-user --prefix /home/model-server/tmp/pip-build-env-kjaud8c5/overlay --no-warn-script-location --no-binary :none: --only-binary :none: -i https://pypi.org/simple -- setuptools wheel cython numpy Check the logs for full command output.
yi10yosa56-algo-1-l1c8r | ERROR: Could not find a version that satisfies the requirement hdbscan==0.8.27
yi10yosa56-algo-1-l1c8r | ERROR: No matching distribution found for hdbscan==0.8.27
yi10yosa56-algo-1-l1c8r | Traceback (most recent call last):
yi10yosa56-algo-1-l1c8r |   File "/opt/ml/processing/input/code/preprocessing.py", line 19, in <module>
yi10yosa56-algo-1-l1c8r |     install("hdbscan==0.8.27")
yi10yosa56-algo-1-l1c8r |   File "/opt/ml/processing/input/code/preprocessing.py", line 13, in install
yi10yosa56-algo-1-l1c8r |     stdout=open(os.devnull, "wb"),
yi10yosa56-algo-1-l1c8r |   File "/miniconda3/lib/python3.7/subprocess.py", line 363, in check_call
yi10yosa56-algo-1-l1c8r |     raise CalledProcessError(retcode, cmd)

Is this right looks like the latest HDBBSCAN version is 0.8.26?
https://github.com/scikit-learn-contrib/hdbscan/releases

Error while importing

from denseclus import DenseClus

TypeError Traceback (most recent call last)
C:\Temp\ipykernel_11700\3655304814.py in <cell line: 1>()
----> 1 from denseclus import DenseClus

~\anaconda3\lib\site-packages\denseclus_init_.py in
----> 1 from .DenseClus import DenseClus
2 from .utils import extract_categorical, extract_numerical
3
4 version = "0.0.19"

~\anaconda3\lib\site-packages\denseclus\DenseClus.py in
3 import warnings
4
----> 5 import hdbscan
6 import numpy as np
7 import pandas as pd

~\anaconda3\lib\site-packages\hdbscan_init_.py in
----> 1 from .hdbscan_ import HDBSCAN, hdbscan
2 from .robust_single_linkage_ import RobustSingleLinkage, robust_single_linkage
3 from .validity import validity_index
4 from .prediction import (approximate_predict,
5 membership_vector,

~\anaconda3\lib\site-packages\hdbscan\hdbscan_.py in
507 leaf_size=40,
508 algorithm="best",
--> 509 memory=Memory(cachedir=None, verbose=0),
510 approx_min_span_tree=True,
511 gen_min_span_tree=False,

TypeError: init() got an unexpected keyword argument 'cachedir'

What has gone wrong here? How can I fix this? I am using anaconda3.

Tuning for HDBSCAN parameters

Desirable to include example of how to tune HDBSCAN because there is not much work out there on that to date.

Will there be a transform function like UMAP has?

For the full embedding/clustering job, it might be really good to have transform function so new data can be transformed and clustered without repeating the embedding. Similar to UMAPs transform function.

Question.

System error while importing

SystemError Traceback (most recent call last)
Cell In[5], line 1
----> 1 from denseclus import DenseClus

File ~\anaconda3\envs\DenClu\lib\site-packages\denseclus_init_.py:1
----> 1 from .DenseClus import DenseClus
2 from .utils import extract_categorical, extract_numerical
4 version = "0.0.19"

File ~\anaconda3\envs\DenClu\lib\site-packages\denseclus\DenseClus.py:8
6 import numpy as np
7 import pandas as pd
----> 8 import umap.umap_ as umap
9 from hdbscan import flat
10 from sklearn.base import BaseEstimator, ClassifierMixin

File ~\anaconda3\envs\DenClu\lib\site-packages\umap_init_.py:2
1 from warnings import warn, catch_warnings, simplefilter
----> 2 from .umap_ import UMAP
4 try:
5 with catch_warnings():

File ~\anaconda3\envs\DenClu\lib\site-packages\umap\umap_.py:28
26 from scipy.sparse import tril as sparse_tril, triu as sparse_triu
27 import scipy.sparse.csgraph
---> 28 import numba
30 import umap.distances as dist
32 import umap.sparse as sparse

File ~\anaconda3\envs\DenClu\lib\site-packages\numba_init_.py:43
39 from numba.core.decorators import (cfunc, generated_jit, jit, njit, stencil,
40 jit_module)
42 # Re-export vectorize decorators and the thread layer querying function
---> 43 from numba.np.ufunc import (vectorize, guvectorize, threading_layer,
44 get_num_threads, set_num_threads)
46 # Re-export Numpy helpers
47 from numba.np.numpy_support import carray, farray, from_dtype

File ~\anaconda3\envs\DenClu\lib\site-packages\numba\np\ufunc_init_.py:3
1 # -- coding: utf-8 --
----> 3 from numba.np.ufunc.decorators import Vectorize, GUVectorize, vectorize, guvectorize
4 from numba.np.ufunc._internal import PyUFunc_None, PyUFunc_Zero, PyUFunc_One
5 from numba.np.ufunc import _internal, array_exprs

File ~\anaconda3\envs\DenClu\lib\site-packages\numba\np\ufunc\decorators.py:3
1 import inspect
----> 3 from numba.np.ufunc import _internal
4 from numba.np.ufunc.parallel import ParallelUFuncBuilder, ParallelGUFuncBuilder
6 from numba.core.registry import DelayedRegistry

SystemError: initialization of _internal failed without raising an exception

Just when I thought I have fixed this problem, I have got another error while importing DenseClus. Any idea how to fix this problem?

Using denseclus with numeric only data

How can I use denseclus with numeric only data?

Also: in utils.py I see that all categorical data is one hot encoded using pd.get_dummies(), which is appropriate for nominal categorical data. Not so for ordinal categorical data.
My question: how to use DenseClus on ordinal categorical data?

GPU Support

Desirable to have the option to select Rapids and run on GPUs so that it can scale out to larger datasets.

SageMaker NB Example

A notebook example that works with SageMaker

Prediction on new points

Hello,

Firstly, many thanks for the great DenseClus project.
I was wondering if it is possible to make cluster prediction on a new point, like the approximate_predict(clusterer, points_to_predict) in hdbscan.

I know DenseClus() has the boolean parameter 'prediction_data', but I didn't find how to make the predictions on new points.

Many thanks for your help.

Package doesn't work on Sagemaker

It would be nice if this Amazon package would work on Amazon Sagemaker.
I can install regular hdbscan using conda from conda-forge, but both pip and conda fail to install Amazon-DenseClus. This using Datascience 2.0 Kernel in Sagemaker Studio.

Is there a reason you didn't use DensMAP (e.g. `dens_lambda=1.0`) within DenseClus?

See https://umap-learn.readthedocs.io/en/latest/densmap_demo.html
@faris-k

Dependence of clustering results on random seed

When testing the DenseClus algorithm on a mixed type dataset, I discovered the strong dependence of clustering results on random seed used.

My question: given this fact, what can I do to get reliable cluster results? My initial approach would be to target stable clustering results, but how to that?

Suppress Numba deprecation warnings

Either show once or allow for optional suppression.

Fit Predict method

Currently, no fit predict is supported. Behaviour is desirable and probable would need to use Parametric UMAP base or other means as workaround.

Labels for the "Trustworthiness at K" plot are misaligned

Trustworthiness at K

Unable to cluster sentence embeddings (vectors)?

Hi, I was wondering if you could kindly clarify whether vector data is accepted in this library, or it only accepts categorical and numerical data?

CD Deploys to PyPi but show as failed

Looks like the GH CD will deploy to pypi a new release but shows up as an error in actions.

What preprocessing required for mix type of dataset ( continuous and categorical ) ?

Hello Denseclus developers: I have one question regarding the the preprocessing,

If my dataset is the combination of continuous ( numerical ) and categorical values, What are the steps to use this repo? Do I need to preprocess my data before feeding to DenseClus function?

Thank you!

DenseClus and missing values

Does DenseClus tolerate missing values in the dataset?
As for as I can see from the source code, I expect that missing values in the data have to be imputed before they are fed into DensClus, but I would like that to be confirmed.

UMAP hyperparameters question

We have three bunches of hyperparameters of UMAP. When should we use categorical and numerical dicts, and when combined dict?

default_umap_params = {
"categorical": {
# jaccard is an option but only takes sparse input
"metric": "hamming",
"n_neighbors": 30,
"n_components": 5,
"min_dist": 0.0,
},
"numerical": {
"metric": "l2",
"n_neighbors": 30,
"n_components": 5,
"min_dist": 0.0,
},
"combined": {
"n_neighbors": 30,
"min_dist": 0.0,
"n_components": 5,
},
}

awslabs / amazon-denseclus Goto Github PK

amazon-denseclus's People

Contributors

Stargazers

Watchers

Forkers

amazon-denseclus's Issues

Recommend Projects

Recommend Topics

Recommend Org