awslabs / amazon-denseclus Goto Github PK
View Code? Open in Web Editor NEWClustering for mixed-type data
License: MIT No Attribution
Clustering for mixed-type data
License: MIT No Attribution
Hi, hi. ๐
I would kindly ask for a little bit of help.
I have ta large data set (over 6mil rows and 20 columns ) with mixtured datatypes.
The idea was to implement the DenseClus.
(mix of umap embeddings for different data types with HDBSCAN).
I've implemented on subset oh 100K.
But the OOM error popped up really soon.
My core questions would be?
Additional advices are welcoming. ๐ค
yi10yosa56-algo-1-l1c8r | WARNING: Discarding https://files.pythonhosted.org/packages/32/bb/59a75bc5ac66a9b4f9b8f979e4545af0e98bb1ca4e6ae96b3b956b554223/hdbscan-0.8.27.tar.gz#sha256=e3a418d0d36874f7b6a1bf0b7461f3857fc13a525fd48ba34caed2fe8973aa26 (from https://pypi.org/simple/hdbscan/). Command errored out with exit status 1: /miniconda3/bin/python3 /miniconda3/lib/python3.7/site-packages/pip install --ignore-installed --no-user --prefix /home/model-server/tmp/pip-build-env-kjaud8c5/overlay --no-warn-script-location --no-binary :none: --only-binary :none: -i https://pypi.org/simple -- setuptools wheel cython numpy Check the logs for full command output.
yi10yosa56-algo-1-l1c8r | ERROR: Could not find a version that satisfies the requirement hdbscan==0.8.27
yi10yosa56-algo-1-l1c8r | ERROR: No matching distribution found for hdbscan==0.8.27
yi10yosa56-algo-1-l1c8r | Traceback (most recent call last):
yi10yosa56-algo-1-l1c8r | File "/opt/ml/processing/input/code/preprocessing.py", line 19, in <module>
yi10yosa56-algo-1-l1c8r | install("hdbscan==0.8.27")
yi10yosa56-algo-1-l1c8r | File "/opt/ml/processing/input/code/preprocessing.py", line 13, in install
yi10yosa56-algo-1-l1c8r | stdout=open(os.devnull, "wb"),
yi10yosa56-algo-1-l1c8r | File "/miniconda3/lib/python3.7/subprocess.py", line 363, in check_call
yi10yosa56-algo-1-l1c8r | raise CalledProcessError(retcode, cmd)
Is this right looks like the latest HDBBSCAN version is 0.8.26?
https://github.com/scikit-learn-contrib/hdbscan/releases
from denseclus import DenseClus
TypeError Traceback (most recent call last)
C:\Temp\ipykernel_11700\3655304814.py in <cell line: 1>()
----> 1 from denseclus import DenseClus
~\anaconda3\lib\site-packages\denseclus_init_.py in
----> 1 from .DenseClus import DenseClus
2 from .utils import extract_categorical, extract_numerical
3
4 version = "0.0.19"
~\anaconda3\lib\site-packages\denseclus\DenseClus.py in
3 import warnings
4
----> 5 import hdbscan
6 import numpy as np
7 import pandas as pd
~\anaconda3\lib\site-packages\hdbscan_init_.py in
----> 1 from .hdbscan_ import HDBSCAN, hdbscan
2 from .robust_single_linkage_ import RobustSingleLinkage, robust_single_linkage
3 from .validity import validity_index
4 from .prediction import (approximate_predict,
5 membership_vector,
~\anaconda3\lib\site-packages\hdbscan\hdbscan_.py in
507 leaf_size=40,
508 algorithm="best",
--> 509 memory=Memory(cachedir=None, verbose=0),
510 approx_min_span_tree=True,
511 gen_min_span_tree=False,
TypeError: init() got an unexpected keyword argument 'cachedir'
What has gone wrong here? How can I fix this? I am using anaconda3.
Desirable to include example of how to tune HDBSCAN because there is not much work out there on that to date.
For the full embedding/clustering job, it might be really good to have transform function so new data can be transformed and clustered without repeating the embedding. Similar to UMAPs transform function.
SystemError Traceback (most recent call last)
Cell In[5], line 1
----> 1 from denseclus import DenseClus
File ~\anaconda3\envs\DenClu\lib\site-packages\denseclus_init_.py:1
----> 1 from .DenseClus import DenseClus
2 from .utils import extract_categorical, extract_numerical
4 version = "0.0.19"
File ~\anaconda3\envs\DenClu\lib\site-packages\denseclus\DenseClus.py:8
6 import numpy as np
7 import pandas as pd
----> 8 import umap.umap_ as umap
9 from hdbscan import flat
10 from sklearn.base import BaseEstimator, ClassifierMixin
File ~\anaconda3\envs\DenClu\lib\site-packages\umap_init_.py:2
1 from warnings import warn, catch_warnings, simplefilter
----> 2 from .umap_ import UMAP
4 try:
5 with catch_warnings():
File ~\anaconda3\envs\DenClu\lib\site-packages\umap\umap_.py:28
26 from scipy.sparse import tril as sparse_tril, triu as sparse_triu
27 import scipy.sparse.csgraph
---> 28 import numba
30 import umap.distances as dist
32 import umap.sparse as sparse
File ~\anaconda3\envs\DenClu\lib\site-packages\numba_init_.py:43
39 from numba.core.decorators import (cfunc, generated_jit, jit, njit, stencil,
40 jit_module)
42 # Re-export vectorize decorators and the thread layer querying function
---> 43 from numba.np.ufunc import (vectorize, guvectorize, threading_layer,
44 get_num_threads, set_num_threads)
46 # Re-export Numpy helpers
47 from numba.np.numpy_support import carray, farray, from_dtype
File ~\anaconda3\envs\DenClu\lib\site-packages\numba\np\ufunc_init_.py:3
1 # -- coding: utf-8 --
----> 3 from numba.np.ufunc.decorators import Vectorize, GUVectorize, vectorize, guvectorize
4 from numba.np.ufunc._internal import PyUFunc_None, PyUFunc_Zero, PyUFunc_One
5 from numba.np.ufunc import _internal, array_exprs
File ~\anaconda3\envs\DenClu\lib\site-packages\numba\np\ufunc\decorators.py:3
1 import inspect
----> 3 from numba.np.ufunc import _internal
4 from numba.np.ufunc.parallel import ParallelUFuncBuilder, ParallelGUFuncBuilder
6 from numba.core.registry import DelayedRegistry
SystemError: initialization of _internal failed without raising an exception
Just when I thought I have fixed this problem, I have got another error while importing DenseClus. Any idea how to fix this problem?
How can I use denseclus with numeric only data?
Also: in utils.py I see that all categorical data is one hot encoded using pd.get_dummies(), which is appropriate for nominal categorical data. Not so for ordinal categorical data.
My question: how to use DenseClus on ordinal categorical data?
Desirable to have the option to select Rapids and run on GPUs so that it can scale out to larger datasets.
A notebook example that works with SageMaker
Hello,
Firstly, many thanks for the great DenseClus project.
I was wondering if it is possible to make cluster prediction on a new point, like the approximate_predict(clusterer, points_to_predict) in hdbscan.
I know DenseClus() has the boolean parameter 'prediction_data', but I didn't find how to make the predictions on new points.
Many thanks for your help.
It would be nice if this Amazon package would work on Amazon Sagemaker.
I can install regular hdbscan using conda from conda-forge, but both pip and conda fail to install Amazon-DenseClus. This using Datascience 2.0 Kernel in Sagemaker Studio.
When testing the DenseClus algorithm on a mixed type dataset, I discovered the strong dependence of clustering results on random seed used.
My question: given this fact, what can I do to get reliable cluster results? My initial approach would be to target stable clustering results, but how to that?
Either show once or allow for optional suppression.
Currently, no fit predict is supported. Behaviour is desirable and probable would need to use Parametric UMAP base or other means as workaround.
Hi, I was wondering if you could kindly clarify whether vector data is accepted in this library, or it only accepts categorical and numerical data?
Looks like the GH CD will deploy to pypi a new release but shows up as an error in actions.
Hello Denseclus developers: I have one question regarding the the preprocessing,
If my dataset is the combination of continuous ( numerical ) and categorical values, What are the steps to use this repo? Do I need to preprocess my data before feeding to DenseClus function?
Thank you!
Does DenseClus tolerate missing values in the dataset?
As for as I can see from the source code, I expect that missing values in the data have to be imputed before they are fed into DensClus, but I would like that to be confirmed.
We have three bunches of hyperparameters of UMAP. When should we use categorical and numerical dicts, and when combined dict?
default_umap_params = {
"categorical": {
# jaccard is an option but only takes sparse input
"metric": "hamming",
"n_neighbors": 30,
"n_components": 5,
"min_dist": 0.0,
},
"numerical": {
"metric": "l2",
"n_neighbors": 30,
"n_components": 5,
"min_dist": 0.0,
},
"combined": {
"n_neighbors": 30,
"min_dist": 0.0,
"n_components": 5,
},
}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.