Code Monkey home page Code Monkey logo

amazon-denseclus's People

Contributors

alexmetsai avatar amazon-auto avatar amorisot avatar bharven avatar dependabot[bot] avatar itaiara avatar momonga-ml avatar monk1337 avatar srushtii-aws avatar sunbc0120 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

amazon-denseclus's Issues

Denseclus on large data set

Hi, hi. ๐Ÿ˜Š
I would kindly ask for a little bit of help.
I have ta large data set (over 6mil rows and 20 columns ) with mixtured datatypes.

The idea was to implement the DenseClus.
(mix of umap embeddings for different data types with HDBSCAN).
I've implemented on subset oh 100K.
But the OOM error popped up really soon.

My core questions would be?

  1. Is there option of fitting DanseClaus
  2. on this kind of dataset?
  3. Recommendations related to prams?
  4. There is no possibility of transform the unseen data/ how to apply on entire data set?
  5. As I work on DataBricks, some recommendations regarding the memory-performance of indtance?
  6. The strategy of dealing with this kind of task?

Additional advices are welcoming. ๐Ÿค—

Getting error on version 0.8.27?

yi10yosa56-algo-1-l1c8r | WARNING: Discarding https://files.pythonhosted.org/packages/32/bb/59a75bc5ac66a9b4f9b8f979e4545af0e98bb1ca4e6ae96b3b956b554223/hdbscan-0.8.27.tar.gz#sha256=e3a418d0d36874f7b6a1bf0b7461f3857fc13a525fd48ba34caed2fe8973aa26 (from https://pypi.org/simple/hdbscan/). Command errored out with exit status 1: /miniconda3/bin/python3 /miniconda3/lib/python3.7/site-packages/pip install --ignore-installed --no-user --prefix /home/model-server/tmp/pip-build-env-kjaud8c5/overlay --no-warn-script-location --no-binary :none: --only-binary :none: -i https://pypi.org/simple -- setuptools wheel cython numpy Check the logs for full command output.
yi10yosa56-algo-1-l1c8r | ERROR: Could not find a version that satisfies the requirement hdbscan==0.8.27
yi10yosa56-algo-1-l1c8r | ERROR: No matching distribution found for hdbscan==0.8.27
yi10yosa56-algo-1-l1c8r | Traceback (most recent call last):
yi10yosa56-algo-1-l1c8r |   File "/opt/ml/processing/input/code/preprocessing.py", line 19, in <module>
yi10yosa56-algo-1-l1c8r |     install("hdbscan==0.8.27")
yi10yosa56-algo-1-l1c8r |   File "/opt/ml/processing/input/code/preprocessing.py", line 13, in install
yi10yosa56-algo-1-l1c8r |     stdout=open(os.devnull, "wb"),
yi10yosa56-algo-1-l1c8r |   File "/miniconda3/lib/python3.7/subprocess.py", line 363, in check_call
yi10yosa56-algo-1-l1c8r |     raise CalledProcessError(retcode, cmd)

Is this right looks like the latest HDBBSCAN version is 0.8.26?
https://github.com/scikit-learn-contrib/hdbscan/releases

Error while importing

from denseclus import DenseClus

TypeError Traceback (most recent call last)
C:\Temp\ipykernel_11700\3655304814.py in <cell line: 1>()
----> 1 from denseclus import DenseClus

~\anaconda3\lib\site-packages\denseclus_init_.py in
----> 1 from .DenseClus import DenseClus
2 from .utils import extract_categorical, extract_numerical
3
4 version = "0.0.19"

~\anaconda3\lib\site-packages\denseclus\DenseClus.py in
3 import warnings
4
----> 5 import hdbscan
6 import numpy as np
7 import pandas as pd

~\anaconda3\lib\site-packages\hdbscan_init_.py in
----> 1 from .hdbscan_ import HDBSCAN, hdbscan
2 from .robust_single_linkage_ import RobustSingleLinkage, robust_single_linkage
3 from .validity import validity_index
4 from .prediction import (approximate_predict,
5 membership_vector,

~\anaconda3\lib\site-packages\hdbscan\hdbscan_.py in
507 leaf_size=40,
508 algorithm="best",
--> 509 memory=Memory(cachedir=None, verbose=0),
510 approx_min_span_tree=True,
511 gen_min_span_tree=False,

TypeError: init() got an unexpected keyword argument 'cachedir'

What has gone wrong here? How can I fix this? I am using anaconda3.

Will there be a transform function like UMAP has?

For the full embedding/clustering job, it might be really good to have transform function so new data can be transformed and clustered without repeating the embedding. Similar to UMAPs transform function.

System error while importing

SystemError Traceback (most recent call last)
Cell In[5], line 1
----> 1 from denseclus import DenseClus

File ~\anaconda3\envs\DenClu\lib\site-packages\denseclus_init_.py:1
----> 1 from .DenseClus import DenseClus
2 from .utils import extract_categorical, extract_numerical
4 version = "0.0.19"

File ~\anaconda3\envs\DenClu\lib\site-packages\denseclus\DenseClus.py:8
6 import numpy as np
7 import pandas as pd
----> 8 import umap.umap_ as umap
9 from hdbscan import flat
10 from sklearn.base import BaseEstimator, ClassifierMixin

File ~\anaconda3\envs\DenClu\lib\site-packages\umap_init_.py:2
1 from warnings import warn, catch_warnings, simplefilter
----> 2 from .umap_ import UMAP
4 try:
5 with catch_warnings():

File ~\anaconda3\envs\DenClu\lib\site-packages\umap\umap_.py:28
26 from scipy.sparse import tril as sparse_tril, triu as sparse_triu
27 import scipy.sparse.csgraph
---> 28 import numba
30 import umap.distances as dist
32 import umap.sparse as sparse

File ~\anaconda3\envs\DenClu\lib\site-packages\numba_init_.py:43
39 from numba.core.decorators import (cfunc, generated_jit, jit, njit, stencil,
40 jit_module)
42 # Re-export vectorize decorators and the thread layer querying function
---> 43 from numba.np.ufunc import (vectorize, guvectorize, threading_layer,
44 get_num_threads, set_num_threads)
46 # Re-export Numpy helpers
47 from numba.np.numpy_support import carray, farray, from_dtype

File ~\anaconda3\envs\DenClu\lib\site-packages\numba\np\ufunc_init_.py:3
1 # -- coding: utf-8 --
----> 3 from numba.np.ufunc.decorators import Vectorize, GUVectorize, vectorize, guvectorize
4 from numba.np.ufunc._internal import PyUFunc_None, PyUFunc_Zero, PyUFunc_One
5 from numba.np.ufunc import _internal, array_exprs

File ~\anaconda3\envs\DenClu\lib\site-packages\numba\np\ufunc\decorators.py:3
1 import inspect
----> 3 from numba.np.ufunc import _internal
4 from numba.np.ufunc.parallel import ParallelUFuncBuilder, ParallelGUFuncBuilder
6 from numba.core.registry import DelayedRegistry

SystemError: initialization of _internal failed without raising an exception

Just when I thought I have fixed this problem, I have got another error while importing DenseClus. Any idea how to fix this problem?

Using denseclus with numeric only data

How can I use denseclus with numeric only data?

Also: in utils.py I see that all categorical data is one hot encoded using pd.get_dummies(), which is appropriate for nominal categorical data. Not so for ordinal categorical data.
My question: how to use DenseClus on ordinal categorical data?

GPU Support

Desirable to have the option to select Rapids and run on GPUs so that it can scale out to larger datasets.

Prediction on new points

Hello,

Firstly, many thanks for the great DenseClus project.
I was wondering if it is possible to make cluster prediction on a new point, like the approximate_predict(clusterer, points_to_predict) in hdbscan.

I know DenseClus() has the boolean parameter 'prediction_data', but I didn't find how to make the predictions on new points.

Many thanks for your help.

Package doesn't work on Sagemaker

It would be nice if this Amazon package would work on Amazon Sagemaker.
I can install regular hdbscan using conda from conda-forge, but both pip and conda fail to install Amazon-DenseClus. This using Datascience 2.0 Kernel in Sagemaker Studio.

Dependence of clustering results on random seed

When testing the DenseClus algorithm on a mixed type dataset, I discovered the strong dependence of clustering results on random seed used.

My question: given this fact, what can I do to get reliable cluster results? My initial approach would be to target stable clustering results, but how to that?

Fit Predict method

Currently, no fit predict is supported. Behaviour is desirable and probable would need to use Parametric UMAP base or other means as workaround.

DenseClus and missing values

Does DenseClus tolerate missing values in the dataset?
As for as I can see from the source code, I expect that missing values in the data have to be imputed before they are fed into DensClus, but I would like that to be confirmed.

UMAP hyperparameters question

We have three bunches of hyperparameters of UMAP. When should we use categorical and numerical dicts, and when combined dict?

default_umap_params = {
"categorical": {
# jaccard is an option but only takes sparse input
"metric": "hamming",
"n_neighbors": 30,
"n_components": 5,
"min_dist": 0.0,
},
"numerical": {
"metric": "l2",
"n_neighbors": 30,
"n_components": 5,
"min_dist": 0.0,
},
"combined": {
"n_neighbors": 30,
"min_dist": 0.0,
"n_components": 5,
},
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.