rapidsai / cuml Goto Github PK
View Code? Open in Web Editor NEWcuML - RAPIDS Machine Learning Library
Home Page: https://docs.rapids.ai/api/cuml/stable/
License: Apache License 2.0
cuML - RAPIDS Machine Learning Library
Home Page: https://docs.rapids.ai/api/cuml/stable/
License: Apache License 2.0
I'm using at the HEAD of this repository. When I try to run the unit-tests, I get the error at attached at the end. Command to repro:
$ cd python/cuML/test
$ pytest test_pca.py
With cuDF adding support for python 3.7, cuML needs to follow and be tested with the corresponding cuDF build. Waiting on rapidsai/cudf#668
I pulled the new Rapids Docker container particularly to re-run a KMeans exercise on Twitter location data that I've previously run successfully in both TensorFlow and Scikit-Learn.
from cuml import KMeans as km
import cudf
names = ['0','1']
dtypes = ['float64','float64']
filename = "/data/twitter/cluster_points.csv"
clustering_cuml = km(n_clusters=100)
clustering_cuml.fit(gdf)
The gdf looks like this:
0 1
0 392159.91 223933.2
1 434359.54 278703.86
2 436988.1599999999 335566.98
3 386173.63999999996 349452.80000000005
4 275936.06 674298.0899999999
5 432248.25 444924.63
6 458423.64999999997 304714.0300000001
7 591923.55 120227.19
8 532864.35 182221.79
9 336145.64999999997 390272.31
[10023370 more rows]
And in the Docker container it hangs. I noticed 2 things by monitoring the system in another window:
The log shows:
terminate called after throwing an instance of 'thrust::system::system_error'
what(): trivial_device_copy D->H failed: invalid argument
I tried then tried using a smaller sample of just 10,000 records and the same thing happened.
I reduced the number of clusters to 10 and it worked fine. I then steadily increased the number of records back up to 10 million and found it would intermittently hang on some runs above 12 clusters, but then work other times. I then increased the number of clusters and found I could never get it to work beyond 20 clusters.
It's a difficult error to reproduce as it appears random although happens consistently on larger datasets. I always shutting down all kernels and restarting each time for a clean environment.
I've also tried it on different computers with different GPUs: GV100, Titan V and Titan Xp and experienced the same issue.
I also tried it outside Docker and the same thing happened.
cuML Kmeans results in unexpected clustering behavior compared to sklearn and R's stats
package. Basic reproducible example below. The result appears to be due to the fundamentally different clusters, not a mis-assignment of records to cluster IDs.
"""KMeans testing cuml vs sklearn
"""
from cuml import KMeans as cumlKMeans
from sklearn.cluster import KMeans
import cudf
import numpy as np
import pandas as pd
cdf = cudf.DataFrame()
cdf['a'] = np.array([3000, 2, 3100.], dtype=np.float32)
cdf['b'] = np.array([3000, 4, 3100.], dtype=np.float32)
cdf['c'] = np.array([4000, 3, 4100.], dtype=np.float32)
cdf['d'] = np.array([3100, 1, 4100.], dtype=np.float32)
cuml_km = cumlKMeans(n_clusters=2)
sk_km = KMeans(n_clusters=2)
# cuML Kmeans results in cluster centers of
# 0 4.0 3100.0 4000.0 3.0
# 1 3550.0 1551.0 1550.5 3550.0
cuml_km.fit(cdf)
print(cuml_km.cluster_centers_)
# sklearn Kmeans results in cluster centers of
# 0 3050.0 3050.0 4050.0 3600.0
# 1 2.0 4.0 3.0 1.0
sk_km.fit(cdf.to_pandas())
print(pd.DataFrame(sk_km.cluster_centers_))
The sklearn results are consitent with the results from R's stats
package implementation of Kmeans. These results are also consistent across multiple runs with different random seeds.
"""KMeans testing cuml vs sklearn
"""
from cuml import KMeans as cumlKMeans
from sklearn.cluster import KMeans
import cudf
import numpy as np
import pandas as pd
cdf = cudf.DataFrame()
cdf['a'] = np.array([3000, 2, 3100.], dtype=np.float32)
cdf['b'] = np.array([3000, 4, 3100.], dtype=np.float32)
cdf['c'] = np.array([4000, 3, 4100.], dtype=np.float32)
cdf['d'] = np.array([3100, 1, 4100.], dtype=np.float32)
km = cumlKMeans(n_clusters=2)
sk_km = KMeans(n_clusters=2)
cuml_results = []
cuml_centers = []
for i in range(100):
km = cumlKMeans(n_clusters=2, random_state=np.random.choice(100000))
res = km.fit_predict(cdf)
cuml_results.append(res.to_array())
cuml_centers.append(km.cluster_centers_.to_pandas().mean(axis=1))
print(cuml_centers)
cuml_res_df = pd.DataFrame(cuml_results, columns=['c1', 'c2', 'c3'])
print(cuml_res_df[
(cuml_res_df.c1 == 1) & (cuml_res_df.c2 == 1)
| (cuml_res_df.c1 == 0) & (cuml_res_df.c2 == 0)
].shape)
sklearn_results = []
for i in range(1000):
res = sk_km.fit_predict(cdf.to_pandas())
sklearn_results.append(res)
sk_res_df = pd.DataFrame(sklearn_results, columns=['c1', 'c2', 'c3'])
print(sk_res_df[
(sk_res_df.c1 == 1) & (sk_res_df.c2 == 1)
| (sk_res_df.c1 == 0) & (sk_res_df.c2 == 0)
].shape)
About 25-40% of the cuml kmeans runs result in records 0 and 1 being in the same cluster, which should not happen or should be vanishingly unlikely (I haven't done the math to see if this is actually a possible stable outcome). None of the sklearn runs result in this pairing.
R example, that matches sklearn:
library(stats)
library(dplyr)
df <- tibble(
temp = c(3000, 2, 3100.),
temp2 = c(3000, 4, 3100.),
temp3 = c(4000, 3, 4100.),
temp4 = c(3100, 1, 4100.),
)
cl = kmeans(df, 2)
print(cl$centers)
# 1 2 4 3 1
# 2 3050 3050 4050 3600
Being developed by @oyilmaz-nvidia with assistance of @dantegd
Is your feature request related to a problem? Please describe.
Randomized option is not supported due to its high discrepancy. It was added to be compatible with SKL.
Describe the solution you'd like
Should be removed from python wrapper.
While setting up my workstation to debug through CUDA code, I found a script in [1] that has proven really useful for creating an eclipse project file. Before finding this script, I made several unsuccessful attempts at creating projects that either wouldn't build properly, wouldn't analyze/index the code properly, or wouldn't run/debug. Building the eclipse project file from the cmake command itself worked.
It would be very useful to the community if we included this command in our documentation to enable more potential contributors to cuML. NSight already comes with the CUDA toolkit, thus we can assume any developers wanting to build our repository already have it installed.
[1] https://github.com/rickyzhang82/cs344/blob/master/auto-generate-project.sh
TL;DR: 8 workers fails 4 workers succeeds on an 8 GPU server.
Exception: GDF_VALIDITY_UNSUPPORTED. The traceback refers to an error with everdf.group.max().
In reducing the file I removed all values of a given loan #, never splitting them.
5. Without checking, the XGBoost DMatrix conversion appears to run with a wall time of 6.18s.
6. But the GPU XGBoost train fails with the exception and traceback and traceback shown below.
TypeError: reraise() missing 2 required positional arguments: 'tp' and 'value'
Is your feature request related to a problem? Please describe.
Currently, we directly expose underlying C++ implementations via cython. Since cython can also understand C++ interfaces, all is well for us. But if we need wider adoption, I think we should expose a true C-API (for eg: declaring symbols under "extern C"). Such an interface can then be easily usable across multiple languages.
Describe the solution you'd like
As a first step, we could just start by wrapping our *_c.h files under each algo folder of cuML with "extern C" declarations.
Describe alternatives you've considered
-NA-
Additional context
None.
Convert the great job by @daxiongshu PR #83 into a unit test, at the python level.
We should drop the requirement on GCC>=5.4.0, if, at all possible.
I have built a few real-world examples for a talk on cuML. The notebooks that I created for the talk will be useful to the community. I will submit a PR for this.
Currently, the kNN implementation within cuml's python layer only supports the IndexFlatL2
index. It would be ideal if cuml could somehow support the other index types without tying the user-facing API too closely to FAISS.
One way to implement this might be for the API layer to provide a pluggable strategy for "index_alg" that would call the necessary index function in FAISS when invoked.
In the current version of kmeans, there are multiple alignment conflicts that cause warnings to be raised during compilation of the style:
warning: specified alignment (4) is different from alignment (8) specified on a previous declaration
detected during instantiation of "void kmeans::detail::matmul(const float_t *, const float_t *, float_t *, float_t, float_t, int, int, int, int) [with float_t=float]"
Besides the warnings, there is the potential that this might cause problems in the future so it is worth looking into the conflicts in the kmeans code.
Problem: The knn_demo.ipynb included in the CUDA 10 version of RAPIDS container fails on cell 9 (calling knn_cuml.fit(X)) with the following traceback:
AttributeError Traceback (most recent call last)
in
/conda/envs/rapids/lib/python3.5/site-packages/cuml-0+unknown-py3.5-linux-x86_64.egg/cuml.cpython-35m-x86_64-linux-gnu.so in cuml.KNN.fit()
AttributeError: module 'faiss' has no attribute 'StandardGpuResources'
Work-around:
Here are the steps to take inside the nvcr.io/nvidia/rapidsai/cuda10.0_ubuntu16.04 container:
as jupyter user inside container:
source activate rapids
conda uninstall -y faiss-gpu
conda install -y mkl-include=2018.0.3
conda install -y swig=3.0.12
git clone -b v1.4.0 https://github.com/facebookresearch/faiss.git
cd faiss
LDFLAGS="-L${CONDA_PREFIX}/lib" ./configure --prefix=$CONDA_PREFIX --with-python=$(which python)
sed -i 's|PYTHONCFLAGS = -I|PYTHONCFLAGS= -I/conda/envs/rapids/include/python3.5m/ -I/conda/envs/rapids/lib/python3.5/site-packages/numpy/core/include|g' ./makefile.inc
sed -i '/-gencode arch=compute_61,code="compute_61" \/a -gencode arch=compute_70,code="compute_70" \' ./makefile.inc
make install
cd gpu
make
make cpu && make gpu
Then from another shell, need to perform the following as root in the container:
docker ps <-- identify container-id
docker exec -it -u root container-id bash
source activate rapids
cd /rapids/notebooks/faiss/python
python setup.py install
I noticed this when working with a Dataframe that has columns of type np.float64
. It looks like the wrappers underneath expect a single precision float *
and there's no explicit casting going on.
As a result, the calculations resulting from the c code are incorrect because the pointers are being treated as single precision. This isn't necessarily a bug but it should be documented somewhere so that users know to expect this behavior. Otherwise, it could cause some headaches and slow adoption.
Building cuML from source in an environment where cuDF was installed also building from source works fine, and installing both with conda also works well.
This issue refers to problems building cuML from source in a conda environment where cuDF was installed using conda install
, and can happen when using non conda environments as well. In such an environment, libcuml
is still installed to the environment lib
folder, but if cudf
was installed with conda install
, cuML's setup.py
will look for libcuml
in site-packages
instead, making the cythonization process fail like this:
$ python setup.py build_ext --inplace
cuML/cuml.pyx: cannot find cimported module 'c_tsvd'
cuML/cuml.pyx: cannot find cimported module 'c_kmeans'
cuML/cuml.pyx: cannot find cimported module 'c_pca'
cuML/cuml.pyx: cannot find cimported module 'c_dbscan'
Currently working on a solution.
New folder structure with unified build system for ml-prims and cuml.
KMeans appears to only work with floats, currently. This may be known, as the docstring explicitly calls out Kmeans for floats in the naming conventions. We should update the documentation in the short term to reflect this explicitly if it's known behavior.
The example in the docstring works:
from cuml import KMeans
import cudf
import numpy as np
import pandas as pd
def np2cudf(df):
# convert numpy array to cuDF dataframe
df = pd.DataFrame({'fea%d'%i:df[:,i] for i in range(df.shape[1])})
pdf = cudf.DataFrame()
for c,column in enumerate(df):
pdf[str(c)] = df[column]
return pdf
a = np.asarray([[1.0, 1.0], [1.0, 2.0], [3.0, 2.0], [4.0, 3.0]],dtype=np.float32)
b = np2cudf(a)
print("input:")
print(b)
print("Calling fit")
kmeans_float = KMeans(n_clusters=2, n_gpu=-1)
kmeans_float.fit(b)
But the following examples either cause core dumps or hang:
import cudf
import numpy as np
import pandas as pd
def np2cudf(df):
# convert numpy array to cuDF dataframe
df = pd.DataFrame({'fea%d'%i:df[:,i] for i in range(df.shape[1])})
pdf = cudf.DataFrame()
for c,column in enumerate(df):
pdf[str(c)] = df[column]
return pdf
a = np.asarray([[1.0, 1.0], [1.0, 2.0], [3.0, 2.0], [4.0, 3.0]],dtype=np.int32)
b = np2cudf(a)
print("input:")
print(b)
print("Calling fit")
kmeans_float = KMeans(n_clusters=2, n_gpu=-1)
kmeans_float.fit(b)
from cuml import KMeans
import cudf
cdf = cudf.DataFrame()
cdf['a'] = [1,2,3]
cdf['b'] = [6,1,2]
cdf['c'] = [1,2,4]
cdf['d'] = [9,2,100]
kmeans_float = KMeans(n_clusters=2, n_gpu=-1)
kmeans_float.fit(cdf)
Ubuntu 16.04, Cuda 9.2, 410.48, V100
As both the code base and community in cuML (and RAPIDS AI in general) continue to grow, it would be useful to add a license header check to our build.
Most often, tools that do this will allow you to white-list a set of file extensions that will be checked.
From a cursory look through the code base, we will definitely want to check the extensions: ["py", "cu", "c", "h", "cfg", "sh"]
It would also make sense to blacklist directories (e.g. external/
).
Perhaps it could be as simple as having a python script that runs (and places headers in the appropriate format with the appropriate extensions).
There are tools out there to help with this too [1]. We would need to decide whether we want our tool to change the files or just alert when files exist without license headers. If we go the latter route, we should probably add this to our travis build to make it easier for contributors (and reviewers).
GTest is a library that is available in standard repositories in common Linux package managers.
Currently, the cuml codebase ships with the gtest code included in the external/
directory but it might be easier for users if we follow the path we took for the FAISS integration and simply make it a dependency for our build. In the case of cuML being packaged up for install with aptitude or yum, the gtest dependency would be installed automatically.
I opened this ticket to discuss our options and bring to light any possible reasons why removing gtest from external would be a bad idea for sustainability.
What is your question?
when I try small columns dataset (shape=[10000,1000]) on cuML PCA, it work like charm, GPU utilization rate is high.
when I try large columns dataset (shape=[10000,10000]) on cuML PCA, it seems like a disaster to GPU.
utilization rate in the beginning is high (hit 100%), after 2 second, the ratio is super low (0~3%) lasting more than 5 minutes, and no matter svd_solver='randomized'
or svd_solver='full'
, it seems no difference, is that a common situation for cuML? It seems like the gpu is doing nothing after 2 seconds, and cpu (single process hit 100% util rate) is busy to moving data in and out lasting more than 5 minutes.
(code as below)
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA as skPCA
import os
import multiprocessing
import cudf
import cuml
from cuml import PCA as cumlPCA
def load_data(nrows, ncols):
print('use random data')
X = np.random.rand(nrows,ncols)
df = pd.DataFrame({'fea%d'%i:X[:,i] for i in range(X.shape[1])})
return df
nrows = 10000
ncols = 10000
X = load_data(nrows,ncols)
n_components = 2
whiten = False
random_state = 42
# no performance difference between "randomized" and "full" svd_solver
svd_solver="randomized"
pca_cuml = cumlPCA(n_components=n_components,svd_solver=svd_solver,
whiten=whiten, random_state=random_state)
result_cuml = pca_cuml.fit_transform(X)
# it cuase about 5.5 mins on 2080Ti GPU.
DBSCAN generates different # of clusters when using cuML compared to when using sklearn.
Dataset to reproduce:
https://github.com/PatWalters/gpu_kmeans/blob/master/fp.csv
Code to reproduce:
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN as skDBSCAN
from cuml import DBSCAN as cumlDBSCAN
import cudf
import os
X = pd.read_csv("fp.csv")
print('data',X.shape)
eps = 3
min_samples = 2
clustering_sk = skDBSCAN(eps = eps, min_samples = min_samples)
clustering_sk.fit(X)
print("# of sklearn clusters", len(set(clustering_sk.labels_)))
X = cudf.DataFrame.from_pandas(X)
clustering_cuml = cumlDBSCAN(eps = eps, min_samples = min_samples)
clustering_cuml.fit(X)
print("# of cuML clusters", clustering_cuml.labels_.unique_count())
GPU used: 1u Tesla P40;
OS and version: Ubuntu 18.04;
CUDA version: 9.2;
Driver: 410.48;
gcc version: 7.3;
python version: 3.5
I have a data set which content 180,914 rows and 48 columns, most of them are integer (from 0 to 10,000):
I convert this full data frame to a float64 data type.
It works well when we use the “sklearn” lib (CPU) to run.
I tried to run use the DBSCAN lib in cuML, it crashed, and no response at all, I have to restart the whole kernel.
Then I tried to reduce the rows in our data sets from 180K to 10K, it works, but very slow, it cost about 3s, and then 20K rows data for 6s, 30K rows for 9s, and then it will crash when the data become to 70K rows.
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN as skDBSCAN
from sklearn.datasets.samples_generator import make_blobs
from cuML import DBSCAN as cumlDBSCAN
import pygdf
import os
import dask
#import dask_gdf
from dask.delayed import delayed
from dask.distributed import Client, wait
import pygdf
from pygdf.dataframe import DataFrame
from collections import OrderedDict
from glob import glob
from sklearn.cluster import KMeans
import re
from itertools import cycle
from sklearn.preprocessing import StandardScaler
##LOAD DATA
X = load_data()
eps = 0.5
min_samples = 3
clustering_sk = skDBSCAN(eps = eps, min_samples = min_samples)
clustering_sk.fit(X)
Y = pygdf.DataFrame.from_pandas(X)
clustering_cuml = cumlDBSCAN(eps = eps, min_samples = min_samples)
Z = Y.head(70000)
clustering_cuml.fit(Z)
Installing cuML from rapidsai conda failed due to two packages not being found. I am able to install cuDF successfully from conda, though. If I try to do conda install faiss-gpu
on its own, I am able to install it into the environment.
Ubuntu 16.04, V100, Cuda 9.2 410.48,
Full error:
(cudf) root@81afa5caf852:/# conda install -c rapidsai cuml
Solving environment: failed
PackagesNotFoundError: The following packages are not available from current channels:
- cuml
- faiss-gpu
Current channels:
- https://conda.anaconda.org/rapidsai/linux-64
- https://conda.anaconda.org/rapidsai/noarch
- https://repo.anaconda.com/pkgs/main/linux-64
- https://repo.anaconda.com/pkgs/main/noarch
- https://repo.anaconda.com/pkgs/free/linux-64
- https://repo.anaconda.com/pkgs/free/noarch
- https://repo.anaconda.com/pkgs/r/linux-64
- https://repo.anaconda.com/pkgs/r/noarch
- https://repo.anaconda.com/pkgs/pro/linux-64
- https://repo.anaconda.com/pkgs/pro/noarch
To search for alternate channels that may provide the conda package you're
looking for, navigate to
https://anaconda.org
and use the search bar at the top of the page.
@daxiongshu ran our DBSCAN & k-means implementations against [1] and found that our results do not match, even for datasets as small as size 2^10.
[1] https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html
I have spoken with some of the other back-end developers of cuML about this. GLOG is Google's logging tool for C++.
Similar to other logging tools, it allows us to set a log level at run-time in order to debug cuML algorithms without having to rebuild each time.
It will also allow users of the community to drop the log level to debug when they experience problems and provide their output on github issues so we can help isolate problems more easily.
All notebooks should be aggregated here: https://github.com/rapidsai/notebooks
where they can be properly tracked and controlled, rather than in the python directory (cuml/python
)
Why does cuml always fail to install, when I execute the phrase "make -j"?
The error message is as follows:
31 errors detected in the compilation of "/tmp/tmpxft_000022ed_00000000-15_pca_test.compute_61.cpp1.ii".
CMakeFiles/ml_test.dir/build.make:62: recipe for target 'CMakeFiles/ml_test.dir/test/pca_test.cu.o' failed
make[2]: *** [CMakeFiles/ml_test.dir/test/pca_test.cu.o] Error 2
5 errors detected in the compilation of "/tmp/tmpxft_000022f0_00000000-15_dbscan_test.compute_61.cpp1.ii".
CMakeFiles/ml_test.dir/build.make:88: recipe for target 'CMakeFiles/ml_test.dir/test/dbscan_test.cu.o' failed
make[2]: *** [CMakeFiles/ml_test.dir/test/dbscan_test.cu.o] Error 2
CMakeFiles/Makefile2:73: recipe for target 'CMakeFiles/ml_test.dir/all' failed
make[1]: *** [CMakeFiles/ml_test.dir/all] Error 2
Makefile:83: recipe for target 'all' failed
make: *** [all] Error 2.
The last commit to CUTLASS in cuML's ml-prims/external/cutlass submodule was from June 2018.
The CUTLASS repository has commits from Dec 19, 2018.
It would be a good idea to update this.
Describe the bug
3 cuML tests seem to sensitive to random number generation because they are failing with GCC 7.3.0:
[==========] 56 tests from 20 test cases ran. (1502 ms total)
[ PASSED ] 53 tests.
[ FAILED ] 3 tests, listed below:
[ FAILED ] KmeansTests/KmeansTestF.Fit/0, where GetParam() = 16-byte object <02-00 00-00 CD-CC 4C-3D 04-00 00-00 02-00 00-00>
[ FAILED ] KmeansTests/KmeansTestD.Fit/0, where GetParam() = 24-byte object <02-00 00-00 00-00 00-00 9A-99 99-99 99-99 A9-3F 04-00 00-00 02-00 00-00>
[ FAILED ] TsvdTests/TsvdTestDataVecF.Result/0, where GetParam() =
With GCC 7.1.1 the tests work fine.
Steps/Code to reproduce bug
Build cuML with GCC 7.3.0 and run ml_test
Expected behavior
All tests are passing
Environment details (please complete the following information):
Would you like to wrap any pointer data members with the class template “std::unique_ptr”?
Update candidates:
@daxiongshu had a great idea to start testing our algorithms against SKLearn's results [1].
A couple discrepancies turned up in DBSCAN. I believe we should be running these comparisons as part of our py.test suite.
[1] https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html
Describe the bug
A clear and concise description of what the bug is.
Steps/Code to reproduce bug
import cudf
Expected behavior
The sample notebook should run as indicated by the docs. At least the import should work
Environment details (please complete the following information):
# packages in environment at /env/python-custom:
#
# Name Version Build Channel
arrow-cpp 0.10.0 py36h70250a7_0 conda-forge
blas 1.0 mkl
boost-cpp 1.67.0 h3a22d5f_0 conda-forge
bzip2 1.0.6 h470a237_2 conda-forge
ca-certificates 2018.11.29 ha4d7672_0 conda-forge
certifi 2018.11.29 py36_1000 conda-forge
cffi 1.11.5 py36h5e8e0c9_1 conda-forge
cudf 0.4.0 py36_0 rapidsai
cuml 0.4.0 cuda9.2_py36_0 rapidsai
cython 0.28.5 py36hfc679d8_0 conda-forge
faiss-gpu 1.4.0 py36_cuda8.0.61_1 pytorch
icu 58.2 hfc679d8_0 conda-forge
intel-openmp 2019.1 144
libcudf 0.4.0 cuda9.2_0 rapidsai
libcudf_cffi 0.4.0 cuda9.2_py36_0 rapidsai
libcuml 0.4.0 cuda9.2_0 rapidsai
libffi 3.2.1 hfc679d8_5 conda-forge
libgcc 7.2.0 h69d50b8_2 conda-forge
libgcc-ng 7.2.0 hdf63c60_3 conda-forge
libgfortran-ng 7.2.0 hdf63c60_3 conda-forge
libstdcxx-ng 7.2.0 hdf63c60_3 conda-forge
llvmlite 0.26.0 py36hd28b015_0 conda-forge
mkl 2018.0.3 1
mkl_fft 1.0.10 py36_0 conda-forge
mkl_random 1.0.2 py36_0 conda-forge
ncurses 6.1 hfc679d8_2 conda-forge
numba 0.41.0 py36hf8a1672_0 conda-forge
numpy 1.15.0 py36h1b885b7_0
numpy-base 1.15.0 py36h3dfced4_0
nvstrings 0.2.0 cuda9.2_py36_0 nvidia
openssl 1.0.2p h470a237_2 conda-forge
pandas 0.20.3 py36_1 conda-forge
parquet-cpp 1.5.0.pre h83d4a3d_0 conda-forge
pip 18.1 py36_1000 conda-forge
pyarrow 0.10.0 py36hfc679d8_0 conda-forge
pycparser 2.19 py_0 conda-forge
python 3.6.7 h5001a0f_1 conda-forge
python-dateutil 2.7.5 py_0 conda-forge
pytz 2018.9 py_0 conda-forge
readline 7.0 haf1bffa_1 conda-forge
setuptools 40.6.3 py36_0 conda-forge
six 1.12.0 py36_1000 conda-forge
sqlite 3.26.0 hb1c47c0_0 conda-forge
tk 8.6.9 ha92aebf_0 conda-forge
wheel 0.32.3 py36_0 conda-forge
xz 5.2.4 h470a237_1 conda-forge
zlib 1.2.11 h470a237_4 conda-forge
Additional context
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/env/python-custom/lib/python3.6/site-packages/cudf/__init__.py", line 2, in <module>
from cudf import dataframe # noqa: F401
File "/env/python-custom/lib/python3.6/site-packages/cudf/dataframe/__init__.py", line 1, in <module>
from cudf.dataframe import (buffer, dataframe, series, # noqa: F401
File "/env/python-custom/lib/python3.6/site-packages/cudf/dataframe/dataframe.py", line 18, in <module>
from cudf import formatting, _gdf
File "/env/python-custom/lib/python3.6/site-packages/cudf/_gdf.py", line 13, in <module>
from libgdf_cffi import ffi, libgdf
File "/env/python-custom/lib/python3.6/site-packages/libgdf_cffi/__init__.py", line 30, in <module>
libgdf_api = ffi.dlopen(_get_lib_name())
OSError: cannot load library 'libcudf.so': libNVStrings.so: cannot open shared object file: No such file or directory
Should cuML/cuDF also support DLPack?
Following another thread on pytorch made me think about this. pytorch/pytorch#15601
I'm a fan of the numba array_interface, but supporting multiple integrations may make it easier to consolidate in the future.
cuml default is 1
sklearn default is 5
IndexError Traceback (most recent call last)
in
/conda/envs/gdf/lib/python3.5/site-packages/dask_xgboost-0.1.5-py3.5.egg/dask_xgboost/core.py in train(client, params, data, labels, dmatrix_kwargs, **kwargs)
229 """
230 return client.sync(_train, client, params, data,
--> 231 labels, dmatrix_kwargs, **kwargs)
232
233
/conda/envs/gdf/lib/python3.5/site-packages/distributed-1.23.3-py3.5.egg/distributed/client.py in sync(self, func, *args, **kwargs)
645 return future
646 else:
--> 647 return sync(self.loop, func, *args, **kwargs)
648
649 def repr(self):
/conda/envs/gdf/lib/python3.5/site-packages/distributed-1.23.3-py3.5.egg/distributed/utils.py in sync(loop, func, *args, **kwargs)
275 e.wait(10)
276 if error[0]:
--> 277 six.reraise(*error[0])
278 else:
279 return result[0]
/conda/envs/gdf/lib/python3.5/site-packages/six.py in reraise(tp, value, tb)
691 if value.traceback is not tb:
692 raise value.with_traceback(tb)
--> 693 raise value
694 finally:
695 value = None
/conda/envs/gdf/lib/python3.5/site-packages/distributed-1.23.3-py3.5.egg/distributed/utils.py in f()
260 if timeout is not None:
261 future = gen.with_timeout(timedelta(seconds=timeout), future)
--> 262 result[0] = yield future
263 except Exception as exc:
264 error[0] = sys.exc_info()
/conda/envs/gdf/lib/python3.5/site-packages/tornado/gen.py in run(self)
1131
1132 try:
-> 1133 value = future.result()
1134 except Exception:
1135 self.had_exception = True
/conda/envs/gdf/lib/python3.5/asyncio/futures.py in result(self)
292 self._tb_logger = None
293 if self._exception is not None:
--> 294 raise self._exception
295 return self._result
296
/conda/envs/gdf/lib/python3.5/site-packages/tornado/gen.py in wrapper(*args, **kwargs)
324 try:
325 orig_stack_contexts = stack_context._state.contexts
--> 326 yielded = next(result)
327 if stack_context._state.contexts is not orig_stack_contexts:
328 yielded = _create_future()
/conda/envs/gdf/lib/python3.5/site-packages/dask_xgboost-0.1.5-py3.5.egg/dask_xgboost/core.py in _train(client, params, data, labels, dmatrix_kwargs, **kwargs)
135 label_parts = None
136 if isinstance(data, (list, tuple)):
--> 137 if isinstance(data[0], Delayed):
138 for data_part in data:
139 if not isinstance(data_part, Delayed):
IndexError: list index out of range
Describe the bug
The current DBSCAN implementation appears to have a memory leak that builds up over multiple different runs of the algorithm.
pca_demo.ipynb and tsvd_demo.ipynb
%%time
pca_sk = skPCA(n_components=n_components,svd_solver=svd_solver,
whiten=whiten, random_state=random_state)
result_sk = pca_sk.fit_transform(X)
%%time
algorithm='arpack'
tsvd_sk = skTSVD(n_components=n_components,algorithm=algorithm,
random_state=random_state)
result_sk = tsvd_sk.fit_transform(X)
I found that using ‘fit_transform‘’ will not report an error, using ‘fit’ will not
I have an issue with DBSCAN terminating on large datasets. I'm running the latest NGC Rapids Docker container. I've seen the comments in #31.
[I 14:36:16.181 LabApp] Kernel started: d712459f-cf0a-4c0e-a6ec-ecc73b9e91f1
[I 14:36:16.667 LabApp] Adapting to protocol v5.1 for kernel d712459f-cf0a-4c0e-a6ec-ecc73b9e91f1
[I 14:37:01.848 LabApp] Saving file at /cuml/dbscan_twitter.ipynb
terminate called after throwing an instance of 'std::runtime_error'
what(): Exception occured! file=/rapids/cuml/cuML/src/dbscan/vertexdeg/algo5.h line=141: FAIL: call='res.result'. Reason:invalid configuration argument
[I 14:38:40.180 LabApp] KernelRestarter: restarting kernel (1/5), keep random ports
kernel d712459f-cf0a-4c0e-a6ec-ecc73b9e91f1 restarted
I'm using the same Twitter derived point dataset as in #53 and sampling it down until it works. The source data has 10.5 million points. The crashes happen above 5 million rows x 2 columns, row counts below that work fine. Minimum sample size set to 1000 eps to none.
In addition to using the Twitter file, I've also tested it with randomly generated data.
0 1
0 478401.76952889207 542950.5525448014
1 454622.9484872194 463117.5441902199
2 340651.60100943915 568573.833436874
3 60462.91779449762 248186.3741022621
4 290905.83582845033 564827.5589121555
5 305875.7357089389 187773.6372960709
6 122647.20323430178 444709.74442503956
7 11336.977964928763 382183.34051422554
8 22657.665326672173 527002.8538174401
9 106190.10338763308 428359.52118789754
[9999990 more rows]
The methods fit
, query
, to_cudf
and to_nparray
are missing docstrings in comparison to the other methods.
I'm working on a fix for this now.
I know this might be an ambitious effort, but would the RAPIDS development team be able to provide support for Multi-GPU PCA. Many of the datasets I work with are 30GB+. Being able to reduce the dimensionality of these datasets to more manageable sizes would be useful for not only less computational expense, but perhaps more so sharing/collaboration.
As a test case, I am currently working with HCP (Human Connectome Project) data. Here, I use NumPy to manipulate my data because I am a decent human being. But after my data is all nice and beautiful, I am faced with "Out of Memory" when I try to convert my NumPy array to a Pandas dataframe to a cudf.
The shape of my data I am feeding into PCA here is (120, 63070800)
. The time to beat is shown below. My colleagues and I have struggled with this PCA problem on different datasets for almost two years now so if you can do us a solid and end our misery a little bit faster, we would be extremely grateful. Thanks again RAPIDS team. You are killing it.
Is your feature request related to a problem? Please describe.
We currently are not exposing the following things from our C/C++ API:
The advantages of doing these are:
Describe the solution you'd like
One solution can be to:
cumlHandle_t
structure (just like cudnn/cublas/cufft/cusolver).Describe alternatives you've considered
There are no alternatives currently.
Additional context
None.
Note
Just like #77 , I'm mostly filing this issue so that it doesn't slip away. Please feel free to set the priority for this accordingly, @datametrician @dantegd .
cuml/cuML/CMakeLists.txt
requires that both pthread
and z
are installed by the OS package manager. There are cmake
modules for FindThreads
and FindZLIB
which can be used to discover the location of the libraries in question, so that they can be linked more generally.
Currently, if a user installed zlib
via conda install -c conda-forge zlib
, libcuml.so
will fail to link because it expects libz.so
to be in /usr/local/lib
.
https://github.com/rapidsai/cuml/blob/master/cuML/CMakeLists.txt#L123-L130
target_link_libraries(cuml
OpenMP::OpenMP_CXX
${CUDA_cublas_LIBRARY}
${CUDA_curand_LIBRARY}
${CUDA_cusolver_LIBRARY}
${CUDA_CUDART_LIBRARY}
pthread
z)
Build gtests for vertex degree, adjacency graph, and labeling components within DBSCAN to aid in debugging performance and correctness problems.
There are naive kernel implementations of each of these components that should provide good baselines. This needs to be verified as well, however.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.