The RAPIDS Accelerator for Apache Spark ML provides a set of GPU accelerated Spark ML algorithms.
We describe main API changes for GPU accelerated algorithms:
Comparing to the original PCA training API:
val pca = new org.apache.spark.ml.feature.PCA()
.setInputCol("feature_vector_type")
.setOutputCol("feature_value_3d")
.setK(3)
.fit(vectorDf)
We used a customized class and user will need to do no code change
to enjoy the GPU acceleration:
val pca = new com.nvidia.spark.ml.feature.PCA()
.setInputCol("feature_array_type") // accept ArrayType column, no need to convert it to Vector type
.setOutputCol("feature_value_3d")
.setK(3)
.fit(vectorDf)
...
Note: The setInputCol
is targeting the input column of Vector
type for training process in CPU
version. But in GPU version, user doesn't need to do the extra preprocess step to convert column of
ArrayType
to Vector
type, the setInputCol
will accept the raw ArrayType
column.
- essential build tools:
- CUDA Toolkit(>=11.0)
- conda: use miniconda to maintain header files and cmake dependecies
- cuDF:
- install cuDF shared library via conda:
conda install -c rapidsai-nightly -c nvidia -c conda-forge cudf=22.02 python=3.8 -y
- install cuDF shared library via conda:
- RAFT(22.02):
- raft provides only header files, so no build instructions for it.
$ git clone -b branch-21.12 https://github.com/rapidsai/raft.git
- raft provides only header files, so no build instructions for it.
- export RAFT_PATH:
export RAFT_PATH=PATH_TO_YOUR_RAFT_FOLDER
User can build it directly in the project root path with:
mvn clean package
Then rapids-4-spark-ml_2.12-22.02.0-SNAPSHOT.jar
will be generated under target
folder.
Note: This module contains both native and Java/Scala code. The native library build instructions has been added to the pom.xml file so that maven build command will help build native library all the way. Make sure the prerequisites are all met, or the build will fail with error messages accordingly such as "cmake not found" or "ninja not found" etc.
When building the jar, cudf jar and spark-rapids plugin jar will be downloaded to your local maven
repository, usually in your ~/.m2/repository
.
Add the artifact jar to the Spark, for example:
ML_JAR="target/rapids-4-spark-ml_2.12-22.02.0-SNAPSHOT.jar"
CUDF_JAR="~/.m2/repository/ai/rapids/cudf/22.02.0-SNAPSHOT/cudf-22.02.0-SNAPSHOT.jar"
PLUGIN_JAR="~/.m2/repository/com/nvidia/rapids-4-spark_2.12/22.02.0-SNAPSHOT/rapids-4-spark_2.12-22.02.0-SNAPSHOT.jar"
$SPARK_HOME/bin/spark-shell --master $SPARK_MASTER \
--driver-memory 20G \
--executor-memory 30G \
--conf spark.driver.maxResultSize=8G \
--jars ${ML_JAR},${CUDF_JAR},${PLUGIN_JAR} \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.rapids.sql.enabled=true \
--conf spark.task.resource.gpu.amount=0.08 \
--conf spark.executor.resource.gpu.amount=1 \
--conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \
--files ${SPARK_HOME}/examples/src/main/scripts/getGpusResources.sh
Please refer to PCA examples for more details about example code. We provide both Notebook and jar versions there. Instructions to run these examples are described in the README.