Code Monkey home page Code Monkey logo

spark-rapids-ml's Introduction

RAPIDS Accelerator for Apache Spark ML

The RAPIDS Accelerator for Apache Spark ML provides a set of GPU accelerated Spark ML algorithms.

API change

We describe main API changes for GPU accelerated algorithms:

1. PCA

Comparing to the original PCA training API:

val pca = new org.apache.spark.ml.feature.PCA()
  .setInputCol("feature_vector_type")
  .setOutputCol("feature_value_3d")
  .setK(3)
  .fit(vectorDf)

We used a customized class and user will need to do no code change to enjoy the GPU acceleration:

val pca = new com.nvidia.spark.ml.feature.PCA()
  .setInputCol("feature_array_type") // accept ArrayType column, no need to convert it to Vector type
  .setOutputCol("feature_value_3d")
  .setK(3)
  .fit(vectorDf)
...

Note: The setInputCol is targeting the input column of Vector type for training process in CPU version. But in GPU version, user doesn't need to do the extra preprocess step to convert column of ArrayType to Vector type, the setInputCol will accept the raw ArrayType column.

Build

Prerequisites:

  1. essential build tools:
  2. CUDA Toolkit(>=11.0)
  3. conda: use miniconda to maintain header files and cmake dependecies
  4. cuDF:
    • install cuDF shared library via conda:
      conda install -c rapidsai-nightly -c nvidia -c conda-forge cudf=22.02 python=3.8 -y
  5. RAFT(22.02):
    • raft provides only header files, so no build instructions for it.
      $ git clone -b branch-21.12 https://github.com/rapidsai/raft.git
  6. export RAFT_PATH:
    export RAFT_PATH=PATH_TO_YOUR_RAFT_FOLDER

Build target jar

User can build it directly in the project root path with:

mvn clean package

Then rapids-4-spark-ml_2.12-22.02.0-SNAPSHOT.jar will be generated under target folder.

Note: This module contains both native and Java/Scala code. The native library build instructions has been added to the pom.xml file so that maven build command will help build native library all the way. Make sure the prerequisites are all met, or the build will fail with error messages accordingly such as "cmake not found" or "ninja not found" etc.

How to use

When building the jar, cudf jar and spark-rapids plugin jar will be downloaded to your local maven repository, usually in your ~/.m2/repository.

Add the artifact jar to the Spark, for example:

ML_JAR="target/rapids-4-spark-ml_2.12-22.02.0-SNAPSHOT.jar"
CUDF_JAR="~/.m2/repository/ai/rapids/cudf/22.02.0-SNAPSHOT/cudf-22.02.0-SNAPSHOT.jar"
PLUGIN_JAR="~/.m2/repository/com/nvidia/rapids-4-spark_2.12/22.02.0-SNAPSHOT/rapids-4-spark_2.12-22.02.0-SNAPSHOT.jar"

$SPARK_HOME/bin/spark-shell --master $SPARK_MASTER \
 --driver-memory 20G \
 --executor-memory 30G \
 --conf spark.driver.maxResultSize=8G \
 --jars ${ML_JAR},${CUDF_JAR},${PLUGIN_JAR} \
 --conf spark.plugins=com.nvidia.spark.SQLPlugin \
 --conf spark.rapids.sql.enabled=true \
 --conf spark.task.resource.gpu.amount=0.08 \
 --conf spark.executor.resource.gpu.amount=1 \
 --conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \
 --files ${SPARK_HOME}/examples/src/main/scripts/getGpusResources.sh

PCA examples

Please refer to PCA examples for more details about example code. We provide both Notebook and jar versions there. Instructions to run these examples are described in the README.

spark-rapids-ml's People

Contributors

nvtimliu avatar pxli avatar sameerz avatar wjxiz1992 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.