Code Monkey home page Code Monkey logo

dsc-xgboost-deloitte-pyspark's Introduction

Introduction

Now that we've run the model locally with one month of data, we'd like to build the model using multiple months. The total data zipped is about ~10GB, but unzipped it will be much more. We can serialize the data to a Pandas dataframe but most likely it will throw memory issues depending on the machine you have. We want to write code for one month, locally, using PySpark then migrate the code to run on EMR, and take multiple unzipped files.

Objectives

  • Migrate the model using PySpark to fully utilize distributed computing resource

First, use the boto3 client to set up the s3 resource then check if the file exists in your bucket. If it doesn't exist, you might have to upload it. You can skip this step for now, but will be helpful for the next lab, where you'll be pulling the data from the S3 bucket.

import boto3
from pyspark import SparkSession
#solution
spark = SparkSession \
    .builder \
    .appName("XGBoost") \
    .getOrCreate()
# path could be local or boto3
path = ""
df = spark.read.csv(path=path, header="true", inferSchema="true")
display(df)
#df.cache()

How many unique customers?

# solution

Preprocess the data

Using the logic from the previous lab, use pyspark functions to explore the dataset.

Modeling: Cart Abandonment

The model will be similar - let's build out the new features then start building the model.

# solution for additional columns/features
# df should be the dataframe with additional columns/features
train, test = df.randomSplit([0.7, 0.3], seed = 42)
print("There are %d training examples and %d test examples." % (train.count(), test.count()))

Most MLlib algorithms require a single input column containing a vector of features and a single target column. The DataFrame currently has one column for each feature. MLlib provides functions to help you prepare the dataset in the required format.

MLlib pipelines combine multiple steps into a single workflow, making it easier to iterate as you develop the model.

In this example, you create a pipeline using the following functions:

  • VectorAssembler: Assembles the feature columns into a feature vector.
  • VectorIndexer: Identifies columns that should be treated as categorical. This is done heuristically, identifying any column with a small number of distinct values as categorical. In this example, the cart abandonment feature would be categorical (0 or 1)
  • XgboostRegressor: Uses the XgboostRegressor estimator to learn how to predict rental counts from the feature vectors.
  • CrossValidator: The XGBoost regression algorithm has several hyperparameters. This notebook illustrates how to use hyperparameter tuning in Spark. This capability automatically tests a grid of hyperparameters and chooses the best resulting model.
from pyspark.ml.feature import VectorAssembler, VectorIndexer
 
# Remove the target column from the input feature set.
featuresCols = df.columns
# featuresCols.remove('your target column')
 
# vectorAssembler combines all feature columns into a single feature vector column, "rawFeatures".
vectorAssembler = VectorAssembler(inputCols=featuresCols, outputCol="rawFeatures")
 
# vectorIndexer identifies categorical features and indexes them, and creates a new column "features". 
vectorIndexer = VectorIndexer(inputCol="rawFeatures", outputCol="features", maxCategories=4)
from sparkdl.xgboost import XgboostRegressor
 
xgb_regressor = XgboostRegressor(num_workers=3, labelCol="your_label_column", missing=0.0)
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import RegressionEvaluator
 
# Define a grid of hyperparameters to test:
#  - maxDepth: maximum depth of each decision tree 
#  - maxIter: iterations, or the total number of trees 
paramGrid = ParamGridBuilder()\
  .addGrid(xgb_regressor.max_depth, [2, 5])\
  .addGrid(xgb_regressor.n_estimators, [10, 100])\
  .build()
 
# Define an evaluation metric.  The CrossValidator compares the true labels with predicted values for each combination of parameters, and calculates this value to determine the best model.
evaluator = RegressionEvaluator(metricName="rmse",
                                labelCol=xgb_regressor.getLabelCol(),
                                predictionCol=xgb_regressor.getPredictionCol())
 
# Declare the CrossValidator, which performs the model tuning.
cv = CrossValidator(estimator=xgb_regressor, evaluator=evaluator, estimatorParamMaps=paramGrid)

Create the pipeline

from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[vectorAssembler, vectorIndexer, cv])

dsc-xgboost-deloitte-pyspark's People

Contributors

cheffrey2000 avatar ismayc avatar jilliankim avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dsc-xgboost-deloitte-pyspark's Issues

No curriculum or solution branch

Link to Canvas

Issue Subtype

  • Master branch code
  • Solution branch code
  • Code tests
  • Layout/rendering issue
  • Instructions unclear
  • Other (explain below)

Describe the Issue

Source


Concern

Needs a solution and curriculum branch to match curriculum standards.

(Optional) Proposed Solution

What OS Are You Using?

  • OS X
  • Windows
  • WSL
  • Linux
  • Saturn Cloud from Canvas

Any Additional Context?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.