Code Monkey home page Code Monkey logo

optimizing-azure-ml-pipeline's Introduction

Optimizing an Azure ML Pipeline

Overview

Built and optimized an MS Azure machine learning pipeline using the Python SDK and a provided Scikit-learn model. This model was then compared to an Azure AutoML run.

Summary

This dataset contains data about bank marketing campaigns. Usually marketing campaigns are based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. We seek to predict if the client will subscribe a term deposit i.e. 'y'.

The best performing model was Voting Ensemble model developed by AutoML with accuracy of 0.9177, the graph shows the top 8 features the model used in predicting whether the client will subscribe a term deposit or not.

top 8 features by AutoML

Scikit-learn Pipeline

  • Simple overview of how the pipeline architecture works:
    • The snapshot of folder is sent to experiment (Azure ML).
    • Create docker image.
    • Deploy docker and snapshot to compute.
    • Mount datastore to compute.
    • Launch script (training).
    • Stream logs, metrics, stdouts.
    • Copy output after finishing job (experiment).

For training, train.py script was written (used only for Hyperdrive). In this file we used LogisticRegression. In train.py, we load data, clean it, and then perform training using sklearn logistic regression. Hyperdrive will choose the best parameter among combinations.

We choose 'RandomParameterSampling' which uses random sampling over a hyperparameter search space. It help us in tuning discrete hyperparameter like 'max_iter' and uniform distribution like 'C' (Inverse of regularization strength).

We used 'BanditPolicy' for early stopping policy as it will help in stooping the experiment if metric falls below certain threshold (slack_factor).

AutoML

AutoML is advanced tool in Azure which let's our work much easier. AutoML search through all algorithms (unless blacklisted) and find the best one.

  • The basic configuration of AutoML that we used are:

    • timeout to 30 minutes (due to resource limitations)
    • primary metric was "accuracy"
    • cross validation set to 5 to avoid overfitting.
    • task name was "classification".
    • data in which target column name is "y".
  • "VotingEnsemble" uses few algorithms like "XGBoostClassifier", "LightGBMClassifier". Few of hyperparameters for "XGBoostClassifier" which is learnt by AutoML are:

    • "n_estimators": Number of boosting rounds.
    • "learning_rate": Rate at which classifier will learn.
    • "max_depth": Maximum tree depth for base learners.

Below is the full list of hyperparameters: "XGBoostClassifier estimator of VotingEnsemble" Also added cell in notebook where we can see all estimators used with various hyperparameters.

AutoML tries a lot of algorithm and the best model was "VotingEnsemble" having accuracy of "0.9177". It generates 21 metric values like "AUC_weighted", "weighted_accuracy", etc. It also creates confusion matix and accuracy table. We can use these metrics to gain a lot of insights. We can also see model's explanation like top "x" feature.

Pipeline comparison

AutoML performs slightly better than HyperDrive in accuracy by 0.01. The best model choosed by AutoML was VotingEnsemble.

Architecture wise both are different, but idea is same. The main advantage of AutoML is that we have many possibilities to increase search of better algorithm and hyperparameters.

In AutoML, it provides warnings for class imbalance and data checks. It also provides top features to select best model. AutoML also imputes missing feature values.

While in Hyperdrive, we only build one model which we think is best and tune it's hyperparameters (those which we think is best).

AutoML provides a wide variety of models and preprocessing steps which is not there in Hyperdrive.

Future work

In future experiments I will try other primary metrics like 'AUC', 'weighted accuracy', etc. Accuracy is basic metric and it is useful to take other metrics too, to gain more insights. "AUC" gives the arithmetic mean of the score for each class, weighted by the true number of true instances in each class.

Also we can try other hyperparameters of AutoML like increase number of cross-validation.

As increase timeout duration was capped to 30 minutes so there could only be a limited number of models that can be runned on that time period, so a longer timeout could have greater number of models to run and thus improving the performance.

optimizing-azure-ml-pipeline's People

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.