Code Monkey home page Code Monkey logo

dsc-xgboost-lab's Introduction

XGBoost - Lab

Introduction

In this lab, we'll install the popular XGBoost library and explore how to use this popular boosting model to classify different types of wine using the Wine Quality Dataset from the UCI Machine Learning Dataset Repository.

Objectives

You will be able to:

  • Fit, tune, and evaluate an XGBoost algorithm

Installing XGBoost

Run this lab on your local computer.

The XGBoost model is not currently included in scikit-learn, so we'll have to install it on our own. To install XGBoost, you'll need to use pip.

To install XGBoost, follow these steps:

  1. Open up a new terminal window
  2. Activate your conda environment
  3. Run pip install xgboost
  4. Once the installation has completed, run the cell below to verify that everything worked
from xgboost import XGBClassifier

Run the cell below to import everything we'll need for this lab.

import pandas as pd
import numpy as np
np.random.seed(0)
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

Loading the Data

The dataset we'll be using for this lab is currently stored in the file 'winequality-red.csv'.

In the cell below, use pandas to import the dataset into a dataframe, and inspect the .head() of the dataframe to ensure everything loaded correctly.

df = None

For this lab, our target column will be 'quality'. That makes this a multiclass classification problem. Given the data in the columns from 'fixed_acidity' through 'alcohol', we'll predict the quality of the wine.

This means that we need to store our target variable separately from the dataset, and then split the data and labels into training and test sets that we can use for cross-validation.

Splitting the Data

In the cell below:

  • Assign the 'quality' column to y
  • Drop this column ('quality') and assign the resulting DataFrame to X
  • Split the data into training and test sets. Set the random_state to 42
y = None
X = None

X_train, X_test, y_train, y_test = None

Preprocessing the Data

These are the current target values:

y_train.value_counts().sort_index()

XGBoost requires that classification categories be integers that count up from 0, not starting at 3. Therefore you should instantiate a LabelEncoder (documentation here) and convert both y_train and y_test into arrays containing label encoded values (i.e. integers that count up from 0).

# Instantiate the encoder
encoder = None

# Fit and transform the training data


# Transform the test data

Confirm that the new values start at 0 instead of 3:

# Your code here to inspect the values of y_train and y_test

Building an XGBoost Model

Now that you have prepared the data for modeling, you can use XGBoost to build a model that can accurately classify wine quality based on the features of the wine!

The API for xgboost is purposefully written to mirror the same structure as other models in scikit-learn.

# Instantiate XGBClassifier
clf = None

# Fit XGBClassifier


# Predict on training and test sets
training_preds = None
test_preds = None

# Accuracy of training and test sets
training_accuracy = None
test_accuracy = None

print('Training Accuracy: {:.4}%'.format(training_accuracy * 100))
print('Validation accuracy: {:.4}%'.format(test_accuracy * 100))

Tuning XGBoost

The model had a somewhat lackluster performance on the test set compared to the training set, suggesting the model is beginning to overfit to the training data. Let's tune the model to increase the model performance and prevent overfitting.

You've already encountered a lot of parameters when working with Decision Trees, Random Forests, and Gradient Boosted Trees.

For a full list of model parameters, see the XGBoost Documentation.

Examine the tunable parameters for XGboost, and then fill in appropriate values for the param_grid dictionary in the cell below.

NOTE: Remember, GridSearchCV finds the optimal combination of parameters through an exhaustive combinatoric search. If you search through too many parameters, the model will take forever to run! To ensure your code runs in sufficient time, we restricted the number of values the parameters can take.

param_grid = {
    'learning_rate': [0.1, 0.2],
    'max_depth': [6],
    'min_child_weight': [1, 2],
    'subsample': [0.5, 0.7],
    'n_estimators': [100],
}

Now that we have constructed our params dictionary, create a GridSearchCV object in the cell below and use it to iteratively tune our XGBoost model.

Now, in the cell below:

  • Create a GridSearchCV object. Pass in the following parameters:
    • clf, the classifier
    • param_grid, the dictionary of parameters we're going to grid search through
    • scoring='accuracy'
    • cv=None
    • n_jobs=1
  • Fit our grid_clf object and pass in X_train and y_train
  • Store the best parameter combination found by the grid search in best_parameters. You can find these inside the grid search object's .best_params_ attribute
  • Use grid_clf to create predictions for the training and test sets, and store them in separate variables
  • Compute the accuracy score for the training and test predictions
grid_clf = None
grid_clf.fit(None, None)

best_parameters = None

print('Grid Search found the following optimal parameters: ')
for param_name in sorted(best_parameters.keys()):
    print('%s: %r' % (param_name, best_parameters[param_name]))

training_preds = None
test_preds = None
training_accuracy = None
test_accuracy = None

print('')
print('Training Accuracy: {:.4}%'.format(training_accuracy * 100))
print('Validation accuracy: {:.4}%'.format(test_accuracy * 100))

Summary

Great! You've now successfully made use of one of the most powerful boosting models in data science for modeling. We've also learned how to tune the model for better performance using the grid search methodology we learned previously. XGBoost is a powerful modeling tool to have in your arsenal. Don't be afraid to experiment with it!

dsc-xgboost-lab's People

Contributors

alexgriff avatar cheffrey2000 avatar fpolchow avatar hoffm386 avatar mathymitchell avatar mike-kane avatar sumedh10 avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dsc-xgboost-lab's Issues

Can't import XGBClassifier

Link to Canvas

https://learning.flatironschool.com/courses/5864/assignments/217615?module_item_id=507145

Issue Subtype

  • Master branch code
  • Solution branch code
  • Code tests
  • Layout/rendering issue
  • Instructions unclear
  • Other (explain below)

Describe the Issue

The first cell throws an error. I have followed the steps to run "conda install xgboost" successfully in a terminal window, but I don't understand the instruction to "activate your conda environment". It appears that I have successfully installed xgboost in terminal, but the first cell still throws an error

Source

from xgboost import XGBClassifier

Concern

(Optional) Proposed Solution

What OS Are You Using?

  • OS X
  • Windows
  • WSL
  • Linux
  • Saturn Cloud from Canvas

Any Additional Context?

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
~/opt/anaconda3/envs/learn-env/lib/python3.8/site-packages/numpy/core/__init__.py in <module>
     21 try:
---> 22     from . import multiarray
     23 except ImportError as exc:

~/opt/anaconda3/envs/learn-env/lib/python3.8/site-packages/numpy/core/multiarray.py in <module>
     11 
---> 12 from . import overrides
     13 from . import _multiarray_umath

~/opt/anaconda3/envs/learn-env/lib/python3.8/site-packages/numpy/core/overrides.py in <module>
      6 
----> 7 from numpy.core._multiarray_umath import (
      8     add_docstring, implement_array_function, _get_implementing_args)

ImportError: dlopen(/Users/stubbletrouble/opt/anaconda3/envs/learn-env/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-darwin.so, 0x0002): Library not loaded: '@rpath/libopenblas.dylib'
  Referenced from: '/Users/stubbletrouble/opt/anaconda3/envs/learn-env/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-darwin.so'
  Reason: tried: '/Users/stubbletrouble/opt/anaconda3/envs/learn-env/lib/libopenblas.dylib' (no such file), '/Users/stubbletrouble/opt/anaconda3/envs/learn-env/lib/libopenblas.dylib' (no such file), '/Users/stubbletrouble/opt/anaconda3/envs/learn-env/lib/libopenblas.dylib' (no such file), '/Users/stubbletrouble/opt/anaconda3/envs/learn-env/lib/python3.8/site-packages/numpy/core/../../../../libopenblas.dylib' (no such file), '/Users/stubbletrouble/opt/anaconda3/envs/learn-env/lib/libopenblas.dylib' (no such file), '/Users/stubbletrouble/opt/anaconda3/envs/learn-env/lib/libopenblas.dylib' (no such file), '/Users/stubbletrouble/opt/anaconda3/envs/learn-env/lib/libopenblas.dylib' (no such file), '/Users/stubbletrouble/opt/anaconda3/envs/learn-env/lib/python3.8/site-packages/numpy/core/../../../../libopenblas.dylib' (no such file), '/Users/stubbletrouble/opt/anaconda3/envs/learn-env/lib/libopenblas.dylib' (no such file), '/Users/stubbletrouble/opt/anaconda3/envs/learn-env/lib/libopenblas.dylib' (no such file), '/Users/stubbletrouble/opt/anaconda3/envs/learn-env/bin/../lib/libopenblas.dylib' (no such file), '/Users/stubbletrouble/opt/anaconda3/envs/learn-env/lib/libopenblas.dylib' (no such file), '/Users/stubbletrouble/opt/anaconda3/envs/learn-env/lib/libopenblas.dylib' (no such file), '/Users/stubbletrouble/opt/anaconda3/envs/learn-env/bin/../lib/libopenblas.dylib' (no such file), '/usr/local/lib/libopenblas.dylib' (no such file), '/usr/lib/libopenblas.dylib' (no such file)

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
<ipython-input-1-477fa34615c5> in <module>
----> 1 from xgboost import XGBClassifier

~/opt/anaconda3/envs/learn-env/lib/python3.8/site-packages/xgboost/__init__.py in <module>
      7 import os
      8 
----> 9 from .core import DMatrix, DeviceQuantileDMatrix, Booster
     10 from .training import train, cv
     11 from . import rabit  # noqa

~/opt/anaconda3/envs/learn-env/lib/python3.8/site-packages/xgboost/core.py in <module>
     14 import warnings
     15 
---> 16 import numpy as np
     17 import scipy.sparse
     18 

~/opt/anaconda3/envs/learn-env/lib/python3.8/site-packages/numpy/__init__.py in <module>
    138     from . import _distributor_init
    139 
--> 140     from . import core
    141     from .core import *
    142     from . import compat

~/opt/anaconda3/envs/learn-env/lib/python3.8/site-packages/numpy/core/__init__.py in <module>
     46 """ % (sys.version_info[0], sys.version_info[1], sys.executable,
     47         __version__, exc)
---> 48     raise ImportError(msg)
     49 finally:
     50     for envkey in env_added:

ImportError: 

IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!

Importing the numpy C-extensions failed. This error can happen for
many reasons, often due to issues with your setup or how NumPy was
installed.

We have compiled some common reasons and troubleshooting tips at:

    https://numpy.org/devdocs/user/troubleshooting-importerror.html

Please note and check the following:

  * The Python version is: Python3.8 from "/Users/stubbletrouble/opt/anaconda3/envs/learn-env/bin/python"
  * The NumPy version is: "1.19.1"

and make sure that they are the versions you expect.
Please carefully study the documentation linked above for further help.

Original error was: dlopen(/Users/stubbletrouble/opt/anaconda3/envs/learn-env/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-darwin.so, 0x0002): Library not loaded: '@rpath/libopenblas.dylib'
  Referenced from: '/Users/stubbletrouble/opt/anaconda3/envs/learn-env/lib/python3.8/site-packages/numpy/core/_multiarray_umath.cpython-38-darwin.so'
  Reason: tried: '/Users/stubbletrouble/opt/anaconda3/envs/learn-env/lib/libopenblas.dylib' (no such file), '/Users/stubbletrouble/opt/anaconda3/envs/learn-env/lib/libopenblas.dylib' (no such file), '/Users/stubbletrouble/opt/anaconda3/envs/learn-env/lib/libopenblas.dylib' (no such file), '/Users/stubbletrouble/opt/anaconda3/envs/learn-env/lib/python3.8/site-packages/numpy/core/../../../../libopenblas.dylib' (no such file), '/Users/stubbletrouble/opt/anaconda3/envs/learn-env/lib/libopenblas.dylib' (no such file), '/Users/stubbletrouble/opt/anaconda3/envs/learn-env/lib/libopenblas.dylib' (no such file), '/Users/stubbletrouble/opt/anaconda3/envs/learn-env/lib/libopenblas.dylib' (no such file), '/Users/stubbletrouble/opt/anaconda3/envs/learn-env/lib/python3.8/site-packages/numpy/core/../../../../libopenblas.dylib' (no such file), '/Users/stubbletrouble/opt/anaconda3/envs/learn-env/lib/libopenblas.dylib' (no such file), '/Users/stubbletrouble/opt/anaconda3/envs/learn-env/lib/libopenblas.dylib' (no such file), '/Users/stubbletrouble/opt/anaconda3/envs/learn-env/bin/../lib/libopenblas.dylib' (no such file), '/Users/stubbletrouble/opt/anaconda3/envs/learn-env/lib/libopenblas.dylib' (no such file), '/Users/stubbletrouble/opt/anaconda3/envs/learn-env/lib/libopenblas.dylib' (no such file), '/Users/stubbletrouble/opt/anaconda3/envs/learn-env/bin/../lib/libopenblas.dylib' (no such file), '/usr/local/lib/libopenblas.dylib' (no such file), '/usr/lib/libopenblas.dylib' (no such file)

xgboost isn't installed in Canvas/Illumidesk

Canvas Link

https://learning.flatironschool.com/courses/4266/assignments/158609?module_item_id=338742

Concern

xgboost isn't installed in Illumidesk, therefore the students can't run this in Canvas.

Please place a warning about this, or verbiage that tell the students to clone it down locally.

Additional Context

No response

Suggested Changes

Suggested change is to add "Please clone this locally and run it from your machine." or something along those lines.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.