Code Monkey home page Code Monkey logo

dsc-ridge-and-lasso-regression-lab's Introduction

Ridge and Lasso Regression - Lab

Introduction

In this lab, you'll practice your knowledge of ridge and lasso regression!

Objectives

In this lab you will:

  • Use lasso and ridge regression with scikit-learn
  • Compare and contrast lasso, ridge and non-regularized regression

Housing Prices Data

We'll use this version of the Ames Housing dataset:

# Run this cell without changes
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('housing_prices.csv', index_col=0)
df.info()

More information about the features is available in the data_description.txt file in this repository.

Data Preparation

The code below:

  • Separates the data into X (predictor) and y (target) variables
  • Splits the data into 75-25 training-test sets, with a random_state of 10
  • Separates each of the X values into continuous vs. categorical features
  • Fills in missing values (using different strategies for continuous vs. categorical features)
  • Scales continuous features to a range of 0 to 1
  • Dummy encodes categorical features
  • Combines the preprocessed continuous and categorical features back together
# Run this cell without changes
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder

# Create X and y
y = df['SalePrice']
X = df.drop(columns=['SalePrice'])

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=10)

# Separate X data into continuous vs. categorical
X_train_cont = X_train.select_dtypes(include='number')
X_test_cont = X_test.select_dtypes(include='number')
X_train_cat = X_train.select_dtypes(exclude='number')
X_test_cat = X_test.select_dtypes(exclude='number')

# Impute missing values using SimpleImputer, median for continuous and
# filling in 'missing' for categorical
impute_cont = SimpleImputer(strategy='median')
X_train_cont = impute_cont.fit_transform(X_train_cont)
X_test_cont = impute_cont.transform(X_test_cont)
impute_cat = SimpleImputer(strategy='constant', fill_value='missing')
X_train_cat = impute_cat.fit_transform(X_train_cat)
X_test_cat = impute_cat.transform(X_test_cat)

# Scale continuous values using MinMaxScaler
scaler = MinMaxScaler()
X_train_cont = scaler.fit_transform(X_train_cont)
X_test_cont = scaler.transform(X_test_cont)

# Dummy encode categorical values using OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore')
X_train_cat = ohe.fit_transform(X_train_cat)
X_test_cat = ohe.transform(X_test_cat)

# Combine everything back together
X_train_preprocessed = np.asarray(np.concatenate([X_train_cont, X_train_cat.todense()], axis=1))
X_test_preprocessed = np.asarray(np.concatenate([X_test_cont, X_test_cat.todense()], axis=1))

Linear Regression Model

Let's use this data to build a first naive linear regression model. Fit the model on the training data (X_train_preprocessed), then compute the R-Squared and the MSE for both the training and test sets.

# Replace None with appropriate code
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression

# Fit the model
linreg = None

# Print R2 and MSE for training and test sets
None

Notice the severe overfitting above; our training R-Squared is very high, but the test R-Squared is negative! Similarly, the scale of the test MSE is orders of magnitude higher than that of the training MSE.

Ridge and Lasso Regression

Use all the data (scaled features and dummy categorical variables, X_train_preprocessed) to build some models with regularization - two each for lasso and ridge regression. Each time, look at R-Squared and MSE.

Remember that you can use the scikit-learn documentation if you don't remember how to import or use these classes:

Lasso

With default hyperparameters (alpha = 1)

# Your code here

With a higher regularization hyperparameter (alpha = 10)

# Your code here

Ridge

With default hyperparameters (alpha = 1)

# Your code here

With higher regularization hyperparameter (alpha = 10)

# Your code here

Comparing the Metrics

Which model seems best, based on the metrics?

# Write your conclusions here:
Answer (click to reveal)

In terms of both R-Squared and MSE, the Lasso model with alpha=10 has the best metric results.

(Remember that better R-Squared is higher, whereas better MSE is lower.)

Comparing the Parameters

Compare the number of parameter estimates that are (very close to) 0 for the Ridge and Lasso models with alpha=10.

Use 10**(-10) as an estimate that is very close to 0.

# Number of Ridge params almost zero
# Number of Lasso params almost zero
# Compare and interpret these results
Answer (click to reveal)

The ridge model did not penalize any coefficients to 0, while the lasso model removed about 1/4 of the coefficients. The lasso model essentially performed variable selection for us, and got the best metrics as a result!

Finding an Optimal Alpha

Earlier we tested two values of alpha to see how it affected our MSE and the value of our coefficients. We could continue to guess values of alpha for our ridge or lasso regression one at a time to see which values minimize our loss, or we can test a range of values and pick the alpha which minimizes our MSE. Here is an example of how we would do this:

# Run this cell without changes
import matplotlib.pyplot as plt
%matplotlib inline

train_mse = []
test_mse = []
alphas = np.linspace(0, 200, num=50)

for alpha in alphas:
    lasso = Lasso(alpha=alpha)
    lasso.fit(X_train_preprocessed, y_train)
    
    train_preds = lasso.predict(X_train_preprocessed)
    train_mse.append(mean_squared_error(y_train, train_preds))
    
    test_preds = lasso.predict(X_test_preprocessed)
    test_mse.append(mean_squared_error(y_test, test_preds))

fig, ax = plt.subplots()
ax.plot(alphas, train_mse, label='Train')
ax.plot(alphas, test_mse, label='Test')
ax.set_xlabel('alpha')
ax.set_ylabel('MSE')

# np.argmin() returns the index of the minimum value in a list
optimal_alpha = alphas[np.argmin(test_mse)]

# Add a vertical line where the test MSE is minimized
ax.axvline(optimal_alpha, color='black', linestyle='--')
ax.legend();

print(f'Optimal Alpha Value: {int(optimal_alpha)}')

Take a look at this graph of our training and test MSE against alpha. Try to explain to yourself why the shapes of the training and test curves are this way. Make sure to think about what alpha represents and how it relates to overfitting vs underfitting.


Answer (click to reveal)

For alpha values below 28, the model is overfitting. As alpha increases up to 28, the MSE for the training data increases and MSE for the test data decreases, indicating that we are reducing overfitting.

For alpha values above 28, the model is starting to underfit. You can tell because both the train and the test MSE values are increasing.

Summary

Well done! You now know how to build lasso and ridge regression models, use them for feature selection and find an optimal value for alpha.

dsc-ridge-and-lasso-regression-lab's People

Contributors

bmcgarry194 avatar bpurdy-ds avatar cheffrey2000 avatar fpolchow avatar hoffm386 avatar loredirick avatar mas16 avatar sumedh10 avatar taylorhawks avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.