Modeling Our Data - Lab

Introduction

In this lab we'll perform a full linear regression on our data. We'll take a stepwise approach and we'll try to improve our model as we go.

Objectives

You will be able to:

Remove predictors with p-values too high and refit the model
Examine and interpret the model results
Split data into training and testing sets
Fit a regression model to the data set using statsmodel library

Build single linear regression models

From the previous steps, it is pretty clear that we have quite a few predictors, but there are some issues with them. Linearity with the target "Weekly_Sales" wasn't apparent. If that's the case, it's always smart to start small, and go ahead and build linear regression models with just one input at the time. Somewhat like what we've done in section 10, let's look at some statistics for single linear regression models for all our continuous variables with the outcome.

Note: for now, we will not use holdout validation, as we're just trying to gauge interpretation and a sense of predictive capacity for each of the candidate predictors

Load the cleaned dataset "walmart_dataset.csv", and check its contents

Let's pull up the info.

Note that the output for info is much smaller compared to what we usually see. Because we have so many columns, pandas is intentionally not showing the data types for each column. Let's use info() again, but now just on the first 15 columns of the data.

Remember that all the columns from store_1 onwards are actually dummies, so categorical variables. Because we stored the data and loaded it in again, this information was lost. Let's make sure they become categorical again. You can write a for-loop to do this.

Let's make sure IsHoliday is a categorical variable as well.

Let's check the info again to make sure everything is OK now.

Great! you should see that the datatypes have changed to categories now! If you use .describe now, you should see only the remaining continuous variables in the data set.

Use a for-loop to look at some results for each linear regression model

Let's use ordinary least squares in statsmodels at this stage. Import statsmodels.formula.api to get started

import statsmodels.formula.api as smf

Create a loop that for each iteration:

Runs a simple OLS regression between (continuous) independent and dependent variables
Store following values in array for each iteration
- Target variable
- R_squared
- intercept
- slope
- p-value
Comment on each output

Think about your results.

What do the parameter estimates mean? Do they make sense?
What do the p-values tell us?
What does the R-squared tell us?

Our R-squared values are low, let's try to solve this

Something we haven't considered before, is taking log-transformations to make certain data less skewed. Let's take a quick look at our summarizing histograms.

Interestingly the most problematic variable in terms of skewness seems to be weekly sales itself. Does it make sense to log-transform this variable? It definitely doesn't hurt to try! Let's have a look below. what do you see?

That's right, we have some negative Weekly_Sales values! Let's check how many we have.

This seems negligibe considering we have almost 100,000 observations. Let's remove these 224 rows so we can take the log.

Let's have another look at the histogram. What do you see?

Now let's repeat what we did before, yet now with the log(Weekly_Sales) as the target.

compare and contract the results with the results obtained when we did not take the log(sales)
Which one would you want to proceed with based on this?

Build a model with each categorical variable as a predictor

Use it on the log-transformed, and the regular Weekly_Sales
put all categories for one categorical variable in 1 model, so we want 4 models.
remember that we have 4 categorical variables: Store, Dept, IsHoliday and Type( we're for now ignoring the binned_markdown categories, you can add then later on as an extension)
IMPORTANT: remember that we made dummies for Type, Dept and Store columns. You'll need to drop 1 column for each of these if you want good results. The reason for this is that singularity will occur and . This is related to what we mentioned earlier on in section 11. Don't worry about the "why" for now, just make sure to drop 1 column and you should be fine! The parameter value for the dropper "base category" will be absorbed in the intercept.

Let's drop a few columns in our data set based on our findings

Let's stick with our walmart_log data, as it seemed like it was generally resulting in higher R-squared values.
Let's drop continuous variables which resulted in single linear models with a R-squared value <0.01 for the walmart_log models.
Let's make sure to drop 1 column for each categorical variable we end up using.

From here on out, use Feature ranking with recursive feature elimination

Let's create a matrix X and y containing the predictors and target for our model. Let's use Scikit-Learn's RFE function, documentation again here.

Let's create a for loop using RFE where we look at the 5, 15, 25,... up until 85 best features to be selected according to the feature ranking algorithm. Store the R-squared and the adjusted-R-squareds for all these models in a list. What do you see? No need to perform a train-test-split for now- that will be next!

The difference between $R^2$ and adjusted $R^2$ is negligible, and seems to continue to be going up as we include more features. Remember though that we're likely overfitting when including 85 features. In order to identify this, let's rerun a similar experiment, but using a train test split!

Including a train-test-split

Let's create a similar for loop to what we did before. Except, this time

Use a train test split of 20-80
Instead of looking at $R^2$ and $R^2_{adj}$, look at the MSE for train and test

What we see is that both MSE keeps improving when we add variables. It seems like a bigger model improves our performance, and the test and train performance don't really diverge. It is important to note however that is not an unusual result. The performance measures used typically will show this type of behavior. In order to really be able to balance the curse of dimensionality (which will become more important in machine learning), we need other information criteria such as AIC and BIC. You'll learn about them later! Now, let's perform cross-validation on our model with 85 predictors!

10-fold cross validation with the final model

Create a 10-fold cross-validation and store the (negative) MSEs

Running our 10-fold cross-validation highlights some issues for sure! Have a look at your list of 10 MSEs. Where most MSEs are manageable, some are very high. The cure of dimensionality is already pretty clear here. The issue is that we have many (dummy) categorical variables that result in columns with many zeroes and few ones. This means that for some folds, there is a risk of ending up with columns that almost exclusively contain 0's for prediction, which might cause weird results. Looking at this, a model with less predictors might make sense again. This is where we conclude for now. It's up to you now to explore other model options! Additionally, it is encouraged to try some of the "level up" exercises below. Good luck!

Level up - Optional

You could argue that throwing out negative sales figures is problematic, because these are probably the types of observations a stakeholder would be very interested in knowing. Repeat your analysis, but now, instead of removing the rows with negative sales, replace their sales with a slightly positive value (eg. 1), so they have an existing and finite value. Does the result change?
Go back and log-transform CPI and Size before standardizing it (we did this a few lessons ago). Look at the histogram and see if there is an improvement.
You might have noticed we ignored binned_markdown throughout. Add it in the model and see how it changes the results!
Try other feature selection methods such as stepwise selection and forward selection seen in section 11.

Summary

Congratulations, you made it to the end of the last section in this module. Now it's time for a big project on multiple linear regression!

thahmina / dsc-1-12-11-modeling-our-data-lab-online-ds-sp-000 Goto Github PK

dsc-1-12-11-modeling-our-data-lab-online-ds-sp-000's Introduction