In this final lab, we shall see how to apply regression analysis using CART trees for regression, with some hyper parameter tuning as we saw in the case of classification. For a comparison of predictive capabilities and computational cost, we shall work the "Boston Housing" dataset. This will allow us to compare different regression approaches in terms of their accuracy and cost involved.
You will be able to:
- Apply predictive regression analysis with CART trees
- Get the data ready for modeling
- Tune the key hyper parameters based a various models developed during training
- Study the impact of tree pruning on the quality of predictions
The dataset is available in the repo as boston.csv
.
- Load the Dataset and print its head and dimensions
# Your code here
Boston housing dataset has 506 data points with 15 variables each.
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Unnamed: 0 | crim | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | black | lstat | medv | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.00632 | 18.0 | 2.31 | 0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1 | 296 | 15.3 | 396.90 | 4.98 | 24.0 |
1 | 2 | 0.02731 | 0.0 | 7.07 | 0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242 | 17.8 | 396.90 | 9.14 | 21.6 |
2 | 3 | 0.02729 | 0.0 | 7.07 | 0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242 | 17.8 | 392.83 | 4.03 | 34.7 |
3 | 4 | 0.03237 | 0.0 | 2.18 | 0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3 | 222 | 18.7 | 394.63 | 2.94 | 33.4 |
4 | 5 | 0.06905 | 0.0 | 2.18 | 0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3 | 222 | 18.7 | 396.90 | 5.33 | 36.2 |
In this lab, we shall use three features from the Boston housing dataset: 'RM'
, 'LSTAT'
, and 'PTRATIO'
. For each data point:
'RM'
is the average number of rooms among homes in the neighborhood.'LSTAT'
is the percentage of homeowners in the neighborhood considered "lower class" (working poor).'PTRATIO'
is the ratio of students to teachers in primary and secondary schools in the neighborhood.
-
MEDV
' has been multiplicatively scaled to account for 35 years of market inflation. -
Create dataframes for features and target as shown above.
-
Inspect the contents for validity
# Your code here
count 506.000000
mean 22.532806
std 9.197104
min 5.000000
25% 17.025000
50% 21.200000
75% 25.000000
max 50.000000
Name: medv, dtype: float64
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
rm | lstat | ptratio | |
---|---|---|---|
0 | 6.575 | 4.98 | 15.3 |
1 | 6.421 | 9.14 | 17.8 |
2 | 7.185 | 4.03 | 17.8 |
3 | 6.998 | 2.94 | 18.7 |
4 | 7.147 | 5.33 | 18.7 |
- Use scatter plots to show the correlation between chosen features and target variable
- Comment on each scatter plot
# Your code here
# Your observations here
- Create a function
performance(true, predicted)
to calculate and return the r-sqaured score and MSE for two equal sized arrays showing true and predicted values - TEst the function with given data
# Evaluation Metrics
# Import metrics
def performance(y_true, y_predict):
""" Calculates and returns the performance score between
true and predicted values based on the metric chosen. """
# Your code here
pass
# Calculate the performance - TEST
score = performance([3, -0.5, 2, 7, 4.2], [2.5, 0.0, 2.1, 7.8, 5.3])
score
# [0.9228556485355649, 0.4719999999999998]
- For supervised learning, split the
features
andtarget
datasets into training/test data (80/20). - For reproducibility, use
random_state=42
# Your code here
- Run a baseline model for later comparison using the datasets created above
- Generate predictions for test dataset and calculate the performance measures using the function created above.
- Use
random_state=45
for tree instance - Record your observations
# Your code here
# (0.4712438851035674, 38.7756862745098) - R2, MSE
(0.47097115950374013, 38.795686274509805)
# Your observations here
- Find the best tree depth for a depth range: 1-30
- Run the regressor repeatedly in a for loop for each depth value.
- Use
random_state=45
for reproducibility - Calculate MSE and r-squared for each run
- Plot both performance measures, for all runs.
- Comment on the output
# Your code here
# Your observations here
-
Repeat the above process for
min_samples_split
parameter -
Use a a range of values from 2-10 for this parameter
-
Use
random_state=45
for reproducibility -
Visualize the output and comment on results as above
# Your code here
# Your observations here
- Use the best values for max_depth and min_samples_split found in previous runs and run an optimized model with these values.
- Calculate the performance and comment on the output
# Your code here
(0.7510017608643338,
18.259982876077185,
DecisionTreeRegressor(criterion='mse', max_depth=6, max_features=None,
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=6, min_weight_fraction_leaf=0.0,
presort=False, random_state=45, splitter='best'))
# Your observation
- Visualize the trained model as we did in previous sections
- Show the labels for each variable being split in a node
- Interpret the tree
# Your code here
#Your observations here
- How about bringing in some more features from the original dataset which may be good predictors
- Also , Tune more more hyper parameters like max-features to find the optimal model
In this lab, we looked at applying a decision tree based regression analysis on the Boston Housing Dataset. We saw how to train various models to find the optimal values for pruning and limiting the growth of the trees. We also looked at how to extract some rules from visualizing trees , that might be used for decision making later. In the next section we shall look at running "Grid Searches" for identifying the best model while tuning all required hyper-parameters - at once.