This project focuses on investigating the house sales in King County area. It involves building a regression model to predict the house price. This project helps in finding the various criteria that stimulates the housing price .
Homeowners approaches the real estate agency for help in order to buy /sell their homes. The business problem that we could focus on for the real estate agency is the need to provide advice to homeowners about home renovations that might increase the estimated value of their homes, and by what amount. In this project we are providing the prediction data to the real estate agency for homeowners who wants to buy homes. It focus on the factors that increases the house price that might be negotiated while purchasing new homes.
This project uses the King County House Sales dataset, which can be found in kc_house_data.csv
in the data folder in this project repository. The description of the column names can be found in column_names.md
in the same folder.
In this project we are following OSEMN data science workflow. It contains:
- Obtain (Generate data)
- Scrub (extracting columns,handling missing values)
- Explore (understanding data and create visualization)
- Model (building regression model)
- Interpret (communicating results)
We explored our King County House Data and created different regression model to meet regression assumptions and to predict the house price.
First created a baseline model with few variables .
Train RMSE: 236467.87482500693
Test RMSE: 242081.30788978518
Train R2: 0.5833310228392444
Test R2: 0.5728609089897383
Our model didn't meet the regression assumptions and the RMSE value was high with low R2 value.
Selected variables which are linear and built a new model.
Train RMSE: 238114.36314309336
Test RMSE: 243234.25014980324
Train R2: 0.5775084222015455
Test R2: 0.568782614551135
This model didn't improve well. Hence built Model 3
Tried log transformation for the target and the response variable
Train RMSE: 0.3417654821862211
Test RMSE: 0.34131400565596115
Train R2: 0.5800369077132734
Test R2: 0.575725504061078
Our model 3 improved well on RMSE value but the R squared value improved but let us try improving the model with better R squared value
Checking for the muliticollinearity to improve our model
Train RMSE: 0.3513999392706709
Test RMSE: 0.35286466512965453
Train R2: 0.5560254312136382
Test R2: 0.5465232245990275
The model R2 square value is reduced by small amount
There were many bathrooms and hence selected only upto 4 bathroom houses and converted them to catergorical variable to improve our model.
Train RMSE: 0.33420217863454416
Test RMSE: 0.34145241899517415
Train R2: 0.5668054800701774
Test R2: 0.5687840686619469
There is no big difference in our R2 and RMSE value and hence removed the categorical terms.
In our dataset we have 'yr_built', 'yr_renovated' and 'grade' column. Checked for interactions between them and applied in the model.
Train RMSE: 0.3132980216188752
Test RMSE: 0.3193145255976609
Train R2: 0.647085067318093
Test R2: 0.6286563030309209
The new model looked better in terms of both R2 value and RMSE value.
- The model RMSE for the train and test set are some what similar which indicates that our model will perform well on different data
- The model R2 value is 0.647 which is 64% of price variation is explained by our model
- Based on the RMSE value, our model's prediction on an average is off by 31% from the actual price value
Since the model is log transformed , our interpretation cannot be done in USD as it is unit free
If you are looking for housing that won't make your bank account fragile, then go for the housing with minimal bathroom so that you could share them and it's advisable to shrink on squarefootage that would make our house purchase a very cost effective one.
Our model RMSE for the train and test set are similar which indicates that our model will perform well on different data. Hence our model can be used as a predictor for the Real Estate Agents to buy house or to analyze the various criteria that stimulates the housing price .
See the full analysis in the Jupyter Notebook .
For additional info,contact Janaki @[email protected]