- Problem Statement
- Background
- Datasets
- Data Dictionary
- Data Cleaning
- Exploratory Data Analysis (EDA)
- Feature Engineering
- Model Tuning
- Conclusions
- Recommendations
Zillow has recently hired a team of data scientists to improve its methods of home price prediction. As a part of this team, we have been tasked with creating a machine learning model aimed at predicting the price of a home in Ames, Iowa at sale based on a publically available dataset. The Ames dataset contains information from the Ames Assessor’s Office used in computing assessed values for individual residential properties sold in Ames, IA from 2006 to 2010.
This will be a proof-of-concept model geographically restricted to the city of Ames. Zillow would like to determine if it is worth allocating financial resources to this method of prediction in order to expand it to serve as Zillow's primary prediction engine for home sale prices across the US.
The Real Estate market has been looking to improve the ways that it predicts home prices to better help customers shop for the home that meets their needs and is within their budget. In pursuit of this goal, the top digital Real Estate companies on the market have agreed to join a competition focused on adapting machine learning techniques to the housing market. Using the Ames, Iowa housing dataset, each company will submit predictions from their two best models to determine if applying machine learning in this way is a viable strategy and to prove to the public who is the one true Real Estate company to rule them all.
In light of this, Zillow's goal is to be able to produce higher accuracy predictions than its competitors (i.e. Trulia, Redfin, Realtor.com, etc.). Pre-submissions from each company are accepted until 11:59 pm on 2/16/23. 70% of the Ames testing data will be used to produce a score to judge models for all pre-submissions. Once submissions close, the remaining 30% will be used to determine the final scores.
All submitted models will be compared using the metric of Mean Squared Error (MSE). The lower the MSE score is, the better the model performed.
To achieve Zillow's goals, our team will utilize three types of linear regression models based on the Ames Housing Dataset that was provided to all companies for this Kaggle competition. These models will be Ordinary Least Squares, Ridge Regression, and LASSO Regression. The top two predictions of the best performing models will be submitted to Kaggle for review.
Because the purpose of this model is solely based on predicting home sale pricing, performance is prioritized over creating a "white-box" model. Zillow has instructed our team that understanding the internal components of the model is not as important as accurate predictions. Proof-of-concept success will be determined by producing an
The entire data documentation provided by the Ames Assessor’s Office can be found here: https://jse.amstat.org/v19n3/decock/DataDocumentation.txt
The data dictionary contains all features, provided and engineered, that were used in the models.
- Dummy variables are grouped by their original variable name and denoted under 'Type'.
- Ordinal features that were converted to a numeric scale are also noted under 'Type'.
- Polynomial Features are grouped by the pair of features that created them.
Feature | Type | Dataset | Description |
---|---|---|---|
Id | Discrete | Ames, IA Housing | ID number for each home |
PID | Nominal | Ames, IA Housing | Parcel identification number - can be used with city website for parcel review |
MS SubClass | Nominal | Ames, IA Housing | Identifies the type of dwelling involved in the sale |
MS Zoning | Nominal (Dummy) | Ames, IA Housing | Identifies the general zoning classification of the sale |
Lot Frontage | Continuous | Ames, IA Housing | Linear feet of street connected to property |
Lot Area | Continuous | Ames, IA Housing | Lot size in square feet |
Street | Nominal | Ames, IA Housing | Type of road access to property |
Alley | Nominal | Ames, IA Housing | Type of alley access to property |
Lot Shape | Ordinal (Dummy) | Ames, IA Housing | General shape of property |
Land Contour | Nominal (Dummy) | Ames, IA Housing | Flatness of the property |
Utilities | Ordinal | Ames, IA Housing | Type of utilities available |
Lot Config | Nominal (Dummy) | Ames, IA Housing | Lot configuration |
Land Slope | Ordinal | Ames, IA Housing | Slope of property |
Neighborhood | Nominal (Dummy) | Ames, IA Housing | Physical locations within Ames city limits |
Condition 1 | Nominal (Dummy) | Ames, IA Housing | Proximity to various conditions |
Condition 2 | Nominal | Ames, IA Housing | Proximity to various conditions (if more than one is present) |
Bldg Type | Nominal (Dummy) | Ames, IA Housing | Type of dwelling |
House Style | Nominal (Dummy) | Ames, IA Housing | Style of dwelling |
Overall Qual | Ordinal | Ames, IA Housing | Rates the overall material and finish of the house |
Overall Cond | Ordinal | Ames, IA Housing | Rates the overall condition of the house |
Year Built | Discrete | Ames, IA Housing | Original construction date |
Year Remod/Add | Discrete | Ames, IA Housing | Remodel date (same as construction date if no remodeling or additions) |
Roof Style | Nominal (Dummy) | Ames, IA Housing | Type of roof |
Roof Matl | Nominal | Ames, IA Housing | Roof material |
Exterior 1st | Nominal (Dummy) | Ames, IA Housing | Exterior covering on house |
Exterior 2nd | Nominal (Dummy) | Ames, IA Housing | Exterior covering on house (if more than one material) |
Mas Vnr Type | Nominal (Dummy) | Ames, IA Housing | Masonry veneer type |
Mas Vnr Area | Continuous | Ames, IA Housing | Masonry veneer area in square feet |
Exter Qual | Ordinal (Numeric) | Ames, IA Housing | Evaluates the quality of the material on the exterior |
Exter Cond | Ordinal (Numeric) | Ames, IA Housing | Evaluates the present condition of the material on the exterior |
Foundation | Nominal (Dummy) | Ames, IA Housing | Type of foundation |
Bsmt Qual | Ordinal (Numeric) | Ames, IA Housing | Evaluates the height of the basement |
Bsmt Cond | Ordinal (Numeric) | Ames, IA Housing | Evaluates the general condition of the basement |
Bsmt Exposure | Ordinal (Numeric) | Ames, IA Housing | Refers to walkout or garden level walls |
BsmtFin Type 1 | Ordinal (Numeric) | Ames, IA Housing | Rating of basement finished area |
BsmtFin SF 1 | Continuous | Ames, IA Housing | Type 1 finished square feet |
BsmtFin Type 2 | Ordinal (Numeric) | Ames, IA Housing | Rating of basement finished area (if multiple types) |
BsmtFin SF 2 | Continuous | Ames, IA Housing | Type 2 finished square feet |
Bsmt Unf SF | Continuous | Ames, IA Housing | Unfinished square feet of basement area |
Total Bsmt SF | Continuous | Ames, IA Housing | Total square feet of basement area |
Heating | Nominal | Ames, IA Housing | Type of heating |
Heating QC | Ordinal (Numeric) | Ames, IA Housing | Heating quality and condition |
Central Air | Nominal (Dummy) | Ames, IA Housing | Central air conditioning |
Electrical | Ordinal (Numeric) | Ames, IA Housing | Electrical system |
1st Flr SF | Continuous | Ames, IA Housing | First Floor square feet |
2nd Flr SF | Continuous | Ames, IA Housing | Second floor square feet |
Low Qual Fin SF | Continuous | Ames, IA Housing | Low quality finished square feet (all floors) |
Gr Liv Area | Continuous | Ames, IA Housing | Above grade (ground) living area square feet |
Bsmt Full Bath | Discrete | Ames, IA Housing | Basement full bathrooms |
Bsmt Half Bath | Discrete | Ames, IA Housing | Basement half bathrooms |
Full Bath | Discrete | Ames, IA Housing | Full bathrooms above grade |
Half Bath | Discrete | Ames, IA Housing | Half baths above grade |
Bedroom AbvGr | Discrete | Ames, IA Housing | Bedrooms above grade (does NOT include basement bedrooms) |
Kitchen AbvGr | Discrete | Ames, IA Housing | Kitchens above grade |
Kitchen Qual | Ordinal (Numeric) | Ames, IA Housing | Kitchen quality |
TotRms AbvGrd | Discrete | Ames, IA Housing | Total rooms above grade (does not include bathrooms) |
Functional | Ordinal | Ames, IA Housing | Home functionality (Assume typical unless deductions are warranted) |
Fireplaces | Discrete | Ames, IA Housing | Number of fireplaces |
Fireplace Qu | Ordinal (Numeric) | Ames, IA Housing | Number of fireplaces |
Garage Type | Nominal (Dummy) | Ames, IA Housing | Garage location |
Garage Yr Blt | Discrete | Ames, IA Housing | Year garage was built |
Garage Finish | Ordinal (Numeric) | Ames, IA Housing | Interior finish of the garage |
Garage Cars | Discrete | Ames, IA Housing | Size of garage in car capacity |
Garage Area | Continuous | Ames, IA Housing | Size of garage in square feet |
Garage Qual | Ordinal | Ames, IA Housing | Garage quality |
Garage Cond | Ordinal | Ames, IA Housing | Garage condition |
Paved Drive | Ordinal (Numeric) | Ames, IA Housing | Paved driveway |
Wood Deck SF | Continuous | Ames, IA Housing | Wood deck area in square feet |
Open Porch SF | Continuous | Ames, IA Housing | Open porch area in square feet |
Enclosed Porch | Continuous | Ames, IA Housing | Enclosed porch area in square feet |
3Ssn Porch | Continuous | Ames, IA Housing | Three season porch area in square feet |
Screen Porch | Continuous | Ames, IA Housing | Screen porch area in square feet |
Pool Area | Continuous | Ames, IA Housing | Pool area in square feet |
Pool QC | Ordinal | Ames, IA Housing | Pool quality |
Fence | Ordinal (Numeric) | Ames, IA Housing | Fence quality |
Misc Feature | Nominal | Ames, IA Housing | Miscellaneous feature not covered in other categories |
Misc Val | Continuous | Ames, IA Housing | $ Value of miscellaneous feature |
Mo Sold | Discrete | Ames, IA Housing | Month Sold (MM) |
Yr Sold | Discrete | Ames, IA Housing | Year Sold (YYYY) |
Sale Type | Nominal | Ames, IA Housing | Type of sale |
Sale Condition | Nominal | Ames, IA Housing | Condition of sale |
SalePrice | Continuous | Ames, IA Housing | Sale price $$ |
Total Bath | Discrete | Ames, IA Housing - Engineered | Total Bathrooms |
Total Area | Continuous | Ames, IA Housing - Engineered | Basement and above grade living area combined (square feet) |
Outside Amenity Area | Continuous | Ames, IA Housing - Engineered | All outside area combined (including all porches) |
pf_overall | Continuous - Polynomial | Ames, IA Housing | Polynomial features consisting of ('Overall Qual','Gr Liv Area') |
pf_garage | Continuous - Polynomial | Ames, IA Housing | Polynomial features consisting of ('Garage Area', 'Garage Cars') |
pf_ext | Discrete - Polynomial | Ames, IA Housing | Polynomial features consisting of ('Exter Qual', 'Exter Cond') |
pf_bsmt | Continuous - Polynomial | Ames, IA Housing | Polynomial features consisting of ('Bsmt Qual', 'Total Bsmt SF', 'Bsmt Cond', 'BsmtFin SF 1', 'Bsmt Exposure') |
pf_fire | Discrete - Polynomial | Ames, IA Housing | Polynomial features consisting of ('Fireplaces', 'Fireplace Qu') |
pf_mas | Continuous - Polynomial | Ames, IA Housing | Polynomial features consisting of ('Mas Vnr Area', 'Mas_Vnr_Stone') |
pf_total_qual | Discrete - Polynomial | Ames, IA Housing | Polynomial features consisting of ('Overall Qual', 'Exter Qual', 'Kitchen Qual', 'Bsmt Qual', 'Heating QC') |
- Train dataset has 2051 entries with 81 features (includes SalePrice)
- Test dataset has 878 entries with 80 features (excludes SalePrice)
- 38 features are numeric
- 42 features are categorical
Some features stored as numbers should be categories:
- Id: ID number
- PID (Nominal): Parcel identification number - can be used with city website for parcel review.
- MS SubClass (Nominal): Identifies the type of dwelling involved in the sale.
- Year Built (Discrete): Original construction date
- Year Remod/Add (Discrete): Remodel date (same as construction date if no remodeling or additions)
Features with Missing Values (Bold = high number of NaNs) (italics = contains numeric NaNs):
- Lot Frontage (Continuous): Linear feet of street connected to property
- Alley (Nominal): Type of alley access to property
- Mas Vnr Type (Nominal): Masonry veneer type
- Mas Vnr Area (Continuous): Masonry veneer area in square feet
- Bsmt Qual (Ordinal): Evaluates the height of the basement
- Bsmt Cond (Ordinal): Evaluates the general condition of the basement
- Bsmt Exposure (Ordinal): Refers to walkout or garden level walls
- BsmtFin Type 1 (Ordinal): Rating of basement finished area
- BsmtFinType 2 (Ordinal): Rating of basement finished area (if multiple types)
- FireplaceQu (Ordinal): Fireplace quality
- Garage Type (Nominal): Garage location
- Garage Yr Blt (Discrete): Year garage was built
- Garage Finish (Ordinal): Interior finish of the garage
- Garage Qual (Ordinal): Garage quality
- Garage Cond (Ordinal): Garage condition
- Pool QC (Ordinal): Pool quality
- Fence (Ordinal): Fence quality
- Misc Feature (Nominal): Miscellaneous feature not covered in other categories
- Elev - Elevator
- Gar2 - 2nd Garage (if not described in garage section)
- Othr - Other
- Shed - Shed (over 100 SF)
- TenC - Tennis Court
- NA - None
A majority of the features with missing values can be imputed as 'NA' for none.
'Lot Frontage' and 'Mas Vnr Area' are the only two continuous variables with NaNs. These can be interpreted as 0s, as they correspond to not having any masonry and not having any street front.
- Dropped data with mostly missing values.
- Data with < 10% entries in datasets was dropped.
- Imputed Missing Values.
- According to the data documentation, all categorical features with NaNs are actually 'NA', which means 'none'.
- 'Lot Frontage' and 'Mas Vnr Area' are the only two continuous variables with NaNs. These can be interpreted as 0s, as they correspond to not having any masonry and not having any street front.
- Replaced 'NA' with 'None' for 'Mas Vnr Type'.
- Additionally cleaned up 'CBlock' that does not appear in both sets of data.
- Grouped Neighborhoods.
- Some neighborhoods have few values and do not appear in both data sets. These are grouped to maintain consistency between data sets.
- Created a 'Misc' category for 'MS Zoning'.
- Some values for 'MS Zoning' have few entries and do not appear in both data sets. These are grouped to maintain consistency between data sets.
- Grouped exterior finishes into 'other' category.
- Some exterior finishes have few entries and do not appear in both data sets. These are grouped to maintain consistency between data sets.
The distribution of the SalePrice is skewed right.
The Overall Quality rating is most closely correlated with the Sale Price. However, there are many other features that also exhibit a strong correlation.
All features with a correlation of over 50% with Sale Price were individually plotted to identify outliers. Once outliers were determined, they were dropped and the resulting correlation plots are shown below.
These features show a reasonably clear correlation to Sale price that can be generalized with a linear model. However, due to the variations of spread in the data, new features were engineered from these features to create higher correlations to Sale Price.
To analyze categorical features that do not correlate to any ranking scale (nominal features), dummy features were created. Some features were excluded because there was not sufficient diversity amongst its values. The following dummy features were included in the model:
- MS Zoning
- Lot Shape
- Land Contour
- Lot Config
- Neighborhood
- Condition 1
- Bldg Type
- House Style
- Roof Style
- Exterior 1st
- Exterior 2nd
- Mas Vnr Type
- Foundation
- Central Air
- Garage Type
Some categorical features, however, do map to a ranking scale (ordinal features). These features were converted to a numeric format on scales ranging from 0 to 6 for analysis. Some features were excluded because there was not sufficient diversity amongst its values. The following ordinal features were included in the model:
- Exter Qual
- Exter Cond
- Bsmt Qual
- Bsmt Cond
- Bsmt Exposure
- BsmtFin Type 1
- BsmtFin Type 2
- Heating QC
- Electrical
- Kitchen Qual
- FireplaceQu
- Garage Finish
- Paved Drive
- Fence
A new feature called 'Total Bath' was engineered to include all bathrooms. bathrooms were counted as follows:
- 1 per bathroom
- 0.5 per half bathroom
A new feature called 'Total Area' was engineered to account for the total square footage of the home. This includes basement and above ground area.
A new feature called 'Outside Amenity Area' was engineered to account for all outdoor amenity spaces. This included all types of porches as well.
Polynomial features were created to explore interactions between different features. The following polynomial features were created:
- Overall quality vs total square footage
- Garage area vs how many cars fit within the garage
- Exterior quality vs exterior condition
- Basement features - basement quality, total basement area, basement condition, basement finish, basement exposure
- Number of fireplaces vs fireplace quality
- Masony area vs if the masonry is stone
- Year built vs remodeled
- Full bathrooms vs rooms above grade
- Total quality - overall quality, exterior quality, kitchen quality, basement quality, heating quality
Once all data had been cleaned and featured engineered, the data was scaled and then three separate models were created.
- Ordinary Least Squares (OLS)
- Ridge Regression
- LASSO Regression
Each model significantly outperformed the baseline, as should be expected. Both the Ridge and LASSO models utilized 10-fold cross-validation to improve their performance.
The residuals for each model are shown below.
All three residual plots demonstrate a random pattern, which supports the assumption of a linear model. Additionally, the residual plots demonstrate homoscedasticity as the variance remains consistent across all prediction values.
Our team of data scientists analyzed the Ames, IA housing dataset to determine if the data provided meaningful information about the Sale Price of each home.
We initially cleaned the data to account for abnormalities and problems. We then performed EDA on the dataset to discover meaningful correlations between features. Then we engineered features via several methods including, creating dummies, mapping ordinal categories to numbers, creating new features based upon similar features, and creating polynomial features to determine reactions between features. All features were then scaled to prepare them for modeling and regularization.
All of these engineered features were then tested on three linear regression models:
- Ordinary Least Squares (OLS)
- Ridge Regression (l2 penalty)
- LASSO Regression (l1 penalty)
From these models, it was determined that the Ridge and LASSO models performed best based on their
We have concluded that utilizing any of the three linear regression models with this data set can produce accurate predictions above the
Based on our achievement of the success metrics and conclusions, we recommend that Zillow allocate more funding to further develop this home sale price prediction technology. Resources should be distributed to collect larger and better data sets as well as continue to refine and improve the current proof-of-concept models.
We also recommend that if Zillow chooses to utilize models to predict the effect of certain aspects of a home on sale price rather than simply utilizing all information to predict the sale price, the complexity of the model be reduced by removing variables that exhibit multicollinearity. This would likely reduce the overall performance of the model, but would provide more clarity to how individual features affect sale price.