This document describes my Advanced Data Science I (140.711.01) final project. The code and all supporting files are in my advanced_datasci GitHub repo.
For my project, I will compete for the Zillow prize and write up my results and experiences.
The data provided for the challenge are described at Zillow prize site.
My primary objective is to describe my efforts to put together a half-decent entry in the Zillow challenge.
The emphasis in this README file is the description of process of designing and planning this project, while the project files focus on the process of exploratory data analysis.
Name | Type | Contains |
---|---|---|
01_zillow_MWS.ipynb | Main source | Everything |
01_zillow_MWS.py | Python script | Text & Code |
01_zillow_MWS.html | Report | Text & Figures |
01_zillow_MWS.md | Report | Text & Figures |
01_zillow_MWS.pdf | Report | Text & Figures |
The code for the report is saved as 01_zillow_MWS.ipynb
and 01_zillow_MWS.py
files. The report is saved as 01_zillow_MWS.md
, 01_zillow_MWS.html
and 01_zillow_MWS.pdf
files. Each time I save the notebook code and report files are automatically generated thanks to the save hook in the jupyter_notebook_config.py
file. The config file also specifies that report files contain no code or input/output numbering.
I spent a great deal of time trying to figure how to make my data analysis report reproducible. There are three strategies to run the code:
- Install Anaconda and create a conda environment using the the
env.yml
file. Theenv.yml
file has a list of all packages and versions. - Use the Kaggle Docker container.
- Run the code on Kaggle.
More info on these three options in the config folder!
After signing up for the challenge, I decided to use only Python in the Jupyter Notebook environment. This is a personal preference, de gustibus non est disputandum. If I want to use, I can with the Rpy2 package, while the use of Python in an R Notebook in RStudio is extremely limited.
I was very happy to find out that Kaggle supports the Jupyter Notebook format. I can upload and download notebooks and work with them using the Kaggle Notebook interface, which is very similar to the Notebook interface with which I am deeply enamored (see previous section).
The challenge entails training a machine learning algorithm to predict the log error between Zillow's proprietary Zestimate prediction of home values and the actual home values.
The metric by which submissions are evaluated is the Mean Absolute Error between the predicted log error and the actual log error. The log error is defined as
- Release of 2017 Training data: 10/2/2017
- Round 1 Submission Deadline: 10/16/2017 11:59 PM PT
- Create Kaggle account Done
- Create a GitHub repo for the Advanced Data Science I (140.711.01) final project Done
- Download the data files and put in the repo Done
- Add the data files to
.gitignore
except forzillow_data_dictionary.xlsx
(useful code book that explains the data) Done - Perform exploratory data analysis Done
- Choose and implement imputation method Done
- Split the training data into training and test sets Done
- Try different algorithms using the Scikit-Learn Done
- Measure algorithm performance
- Select the top algorithm(s)
- Assess opportunities to improve performance of the top algorithm(s)