Zillow Challenge

This document describes my Advanced Data Science I (140.711.01) final project. The code and all supporting files are in my advanced_datasci GitHub repo.

For my project, I will compete for the Zillow prize and write up my results and experiences.

The data provided for the challenge are described at Zillow prize site.

Primary objective

My primary objective is to describe my efforts to put together a half-decent entry in the Zillow challenge.

Repository organization

The emphasis in this README file is the description of process of designing and planning this project, while the project files focus on the process of exploratory data analysis.

Project Files

Name	Type	Contains
01_zillow_MWS.ipynb	Main source	Everything
01_zillow_MWS.py	Python script	Text & Code
01_zillow_MWS.html	Report	Text & Figures
01_zillow_MWS.md	Report	Text & Figures
01_zillow_MWS.pdf	Report	Text & Figures

The code for the report is saved as 01_zillow_MWS.ipynb and 01_zillow_MWS.py files. The report is saved as 01_zillow_MWS.md, 01_zillow_MWS.html and 01_zillow_MWS.pdf files. Each time I save the notebook code and report files are automatically generated thanks to the save hook in the jupyter_notebook_config.py file. The config file also specifies that report files contain no code or input/output numbering.

Reproducibility

I spent a great deal of time trying to figure how to make my data analysis report reproducible. There are three strategies to run the code:

Install Anaconda and create a conda environment using the the env.yml file. The env.yml file has a list of all packages and versions.
Use the Kaggle Docker container.
Run the code on Kaggle.

More info on these three options in the config folder!

Environment

After signing up for the challenge, I decided to use only Python in the Jupyter Notebook environment. This is a personal preference, de gustibus non est disputandum. If I want to use, I can with the Rpy2 package, while the use of Python in an R Notebook in RStudio is extremely limited.

First impressions

I was very happy to find out that Kaggle supports the Jupyter Notebook format. I can upload and download notebooks and work with them using the Kaggle Notebook interface, which is very similar to the Notebook interface with which I am deeply enamored (see previous section).

Zillow Prize Metric

The challenge entails training a machine learning algorithm to predict the log error between Zillow's proprietary Zestimate prediction of home values and the actual home values.

The metric by which submissions are evaluated is the Mean Absolute Error between the predicted log error and the actual log error. The log error is defined as

$logerror=log(Zestimate) - log(SalePrice)$

Zillow Prize Timeline

Release of 2017 Training data: 10/2/2017
Round 1 Submission Deadline: 10/16/2017 11:59 PM PT

First steps

Create Kaggle account Done
Create a GitHub repo for the Advanced Data Science I (140.711.01) final project Done
Download the data files and put in the repo Done
Add the data files to .gitignore except for zillow_data_dictionary.xlsx (useful code book that explains the data) Done
Perform exploratory data analysis Done
Choose and implement imputation method Done
Split the training data into training and test sets Done
Try different algorithms using the Scikit-Learn Done
Measure algorithm performance
Select the top algorithm(s)
Assess opportunities to improve performance of the top algorithm(s)

kpasha / advanced_datasci Goto Github PK

advanced_datasci's Introduction

Zillow Challenge

Primary objective

Repository organization

Project Files

Reproducibility

Environment

First impressions

Zillow Prize Metric

Zillow Prize Timeline

First steps

advanced_datasci's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent