Code Monkey home page Code Monkey logo

advanced_datasci's Introduction

Zillow Challenge

This document describes my Advanced Data Science I (140.711.01) final project. The code and all supporting files are in my advanced_datasci GitHub repo.

For my project, I will compete for the Zillow prize and write up my results and experiences.

The data provided for the challenge are described at Zillow prize site.

Primary objective

My primary objective is to describe my efforts to put together a half-decent entry in the Zillow challenge.

Repository organization

The emphasis in this README file is the description of process of designing and planning this project, while the project files focus on the process of exploratory data analysis.

Project Files

Name Type Contains
01_zillow_MWS.ipynb Main source Everything
01_zillow_MWS.py Python script Text & Code
01_zillow_MWS.html Report Text & Figures
01_zillow_MWS.md Report Text & Figures
01_zillow_MWS.pdf Report Text & Figures

The code for the report is saved as 01_zillow_MWS.ipynb and 01_zillow_MWS.py files. The report is saved as 01_zillow_MWS.md, 01_zillow_MWS.html and 01_zillow_MWS.pdf files. Each time I save the notebook code and report files are automatically generated thanks to the save hook in the jupyter_notebook_config.py file. The config file also specifies that report files contain no code or input/output numbering.

Reproducibility

I spent a great deal of time trying to figure how to make my data analysis report reproducible. There are three strategies to run the code:

  • Install Anaconda and create a conda environment using the the env.yml file. The env.yml file has a list of all packages and versions.
  • Use the Kaggle Docker container.
  • Run the code on Kaggle.

More info on these three options in the config folder!

Environment

After signing up for the challenge, I decided to use only Python in the Jupyter Notebook environment. This is a personal preference, de gustibus non est disputandum. If I want to use, I can with the Rpy2 package, while the use of Python in an R Notebook in RStudio is extremely limited.

First impressions

I was very happy to find out that Kaggle supports the Jupyter Notebook format. I can upload and download notebooks and work with them using the Kaggle Notebook interface, which is very similar to the Notebook interface with which I am deeply enamored (see previous section).

Zillow Prize Metric

The challenge entails training a machine learning algorithm to predict the log error between Zillow's proprietary Zestimate prediction of home values and the actual home values.

The metric by which submissions are evaluated is the Mean Absolute Error between the predicted log error and the actual log error. The log error is defined as

$logerror=log(Zestimate) - log(SalePrice)$

Zillow Prize Timeline

  • Release of 2017 Training data: 10/2/2017
  • Round 1 Submission Deadline: 10/16/2017 11:59 PM PT

First steps

  1. Create Kaggle account Done
  2. Create a GitHub repo for the Advanced Data Science I (140.711.01) final project Done
  3. Download the data files and put in the repo Done
  4. Add the data files to .gitignore except for zillow_data_dictionary.xlsx (useful code book that explains the data) Done
  5. Perform exploratory data analysis Done
  6. Choose and implement imputation method Done
  7. Split the training data into training and test sets Done
  8. Try different algorithms using the Scikit-Learn Done
  9. Measure algorithm performance
  10. Select the top algorithm(s)
  11. Assess opportunities to improve performance of the top algorithm(s)

advanced_datasci's People

Contributors

marskar avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.