Code Monkey home page Code Monkey logo

kingcountysales's Introduction

King County Sales

Flatiron School Phase 2 Project

Harrison Carter

Flatiron School NYC Data Science cohort 05162022

alt text

Project Description

The goal of this project is to use the dataset provided by King County, WA to make predictions and inferences about housing prices according to potential predictors therein. The focal point is to use multiple linear regression to explore the parameters that correlate with my target variable (sale price) in order to build an acceptable model for analysis. All of the computational work here is done in Python and associated libraries (primarily sklearn). The primary predictor here for price is the Zip code for the associated residence, secondary predictors being the square footage of the associated living space, its waterfront juxtaposition or lack thereof, and ZIP code.

Methodology

First read the data. Initially, we display a heatmap to help inform choices moving forward. here we see that the sqft_living has a relatively high correlation, and we note that for later incorporation into the final model. We then prepare our first model and displays its regression results for inspection of the initial model between ZIP code and log price. By comparing ZIP code, we already account for around half of the variance in the data with the r squared value of 0.533. Rule of thumb is that the skew should be less than +/-0.5 for proper regression and the kurtosis should not exceed 6 to ensure normality. Here, skew is too high at 0.594 and kurtosis is borderline, with a value of 5.132. We can fix this going forward.

I selected the living area to incorporate into the multiple regression to boost the r squared value, indicating a much better fit and capturing much more variance. Using the log of the square footage for the living area does not change the r squared, but does eliminate some kurtosis and much of the skew. Both of these are further reduced by incorporating waterfront availability into the regression. This leaves us with a p-value of 0.833.

Performing cross validation on this model with our dataset yields a similar p-value to the test set, within 0.004. This means that our model is not overfit, provided that there is no data leakage (a problem that did occur and was fixed).

Conclusion

This model takes two major factors that alone do not determine the price of a house and combines them to better infer their additive relationship to price, and tempers the skewness and kurtosis to acceptable levels by incorporating waterfront availability. In short, we can take this model and use it to analyze real world housing data with high confidence, accounting for prospectively 83.3 % of the variance in the data.

kingcountysales's People

Contributors

krispy3 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.