Code Monkey home page Code Monkey logo

dsc-phase-2-project-v2-3's Introduction

Linear Analysis of House Prices

Authors: Alec Mills Ashli Dougherty Matt Rubic

Project Overview

For this project, we used multiple linear regression modeling to analyze house sales in a King County from 2014-2015.

Business Problem

Based on the King County Housing Authority data from 2014-2015, we seek to explain to a large realty group (e.g. Berkshire Hathaway) which variables influence the sale price of the most expensive houses on the market (top 15%, or houses over \ $800k).

The Data

This project uses the King County House Sales dataset, which can be found in kc_house_data.csv in the data folder in the project's GitHub repository. The description of the column names can be found in column_names.md in the same folder.

Methods

We cleaned data to remove all null values and convert categorical data into numeric data that could be interpretted by a linear regression model. We used correlation between our target variable (price), and all other variables to identify the most highly correlated variable (sqft_living), and made a baseline linear regression model of our target variable from that variable. We iterated upon that model, to include as much information as we could to explain price, while also trying to reduce the effect of collinearity.

To follow our process in chronological order, read through notebooks in the following order: -Data Cleaning -EDAs -Train_test model -High Value Variables

Results

Our model was able to explain 54% (r-squared of .540) of the variation in sale price of houses that sold for over \ $800,000. The two variables that explained the greatest amount of variation in sale price were sqft_living and grade of a house. Based on the coefficents produced by our model, each square foot of a house increased the sale price by \ $275.98, and each increase in building grade above the minimum obtainable grade for a house increased the sale price by about \ $4454.

The model that we used to explain the variance in sale price of high-value houses performed better with the data of all houses sold in King County for 2014-2015, where it was able to explain 69.2% (r-squared of .692) of the variance in sale price of a house. We found that much of the variation in sale price that our model did not account for could be explained by the location of a house. Though latitude and longitude are continuous values and were included in our model, their relationships with price are non-linear. We found through graphical and statistical analysis that specifc locations yielded generally higher sale prices, and transcended what was otherwise a strong relationship between sqft_living and price of a house.

https://docs.google.com/presentation/d/1_s7tqmaxKG7qEw2rG70wyL2cvsd_6_NrE7Ayt_nRYUo/edit?usp=sharing

alt text alt text alt text alt text alt text alt text alt text alt text

Conclusions

Based on our analysis, we would explain that the greatest factors for predicting which houses will sell for the greatest amounts (among high-value houses) are square footage of the house, building grade of the house, and the location of the house. We identified some premium locations that contribute to greater sale prices, such as houses in the Seattle city limits, houses that are near water, and some specific neighborhoods. We also concluded that in general, factors like the year a house was built, its condition above average, and its number of bedrooms are not predictive of its sale price (among high value houses). We suspect these factors are not predictive because houses in this subset are of a higher minimum condition than all houses sold, and that location plays a grater role in the value of a house than these specific features. A limitation of our analysis is that a linear regression will not identify factors that may explain much of the data, but in a non-linear way (such as the relationship between latitude, longitude, and price). A next step would be to use a non-linear regression (such as a Random Forest regression) to identify relationships within the data that exist, but are not linear.

dsc-phase-2-project-v2-3's People

Contributors

mattrubic avatar davidbraslow avatar hoffm386 avatar ashlidougherty avatar cheffrey2000 avatar alecmills avatar joelsewhere avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.