Code Monkey home page Code Monkey logo

dsc-more-on-missing-data-lab's Introduction

More on Missing Data - Lab

Introduction

In this lab, you'll continue to practice techniques for dealing with missing data. Moreover, you'll observe the impact on distributions of your data produced by various techniques for dealing with missing data.

Objectives

In this lab you will:

  • Evaluate and execute the best strategy for dealing with missing, duplicate, and erroneous values for a given dataset
  • Determine how the distribution of data is affected by imputing values

Load the data

To start, load the dataset 'titanic.csv' using pandas.

# Your code here

Use the .info() method to quickly preview which features have missing data

# Your code here

Observe previous measures of centrality

Let's look at the 'Age' feature. Calculate the mean, median, and standard deviation of this feature. Then plot a histogram of the distribution.

# Your code here

Impute missing values using the mean

Fill the missing 'Age' values using the average age. (Don't overwrite the original data, as we will be comparing to other methods for dealing with the missing values.) Then recalculate the mean, median, and std and replot the histogram.

# Your code here

Commentary

Note that the standard deviation dropped, the median was slightly raised and the distribution has a larger mass near the center.

Impute missing values using the median

Fill the missing 'Age' values, this time using the median age. (Again, don't overwrite the original data, as we will be comparing to other methods for dealing with the missing values.) Then recalculate the mean, median, and std and replot the histogram.

# Your code here

Commentary

Imputing the median has similar effectiveness to imputing the mean. The variance is reduced, while the mean is slightly lowered. You can once again see that there is a larger mass of data near the center of the distribution.

Dropping rows

Finally, let's observe the impact on the distribution if we were to simply drop all of the rows that are missing an age value. Then, calculate the mean, median and standard deviation of the ages along with a histogram, as before.

# Your code here

Commentary

Dropping missing values leaves the distribution and associated measures of centrality unchanged, but at the cost of throwing away data.

Summary

In this lab, you briefly practiced some common techniques for dealing with missing data. Moreover, you observed the impact that these methods had on the distribution of the feature itself. When you begin to tune models on your data, these considerations will be an essential process of developing robust and accurate models.

dsc-more-on-missing-data-lab's People

Contributors

cheffrey2000 avatar fpolchow avatar loredirick avatar mathymitchell avatar sumedh10 avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dsc-more-on-missing-data-lab's Issues

missing block/instructions to import pandas

Link to Canvas

https://learning.flatironschool.com/courses/7244/assignments/272755?module_item_id=657901

Issue Subtype

  • Master branch code
  • Solution branch code
  • Code tests
  • Layout/rendering issue
  • [x ] Instructions unclear
  • Other (explain below)

Describe the Issue

Source

n/a

Concern

I don't know if this is intentional, but in every other lab you typically start with a code block with instructions to import
any necessary modules, but you omitted that in this lab.

(Optional) Proposed Solution

I think you need to import pandas and matplotlib.pyplot

What OS Are You Using?

  • OS X
  • [ x] Windows
  • WSL
  • Linux
  • Saturn Cloud from Canvas

Any Additional Context?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.