Data Cleaning in Pandas - Recap

Introduction

In this section you saw how to wrangle and clean data in Pandas! This will be a baseline skill that you will use consistently in your work whether it's doing sanity checks, cleaning messy data or transforming raw datasets into useful aggregates and views. Having an understanding of the format of your data is essential to critically thinking about how you can manipulate and shape it into new and interesting forms.

Lambda functions

We started out by introducing lambda functions. These are quick throw away functions that you can write on the fly. They're very useful for transforming a column feature. For example, you might want to extract the day from a date.

import pandas as pd
dates = pd.Series(['12-01-2017', '12-02-2017', '12-03-2017', '12-04-2017'])
dates.map(lambda x: x.split('-')[1])

0    01
1    02
2    03
3    04
dtype: object

Combining DataFrames

You can combine dataframes by merging them (joining data by a common field) or concatenating them (appending data at the beginning or end).

df1 = pd.DataFrame(dates)
df2 = pd.DataFrame(['12-05-2017', '12-06-2017', '12-07-2017'])
pd.concat([df1, df2])

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	0
0	12-01-2017
1	12-02-2017
2	12-03-2017
3	12-04-2017
0	12-05-2017
1	12-06-2017
2	12-07-2017

Grouping and aggregating

df = pd.read_csv('titanic.csv')
df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1.0	0.0	3	Braund, Mr. Owen Harris	male	22.0	1.0	A/5 21171	7.2500	NaN	S
1	2.0	1.0	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1.0	PC 17599	71.2833	C85	C
2	3.0	1.0	3	Heikkinen, Miss. Laina	female	26.0	0.0	STON/O2. 3101282	7.9250	NaN	S
3	4.0	1.0	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1.0	113803	53.1000	C123	S
4	5.0	0.0	3	Allen, Mr. William Henry	male	35.0	0.0	373450	8.0500	NaN	S

grouped = df.groupby(['Pclass', 'Sex'])['Age'].mean().reset_index()
grouped.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Pclass	Sex	Age
0	1	female	34.098291
1	1	male	38.287799
2	2	female	26.338992
3	2	male	31.630340
4	3	female	25.677973

Pivot tables

pivoted = grouped.pivot(index='Pclass', columns = 'Sex', values='Age')
pivoted

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

Sex	female	male
Pclass
1	34.098291	38.287799
2	26.338992	31.630340
3	25.677973	25.757624
?	32.011356	32.862597

Graphing

import matplotlib.pyplot as plt
%matplotlib inline
pivoted.plot(kind='barh')

<matplotlib.axes._subplots.AxesSubplot at 0x121802828>

Missing data

print('Top 5 Values before:\n', df['Cabin'].value_counts(normalize=True).reset_index()[:5])
# Not a useful means of imputing in most cases, but a simple example to recap
df.Cabin = df['Cabin'].fillna(value='?')
print('Top 5 Values after:\n', df.Cabin.value_counts(normalize=True).reset_index()[:5])

Top 5 Values before:
          index     Cabin
0         E121  0.013289
1          A20  0.013289
2          B41  0.013289
3  C23 C25 C27  0.011628
4          E38  0.011628
Top 5 Values after:
   index     Cabin
0     ?  0.567218
1   B41  0.005751
2  E121  0.005751
3   A20  0.005751
4   E38  0.005032

Summary

In this lesson, you started practicing essential ETL skills that you will use throughout your data work to transform and wrangle data into useful forms.

anguyen61191 / dsc-summary-data-cleaning-pandas Goto Github PK

dsc-summary-data-cleaning-pandas's Introduction

Data Cleaning in Pandas - Recap

Introduction

Lambda functions

Combining DataFrames

Grouping and aggregating

Pivot tables

Graphing

Missing data

Summary

dsc-summary-data-cleaning-pandas's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent