chendaniely / pandas_for_everyone Goto Github PK

View Code? Open in Web Editor NEW

378.0 25.0 397.0 47.82 MB

Repository to accompany "Pandas for Everyone"

Home Page: http://a.co/d/c270uul

License: MIT License

Jupyter Notebook 100.00%

pandas_for_everyone's Introduction

Pandas for Everyone
Setup
- Install seaborn for plotting
- Install all the packages used in the book
  - (Optional) Create a Virtual Environment
  - Install the packages
Teaching Slides
- No Powerpoint (.ppt/.odp)
Data
Links to teaching sessions
Other random goodies

Pandas for Everyone

Repository to accompany "Pandas for Everyone".

If you have gone through the book, an Amazon review would be much appreciated! My mom would too :)

Setup

The easiest way to get everything you need to the tutorial is to install anaconda

You can download and install it here: https://www.continuum.io/downloads

To download just the data, see the Data section below. Otherwise you can choose to clone this repository, or click the "Clone or Download" link above and clicking Download Zip

Install seaborn for plotting

conda install seaborn

Install all the packages used in the book

There is an error in the preface of the book for installing packages. I am leaving this section here in the README to have an updated list of packages and installation instructions

(Optional) Create a Virtual Environment

You can choose to create a virtual envirionment for the packages used in the book, so it doesn't clash with other packages you plan to use later on.

# create a virtual environment named "book" using python 3.6
conda create -n book python=3.6

# activate the environment
# so all installed packages will go in there and not mess up your base python environment
source activate book

Install the packages

Whether you decited to create a virtual environment or not, you can install the packages with the below commands. If you did use virtual environments, remember to source activate book before you follow along with the book so the packages you installed can be loaded.

conda install pandas xlwt openpyxl seaborn numpy ipython jupyter statsmodels scikit-learn regex wget odo numba
conda install -c conda-forge pweave # you don't really need this package, it was used to build and create the book
conda install -c conda-forge feather-format
pip install lifelines pandas-datareader

Teaching Slides

For those instructors who are using the teaching slide deck version of the book. Each chapter is split into it's own slide deck. There are multiple versions for each chapter.

Jupyter notebook (ipynb)
PDF
HTML

The slides are created using Damian Avila's RISE Jupyter/IPython Slideshow Extension. Thus, you can choose to install the RISE extension and live render and display the Jupyter notebooks (ipynb). Since each chapter is a Jupyter notebook at heart, the conversions to PDF and HTML are performed using

jupyter nbconvert --to slides your_talk.ipynb --post serve

More about useage ange converting to the PDF can be found on the RISE documentation page on useage.

No Powerpoint (.ppt/.odp)

RISE's back end uses reveal.js. Unfortunately there is no way to go from a reveal.js presentation to powerpoint. Having said that, if there's a way we can jerry-rig something together using the the given capabilties of RISE and reveal.js please let me know.

Data

You can choose to just download the datasets by using Minhas Kamal's DownGit by clicking the link here

Ongoing list of data references:

Gapminder: https://github.com/jennybc/gapminder/
Survey: Comes from the Software-Carpentry SQL lesson
Ebola: www.github.com/cmrivers/ebola

Links to teaching sessions

I've taught out of the book while I was writing it. Here you can find the various tutorials and workshops I've taught (pre and post when the book was officially published). You can also checkout my talks page for other things not completely on Pandas.

Tables	URL	Video
Online Live Training	https://github.com/chendaniely/2017-12-04-pandas_live, https://github.com/chendaniely/2018-05-pandas_live, https://github.com/chendaniely/2018-06-pandas_live
Whirlwind tour of Python	https://github.com/chendaniely/2017-10-26-python_crash_course
SciPy 2017 Pandas Tutorial	https://github.com/chendaniely/scipy-2017-tutorial-pandas	https://www.youtube.com/watch?v=oGzU688xCUs
PyData Carolinas 2016 Tutorial	https://github.com/chendaniely/2016-pydata-carolinas-pandas	https://www.youtube.com/watch?v=dye7rDktJ2E

Other random goodies

pandas .head() to .tail() (Beginner)
- SciPy 2018 Tutorial
- Dillon Niederhut, Tom Augspurger, Joris Van den Bossche
- https://www.youtube.com/watch?v=lkLl_QKLgcA
Andreas C. Müller - Lecturer in Data Science Courses COMS W4995 Applied Machine Learning :
- https://amueller.github.io/COMS4995-s18/
- http://www.cs.columbia.edu/~amueller/comsw4995s18/schedule/
Scipy 2017 list of tutorial/talks/links: https://github.com/chendaniely/scipy_2017_notes
Titus Brown's "Data Intensive Biology" Training page: http://dib-training.readthedocs.io/en/pub/

pandas_for_everyone's People

Contributors

Stargazers

Watchers

Forkers

absarf junus89 tonywang1985 andrewadev phani111 krbarker abbyf orapradeep nobusugi246 donordatabasesupportllc system-dev-formations paps272003 padamsinghinda joseregi82 raymondltremblay tiravata kivar feralrobot markriedesel hugotoledo uddipan89 henfee ginescasanova kaparthy mohitsethi datascience4me hubgit1970 stevenduong1983 sperera9 mshapi2 bejustawsome yunfeiz geraldmorrison cuellared ksur kriaz100 zzlep14 radovankavicky gapdata saadkattan ramanadk vsbavan jaynoel se77enn webbedfeet omarun elenapark84 mpgopala lincolnassy ravirathee1 patropavan bm-hendra opivtorak wzhang3 phamsuong1997 pratgohi jenufa pydatawrangler glendasp vnikov mbrukman g-iyer nunuse dev-pasa canshot aligeekk 42n4 aravind-bhat uelihofstetter sureshdontha nevzat sachin0922 gabtelcd shikharsaxena26 karthh dancampos1 davetlewis dileepvuppaladhadiam rmssoftwaretech paraskuk justinchuntingho arryabkina lawrencen encimina mercyclubm dalekseevspb ydarb10 erfynaila solongo2017 fsgp yunjaehyeck niravlangaliya cliffkwok graben1437 amyipliu ogfunkycold liewliew bryan3387 maasano sanjay780013

pandas_for_everyone's Issues

data_reader example to get tesla stock broken

From #2:

The Tesla stock data isn't in the data folder and the data_reader is broken.

ImmediateDeprecationError: 
Yahoo Actions has been immediately deprecated due to large breaks in the API without the
introduction of a stable replacement. Pull Requests to re-enable these data
connectors are welcome.

See https://github.com/pydata/pandas-datareader/issues

Logistic regression with sklearn fails in section 13.2.2

Hi! Following the example in section 13.2.2 to perform logistic regression using sklearn on the acs_ny.csv dataset results in a ConvergenceWarning and doesn't produce an intercept nor coefficients that match those in the book:

import pandas as pd
acs = pd.read_csv('../data/acs_ny.csv')

acs['ge150k'] = pd.cut(acs['FamilyIncome'], [0,150000,acs['FamilyIncome'].max()], labels=[0,1])
acs['ge150k_i'] = acs['ge150k'].astype(int)

predictors = pd.get_dummies(acs[['HouseCosts', 'NumWorkers', 'OwnRent', 'NumBedrooms','FamilyType']], drop_first=True)

from sklearn import linear_model
lr = linear_model.LogisticRegression()

results = lr.fit(X=predictors, y=acs['ge150k_i'])

ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(

To address this warning I increased the max number of iterations as follows:

lr.max_iter = 1000 # my default value was 100

But my intercept and coefficients still don't match those in the book.

Per the books guidance, ran the following commands to get

import numpy as np
values = np.append(results.intercept_, results.coef_)
names = np.append('intercept', predictors.columns)
coefs = pd.DataFrame(values, index = names, columns=['coefs'])
coefs

And these are the results I get:

	coefs	or
intercept	-5.632904	0.003578
HouseCosts	0.000726	1.000726
NumWorkers	0.581870	1.789382
NumBedrooms	0.238619	1.269495
OwnRent_Outright	0.570278	1.768759
OwnRent_Rented	-0.692253	0.500447
FamilyType_Male Head	-0.330524	0.718547
FamilyType_Married	1.224612	3.402845

Very different from those in the book:

	coef	or
intercept	-5.492705	0.004117
HouseCosts	0.000710	1.000710
NumWorkers	0.559836	1.750385
NumBedrooms	0.222619	1.249345
OwnRent_Outright	1.180146	3.254851
OwnRent_Rented	-0.730046	0.481887
FamilyType_Male Head	0.318643	1.375260
FamilyType_Married	1.213134	3.364012

I wouldn't be as thrown off if the differences were a few decimal points or so, but my results assign substantially less weight to OwnRent_Outright and to FamilyType_Male Head and I have no idea why...

PS - I'm have VERY little experience with statistics and data science, my background is computer science.

Table 8.4 Incorrect Index Value/Result

Fifth row from top (as depicted):

"It's just a flesh sound!".find('u') -> Result == 7

Correct:

"It's just a flesh sound!".find('u') -> Result == 6

This may seem minor but newcomers will already be struggling with the zero-reference concept, so this will not be helpful along that journey.

I also am curious about the final line:

"9".zfill(with=5) -> result == '00009'

Having attempted this in both Python 2 and 3, it seems the correct form would simply be:

"9".zfill(5) -> result == '00009'

Typo 'om' section 1.3.2.2 page 11

Hi, I was reading the book and I found a typo in section 1.3.2.2 page 11.

It says iloc and loc will behave om exactly ....

isn't 'om' supposed to be 'in' I can tell they are next to each other on the keyboard.

Cheers

Roque

matplotlib chart gives error

Greetings!

Kindly refer to chapter 11, point section 11.11 and fig 11.3. If I type the exact code, the plot gives date time data error.

Please suggest a way around.

If I pass a column name to the ebola pandas data frame then it plots for that column. But there are too name columns to type in. But I do not know if this is the best way around.

Regards,
Andy

Grammatical error on Page 43, section 2.6.1

The text reads-
"This is Python's way of serializing and saving data in a binary format reading pickle data is also backwards compatible."

It supposed to read-
"This is Python's way of serializing and saving data in a binary format. Reading pickle data is also backwards compatible."

Typo section 3.2, page 53

In the second to last paragraph, you have in part
"to make sure the axes are apread apart from one another"

I think it should read:
"to make sure the axes are spread apart from one another"

More data missing from repository

The acs_ny.csv file is missing as well.

Also, has the book version gone to print. There are some typos in the previous couple chapters. I'll be happy to point them out if it is still at a stage that you're able to correct them.

timedelta64[Y] on p.49 is no longer supported in pandas 2? What alternatives are available?

The following code appears on p.49 in 2nd edition.

scientists['age_years'] = (scientists['age_days'].astype('timedelta64[Y]'))

When this is executed in pandas 2.0.3(Python 3.11.5), the following error is output.

Cannot convert from timedelta64[ns] to timedelta64[Y]. Supported resolutions are 's', 'ms', 'us', 'ns

It seems that timedelta64[Y] is no longer supported in pandas 2.0.0 released in April. This seems to be the cause of the error.
https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html#construction-with-datetime64-or-timedelta64-dtype-with-unsupported-resolution

Can the following code be considered as an alternative to the above code?

# Calculate a year in seconds
seconds_in_a_year = 60 * 60 * 24 * 365

# convert timedelta to seconds and then to years
scientists['age_years'] = scientists['age_days'].dt.total_seconds() / seconds_in_a_year

print(scientists)

Please let us know if there is a better way.
Thanks.

Typo in "memory location"; section 1.4.1, page 19

In the last non-code paragraph, in the last sentence, you say:

"Notice that if we printed the grouped dataframe, Pandas would return only the memory location."

I believe it does not return the memory location. Rather, it returns the type.

Type in section 2.3.2, on page 32

In the fourth to the last line, you have "If we liked, we could manually supply...". I believe the word supposed to be "like" and not "liked"

In section 2.2.2, I believe dicts are ordered beginning in Python 3.6

There is a line on page 27, section 2.2.2 that says "Notice that order is not guaranteed." I think that is no longer true beginning in Python 3.7 as it is a feature of dicts (and technically in 3.6 but an implementation detail). See- https://stackoverflow.com/a/47837132/178550.

Improved way to load as conda virtual environment

Hi Daniel,

Nice course!

This is working well so-far
% conda env create --name pandas_for_everyone -f environment.yml

environment.yml:

name: pandas-data-analysis-chen
channels:
  - anaconda
  - defaults
  - conda-forge
dependencies:
  - python=3
  - pip
  - feather-format
  - ipython
  - jupyterlab
  - numba
  - numpy
  - odo
  - openpyxl
  - pandas
  - pandas-datareader
  - pweave
  - regex
  - scikit-learn
  - seaborn
  - statsmodels
  - wget
  - xlwt

Cheers,
--Peter G

missing data files

can't find folder tidy-data under data folder, which contains all data files for lesson 7. where can I find these data files ?

Can't find data/billboard-by_week/billboard-XX.csv

In Chapter 6 Data Assembly of the Second Edition,
6.3 Observational Units Across Multiple Tables, (pp.154-160),
the example scripts uses:
data/billboard-by_week/billboard-XX.csv
but I couldn't find the folder (and csv files) in the data section of the repository.

Grammatical error on page 45, section 2.6.2

You wrote-
"The Series and DataFrame have a to_csv method to write a CSV file"

I believe is supposed to read-
"Series and DataFrame types have a to_csv method to write to a CSV file"

[SOLVED] days not consecutive in the last two prints in section 6.4

On section 6.4 "Variables in Both Rows and Columns",
in the last two 'head' prints of "weather" data in page 134,
the day column starts with value 'd1' on the 1st row, but
it jumps with no apparent reason to 'd10' on the 2nd row,
and then it continues with 'd11', 'd12', and 'd13'.
Why jumping? A human error in the editing process?

kneo - the English to Japanese translator of the book.

Pandas for Everyone Section 2.4.1 pg 36

Hello, I'm a Python Baby and am going through the book line by line.
Section 2.4.1 Boolean Subsetting: DataFrames has a command line which is not performing the way the book suggests.

4 values passed as bool vector

3 rows returned

print(scientists.loc[[True, True, False, True]])

The book says that 3 rows should be returned, index 0, 1 and 3.
Which seems to make sense.

But the Return line says
IndexError: Boolean index has wrong length: 4 instead of 8
Which also seems to make sense, so I'm confused.

Can I get some clarification?
Thanks
Paul

Margin Changes

Some of the printed output break the columns over multiple lines

2.7.2 Directly change a column
6.3.3 code snipit in comment has line break

Incorrect pew_long.tail() examples in 6.2.1

Hello,

Really appreciating this book. I thought I'd take the time to register that this doesn't add up.

In the pew dataset shown in the below picture, "Don't know/refused" is a value in the religion column:

After we melt with:
pew_long = pd.melt(pew, id_vars="religion")
we should not see this value in the new variable column on page 126. (There is another example of this with different arguments right afterward that also have this value in the income column):

I can't say what went wrong, and I'm on mobile at the moment, but I thought I'd let you know!