mattharrison / effective_pandas_book Goto Github PK

View Code? Open in Web Editor NEW

345.0 14.0 117.0 5.67 MB

Errata and code for Effective Pandas book

Jupyter Notebook 100.00%

effective_pandas_book's Introduction

effective_pandas_book

Errata and code for Effective Pandas book

If you are interested in this book, considering purchasing a copy.

Physical version available on Amazon.

effective_pandas_book's People

Contributors

Stargazers

Watchers

Forkers

clickio gcstudios jcobx enaheld anggi-permana-harianja mah3sm pepper1709 rodrigobarrps mahyar24 aj-white network-technology-academy-institute anas20001 iamvarol priya-gittest ryd-dev ammydolphin datalabs-apps-tools-docs-tips captivus bentson1187 the-salty-medic bhamathib ftabasinejad btparrish hfung4 farshadtabasi learning-jusue404 karthy257 deyemia paulus29 jingmouren lcg829 allensmile wodole juliom86 paaaron gaoloveai techthiyanes orsdemir eugeniorj dharani211 nevzat ander2 grzesadam philocfd hbcbh1999 acartro muhsinciftci sgatea whlinaa sirkahuna boslei mekongdelta-mind vanglaz toobamukhtar lukedup sbwiecko slflood shuaiwang88 ekoepplin mp3201 evansdoe bjoernbuth heschmidt04 mubangansofu ramyfa5ry krashu anishreddy92 chusk2 shailja-thakur rjg1209 dammdeol philwebsurfer codekatas-collections taltaf913 shincai kimyc3825 ibrahim-ola irfaanz guyengalindev serviolimareina d3nsin smaiti7 kpradyumna095 omulei dnonline vpineda7 thongdata kaladharprajapati eo1989 jmgang myeducationalrepos smoltingamber yz599 redcross2018 gkeidel pradnyag jkelly93 soumitadas9 emayssat tiantianlecheng

effective_pandas_book's Issues

Typo

In figure 3.2 caption, page 12: "a dataframe can have on or many series." should be "one or many".

Typesetting issue on chap 29 PDF version

I'm really enjoying working through the Effective Pandas book! It's great. However, in the PDF version, there's a typesetting error at the end of Chapter 29. At least in the PDF I have, the summary and exercises go off the end of the page. Just wanted to let you know. Thanks!

Why no .mobi or .epub version?

Most of my reading is done on an iPad Pro or an iPhone 11Pro in Kindle or iBooks. PDF formatted books don't really work for that. When will there be .mobi and .epub versions?

%matplotlib inline no longer needed

In Chapter 14 on plotting you wrote "To leverage it in Jupyter, make sure you include the
following cell magic to tell Jupyter to display the plots in the browser:
%matplotlib inline " which no longer holds (for 2+ years atleast) especially if you import pyplot or pandas (which Effective Pandas is all about) https://github.com/ipython/ipython/issues/12190
I just felt that a book written in 2021 should explain that to its readers

Creating dummy columns

In chapter 26 - Reshaping DataFrames with Dummies, we wanted to turn values in the "job.role" columns into a categorical series, which we would then reshape into a dummy matrix.

That's the code of the book:

job = (jb
    .filter(like=r'job.role')
    .where(jb.isna(), 1)
    .fillna(0)
    .idxmax(axis='columns')
    .str.replace('job.role.', '', regex=False))

job

However, many rows have multiple jobs, and the above code only captures the first one.

I think the following code captures all jobs and converts them into a dummy matrix.

(jb
     .filter(like='job.role')
     .fillna('')
     .apply(lambda ser: ','.join([i for i in ser if i]), axis=1)
     .str.get_dummies(sep=',')
)

Typo in the table on page 32(PDF) and page 41 (print)

In the table on operator methods, the Operator entries for s.gt(s2), s.ge(s2), s.lt(s2), and s.le(s2) all simply list '2' rather than 's2'

query engine fails p182

The code at the top of p182:
(jb2
.query("team_size.isna()")
.employment_status
.value_counts(dropna=False)
)
Can fail with:"TypeError: unhashable type: 'Series'"
Running Python v3.9.7 and Pandas 1.3.4 with latest Anaconda install.
Cause: 'numexpr' is default query engine if installed which appears so with Anaconda.
See (ref).
Solution: add engine='python' to query arguments.

Digging into apply() for strings

I know we generally want to avoid apply(), especially for any numerical operations. I just often find myself working with and parsing a variety of text (usually coming from csv, which in turn is coming from open textbox data, aka ugly).

Just wondering if Matt or anyone here knows of good resources to dig into using apply? I can try to be more specific, but as an example, today I'm trying to run some sentiment analysis over 2 columns/series in a dataframe, and trying to turn that text into scores. In this very specific case I'm using NRCLex and getting back a dict (like this: {'fear': 2, 'positive': 1, 'negative': 4, 'anticipation': 1}) Which I'm in turn trying creating columns based on that and the value is the dict value. So for this record, column "fear" would have 2.

Anyway, not expecting a direct answer to this specific question (though that would be fine too! ha!) just more where I can look into the apply method and trying to learn how to better work with it.

Thanks!

TypeError: Int64 when running `jb2.pivot_table`

I have been following the code in your book in jupyter notebook.
There is several places where the code leads to errors. The errors also show up in the github version of the code.
For example:
In the your notebook Chapters 16-30 (page 302 in the physical book)
input 106 and 107 lead to errors. Have they been corrected?

(jb2
 .pivot_table(index='country_live', columns='employment_status',
     values='age', aggfunc='mean')
)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-c65017e0f276> in <module>
      1 # run code
----> 2 (jb2
      3  .pivot_table(index='country_live', columns='employment_status',
      4      values='age', aggfunc='mean')
      5 )

~/envs/menv/lib/python3.8/site-packages/pandas/core/frame.py in pivot_table(self, values, index, columns, aggfunc, fill_value, margins, dropna, margins_name, observed, sort)
   8036         from pandas.core.reshape.pivot import pivot_table
   8037 
-> 8038         return pivot_table(
   8039             self,
   8040             values=values,

~/envs/menv/lib/python3.8/site-packages/pandas/core/reshape/pivot.py in pivot_table(data, values, index, columns, aggfunc, fill_value, margins, dropna, margins_name, observed, sort)
     93         return table.__finalize__(data, method="pivot_table")
     94 
---> 95     table = __internal_pivot_table(
     96         data,
     97         values,

~/envs/menv/lib/python3.8/site-packages/pandas/core/reshape/pivot.py in __internal_pivot_table(data, values, index, columns, aggfunc, fill_value, margins, dropna, margins_name, observed, sort)
    185                     #  agged.columns is a MultiIndex and 'v' is indexing only
    186                     #  on its first level.
--> 187                     agged[v] = maybe_downcast_to_dtype(agged[v], data[v].dtype)
    188 
    189     table = agged

~/envs/menv/lib/python3.8/site-packages/pandas/core/dtypes/cast.py in maybe_downcast_to_dtype(result, dtype)
    275     if not isinstance(dtype, np.dtype):
    276         # enforce our signature annotation
--> 277         raise TypeError(dtype)  # pragma: no cover
    278 
    279     converted = maybe_downcast_numeric(result, dtype, do_round)

TypeError: Int64

Typo

Page 193, section 23.1: "The .sort_values method will you sort the rows" -> the "you" should not be there.

Typos on a few pdf pages

Page 12: > Namely, that a dataframe can have on or many series.
Page 24, there's an image of city.<TAB> that doesn't match the text (which the data frame would be city_mpg).
pg 83 -- "Note the .loc attribute cap? pulling out..."?
pg 85 -- "Also, you can only put an expression in it, you can have a statement." -- I'm not clear on what this sentence means, I suspect it's missing a word.
pg 90 -- "(city has a numeric index that is unique): city_mpg.reindex([0,0, 10, 20, 2_000_000])" -- I think you mean to refer to city_mpg in the text again.
pg 127 -- "This makes it easy to do things like calculate the percentage of quarterly snowfall the fell in a day:" -- quarterly snowfall that fell in a day.
pg 167 -- "if frac > 1, my specify replace=True" (perhaps must specify replace=True?)

Typo(?) on Page 327 of printed book

"Coming from a software backward" should probably be "Coming from a software background"

Typo on p.238 of PDF

"We pull off the age colum" should be "column"

Typo [Chapter 19] pdf p 164

41 20.818182
42 14.636364
43 30.363636
44 15.818182
45 39.772727

should be:
40 20.818182
41 14.636364
42 30.363636
43 15.818182
44 39.772727

pg. 296 : code doesn't result in what's in book

(jb
.filter(like=r'job.role.*t')
.where(jb.isna(), 1)
)
results in a single col with the indexes 1...54461

(jb
.filter(like=r'job.role.*t')
.where(jb.isna(), 1)
.fillna(0)
)
ditto

Thereafter seems to be ok

typos in chapter 6, Table 6.1 on page 41

s.gt(s2) operator should be s > s2
s.ge(s2) operator should be s >= s2
s.lt(s2) operator should be s < s2
s.le(s2) operator should be s <= s2

happy to help reviewing Edition 2 before realease ;-)

Sorting by first name rather than last name on p194/195 of PDF

It looks like the output at the top of page 195 might be wrong (or I'm confused, wouldn't be the first time!).

I think it should be sorted by the last names, which is what I get when I run the code but in the PDF it is sorted by the first name.

[Chapter 28][p.332] jb2.query("~country_live.isin(@countries_to_remove)") does not reflect book output.

Cell referenced on Chapter 28 page 332 show an zero records DataFrame on the following cell.

(jb2
 .query('~country_live.isin(@countries_to_remove)')
)

The book shows an output of a non-zero records DataFrame.

.filter with regex

Hi Matt, I'm working through your Effective Pandas book and might have found a typo at the start of the chapter Reshaping Dataframes with Dummies.

You write:

>>> (jb
...  .filter(like=r'job.role.*t')
...  .where(jb.isna(), 1)
... )

but with pandas 1.4.3 that doesn't work. I can leave as .filter(like="job.role") and get the 13 columns as intended, or I can use .filter(regex=r"job.role.*t") and get the 8 columns that have a "t" in the job title.

Chapter 23 Jetbrains Python Survey

The Jetbrains Python survey used in chapter 23 and subsequent chapters is very problematic. I ran into numerous problems when trying to make the jb2 DataFrame on page 233. The first problem was that the number ranges under 'company_size' (e.g. 2-10) were not interpreted correctly by Excel. The hyphen between the two numbers was changed into very strange looking three-character symbols. I had to go into the Excel file and manually change them back into hyphens using Ctrl-H. But that made new problems.

Once the hyphens were inserted, Excel regarded some of the number ranges as dates. For example, 2-10 was turned into 10-Feb. Changing the column format had no effect. After many hours of frustration, I finally discovered that adding a leading space prevented Excel from treating the range as a date.

But then Python had trouble recognizing other number range strings. I kept getting the error "ValueError: invalid literal for int() with base 10: '51-500' ", and others like it. After more frustration I found that many of the string entries in the CSV file had extra spaces, or whitespace. I tried to remove the whitespace all in one sweep using pd.read_csv(jb, delim_whitespace=True), but I only got the following error: ParserError: Error tokenizing data. C error: Expected 194 fields in line 961, saw 215

I had to use Ctrl-H to replace each of the number ranges with whitespace, with a number range without whitespace. As for the ranges that Excel thought were dates, I had to modify the Python code to recognize the needed whitespace.

But that still was not the end. After fixing the strings, the code would not recognize "company_size" as an attribute. It gave me the following error: _"AttributeError: 'DataFrame' object has no attribute _'company_size'__. Again, it took me a few hours, but I finally figured out that the attribute 'company_size' had an extra leading and trailing space, making Python unable to recognize it, since it technically did not match the code.

Bottom line: the Jetbrains survey is not ready to use out of the box, so to speak. Translating the file into CSV creates strange symbols that must be changed internally. Additionally, there is a lot of whitespace surrounding the data entries; without knowing what the whitespace is, it is impossible to make Python read it. Finally, some of the entries need whitespace so that they are not changed into dates, and the Python code must reflect the same thing.

I am still frustrated about this because it took me approximately three days to figure out what was going on.

Wrong attribute of numpy on page 234.

On PDF, page 234, np.where_dummies should be np.where.

No mention of .idxmax() behavior on true NaNs (section 26.1)

Rows with NaN/0 across all columns get "DBA"

Chapter #5

reordering columns

Hi!

So I'm using the "recipe style" of working on a dataframe and assigning some new new columns as part of that process (which works great).

Once of the last steps I'd like to do is put all the columns in a specific order. In this case, by "all" I mean some of the original columns as well as some of the newly created columns.

I understand (or at least think I understand) that since I want access to the new columns, which are in the intermediate df, I'll need to use a lambda.

Looking through Effective Pandas (p.229) Matt does a column rename:

.rename(columns=lambda c: c.replace('.', '_')

But this is doing the same thing to all the columns so I couldn't figure out how to apply this concept to a simple reorder. If I was doing this outside of the recipe, I can simply do:

df[cols in my order] # cols include old and new columns

But using the following inside the function/recipe

[cols in my order] # cols include old and new columns

Naturally fails since the new cols don't exist here.

It's not a huge deal to simply do the ordering after the recipe function is called, just wondering if it's something I can do as part of the recipe?

Thanks!
Dan

[Chapter 27] Couple of possible improvements

Hi Matt, thanks for your book, really enjoying it. I'd suggest a couple of changes in chapter 27:

Within the description of the parameters of method pandas.DataFrame.groupby() at page 322, I'd change the specification of dropna. Indeed, this works differently than the same parameter in pandas.DataFrame.pivot_table() or in function pandas.crosstab(), where it applies to values (and therefore << [...] dropna=False will keep columns that have no values >>). Instead, in pandas.DataFrame.groupby() dropna applies to group keys (the description above - which is specified in the book - is no longer valid). For this reason, the DataFrame at page 305 should have 8 columns, rather than 4.
At pages 313-314 of the book, per column aggregations are not applied on numeric columns only, which we instead may possibly get by typing jb2.groupby('country_live')[[col for col in jb2.select_dtypes('number').columns]].agg(['min', 'max']).

Setting values on the intermediate dataframe

I have a df that is mostly a bunch of columns that contain numbers (dtypes are Int). The index is a datetime type, but I don't think that's important for my question. Here is my function:

def tally_emotion_scores(input_df):
    pos_e = ['anticipation', 'surprise', 'joy', 'trust']
    neg_e = ['fear', 'anger', 'disgust', 'sadness']
    all_e = pos_e + neg_e
    return(input_df
            .assign(**pd.DataFrame(input_df.total_scores.to_list()).fillna(0).astype('Int64'))
            .drop(columns=['total_scores'])
            .assign(pos_neg_val= lambda df_: df_['positive'] - df_['negative'])
            .set_index('date')
            .sort_index()
        )

What I'd like to do is make changes to columns based on the pos_neg_val.

I can do it on the resulting df (new2_df is what's being returned from the function above). So the following is what I want to do and it works, I'm just trying to figure out how to get this into my function.

new2_df.loc[new2_df['pos_neg_val'] > 0, neg_e] = 0
new2_df.loc[new2_df['pos_neg_val'] <= 0, all_e] = 0

I thought I could use a lambda to access the intermediate df and tried several ways (trying to remember some):

When I tried (on the line right after assigning pos_neg_val):

.loc[lambda df_: df_['pos_neg_val'] > 0, neg_e] = 0

I got:
SyntaxError: cannot assign to subscript here. Maybe you meant '==' instead of '='?

I think I tried adding another assign with versions of:

.assign(
    pos_neg_val= lambda df_: df_['positive'] - df_['negative'],
    [[lambda df_: df_['pos_neg_val'] > 0, neg_e] = 0] # Try one
    [lambda df_: df_['pos_neg_val'] > 0, neg_e] = 0 # Try two
)

Neither of which looked right, but I tried anyway.

So I'm wondering how do I access and set the values on multiple columns based on a new created value on the intermediate df?

Thanks!

[Question - Chapter 4] dtype='int64' is not np.int64

Hi Matt! First off, thank you for your amazing book! :)
I'm going through Chapter 4. I totally understand the discussion behind the nullable integer type.
Instead, I'm wondering why this sentence from Pandas documentation on Nullable Integer data type

Or the string alias "Int64" (note the capital "I", to differentiate from NumPy’s 'int64' dtype)

does not find confirmation in the following (songs2.dtype is np.int64 gives False):

songs2 = pd.Series(
    [145, 142, 133, 19],
    name='counts'
)

print(songs2.dtype is np.int64)

What am I missing and misunderstanding?

Thank you for your help!