Errata and code for Effective Pandas book
If you are interested in this book, considering purchasing a copy.
Physical version available on Amazon.
Errata and code for Effective Pandas book
Errata and code for Effective Pandas book
If you are interested in this book, considering purchasing a copy.
Physical version available on Amazon.
incorrect spelling, should be "Olson"
In figure 3.2 caption, page 12: "a dataframe can have on or many series." should be "one or many".
I'm really enjoying working through the Effective Pandas book! It's great. However, in the PDF version, there's a typesetting error at the end of Chapter 29. At least in the PDF I have, the summary and exercises go off the end of the page. Just wanted to let you know. Thanks!
Most of my reading is done on an iPad Pro or an iPhone 11Pro in Kindle or iBooks. PDF formatted books don't really work for that. When will there be .mobi and .epub versions?
In Chapter 14 on plotting you wrote "To leverage it in Jupyter, make sure you include the
following cell magic to tell Jupyter to display the plots in the browser:
%matplotlib inline
" which no longer holds (for 2+ years atleast) especially if you import pyplot or pandas (which Effective Pandas is all about) https://github.com/ipython/ipython/issues/12190
I just felt that a book written in 2021 should explain that to its readers
In chapter 26 - Reshaping DataFrames with Dummies, we wanted to turn values in the "job.role" columns into a categorical series, which we would then reshape into a dummy matrix.
That's the code of the book:
job = (jb
.filter(like=r'job.role')
.where(jb.isna(), 1)
.fillna(0)
.idxmax(axis='columns')
.str.replace('job.role.', '', regex=False))
job
However, many rows have multiple jobs, and the above code only captures the first one.
I think the following code captures all jobs and converts them into a dummy matrix.
(jb
.filter(like='job.role')
.fillna('')
.apply(lambda ser: ','.join([i for i in ser if i]), axis=1)
.str.get_dummies(sep=',')
)
In the table on operator methods, the Operator entries for s.gt(s2), s.ge(s2), s.lt(s2), and s.le(s2) all simply list '2' rather than 's2'
The code at the top of p182:
(jb2
.query("team_size.isna()")
.employment_status
.value_counts(dropna=False)
)
Can fail with:"TypeError: unhashable type: 'Series'"
Running Python v3.9.7 and Pandas 1.3.4 with latest Anaconda install.
Cause: 'numexpr' is default query engine if installed which appears so with Anaconda.
See (ref).
Solution: add engine='python'
to query arguments.
I know we generally want to avoid apply(), especially for any numerical operations. I just often find myself working with and parsing a variety of text (usually coming from csv, which in turn is coming from open textbox data, aka ugly).
Just wondering if Matt or anyone here knows of good resources to dig into using apply? I can try to be more specific, but as an example, today I'm trying to run some sentiment analysis over 2 columns/series in a dataframe, and trying to turn that text into scores. In this very specific case I'm using NRCLex and getting back a dict (like this: {'fear': 2, 'positive': 1, 'negative': 4, 'anticipation': 1}) Which I'm in turn trying creating columns based on that and the value is the dict value. So for this record, column "fear" would have 2.
Anyway, not expecting a direct answer to this specific question (though that would be fine too! ha!) just more where I can look into the apply method and trying to learn how to better work with it.
Thanks!
I have been following the code in your book in jupyter notebook.
There is several places where the code leads to errors. The errors also show up in the github version of the code.
For example:
In the your notebook Chapters 16-30 (page 302 in the physical book)
input 106 and 107 lead to errors. Have they been corrected?
(jb2
.pivot_table(index='country_live', columns='employment_status',
values='age', aggfunc='mean')
)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-8-c65017e0f276> in <module>
1 # run code
----> 2 (jb2
3 .pivot_table(index='country_live', columns='employment_status',
4 values='age', aggfunc='mean')
5 )
~/envs/menv/lib/python3.8/site-packages/pandas/core/frame.py in pivot_table(self, values, index, columns, aggfunc, fill_value, margins, dropna, margins_name, observed, sort)
8036 from pandas.core.reshape.pivot import pivot_table
8037
-> 8038 return pivot_table(
8039 self,
8040 values=values,
~/envs/menv/lib/python3.8/site-packages/pandas/core/reshape/pivot.py in pivot_table(data, values, index, columns, aggfunc, fill_value, margins, dropna, margins_name, observed, sort)
93 return table.__finalize__(data, method="pivot_table")
94
---> 95 table = __internal_pivot_table(
96 data,
97 values,
~/envs/menv/lib/python3.8/site-packages/pandas/core/reshape/pivot.py in __internal_pivot_table(data, values, index, columns, aggfunc, fill_value, margins, dropna, margins_name, observed, sort)
185 # agged.columns is a MultiIndex and 'v' is indexing only
186 # on its first level.
--> 187 agged[v] = maybe_downcast_to_dtype(agged[v], data[v].dtype)
188
189 table = agged
~/envs/menv/lib/python3.8/site-packages/pandas/core/dtypes/cast.py in maybe_downcast_to_dtype(result, dtype)
275 if not isinstance(dtype, np.dtype):
276 # enforce our signature annotation
--> 277 raise TypeError(dtype) # pragma: no cover
278
279 converted = maybe_downcast_numeric(result, dtype, do_round)
TypeError: Int64
Page 193, section 23.1: "The .sort_values method will you sort the rows" -> the "you" should not be there.
city.<TAB>
that doesn't match the text (which the data frame would be city_mpg
)..loc
attribute cap? pulling out..."?city_mpg
in the text again."Coming from a software backward" should probably be "Coming from a software background"
"We pull off the age colum" should be "column"
41 20.818182
42 14.636364
43 30.363636
44 15.818182
45 39.772727
should be:
40 20.818182
41 14.636364
42 30.363636
43 15.818182
44 39.772727
(jb
.filter(like=r'job.role.*t')
.where(jb.isna(), 1)
)
results in a single col with the indexes 1...54461
(jb
.filter(like=r'job.role.*t')
.where(jb.isna(), 1)
.fillna(0)
)
ditto
Thereafter seems to be ok
s.gt(s2)
operator should be s > s2
s.ge(s2)
operator should be s >= s2
s.lt(s2)
operator should be s < s2
s.le(s2)
operator should be s <= s2
happy to help reviewing Edition 2 before realease ;-)
It looks like the output at the top of page 195 might be wrong (or I'm confused, wouldn't be the first time!).
I think it should be sorted by the last names, which is what I get when I run the code but in the PDF it is sorted by the first name.
Cell referenced on Chapter 28 page 332 show an zero records DataFrame on the following cell.
(jb2
.query('~country_live.isin(@countries_to_remove)')
)
The book shows an output of a non-zero records DataFrame.
Hi Matt, I'm working through your Effective Pandas book and might have found a typo at the start of the chapter Reshaping Dataframes with Dummies.
You write:
>>> (jb
... .filter(like=r'job.role.*t')
... .where(jb.isna(), 1)
... )
but with pandas 1.4.3 that doesn't work. I can leave as .filter(like="job.role")
and get the 13 columns as intended, or I can use .filter(regex=r"job.role.*t")
and get the 8 columns that have a "t" in the job title.
The Jetbrains Python survey used in chapter 23 and subsequent chapters is very problematic. I ran into numerous problems when trying to make the jb2 DataFrame on page 233. The first problem was that the number ranges under 'company_size' (e.g. 2-10) were not interpreted correctly by Excel. The hyphen between the two numbers was changed into very strange looking three-character symbols. I had to go into the Excel file and manually change them back into hyphens using Ctrl-H. But that made new problems.
Once the hyphens were inserted, Excel regarded some of the number ranges as dates. For example, 2-10 was turned into 10-Feb. Changing the column format had no effect. After many hours of frustration, I finally discovered that adding a leading space prevented Excel from treating the range as a date.
But then Python had trouble recognizing other number range strings. I kept getting the error "ValueError: invalid literal for int() with base 10: '51-500' ", and others like it. After more frustration I found that many of the string entries in the CSV file had extra spaces, or whitespace. I tried to remove the whitespace all in one sweep using pd.read_csv(jb, delim_whitespace=True), but I only got the following error: ParserError: Error tokenizing data. C error: Expected 194 fields in line 961, saw 215
I had to use Ctrl-H to replace each of the number ranges with whitespace, with a number range without whitespace. As for the ranges that Excel thought were dates, I had to modify the Python code to recognize the needed whitespace.
But that still was not the end. After fixing the strings, the code would not recognize "company_size" as an attribute. It gave me the following error: _"AttributeError: 'DataFrame' object has no attribute _'company_size'__. Again, it took me a few hours, but I finally figured out that the attribute 'company_size' had an extra leading and trailing space, making Python unable to recognize it, since it technically did not match the code.
Bottom line: the Jetbrains survey is not ready to use out of the box, so to speak. Translating the file into CSV creates strange symbols that must be changed internally. Additionally, there is a lot of whitespace surrounding the data entries; without knowing what the whitespace is, it is impossible to make Python read it. Finally, some of the entries need whitespace so that they are not changed into dates, and the Python code must reflect the same thing.
I am still frustrated about this because it took me approximately three days to figure out what was going on.
Rows with NaN/0 across all columns get "DBA"
Hi!
So I'm using the "recipe style" of working on a dataframe and assigning some new new columns as part of that process (which works great).
Once of the last steps I'd like to do is put all the columns in a specific order. In this case, by "all" I mean some of the original columns as well as some of the newly created columns.
I understand (or at least think I understand) that since I want access to the new columns, which are in the intermediate df, I'll need to use a lambda.
Looking through Effective Pandas
(p.229) Matt does a column rename:
.rename(columns=lambda c: c.replace('.', '_')
But this is doing the same thing to all the columns so I couldn't figure out how to apply this concept to a simple reorder. If I was doing this outside of the recipe, I can simply do:
df[cols in my order] # cols include old and new columns
But using the following inside the function/recipe
[cols in my order] # cols include old and new columns
Naturally fails since the new cols don't exist here.
It's not a huge deal to simply do the ordering after the recipe function is called, just wondering if it's something I can do as part of the recipe?
Thanks!
Dan
Hi Matt, thanks for your book, really enjoying it. I'd suggest a couple of changes in chapter 27:
pandas.DataFrame.groupby()
at page 322, I'd change the specification of dropna
. Indeed, this works differently than the same parameter in pandas.DataFrame.pivot_table()
or in function pandas.crosstab()
, where it applies to values (and therefore << [...] dropna=False
will keep columns that have no values >>). Instead, in pandas.DataFrame.groupby()
dropna
applies to group keys (the description above - which is specified in the book - is no longer valid). For this reason, the DataFrame at page 305 should have 8 columns, rather than 4.jb2.groupby('country_live')[[col for col in jb2.select_dtypes('number').columns]].agg(['min', 'max'])
.I have a df that is mostly a bunch of columns that contain numbers (dtypes are Int). The index is a datetime type, but I don't think that's important for my question. Here is my function:
def tally_emotion_scores(input_df):
pos_e = ['anticipation', 'surprise', 'joy', 'trust']
neg_e = ['fear', 'anger', 'disgust', 'sadness']
all_e = pos_e + neg_e
return(input_df
.assign(**pd.DataFrame(input_df.total_scores.to_list()).fillna(0).astype('Int64'))
.drop(columns=['total_scores'])
.assign(pos_neg_val= lambda df_: df_['positive'] - df_['negative'])
.set_index('date')
.sort_index()
)
What I'd like to do is make changes to columns based on the pos_neg_val
.
I can do it on the resulting df (new2_df
is what's being returned from the function above). So the following is what I want to do and it works, I'm just trying to figure out how to get this into my function.
new2_df.loc[new2_df['pos_neg_val'] > 0, neg_e] = 0
new2_df.loc[new2_df['pos_neg_val'] <= 0, all_e] = 0
I thought I could use a lambda to access the intermediate df and tried several ways (trying to remember some):
When I tried (on the line right after assigning pos_neg_val
):
.loc[lambda df_: df_['pos_neg_val'] > 0, neg_e] = 0
I got:
SyntaxError: cannot assign to subscript here. Maybe you meant '==' instead of '='?
I think I tried adding another assign with versions of:
.assign(
pos_neg_val= lambda df_: df_['positive'] - df_['negative'],
[[lambda df_: df_['pos_neg_val'] > 0, neg_e] = 0] # Try one
[lambda df_: df_['pos_neg_val'] > 0, neg_e] = 0 # Try two
)
Neither of which looked right, but I tried anyway.
So I'm wondering how do I access and set the values on multiple columns based on a new created value on the intermediate df?
Thanks!
Hi Matt! First off, thank you for your amazing book! :)
I'm going through Chapter 4. I totally understand the discussion behind the nullable integer type.
Instead, I'm wondering why this sentence from Pandas documentation on Nullable Integer data type
Or the string alias "Int64" (note the capital "I", to differentiate from NumPy’s 'int64' dtype)
does not find confirmation in the following (songs2.dtype is np.int64
gives False
):
songs2 = pd.Series(
[145, 142, 133, 19],
name='counts'
)
print(songs2.dtype is np.int64)
What am I missing and misunderstanding?
Thank you for your help!
date offset aliases 147
date string format codes 131
time string format codes 131
Page 75 (The iloc Attribute)
s.loc[-2:] --> error on output ??
On page 75 there is a method on the picture "s.loc[-2:]". It gives an error because it cannot do slice indexing on Index of type int.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.