jakevdp / pythondatasciencehandbook Goto Github PK

View Code? Open in Web Editor NEW

41.6K 1.8K 17.6K 42.62 MB

Python Data Science Handbook: full text in Jupyter Notebooks

Home Page: http://jakevdp.github.io/PythonDataScienceHandbook

License: MIT License

Jupyter Notebook 99.84% Python 0.05% Makefile 0.01% CSS 0.03% HTML 0.04% Less 0.02%

scikit-learn numpy python jupyter-notebook matplotlib pandas

pythondatasciencehandbook's Issues

Missing Pillow in requirements.txt.

Indirectly required by 05.07-Support-Vector-Machines from scipy.

Bicycle analysis: Spike in west-side traffic missing in avaiable dataset

The dataset appears to have been updated?

Downloaded using curl -o FremontBridge.csv https://data.seattle.gov/api/views/65db-xm6k/rows.csv?accessType=DOWNLOAD

The plot in the notebook:

My plot, without the blue spike:

Relevant notebook (from input cell 40).

Index as ordered set: unlike Python sets, can't get difference using '-'

Ch. 3, Index as ordered set.
Notebook doesn't include example of Index difference but from the text the reader may assume that like Python sets you could do indA - IndB to get the difference, rather than using indA.difference(indB).

I know this is a tiny quibble. And I'm extremely grateful for the github resources you've provided. The only reason I've posted this, is that I've selected this textbook for two Data Science courses I teach, and when I work through examples I try to view them from a student's point of view. I was hesitant to post this, but since I felt I needed to add an additional note to my teaching references on this section, I thought I would pass it on.

03.10-Working-With-Strings - JSON lines

Since pandas 0.19 there is a boolean argument lines exactly for this case. Probably worth to mention. Also pandas can work perfectly with compressed files locally (hopefully in the next version remotely too).

Missing `births.csv` in `code_listings/data`

I think you forgot to add the births.csv when you' updated the examples of 03.09-Pivot-Tables.ipynb and 04.09-Text-and-Annotation.ipynb.

max margin example: magic numbers?

Hey jake. I'm about to steal your max margin example for my lecture.
It looks like there are magic numbers
that show the margin of the other candidates.

Where do they come from? eyeballing?

Also, I feel like we might put that into the sklearn example gallery?

correction to inches calculation in 02.06-Boolean-Arrays-and-Masks

I wasn't able to recreate a little part of this notebook's output. It seemed that dividing rainfall by 254 made the numbers too small.

May I propose rerunning it with the change below?

Line 6 in Input 1
(-) inches = rainfall / 254 # 1/10mm -> inches
(+) inches = rainfall / 25.4 # 1/10mm -> inches

This makes the numbers on a scale that show up in the plot.

It does mean changing the number reported between cells 23 and 24. It'll also affect the output once rerun of 23, 24, 25, and 29 though I think it doesn't change the meaning of those examples. I'd have done this through a pull request, but notebooks in github are an unfamiliar beast to me for now.

5.04 Minor typo

'is to use Ooe-hot encoding'
should read ' is to use One-hot encoding'

Typo in clustering page.

FYI there's a "the the" in the clustering notebook. Not sure if you want bugs filed for this, but thought you might want to know.

Recipe Database returns empty file

On the Vectorized String Operations page the downloaded .json recipe book is empty.

Typo? 03.05-Hierarchical-Indexing "Multiply indexed DataFrames"

The markdown "Multiply indexed DataFrames", should be "Multi-indexed" ?

(Thanks for uploading these notebooks!)

Minor value error in comment on 02.06 Boolean notebook

"# are all values in each row less than 4?
np.all(x < 8, axis=1)"

Prob should read values in each row less than "8"

03.09-Pivot-Tables references data files that do not exist.

Number of csv files missing from repo.

IOErrorTraceback (most recent call last)
<ipython-input-11-e52a01e5b049> in <module>()
----> 1 births = pd.read_csv('births.csv')

Should births_by_date multiplied by 2?

In Chapter 3's "Pivot Tables section" and Chapter 4's "Text and Annotation" section, when computing the births by date using:

  births_by_date = births.pivot_table('births', [births.index.month, births.index.day])

Should each value be multiplied by 2 since male and female births are counted on separate rows? So should it be:

  births_by_date = (births.pivot_table('births', [births.index.month, births.index.day])) * 2

instead?

Something is odd in digits-classification plot

There seems to be an error in the last plot (line 32) of 05.02-Introducing-Scikit-Learn.ipynb (digits classification example). The green labels are not matching those in the figure in line 23.

Pandas version and 03.02-Data-Indexing-and-Selection

The requirements.txt file has:

pandas==0.18.1

yet 03.02-Data-Indexing-and-Selection fails with:

AttributeErrorTraceback (most recent call last)
<ipython-input-5-8721e0616114> in <module>()
----> 1 list(data.items())

/opt/app-root/lib/python2.7/site-packages/pandas/core/generic.pyc in __getattr__(self, name)
   2666         if (name in self._internal_names_set or name in self._metadata or
   2667                 name in self._accessors):
-> 2668             return object.__getattribute__(self, name)
   2669         else:
   2670             if name in self._info_axis:

AttributeError: 'Series' object has no attribute 'items'

So maybe is meant to be pandas 0.19.1.

Instructions to create PDSH conda environment fails

Fetching package metadata .........
Solving package specifications: ....

UnsatisfiableError: The following specifications were found to be in conflict:

numpy ==1.11
python 3.5*
Use "conda info " to see the dependencies for each package.

02.09-Structured-Data-NumPy missing numpy import.

Notebook 02.09-Structured-Data-NumPy is missing numpy import.

name = ['Alice', 'Bob', 'Cathy', 'Doug']
age = [25, 45, 37, 19]
weight = [55.0, 85.5, 68.0, 61.5]
In [2]:

x = np.zeros(4, dtype=int)

NameErrorTraceback (most recent call last)
<ipython-input-2-f437a7cb5a38> in <module>()
----> 1 x = np.zeros(4, dtype=int)

NameError: name 'np' is not defined

Missing netCFD4 from requirements.txt.

Required by 04.13-Geographic-Data-With-Basemap.

05.11-K-Means.ipynb compression factor

While the new image only uses 16 colors instead of 16M, the compression factor is only 6 (24bit -> 4 bit per pixel), no?
At least it is a bit misleading (the color-space is of course compressed by 1M, but that tells us nothing about the image file size).

Suggestion: Incorporating IPython 5 Updates in the Notebooks - Readline and Multiline Paste

A couple of things I noticed while working through your book, and which you most likely are aware of, are

(GNU) Readline has been replaced by python_toolkit, and this in turn has made...
... pasting multiline code with other formatting now even simpler. The code example in the book works without a hitch without invoking the magic command:

In [1]: >>> def donothing(x):
   ...: ...     return x
   ...:
In [2]: donothing(5)
Out[2]: 5

I understand that it may be too late to incorporate these in the print book, but perhaps you could add a note about these changes in the notebooks, and, if possible, in the electronic versions, since newcomers to Python are likely to download the latest version (i.e. 5.x) of IPython and may be confused by the example.

Basemap dependency

Your intro says you are running the notebooks in python 3.5 and the packages in requirements.txt which includes basemap. The problem I face is that there seems to be no Python 3.5 support for basemap.

conda info basemap
Fetching package metadata .........

basemap 1.0.7 np110py27_0

file name : basemap-1.0.7-np110py27_0.tar.bz2
name : basemap
version : 1.0.7
build number: 0
build string: np110py27_0
channel : defaults
size : 120.5 MB
date : 2015-10-06
fn : basemap-1.0.7-np110py27_0.tar.bz2
license : PSF
md5 : e451471ff2a2ccdbf09e81c61cc103bb
priority : 0
schannel : defaults
url : https://repo.continuum.io/pkgs/free/win-64/basemap-1.0.7-np110py27_0.tar.bz2
dependencies:
matplotlib
numpy 1.10*
python 2.7*

basemap 1.0.7 np111py27_0

file name : basemap-1.0.7-np111py27_0.tar.bz2
name : basemap
version : 1.0.7
build number: 0
build string: np111py27_0
channel : defaults
size : 120.5 MB
date : 2016-04-01
fn : basemap-1.0.7-np111py27_0.tar.bz2
license : PSF
md5 : 4ffd12859b950ca6bd9e176ef6a71084
priority : 0
schannel : defaults
url : https://repo.continuum.io/pkgs/free/win-64/basemap-1.0.7-np111py27_0.tar.bz2
dependencies:
matplotlib
numpy 1.11*
python 2.7*

basemap 1.0.7 np17py27_0

file name : basemap-1.0.7-np17py27_0.tar.bz2
name : basemap
version : 1.0.7
build number: 0
build string: np17py27_0
channel : defaults
size : 120.5 MB
date : 2013-12-09
fn : basemap-1.0.7-np17py27_0.tar.bz2
license : PSF
md5 : 6bcb42a4435836b342c96d94a98ef785
priority : 0
schannel : defaults
url : https://repo.continuum.io/pkgs/free/win-64/basemap-1.0.7-np17py27_0.tar.bz2
dependencies:
matplotlib
numpy 1.7*
python 2.7*

basemap 1.0.7 np18py27_0

file name : basemap-1.0.7-np18py27_0.tar.bz2
name : basemap
version : 1.0.7
build number: 0
build string: np18py27_0
channel : defaults
size : 120.5 MB
date : 2014-01-30
fn : basemap-1.0.7-np18py27_0.tar.bz2
license : PSF
md5 : 14cabc1a134b14073fe3afa943753888
priority : 0
schannel : defaults
url : https://repo.continuum.io/pkgs/free/win-64/basemap-1.0.7-np18py27_0.tar.bz2
dependencies:
matplotlib
numpy 1.8*
python 2.7*

basemap 1.0.7 np19py27_0

file name : basemap-1.0.7-np19py27_0.tar.bz2
name : basemap
version : 1.0.7
build number: 0
build string: np19py27_0
channel : defaults
size : 120.5 MB
date : 2014-09-09
fn : basemap-1.0.7-np19py27_0.tar.bz2
license : PSF
md5 : 18142d0b3ede8b156f31c627d78aea72
priority : 0
schannel : defaults
url : https://repo.continuum.io/pkgs/free/win-64/basemap-1.0.7-np19py27_0.tar.bz2
dependencies:
matplotlib
numpy 1.9*
python 2.7*

Use of non ASCII character in label in basemap example.

In 04.13-Geographic-Data-With-Basemap you have the line:

plt.colorbar(label='temperature anomaly (°C)');

You may want to reconsider using the degrees symbol there as whether it will work may depend on unknown lang/locale settings in the environment.

UnicodeDecodeErrorTraceback (most recent call last)
<ipython-input-20-17334019f7eb> in <module>()
     10 
     11 plt.title('January 2014 Temperature Anomaly')
---> 12 plt.colorbar(label='temperature anomaly (°C)');
     13 #plt.colorbar(label='temperature anomaly (C)');

...

/opt/app-root/lib/python2.7/site-packages/matplotlib/colorbar.pyc in set_label(self, label, **kw)
    455         Label the long axis of the colorbar
    456         '''
--> 457         self._label = '%s' % (label, )
    458         self._labelkw = kw
    459         self._set_label()

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 21: ordinal not in range(128)

It is actually a bit strange as the lang/locale for the environment is UTF-8 and not ascii.

import locale
locale.getdefaultlocale()
Out[17]:
('en_US', 'UTF-8')
In [18]:

print(u'\u292e') 
⤮
In [21]:

print('temperature anomaly (°C)')
temperature anomaly (°C)

Not sure why matplotlib.pyplot is using ascii rather than default locale.

This is still with pinned versions of packages you had in requirements.txt so will try with latest of all packages and sees if that resolves anything.

Can not install basemap via pip anymore

Due to matplotlib/basemap#251, pip can no longer install basemap like it is specified in requirements.txt

Typo? Label says mean, code computes rolling sum

In the chapter about handling time series with Pandas, you're analyzing bicycle traffic [nbviewer link].
In cell input 41, you compute the rolling sum, but the plot label says mean:

daily = data.resample('D').sum()
daily.rolling(30, center=True).sum().plot(style=[':', '--', '-'])
plt.ylabel('mean hourly count');

Typo

In notebook: notebooks/03.03-Operations-in-Pandas.ipynb

"A similar type of alingment takes place" should be alignment.

Unknown csv files needed by 05.06-Linear-Regression

No commented out command to download and not in repository.

IOErrorTraceback (most recent call last)
<ipython-input-14-3f1b10f5cbcf> in <module>()
      1 import pandas as pd
----> 2 counts = pd.read_csv('fremont_hourly.csv', index_col='Date', parse_dates=True)
      3 weather = pd.read_csv('599021.csv', index_col='DATE', parse_dates=True)

Request for .pdf .epub .mobi formats?

Hey Jake,

I would appreciate if you could at least consider releasing the book as a .pdf, .epub or .mobi as a great tool to add to my portable/commute/trip library on the tablet or the phone. I think the pros way outweigh the cons and it would be an awesome addition to any programmer's or scientist's workshop.

Btw. Congrats on an awesome book!

01.07 Unable to load_ext line_profiler

I installed line_profiler using conda install instead of pip

conda install line_profiler

conda list shows that I have line_profiler 1.1

line_profiler 1.1 py35_0

Then, when I entered

%load_ext line_profiler

received an error

AttributeError: 'TerminalInteractiveShell' object has no attribute 'define_magic'

Numpy not importing in the environment

On MacOS(Sierra) with other Conda environments succeeding in importing numpy, PDSH environment fails, with the following error -

ImportError: dlopen(//anaconda/envs/PDSH/lib/python3.5/site-packages/numpy/core/multiarray.cpython-35m-darwin.so, 2): Library not loaded: @rpath/libopenblas-r0.2.18.dylib Referenced from: //anaconda/envs/PDSH/lib/python3.5/site-packages/numpy/core/multiarray.cpython-35m-darwin.so Reason: image not found

Typo: 02.00

1st paragraph, 2nd sentence:

"The topic is very broad: datasets can come from a wide range of sources and a wide range of formats, including be collections of documents, collections of images, collections of sound clips, collections of numerical measurements, or nearly anything else."

"be" not necessary.

Typo in 02.06-Boolean-Arrays-and-Masks

The Boolean Operators section of 02.06-Boolean-Arrays-and-Masks.ipynb includes the phrase:
"the equivalence of A AND B and NOT (A OR B)".

I believe that should be: "the equivalence of A AND B and NOT (NOT A OR NOT B)". That is what is shown in the code fragment below the markdown: np.sum(~( (inches <= 0.5) | (inches >= 1) )).

03.07-Merge-and-Join references data files that do not exist.

Number of csv files missing from repo.

IOErrorTraceback (most recent call last)
<ipython-input-20-0928a7e92a5e> in <module>()
----> 1 pop = pd.read_csv('state-population.csv')
      2 areas = pd.read_csv('state-areas.csv')
      3 abbrevs = pd.read_csv('state-abbrevs.csv')
      4 
      5 display('pop.head()', 'areas.head()', 'abbrevs.head()')

Example: Recipe Database - !gunzip command not working

Hi,

I am trying to reproduce the code at page 184 of the book (Chapter 3).

!curl -O http://openrecipes.s3.amazonaws.com/recipeitems-latest.json.gz

!gunzip recipeitems-latest.json.gz

the second command to unzip the file gives me the error:
'gunzip' is not recognized as an internal or external command,
operable program or batch file.

How can I unzip the file?
Thanks,

Marco

Missing baseman from requirements.txt.

Required by 04.13-Geographic-Data-With-Basemap.

This was is annoying because it can't be installed from PyPi using pip. You instead need to list the URL of where it is located on SourceForge.

https://downloads.sourceforge.net/project/matplotlib/matplotlib-toolkits/basemap-1.0.7/basemap-1.0.7.tar.gz

Available data files ?

Possible error in "Example: Not-So-Naive Bayes"?

I've followed along through the whole book so far - it has been extremely helpful for me. Ran into issue in Chapter 5: Example: Not-So-Naive Bayes. The call to grid.fit(digits.data, digits.target) gives an error as captioned below. It seems that KernelDensity is not found.

I apologize if I raised this possible issue in error.

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-20-4d39a80964d5> in <module>()
      6 bandwidths = 10 ** np.linspace(0, 2, 100)
      7 grid = GridSearchCV(KDEClassifier(), {'bandwidth': bandwidths})
----> 8 grid.fit(digits.data, digits.target)

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/grid_search.py in fit(self, X, y)
    827 
    828         """
--> 829         return self._fit(X, y, ParameterGrid(self.param_grid))
    830 
    831 

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/grid_search.py in _fit(self, X, y, parameter_iterable)
    571                                     self.fit_params, return_parameters=True,
    572                                     error_score=self.error_score)
--> 573                 for parameters in parameter_iterable
    574                 for train, test in cv)
    575 

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
    756             # was dispatched. In particular this covers the edge
    757             # case of Parallel used with an exhausted iterator.
--> 758             while self.dispatch_one_batch(iterator):
    759                 self._iterating = True
    760             else:

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
    606                 return False
    607             else:
--> 608                 self._dispatch(tasks)
    609                 return True
    610 

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch)
    569         dispatch_timestamp = time.time()
    570         cb = BatchCompletionCallBack(dispatch_timestamp, len(batch), self)
--> 571         job = self._backend.apply_async(batch, callback=cb)
    572         self._jobs.append(job)
    573 

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/externals/joblib/_parallel_backends.py in apply_async(self, func, callback)
    107     def apply_async(self, func, callback=None):
    108         """Schedule a func to be run"""
--> 109         result = ImmediateResult(func)
    110         if callback:
    111             callback(result)

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/externals/joblib/_parallel_backends.py in __init__(self, batch)
    324         # Don't delay the application, to avoid keeping the input
    325         # arguments in memory
--> 326         self.results = batch()
    327 
    328     def get(self):

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in __call__(self)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
    132 
    133     def __len__(self):

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0)
    129 
    130     def __call__(self):
--> 131         return [func(*args, **kwargs) for func, args, kwargs in self.items]
    132 
    133     def __len__(self):

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/cross_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, error_score)
   1663             estimator.fit(X_train, **fit_params)
   1664         else:
-> 1665             estimator.fit(X_train, y_train, **fit_params)
   1666 
   1667     except Exception as e:

<ipython-input-16-8d98b394e8b0> in fit(self, X, y)
     21         self.models_ = [KernelDensity(bandwidth=self.bandwidth,
     22                                       kernel=self.kernel).fit(Xi)
---> 23                         for Xi in training_sets]
     24         self.logpriors_ = [np.log(Xi.shape[0] / X.shape[0])
     25                            for Xi in training_sets]

<ipython-input-16-8d98b394e8b0> in <listcomp>(.0)
     21         self.models_ = [KernelDensity(bandwidth=self.bandwidth,
     22                                       kernel=self.kernel).fit(Xi)
---> 23                         for Xi in training_sets]
     24         self.logpriors_ = [np.log(Xi.shape[0] / X.shape[0])
     25                            for Xi in training_sets]

NameError: name 'KernelDensity' is not defined

Pandas ix indexer is deprecated

Accessing the ix indexer on a pandas Series or DataFrame now gives a DeprecationWarning.
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated

You're already advising to avoid ix. How about changing the existing mentions of ix to be even more discouraging (i.e. mention it's deprecated and really shouldn't be used) and to drop the one code example using ix?
I think this would be a minor change to the book, all examples would keep working with older and newer Pandas versions, just the description of ix would change slightly.

!head -4 data/president_heights.csv- where can i get this csv file

I want to try this example but couldn't find the csv dataset. Where to get this ??

Figure 5-32

Hi Jake, awesome book, it's actually taught me a lot that I didn't know about matplotlib and numpy. I'm glad you've covered all the core technologies that we use as Python data scientists, not just machine learning.
One observation of Figure 5-32 on page 374 - it looks like the training and validation curves are the wrong way around. Is this right?

Question: porting Ruby language (and publishing it on GitHub)

I would like to port PythonDataScienceHandbook to Ruby programming language (at the first in Japanese for the text).
At the moment the Ruby datascience environment is immature and can not be a complete port, but I would like to borrow the dataset and the text outline.
Is this attempt contrary to your PythonDataScienceHandbook license terms?

Typo in 04.02-Simple-Scatter-Plots?

Notice that the size of points is given in pixels, the color argument is automatically mapped to a color scale (shown here by the colorbar() command), and that the size argument is given in pixels.

"Given in pixels" repeated twice?

About Japanese translation

Hi @jakevdp
Thank you for publishing these great notebooks.

I would like to translate your notebooks into Japanese (and publish it on GitHub).
But I am concerned that the translation is contrary to the ND (NoDerivatives).

I would like your comments on this CC-BY-NC-ND license issue.

Missing numexpr from requirements.txt.

Required by 03.12-Performance-Eval-and-Query

Invalid requirements.txt file.

The requirements.txt file has:

numpy=1.11
pandas=0.18.1
scipy=0.17.1
sklearn=0.17.1
matplotlib=1.5.1
jupyter
notebook
line_profiler
memory_profiler

which will be rejected by pip.

Invalid requirement: 'numpy=1.11'
= is not a valid operator. Did you mean == ?

Need to change = to == in each instance where pinning the version.

Missing scikit-image in requirements.txt.

Needed by 05.14-Image-Features.

Deprecation warnings

Some of the Scikit modules/classes trigger deprecation warnings.

The contents of the cross_validation submodule will be moved to model_selection.
GaussianMixure will replace GMM.

Missing code for helpers_05_08 module.

Used in 05.08-Random-Forests, the code for helpers_05_08 module needs to be included in the repo and either placed in the root directory so found when Jupyter notebook server started out of that directory, or made part of a Python package in a sub directory, which is then referenced by the requirements.txt file so that it will be installed by pip -r requirements.txt and made available when people running the notebook. Preferably people use a virtual environment given is module specific to your examples.

conda environment creation using the txt file. basemap depends on Python2.7. Gives an UnsatisfiableError

The specifications were found to be in conflict:

basemap -> python 2.7*
python 3.5*

Missing pandas_datareader from requirements.txt.

In 03.11-Working-with-Time-Series:

ImportErrorTraceback (most recent call last)
<ipython-input-25-6ad953b03311> in <module>()
----> 1 from pandas_datareader import data
      2 
      3 goog = data.DataReader('GOOG', start='2004', end='2016',
      4                        data_source='google')
      5 goog.head()

ImportError: No module named pandas_datareader