jakevdp / pythondatasciencehandbook Goto Github PK
View Code? Open in Web Editor NEWPython Data Science Handbook: full text in Jupyter Notebooks
Home Page: http://jakevdp.github.io/PythonDataScienceHandbook
License: MIT License
Python Data Science Handbook: full text in Jupyter Notebooks
Home Page: http://jakevdp.github.io/PythonDataScienceHandbook
License: MIT License
Indirectly required by 05.07-Support-Vector-Machines from scipy.
The dataset appears to have been updated?
Downloaded using curl -o FremontBridge.csv https://data.seattle.gov/api/views/65db-xm6k/rows.csv?accessType=DOWNLOAD
My plot, without the blue spike:
Relevant notebook (from input cell 40).
Ch. 3, Index as ordered set.
Notebook doesn't include example of Index difference but from the text the reader may assume that like Python sets you could do indA - IndB to get the difference, rather than using indA.difference(indB).
I know this is a tiny quibble. And I'm extremely grateful for the github resources you've provided. The only reason I've posted this, is that I've selected this textbook for two Data Science courses I teach, and when I work through examples I try to view them from a student's point of view. I was hesitant to post this, but since I felt I needed to add an additional note to my teaching references on this section, I thought I would pass it on.
Since pandas 0.19 there is a boolean argument lines
exactly for this case. Probably worth to mention. Also pandas can work perfectly with compressed files locally (hopefully in the next version remotely too).
I think you forgot to add the births.csv
when you' updated the examples of 03.09-Pivot-Tables.ipynb
and 04.09-Text-and-Annotation.ipynb
.
Hey jake. I'm about to steal your max margin example for my lecture.
It looks like there are magic numbers
that show the margin of the other candidates.
Where do they come from? eyeballing?
Also, I feel like we might put that into the sklearn example gallery?
I wasn't able to recreate a little part of this notebook's output. It seemed that dividing rainfall by 254 made the numbers too small.
May I propose rerunning it with the change below?
Line 6 in Input 1
(-) inches = rainfall / 254 # 1/10mm -> inches
(+) inches = rainfall / 25.4 # 1/10mm -> inches
This makes the numbers on a scale that show up in the plot.
It does mean changing the number reported between cells 23 and 24. It'll also affect the output once rerun of 23, 24, 25, and 29 though I think it doesn't change the meaning of those examples. I'd have done this through a pull request, but notebooks in github are an unfamiliar beast to me for now.
'is to use Ooe-hot encoding'
should read ' is to use One-hot encoding'
FYI there's a "the the" in the clustering notebook. Not sure if you want bugs filed for this, but thought you might want to know.
On the Vectorized String Operations page the downloaded .json
recipe book is empty.
The markdown "Multiply indexed DataFrames", should be "Multi-indexed" ?
(Thanks for uploading these notebooks!)
"# are all values in each row less than 4?
np.all(x < 8, axis=1)"
Prob should read values in each row less than "8"
Number of csv files missing from repo.
IOErrorTraceback (most recent call last)
<ipython-input-11-e52a01e5b049> in <module>()
----> 1 births = pd.read_csv('births.csv')
In Chapter 3's "Pivot Tables section" and Chapter 4's "Text and Annotation" section, when computing the births by date using:
births_by_date = births.pivot_table('births', [births.index.month, births.index.day])
Should each value be multiplied by 2 since male and female births are counted on separate rows? So should it be:
births_by_date = (births.pivot_table('births', [births.index.month, births.index.day])) * 2
instead?
There seems to be an error in the last plot (line 32) of 05.02-Introducing-Scikit-Learn.ipynb
(digits classification example). The green labels are not matching those in the figure in line 23.
The requirements.txt
file has:
pandas==0.18.1
yet 03.02-Data-Indexing-and-Selection fails with:
AttributeErrorTraceback (most recent call last)
<ipython-input-5-8721e0616114> in <module>()
----> 1 list(data.items())
/opt/app-root/lib/python2.7/site-packages/pandas/core/generic.pyc in __getattr__(self, name)
2666 if (name in self._internal_names_set or name in self._metadata or
2667 name in self._accessors):
-> 2668 return object.__getattribute__(self, name)
2669 else:
2670 if name in self._info_axis:
AttributeError: 'Series' object has no attribute 'items'
So maybe is meant to be pandas 0.19.1.
Fetching package metadata .........
Solving package specifications: ....
UnsatisfiableError: The following specifications were found to be in conflict:
Notebook 02.09-Structured-Data-NumPy is missing numpy import.
name = ['Alice', 'Bob', 'Cathy', 'Doug']
age = [25, 45, 37, 19]
weight = [55.0, 85.5, 68.0, 61.5]
In [2]:
x = np.zeros(4, dtype=int)
NameErrorTraceback (most recent call last)
<ipython-input-2-f437a7cb5a38> in <module>()
----> 1 x = np.zeros(4, dtype=int)
NameError: name 'np' is not defined
Required by 04.13-Geographic-Data-With-Basemap.
While the new image only uses 16 colors instead of 16M, the compression factor is only 6 (24bit -> 4 bit per pixel), no?
At least it is a bit misleading (the color-space is of course compressed by 1M, but that tells us nothing about the image file size).
A couple of things I noticed while working through your book, and which you most likely are aware of, are
In [1]: >>> def donothing(x):
...: ... return x
...:
In [2]: donothing(5)
Out[2]: 5
I understand that it may be too late to incorporate these in the print book, but perhaps you could add a note about these changes in the notebooks, and, if possible, in the electronic versions, since newcomers to Python are likely to download the latest version (i.e. 5.x) of IPython and may be confused by the example.
Your intro says you are running the notebooks in python 3.5 and the packages in requirements.txt which includes basemap. The problem I face is that there seems to be no Python 3.5 support for basemap.
conda info basemap
Fetching package metadata .........
file name : basemap-1.0.7-np110py27_0.tar.bz2
name : basemap
version : 1.0.7
build number: 0
build string: np110py27_0
channel : defaults
size : 120.5 MB
date : 2015-10-06
fn : basemap-1.0.7-np110py27_0.tar.bz2
license : PSF
md5 : e451471ff2a2ccdbf09e81c61cc103bb
priority : 0
schannel : defaults
url : https://repo.continuum.io/pkgs/free/win-64/basemap-1.0.7-np110py27_0.tar.bz2
dependencies:
matplotlib
numpy 1.10*
python 2.7*
file name : basemap-1.0.7-np111py27_0.tar.bz2
name : basemap
version : 1.0.7
build number: 0
build string: np111py27_0
channel : defaults
size : 120.5 MB
date : 2016-04-01
fn : basemap-1.0.7-np111py27_0.tar.bz2
license : PSF
md5 : 4ffd12859b950ca6bd9e176ef6a71084
priority : 0
schannel : defaults
url : https://repo.continuum.io/pkgs/free/win-64/basemap-1.0.7-np111py27_0.tar.bz2
dependencies:
matplotlib
numpy 1.11*
python 2.7*
file name : basemap-1.0.7-np17py27_0.tar.bz2
name : basemap
version : 1.0.7
build number: 0
build string: np17py27_0
channel : defaults
size : 120.5 MB
date : 2013-12-09
fn : basemap-1.0.7-np17py27_0.tar.bz2
license : PSF
md5 : 6bcb42a4435836b342c96d94a98ef785
priority : 0
schannel : defaults
url : https://repo.continuum.io/pkgs/free/win-64/basemap-1.0.7-np17py27_0.tar.bz2
dependencies:
matplotlib
numpy 1.7*
python 2.7*
file name : basemap-1.0.7-np18py27_0.tar.bz2
name : basemap
version : 1.0.7
build number: 0
build string: np18py27_0
channel : defaults
size : 120.5 MB
date : 2014-01-30
fn : basemap-1.0.7-np18py27_0.tar.bz2
license : PSF
md5 : 14cabc1a134b14073fe3afa943753888
priority : 0
schannel : defaults
url : https://repo.continuum.io/pkgs/free/win-64/basemap-1.0.7-np18py27_0.tar.bz2
dependencies:
matplotlib
numpy 1.8*
python 2.7*
file name : basemap-1.0.7-np19py27_0.tar.bz2
name : basemap
version : 1.0.7
build number: 0
build string: np19py27_0
channel : defaults
size : 120.5 MB
date : 2014-09-09
fn : basemap-1.0.7-np19py27_0.tar.bz2
license : PSF
md5 : 18142d0b3ede8b156f31c627d78aea72
priority : 0
schannel : defaults
url : https://repo.continuum.io/pkgs/free/win-64/basemap-1.0.7-np19py27_0.tar.bz2
dependencies:
matplotlib
numpy 1.9*
python 2.7*
In 04.13-Geographic-Data-With-Basemap you have the line:
plt.colorbar(label='temperature anomaly (°C)');
You may want to reconsider using the degrees symbol there as whether it will work may depend on unknown lang/locale settings in the environment.
UnicodeDecodeErrorTraceback (most recent call last)
<ipython-input-20-17334019f7eb> in <module>()
10
11 plt.title('January 2014 Temperature Anomaly')
---> 12 plt.colorbar(label='temperature anomaly (°C)');
13 #plt.colorbar(label='temperature anomaly (C)');
...
/opt/app-root/lib/python2.7/site-packages/matplotlib/colorbar.pyc in set_label(self, label, **kw)
455 Label the long axis of the colorbar
456 '''
--> 457 self._label = '%s' % (label, )
458 self._labelkw = kw
459 self._set_label()
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 21: ordinal not in range(128)
It is actually a bit strange as the lang/locale for the environment is UTF-8 and not ascii.
import locale
locale.getdefaultlocale()
Out[17]:
('en_US', 'UTF-8')
In [18]:
print(u'\u292e')
⤮
In [21]:
print('temperature anomaly (°C)')
temperature anomaly (°C)
Not sure why matplotlib.pyplot
is using ascii rather than default locale.
This is still with pinned versions of packages you had in requirements.txt
so will try with latest of all packages and sees if that resolves anything.
Due to matplotlib/basemap#251, pip can no longer install basemap like it is specified in requirements.txt
In the chapter about handling time series with Pandas, you're analyzing bicycle traffic [nbviewer link].
In cell input 41, you compute the rolling sum, but the plot label says mean:
daily = data.resample('D').sum()
daily.rolling(30, center=True).sum().plot(style=[':', '--', '-'])
plt.ylabel('mean hourly count');
In notebook: notebooks/03.03-Operations-in-Pandas.ipynb
"A similar type of alingment takes place" should be alignment.
No commented out command to download and not in repository.
IOErrorTraceback (most recent call last)
<ipython-input-14-3f1b10f5cbcf> in <module>()
1 import pandas as pd
----> 2 counts = pd.read_csv('fremont_hourly.csv', index_col='Date', parse_dates=True)
3 weather = pd.read_csv('599021.csv', index_col='DATE', parse_dates=True)
Hey Jake,
I would appreciate if you could at least consider releasing the book as a .pdf, .epub or .mobi as a great tool to add to my portable/commute/trip library on the tablet or the phone. I think the pros way outweigh the cons and it would be an awesome addition to any programmer's or scientist's workshop.
Btw. Congrats on an awesome book!
I installed line_profiler using conda install instead of pip
conda install line_profiler
conda list shows that I have line_profiler 1.1
line_profiler 1.1 py35_0
Then, when I entered
%load_ext line_profiler
received an error
AttributeError: 'TerminalInteractiveShell' object has no attribute 'define_magic'
On MacOS(Sierra) with other Conda environments succeeding in importing numpy, PDSH environment fails, with the following error -
ImportError: dlopen(//anaconda/envs/PDSH/lib/python3.5/site-packages/numpy/core/multiarray.cpython-35m-darwin.so, 2): Library not loaded: @rpath/libopenblas-r0.2.18.dylib Referenced from: //anaconda/envs/PDSH/lib/python3.5/site-packages/numpy/core/multiarray.cpython-35m-darwin.so Reason: image not found
1st paragraph, 2nd sentence:
"The topic is very broad: datasets can come from a wide range of sources and a wide range of formats, including be collections of documents, collections of images, collections of sound clips, collections of numerical measurements, or nearly anything else."
"be" not necessary.
The Boolean Operators section of 02.06-Boolean-Arrays-and-Masks.ipynb includes the phrase:
"the equivalence of A AND B and NOT (A OR B)".
I believe that should be: "the equivalence of A AND B and NOT (NOT A OR NOT B)". That is what is shown in the code fragment below the markdown: np.sum(~( (inches <= 0.5) | (inches >= 1) )).
Number of csv files missing from repo.
IOErrorTraceback (most recent call last)
<ipython-input-20-0928a7e92a5e> in <module>()
----> 1 pop = pd.read_csv('state-population.csv')
2 areas = pd.read_csv('state-areas.csv')
3 abbrevs = pd.read_csv('state-abbrevs.csv')
4
5 display('pop.head()', 'areas.head()', 'abbrevs.head()')
Hi,
I am trying to reproduce the code at page 184 of the book (Chapter 3).
the second command to unzip the file gives me the error:
'gunzip' is not recognized as an internal or external command,
operable program or batch file.
How can I unzip the file?
Thanks,
Marco
Required by 04.13-Geographic-Data-With-Basemap.
This was is annoying because it can't be installed from PyPi using pip. You instead need to list the URL of where it is located on SourceForge.
I've followed along through the whole book so far - it has been extremely helpful for me. Ran into issue in Chapter 5: Example: Not-So-Naive Bayes. The call to grid.fit(digits.data, digits.target) gives an error as captioned below. It seems that KernelDensity is not found.
I apologize if I raised this possible issue in error.
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-20-4d39a80964d5> in <module>()
6 bandwidths = 10 ** np.linspace(0, 2, 100)
7 grid = GridSearchCV(KDEClassifier(), {'bandwidth': bandwidths})
----> 8 grid.fit(digits.data, digits.target)
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/grid_search.py in fit(self, X, y)
827
828 """
--> 829 return self._fit(X, y, ParameterGrid(self.param_grid))
830
831
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/grid_search.py in _fit(self, X, y, parameter_iterable)
571 self.fit_params, return_parameters=True,
572 error_score=self.error_score)
--> 573 for parameters in parameter_iterable
574 for train, test in cv)
575
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable)
756 # was dispatched. In particular this covers the edge
757 # case of Parallel used with an exhausted iterator.
--> 758 while self.dispatch_one_batch(iterator):
759 self._iterating = True
760 else:
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator)
606 return False
607 else:
--> 608 self._dispatch(tasks)
609 return True
610
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch)
569 dispatch_timestamp = time.time()
570 cb = BatchCompletionCallBack(dispatch_timestamp, len(batch), self)
--> 571 job = self._backend.apply_async(batch, callback=cb)
572 self._jobs.append(job)
573
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/externals/joblib/_parallel_backends.py in apply_async(self, func, callback)
107 def apply_async(self, func, callback=None):
108 """Schedule a func to be run"""
--> 109 result = ImmediateResult(func)
110 if callback:
111 callback(result)
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/externals/joblib/_parallel_backends.py in __init__(self, batch)
324 # Don't delay the application, to avoid keeping the input
325 # arguments in memory
--> 326 self.results = batch()
327
328 def get(self):
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in __call__(self)
129
130 def __call__(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
132
133 def __len__(self):
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in <listcomp>(.0)
129
130 def __call__(self):
--> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items]
132
133 def __len__(self):
/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/sklearn/cross_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, error_score)
1663 estimator.fit(X_train, **fit_params)
1664 else:
-> 1665 estimator.fit(X_train, y_train, **fit_params)
1666
1667 except Exception as e:
<ipython-input-16-8d98b394e8b0> in fit(self, X, y)
21 self.models_ = [KernelDensity(bandwidth=self.bandwidth,
22 kernel=self.kernel).fit(Xi)
---> 23 for Xi in training_sets]
24 self.logpriors_ = [np.log(Xi.shape[0] / X.shape[0])
25 for Xi in training_sets]
<ipython-input-16-8d98b394e8b0> in <listcomp>(.0)
21 self.models_ = [KernelDensity(bandwidth=self.bandwidth,
22 kernel=self.kernel).fit(Xi)
---> 23 for Xi in training_sets]
24 self.logpriors_ = [np.log(Xi.shape[0] / X.shape[0])
25 for Xi in training_sets]
NameError: name 'KernelDensity' is not defined
Accessing the ix
indexer on a pandas Series or DataFrame now gives a DeprecationWarning.
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
You're already advising to avoid ix
. How about changing the existing mentions of ix
to be even more discouraging (i.e. mention it's deprecated and really shouldn't be used) and to drop the one code example using ix
?
I think this would be a minor change to the book, all examples would keep working with older and newer Pandas versions, just the description of ix
would change slightly.
I want to try this example but couldn't find the csv dataset. Where to get this ??
Hi Jake, awesome book, it's actually taught me a lot that I didn't know about matplotlib and numpy. I'm glad you've covered all the core technologies that we use as Python data scientists, not just machine learning.
One observation of Figure 5-32 on page 374 - it looks like the training and validation curves are the wrong way around. Is this right?
I would like to port PythonDataScienceHandbook to Ruby programming language (at the first in Japanese for the text).
At the moment the Ruby datascience environment is immature and can not be a complete port, but I would like to borrow the dataset and the text outline.
Is this attempt contrary to your PythonDataScienceHandbook license terms?
Notice that the size of points is given in pixels, the color argument is automatically mapped to a color scale (shown here by the colorbar() command), and that the size argument is given in pixels.
"Given in pixels" repeated twice?
Hi @jakevdp
Thank you for publishing these great notebooks.
I would like to translate your notebooks into Japanese (and publish it on GitHub).
But I am concerned that the translation is contrary to the ND (NoDerivatives).
I would like your comments on this CC-BY-NC-ND license issue.
Required by 03.12-Performance-Eval-and-Query
The requirements.txt
file has:
numpy=1.11
pandas=0.18.1
scipy=0.17.1
sklearn=0.17.1
matplotlib=1.5.1
jupyter
notebook
line_profiler
memory_profiler
which will be rejected by pip
.
Invalid requirement: 'numpy=1.11'
= is not a valid operator. Did you mean == ?
Need to change =
to ==
in each instance where pinning the version.
Needed by 05.14-Image-Features.
Some of the Scikit modules/classes trigger deprecation warnings.
cross_validation
submodule will be moved to model_selection
.GaussianMixure
will replace GMM
.Used in 05.08-Random-Forests, the code for helpers_05_08
module needs to be included in the repo and either placed in the root directory so found when Jupyter notebook server started out of that directory, or made part of a Python package in a sub directory, which is then referenced by the requirements.txt
file so that it will be installed by pip -r requirements.txt
and made available when people running the notebook. Preferably people use a virtual environment given is module specific to your examples.
The specifications were found to be in conflict:
In 03.11-Working-with-Time-Series:
ImportErrorTraceback (most recent call last)
<ipython-input-25-6ad953b03311> in <module>()
----> 1 from pandas_datareader import data
2
3 goog = data.DataReader('GOOG', start='2004', end='2016',
4 data_source='google')
5 goog.head()
ImportError: No module named pandas_datareader
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.