palewire / first-python-notebook Goto Github PK

A step-by-step guide to analyzing data with Python and the Jupyter notebook.

Home Page: https://palewi.re/docs/first-python-notebook/

License: MIT License

Makefile 53.44% Jupyter Notebook 46.56%

python jupyter-notebook pandas tutorial journalism data-analysis jupyter sphinx data-journalism education news altair jupyterlab

first-python-notebook's Introduction

First Python Notebook

A step-by-step guide to analyzing data with Python and the Jupyter notebook. Take the class at firstpythonnotebook.org.

first-python-notebook's People

Contributors

Stargazers

Watchers

first-python-notebook's Issues

Update references from CCDC to BLN

The keyword arguments note needs to move

We are dropping it in here following the use of rename. The problem: Our rename call no longer users a keyword argument. So we need to bump this elsewhere. I think it should come after our first use of a kwarg, which may not be the merge method in the next chapter.

Exercises for extra practice as a list at the bottom of chapters

Update the material so we're not using the 2016 data

Add a state total group, and then an "in_state" column creation and group

403 Forbidden error when opening CSV

pd.read_csv("http://www.firstpythonnotebook.org/_static/committees.csv")

---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-4-3ea4d8833327> in <module>()
----> 1 pd.read_csv("http://www.firstpythonnotebook.org/_static/committees.csv")

~/.local/share/virtualenvs/first-python-notebook-DfG0-Xvh/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, doublequote, delim_whitespace, low_memory, memory_map, float_precision)
    676                     skip_blank_lines=skip_blank_lines)
    677 
--> 678         return _read(filepath_or_buffer, kwds)
    679 
    680     parser_f.__name__ = name

~/.local/share/virtualenvs/first-python-notebook-DfG0-Xvh/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    422     compression = _infer_compression(filepath_or_buffer, compression)
    423     filepath_or_buffer, _, compression, should_close = get_filepath_or_buffer(
--> 424         filepath_or_buffer, encoding, compression)
    425     kwds['compression'] = compression
    426 

~/.local/share/virtualenvs/first-python-notebook-DfG0-Xvh/lib/python3.6/site-packages/pandas/io/common.py in get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode)
    193 
    194     if _is_url(filepath_or_buffer):
--> 195         req = _urlopen(filepath_or_buffer)
    196         content_encoding = req.headers.get('Content-Encoding', None)
    197         if content_encoding == 'gzip':

~/.pyenv/versions/3.6.4/lib/python3.6/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    221     else:
    222         opener = _opener
--> 223     return opener.open(url, data, timeout)
    224 
    225 def install_opener(opener):

~/.pyenv/versions/3.6.4/lib/python3.6/urllib/request.py in open(self, fullurl, data, timeout)
    530         for processor in self.process_response.get(protocol, []):
    531             meth = getattr(processor, meth_name)
--> 532             response = meth(req, response)
    533 
    534         return response

~/.pyenv/versions/3.6.4/lib/python3.6/urllib/request.py in http_response(self, request, response)
    640         if not (200 <= code < 300):
    641             response = self.parent.error(
--> 642                 'http', request, response, code, msg, hdrs)
    643 
    644         return response

~/.pyenv/versions/3.6.4/lib/python3.6/urllib/request.py in error(self, proto, *args)
    568         if http_err:
    569             args = (dict, 'default', 'http_error_default') + orig_args
--> 570             return self._call_chain(*args)
    571 
    572 # XXX probably also want an abstract factory that knows when it makes

~/.pyenv/versions/3.6.4/lib/python3.6/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
    502         for handler in handlers:
    503             func = getattr(handler, meth_name)
--> 504             result = func(*args)
    505             if result is not None:
    506                 return result

~/.pyenv/versions/3.6.4/lib/python3.6/urllib/request.py in http_error_default(self, req, fp, code, msg, hdrs)
    648 class HTTPDefaultErrorHandler(BaseHandler):
    649     def http_error_default(self, req, fp, code, msg, hdrs):
--> 650         raise HTTPError(req.full_url, code, msg, hdrs, fp)
    651 
    652 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 403: Forbidden

More explicitly spell out how to add and delete cells

Fix for committees read 403

Summary
Background
Original stacktrace
urlopen headers

Summary

This issue was previously discussed in #26

I'm seeing 403s when attempting to read committees.csv using pandas with Python 3.9 and pandas==1.2.2 (see Original stacktrace below).

Using a GitHub raw URL resolves the issue.

Background

This issue appears to stem from RTD issuing a 403 response to HTTP requests made with the default Python user agent.

pandas.read_csv uses urllib.request.urlopen under the hood to make the web request, by default setting the User-agent header to Python-urllib/3.9(see urlopen headers below).

## This fails

>>> import urllib.request
>>> url = "https://first-python-notebook.readthedocs.io/_static/committees.csv"
>>> urllib.request.urlopen(url)
Traceback (most recent call last):
<<< snipped >>>
HTTPError: Forbidden

Setting a realistic User-agent header fixes the issue:

>>> headers = {
...     'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:85.0) Gecko/20100101 Firefox/85.0'
... }
>>> req = urllib.request.Request(url=url, headers=headers)
>>> resp = urllib.request.urlopen(req)
>>> resp.read().decode('utf-8')[0:50]
'ocd_prop_id,calaccess_prop_id,ccdc_prop_id,prop_na'

Unfortunately, there doesn't appear to be a way to configure request headers via the pandas.read_csv interface (at least none jumped out at me from a quick review of function parameters).

Using an alternative URL such as the raw GH URL sidesteps the issue:

>>> gh_url = "https://raw.githubusercontent.com/california-civic-data-coalition/first-python-notebook/master/docs/_static/committees.csv"

# urlopen version works
>>> resp = urllib.request.urlopen(gh_url)
>>> resp.read().decode('utf-8')[0:50]
'ocd_prop_id,calaccess_prop_id,ccdc_prop_id,prop_na'

# pands.read_csv version works
>>> response = pd.read_csv(gh_url)
>>> response
                                          ocd_prop_id  ...  committee_position
0    ocd-contest/b51dc64d-3562-4913-a190-69f5088c22a6  ...             SUPPORT
1    ocd-contest/b51dc64d-3562-4913-a190-69f5088c22a6  ...             SUPPORT
2    ocd-contest/b51dc64d-3562-4913-a190-69f5088c22a6  ...             SUPPORT
3    ocd-contest/b51dc64d-3562-4913-a190-69f5088c22a6  ...              OPPOSE
4    ocd-contest/85990193-9d6f-4600-b8e7-bf1317841d82  ...             SUPPORT
..                                                ...  ...                 ...
97   ocd-contest/7495cdbe-1aa7-4c26-9a55-aa4130347b95  ...             SUPPORT
98   ocd-contest/7495cdbe-1aa7-4c26-9a55-aa4130347b95  ...             SUPPORT
99   ocd-contest/7495cdbe-1aa7-4c26-9a55-aa4130347b95  ...             SUPPORT
100  ocd-contest/7495cdbe-1aa7-4c26-9a55-aa4130347b95  ...             SUPPORT
101  ocd-contest/7495cdbe-1aa7-4c26-9a55-aa4130347b95  ...             SUPPORT

Original stacktrace

import pandas as pd
committee_list = pd.read_csv("https://first-python-notebook.readthedocs.io/_static/committees.csv")


---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-35-62ab1780e12d> in <module>
----> 1 committee_list = pd.read_csv("https://first-python-notebook.readthedocs.io/_static/committees.csv")

~/.local/share/virtualenvs/first-python-notebook-QxiypQOy/lib/python3.9/site-packages/pandas/io/parsers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    608     kwds.update(kwds_defaults)
    609 
--> 610     return _read(filepath_or_buffer, kwds)
    611 
    612 

~/.local/share/virtualenvs/first-python-notebook-QxiypQOy/lib/python3.9/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    460 
    461     # Create the parser.
--> 462     parser = TextFileReader(filepath_or_buffer, **kwds)
    463 
    464     if chunksize or iterator:

~/.local/share/virtualenvs/first-python-notebook-QxiypQOy/lib/python3.9/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    817             self.options["has_index_names"] = kwds["has_index_names"]
    818 
--> 819         self._engine = self._make_engine(self.engine)
    820 
    821     def close(self):

~/.local/share/virtualenvs/first-python-notebook-QxiypQOy/lib/python3.9/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
   1048             )
   1049         # error: Too many arguments for "ParserBase"
-> 1050         return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
   1051 
   1052     def _failover_to_python(self):

~/.local/share/virtualenvs/first-python-notebook-QxiypQOy/lib/python3.9/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1865 
   1866         # open handles
-> 1867         self._open_handles(src, kwds)
   1868         assert self.handles is not None
   1869         for key in ("storage_options", "encoding", "memory_map", "compression"):

~/.local/share/virtualenvs/first-python-notebook-QxiypQOy/lib/python3.9/site-packages/pandas/io/parsers.py in _open_handles(self, src, kwds)
   1360         Let the readers open IOHanldes after they are done with their potential raises.
   1361         """
-> 1362         self.handles = get_handle(
   1363             src,
   1364             "r",

~/.local/share/virtualenvs/first-python-notebook-QxiypQOy/lib/python3.9/site-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    556 
    557     # open URLs
--> 558     ioargs = _get_filepath_or_buffer(
    559         path_or_buf,
    560         encoding=encoding,

~/.local/share/virtualenvs/first-python-notebook-QxiypQOy/lib/python3.9/site-packages/pandas/io/common.py in _get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode, storage_options)
    287                 "storage_options passed with file object or non-fsspec file path"
    288             )
--> 289         req = urlopen(filepath_or_buffer)
    290         content_encoding = req.headers.get("Content-Encoding", None)
    291         if content_encoding == "gzip":

~/.local/share/virtualenvs/first-python-notebook-QxiypQOy/lib/python3.9/site-packages/pandas/io/common.py in urlopen(*args, **kwargs)
    193     import urllib.request
    194 
--> 195     return urllib.request.urlopen(*args, **kwargs)
    196 
    197 

/usr/local/Cellar/[email protected]/3.9.1_6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    212     else:
    213         opener = _opener
--> 214     return opener.open(url, data, timeout)
    215 
    216 def install_opener(opener):

/usr/local/Cellar/[email protected]/3.9.1_6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/urllib/request.py in open(self, fullurl, data, timeout)
    521         for processor in self.process_response.get(protocol, []):
    522             meth = getattr(processor, meth_name)
--> 523             response = meth(req, response)
    524 
    525         return response

/usr/local/Cellar/[email protected]/3.9.1_6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/urllib/request.py in http_response(self, request, response)
    630         # request was successfully received, understood, and accepted.
    631         if not (200 <= code < 300):
--> 632             response = self.parent.error(
    633                 'http', request, response, code, msg, hdrs)
    634 

/usr/local/Cellar/[email protected]/3.9.1_6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/urllib/request.py in error(self, proto, *args)
    559         if http_err:
    560             args = (dict, 'default', 'http_error_default') + orig_args
--> 561             return self._call_chain(*args)
    562 
    563 # XXX probably also want an abstract factory that knows when it makes

/usr/local/Cellar/[email protected]/3.9.1_6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
    492         for handler in handlers:
    493             func = getattr(handler, meth_name)
--> 494             result = func(*args)
    495             if result is not None:
    496                 return result

/usr/local/Cellar/[email protected]/3.9.1_6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/urllib/request.py in http_error_default(self, req, fp, code, msg, hdrs)
    639 class HTTPDefaultErrorHandler(BaseHandler):
    640     def http_error_default(self, req, fp, code, msg, hdrs):
--> 641         raise HTTPError(req.full_url, code, msg, hdrs, fp)
    642 
    643 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 403: Forbidden

urlopen headers

Below is a dump of the headers passed by the underlying urlopen call in urllib/request.py:

{'_full_url': 'https://first-python-notebook.readthedocs.io/_static/committees.csv', 'fragment': None, 'type': 'https', 'host': 'first-python-notebook.readthedocs.io', 'selector': '/_static/committees.csv', 'headers': {}, 'unredirected_hdrs': {'Host': 'first-python-notebook.readthedocs.io', 'User-agent': 'Python-urllib/3.9'}, '_data': None, '_tunnel_host': None, 'origin_req_host': 'first-python-notebook.readthedocs.io', 'unverifiable': False, 'timeout': <object object at 0x114e47a60>}

Add `%pip` magic command instructions for when pandas and altair aren't available.

jupyter lab install

You may get this error:
Could not install packages due to an EnvironmentError: [Errno 13] Permission denied: '/Library/Python/2.7/site-packages/jupyterlab_launcher-0.11.2.dist-info'
Consider using the --user option or check the permissions.

That’s because of the permissions you have on your computer.. (note this error will likely occur if you are on a mac.
jupyterlab/jupyterlab#3913

Try using this command: pip install --user jupyterlab
Q: Should we just suggest this for everyone?

Explicitly encourage and teach Python 3

https://pythonclock.org/ says Python 2.7 will retire in 1 Month, 10 Days, 4 Hours, 56 Minutes and 49 Seconds

More to the point: this is one of the best tutorials on getting a Python development environment set up I've ever seen, but I want people to get going with Python 3 so they can use my Datasette project!

I imagine updating it to Python 3 is not an insignificant amount of work, due to the need to re-record the installation videos. But I'm optimistically filing a bug report anyway!

Get rid of committee totals. They are boring.

clean the names with a strip to avoid whitespace bugs

Once we're done with the text, reshoot all the screenshots

Tip: Make sure your virtualenv pip isn't _really_ old

Hi!
Using OS X 10.11.6 El Capitan here. For whatever reason, my virtualenv had a really old version of pip (1.4.1) and when I tried

$ pip install jupyter

I ended up with this error:

 ... [snip] ... 

Downloading/unpacking ipython (from jupyter-console->jupyter)
  Downloading ipython-5.2.2.tar.gz (4.9MB): 4.9MB downloaded
  Running setup.py egg_info for package ipython
    error in ipython setup command: Invalid environment marker: sys_platform == "win32" and python_version < "3.6"
    Complete output from command python setup.py egg_info:
    error in ipython setup command: Invalid environment marker: sys_platform == "win32" and python_version < "3.6"

----------------------------------------
Cleaning up...
Command python setup.py egg_info failed with error code 1 in /Users/ ... 

... [snip] ...

So following the jupyter install advice, I upgraded my pip

$ pip install --upgrade pip

I went from 1.4.1 to 9.0.1 (!) and then pip install jupyter worked like a charm!
So if at NICAR you see an install bomb like the above, try upgrading the virtualenv pip.

latest jupyter didn't default to 'code' edit mode

When running through the tutorial on my own, in Hello Notebook section Write Python in the notebook, I had to switch the new notebook's editing mode to Code in a dropdown before math worked.

I went to make edits and rebuild them from source, but yolk was confused because both yolk and yolk3k were installed into a python3 pipenv.

I have fixes for both of these, but I'm not sure how you'd prefer to handle these, and I'm not a professional writer targeting new folks, so the tone of my PR is maybe not what you would prefer.

Tidy up the charts chapter

Add a data cleaning chapter

Describe how you would standardize the name columns
Show how you would group and sum on the new standardized column

Get more basic python stuff into intro chapter

This could build up to and conclude with the moral of the story being that default Python doesn't have median

Introduce NaN values and show how to deal with them

Include a caveat about what to do if the notebook doesn't automatically open in your browser when you start i t

Show how to print out only a subset of columns

Should we break out the `to_csv` export into its own chapter?

I'm leaning towards yes on this one.

Add a timeseries example to the charts chapter

It would also teach how to parse the date column as a datetime object.

Integrate some more questions for students to answer into the material

Go through subchapter headings and make sure tense and stuff is consistent

port over the main footer

Add a datawrapper example (and tutorial off ramp) to the export chapter

Spin off Fast Python Notebook as a separate walkthrough

Show to do a contains search

Add a chapter that explains how to add Markdown

Add disclaimer about Python3?

Hey Ben,
This is great stuff. I just ran through the tutorial to see how it would fare on Python 3, and happy to report that every single command worked! Wondering if it's worth posting a little heads up in the Python prerequisites section noting as much, so folks know they can use either 2.x or 3.x (note I tested on Python 3.6.1). Totally understand if you don't want to tie yourself to maintaining both versions, btw, so no stress if you want to keep the focus on 2.7.

Cheers!

Serdar