data-8 / datascience Goto Github PK

View Code? Open in Web Editor NEW

603.0 80.0 282.0 50.13 MB

A Python library for introductory data science

License: BSD 3-Clause "New" or "Revised" License

Python 4.30% Makefile 0.01% Jupyter Notebook 95.69%

datascience's Introduction

datascience

A Berkeley library for introductory data science.

written by Professor John DeNero, Professor David Culler, Sam Lau, and Alvin Wan

For an example of usage, see the Berkeley Data 8 class.

Installation

Use pip:

pip install datascience

A log of all changes can be found in CHANGELOG.md.

datascience's People

Contributors

Stargazers

Watchers

Forkers

mdibyo ericz82 jkhaykin stefanv bobjacobsen purvizinjuvadia gitter-badger choldgraf anthar eug84 kvamarnath peterasujan waanng shyamalschandra kuanhoong sci-sekha jimbog davinirjr sethips sukanyamandal cyberaxx alokkshukla jiayingjie92 mengkeding jakirkham shbangash ziaullahliaqat lars1050 a-k-i-n-j-o elisechant epai strongdan mason-datamaterials wtall xiaolisun11235 wstuetzle jo-tez jasonyzhang tompiona qinganzhao adnanhemani libardo1 wangyingbo1118 stephanieboyle singhal95 praveer08 duthedd narendrakumawat joshbooya lamastex atsui888 datalayer-externals getbioinfo yufengg hbcbh1999 candeira arastoul sbuenker leemit jrnf jinxed007 anxietyhangover thirulak mikeh2 datarot rroggenkemper pythonexpert ruzbro priya-gittest arunkumarramanan lhagchung rameshvs dsementsov jesca yeshwand kirosg wtraves srmchcy nnj1 valerieh99 tylere pdwaggoner pydatawrangler engmux pheerathano syuan-yi zhp510730568 ktakimoto deculler tarunpunna lenakchen nlrahimi kawazu betsyvr isehd birdinthetree lucyportnoff oduor joaquingomezcabido ubiot-alejandro

datascience's Issues

Table help is incorrect

help(Table) gives the following example of how to use a Table:

 |  >>> letters = ['a', 'b', 'c', 'z']
 |  >>> counts = [9, 3, 3, 1]
 |  >>> points = [1, 2, 2, 10]
 |  >>> t = Table([('letter', letters), ('count', counts), ('points', points)])

However copy-pasting that into Jupyter goes the following error message:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
...
/usr/local/lib/python3.4/dist-packages/datascience/table.py in __init__(self, columns, labels)
---> 46         assert labels is not None, 'Labels are required'

No clean way to count duplicate items in a Table, with Table API

Goal: Given a table T and a column C, build a new table that has one row for each unique value in T.C along with a count of the number of times that value appears in T.C.

I was not able to find any clean way to do this within the Table API. Should this be doable using Tables, without leaving Table space and going back to arrays and raw Python?

Here is the solution I came up with:

from collections import Counter
c = Counter(origtbl['column_label'])
t = Table.from_rows(c.items(), ['column_label', 'count'])

Not so terrible if you know Python idioms, but also probably not so discoverable for students. Should there be an API in Table that's accessible to students that allows performing this kind of task? Or some suitable generalized primitive, which is enough to solve this problem?

Where method doesn't handle invert or conjunction well

We should consider adding one of these, because where is currently lacking.

.where(column_label, fn)
.where(column_label, value, compare_fn)
.where(column_label, value, not_equal=True)

Top one seems most general, so that's probably the way to go, but it requires understanding higher-order functions.

The problem is that if I want to say n != 2 and m < 4 for columns n and m, neither of the following work.

t.where(t['n'] != 2).where(t['m'] < 4) fails because t['m'] is unfiltered and has the wrong length

t.where(t['n'] != 2 and t['m'] < 4) fails because and doesn't work with numpy arrays

The only working solutions currently are both ugly:

t.where(numpy.logical_and(t['n'] != 2, t['m'] < 4))

u = t.where(t['n'] != 2)
u.where(u['m'] < 4)

CurrencyFormatter broken?

M(N-)WE:

from datascience import *
%matplotlib inline
import numpy as np

tab = Table(labels=["money"], columns=[[1.,2.,3.]])
tab.set_format("money", CurrencyFormatter)

Scatter plots

Methods need better documentation

When working with the datascience package, I spend a lot of time trying to figure out how the methods work since the docstrings aren't super helpful — some methods require the table to be a certain shape, others require the table values to be numbers, others strings. None of these details are mentioned in some important methods like hist and barh.

In addition, to find out whether the package has the functionality I want (eg. whether I can group a table of years by decade) I have to browse the methods one by one, trying to keep a lot of things in my head about what methods are available.

I imagine I'm running into a majority of these issues because most of this code wasn't written by me. However, this will be the case for our students so IMO the earlier we can work on this the better.

It'd be very helpful to 1. Improve the docstrings and 2. Have easily navigable documentation (probably generated from docstrings using something like Sphinx).

A great place to start would be the plotting functions, since those seem to be the most finicky and most commonly used.

Upgrade to newer version of folium

Useful stuff since version 0.14

autozoom
popup in GeoJSON regions

[moved] Debug errors with append, group, etc.

append doesn't work
zip should be used to construct columns in from_rows
group by a zipped column should do the right thing, but currently expands the contents of tuples
group should not introduce a new column if a column is passed in

Split datascience.py file into smaller, more digestible files

It's hard to tell at a glance what the datascience.py file does since it's getting pretty long (~700 LOC as of this writing). I think we should consider splitting it up into separate files to help with code organization.

Feature request: way to create a heat map from a Table of (lat,long) points

It would be nice if the Map class provided a way to overlay a heat map on top of a map: given a large collection of (lat,long) points (e.g., Markers), construct a heat map making it easy to visualize where the points are most concentrated.

Maybe there's an easy way by using the options provided, but I couldn't figure out how to do that from the public documentation. Maybe provide an API for this, or document how to do it? I bet this will be a useful thing for students -- and I'd like to use it for Friday lecture too.

Generalize pivot_hist to bar charts and plots

Map(Region?) cannot be plotted twice

Marker.show() throws error

I tried to use the Maps demo from text/demos/MapsDemonstration.ipynb. I can't make it work for me.

When I try to create a Marker, I get an error. In particular, I can do

m = Marker(37.78, -122.42, 'San Francisco')

with no error, but then when I do

m.show()

I get a traceback:

TypeError: simple_marker() got an unexpected keyword argument 'popup_on'

Same if I type just

into an input cell.

Charts need resize option; Legend may cover up histogram contents in some cases

I generated a histogram & couldn't tell if some data were occluded by the legend when the legend contained 11 items.

Add doctests back into tables.py

At some point doctests were moved into tests, but they would be helpful in the documentation.

Mapping through a table

In many cases it would be good to build up a little table and use it to augment another table. A common case is to map through a table. For example, you have a table of Parcels. You have categorized them. Now you want to map categories to colors. So you build a little table.

color_map = Table.from_rows([["Residential",'#f1eef6'],
["Commercial",'#d0d1e6'],
["Industrial",'#a6bddb'],
["Apartment",'#74a9cf'],
["Public",'#2b8cbe'],
["Other",'#045a8d']], ("Category", "Color"))

The old join did a left outer join; thus, this would just work. The new one does an inner join - you only get one row per match. You can build it up with indexed_by, but it is pretty ugly because it is dealing with the possibility of multiple matching entries. If we implement lookup we can get this case. Or do we want to offer a richer join?

Add wrapper for boxplot function

Conveniently, it turns out that matplotlib has boxplots, which is something @a-adhikari wanted for an upcoming lecture on visualizations.

We should write a Table.boxplot function that essentially wraps the matplotlib function ASAP.

Histogram from values (or intervals) and counts

From Ani:

Is it possible to have hist draw a histogram based on a distribution table?

E.g. the inputs are intervals and the proportions in each interval (adding up to 100%). Output is a histogram.

At the moment hist takes the raw data as its input. We could simply generate the right number of values at the center of each interval, and provide that as the dataset.

I'm asking because it will be very helpful when students find bad histograms in the newspaper or journal articles and try to fix them. They won't have the raw data. They'll just have the distribution, badly represented. To fix the representation, they could work with the distribution by hand as in Stat 2/20/21, but could we do better in our course?

See the following for an example; scroll down till you see the bar graph.
http://www.cdc.gov/mmwr/preview/mmwrhtml/rr58e0821a1.htm

Switch to Bokeh for charts?

Bokeh now supports Python 3. The charts that are rendered support zooming and look nice. (Their mapping functionality looks less useful than folium, though.)

Publicize version.

https://www.python.org/dev/peps/pep-0396/ describes a common way of setting library versions which can be integrated into the release process. There is also https://www.python.org/dev/peps/pep-0440/.

version.py's VERSION has 0.3.7.

Table does not seem to support more than 255 columns

Here's a piece of code that triggers this issue:

lyricsTable = Table.read_table('http://eecs.berkeley.edu/~xinghao/ds10data/lyricsTable.csv')
lyricsTable
gives an error of
  File "<string>", line 12
SyntaxError: more than 255 arguments

If I used a smaller table (with 104 columns instead of 5004), the error goes away:

lyricsTable = Table.read_table('http://eecs.berkeley.edu/~xinghao/ds10data/lyricsTable_part.csv')
lyricsTable

Table.sort does not seem to be stable

Maybe it's not deployed to Jupyter yet?

Tests for Table.pivot

CSS for docs not working; suspect DNS issue

http://data8.org/datascience/

Shows that the CSS isn't loading. I suspect this is because dsten.github.io is getting automatically redirected to data8.org but the asset files aren't being redirected properly. @papajohn any thoughts?

Overlain histograms need to have the correct bin widths; overlap instead of side-by-side

Ints should not be formatted using scientific notation

Large ints are formatted poorly (by {:g} instead of {:d})

Should we use Zenhub?

What do you guys think about using Zenhub for our workflow? I've noticed that it's a little challenging to manage multiple issues in different states of progress because they're all in one big list. Zenhub adds a layer of functionality on top of Github while preserving existing functionality for those who don't want to use it.

Here's a couple problems and how Zenhub handles them.

It's hard to see which issues are important (we should work on immediately) and which we're putting off for later.
Zenhub has multiple columns (called pipelines) to file issues under called "Backlog", "In progress", "New Issues", etc.

We have multiple repos are being worked on but each one has its own issue list.
You can use one Zenhub board to view and manage issues from multiple repos.

Personally, I've used Zenhub before for a bunch of projects and liked it a lot more than the simple Issue list Github provides. But, there's always an overhead of getting used to the new technology.

@papajohn @alvinwan

Barh doesn't guarantee chart order

Charts for columns might come out in the wrong order, which is surprising. They should be rendered in column order.

Resampling-based Hypothesis Testing

@mijordan3 :

It would be good to have both bootstrap-based hypothesis testing (where
column A is resampled n times
with replacement and column B is independently sampled n times with
replacement), and permutation
testing (where column A and column B are put together in a single long
column, a random permutation
is made of that column, it is split in the middle into two new columns, and
the statistic is computed on
those two columns).

@SamLau95 I can expand sample functionality. Would the following syntax work?

def table(self, k, with_replacement=False, columns=[], random_permutation=False)
# columns is a list of column names - if set, with_replacement 
# must be a list of the same length OR random_permutation must
# be true
# if columns and with_replacement are both lists, and r_p is True,
# assert error (don't know what to do)

independent column sampling
permutation sampling

Uniformly draw histograms of data, count, and interval tables

Table vs numpy.matrix speed

Below is a benchmark for comparing Table vs numpy.matrix on an access pattern that I expect is very common in data science applications. Essentially, all I'm doing is treating each row as a vector, and attempting to compute pairwise distances between rows / vectors by iterating over all their values.

from datascience import *
import numpy as np
import time

numDatapoints = 10
numFeatures = 250
countsTable = Table([[0 for i in range(0,numDatapoints)] for j in range(0,numFeatures)], [str(j) for j in range(0,numFeatures)])

countsMatrix = countsTable.matrix().transpose()



t0 = time.clock()
[sum([abs(countsMatrix[0,k] - countsMatrix[j,k]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]
t1 = time.clock()
print('Compute L1 distance of first row to all rows, using numpy.matrix, took ', t1-t0, 's', sep='')
# Compute L1 distance of first row to all rows, using numpy.matrix, took 0.007395999999999958s



t0 = time.clock()
[sum([abs(countsTable.columns[k][0] - countsTable.columns[k][j]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]
t1 = time.clock()
print('Compute L1 distance of first row to all rows, using Table.columns, took ', t1-t0, 's', sep='')
# Compute L1 distance of first row to all rows, using Table.columns, took 0.4431849999999997s



t0 = time.clock()
[sum([abs(countsTable.rows[0][k] - countsTable.rows[j][k]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]
t1 = time.clock()
print('Compute L1 distance of first row to all rows, using Table.rows, took ', t1-t0, 's', sep='')
# Compute L1 distance of first row to all rows, using Table.rows, took 31.142619999999994s

Running this code shows that iteration over numpy.matrix is about 100x faster than iterating over Table.columns, which in turn is 100x faster than using Table.rows.

Join tests broken

pip install doesn't build maps

[moved] Draw polygons on a map from a notebook

Prof. DeNero

I think we want something like https://github.com/pbugnion/gmaps to draw a Google map with polygons & dots on it using the Google Maps Javascript API. It's unclear whether this library has been ported to Python 3. They have a directory that gives me hope: https://github.com/pbugnion/gmaps/tree/master/examples/ipy3

Drop row

The current Table API supports take (for rows), but appears to lack the functionality to drop rows. While it is possible to work around this by constructing my own complement, it would be more elegant to directly support these operations.

This is in the context of doing cross-validation, where we typically drop a small number of rows to construct the test set.

[This issue should be labeled as enhancement, but I can't seem to figure out how to do that.]

Sampling with and without replacement from rows

Sort tests broken

test_sort is broken

[moved] Table.drop sounds like it will mutate and has an out-of-date docstring

Prof. DeNero:

When I see drop, I think it's going to change the table like del.

[moved] Appending a table should allow column names to be in a different order or extra column names

Histograms "bins" parameter should be able to take a list of bin endpoints for unequal-sized bins

Color palette needs more colors

We currently cycle only through blue, yellow, green, red; dark variants of these would be nicer than magenta & white. In table.py _visualize method.

Table.read_table() could be smarter about auto-detecting the file format

Try this:

Table.read_table('https://data.oaklandnet.com/api/views/7axi-hi5i/rows.csv?accessType=DOWNLOAD')

Table.read_table() fails to recognize the columns; it stuff everything into one column.

Compare to

Table.read_table('https://data.oaklandnet.com/api/views/7axi-hi5i/rows.csv')

which does recognize that there are three columns.

Perhaps it is looking at the URL and trying to parse out the filename extension, and then using that to decide how to decode the data. If so, maybe it should be smarter about how to parse URLs (to remove fragments and parameters), or maybe it should ignore the URL/filename and have smarter format detection (e.g., auto-detect it as CSV based on the contents of the data rather than the filename).

[moved] Appending a table is inefficient

Prof. DeNero:

np.append should be invoked once per column.

Support stacked bar charts & histograms

Cannot pip install version 0.2.0

I tried to create a new release, but it appears I was unsuccessful. Any suggestions?

$ pip install datascience==0.2.0
Collecting datascience==0.2.0
  Could not find a version that satisfies the requirement datascience==0.2.0 (from versions: 0.1.0, 0.1, 0.1.1)
No matching distribution found for datascience==0.2.0

I created a datascience release called 0.2.0, updated the setup.py file, and ran python setup.py sdist upload -r pypi

Table([columns, labels, formatter])

This has a typo: it should be

Table(columns, labels, formatter)

You need to pass 3 args (or 2 args), not 1 arg that's an array of length 3.

Pivot needs to support multiple rows

I already ran into an example where I needed multiple rows in a pivot. I had to join and then split a column to get what I wanted, which was gross. We need to support something like

t.pivot(column, [row0, row1], value, ...)