data-8 / datascience Goto Github PK
View Code? Open in Web Editor NEWA Python library for introductory data science
License: BSD 3-Clause "New" or "Revised" License
A Python library for introductory data science
License: BSD 3-Clause "New" or "Revised" License
From Ani:
Is it possible to have hist draw a histogram based on a distribution table?
E.g. the inputs are intervals and the proportions in each interval (adding up to 100%). Output is a histogram.
At the moment hist takes the raw data as its input. We could simply generate the right number of values at the center of each interval, and provide that as the dataset.
I'm asking because it will be very helpful when students find bad histograms in the newspaper or journal articles and try to fix them. They won't have the raw data. They'll just have the distribution, badly represented. To fix the representation, they could work with the distribution by hand as in Stat 2/20/21, but could we do better in our course?
See the following for an example; scroll down till you see the bar graph.
http://www.cdc.gov/mmwr/preview/mmwrhtml/rr58e0821a1.htm
Charts for columns might come out in the wrong order, which is surprising. They should be rendered in column order.
The pth percentile of a list is the smallest number that is at least as large as p% of the numbers in the list.
That means: sort the list from low to high, go p% up the list from the bottom. If you're at a place on the list, take its value. Else take the next one.
When working with the datascience package, I spend a lot of time trying to figure out how the methods work since the docstrings aren't super helpful โ some methods require the table to be a certain shape, others require the table values to be numbers, others strings. None of these details are mentioned in some important methods like hist
and barh
.
In addition, to find out whether the package has the functionality I want (eg. whether I can group a table of years by decade) I have to browse the methods one by one, trying to keep a lot of things in my head about what methods are available.
I imagine I'm running into a majority of these issues because most of this code wasn't written by me. However, this will be the case for our students so IMO the earlier we can work on this the better.
It'd be very helpful to 1. Improve the docstrings and 2. Have easily navigable documentation (probably generated from docstrings using something like Sphinx).
A great place to start would be the plotting functions, since those seem to be the most finicky and most commonly used.
Shows that the CSS isn't loading. I suspect this is because dsten.github.io is getting automatically redirected to data8.org but the asset files aren't being redirected properly. @papajohn any thoughts?
Try this:
Table.read_table('https://data.oaklandnet.com/api/views/7axi-hi5i/rows.csv?accessType=DOWNLOAD')
Table.read_table() fails to recognize the columns; it stuff everything into one column.
Compare to
Table.read_table('https://data.oaklandnet.com/api/views/7axi-hi5i/rows.csv')
which does recognize that there are three columns.
Perhaps it is looking at the URL and trying to parse out the filename extension, and then using that to decide how to decode the data. If so, maybe it should be smarter about how to parse URLs (to remove fragments and parameters), or maybe it should ignore the URL/filename and have smarter format detection (e.g., auto-detect it as CSV based on the contents of the data rather than the filename).
At some point doctests were moved into tests, but they would be helpful in the documentation.
Here's a piece of code that triggers this issue:
lyricsTable = Table.read_table('http://eecs.berkeley.edu/~xinghao/ds10data/lyricsTable.csv')
lyricsTable
gives an error of
File "<string>", line 12
SyntaxError: more than 255 arguments
If I used a smaller table (with 104 columns instead of 5004), the error goes away:
lyricsTable = Table.read_table('http://eecs.berkeley.edu/~xinghao/ds10data/lyricsTable_part.csv')
lyricsTable
I tried to create a new release, but it appears I was unsuccessful. Any suggestions?
$ pip install datascience==0.2.0
Collecting datascience==0.2.0
Could not find a version that satisfies the requirement datascience==0.2.0 (from versions: 0.1.0, 0.1, 0.1.1)
No matching distribution found for datascience==0.2.0
I created a datascience
release called 0.2.0, updated the setup.py
file, and ran python setup.py sdist upload -r pypi
Below is a benchmark for comparing Table vs numpy.matrix on an access pattern that I expect is very common in data science applications. Essentially, all I'm doing is treating each row as a vector, and attempting to compute pairwise distances between rows / vectors by iterating over all their values.
from datascience import * import numpy as np import time numDatapoints = 10 numFeatures = 250 countsTable = Table([[0 for i in range(0,numDatapoints)] for j in range(0,numFeatures)], [str(j) for j in range(0,numFeatures)]) countsMatrix = countsTable.matrix().transpose() t0 = time.clock() [sum([abs(countsMatrix[0,k] - countsMatrix[j,k]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)] t1 = time.clock() print('Compute L1 distance of first row to all rows, using numpy.matrix, took ', t1-t0, 's', sep='') # Compute L1 distance of first row to all rows, using numpy.matrix, took 0.007395999999999958s t0 = time.clock() [sum([abs(countsTable.columns[k][0] - countsTable.columns[k][j]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)] t1 = time.clock() print('Compute L1 distance of first row to all rows, using Table.columns, took ', t1-t0, 's', sep='') # Compute L1 distance of first row to all rows, using Table.columns, took 0.4431849999999997s t0 = time.clock() [sum([abs(countsTable.rows[0][k] - countsTable.rows[j][k]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)] t1 = time.clock() print('Compute L1 distance of first row to all rows, using Table.rows, took ', t1-t0, 's', sep='') # Compute L1 distance of first row to all rows, using Table.rows, took 31.142619999999994s
Running this code shows that iteration over numpy.matrix is about 100x faster than iterating over Table.columns, which in turn is 100x faster than using Table.rows.
I already ran into an example where I needed multiple rows in a pivot. I had to join and then split a column to get what I wanted, which was gross. We need to support something like
t.pivot(column, [row0, row1], value, ...)
What do you guys think about using Zenhub for our workflow? I've noticed that it's a little challenging to manage multiple issues in different states of progress because they're all in one big list. Zenhub adds a layer of functionality on top of Github while preserving existing functionality for those who don't want to use it.
Here's a couple problems and how Zenhub handles them.
It's hard to see which issues are important (we should work on immediately) and which we're putting off for later.
Zenhub has multiple columns (called pipelines) to file issues under called "Backlog", "In progress", "New Issues", etc.
We have multiple repos are being worked on but each one has its own issue list.
You can use one Zenhub board to view and manage issues from multiple repos.
Personally, I've used Zenhub before for a bunch of projects and liked it a lot more than the simple Issue list Github provides. But, there's always an overhead of getting used to the new technology.
Maybe it's not deployed to Jupyter yet?
The current Table API supports take (for rows), but appears to lack the functionality to drop rows. While it is possible to work around this by constructing my own complement, it would be more elegant to directly support these operations.
This is in the context of doing cross-validation, where we typically drop a small number of rows to construct the test set.
[This issue should be labeled as enhancement, but I can't seem to figure out how to do that.]
I generated a histogram & couldn't tell if some data were occluded by the legend when the legend contained 11 items.
Prof. DeNero
I think we want something like https://github.com/pbugnion/gmaps to draw a Google map with polygons & dots on it using the Google Maps Javascript API. It's unclear whether this library has been ported to Python 3. They have a directory that gives me hope: https://github.com/pbugnion/gmaps/tree/master/examples/ipy3
It would be nice if the Map class provided a way to overlay a heat map on top of a map: given a large collection of (lat,long) points (e.g., Markers), construct a heat map making it easy to visualize where the points are most concentrated.
Maybe there's an easy way by using the options provided, but I couldn't figure out how to do that from the public documentation. Maybe provide an API for this, or document how to do it? I bet this will be a useful thing for students -- and I'd like to use it for Friday lecture too.
help(Table)
gives the following example of how to use a Table
:
| >>> letters = ['a', 'b', 'c', 'z']
| >>> counts = [9, 3, 3, 1]
| >>> points = [1, 2, 2, 10]
| >>> t = Table([('letter', letters), ('count', counts), ('points', points)])
However copy-pasting that into Jupyter goes the following error message:
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
...
/usr/local/lib/python3.4/dist-packages/datascience/table.py in __init__(self, columns, labels)
---> 46 assert labels is not None, 'Labels are required'
Large ints are formatted poorly (by {:g}
instead of {:d}
)
https://www.python.org/dev/peps/pep-0396/ describes a common way of setting library versions which can be integrated into the release process. There is also https://www.python.org/dev/peps/pep-0440/.
version.py's VERSION has 0.3.7.
We should consider adding one of these, because where
is currently lacking.
.where(column_label, fn)
.where(column_label, value, compare_fn)
.where(column_label, value, not_equal=True)
Top one seems most general, so that's probably the way to go, but it requires understanding higher-order functions.
The problem is that if I want to say n != 2 and m < 4
for columns n
and m
, neither of the following work.
t.where(t['n'] != 2).where(t['m'] < 4)
fails because t['m']
is unfiltered and has the wrong length
t.where(t['n'] != 2 and t['m'] < 4)
fails because and
doesn't work with numpy arrays
The only working solutions currently are both ugly:
t.where(numpy.logical_and(t['n'] != 2, t['m'] < 4))
u = t.where(t['n'] != 2)
u.where(u['m'] < 4)
It's hard to tell at a glance what the datascience.py
file does since it's getting pretty long (~700 LOC as of this writing). I think we should consider splitting it up into separate files to help with code organization.
I tried to use the Maps demo from text/demos/MapsDemonstration.ipynb. I can't make it work for me.
When I try to create a Marker, I get an error. In particular, I can do
m = Marker(37.78, -122.42, 'San Francisco')
with no error, but then when I do
m.show()
I get a traceback:
TypeError: simple_marker() got an unexpected keyword argument 'popup_on'
Same if I type just
m
into an input cell.
Useful stuff since version 0.14
We currently cycle only through blue, yellow, green, red; dark variants of these would be nicer than magenta & white. In table.py
_visualize
method.
In many cases it would be good to build up a little table and use it to augment another table. A common case is to map through a table. For example, you have a table of Parcels. You have categorized them. Now you want to map categories to colors. So you build a little table.
color_map = Table.from_rows([["Residential",'#f1eef6'],
["Commercial",'#d0d1e6'],
["Industrial",'#a6bddb'],
["Apartment",'#74a9cf'],
["Public",'#2b8cbe'],
["Other",'#045a8d']], ("Category", "Color"))
The old join did a left outer join; thus, this would just work. The new one does an inner join - you only get one row per match. You can build it up with indexed_by, but it is pretty ugly because it is dealing with the possibility of multiple matching entries. If we implement lookup we can get this case. Or do we want to offer a richer join?
Conveniently, it turns out that matplotlib has boxplots, which is something @a-adhikari wanted for an upcoming lecture on visualizations.
We should write a Table.boxplot
function that essentially wraps the matplotlib function ASAP.
test_sort
is broken
It would be good to have both bootstrap-based hypothesis testing (where
column A is resampled n times
with replacement and column B is independently sampled n times with
replacement), and permutation
testing (where column A and column B are put together in a single long
column, a random permutation
is made of that column, it is split in the middle into two new columns, and
the statistic is computed on
those two columns).
@SamLau95 I can expand sample
functionality. Would the following syntax work?
def table(self, k, with_replacement=False, columns=[], random_permutation=False)
# columns is a list of column names - if set, with_replacement
# must be a list of the same length OR random_permutation must
# be true
# if columns and with_replacement are both lists, and r_p is True,
# assert error (don't know what to do)
The summary of methods lists the first constructor as:
Table([columns, labels, formatter])
This has a typo: it should be
Table(columns, labels, formatter)
You need to pass 3 args (or 2 args), not 1 arg that's an array of length 3.
Prof. DeNero:
np.append
should be invoked once per column.
M(N-)WE:
from datascience import *
%matplotlib inline
import numpy as np
tab = Table(labels=["money"], columns=[[1.,2.,3.]])
tab.set_format("money", CurrencyFormatter)
append
doesn't workzip
should be used to construct columns in from_rows
group
by a zipped column should do the right thing, but currently expands the contents of tuplesgroup
should not introduce a new column if a column is passed inBokeh now supports Python 3. The charts that are rendered support zooming and look nice. (Their mapping functionality looks less useful than folium, though.)
Prof. DeNero:
When I see
drop
, I think it's going to change the table likedel
.
Goal: Given a table T and a column C, build a new table that has one row for each unique value in T.C along with a count of the number of times that value appears in T.C.
I was not able to find any clean way to do this within the Table API. Should this be doable using Tables, without leaving Table space and going back to arrays and raw Python?
Here is the solution I came up with:
from collections import Counter
c = Counter(origtbl['column_label'])
t = Table.from_rows(c.items(), ['column_label', 'count'])
Not so terrible if you know Python idioms, but also probably not so discoverable for students. Should there be an API in Table that's accessible to students that allows performing this kind of task? Or some suitable generalized primitive, which is enough to solve this problem?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.