Comments (10)
Looking at the Table code, it seems like doing a .columns essentially does a complete dump of the Table, before [i] indexes into the dump. Would it be possible to instead have a function like .value(i,j) that returns the value of a cell without dumping all columns or even the column being accessed? Or amortize the cost by storing the dumped columns somewhere?
For example, a quick fix was to do the dump once:
columns = countsTable.columns t0 = time.clock() [sum([abs(columns[k][0] - columns[k][j]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)] t1 = time.clock() print('Compute L1 distance of first row to all rows, using Table.columns once, took ', t1-t0, 's', sep='') # Compute L1 distance of first row to all rows, using Table.columns once, took 0.0015499999999804004s
from datascience.
mytable[col][row] is the value.
The primary design point for tables treats each column as a vector. A row
is really more like a record. A table is definitely not a matrix. (And it
isn't C, i.e., row major, although numpy arrays down inside are implemented
as such and fast.)
The fundamental unit to be working with is mytable[column_label]. If you
care about performance, that is what you care about. And you want to
operate on columns.
I don't follow what you mean by "complete dump" in this context. All of
this is in memory. But perhaps you mean making copies. All the columns
are numpy arrays down inside. Rows are not.
It would be great if you'd do the timing for columns (not lists constructed
from cells of columns or rows) rows, and cells. Comparisons with Pandas and
Numpy would be great too.
David E. Culler
Friesen Professor of Computer Science
Electrical Engineering and Computer Sciences
University of California, Berkeley
On Tue, Aug 18, 2015 at 6:57 PM, Xinghao Pan [email protected]
wrote:
Below is a benchmark for comparing Table vs numpy.matrix on an access
pattern that I expect is very common in data science applications.
Essentially, all I'm doing is treating each row as a vector, and attempting
to compute pairwise distances between rows / vectors by iterating over all
their values.from datascience import *
import numpy as np
import timenumDatapoints = 10
numFeatures = 250
countsTable = Table([[0 for i in range(0,numDatapoints)] for j in range(0,numFeatures)], [str(j) for j in range(0,numFeatures)])countsMatrix = countsTable.matrix().transpose()
t0 = time.clock()
[sum([abs(countsMatrix[0,k] - countsMatrix[j,k]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]
t1 = time.clock()
print('Compute L1 distance of first row to all rows, using numpy.matrix, took ', t1-t0, 's', sep='')Compute L1 distance of first row to all rows, using numpy.matrix, took 0.007395999999999958s
t0 = time.clock()
[sum([abs(countsTable.columns[k][0] - countsTable.columns[k][j]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]
t1 = time.clock()
print('Compute L1 distance of first row to all rows, using Table.columns, took ', t1-t0, 's', sep='')Compute L1 distance of first row to all rows, using Table.columns, took 0.4431849999999997s
t0 = time.clock()
[sum([abs(countsTable.rows[0][k] - countsTable.rows[j][k]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]
t1 = time.clock()
print('Compute L1 distance of first row to all rows, using Table.rows, took ', t1-t0, 's', sep='')Compute L1 distance of first row to all rows, using Table.rows, took 31.142619999999994s
Running this code shows that iteration over numpy.matrix is about 100x
faster than iterating over Table.columns, which in turn is 100x faster than
using Table.rows.—
Reply to this email directly or view it on GitHub
#48.
from datascience.
Using getitem is as fast as using numpy.matrix:
t0 = time.clock() [sum([abs(countsTable[str(k)][0] - countsTable[str(k)][j]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)] t1 = time.clock() print('Compute L1 distance of first row to all rows, using Table __getitem__ (column-wise), took ', t1-t0, 's', sep='') # Compute L1 distance of first row to all rows, using Table __getitem__ (column-wise), took 0.006763999999999992s
Running either column- or row-first doesn't seem to make much of a difference on the small examples I tried:
t0 = time.clock() [sum([abs(countsTable[str(k)][0] - countsTable[str(k)][j]) for j in range(0, numDatapoints)]) for k in range(0, numFeatures)] t1 = time.clock() print('Compute L1 distance of first row to all rows, using Table __getitem__ (row-wise), took ', t1-t0, 's', sep='') # Compute L1 distance of first row to all rows, using Table __getitem__ (row-wise), took 0.00734900000000005s
I'm not sure what is meant by "cell" though.
The only (minor) issue I have with getitem is the need to deference by column label. Is there a direct, natural way to deference by column number instead? (I could iterate over column_labels, but using [str][int] still feels less natural than the usual [int][int] indexing.)
from datascience.
What if instead of supporting tables with many columns that are one-column-per-feature, we model this project using four columns:
- song name
- song genre
- artist name
- word counts: numpy.array valued column
The word-counts column could be represented as a 2-d array, which we know is fast.
The only headache in achieving this design is fiddling with the read_table
function so that it's possible to produce such a structure.
from datascience.
Regarding this issue in particular, we're not going to support numpy-speed row comparison operations on Table
s this semester. The best approximation would be to convert first to a numpy.matrix
.
from datascience.
A possible workaround is to first read the Table from CSV as-is into a 5004-column data structure, then use a simple wrapper to convert into a 5-column Table.
For example,
largeTable = Table.read_table(...)
wordCountsMatrix = largeTable.drop(['UID', 'Title', 'Artist', 'Genre']).matrix()
smallTable = Table(largeTable.select(['UID', 'Title', 'Artist', 'Genre']).columns + wordCountsMatrix.tolist(), ['UID', 'Title', 'Artist', 'Genre', 'Counts']).
from datascience.
This looks more relevant now as project 2 is coming up.
from datascience.
Turns out that the fastest way I've found is to use numpy.tile:
(abs(countsMatrix - np.tile(countsMatrix[0,],[numDatapoints, 1]))).sum(1)
which is faster than manually computing pairwise distances, even using numpy:
[sum([abs(countsMatrix[0,k] - countsMatrix[j,k]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]
from datascience.
Perhaps we just wrap this call to np.tile into a more intuitive name, such
as pairwise_distance.
On Fri, Oct 30, 2015 at 12:52 AM, Xinghao Pan [email protected]
wrote:
Turns out that the fastest way I've found is to use numpy.tile:
(abs(countsMatrix - np.tile(countsMatrix[0,],[numDatapoints, 1]))).sum(1)
which is faster than manually computing pairwise distances, even using
numpy:[sum([abs(countsMatrix[0,k] - countsMatrix[j,k]) for k in range(0,
numFeatures)]) for j in range(0, numDatapoints)]—
Reply to this email directly or view it on GitHub
#48 (comment).
from datascience.
For project2, we might want students to try to write code for computing the distances, so I would vote against putting it in the Table codebase, for now at least.
from datascience.
Related Issues (20)
- Table.remove() converts columns into lists HOT 1
- hist.(group=...) triggers VisibleDeprecationWarning HOT 2
- datascience.Table.hist inconsistent y-axis with pyplot HOT 1
- Is there any tutorial on how to contribute to the documentation? HOT 1
- Datascience documentation has a typo HOT 3
- index_by
- Upload a wheel to PyPI HOT 2
- Remove sphinx and nbsphinx from requirements.txt file HOT 1
- “python_requires” should be set with “>=3.0”, as datascience is not compatible with all Python versions. HOT 3
- make test doesn't work HOT 1
- Make Plots Compatible with %matplotlib notebook HOT 2
- Deprecation warning for np.int in _vertical_x()
- Actions for running tests are erroring out HOT 3
- Docs are not getting automatically built HOT 2
- v0.17.5 not pip installable HOT 2
- `Table.scatter` method doesn't recognize `sizes` argument HOT 2
- Getting warning from numpy when creating pivot table HOT 4
- Getting warning from numpy when creating histogram HOT 5
- getting version in setup.py HOT 5
- We use Python f-strings HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from datascience.