Table vs numpy.matrix speed about datascience HOT 10 CLOSED

data-8 commented on June 27, 2024

Table vs numpy.matrix speed

from datascience.

Comments (10)

pxinghao commented on June 27, 2024

Looking at the Table code, it seems like doing a .columns essentially does a complete dump of the Table, before [i] indexes into the dump. Would it be possible to instead have a function like .value(i,j) that returns the value of a cell without dumping all columns or even the column being accessed? Or amortize the cost by storing the dumped columns somewhere?

For example, a quick fix was to do the dump once:

columns = countsTable.columns
t0 = time.clock()
[sum([abs(columns[k][0] - columns[k][j]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]
t1 = time.clock()
print('Compute L1 distance of first row to all rows, using Table.columns once, took ', t1-t0, 's', sep='')
# Compute L1 distance of first row to all rows, using Table.columns once, took 0.0015499999999804004s

from datascience.

deculler commented on June 27, 2024

mytable[col][row] is the value.

The primary design point for tables treats each column as a vector. A row
is really more like a record. A table is definitely not a matrix. (And it
isn't C, i.e., row major, although numpy arrays down inside are implemented
as such and fast.)

The fundamental unit to be working with is mytable[column_label]. If you
care about performance, that is what you care about. And you want to
operate on columns.

I don't follow what you mean by "complete dump" in this context. All of
this is in memory. But perhaps you mean making copies. All the columns
are numpy arrays down inside. Rows are not.

It would be great if you'd do the timing for columns (not lists constructed
from cells of columns or rows) rows, and cells. Comparisons with Pandas and
Numpy would be great too.

David E. Culler
Friesen Professor of Computer Science
Electrical Engineering and Computer Sciences
University of California, Berkeley

On Tue, Aug 18, 2015 at 6:57 PM, Xinghao Pan [email protected]
wrote:

Below is a benchmark for comparing Table vs numpy.matrix on an access
pattern that I expect is very common in data science applications.
Essentially, all I'm doing is treating each row as a vector, and attempting
to compute pairwise distances between rows / vectors by iterating over all
their values.

from datascience import *
import numpy as np
import time

numDatapoints = 10
numFeatures = 250
countsTable = Table([[0 for i in range(0,numDatapoints)] for j in range(0,numFeatures)], [str(j) for j in range(0,numFeatures)])

countsMatrix = countsTable.matrix().transpose()

t0 = time.clock()
[sum([abs(countsMatrix[0,k] - countsMatrix[j,k]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]
t1 = time.clock()
print('Compute L1 distance of first row to all rows, using numpy.matrix, took ', t1-t0, 's', sep='')

Compute L1 distance of first row to all rows, using numpy.matrix, took 0.007395999999999958s

t0 = time.clock()
[sum([abs(countsTable.columns[k][0] - countsTable.columns[k][j]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]
t1 = time.clock()
print('Compute L1 distance of first row to all rows, using Table.columns, took ', t1-t0, 's', sep='')

Compute L1 distance of first row to all rows, using Table.columns, took 0.4431849999999997s

t0 = time.clock()
[sum([abs(countsTable.rows[0][k] - countsTable.rows[j][k]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]
t1 = time.clock()
print('Compute L1 distance of first row to all rows, using Table.rows, took ', t1-t0, 's', sep='')

Compute L1 distance of first row to all rows, using Table.rows, took 31.142619999999994s

Running this code shows that iteration over numpy.matrix is about 100x
faster than iterating over Table.columns, which in turn is 100x faster than
using Table.rows.

—
Reply to this email directly or view it on GitHub
#48.

from datascience.

pxinghao commented on June 27, 2024

Using getitem is as fast as using numpy.matrix:

t0 = time.clock()
[sum([abs(countsTable[str(k)][0] - countsTable[str(k)][j]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]
t1 = time.clock()
print('Compute L1 distance of first row to all rows, using Table __getitem__ (column-wise), took ', t1-t0, 's', sep='')
# Compute L1 distance of first row to all rows, using Table __getitem__ (column-wise), took 0.006763999999999992s

Running either column- or row-first doesn't seem to make much of a difference on the small examples I tried:

t0 = time.clock()
[sum([abs(countsTable[str(k)][0] - countsTable[str(k)][j]) for j in range(0, numDatapoints)]) for k in range(0, numFeatures)]
t1 = time.clock()
print('Compute L1 distance of first row to all rows, using Table __getitem__ (row-wise), took ', t1-t0, 's', sep='')
# Compute L1 distance of first row to all rows, using Table __getitem__ (row-wise), took 0.00734900000000005s

I'm not sure what is meant by "cell" though.

The only (minor) issue I have with getitem is the need to deference by column label. Is there a direct, natural way to deference by column number instead? (I could iterate over column_labels, but using [str][int] still feels less natural than the usual [int][int] indexing.)

from datascience.

papajohn commented on June 27, 2024

What if instead of supporting tables with many columns that are one-column-per-feature, we model this project using four columns:

song name
song genre
artist name
word counts: numpy.array valued column

The word-counts column could be represented as a 2-d array, which we know is fast.

The only headache in achieving this design is fiddling with the read_table function so that it's possible to produce such a structure.

from datascience.

papajohn commented on June 27, 2024

Regarding this issue in particular, we're not going to support numpy-speed row comparison operations on Tables this semester. The best approximation would be to convert first to a numpy.matrix.

from datascience.

pxinghao commented on June 27, 2024

A possible workaround is to first read the Table from CSV as-is into a 5004-column data structure, then use a simple wrapper to convert into a 5-column Table.

For example,
largeTable = Table.read_table(...)
wordCountsMatrix = largeTable.drop(['UID', 'Title', 'Artist', 'Genre']).matrix()
smallTable = Table(largeTable.select(['UID', 'Title', 'Artist', 'Genre']).columns + wordCountsMatrix.tolist(), ['UID', 'Title', 'Artist', 'Genre', 'Counts']).

from datascience.

SamLau95 commented on June 27, 2024

This looks more relevant now as project 2 is coming up.

from datascience.

pxinghao commented on June 27, 2024

Turns out that the fastest way I've found is to use numpy.tile:

(abs(countsMatrix - np.tile(countsMatrix[0,],[numDatapoints, 1]))).sum(1)

which is faster than manually computing pairwise distances, even using numpy:

[sum([abs(countsMatrix[0,k] - countsMatrix[j,k]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]

from datascience.

papajohn commented on June 27, 2024

Perhaps we just wrap this call to np.tile into a more intuitive name, such
as pairwise_distance.

On Fri, Oct 30, 2015 at 12:52 AM, Xinghao Pan [email protected]
wrote:

Turns out that the fastest way I've found is to use numpy.tile:

(abs(countsMatrix - np.tile(countsMatrix[0,],[numDatapoints, 1]))).sum(1)

which is faster than manually computing pairwise distances, even using
numpy:

[sum([abs(countsMatrix[0,k] - countsMatrix[j,k]) for k in range(0,
numFeatures)]) for j in range(0, numDatapoints)]

—
Reply to this email directly or view it on GitHub
#48 (comment).

from datascience.

pxinghao commented on June 27, 2024

For project2, we might want students to try to write code for computing the distances, so I would vote against putting it in the Table codebase, for now at least.

from datascience.

Table vs numpy.matrix speed about datascience HOT 10 CLOSED

Comments (10)

Compute L1 distance of first row to all rows, using numpy.matrix, took 0.007395999999999958s

Compute L1 distance of first row to all rows, using Table.columns, took 0.4431849999999997s

Compute L1 distance of first row to all rows, using Table.rows, took 31.142619999999994s

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent