Code Monkey home page Code Monkey logo

Comments (10)

pxinghao avatar pxinghao commented on June 27, 2024

Looking at the Table code, it seems like doing a .columns essentially does a complete dump of the Table, before [i] indexes into the dump. Would it be possible to instead have a function like .value(i,j) that returns the value of a cell without dumping all columns or even the column being accessed? Or amortize the cost by storing the dumped columns somewhere?

For example, a quick fix was to do the dump once:

columns = countsTable.columns
t0 = time.clock()
[sum([abs(columns[k][0] - columns[k][j]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]
t1 = time.clock()
print('Compute L1 distance of first row to all rows, using Table.columns once, took ', t1-t0, 's', sep='')
# Compute L1 distance of first row to all rows, using Table.columns once, took 0.0015499999999804004s

from datascience.

deculler avatar deculler commented on June 27, 2024

mytable[col][row] is the value.

The primary design point for tables treats each column as a vector. A row
is really more like a record. A table is definitely not a matrix. (And it
isn't C, i.e., row major, although numpy arrays down inside are implemented
as such and fast.)

The fundamental unit to be working with is mytable[column_label]. If you
care about performance, that is what you care about. And you want to
operate on columns.

I don't follow what you mean by "complete dump" in this context. All of
this is in memory. But perhaps you mean making copies. All the columns
are numpy arrays down inside. Rows are not.

It would be great if you'd do the timing for columns (not lists constructed
from cells of columns or rows) rows, and cells. Comparisons with Pandas and
Numpy would be great too.

David E. Culler
Friesen Professor of Computer Science
Electrical Engineering and Computer Sciences
University of California, Berkeley

On Tue, Aug 18, 2015 at 6:57 PM, Xinghao Pan [email protected]
wrote:

Below is a benchmark for comparing Table vs numpy.matrix on an access
pattern that I expect is very common in data science applications.
Essentially, all I'm doing is treating each row as a vector, and attempting
to compute pairwise distances between rows / vectors by iterating over all
their values.

from datascience import *
import numpy as np
import time

numDatapoints = 10
numFeatures = 250
countsTable = Table([[0 for i in range(0,numDatapoints)] for j in range(0,numFeatures)], [str(j) for j in range(0,numFeatures)])

countsMatrix = countsTable.matrix().transpose()

t0 = time.clock()
[sum([abs(countsMatrix[0,k] - countsMatrix[j,k]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]
t1 = time.clock()
print('Compute L1 distance of first row to all rows, using numpy.matrix, took ', t1-t0, 's', sep='')

Compute L1 distance of first row to all rows, using numpy.matrix, took 0.007395999999999958s

t0 = time.clock()
[sum([abs(countsTable.columns[k][0] - countsTable.columns[k][j]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]
t1 = time.clock()
print('Compute L1 distance of first row to all rows, using Table.columns, took ', t1-t0, 's', sep='')

Compute L1 distance of first row to all rows, using Table.columns, took 0.4431849999999997s

t0 = time.clock()
[sum([abs(countsTable.rows[0][k] - countsTable.rows[j][k]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]
t1 = time.clock()
print('Compute L1 distance of first row to all rows, using Table.rows, took ', t1-t0, 's', sep='')

Compute L1 distance of first row to all rows, using Table.rows, took 31.142619999999994s

Running this code shows that iteration over numpy.matrix is about 100x
faster than iterating over Table.columns, which in turn is 100x faster than
using Table.rows.


Reply to this email directly or view it on GitHub
#48.

from datascience.

pxinghao avatar pxinghao commented on June 27, 2024

Using getitem is as fast as using numpy.matrix:

t0 = time.clock()
[sum([abs(countsTable[str(k)][0] - countsTable[str(k)][j]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]
t1 = time.clock()
print('Compute L1 distance of first row to all rows, using Table __getitem__ (column-wise), took ', t1-t0, 's', sep='')
# Compute L1 distance of first row to all rows, using Table __getitem__ (column-wise), took 0.006763999999999992s

Running either column- or row-first doesn't seem to make much of a difference on the small examples I tried:

t0 = time.clock()
[sum([abs(countsTable[str(k)][0] - countsTable[str(k)][j]) for j in range(0, numDatapoints)]) for k in range(0, numFeatures)]
t1 = time.clock()
print('Compute L1 distance of first row to all rows, using Table __getitem__ (row-wise), took ', t1-t0, 's', sep='')
# Compute L1 distance of first row to all rows, using Table __getitem__ (row-wise), took 0.00734900000000005s

I'm not sure what is meant by "cell" though.

The only (minor) issue I have with getitem is the need to deference by column label. Is there a direct, natural way to deference by column number instead? (I could iterate over column_labels, but using [str][int] still feels less natural than the usual [int][int] indexing.)

from datascience.

papajohn avatar papajohn commented on June 27, 2024

What if instead of supporting tables with many columns that are one-column-per-feature, we model this project using four columns:

  • song name
  • song genre
  • artist name
  • word counts: numpy.array valued column

The word-counts column could be represented as a 2-d array, which we know is fast.

The only headache in achieving this design is fiddling with the read_table function so that it's possible to produce such a structure.

from datascience.

papajohn avatar papajohn commented on June 27, 2024

Regarding this issue in particular, we're not going to support numpy-speed row comparison operations on Tables this semester. The best approximation would be to convert first to a numpy.matrix.

from datascience.

pxinghao avatar pxinghao commented on June 27, 2024

A possible workaround is to first read the Table from CSV as-is into a 5004-column data structure, then use a simple wrapper to convert into a 5-column Table.

For example,
largeTable = Table.read_table(...)
wordCountsMatrix = largeTable.drop(['UID', 'Title', 'Artist', 'Genre']).matrix()
smallTable = Table(largeTable.select(['UID', 'Title', 'Artist', 'Genre']).columns + wordCountsMatrix.tolist(), ['UID', 'Title', 'Artist', 'Genre', 'Counts']).

from datascience.

SamLau95 avatar SamLau95 commented on June 27, 2024

This looks more relevant now as project 2 is coming up.

from datascience.

pxinghao avatar pxinghao commented on June 27, 2024

Turns out that the fastest way I've found is to use numpy.tile:

(abs(countsMatrix - np.tile(countsMatrix[0,],[numDatapoints, 1]))).sum(1)

which is faster than manually computing pairwise distances, even using numpy:

[sum([abs(countsMatrix[0,k] - countsMatrix[j,k]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]

from datascience.

papajohn avatar papajohn commented on June 27, 2024

Perhaps we just wrap this call to np.tile into a more intuitive name, such
as pairwise_distance.

On Fri, Oct 30, 2015 at 12:52 AM, Xinghao Pan [email protected]
wrote:

Turns out that the fastest way I've found is to use numpy.tile:

(abs(countsMatrix - np.tile(countsMatrix[0,],[numDatapoints, 1]))).sum(1)

which is faster than manually computing pairwise distances, even using
numpy:

[sum([abs(countsMatrix[0,k] - countsMatrix[j,k]) for k in range(0,
numFeatures)]) for j in range(0, numDatapoints)]


Reply to this email directly or view it on GitHub
#48 (comment).

from datascience.

pxinghao avatar pxinghao commented on June 27, 2024

For project2, we might want students to try to write code for computing the distances, so I would vote against putting it in the Table codebase, for now at least.

from datascience.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.