turi-code / sframe Goto Github PK

SFrame: Scalable tabular and graph data-structures built for out-of-core data analysis and machine learning.

License: BSD 3-Clause "New" or "Revised" License

CMake 0.73% Shell 0.37% Python 15.38% C++ 80.66% C 0.99% Makefile 0.07% XSLT 0.51% Smarty 0.01% Perl 0.07% R 0.01% JavaScript 0.14% HTML 0.17% CSS 0.88% Scala 0.04%

sframe's People

Contributors

Stargazers

Watchers

Forkers

ylow jizhihang jasonyaw morning-wind liaobs giserh pthaike ericxsun xsongx hoytak xinchoubiology thirdwing nkhuyu fransal ezhangle lidawen sidsachdev raymick tacaswell tempbottle bbandaru kod3r alicezheng esamanas yonglehou csgxy123 drtobbe nagyistoce yangxingpping deegeorgie gxglhyy guoyinwang mohendra codeaudit evitself m00nd00r harshbalyan srikris gardengnomes asmalllemon cqc-sandbox eotp kakamessi99 loginsight lvchigo shyamalschandra kevgh lucentcosmos haijieg pombredanne cosmasquantum nunb ellasanders mitchellzen yuwin andreacrescini nguyenducnhaty kklarsen donggegithub hihihippp varadhanr besciak ralphmastro krishnatray birdgun datatonic wakamori bobholt quantx79 thsiung chronologic1 bin2000 pavlvstc ana2s007 chengjinwen cvml ashu11939 imjerrybao jkhoogland chinaquants npeyralans webon100 bruno78 znation andrewlu1992 jango2015 fnet123 delkyd neuroradiology wzugang bhushan1002 tobyroseman dataeinstein egodfred try4dipen anukat2015 nevermoore gianbalsamo smopart jornason

sframe's Issues

how to read csv with mutiple list fields?

my csv file, one single line with TAB sep:
--------------------test.csv-------------------------
xxx [1,2,3] [1,2,3]

import sframe as sf
sf.SFrame.read_csv("./test", delimiter="\t", column_type_hints=[str, list, list], header=False)

And I get error like this:
RuntimeError: Runtime Exception. column_type_hints has different size from actual number of columns: column_type_hints.size()=3;number of columns=4

pls help.
Thanks!

Missed optimization for SFrame column indexing

When sf['column'] is called, it spins off an SArray as a new column. As a result, it doesn't preserve caches. This causes unexpected behavior, as reported by a user on the forum:

In [7]: arr1 = array.array('d',[random.random() for item in range(4096)])
...
In [13]: sf = gl.SFrame({'data':[arr1 for item in range(10000)]})

In [14]: sa = sf['data']

In [15]: %timeit sa[1]
The slowest run took 6524.06 times longer than the fastest. This could mean that an intermediate result is being cached 
1 loops, best of 3: 154 µs per loop

In [16]: %timeit sf['data'][1]
1 loops, best of 3: 902 ms per loop

(Note the stark differences in timing. ) The solution is to keep references to created sarrays when retrieving a column from an SFrame.

Iterate on different chunks of a sframe

Hi, I'm trying to do iterate over a large sframe to do something like example below

import sframe

def transform(x):
    print x

sf = sframe.SFrame()
sf['a'] = [1, 1, 1, 1, 2, 2, 2, 3, 4, 4, 5]
sf['b'] = [1, 2, 1, 2, 3, 3, 1, 4, None, 2, 3]

for r in sf:
    for x in r:
        transform(x)

Is there a way to use multiprocessing to run something like this, when I try to chunk the sframe I get the same issue mentioned on the forums here

TimeSeries shift for multiple columns ?

Hi,

We are using a TimeSeries and shift, but we want the shift to the values for a certain group, something like that:
We got the following data :
ts, group, value
1 a q
1 b r
2 a w
2 b t
3 a e
3 b y
4 b u

We want to shift the values in column "value" for a group using timestamp:
ts, group, value
1 a r
2 a w
3 a t
1 b e
2 b y
3 b u
4 b -

Support for using temporary credentials

Sframe read_csv doesn't support using temporary credentials as it doesn't pass the token from the environment and only passes the key and secret.

http://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_use-resources.html#using-temp-creds-sdk-ec2-instances

C++ API for read_csv

I can't seem to find how to load an SFrame from a CSV using the C++ API. Can you point me in the right direction?

Problem in CSV reader

This CSV file loads fine with pandas, but seems to choke the SFrame CSV parser:

In [24]: df=gl.SFrame.read_csv("batch.csv")
PROGRESS: Unable to parse line ""3Y3N5A7N4F8BFCSGP0QMO7PDF0FYME","26PZ9RNDWIOV420VVL4TLU13ZJYUFK","x","y","z","Fri Oct 09 12:38:00 PDT 2015","20","BatchId:2118345;","2700","604800","Fri Oct 16 12:38:00 PDT 2015","","","31HQ4X3T3S9XZSVFGK2JUSI8Z2RSLJ","A3579N2TITA69M","Submitted","Fri Oct..."
PROGRESS: Unable to parse line ""3Y3N5A7N4F8BFCSGP0QMO7PDF0FYME","26PZ9RNDWIOV420VVL4TLU13ZJYUFK","x","y","z","Fri Oct 09 12:38:00 PDT 2015","20","BatchId:2118345;","2700","604800","Fri Oct 16 12:38:00 PDT 2015","","","32N49TQG3GHWV1LFDOIYW1M44UOVAB","A3BBGFC0RG39HK","Submitted","Fri Oct..."
PROGRESS: 2 lines failed to parse correctly
PROGRESS: Finished parsing file /Users/malmaud/tmp/batch.csv
PROGRESS: Parsing completed. Parsed 0 lines in 0.010217 secs.
Insufficient number of rows to perform type inference
Could not detect types. Using str for each column.
PROGRESS: Unable to parse line ""3Y3N5A7N4F8BFCSGP0QMO7PDF0FYME","26PZ9RNDWIOV420VVL4TLU13ZJYUFK","x","y","z","Fri Oct 09 12:38:00 PDT 2015","20","BatchId:2118345;","2700","604800","Fri Oct 16 12:38:00 PDT 2015","","","31HQ4X3T3S9XZSVFGK2JUSI8Z2RSLJ","A3579N2TITA69M","Submitted","Fri Oct..."
PROGRESS: Unable to parse line ""3Y3N5A7N4F8BFCSGP0QMO7PDF0FYME","26PZ9RNDWIOV420VVL4TLU13ZJYUFK","x","y","z","Fri Oct 09 12:38:00 PDT 2015","20","BatchId:2118345;","2700","604800","Fri Oct 16 12:38:00 PDT 2015","","","32N49TQG3GHWV1LFDOIYW1M44UOVAB","A3BBGFC0RG39HK","Submitted","Fri Oct..."
PROGRESS: 2 lines failed to parse correctly
PROGRESS: Finished parsing file /Users/malmaud/tmp/batch.csv
PROGRESS: Parsing completed. Parsed 0 lines in 0.01067 secs.

Configure with a gcc toolchain download from the dato(amazon) fails > ERROR 403: Forbidden

Hi,
I tried installing using the dato gcc tool chain on my CentOS 6.7 as per instructions :
./configure --toolchain=https://s3-us-west-2.amazonaws.com/dato-deps/1/dato_deps_linux_gcc_4.9.2.tar.gz

This configuration fails with the following error:
Downloading dato_deps_linux_gcc_4.9.2.tar.gz from https://s3-us-west-2.amazonaws.com/dato-deps/1/dato_deps_linux_gcc_4.9.2.tar.gz ...
--2016-01-13 18:58:04-- https://s3-us-west-2.amazonaws.com/dato-deps/1/dato_deps_linux_gcc_4.9.2.tar.gz
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 54.231.162.24
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|54.231.162.24|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2016-01-13 18:58:04 ERROR 403: Forbidden.

Please help me in this regard.
Thanks!

Compile as dynamic library?

Is there a supported set of configuration options to compile dynamic instead of static libraries?

The context here is I'm trying to write an in-process wrapper for SFrames for the Julia programming language, which will require loading the graphlab libraries at runtime.

Missing value treatment in the sum operator

Something seems odd with how missing values get treated in the sum operator.

In [19]: print gl.SArray([]).sum()
None

In [21]: print gl.SArray([None]).sum()
0.0

In [20]: print sum([])

In [30]: print gl.SArray([None], array.array).sum()
array('d')
0

read_csv does not interpret na_values properly when column_type is int

In [2]: !head bad_example_read_csv.csv
k,v
a,1
b,1
c,-8
d,3


In [6]: sf = gl.SFrame.read_csv('bad_example_read_csv.csv', na_values=['-8'])
PROGRESS: Finished parsing file /Users/charlie/bad_example_read_csv.csv
PROGRESS: Parsing completed. Parsed 4 lines in 0.011829 secs.
------------------------------------------------------
Inferred types from first line of file as
column_type_hints=[str,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
PROGRESS: Finished parsing file /Users/charlie/bad_example_read_csv.csv
PROGRESS: Parsing completed. Parsed 4 lines in 0.011498 secs.

In [7]: sf
Out[7]:
Columns:
    k   str
    v   int

Rows: 4

Data:
+---+----+
| k | v  |
+---+----+
| a | 1  |
| b | 1  |
| c | -8 |
| d | 3  |
+---+----+
[4 rows x 2 columns]

In [8]: sf = gl.SFrame.read_csv('bad_example_read_csv.csv', na_values=[-8])
PROGRESS: Finished parsing file /Users/charlie/bad_example_read_csv.csv
PROGRESS: Parsing completed. Parsed 4 lines in 0.011929 secs.
------------------------------------------------------
Inferred types from first line of file as
column_type_hints=[str,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
PROGRESS: Finished parsing file /Users/charlie/bad_example_read_csv.csv
PROGRESS: Parsing completed. Parsed 4 lines in 0.011881 secs.

In [9]: sf
Out[9]:
Columns:
    k   str
    v   int

Rows: 4

Data:
+---+----+
| k | v  |
+---+----+
| a | 1  |
| b | 1  |
| c | -8 |
| d | 3  |
+---+----+
[4 rows x 2 columns]

In [13]: sf = gl.SFrame.read_csv('bad_example_read_csv.csv', na_values=['-8'], column_type_hints=str)
PROGRESS: Finished parsing file /Users/charlie/bad_example_read_csv.csv
PROGRESS: Parsing completed. Parsed 4 lines in 0.010938 secs.

In [14]: sf
Out[14]:
Columns:
    k   str
    v   str

Rows: 4

Data:
+---+------+
| k |  v   |
+---+------+
| a |  1   |
| b |  1   |
| c | None |
| d |  3   |
+---+------+
[4 rows x 2 columns]

Why unity_server start when I only import sfame

I have install both GraphLab (student license ) and sframe (installed from pypi). When I run this code below (taken from regression class on Coursera):

import sframe

sales = sframe.SFrame("kc_house_data.gl/")

# Let's compute the mean of the House Prices in King County in 2 different ways.
prices = sales['price'] # extract the price column of the sales SFrame -- this is now an SArray

# recall that the arithmetic average (the mean) is the sum of the prices divided by the total number of houses:
sum_prices = prices.sum()
num_houses = prices.size() # when prices is an SArray .size() returns its length
avg_price_1 = sum_prices/num_houses
avg_price_2 = prices.mean() # if you just want the average, the .mean() function
print "average price via method 1: " + str(avg_price_1)
print "average price via method 2: " + str(avg_price_2)

I notice in the console:

"C:\Users\tho\AppData\Local\Dato\Dato Launcher\python.exe" F:/_python_test/regression/week_1/_dataset/script.py
[INFO] Using MetricMock instead of real metrics, mode is: QA
[INFO] Start server at: ipc:///tmp/graphlab_server-3104 - Server binary: C:\Users\tho\AppData\Local\Dato\Dato Launcher\lib\site-packages\sframe\unity_server.exe - Server log: C:\Users\tho\AppData\Local\Temp\sframe_server_1449199266.log.0
[INFO] GraphLab Server Version: 1.6
average price via method 1: 540088.141905
average price via method 2: 540088.141905
[INFO] Stopping the server connection.

Process finished with exit code 0

Why GraphLab Server loaded and what is it's role here?

Thank you in advance.

Docstring of SFrame.print_rows evaluates sys.stdout

The docstring signature of SFrame.print_rows is supposed to look like this:

def print_rows(self, num_rows=10, num_columns=40, max_column_width=30,
               max_row_width=80, output_file=sys.stdout):

However, looking at the docstring in the python shell it looks like sys.stdout somehow got evaluated:

print_rows(self, num_rows=10, num_columns=40, max_column_width=30,
           max_row_width=80, output_file=<open file '<stdout>', mode 'w'>) unbound sframe.data_structures.sframe.SFrame method

Reduce disk utilization

I notice that the SFrame folder comes in at a sizable 5.5Gb after compiling libunity. I'm interested in extracting the minimum runtime dependencies of libunity plus the header files, something like what make install would do for many packages.

Is that (or is there a plan to) make that functionality available? Or it's not too much trouble, would you mind offering me some guidance on how to do it manually?

Thanks!

Deepcopy an sframe

import sframe
sf = sframe.SFrame()
sf['a'] = [1,1,1,1,             2,2,2,           3,     4,4,      5]
sf['b'] = [1,2,1,2,             3,3,1,           4,     None, 2,  3]
sf2= deepcopy(sf)

Error logs


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-11-f0d0a6912fba> in <module>()
----> 1 sf2= deepcopy(sf)

/Users/sbajaj/miniconda/lib/python2.7/copy.pyc in deepcopy(x, memo, _nil)
    188                             raise Error(
    189                                 "un(deep)copyable object of type %s" % cls)
--> 190                 y = _reconstruct(x, rv, 1, memo)
    191 
    192     memo[d] = y

/Users/sbajaj/miniconda/lib/python2.7/copy.pyc in _reconstruct(x, info, deep, memo)
    332     if state:
    333         if deep:
--> 334             state = deepcopy(state, memo)
    335         if hasattr(y, '__setstate__'):
    336             y.__setstate__(state)

/Users/sbajaj/miniconda/lib/python2.7/copy.pyc in deepcopy(x, memo, _nil)
    161     copier = _deepcopy_dispatch.get(cls)
    162     if copier:
--> 163         y = copier(x, memo)
    164     else:
    165         try:

/Users/sbajaj/miniconda/lib/python2.7/copy.pyc in _deepcopy_tuple(x, memo)
    235     y = []
    236     for a in x:
--> 237         y.append(deepcopy(a, memo))
    238     d = id(x)
    239     try:

/Users/sbajaj/miniconda/lib/python2.7/copy.pyc in deepcopy(x, memo, _nil)
    161     copier = _deepcopy_dispatch.get(cls)
    162     if copier:
--> 163         y = copier(x, memo)
    164     else:
    165         try:

/Users/sbajaj/miniconda/lib/python2.7/copy.pyc in _deepcopy_dict(x, memo)
    255     memo[id(x)] = y
    256     for key, value in x.iteritems():
--> 257         y[deepcopy(key, memo)] = deepcopy(value, memo)
    258     return y
    259 d[dict] = _deepcopy_dict

/Users/sbajaj/miniconda/lib/python2.7/copy.pyc in deepcopy(x, memo, _nil)
    188                             raise Error(
    189                                 "un(deep)copyable object of type %s" % cls)
--> 190                 y = _reconstruct(x, rv, 1, memo)
    191 
    192     memo[d] = y

/Users/sbajaj/miniconda/lib/python2.7/copy.pyc in _reconstruct(x, info, deep, memo)
    327     if deep:
    328         args = deepcopy(args, memo)
--> 329     y = callable(*args)
    330     memo[id(x)] = y
    331 

/Users/sbajaj/miniconda/lib/python2.7/copy_reg.pyc in __newobj__(cls, *args)
     91 
     92 def __newobj__(cls, *args):
---> 93     return cls.__new__(cls, *args)
     94 
     95 def _slotnames(cls):

sframe/cython/cy_sframe.pyx in sframe.cython.cy_sframe.UnitySFrameProxy.__cinit__()

TypeError: __cinit__() takes at least 1 positional argument (0 given)

Undefined Reference Error on Windows

I followed the steps as suggested at the page "Setting up for Windows".

In my MSYS2 shell, I first perform the configuration, it goes well.
After that, I enter the directory ${SFrameRoot}/debug/oss_src/unity/python and "make",
then I get information like:

and some undefined reference errors:

Could anyone give me some help?

S3 Read Speed and Error

I encountered some error when loading a big sframe directly from s3 and do processing. This error disappear after I directly download from S3, so I guess there might be some problem in current S3 reading API.

opened a issue to track this as discussed with @ylow

Remove cppipc layer from SFrame

The cppipc layer becomes unnecessary once unity_server is "inproc". Removing the cppipc layer entirely will by pass data structure serialization, provide a tighter integration with python, and speed up all object passing between python and glc.

Allow SFrame to be installed with conda

Right now SFrame package has to be installed with pip, it would be nice if it could be installed with 'conda install'.

SFrame doesn't have a `set_credentials` function

So it currently only supports passing credentials from environment variables.

Incompatible with GraphLab package

importing both sframe and graphlab can result in a crash.

import sframe
import graphlab
g=sframe.SFrame()

New version ?

The current released version is 168 commits behind, do you have plans to release a newer version of SFrame ?

Python Lambda Execution Robustness

The current mode of python lambda parallelization involves spinning of a collection of subprocesses (pylambda_worker) each of which is dynamically linked to libpython.so. These pylambda_workers then connect back to the original process via interprocess shared memory.

This causes issues in certain situations :

There are multiple python versions, in which case we may pick up the wrong libpython. (We conflict really badly on Mac when there is brew python as well as system python)
Some self-built python installations on Linux in particular do not build libpython.so

The proposal is to flip the linking around.

instead of spinning of pylambda_workers, we spin off python processes.
These python processes then use ctypes to load a pylambda_worker shared library and call into it
This way we resolve our python symbols via python itself.
Change pylambda

run_cpp_tests only work if C++ tests already built

Building the C++ tests doesn't happen when you build everything else. If you try to run ./oss_local_scripts/run_cpp_tests.py when you have not built the C++ test, you get a super unhelpful error message.

SFrame Python Side Cache Needs to be invalidated when the SFrame column set is modified

This behavior is wrong. The python side cache "SFrame._cache" needs to be invalidated when the column set changes.

>>> g=gl.SFrame({'a':[1]})
>>> g[0]
{'a': 1}
>>> g['a'] = g['a'] + 1
>>> g[0]
{'a': 1}

Enable strict evaluation mode for debugging purposes

Sometimes I get a call stack that is very hard to debug -- resulting in an exception like:

File "graphlab/cython/cy_sarray.pyx", line 84, in graphlab.cython.cy_sarray.UnitySArrayProxy.size
  File "graphlab/cython/cy_sarray.pyx", line 85, in graphlab.cython.cy_sarray.UnitySArrayProxy.size
RuntimeError: Runtime Exception. Cannot convert python object Decimal to flexible_type.

Since the apply method is executed lazily, this error (with call stack) tends to appear far away from the code that actually causes the error. In a large program it is hard to track down where this error was introduced.

It would be nice to have a mode or flag to enable strict evaluation (across the board) for debugging purposes, so that it would be easier to track down type mismatches or exceptions raised in apply methods, without having to explicitly materialize at each point in the code.

gl.connect_odbc() fails

Repro steps:

Start Python session

import graphlab as gl
db = gl.connect_odbc(<my awesome connection string>)

Expected Behavior:

db should now be connected to database

Actual Behavior:

AttributeError: 'module' object has no attribute '_odbc_connection'

PyLambda worker testing.

The pylambda workers allow efficient parallel execution of functions over sframe and sarray type, but they have been historically plagued by library resolution issues and linking issues. The previous pull request should fix all of these issues. The primary idea is that it spawns separate processes of the python interpreter, which then loads the pylambda workers as a library and by file, instead of as a separate process linked to libpython. Both of these things are reliably determined at startup, whereas libpython is not.

To ensure that this change indeed works, we need to test that it runs on a number of different configurations and systems.

To sign off on this feature, we need to make sure that nosetests .../sframe/test/test_lambda_workers.py
runs on the following systems:

Implicit cast to bool gives unexpected behavior for SArray

When I do the following:

x = gl.SArray([1,None,0])
y = gl.SArray([0,0,1])
print x and y

I expected that (like many other operators), the and operator would be overloaded to do an element-wise and. The result I get actually has the expected type and expected number of rows, but not the result I expected. I get back y:

[0, 0, 1]

This is because apparently Python likes to implicitly-cast-to-bool for comparison purposes, and return the original type from the expression (JS does this as well, so I should've realized that's what was happening). The equivalent expression to x and y, without implicit type coercion, is actually y if bool(x) else x.

Explicit is better than implicit probably applies here. Why Python originally allowed this implicit casting is beyond me.

@hoytak suggests that we adopt Numpy behavior here and error on implicit cast to bool, and I'm inclined to agree. I assume TypeError?

SFrame self-append performance and Query optimization issues

SFrame.append() is slow when it iteratively self-append.
The following script reproduce the slow performance:

def f(n):
    sf = gl.SFrame({"h":range(0,100)})
    for i in range(0,100):
       sf.add_column(sf["h"],name=str(i))
    for i in range(0,n):
        sf = sf.append(sf)
    sf.__materialize__()
for i in range(5, 15):
    print i
    %timeit f(i)

Another related issue is that print sf in the following script collapses the query tree (calling materialize()) while it only needs to print the first 10 rows.

sf = gl.SFrame({'h':range(0,100)})
for i in range(0,100):
   sf.add_column(sf['h'],name=str(i))
for i in range(0,14):
   sf = sf.append(sf)
print sf

On windows, libpylambda_worker.dll needs manifest to load the correct libraries.

The new libpylambda_worker.dll needs to have an accompanying manifest so that it loads all dependent libraries from the local graphlab directory. The main issue encountered here is that old openssl libraries are not compatible, and many openssl installers put libraries by default into C:\System32. Thus if someone has an old version of openssl installed, then the pylambda workers won't start.

escape_char is ignored in export_csv

Hi,

When trying to use an empty string as escape_char, it ignores it.
When trying other characters, it works.
This is preventing us from disabling escapement.

SFrame throwing various errors when output-ing or retrieving a specific row

There's some problems with my SFrame that I've loaded but it just can't seem to dump into a binary or csv format. Is it a memory issue or is it how SFrame works such that NoneType appears often when iterating or saving the SFrame?

For more information, see http://stackoverflow.com/questions/34654901/how-do-i-find-specific-rows-that-throws-an-error-when-saving-in-graphlab-sframe

When inserting a list of int it is automatically converted to list of float

import sframe
sf = sframe.SFrame()
sf['a'] = [1,1,1,1,             2,2,2,           3,     4,4,      5]
sf['b'] = [1,2,1,2,             3,3,1,           4,     None, 2,  3]

af = sf.groupby("a", {'b':sframe.aggregate.CONCAT("b")})
af['c'] = af['b'].apply(lambda x: list(set(x)))

af will have the columns as float instead of preserving the int type:

+---+--------------+------------+
| a |      b       |     c      |
+---+--------------+------------+
| 3 |     [4]      |   [4.0]    |
| 1 | [2, 1, 1, 2] | [1.0, 2.0] |
| 2 |  [3, 3, 1]   | [1.0, 3.0] |
| 5 |     [3]      |   [3.0]    |
| 4 |     [2]      |   [2.0]    |
+---+--------------+------------+```

sframe Cannot create sarray of value type <type 'unicode'>

The testing code is below:

import sframe as sf
t = sf.SFrame({"id": ['a', 'b', 'c'], 'value': [1, 2, 3]})
t['str'] = 'aaa'
t['uni'] = u'aa'

Will sarray.from_const support unicode value?

Python 3 compatibility

The package currently only supports Python 2.7x.

Unable to parse line...

When I used SFrame to read_csv, I came across error below:

PROGRESS: Unable to parse line ""###0000","{}","{""-2sxY5OvAJh4UT3w"":""不vvv""}""

PROGRESS: Unable to parse line ""**666996","{}","{""uFx1XYq5tjyFt8--"":""八八\\""}""

Then What's wrong with those lines? could somebody tell me how the SFrame to parse line?

By the Way, the file containing these lines are generated from SFrame itself by save method.

Thank you!

Update Docstrings

Docstrings still reference the package name as being graphlab. There are other incorrect references to GraphLab Create.

The output of a sframe[row] is a dict, which messes up the output

Being new to the platform, I wanted to look at a row of data and did:

Which is great, except the columns are out of order from the original SFrame, which made me think I had a bug, and kept looking for a while. Then I decided to save the SFrame to disk (as CSV), and BOOM everything is as expected.

This is very annoying that the columns are not in the order displayed. Hoyt says that we could use an ordereddict and it would fix this.

Query optimizer splits binary_transform in ordering filter operation.

The SFrame query optimizer, in prioritizing a filter operation, incorrectly allows a binary_transform with multiple outputs to be split and thus executed twice. Here's the input graph:

And it produces this result:

The reason is that the filter optimization does not correctly detect if it is splitting the output of a binary transform.

graphlab installation failure, solved

Ubuntu 15.04 Python 2.7 IPython 4.0.0

Having used graphlab in the introductory section of your Coursera program I thought I would make sure I was still up-to-date per the instructions at the beginning of the Regression module

Following

sudo -H pip install --upgrade graphlab-create
sudo -H pip install -U sframe
sudo reboot

ipython
import graphlab as gl

produced the following

In [2]: import graphlab as gl
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-4a710214f762> in <module>()
----> 1 import graphlab as gl

/usr/local/lib/python2.7/dist-packages/graphlab/__init__.py in <module>()
     52 from graphlab.util import set_runtime_config
     53 
---> 54 import graphlab.connect as _mt
     55 import graphlab.connect.aws as aws
     56 import visualization

/usr/local/lib/python2.7/dist-packages/graphlab/connect/__init__.py in <module>()
     29 """ The module usage metric tracking object """
     30 from graphlab.util.config import DEFAULT_CONFIG as _default_local_conf
---> 31 from graphlab.util.metric_tracker import MetricTracker as _MetricTracker
     32 
     33 

/usr/local/lib/python2.7/dist-packages/graphlab/util/metric_tracker.py in <module>()
     21 import uuid
     22 import copy as _copy
---> 23 import requests as _requests
     24 import sys
     25 import urllib as _urllib

/usr/local/lib/python2.7/dist-packages/requests/__init__.py in <module>()
     51 # Attempt to enable urllib3's SNI support, if possible
     52 try:
---> 53     from .packages.urllib3.contrib import pyopenssl
     54     pyopenssl.inject_into_urllib3()
     55 except ImportError:

/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/contrib/pyopenssl.py in <module>()
     68 _openssl_versions = {
     69     ssl.PROTOCOL_SSLv23: OpenSSL.SSL.SSLv23_METHOD,
---> 70     ssl.PROTOCOL_SSLv3: OpenSSL.SSL.SSLv3_METHOD,
     71     ssl.PROTOCOL_TLSv1: OpenSSL.SSL.TLSv1_METHOD,
     72 }

AttributeError: 'module' object has no attribute 'PROTOCOL_SSLv3'

Looking at the error message and having noticed that your installation had moved requests back to an earlier level, I tried

sudo -H pip install -U pyopenssl
sudo -H pip install -U urllib3
sudo -H pip install -U requests

... and this seemed to do the trick.

SFrame read_csv read corrupted data when type inference is incorrect

SFrame uses maybe first 1000 rows to infer column types. When a column is inferred as 'int', and a 'str' is encountered later, the parser will read the first valid digit in the string as value, or discard the str if there is no valid digit.

For instance, create 'a.csv' like follows:

A,B
0,1
0.1
...
// repeat 100 times
...
9a,1
a,1

SFrame.read_csv('a.csv').tail()

+---+---+
| A | B |
+---+---+
| 0 | 1 |
| 0 | 1 |
| 0 | 1 |
| 0 | 1 |
| 0 | 1 |
| 0 | 1 |
| 0 | 1 |
| 0 | 1 |
| 0 | 1 |
| 9 | 1 |
+---+---+

Note that last row is "9, 1".

The expected behavior should be either throw that row, or lift the inferred column type to str and take it. Parse row partially corrupts the data.

Use dateutil for automatically parsing datetime formats?

https://dateutil.readthedocs.org/en/latest/_modules/dateutil/parser.html
http://cpansearch.perl.org/src/MUIR/Time-modules-2013.0912/lib/Time/ParseDate.pm

Groupby sum vs SArray sum on vector operations behave differently when using vectors of different lengths.

In groupby we get None, while in sum, we get an error.

In [49]: gl.SFrame({'c': ['a', 'b', 'b'], 'v': [[1], [1], [1, 1]]}).groupby('c', gl.aggregate.SUM('v'))
Out[49]:
Columns:
    c   str
    Vector Sum of v array

Rows: 2

Data:
+---+-----------------+
| c | Vector Sum of v |
+---+-----------------+
| a |      [1.0]      |
| b |       None      |
+---+-----------------+
[2 rows x 2 columns]

In [50]: print gl.SArray([[1], [1,1]]).sum()
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-50-db39ee6eb4ec> in <module>()
----> 1 print gl.SArray([[1], [1,1]]).sum()

/Users/srikris/miniconda/envs/graphlab/lib/python2.7/site-packages/graphlab/data_structures/sarray.pyc in sum(self)
   1970         """
   1971         with cython_context():
-> 1972             return self.__proxy__.sum()
   1973
   1974     def mean(self):

/Users/srikris/miniconda/envs/graphlab/lib/python2.7/site-packages/graphlab/cython/context.pyc in __exit__(self, exc_type, exc_value, traceback)
     47             if not self.show_cython_trace:
     48                 # To hide cython trace, we re-raise from here
---> 49                 raise exc_type(exc_value)
     50             else:
     51                 # To show the full trace, we do nothing and let exception propagate

RuntimeError: Runtime Exception. Cannot perform sum over vectors of variable length.

Show methods not defined

Just wanted to check if this is a problem, or if the show method is genuinely unimplemented in the open source tools:

In [11]: sframe.SArray([1,2]).show()
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-11-246d830d08e3> in <module>()
----> 1 sframe.SArray([1,2]).show()

/Users/malmaud/tmp/SFrame/debug/oss_src/unity/python/sframe/data_structures/sarray.pyc in show(self, view)
   2566         """
   2567         from ..visualization.show import show
-> 2568         show(self, view=view)
   2569 
   2570     def item_length(self):

/Users/malmaud/tmp/SFrame/deps/conda/lib/python2.7/site-packages/multipledispatch/dispatcher.pyc in __call__(self, *args, **kwargs)
    162             self._cache[types] = func
    163         try:
--> 164             return func(*args, **kwargs)
    165 
    166         except MDNotImplementedError:

/Users/malmaud/tmp/SFrame/debug/oss_src/unity/python/sframe/visualization/show.pyc in show(obj, **kwargs)
     12 @show_dispatch(object)
     13 def show(obj, **kwargs):
---> 14     raise NotImplementedError("Show for object type " + str(type(obj)))

NotImplementedError: Show for object type <class 'sframe.data_structures.sarray.SArray'>

Remove load_model

We should not have: sframe.load_model

Memory Size Detection

Currently, we try to autotune the amount of memory we use by detecting the amount of system memory. However, this is problematic when we are run inside of docker since we detect the total amount of memory on the system (via sysinfo), rather the amount of memory allocated to us. (See http://fabiokung.com/2014/03/13/memory-inside-linux-containers/, moby/moby#12394)

As of docker 1.8 which is rather new, we should be able to look into /sys/fs/cgroups or /proc/self/cgroup or something like that. But not everyone will be on docker 1.8, so we need an API workaround as well.

Use /sys/fs/cgroups to determine our memory limit and use that
?Alternatives?
Add an sframe.set_memory_limit function that auto-configures the memory limit based on what the user tells us. This is the ultimate fallback.

__abs__ corresponds to abs(sf['x'])
__neg__ corresponds to -sf['x']
__pos__ corresponds to +sf['x']
__pow__ corresponds to sf['x'] ** 2

These are pretty low hanging fruit but quite useful for compatibility with standard Python idioms.