turi-code / sframe Goto Github PK
View Code? Open in Web Editor NEWSFrame: Scalable tabular and graph data-structures built for out-of-core data analysis and machine learning.
License: BSD 3-Clause "New" or "Revised" License
SFrame: Scalable tabular and graph data-structures built for out-of-core data analysis and machine learning.
License: BSD 3-Clause "New" or "Revised" License
my csv file, one single line with TAB sep:
--------------------test.csv-------------------------
xxx [1,2,3] [1,2,3]
import sframe as sf
sf.SFrame.read_csv("./test", delimiter="\t", column_type_hints=[str, list, list], header=False)
And I get error like this:
RuntimeError: Runtime Exception. column_type_hints has different size from actual number of columns: column_type_hints.size()=3;number of columns=4
pls help.
Thanks!
When sf['column'] is called, it spins off an SArray as a new column. As a result, it doesn't preserve caches. This causes unexpected behavior, as reported by a user on the forum:
In [7]: arr1 = array.array('d',[random.random() for item in range(4096)])
...
In [13]: sf = gl.SFrame({'data':[arr1 for item in range(10000)]})
In [14]: sa = sf['data']
In [15]: %timeit sa[1]
The slowest run took 6524.06 times longer than the fastest. This could mean that an intermediate result is being cached
1 loops, best of 3: 154 µs per loop
In [16]: %timeit sf['data'][1]
1 loops, best of 3: 902 ms per loop
(Note the stark differences in timing. ) The solution is to keep references to created sarrays when retrieving a column from an SFrame.
Hi, I'm trying to do iterate over a large sframe to do something like example below
import sframe
def transform(x):
print x
sf = sframe.SFrame()
sf['a'] = [1, 1, 1, 1, 2, 2, 2, 3, 4, 4, 5]
sf['b'] = [1, 2, 1, 2, 3, 3, 1, 4, None, 2, 3]
for r in sf:
for x in r:
transform(x)
Is there a way to use multiprocessing to run something like this, when I try to chunk the sframe I get the same issue mentioned on the forums here
Hi,
We are using a TimeSeries and shift, but we want the shift to the values for a certain group, something like that:
We got the following data :
ts, group, value
1 a q
1 b r
2 a w
2 b t
3 a e
3 b y
4 b u
We want to shift the values in column "value" for a group using timestamp:
ts, group, value
1 a r
2 a w
3 a t
1 b e
2 b y
3 b u
4 b -
Sframe read_csv doesn't support using temporary credentials as it doesn't pass the token from the environment and only passes the key and secret.
I can't seem to find how to load an SFrame from a CSV using the C++ API. Can you point me in the right direction?
This CSV file loads fine with pandas, but seems to choke the SFrame CSV parser:
In [24]: df=gl.SFrame.read_csv("batch.csv")
PROGRESS: Unable to parse line ""3Y3N5A7N4F8BFCSGP0QMO7PDF0FYME","26PZ9RNDWIOV420VVL4TLU13ZJYUFK","x","y","z","Fri Oct 09 12:38:00 PDT 2015","20","BatchId:2118345;","2700","604800","Fri Oct 16 12:38:00 PDT 2015","","","31HQ4X3T3S9XZSVFGK2JUSI8Z2RSLJ","A3579N2TITA69M","Submitted","Fri Oct..."
PROGRESS: Unable to parse line ""3Y3N5A7N4F8BFCSGP0QMO7PDF0FYME","26PZ9RNDWIOV420VVL4TLU13ZJYUFK","x","y","z","Fri Oct 09 12:38:00 PDT 2015","20","BatchId:2118345;","2700","604800","Fri Oct 16 12:38:00 PDT 2015","","","32N49TQG3GHWV1LFDOIYW1M44UOVAB","A3BBGFC0RG39HK","Submitted","Fri Oct..."
PROGRESS: 2 lines failed to parse correctly
PROGRESS: Finished parsing file /Users/malmaud/tmp/batch.csv
PROGRESS: Parsing completed. Parsed 0 lines in 0.010217 secs.
Insufficient number of rows to perform type inference
Could not detect types. Using str for each column.
PROGRESS: Unable to parse line ""3Y3N5A7N4F8BFCSGP0QMO7PDF0FYME","26PZ9RNDWIOV420VVL4TLU13ZJYUFK","x","y","z","Fri Oct 09 12:38:00 PDT 2015","20","BatchId:2118345;","2700","604800","Fri Oct 16 12:38:00 PDT 2015","","","31HQ4X3T3S9XZSVFGK2JUSI8Z2RSLJ","A3579N2TITA69M","Submitted","Fri Oct..."
PROGRESS: Unable to parse line ""3Y3N5A7N4F8BFCSGP0QMO7PDF0FYME","26PZ9RNDWIOV420VVL4TLU13ZJYUFK","x","y","z","Fri Oct 09 12:38:00 PDT 2015","20","BatchId:2118345;","2700","604800","Fri Oct 16 12:38:00 PDT 2015","","","32N49TQG3GHWV1LFDOIYW1M44UOVAB","A3BBGFC0RG39HK","Submitted","Fri Oct..."
PROGRESS: 2 lines failed to parse correctly
PROGRESS: Finished parsing file /Users/malmaud/tmp/batch.csv
PROGRESS: Parsing completed. Parsed 0 lines in 0.01067 secs.
Hi,
I tried installing using the dato gcc tool chain on my CentOS 6.7 as per instructions :
./configure --toolchain=https://s3-us-west-2.amazonaws.com/dato-deps/1/dato_deps_linux_gcc_4.9.2.tar.gz
This configuration fails with the following error:
Downloading dato_deps_linux_gcc_4.9.2.tar.gz from https://s3-us-west-2.amazonaws.com/dato-deps/1/dato_deps_linux_gcc_4.9.2.tar.gz ...
--2016-01-13 18:58:04-- https://s3-us-west-2.amazonaws.com/dato-deps/1/dato_deps_linux_gcc_4.9.2.tar.gz
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 54.231.162.24
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|54.231.162.24|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2016-01-13 18:58:04 ERROR 403: Forbidden.
Please help me in this regard.
Thanks!
Is there a supported set of configuration options to compile dynamic instead of static libraries?
The context here is I'm trying to write an in-process wrapper for SFrames for the Julia programming language, which will require loading the graphlab libraries at runtime.
Something seems odd with how missing values get treated in the sum
operator.
In [19]: print gl.SArray([]).sum()
None
In [21]: print gl.SArray([None]).sum()
0.0
In [20]: print sum([])
In [30]: print gl.SArray([None], array.array).sum()
array('d')
0
In [2]: !head bad_example_read_csv.csv
k,v
a,1
b,1
c,-8
d,3
In [6]: sf = gl.SFrame.read_csv('bad_example_read_csv.csv', na_values=['-8'])
PROGRESS: Finished parsing file /Users/charlie/bad_example_read_csv.csv
PROGRESS: Parsing completed. Parsed 4 lines in 0.011829 secs.
------------------------------------------------------
Inferred types from first line of file as
column_type_hints=[str,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
PROGRESS: Finished parsing file /Users/charlie/bad_example_read_csv.csv
PROGRESS: Parsing completed. Parsed 4 lines in 0.011498 secs.
In [7]: sf
Out[7]:
Columns:
k str
v int
Rows: 4
Data:
+---+----+
| k | v |
+---+----+
| a | 1 |
| b | 1 |
| c | -8 |
| d | 3 |
+---+----+
[4 rows x 2 columns]
In [8]: sf = gl.SFrame.read_csv('bad_example_read_csv.csv', na_values=[-8])
PROGRESS: Finished parsing file /Users/charlie/bad_example_read_csv.csv
PROGRESS: Parsing completed. Parsed 4 lines in 0.011929 secs.
------------------------------------------------------
Inferred types from first line of file as
column_type_hints=[str,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
PROGRESS: Finished parsing file /Users/charlie/bad_example_read_csv.csv
PROGRESS: Parsing completed. Parsed 4 lines in 0.011881 secs.
In [9]: sf
Out[9]:
Columns:
k str
v int
Rows: 4
Data:
+---+----+
| k | v |
+---+----+
| a | 1 |
| b | 1 |
| c | -8 |
| d | 3 |
+---+----+
[4 rows x 2 columns]
In [13]: sf = gl.SFrame.read_csv('bad_example_read_csv.csv', na_values=['-8'], column_type_hints=str)
PROGRESS: Finished parsing file /Users/charlie/bad_example_read_csv.csv
PROGRESS: Parsing completed. Parsed 4 lines in 0.010938 secs.
In [14]: sf
Out[14]:
Columns:
k str
v str
Rows: 4
Data:
+---+------+
| k | v |
+---+------+
| a | 1 |
| b | 1 |
| c | None |
| d | 3 |
+---+------+
[4 rows x 2 columns]
I have install both GraphLab (student license ) and sframe (installed from pypi). When I run this code below (taken from regression class on Coursera):
import sframe
sales = sframe.SFrame("kc_house_data.gl/")
# Let's compute the mean of the House Prices in King County in 2 different ways.
prices = sales['price'] # extract the price column of the sales SFrame -- this is now an SArray
# recall that the arithmetic average (the mean) is the sum of the prices divided by the total number of houses:
sum_prices = prices.sum()
num_houses = prices.size() # when prices is an SArray .size() returns its length
avg_price_1 = sum_prices/num_houses
avg_price_2 = prices.mean() # if you just want the average, the .mean() function
print "average price via method 1: " + str(avg_price_1)
print "average price via method 2: " + str(avg_price_2)
I notice in the console:
"C:\Users\tho\AppData\Local\Dato\Dato Launcher\python.exe" F:/_python_test/regression/week_1/_dataset/script.py
[INFO] Using MetricMock instead of real metrics, mode is: QA
[INFO] Start server at: ipc:///tmp/graphlab_server-3104 - Server binary: C:\Users\tho\AppData\Local\Dato\Dato Launcher\lib\site-packages\sframe\unity_server.exe - Server log: C:\Users\tho\AppData\Local\Temp\sframe_server_1449199266.log.0
[INFO] GraphLab Server Version: 1.6
average price via method 1: 540088.141905
average price via method 2: 540088.141905
[INFO] Stopping the server connection.
Process finished with exit code 0
Why GraphLab Server loaded and what is it's role here?
Thank you in advance.
The docstring signature of SFrame.print_rows
is supposed to look like this:
def print_rows(self, num_rows=10, num_columns=40, max_column_width=30,
max_row_width=80, output_file=sys.stdout):
However, looking at the docstring in the python shell it looks like sys.stdout
somehow got evaluated:
print_rows(self, num_rows=10, num_columns=40, max_column_width=30,
max_row_width=80, output_file=<open file '<stdout>', mode 'w'>) unbound sframe.data_structures.sframe.SFrame method
I notice that the SFrame folder comes in at a sizable 5.5Gb after compiling libunity. I'm interested in extracting the minimum runtime dependencies of libunity plus the header files, something like what make install
would do for many packages.
Is that (or is there a plan to) make that functionality available? Or it's not too much trouble, would you mind offering me some guidance on how to do it manually?
Thanks!
import sframe
sf = sframe.SFrame()
sf['a'] = [1,1,1,1, 2,2,2, 3, 4,4, 5]
sf['b'] = [1,2,1,2, 3,3,1, 4, None, 2, 3]
sf2= deepcopy(sf)
Error logs
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-11-f0d0a6912fba> in <module>()
----> 1 sf2= deepcopy(sf)
/Users/sbajaj/miniconda/lib/python2.7/copy.pyc in deepcopy(x, memo, _nil)
188 raise Error(
189 "un(deep)copyable object of type %s" % cls)
--> 190 y = _reconstruct(x, rv, 1, memo)
191
192 memo[d] = y
/Users/sbajaj/miniconda/lib/python2.7/copy.pyc in _reconstruct(x, info, deep, memo)
332 if state:
333 if deep:
--> 334 state = deepcopy(state, memo)
335 if hasattr(y, '__setstate__'):
336 y.__setstate__(state)
/Users/sbajaj/miniconda/lib/python2.7/copy.pyc in deepcopy(x, memo, _nil)
161 copier = _deepcopy_dispatch.get(cls)
162 if copier:
--> 163 y = copier(x, memo)
164 else:
165 try:
/Users/sbajaj/miniconda/lib/python2.7/copy.pyc in _deepcopy_tuple(x, memo)
235 y = []
236 for a in x:
--> 237 y.append(deepcopy(a, memo))
238 d = id(x)
239 try:
/Users/sbajaj/miniconda/lib/python2.7/copy.pyc in deepcopy(x, memo, _nil)
161 copier = _deepcopy_dispatch.get(cls)
162 if copier:
--> 163 y = copier(x, memo)
164 else:
165 try:
/Users/sbajaj/miniconda/lib/python2.7/copy.pyc in _deepcopy_dict(x, memo)
255 memo[id(x)] = y
256 for key, value in x.iteritems():
--> 257 y[deepcopy(key, memo)] = deepcopy(value, memo)
258 return y
259 d[dict] = _deepcopy_dict
/Users/sbajaj/miniconda/lib/python2.7/copy.pyc in deepcopy(x, memo, _nil)
188 raise Error(
189 "un(deep)copyable object of type %s" % cls)
--> 190 y = _reconstruct(x, rv, 1, memo)
191
192 memo[d] = y
/Users/sbajaj/miniconda/lib/python2.7/copy.pyc in _reconstruct(x, info, deep, memo)
327 if deep:
328 args = deepcopy(args, memo)
--> 329 y = callable(*args)
330 memo[id(x)] = y
331
/Users/sbajaj/miniconda/lib/python2.7/copy_reg.pyc in __newobj__(cls, *args)
91
92 def __newobj__(cls, *args):
---> 93 return cls.__new__(cls, *args)
94
95 def _slotnames(cls):
sframe/cython/cy_sframe.pyx in sframe.cython.cy_sframe.UnitySFrameProxy.__cinit__()
TypeError: __cinit__() takes at least 1 positional argument (0 given)
I followed the steps as suggested at the page "Setting up for Windows".
In my MSYS2 shell, I first perform the configuration, it goes well.
After that, I enter the directory ${SFrameRoot}/debug/oss_src/unity/python and "make",
then I get information like:
and some undefined reference errors:
Could anyone give me some help?
I encountered some error when loading a big sframe directly from s3 and do processing. This error disappear after I directly download from S3, so I guess there might be some problem in current S3 reading API.
opened a issue to track this as discussed with @ylow
The cppipc layer becomes unnecessary once unity_server is "inproc". Removing the cppipc layer entirely will by pass data structure serialization, provide a tighter integration with python, and speed up all object passing between python and glc.
Right now SFrame package has to be installed with pip, it would be nice if it could be installed with 'conda install'.
So it currently only supports passing credentials from environment variables.
importing both sframe and graphlab can result in a crash.
import sframe
import graphlab
g=sframe.SFrame()
The current released version is 168 commits behind, do you have plans to release a newer version of SFrame ?
The current mode of python lambda parallelization involves spinning of a collection of subprocesses (pylambda_worker) each of which is dynamically linked to libpython.so. These pylambda_workers then connect back to the original process via interprocess shared memory.
This causes issues in certain situations :
The proposal is to flip the linking around.
Building the C++ tests doesn't happen when you build everything else. If you try to run ./oss_local_scripts/run_cpp_tests.py
when you have not built the C++ test, you get a super unhelpful error message.
This behavior is wrong. The python side cache "SFrame._cache" needs to be invalidated when the column set changes.
>>> g=gl.SFrame({'a':[1]})
>>> g[0]
{'a': 1}
>>> g['a'] = g['a'] + 1
>>> g[0]
{'a': 1}
Sometimes I get a call stack that is very hard to debug -- resulting in an exception like:
File "graphlab/cython/cy_sarray.pyx", line 84, in graphlab.cython.cy_sarray.UnitySArrayProxy.size
File "graphlab/cython/cy_sarray.pyx", line 85, in graphlab.cython.cy_sarray.UnitySArrayProxy.size
RuntimeError: Runtime Exception. Cannot convert python object Decimal to flexible_type.
Since the apply
method is executed lazily, this error (with call stack) tends to appear far away from the code that actually causes the error. In a large program it is hard to track down where this error was introduced.
It would be nice to have a mode or flag to enable strict evaluation (across the board) for debugging purposes, so that it would be easier to track down type mismatches or exceptions raised in apply
methods, without having to explicitly materialize at each point in the code.
Repro steps:
import graphlab as gl
db = gl.connect_odbc(<my awesome connection string>)
Expected Behavior:
Actual Behavior:
AttributeError: 'module' object has no attribute '_odbc_connection'
The pylambda workers allow efficient parallel execution of functions over sframe and sarray type, but they have been historically plagued by library resolution issues and linking issues. The previous pull request should fix all of these issues. The primary idea is that it spawns separate processes of the python interpreter, which then loads the pylambda workers as a library and by file, instead of as a separate process linked to libpython. Both of these things are reliably determined at startup, whereas libpython is not.
To ensure that this change indeed works, we need to test that it runs on a number of different configurations and systems.
To sign off on this feature, we need to make sure that nosetests .../sframe/test/test_lambda_workers.py
runs on the following systems:
When I do the following:
x = gl.SArray([1,None,0])
y = gl.SArray([0,0,1])
print x and y
I expected that (like many other operators), the and
operator would be overloaded to do an element-wise and. The result I get actually has the expected type and expected number of rows, but not the result I expected. I get back y
:
[0, 0, 1]
This is because apparently Python likes to implicitly-cast-to-bool for comparison purposes, and return the original type from the expression (JS does this as well, so I should've realized that's what was happening). The equivalent expression to x and y
, without implicit type coercion, is actually y if bool(x) else x
.
Explicit is better than implicit probably applies here. Why Python originally allowed this implicit casting is beyond me.
@hoytak suggests that we adopt Numpy behavior here and error on implicit cast to bool, and I'm inclined to agree. I assume TypeError
?
SFrame.append() is slow when it iteratively self-append.
The following script reproduce the slow performance:
def f(n):
sf = gl.SFrame({"h":range(0,100)})
for i in range(0,100):
sf.add_column(sf["h"],name=str(i))
for i in range(0,n):
sf = sf.append(sf)
sf.__materialize__()
for i in range(5, 15):
print i
%timeit f(i)
Another related issue is that print sf
in the following script collapses the query tree (calling materialize()) while it only needs to print the first 10 rows.
sf = gl.SFrame({'h':range(0,100)})
for i in range(0,100):
sf.add_column(sf['h'],name=str(i))
for i in range(0,14):
sf = sf.append(sf)
print sf
The new libpylambda_worker.dll needs to have an accompanying manifest so that it loads all dependent libraries from the local graphlab directory. The main issue encountered here is that old openssl libraries are not compatible, and many openssl installers put libraries by default into C:\System32. Thus if someone has an old version of openssl installed, then the pylambda workers won't start.
Hi,
When trying to use an empty string as escape_char, it ignores it.
When trying other characters, it works.
This is preventing us from disabling escapement.
There's some problems with my SFrame that I've loaded but it just can't seem to dump into a binary or csv format. Is it a memory issue or is it how SFrame works such that NoneType appears often when iterating or saving the SFrame?
For more information, see http://stackoverflow.com/questions/34654901/how-do-i-find-specific-rows-that-throws-an-error-when-saving-in-graphlab-sframe
import sframe
sf = sframe.SFrame()
sf['a'] = [1,1,1,1, 2,2,2, 3, 4,4, 5]
sf['b'] = [1,2,1,2, 3,3,1, 4, None, 2, 3]
af = sf.groupby("a", {'b':sframe.aggregate.CONCAT("b")})
af['c'] = af['b'].apply(lambda x: list(set(x)))
af will have the columns as float instead of preserving the int type:
+---+--------------+------------+
| a | b | c |
+---+--------------+------------+
| 3 | [4] | [4.0] |
| 1 | [2, 1, 1, 2] | [1.0, 2.0] |
| 2 | [3, 3, 1] | [1.0, 3.0] |
| 5 | [3] | [3.0] |
| 4 | [2] | [2.0] |
+---+--------------+------------+```
The testing code is below:
import sframe as sf
t = sf.SFrame({"id": ['a', 'b', 'c'], 'value': [1, 2, 3]})
t['str'] = 'aaa'
t['uni'] = u'aa'
Will sarray.from_const support unicode value?
The package currently only supports Python 2.7x.
When I used SFrame to read_csv, I came across error below:
PROGRESS: Unable to parse line ""###0000","{}","{""-2sxY5OvAJh4UT3w"":""不vvv""}""
PROGRESS: Unable to parse line ""**666996","{}","{""uFx1XYq5tjyFt8--"":""八八\\""}""
Then What's wrong with those lines? could somebody tell me how the SFrame to parse line?
By the Way, the file containing these lines are generated from SFrame itself by save method.
Thank you!
Docstrings still reference the package name as being graphlab
. There are other incorrect references to GraphLab Create
.
Being new to the platform, I wanted to look at a row of data and did:
Which is great, except the columns are out of order from the original SFrame, which made me think I had a bug, and kept looking for a while. Then I decided to save the SFrame to disk (as CSV), and BOOM everything is as expected.
This is very annoying that the columns are not in the order displayed. Hoyt says that we could use an ordereddict and it would fix this.
The SFrame query optimizer, in prioritizing a filter operation, incorrectly allows a binary_transform with multiple outputs to be split and thus executed twice. Here's the input graph:
And it produces this result:
The reason is that the filter optimization does not correctly detect if it is splitting the output of a binary transform.
Ubuntu 15.04 Python 2.7 IPython 4.0.0
Having used graphlab in the introductory section of your Coursera program I thought I would make sure I was still up-to-date per the instructions at the beginning of the Regression module
Following
sudo -H pip install --upgrade graphlab-create
sudo -H pip install -U sframe
sudo reboot
ipython
import graphlab as gl
produced the following
In [2]: import graphlab as gl
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-2-4a710214f762> in <module>()
----> 1 import graphlab as gl
/usr/local/lib/python2.7/dist-packages/graphlab/__init__.py in <module>()
52 from graphlab.util import set_runtime_config
53
---> 54 import graphlab.connect as _mt
55 import graphlab.connect.aws as aws
56 import visualization
/usr/local/lib/python2.7/dist-packages/graphlab/connect/__init__.py in <module>()
29 """ The module usage metric tracking object """
30 from graphlab.util.config import DEFAULT_CONFIG as _default_local_conf
---> 31 from graphlab.util.metric_tracker import MetricTracker as _MetricTracker
32
33
/usr/local/lib/python2.7/dist-packages/graphlab/util/metric_tracker.py in <module>()
21 import uuid
22 import copy as _copy
---> 23 import requests as _requests
24 import sys
25 import urllib as _urllib
/usr/local/lib/python2.7/dist-packages/requests/__init__.py in <module>()
51 # Attempt to enable urllib3's SNI support, if possible
52 try:
---> 53 from .packages.urllib3.contrib import pyopenssl
54 pyopenssl.inject_into_urllib3()
55 except ImportError:
/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/contrib/pyopenssl.py in <module>()
68 _openssl_versions = {
69 ssl.PROTOCOL_SSLv23: OpenSSL.SSL.SSLv23_METHOD,
---> 70 ssl.PROTOCOL_SSLv3: OpenSSL.SSL.SSLv3_METHOD,
71 ssl.PROTOCOL_TLSv1: OpenSSL.SSL.TLSv1_METHOD,
72 }
AttributeError: 'module' object has no attribute 'PROTOCOL_SSLv3'
Looking at the error message and having noticed that your installation had moved requests back to an earlier level, I tried
sudo -H pip install -U pyopenssl
sudo -H pip install -U urllib3
sudo -H pip install -U requests
... and this seemed to do the trick.
SFrame uses maybe first 1000 rows to infer column types. When a column is inferred as 'int', and a 'str' is encountered later, the parser will read the first valid digit in the string as value, or discard the str if there is no valid digit.
For instance, create 'a.csv' like follows:
A,B
0,1
0.1
...
// repeat 100 times
...
9a,1
a,1
SFrame.read_csv('a.csv').tail()
+---+---+
| A | B |
+---+---+
| 0 | 1 |
| 0 | 1 |
| 0 | 1 |
| 0 | 1 |
| 0 | 1 |
| 0 | 1 |
| 0 | 1 |
| 0 | 1 |
| 0 | 1 |
| 9 | 1 |
+---+---+
Note that last row is "9, 1".
The expected behavior should be either throw that row, or lift the inferred column type to str and take it. Parse row partially corrupts the data.
In groupby we get None, while in sum, we get an error.
In [49]: gl.SFrame({'c': ['a', 'b', 'b'], 'v': [[1], [1], [1, 1]]}).groupby('c', gl.aggregate.SUM('v'))
Out[49]:
Columns:
c str
Vector Sum of v array
Rows: 2
Data:
+---+-----------------+
| c | Vector Sum of v |
+---+-----------------+
| a | [1.0] |
| b | None |
+---+-----------------+
[2 rows x 2 columns]
In [50]: print gl.SArray([[1], [1,1]]).sum()
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-50-db39ee6eb4ec> in <module>()
----> 1 print gl.SArray([[1], [1,1]]).sum()
/Users/srikris/miniconda/envs/graphlab/lib/python2.7/site-packages/graphlab/data_structures/sarray.pyc in sum(self)
1970 """
1971 with cython_context():
-> 1972 return self.__proxy__.sum()
1973
1974 def mean(self):
/Users/srikris/miniconda/envs/graphlab/lib/python2.7/site-packages/graphlab/cython/context.pyc in __exit__(self, exc_type, exc_value, traceback)
47 if not self.show_cython_trace:
48 # To hide cython trace, we re-raise from here
---> 49 raise exc_type(exc_value)
50 else:
51 # To show the full trace, we do nothing and let exception propagate
RuntimeError: Runtime Exception. Cannot perform sum over vectors of variable length.
Just wanted to check if this is a problem, or if the show
method is genuinely unimplemented in the open source tools:
In [11]: sframe.SArray([1,2]).show()
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
<ipython-input-11-246d830d08e3> in <module>()
----> 1 sframe.SArray([1,2]).show()
/Users/malmaud/tmp/SFrame/debug/oss_src/unity/python/sframe/data_structures/sarray.pyc in show(self, view)
2566 """
2567 from ..visualization.show import show
-> 2568 show(self, view=view)
2569
2570 def item_length(self):
/Users/malmaud/tmp/SFrame/deps/conda/lib/python2.7/site-packages/multipledispatch/dispatcher.pyc in __call__(self, *args, **kwargs)
162 self._cache[types] = func
163 try:
--> 164 return func(*args, **kwargs)
165
166 except MDNotImplementedError:
/Users/malmaud/tmp/SFrame/debug/oss_src/unity/python/sframe/visualization/show.pyc in show(obj, **kwargs)
12 @show_dispatch(object)
13 def show(obj, **kwargs):
---> 14 raise NotImplementedError("Show for object type " + str(type(obj)))
NotImplementedError: Show for object type <class 'sframe.data_structures.sarray.SArray'>
We should not have: sframe.load_model
Currently, we try to autotune the amount of memory we use by detecting the amount of system memory. However, this is problematic when we are run inside of docker since we detect the total amount of memory on the system (via sysinfo), rather the amount of memory allocated to us. (See http://fabiokung.com/2014/03/13/memory-inside-linux-containers/, moby/moby#12394)
As of docker 1.8 which is rather new, we should be able to look into /sys/fs/cgroups or /proc/self/cgroup or something like that. But not everyone will be on docker 1.8, so we need an API workaround as well.
SFrame is very powerful. And I think it will be more powerfull if SFrame.show() will be support .
Hi,
When saving to CSV file, there is no option to choose a quote.
In this case, both of the systems are in our control, and we can change the reading side quote, but if it wasn't the case, we would need to read the file after saving and convert it.
I just tried playing around with SFrame, and noticed that you haven't added some useful special methods:
__abs__
corresponds to abs(sf['x'])
__neg__
corresponds to -sf['x']
__pos__
corresponds to +sf['x']
__pow__
corresponds to sf['x'] ** 2
These are pretty low hanging fruit but quite useful for compatibility with standard Python idioms.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.