blaze / datashape Goto Github PK

Language defining a data description protocol

License: BSD 2-Clause "Simplified" License

Shell 0.01% Python 99.97% Batchfile 0.01%

datashape's Introduction

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar interface to query data living in other data storage systems.

Example

We point blaze to a simple dataset in a foreign database (PostgreSQL). Instantly we see results as we would see them in a Pandas DataFrame.

>>> import blaze as bz
>>> iris = bz.Data('postgresql://localhost::iris')
>>> iris
    sepal_length  sepal_width  petal_length  petal_width      species
0            5.1          3.5           1.4          0.2  Iris-setosa
1            4.9          3.0           1.4          0.2  Iris-setosa
2            4.7          3.2           1.3          0.2  Iris-setosa
3            4.6          3.1           1.5          0.2  Iris-setosa

These results occur immediately. Blaze does not pull data out of Postgres, instead it translates your Python commands into SQL (or others.)

>>> iris.species.distinct()
           species
0      Iris-setosa
1  Iris-versicolor
2   Iris-virginica

>>> bz.by(iris.species, smallest=iris.petal_length.min(),
...                      largest=iris.petal_length.max())
           species  largest  smallest
0      Iris-setosa      1.9       1.0
1  Iris-versicolor      5.1       3.0
2   Iris-virginica      6.9       4.5

This same example would have worked with a wide range of databases, on-disk text or binary files, or remote data.

What Blaze is not

Blaze does not perform computation. It relies on other systems like SQL, Spark, or Pandas to do the actual number crunching. It is not a replacement for any of these systems.

Blaze does not implement the entire NumPy/Pandas API, nor does it interact with libraries intended to work with NumPy/Pandas. This is the cost of using more and larger data systems.

Blaze is a good way to inspect data living in a large database, perform a small but powerful set of operations to query that data, and then transform your results into a format suitable for your favorite Python tools.

In the Abstract

Blaze separates the computations that we want to perform:

>>> accounts = Symbol('accounts', 'var * {id: int, name: string, amount: int}')

>>> deadbeats = accounts[accounts.amount < 0].name

From the representation of data

>>> L = [[1, 'Alice',   100],
...      [2, 'Bob',    -200],
...      [3, 'Charlie', 300],
...      [4, 'Denis',   400],
...      [5, 'Edith',  -500]]

Blaze enables users to solve data-oriented problems

>>> list(compute(deadbeats, L))
['Bob', 'Edith']

But the separation of expression from data allows us to switch between different backends.

Here we solve the same problem using Pandas instead of Pure Python.

>>> df = DataFrame(L, columns=['id', 'name', 'amount'])

>>> compute(deadbeats, df)
1      Bob
4    Edith
Name: name, dtype: object

Blaze doesn't compute these results, Blaze intelligently drives other projects to compute them instead. These projects range from simple Pure Python iterators to powerful distributed Spark clusters. Blaze is built to be extended to new systems as they evolve.

Getting Started

Blaze is available on conda or on PyPI

conda install blaze
pip install blaze

Development builds are accessible

conda install blaze -c blaze
pip install http://github.com/blaze/blaze --upgrade

You may want to view the docs, the tutorial, some blogposts, or the mailing list archives.

Development setup

The quickest way to install all Blaze dependencies with conda is as follows

conda install blaze spark -c blaze -c anaconda-cluster -y
conda remove odo blaze blaze-core datashape -y

After running these commands, clone odo, blaze, and datashape from GitHub directly. These three projects release together. Run python setup.py develop to make development installations of each.

License

Released under BSD license. See LICENSE.txt for details.

Blaze development is sponsored by Continuum Analytics.

datashape's People

Contributors

Stargazers

Watchers

Forkers

aterrel mwiebe imclab markflorisson francescalted mrocklin talumbau gdementen calvinwly srossross ellisonbg jreback cowlicks pombreda llllllllll telefunkenvf14 dan-coates quantopian insertinterestingnamehere jcrist princeofdarkness76 pombredanne ssanderson rv816 kwmsmith egqm pratapvardhan zen-li thequackdaddy brycehays mitizhi jakirkham dhirschfeld freddiev4 darthkedrik john5223 andreabonchi michelorengo saulshanabrook mingwandroid magonser vishalbelsare vikram-narayan liudengfeng alxmrs defenastrator syyunn bobquest33 j08ny studentyb kurhula koenlek sreekanth370 dthadi3 richlysakowski gururajrkatti aahmadai kolanich-libs arpitjain799 chechi

datashape's Issues

Needs a version number

Something like datashape.version

Time Interval type

Should we add a core type to represent an interval of time or timedelta?

Records allow repeated fields

In [8]: dshape('{name: string, amount: int, name: string}')
Out[8]: dshape("{ name : string, amount : int32, name : string }")

Discover doesn't simultaneously handle var and heterogeneous types

Failing case

In [6]: discover([[1, 2, 3.0], [4.0, 5], [None, 6.0, 7]])
Out[6]: dshape("((int64, int64, float64), (float64, int64), (null, float64, int64))")

Should be

3 * var * ?float64

Need to clarify details of type constructor keyword parameters

As we've written the description of keyword arguments so far, it's implicitly following Python's approach because of the current implementation in the parser. This is not the only option, and not necessarily the best way to do things for datashape.

The main purpose of having keyword arguments in the type constructors is documentation. Consider variations of a bytes type:

bytes
bytes[8]
bytes[8,4]
bytes[align=4]
bytes[8, align=4]
bytes[size=8]
bytes[size=8, align=4]
bytes[align=4, size=8]

Some things that need to be made clear:

Is it ok to provide keyword arguments an any order, or should we always require a fixed order?
Do we allow a given parameter to be provided either with a keyword or not?
If things are in a fixed order, do we allow a positional argument after a keyword argument?

An example possibility is to have dimension constructors be syntactic sugar as follows:

dim[arg1, argname2=arg2] * int32
  equivalent to
dim[arg1, argname2=arg2, int32]

My inclination is towards the following, based on wanting the representation of a given type to not have too many possibilities, and to aid parsing efficiency in static languages:

Require argument order to always match up.
Require keyword name if it is a keyword argument.
Allow non-keyword arguments after keyword arguments.

For the bytes example above, the type constructor signatures we get are:

# variable-sized bytes type
bytes
# fixed-size bytes type of a given size, alignment 1
bytes[<int>]
# variable-sized bytes type, with given alignment
bytes[align=<int>]
# fixed-size bytes type, with given alignment
bytes[<int>, align=<int>]

Change function prototype syntax

Currently function prototypes use currying syntax:

int32 -> 3, int32 -> float64

This is confusing to many and should be changed to use tuple syntax.

( int32; 3, int32 ) -> float64

test_coretypes.py misses importing raises in -0.4.1

datashape/tests/test_util.py::TestDataShapeUtil::test_has_ellipsis  *       py.test -v || die "Tests failed under ${EPYTHON}"
PASSED
datashape/tests/test_util.py::TestDataShapeUtil::test_has_var_dim PASSED * 


===================== FAILURES ==========================
____________________________________________ test_error_on_datashape_with_string_argument ____________________________________________

    def test_error_on_datashape_with_string_argument():
>       assert raises(TypeError, lambda : DataShape('5 * int32'))
E       NameError: global name 'raises' is not defined

datashape/tests/test_coretypes.py:31: NameError
======= 1 failed, 193 passed, 8 xfailed in 1.64 seconds ==================

add from datashape.internal_utils import raises to datashape/tests/test_coretypes.py sees it pass

============= 194 passed, 8 xfailed in 1.57 seconds =============
>>> Completed testing dev-python/datashape-0.4.1

Type constructor mechanism needs cleanup

Here are some of the things it allows right now:

In [21]: x = datashape.dshape('Option[float32]')

In [22]: y = datashape.Option(datashape.float32)

In [23]: x
Out[23]: Option[float32]

In [24]: y
Out[24]: Option(float32)

In [25]: y = datashape.Option[datashape.float32]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-25-1511765166f1> in <module>()
----> 1 y = datashape.Option[datashape.float32]

TypeError: 'type' object has no attribute '__getitem__'

and

In [26]: datashape.dshape('nonexistent_type_[something_arbitrary]')
Out[26]: nonexistent_type_[something_arbitrary]

In [27]: type(_)
Out[27]: datashape.coretypes.nonexistent_type_

The sdist on PyPI does not include requirements.txt, needed for build

It looks like the project needs a MANIFEST.in file to ensure that the correct files get added to the sdist.

Datashape functions should accept strings

In [5]: datashape.to_numpy('3 * 3 * int32')
NotNumpyCompatible: DataShape measure 3 * 3 * int32 is not NumPy-compatible

In [6]: datashape.to_numpy(datashape.dshape('3 * 3 * int32'))
Out[6]: ((3, 3), dtype('int32'))

Add Predicates for tabular, homogenous, etc...

Datashape is wonderfully expressive for a variety of different shapes of data. Unfortunately many backends aren't as permissive. We should establish a set of common predicates to check whether a datashape is amenable for certain backends. Some suggestions

istabular - fits nicely in a table like SQL or CSV
ishomogenous - has a single dtype
islinear -
isscalar -
isfixed -

Function to determine the number of bytes in a datashape.

It would be nice to know how large our data is without looking at the data itself

>>> nbytes('2 * 2 * int64')
32

Generally assume the minimum storage required

>>> nbytes('{x: int32, y: int32}')
8

Accept that this is a hard task to do in general

>>> nbytes('string')
TypeError(...)

setuptools should not be required for building

Use Python data structures to define datashape

We currently use lots of internal data structures like Tuple, Record and DataShape to construct datashapes. Maybe we can get away with tuple, ordereddict, and list?

`10 * var * {name: string}`

DataShape([10, var, OrderedDict([['name', 'string']]))

This might make traversing these data structures much cleaner. I think that this choice should be mostly invisible to the user.

Add documentation about typevar vs type upper/lowercase convention

Need documentation about this.

Infer datashape of slice of dataset

I want something like the following. Do we have this already somewhere?

>>> ds = dshape('var * {name: string, amount: int}')
>>> index = (slice(0, 5), 'amount')

>>> dshape_of_subset(index, ds)
'5 * int'

>>> dshape_of_subset(0, ds)
'{name: string, amount: int}'

CC @mwiebe

Consistency between datashape.int32 and datashape.dshape('int32')

The following produce different values:

In [9]: datashape.int32
Out[9]: ctype("int32")

In [10]: datashape.dshape('int32')
Out[10]: dshape("int32")

which is kind of error prone, because:

In [12]: datashape.int32 == datashape.dshape('int32')
Out[12]: False

datashape -> string conversion should wrap special field names with quotes and escape troublesome characters

Example:

ds1=dshape("{ 'Unique Field' : string}")
print(ds1)
{ Unique Field : string }

When ds1 is printed, Unique Field should be wrapped in single quotes.

Discover sequences of dicts with missing keys

Discover should merge sequences of dicts into a single dict with Option values

In [1]: from blaze import discover

In [2]: data = [{'name': 'Alice', 'amount': 100}, {'name': 'Bob'}]

In [3]: discover(data)
Out[3]: dshape("({ amount : int64, name : string }, { name : string })")

I would have preferred

In [3]: discover(data)
Out[3]: dshape("2 * { amount : ?int64, name : string }")

dshape parser error output is broken

It prints the arrow into the string, but the string itself is missing.

In [2]: datashape.dshape("3, int48")
---------------------------------------------------------------------------
DataShapeSyntaxError                      Traceback (most recent call last)
<ipython-input-2-35b42a853fef> in <module>()
----> 1 datashape.dshape("3, int48")
...

DataShapeSyntaxError: 

  File <stdin>, line 1

    ^

DataShapeSyntaxError: invalid syntax

Decimal type?

Do we want one? They're found in the wild.

Allow field names in structs to be more general

Currently, structs are always like this:

{field0: int32, field1: float64}

if we want to allow field names which are not necessarily identifier names, we should extend it to also allow:

{"Field 0": int32, "another field!": float64}

conda install datashape for Python 2.6 doesn't put datashape in the proper site-packages directory

Here's an ls from a conda 2.7 env:

╭─ ~/code/py/blaze ‹conda26› ‹master›
╰─$ ls -1 ~/miniconda3/envs/conda/lib/python2.7/site-packages/ | grep -C 5 datashape
Cython-0.20.2-py2.7.egg-info
cython.py
cython.pyc
cytoolz
cytoolz-0.7.1dev-py2.7.egg-info
datashape
DataShape-0.2.1dev-py2.7.egg-info
distribute.egg-info
dynd
easy-install.pth
easy_install.py

Same from the conda 2.6 env:

ls -1 ~/miniconda3/envs/conda26/lib/python2.6/site-packages/ | grep -C 5 datashape

nothing shows up

Incorrect datetime discovery

In [5]: discover("31-DEC-99 12.00.00.000000000 AM")
Out[5]: Date()

Should be datetime

Need tests to verify whether ds == eval(str(ds)) for all datashapes

A recent PR (#89) came about because one could make a datashape where eval(str(ds)) != ds. It would be helpful if we exhaustively tested this within datashape.

distinguishing typevars vs types

Enforce the following convention to distinguish:

typevars always have a leading capital letter
types always have a leading lowercase letter

A major reason to enforce this is so that things stay unambiguously one or the other at all times. In the future, Blaze will likely gain the ability to have dynamically registered pluggable types. If what was previously a typevar with the same name as a new type exists, it would change the meaning of a parse from typevar to that type, instead of from an error condition to that type.

The semantics of type variable matching isn't precisely specified

In our docs, we've referred to a datashape like A * A * int32 as a square matrix, but if A is permitted to match the var dimension, this is actually a ragged array.

Since 3 * int32 is syntactic sugar for fixed[3] * int32, we can specify a square matrix as the more wordy fixed[N] * fixed[N] * int32. (Note that the pattern matching implementation does not match type variables inside of type constructors presently though.).

to_numpy should have strict kwarg and warn rather than err

Converting datashapes with missing types to numpy dtypes raises an error. This is appropriate because numpy dtypes don't support missing values. It might be nice to control this behavior with a strict keyword that allows lowering instead to the non-missing equivalent, e.g.

?int32 -> np.int32

Perhaps we should raise a warning rather than an error in this case?

What should be default behavior?

Use list of lists to construct Records in dshape

We should support construction of all datashapes through python syntax and data structures. This allows for simpler programmatic construction than string manipulation.

One case where this is difficult is records, e.g.

'{name: string, amount: int}'

In datashape the order matters while in Python the order doesn't

{'name': string, 'amount': int} == {'amount': int, 'name': string}

The standard fallback is to use a sequence of items

[['name', string], ['amount', 'int']]

Maybe the utility function dshape should recognize this situation and produce the appropriate Record

TST: minor test failure on datashape master on win-64

was installed from zip, but here's the latest commit for ti:
ec4144e

The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    datashape-0.4.1            |       np19py27_1          88 KB

The following NEW packages will be INSTALLED:

    colorama:         0.3.1-py27_0
    datashape:        0.4.1-np19py27_1
    dateutil:         2.1-py27_2
    multipledispatch: 0.4.7-py27_0
    numpy:            1.9.1-py27_0
    py:               1.4.25-py27_0
    pytest:           2.6.3-py27_0
    python:           2.7.8-0
    six:              1.8.0-py27_0

Fetching packages ...
datashape-0.4. 100% |###############################| Time: 0:00:00   0.00  B/s
Extracting packages ...
[      COMPLETE      ] |#################################################| 100%
Linking packages ...
[      COMPLETE      ] |#################################################| 100%

[neat] C:\Users\rebackj\AppData\Local\Continuum\Anaconda\conda-bld\test-tmp_dir>python -c "import datashape; datashape.test()"
Datashape version: 0.4.1
args ['--doctest-modules', 'C:\\Users\\rebackj\\AppData\\Local\\Continuum\\Anaconda\\envs\\_test\\lib\\site-packages\\datashape']
============================= test session starts =============================
platform win32 -- Python 2.7.8 -- py-1.4.25 -- pytest-2.6.3
collected 203 items

../../envs/_test/lib/site-packages/datashape/coretypes.py .........
../../envs/_test/lib/site-packages/datashape/discovery.py ......
../../envs/_test/lib/site-packages/datashape/internal_utils.py .....
../../envs/_test/lib/site-packages/datashape/predicates.py ............
../../envs/_test/lib/site-packages/datashape/promotion.py .
../../envs/_test/lib/site-packages/datashape/typesets.py .
../../envs/_test/lib/site-packages/datashape/util.py .......F
../../envs/_test/lib/site-packages/datashape/validation.py .
../../envs/_test/lib/site-packages/datashape/tests/test_coercion.py ....x....xxx...
../../envs/_test/lib/site-packages/datashape/tests/test_coretypes.py ..............................
../../envs/_test/lib/site-packages/datashape/tests/test_creation.py ..x.............x.
../../envs/_test/lib/site-packages/datashape/tests/test_discovery.py .........x................
../../envs/_test/lib/site-packages/datashape/tests/test_lexer.py .....
../../envs/_test/lib/site-packages/datashape/tests/test_operations.py ...
../../envs/_test/lib/site-packages/datashape/tests/test_overloading.py x......
../../envs/_test/lib/site-packages/datashape/tests/test_parser.py .....................
../../envs/_test/lib/site-packages/datashape/tests/test_predicates.py ..
../../envs/_test/lib/site-packages/datashape/tests/test_str.py ......
../../envs/_test/lib/site-packages/datashape/tests/test_type_equation_solver.py ..............
../../envs/_test/lib/site-packages/datashape/tests/test_user.py .........
../../envs/_test/lib/site-packages/datashape/tests/test_util.py ....

================================== FAILURES ===================================
_____________________ [doctest] datashape.util.to_ctypes ______________________
247     """
248     Constructs a ctypes type from a datashape
249
250     >>> to_ctypes(coretypes.int32)
Expected:
    <class 'ctypes.c_int'>
Got:
    <class 'ctypes.c_long'>

C:\Users\rebackj\AppData\Local\Continuum\Anaconda\envs\_test\lib\site-packages\datashape\util.py:250: DocTestFailure
=============== 1 failed, 194 passed, 8 xfailed in 2.31 seconds ===============
TEST END: datashape-0.4.1-np19py27_1
# If you want to upload this package to binstar.org later, type:
#
# $ binstar upload C:\Users\rebackj\AppData\Local\Continuum\Anaconda\conda-bld\win-64\datashape-0.4.1-np19py27_1.tar.bz2
#
# To have conda build upload to binstar automatically, use
# $ conda config --set binstar_upload yes


[neat] C:\Users\rebackj\Documents\datashape-master\datashape-master>

Normalization allows ellipsis on either side of inequality

(dshape('..., 10, int32'), dshape('10, ..., int32'))

We should disallow ellipses on the LHS of the equation. This means that in signatures, the return type must be concrete, a signature like the following will be disallowed:

'a, int32 -> ..., int32'

since the result of it would be '..., int32'. After that you can end up with an ellipsis at the LHS as input to some blaze function.

test failures in latest release, all pythons

datashape.tests.test_user.test_tuples_can_be_records_too ... ok
datashape.tests.test_user.test_datetimes ... ok
datashape.tests.test_user.test_numpy ... ok
datashape.tests.test_user.test_issubschema ... ok
datashape.tests.test_user.test_integration ... ok
test_cat_dshapes (datashape.tests.test_util.TestDataShapeUtil) ... ok
test_cat_dshapes_errors (datashape.tests.test_util.TestDataShapeUtil) ... ok
test_has_ellipsis (datashape.tests.test_util.TestDataShapeUtil) ... ok
test_has_var_dim (datashape.tests.test_util.TestDataShapeUtil) ... ok

======================================================================
ERROR: test_coerce_traits (datashape.tests.test_coercion.TestCoercion)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/gen2/TmpDir/portage/dev-python/datashape-0.3.0/work/datashape-0.3.0-python3_4/lib/datashape/tests/test_coercion.py", line 75, in test_coerce_traits
    '10 * X * float32')
  File "/mnt/gen2/TmpDir/portage/dev-python/datashape-0.3.0/work/datashape-0.3.0-python3_4/lib/datashape/util.py", line 34, in dshapes
    return [dshape(arg) for arg in args]
  File "/mnt/gen2/TmpDir/portage/dev-python/datashape-0.3.0/work/datashape-0.3.0-python3_4/lib/datashape/util.py", line 34, in <listcomp>
    return [dshape(arg) for arg in args]
  File "/mnt/gen2/TmpDir/portage/dev-python/datashape-0.3.0/work/datashape-0.3.0-python3_4/lib/datashape/util.py", line 49, in dshape
    ds = parser.parse(o, type_symbol_table.sym)
  File "/mnt/gen2/TmpDir/portage/dev-python/datashape-0.3.0/work/datashape-0.3.0-python3_4/lib/datashape/parser.py", line 582, in parse
    dsp.raise_error('Unexpected token in datashape')
  File "/mnt/gen2/TmpDir/portage/dev-python/datashape-0.3.0/work/datashape-0.3.0-python3_4/lib/datashape/parser.py", line 57, in raise_error
    self.ds_str, errmsg)
nose.proxy.DataShapeSyntaxError: 

  File <nofile>, line 1
    10 * X * A : floating
               ^

DataShapeSyntaxError: Unexpected token in datashape

-------------------- >> begin captured stdout << ---------------------


  File <nofile>, line 1
    10 * X * A : floating
               ^

DataShapeSyntaxError: Unexpected token in datashape


--------------------- >> end captured stdout << ----------------------

======================================================================
ERROR: test_constraints_error (datashape.tests.test_creation.TestDataShapeCreation)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/gen2/TmpDir/portage/dev-python/datashape-0.3.0/work/datashape-0.3.0-python3_4/lib/datashape/tests/test_creation.py", line 62, in test_constraints_error
    'A : integral * B : numeric')
  File "/usr/lib64/python3.4/unittest/case.py", line 701, in assertRaises
    return context.handle('assertRaises', callableObj, args, kwargs)
  File "/usr/lib64/python3.4/unittest/case.py", line 161, in handle
    callable_obj(*args, **kwargs)
  File "/mnt/gen2/TmpDir/portage/dev-python/datashape-0.3.0/work/datashape-0.3.0-python3_4/lib/datashape/util.py", line 49, in dshape
    ds = parser.parse(o, type_symbol_table.sym)
  File "/mnt/gen2/TmpDir/portage/dev-python/datashape-0.3.0/work/datashape-0.3.0-python3_4/lib/datashape/parser.py", line 582, in parse
    dsp.raise_error('Unexpected token in datashape')
  File "/mnt/gen2/TmpDir/portage/dev-python/datashape-0.3.0/work/datashape-0.3.0-python3_4/lib/datashape/parser.py", line 57, in raise_error
    self.ds_str, errmsg)
nose.proxy.DataShapeSyntaxError: 

  File <nofile>, line 1
    A : integral * B : numeric
      ^

DataShapeSyntaxError: Unexpected token in datashape

-------------------- >> begin captured stdout << ---------------------


  File <nofile>, line 1
    A : integral * B : numeric
      ^

DataShapeSyntaxError: Unexpected token in datashape


--------------------- >> end captured stdout << ----------------------

======================================================================
ERROR: test_type_decl (datashape.tests.test_creation.TestDataShapeCreation)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/gen2/TmpDir/portage/dev-python/datashape-0.3.0/work/datashape-0.3.0-python3_4/lib/datashape/tests/test_creation.py", line 70, in test_type_decl
    self.assertRaises(error.DataShapeTypeError, dshape, 'type X T = 3, T')
  File "/usr/lib64/python3.4/unittest/case.py", line 701, in assertRaises
    return context.handle('assertRaises', callableObj, args, kwargs)
  File "/usr/lib64/python3.4/unittest/case.py", line 161, in handle
    callable_obj(*args, **kwargs)
  File "/mnt/gen2/TmpDir/portage/dev-python/datashape-0.3.0/work/datashape-0.3.0-python3_4/lib/datashape/util.py", line 49, in dshape
    ds = parser.parse(o, type_symbol_table.sym)
  File "/mnt/gen2/TmpDir/portage/dev-python/datashape-0.3.0/work/datashape-0.3.0-python3_4/lib/datashape/parser.py", line 578, in parse
    dsp.raise_error('Invalid datashape')
  File "/mnt/gen2/TmpDir/portage/dev-python/datashape-0.3.0/work/datashape-0.3.0-python3_4/lib/datashape/parser.py", line 57, in raise_error
    self.ds_str, errmsg)
nose.proxy.DataShapeSyntaxError: 

  File <nofile>, line 1
    type X T = 3, T
    ^

DataShapeSyntaxError: Invalid datashape

-------------------- >> begin captured stdout << ---------------------


  File <nofile>, line 1
    type X T = 3, T
    ^

DataShapeSyntaxError: Invalid datashape


--------------------- >> end captured stdout << ----------------------

======================================================================
ERROR: test_best_match (datashape.tests.test_overloading.TestOverloading)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/gen2/TmpDir/portage/dev-python/datashape-0.3.0/work/datashape-0.3.0-python3_4/lib/datashape/tests/test_overloading.py", line 18, in test_best_match
    match = best_match(f, coretypes.Tuple([d1, d2]))
NameError: name 'best_match' is not defined

======================================================================
FAIL: test_coerce_constrained_typevars (datashape.tests.test_coercion.TestCoercion)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/gen2/TmpDir/portage/dev-python/datashape-0.3.0/work/datashape-0.3.0-python3_4/lib/datashape/tests/test_coercion.py", line 55, in test_coerce_constrained_typevars
    self.assertGreater(coercion_cost(a, b), coercion_cost(a, c))
  File "/mnt/gen2/TmpDir/portage/dev-python/datashape-0.3.0/work/datashape-0.3.0-python3_4/lib/datashape/tests/common.py", line 15, in assertGreater
    self.assertTrue(a > b, msg or "%s is not greater than %s" % (a, b))
AssertionError: False is not true : 1 is not greater than 1

======================================================================
FAIL: test_coerce_src_ellipsis (datashape.tests.test_coercion.TestCoercion)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/gen2/TmpDir/portage/dev-python/datashape-0.3.0/work/datashape-0.3.0-python3_4/lib/datashape/tests/test_coercion.py", line 87, in test_coerce_src_ellipsis
    self.assertGreater(coercion_cost(a, b), coercion_cost(a, c))
  File "/mnt/gen2/TmpDir/portage/dev-python/datashape-0.3.0/work/datashape-0.3.0-python3_4/lib/datashape/tests/common.py", line 15, in assertGreater
    self.assertTrue(a > b, msg or "%s is not greater than %s" % (a, b))
AssertionError: False is not true : 1 is not greater than inf

======================================================================
FAIL: test_coerce_typevars (datashape.tests.test_coercion.TestCoercion)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/mnt/gen2/TmpDir/portage/dev-python/datashape-0.3.0/work/datashape-0.3.0-python3_4/lib/datashape/tests/test_coercion.py", line 49, in test_coerce_typevars
    self.assertGreater(coercion_cost(a, b), coercion_cost(a, c))
  File "/mnt/gen2/TmpDir/portage/dev-python/datashape-0.3.0/work/datashape-0.3.0-python3_4/lib/datashape/tests/common.py", line 15, in assertGreater
    self.assertTrue(a > b, msg or "%s is not greater than %s" % (a, b))
AssertionError: False is not true : 1 is not greater than 1

======================================================================
FAIL: datashape.tests.test_discovery.test_time_string
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib64/python3.4/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/mnt/gen2/TmpDir/portage/dev-python/datashape-0.3.0/work/datashape-0.3.0-python3_4/lib/datashape/tests/test_discovery.py", line 81, in test_time_string
    assert discover('12:00:01') == time_
AssertionError

----------------------------------------------------------------------
Ran 133 tests in 0.506s

FAILED (errors=4, failures=4)

ply; Installed versions: 3.4
numpy; Installed versions: 0.4
multipledispatch; Installed versions: 0.4.4

uder pythons 2.7 3.3 3.4. Can you replicate? Do you require anything further?

Suspicious representation of DataShape class

I don't know if this is intended:

In []: import datashape

In []: datashape.DataShape((1,2), 'int32')
Out[]: dshape("(1, 2) * int32")

But I would say no.

I looked into the source, but the fix is not trivial for me.

Add tuple type

Unfortunatly comma is used to separate data shape and measure so we need to use a semicolon in a tuple:

(3, int32; 10 float 64)

Hashing Performance

The new Blaze computation pipeline stresses the expression system much more. The first performance issue to pop up is in datashape, notably hashing a datashape calls this property in Mono quite a bit

@property
def parameters(self):
    if hasattr(self, '__slots__'):
        return tuple(getattr(self, slot) for slot in self.__slots__)
    else:
        return self._parameters

Perhaps the hasattr and getattr bits are slow? This could be resolved either in datashape or in blaze. Caching the hash or parameters locally is a thought.

Remove "import datashape" from coercion.py

Was introduced in the coercion consistency PR.

DataShape discovery

Consider this github log entry

{"created_at":"2013-10-30T19:00:38-07:00","payload":{},"public":true,"type":"ForkEvent","url":"https://github.com/RenanAguiar/option-tree","actor":"RenanAguiar","actor_attributes":{"login":"RenanAguiar","type":"User","gravatar_id":"8001db3acad4f85b5947b4e14066ab3c","name":"Renan Aguiar","email":"[email protected]"},"repository":{"id":792613,"name":"option-tree","url":"https://github.com/valendesigns/option-tree","description":"Theme Options UI Builder for WordPress. This plugin provides a simple way to create & save Theme Options, and Meta Boxes, for free or premium themes.","homepage":"","watchers":163,"stargazers":163,"forks":55,"fork":false,"size":3813,"owner":"valendesigns","private":false,"open_issues":29,"has_issues":true,"has_downloads":true,"has_wiki":true,"language":"PHP","created_at":"2010-07-22T21:24:05-07:00","pushed_at":"2013-09-09T04:34:23-07:00","master_branch":"master"}}

We have gigabytes of files like these. We can access them with Blaze and DyND tools once we can write down a datashape. Unfortunately writing down a datashape for a dataset like this is pretty daunting. This is the sort of thing that could scare away a user.

Fortunately it should be possible to get a decent first guess at the datashape using basic heuristics. Modifying it from there should be more welcoming.

Documentation needed

How the implicit coercion system works (coercion_cost, etc)
The function signature pattern matching

from_numpy() mixes the order of fields in structure arrays in Python 3.3

The next script:

import numpy as np
import datashape

dt = np.dtype("i4,i8,f8")
dshape = datashape.from_numpy((2,), dt)
print("Final dshape:", dshape)

outputs different orders for the fields in the struct type in Python 3.3:

(py3k)faltet@linux-je9a:/tmp/blaze-hdf5_dd> python /tmp/bug2.py 
Final dshape: 2, { f2 : float64; f1 : int64; f0 : int32 }
(py3k)faltet@linux-je9a:/tmp/blaze-hdf5_dd> python /tmp/bug2.py 
Final dshape: 2, { f1 : int64; f0 : int32; f2 : float64 }
(py3k)faltet@linux-je9a:/tmp/blaze-hdf5_dd> python /tmp/bug2.py 
Final dshape: 2, { f2 : float64; f1 : int64; f0 : int32 }

but it works just fine with Python 2.7:

faltet@linux-je9a:/tmp/blaze-hdf5_dd> python /tmp/bug2.py 
('Final dshape:', dshape("2, { f0 : int32; f1 : int64; f2 : float64 }"))
faltet@linux-je9a:/tmp/blaze-hdf5_dd> python /tmp/bug2.py 
('Final dshape:', dshape("2, { f0 : int32; f1 : int64; f2 : float64 }"))
faltet@linux-je9a:/tmp/blaze-hdf5_dd> python /tmp/bug2.py 
('Final dshape:', dshape("2, { f0 : int32; f1 : int64; f2 : float64 }"))

nose is runtime dependency

Without having nose installed, I cannot even say import datashape. The nose package should only be required for running the tests.

Allow strings with backslash characters

The regex matching strings don't accept backslashes. Places to update this include:

Grammar specification:
https://github.com/ContinuumIO/datashape/blob/master/docs/source/grammar.rst#the-datashape-grammar

Lexer:
https://github.com/ContinuumIO/datashape/blob/master/datashape/lexer.py#L41

Tests

Should subarray always return a DataShape?

In [1]: import datashape

In [2]: type(datashape.dshape('3 * int').subarray(1))
Out[2]: datashape.coretypes.CType

Bigger question: should component be valid DataShapes?

Date/Datetime not NumPy compatible

When I try to create a Blaze HDF5 data descriptor file with dates in the schema I get a datashape error, saying that date is not numpy compatible. Is there a suggested solution for this?

cc @mwiebe

coercion/overloading mechanism is not sound

I'm getting some weird results with the coercion and overloading, where coercions are flipping sometimes.

For example as tested,
int -> float64 should be preferred over int -> complex[float32]

        a, b, c = dshapes('int32', 'float64', 'complex[float32]')
        self.assertLess(coercion_cost(a, b), coercion_cost(a, c))

but, after adding another test checking a related overloading, this flipped so that the conversion to complex[float32] was preferred.

The implementation takes what is a a partial ordering of types, and a type promotion lattice defined based on that, and smushes it onto the real line. Is there good precedent for this approach to the problem?

to_ctypes produces surprising output on Windows platform

Running doctest in datashape/utils.py on Windows platform:

================================== FAILURES _____________________ [doctest] datashape.util.to_ctypes ______________________
247     """
248     Constructs a ctypes type from a datashape
249
250     >>> to_ctypes(coretypes.int32)
Expected:
    <class 'ctypes.c_int'>
Got:
    <class 'ctypes.c_long'>

This occurs at least on python 3.3 on Windows 7 64bit and Windows 7 32bit.

missed module : multipledispatch

Hi, trying to install blaze I encountered the following problem :
No module named multipledispatch

full log here :

https://gist.github.com/ccf2ba2979ef8bfb2140

how can i fix this?
i checked in the datashape src code and i can't find multipledispatch

int* is converted to type var rather than error

In [15]: datashape.dshape("3, int33")
Out[15]: dshape("3, int33")

In [16]: datashape.dshape("3, int48")
Out[16]: dshape("3, int48")

Parameters shouldn't need to be in quotes

Currently parametrized datashapes like ascii strings

string["ascii"]

Have their text parameters encased in quotes. For user convenience it'd be nice to remove this requirement

string[ascii]

Discover Python int as int32 instead of int64

Should we default to int32 for basic Python ints? This is more compatible with DyND and, I think, CPython itself.

Current Behavior

>>> discover(1)
ctype("int64")

Proposed Behavior

>>> discover(1)
ctype("int32")