Code Monkey home page Code Monkey logo

chest's Introduction

https://raw.github.com/blaze/blaze/master/docs/source/svg/blaze_med.png

Build Status Coverage Status Join the chat at https://gitter.im/blaze/blaze

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar interface to query data living in other data storage systems.

Example

We point blaze to a simple dataset in a foreign database (PostgreSQL). Instantly we see results as we would see them in a Pandas DataFrame.

>>> import blaze as bz
>>> iris = bz.Data('postgresql://localhost::iris')
>>> iris
    sepal_length  sepal_width  petal_length  petal_width      species
0            5.1          3.5           1.4          0.2  Iris-setosa
1            4.9          3.0           1.4          0.2  Iris-setosa
2            4.7          3.2           1.3          0.2  Iris-setosa
3            4.6          3.1           1.5          0.2  Iris-setosa

These results occur immediately. Blaze does not pull data out of Postgres, instead it translates your Python commands into SQL (or others.)

>>> iris.species.distinct()
           species
0      Iris-setosa
1  Iris-versicolor
2   Iris-virginica

>>> bz.by(iris.species, smallest=iris.petal_length.min(),
...                      largest=iris.petal_length.max())
           species  largest  smallest
0      Iris-setosa      1.9       1.0
1  Iris-versicolor      5.1       3.0
2   Iris-virginica      6.9       4.5

This same example would have worked with a wide range of databases, on-disk text or binary files, or remote data.

What Blaze is not

Blaze does not perform computation. It relies on other systems like SQL, Spark, or Pandas to do the actual number crunching. It is not a replacement for any of these systems.

Blaze does not implement the entire NumPy/Pandas API, nor does it interact with libraries intended to work with NumPy/Pandas. This is the cost of using more and larger data systems.

Blaze is a good way to inspect data living in a large database, perform a small but powerful set of operations to query that data, and then transform your results into a format suitable for your favorite Python tools.

In the Abstract

Blaze separates the computations that we want to perform:

>>> accounts = Symbol('accounts', 'var * {id: int, name: string, amount: int}')

>>> deadbeats = accounts[accounts.amount < 0].name

From the representation of data

>>> L = [[1, 'Alice',   100],
...      [2, 'Bob',    -200],
...      [3, 'Charlie', 300],
...      [4, 'Denis',   400],
...      [5, 'Edith',  -500]]

Blaze enables users to solve data-oriented problems

>>> list(compute(deadbeats, L))
['Bob', 'Edith']

But the separation of expression from data allows us to switch between different backends.

Here we solve the same problem using Pandas instead of Pure Python.

>>> df = DataFrame(L, columns=['id', 'name', 'amount'])

>>> compute(deadbeats, df)
1      Bob
4    Edith
Name: name, dtype: object

Blaze doesn't compute these results, Blaze intelligently drives other projects to compute them instead. These projects range from simple Pure Python iterators to powerful distributed Spark clusters. Blaze is built to be extended to new systems as they evolve.

Getting Started

Blaze is available on conda or on PyPI

conda install blaze
pip install blaze

Development builds are accessible

conda install blaze -c blaze
pip install http://github.com/blaze/blaze --upgrade

You may want to view the docs, the tutorial, some blogposts, or the mailing list archives.

Development setup

The quickest way to install all Blaze dependencies with conda is as follows

conda install blaze spark -c blaze -c anaconda-cluster -y
conda remove odo blaze blaze-core datashape -y

After running these commands, clone odo, blaze, and datashape from GitHub directly. These three projects release together. Run python setup.py develop to make development installations of each.

License

Released under BSD license. See LICENSE.txt for details.

Blaze development is sponsored by Continuum Analytics.

chest's People

Contributors

maxhutch avatar mrocklin avatar thyrgle avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chest's Issues

odd behavior when given a initial data

I just installed anaconda4.3 and I notice this module so I star playing with it and I notice this

>>> from chest import Chest
>>> d={42:23}
>>> c=Chest(d, "my-chest")
>>> list(c)
[]
>>> 42 in c
False 
>>> c[42]
23 
>>> d
{42: 23}
>>> c.flush()
>>> d
{}
>>> list(c)
[]
>>> 42 in c
False
>>> c[42]
Traceback (most recent call last):
  File "<pyshell#11>", line 1, in <module>
    c[42]
  File "C:\Anaconda3\lib\site-packages\chest\core.py", line 182, in __getitem__
    raise KeyError("Key not found: %s" % str(key))
KeyError: 'Key not found: 42'
>>> c.get_from_disk(42)
>>> c[42]
23
>>> 42 in c
False
>>> list(c)
[]
>>> 

why it say that 42 is not in c?
why it empty my initializer?
why I get an empty list when I do list(c)

but if I close python, delete the folder, and start again and don't give it a initial data it work fine

>>> from chest import Chest
>>> c =Chest(path="my-chest")
>>> c[42]=23
>>> 42 in c
True
>>> list(c)
[42]
>>> c.flush()
>>> list(c)
[42]
>>> 42 in c
True
>>> c[42]
23
>>> 

Include pickling with the highest protocol

I am trying to modify the protocol used for dumping, however, it always complains that it does not understand the argument protocol in dump=partial(pickle.dump, protocol=pickle.HIGHEST_PROTOCOL). Could you put this as default?

Tag for 0.2.3

Would be nice to have a tag for version 0.2.3 so we know where the PyPI release was made.

`data` argument does not work

Hello there,

When using the data argument to initialize the chest with a dict, it seems the inmemory dict does not get correctly initialized. Here is an example:

out = Chest(data={'id': set(), 'label': set()})
out2 = Chest()
out2['id'] = set()
out2['label'] = set()
print(out.keys())
print(out2.keys())

Output:
[]
['id', 'label']

When data is used, the keys are not available, as well as hasattr(out, 'id') will return False, but out['id'] will work and return set().

BTW, it is also not possible to update a chest with a simple dict:

out.update({'id': set([1, 2])})

Output:
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-76-7a010f35ab01> in <module>()
----> 1 out.update({'id': set([1, 2])})

C:\git\centertbi-mri\libs\chest\core.py in update(self, other, overwrite)
    290         #  if already flushed, then this does nothing
    291         self.flush()
--> 292         other.flush()
    293         for key in other._keys:
    294             if key in self._keys and overwrite:

AttributeError: 'dict' object has no attribute 'flush'

run out of inodes by using chest

Hi,

I tried chest on a large file and on an ubuntu 16.04 system. Chest created millions of files on my hard disk and I was running out of inodes. I have plenty of hard disk avaiable but I can not use it because inodes usage is at the peak. What can I do about it? Can I reduce the millions of generated files to hundereds? Is there any option for this?

Thanks in advance

Python 3.9 support

 /opt/hostedtoolcache/Python/3.8.7/x64/lib/python3.8/site-packages/chest/core.py:1: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.9 it will stop working
151
    from collections import MutableMapping

Would you accept a PR fix, and make a new release?

read-only mode

I use chests to distribute data. In most cases, the data consumer either doesn't want to change the contents of the chest or doesn't even have the file permissions to. chests write the .keys to disk at close, even if they haven't changed. __delitem__ and __setitem__ both delete files even before a flush, so it might be nice to protect them from programmer errors.

Does this seem useful enough to belong in chest itself, or should I implement this on my end in a subclass or the likes?

dump with another wrapper than the one used for caching

I am trying to save the resulting chest into a different file format than the one used for caching.

A practical example: I used pickle when I initialize my chest because I am using nested dicts and sets, which would be stripped off I guess if it was to be stored in json.

But in the end, I would like to dump my chest into json for export.

Is there a way to do that?

For example, this does not work past the first level:

import ujson as json
with open('db.json', 'w') as f:
    f.write(json.dumps(out, ensure_ascii=False, indent=4, sort_keys=True))

Prefetch opens too many files

For a reasonably large chest c, c.prefetch(list(c.keys())) will cause OSError: [Errno 24] Too many open files because open_many opens everything, consumes everything, and then closes everything.

Simple solution: chunk up prefetches into open_many's of some maximum size.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.