Code Monkey home page Code Monkey logo

pystore's Introduction

PyStore - Fast data store for Pandas timeseries data

Python version

PyPi version

PyPi status

Travis-CI build status

CodeFactor

Star this repo

Follow me on twitter

PyStore is a simple (yet powerful) datastore for Pandas dataframes, and while it can store any Pandas object, it was designed with storing timeseries data in mind.

It's built on top of Pandas, Numpy, Dask, and Parquet (via Fastparquet), to provide an easy to use datastore for Python developers that can easily query millions of rows per second per client.

==> Check out this Blog post for the reasoning and philosophy behind PyStore, as well as a detailed tutorial with code examples.

==> Follow this PyStore tutorial in Jupyter notebook format.

Quickstart

Install PyStore

Install using `pip`:

$ pip install pystore --upgrade --no-cache-dir

Install using `conda`:

$ conda install -c ranaroussi pystore

INSTALLATION NOTE: If you don't have Snappy installed (compression/decompression library), you'll need to you'll need to install it first.

Using PyStore

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import pystore
import quandl

# Set storage path (optional)
# Defaults to `~/pystore` or `PYSTORE_PATH` environment variable (if set)
pystore.set_path("~/pystore")

# List stores
pystore.list_stores()

# Connect to datastore (create it if not exist)
store = pystore.store('mydatastore')

# List existing collections
store.list_collections()

# Access a collection (create it if not exist)
collection = store.collection('NASDAQ')

# List items in collection
collection.list_items()

# Load some data from Quandl
aapl = quandl.get("WIKI/AAPL", authtoken="your token here")

# Store the first 100 rows of the data in the collection under "AAPL"
collection.write('AAPL', aapl[:100], metadata={'source': 'Quandl'})

# Reading the item's data
item = collection.item('AAPL')
data = item.data  # <-- Dask dataframe (see dask.pydata.org)
metadata = item.metadata
df = item.to_pandas()

# Append the rest of the rows to the "AAPL" item
collection.append('AAPL', aapl[100:])

# Reading the item's data
item = collection.item('AAPL')
data = item.data
metadata = item.metadata
df = item.to_pandas()


# --- Query functionality ---

# Query avaialable symbols based on metadata
collection.list_items(some_key='some_value', other_key='other_value')


# --- Snapshot functionality ---

# Snapshot a collection
# (Point-in-time named reference for all current symbols in a collection)
collection.create_snapshot('snapshot_name')

# List available snapshots
collection.list_snapshots()

# Get a version of a symbol given a snapshot name
collection.item('AAPL', snapshot='snapshot_name')

# Delete a collection snapshot
collection.delete_snapshot('snapshot_name')


# ...


# Delete the item from the current version
collection.delete_item('AAPL')

# Delete the collection
store.delete_collection('NASDAQ')

Using Dask schedulers

PyStore 0.1.18+ supports using Dask distributed.

To use a local Dask scheduler, add this to your code:

from dask.distributed import LocalCluster
pystore.set_client(LocalCluster())

To use a distributed Dask scheduler, add this to your code:

pystore.set_client("tcp://xxx.xxx.xxx.xxx:xxxx")
pystore.set_path("/path/to/shared/volume/all/workers/can/access")

Concepts

PyStore provides namespaced collections of data. These collections allow bucketing data by source, user or some other metric (for example frequency: End-Of-Day; Minute Bars; etc.). Each collection (or namespace) maps to a directory containing partitioned parquet files for each item (e.g. symbol).

A good practice it to create collections that may look something like this:

  • collection.EOD
  • collection.ONEMINUTE

Requirements

  • Python 2.7 or Python > 3.5
  • Pandas
  • Numpy
  • Dask
  • Fastparquet
  • Snappy (Google's compression/decompression library)
  • multitasking

PyStore was tested to work on *nix-like systems, including macOS.

Dependencies:

PyStore uses Snappy, a fast and efficient compression/decompression library from Google. You'll need to install Snappy on your system before installing PyStore.

* See the python-snappy Github repo for more information.

*nix Systems:

  • APT: sudo apt-get install libsnappy-dev
  • RPM: sudo yum install libsnappy-devel

macOS:

First, install Snappy's C library using Homebrew:

$ brew install snappy

Then, install Python's snappy using conda:

$ conda install python-snappy -c conda-forge

...or, using `pip`:

$ CPPFLAGS="-I/usr/local/include -L/usr/local/lib" pip install python-snappy

Windows:

Windows users should checkout Snappy for Windows and this Stackoverflow post for help on installing Snappy and python-snappy.

Roadmap

PyStore currently offers support for local filesystem (including attached network drives). I plan on adding support for Amazon S3 (via s3fs), Google Cloud Storage (via gcsfs) and Hadoop Distributed File System (via hdfs3) in the future.

Acknowledgements

PyStore is hugely inspired by Man AHL's Arctic which uses MongoDB for storage and allow for versioning and other features. I highly reommend you check it out.

License

PyStore is licensed under the Apache License, Version 2.0. A copy of which is included in LICENSE.txt.


I'm very interested in your experience with PyStore. Please drop me an note with any feedback you have.

Contributions welcome!

- Ran Aroussi

pystore's People

Contributors

ancher1912 avatar dependabot-preview[bot] avatar hugovdberg avatar ranaroussi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pystore's Issues

Trying to load collection on windows 10 causes TypeError

Hi Ran,
On windows 10 anaconda 3.5 install, when I tried to load a collection with the following code I got:

pystore.set_path(str('C:\\jupyter_notebooks\\zipline\\pystore'))
clc = pystore.store('data').collection('weather')

TypeError: join() argument must be str or bytes, not 'WindowsPath'

I was able to correct this by changing line 110 in utils.py from:
return Path(os.path.join(*args))
to
return Path(os.path.join(*[str(x) for x in args]))

I'm not sure but I think this change will be safe for non-windows installs.

Append doesn't work on datetimes

Hi,
First of all, thank you for this project ;-)

I've got a little problem with the append function: it doesn't do anything with the indexes are datetimes:

pystore.set_path('/dev/shm/test/pystore')
store = pystore.store('mydatastore')
collection = store.collection('collection')


item = 'test'

it1 = pd.DataFrame({'data': {datetime.datetime(2019, 1, 1): random.random()}})
collection.write(item, it1, overwrite=True)

it2 = pd.DataFrame({'data': {datetime.datetime(2019, 1, 2): random.random()}})
collection.append(item, it2)


it = collection.item(item)
df = it.to_pandas()
print(df)

yields:

                data
index               
2019-01-01  0.450675

However, if I use a timestamp instead of a datetime (datetime.datetime(xxx).timestamp()), it works:

                         data
index                        
2018-12-31 23:00:00  0.385200
2019-01-01 23:00:00  0.639432

Thanks

Q: Fastest way to load multiple DataFrames same time

Awesome project.

Just a short question, I have like 2000 stored dataframes now and I would like to load 500 of it as fast as possible into one python process. Is there a batch-load function in it?

I coded something with ThreadPoolExecutor and it loads 3GB on disk into around a 40GB DataFrame (which is pretty heavy) in under four minutes using 5 threads.

Does somebody see a faster variant? The SSD is relaxed, it looks like the performance limiation lies in df = item.to_pandas(), which is CPU intensive.

AWS support

Is there any plan to add AWS/GCE support? I need use pystore for multiple users.

Thanks!

Append not working with Dask > 2.1.0

Append stopped working for me in 0.1.12

My reproducible example:

import random
import datetime as dt
import pandas as pd
import pystore as ps

#create df with random numbers and timestamp as index
n = 2
timestamps = [dt.datetime.utcnow() + dt.timedelta(seconds=x) for x in range(n)]
pb = ([{'timestamp': timestamp, 'pb': random.random()} for timestamp in timestamps])
df = pd.DataFrame.from_dict(pb).set_index('timestamp')

store = ps.store('mydatastore') 
collection = store.collection('eod')

#write to pystore
collection.write('symbol', df[:-1], overwrite=True)

#append to pystore
collection.append('symbol',  df[-1:]) 

#read 
collection.item('symbol').to_pandas()

This works in 0.1.9
Would be grateful for any hint.

Kind regards
Steffen

Nested Dataframes causes exception

import pystore
pystore.set_path("./pystore")
store = pystore.store('test')
collection = store.collection('demo collection')

df_path_and_hash = pd.DataFrame({'path': ['path1', 'path2'], 'hash': [0, 1]})
d_container = {'idx':[1], 'dfs':[df_path_and_hash]}
df_container = pd.DataFrame(d_container)

collection.write('test item', df_container)

item = collection.item('test item')

Which causes the exception:

Traceback (most recent call last):
  File "<user>\anaconda3\envs\dev1\lib\site-packages\fastparquet\api.py", line 110, in __init__
    with open_with(fn2, 'rb') as f:
  File "<user>\anaconda3\envs\dev1\lib\site-packages\fsspec\spec.py", line 940, in open
    f = self._open(
  File "<user>\anaconda3\envs\dev1\lib\site-packages\fsspec\implementations\local.py", line 118, in _open
    return LocalFileOpener(path, mode, fs=self, **kwargs)
  File "<user>\anaconda3\envs\dev1\lib\site-packages\fsspec\implementations\local.py", line 200, in __init__
    self._open()
  File "<user>\anaconda3\envs\dev1\lib\site-packages\fsspec\implementations\local.py", line 205, in _open
    self.f = open(self.path, mode=self.mode)
FileNotFoundError: [Errno 2] No such file or directory: '<user>/Documents/Professional Work/Work/dev, keyword check; a tool to help find text in files/pystore/test/demo collection/test item/part.0.parquet/_metadata'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "KC_dev.py", line 2329, in <module>
    main()
  File "KC_dev.py", line 54, in main
    item = collection.item('test item')
  File "<user>\anaconda3\envs\dev1\lib\site-packages\pystore\collection.py", line 78, in item
    return Item(item, self.datastore, self.collection,
  File "<user>\anaconda3\envs\dev1\lib\site-packages\pystore\item.py", line 60, in __init__
    self.data = dd.read_parquet(
  File "<user>\anaconda3\envs\dev1\lib\site-packages\dask\dataframe\io\parquet\core.py", line 307, in read_parquet
    read_metadata_result = engine.read_metadata(
  File "<user>\anaconda3\envs\dev1\lib\site-packages\dask\dataframe\io\parquet\fastparquet.py", line 678, in read_metadata
    parts, pf, gather_statistics, base_path = _determine_pf_parts(
  File "<user>\pmatt\anaconda3\envs\dev1\lib\site-packages\dask\dataframe\io\parquet\fastparquet.py", line 159, in _determine_pf_parts
    pf = ParquetFile(paths, open_with=fs.open, **kwargs.get("file", {}))
  File "<user>\anaconda3\envs\dev1\lib\site-packages\fastparquet\api.py", line 90, in __init__
    basepath, fmd = metadata_from_many(fn, verify_schema=verify,
  File "<user>\anaconda3\envs\dev1\lib\site-packages\fastparquet\util.py", line 134, in metadata_from_many
    pfs = [api.ParquetFile(fn, open_with=open_with) for fn in file_list]
  File "<user>\anaconda3\envs\dev1\lib\site-packages\fastparquet\util.py", line 134, in <listcomp>
    pfs = [api.ParquetFile(fn, open_with=open_with) for fn in file_list]
  File "<user>\anaconda3\envs\dev1\lib\site-packages\fastparquet\api.py", line 116, in __init__
    self._parse_header(f, verify)
  File "<user>\anaconda3\envs\dev1\lib\site-packages\fastparquet\api.py", line 133, in _parse_header
    f.seek(-(head_size+8), 2)
OSError: [Errno 22] Invalid argument

An un-nested dataframe causes no issue:

import pystore
pystore.set_path("./pystore")
store = pystore.store('test')
collection = store.collection('demo collection')

df_path_and_hash = pd.DataFrame({'path': ['path1', 'path2'], 'hash': [0, 1]})
d_container = {'idx':[1], 'dfs':[df_path_and_hash]}
df_container = pd.DataFrame(d_container)

collection.write('test item', df_path_and_hash)

item = collection.item('test item')

Important issue

Just a short message - this project is amazing.
I am comming from TimeScaleDB/ postgresql and was suffering with the loading times into dataframes. Your project is increasing the speed of my data flow by a factor of over 5, reducing disk space as well. As a starting point I inserted 110.000.000 rows and read it in great time. :)

you can close this issue now. :)

access item before it's created causes weird PermissionError

When you try to access an item before you write to it, it causes a PermissionError, at least on windows 10. Because it needs to open the item, it creates the directory for the item but then does not create any of the metadata. This causes the item to get in some limbo state where it appears to exist (It is listed as an item using collection.list_items()), but it cannot be accessed.
It would be more helpful to first check if the requested item exists, and if not raise a meaningful ValueError, and not create the directory.

.to_pandas() error [can't read parquet file even though there is data in it when i look with parquet viewer]

I reinstalled my computer and i can't read my old pystore data as getting error when i get to the last line df = item.to_pandas()

i even used the example pystore code, i just using yfinance instead of quandl and still still can't reat the data once created, even though i can see it clearly with parquet viewer. See the image here: https://i.imgur.com/ykKy5cn.png

i'm getting this error:

Traceback (most recent call last):
  File "C:\Users\tothd\anaconda3\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\tothd\anaconda3\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "c:\Users\tothd\.vscode\extensions\ms-python.python-2022.2.1924087327\pythonFiles\lib\python\debugpy\__main__.py", line 45, in <module>
    cli.main()
  File "c:\Users\tothd\.vscode\extensions\ms-python.python-2022.2.1924087327\pythonFiles\lib\python\debugpy/..\debugpy\server\cli.py", line 444, in main
    run()
  File "c:\Users\tothd\.vscode\extensions\ms-python.python-2022.2.1924087327\pythonFiles\lib\python\debugpy/..\debugpy\server\cli.py", line 285, in run_file
    runpy.run_path(target_as_str, run_name=compat.force_str("__main__"))
  File "C:\Users\tothd\anaconda3\lib\runpy.py", line 268, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "C:\Users\tothd\anaconda3\lib\runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "C:\Users\tothd\anaconda3\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "c:\Users\tothd\OneDrive\Desktop\Developers\0_CODE\1_algo_trading\active\data_storage\yfinance_test.py", line 34, in <module>
    df = item.to_pandas()
  File "C:\Users\tothd\anaconda3\lib\site-packages\pystore\item.py", line 71, in to_pandas
    elif df.index.values[0] > 1e6:
IndexError: index 0 is out of bounds for axis 0 with size 0
Press any key to continue . . .

Here is pystore .to_pandas() function:

    def to_pandas(self, parse_dates=True):
        df = self.data.compute()

        if parse_dates and "datetime" not in str(df.index.dtype):
            df.index.name = ""
            if str(df.index.dtype) == "float64":
                df.index = pd.to_datetime(df.index, unit="s",
                                          infer_datetime_format=True)
            elif df.index.values[0] > 1e6:
                df.index = pd.to_datetime(df.index,
                                          infer_datetime_format=True)

        return df

What i am missing? Before i reinstalled my computer it worked without any problems and i used pystore all the time. Here is the code i used as example, i just changed 2 lines of code using yfinance insteand of quandl, with quandl data it was doing the exact same error.

import pystore
import yfinance as yf

# Set storage path (optional)
# Defaults to `~/pystore` or `PYSTORE_PATH` environment variable (if set)
pystore.set_path(r'C:\Users\tothd\OneDrive\Desktop')

# List stores
print(pystore.list_stores())

# Connect to datastore (create it if not exist)
store = pystore.store('mydatastore')

# List existing collections
print(store.list_collections())

# Access a collection (create it if not exist)
collection = store.collection('NASDAQ')

# List items in collection
print(collection.list_items())

# Load some data from yfinance these 2 lines are the only ones i edited
msft = yf.Ticker("AAPL")
aapl = msft.history(period="max")

# Store the first 100 rows of the data in the collection under "AAPL"
collection.write('AAPL', aapl[:100], metadata={'source': 'yfinance'})

# Reading the item's data
item = collection.item('AAPL')
data = item.data  # <-- Dask dataframe (see dask.pydata.org)
metadata = item.metadata
df = item.to_pandas()

# Append the rest of the rows to the "AAPL" item
collection.append('AAPL', aapl[100:])

# Reading the item's data
item = collection.item('AAPL')
data = item.data
metadata = item.metadata
df = item.to_pandas()

Integer index is converted to datetime automatically

Hi I'm experimenting with pystore and just realized that an integer index is converted automatically to epoch datetime when reading the stored data:

tmp_df.head()
Out[69]: 
      askPrice  askSize      ...        mid_price  dollar_volume
2524   65.9800      100      ...          65.9550     6,596.0000
3155   65.9800      100      ...          65.9550     6,596.0000
3786   66.0100     2500      ...          65.9400     6,596.0000
4417   66.0200     2500      ...          65.9800     6,596.0000
5048   66.0200     2500      ...          65.9800     6,596.0000
[5 rows x 19 columns]
collection.append(symbols[0], tmp_df.reset_index(drop=True))
df1 = collection.item(symbols[0]).to_pandas()
cprint(df1)
-------------------------------------------------------------------------------
dataframe information
-------------------------------------------------------------------------------
                               askPrice       ...         dollar_volume
index                                         ...                      
1970-01-01 00:00:00.000325455   67.2100       ...       16,339,954.8800
1970-01-01 00:00:00.000325456   67.2100       ...       16,339,954.8800
1970-01-01 00:00:00.000325457   67.2100       ...       16,810,389.8800
1970-01-01 00:00:00.000325458   67.2500       ...       16,962,429.2400
1970-01-01 00:00:00.000325459   67.2500       ...       16,962,429.2400

I'm not sure if this is expected behavior as it wasn't stated in the examples that pystore requires a datetime index.

Does append() work on OSX?

Has anyone been able to make append() work in a recent release?

Could anyone share with me a set of deps that allow collection.append() to work.

I've been through the various threads on this, but all seem to result in a silent failure.

I also get a lot of deprecation warnings of this sort...

Dask Name: read-parquet, 1 tasks
/Users/andrew/anaconda3/envs/feature_wrapper/lib/python3.8/site-packages/fastparquet/parquet_thrift/parquet/ttypes.py:1929: DeprecationWarning: PY_SSIZE_T_CLEAN will be required for '#' formats
  iprot._fast_decode(self, iprot, [self.__class__, self.thrift_spec])
/Users/andrew/anaconda3/envs/feature_wrapper/lib/python3.8/site-packages/fastparquet/parquet_thrift/parquet/ttypes.py:975: DeprecationWarning: PY_SSIZE_T_CLEAN will be required for '#' formats
  iprot._fast_decode(self, iprot, [self.__class__, self.thrift_spec])
collection1.append('TEST', z1)
/Users/andrew/anaconda3/envs/feature_wrapper/lib/python3.8/site-packages/fastparquet/parquet_thrift/parquet/ttypes.py:1929: DeprecationWarning: PY_SSIZE_T_CLEAN will be required for '#' formats
  iprot._fast_decode(self, iprot, [self.__class__, self.thrift_spec])
/Users/andrew/anaconda3/envs/feature_wrapper/lib/python3.8/site-packages/fastparquet/parquet_thrift/parquet/ttypes.py:1929: DeprecationWarning: PY_SSIZE_T_CLEAN will be required for '#' formats
  iprot._fast_decode(self, iprot, [self.__class__, self.thrift_spec])
/Users/andrew/anaconda3/envs/feature_wrapper/lib/python3.8/site-packages/fastparquet/parquet_thrift/parquet/ttypes.py:975: DeprecationWarning: PY_SSIZE_T_CLEAN will be required for '#' formats
  iprot._fast_decode(self, iprot, [self.__class__, self.thrift_spec])
/Users/andrew/anaconda3/envs/feature_wrapper/lib/python3.8/site-packages/fastparquet/parquet_thrift/parquet/ttypes.py:1929: DeprecationWarning: PY_SSIZE_T_CLEAN will be required for '#' formats
  iprot._fast_decode(self, iprot, [self.__class__, self.thrift_spec])
/Users/andrew/anaconda3/envs/feature_wrapper/lib/python3.8/site-packages/fastparquet/parquet_thrift/parquet/ttypes.py:975: DeprecationWarning: PY_SSIZE_T_CLEAN will be required for '#' formats
  iprot._fast_decode(self, iprot, [self.__class__, self.thrift_spec])
collection1.item('TEST').data
/Users/andrew/anaconda3/envs/feature_wrapper/lib/python3.8/site-packages/fastparquet/parquet_thrift/parquet/ttypes.py:1929: DeprecationWarning: PY_SSIZE_T_CLEAN will be required for '#' formats
  iprot._fast_decode(self, iprot, [self.__class__, self.thrift_spec])
Dask DataFrame Structure:

I realise deprecation warnings aren't necessarily a problem, but perhaps they point to some underlying problem with deps.

I haven't included any particular config as I've been through many with varying degrees of error all centered around append() So I'm hoping someone has already trodden this same path more successfully than me :-)

Append creates new parquet file

Not sure if this is the intended functionality, but I wanted to verify before reporting an issue. I am appending daily minute data files together using the default chucksize. I would have expected the parquet file to grow until the chunksize or npartition size was met before creating another parquet file. However, a new parquet file is created for each append call.

Is this the intended functionality? I can look into the code further and/or provide my code if this is not your intention.

how to read all columns but the one use for partition

I am storing timeseries dataframes (index=datatimeindex, multiple columns of data).
I add a column "year" with the df.index.year.
I write to the collection with collection.write(item_name, df, overwrite=True, partition_on=["year"]).
When I read it back, I use item = collection.item(item_name, filters=[("year", "==", year)]) and I would like to avoid reading (for performance) the "year" column (as it is only used for partitioning). I can read the columns in item.data.columns and remove from this Index the "year". But then, in the item.to_pandas(), I cannot specify the columns to read from.
Any way to do what I want to do properly ?

how to update exist data ?

the latest stock bar data may change due to not finished the timeframe. for example:

>>> datetime.now()
datetime.datetime(2019, 12, 5, 15, 48, 46, 595878)
>>> exchange.fetch_ohlcv(symbol='BTC/USDT', limit=3)
[[1575531960000, 7297.07, 7300.0, 7296.36, 7299.49, 15.150618], [1575532020000, 7299.16, 7304.13, 7299.16, 7303.65, 25.041672], [1575532080000, 7303.68, 7307.73, 7302.76, 7307.7, 18.090149]]
>>> datetime.now()
datetime.datetime(2019, 12, 5, 15, 48, 52, 243378)
>>> exchange.fetch_ohlcv(symbol='BTC/USDT', limit=3)
[[1575531960000, 7297.07, 7300.0, 7296.36, 7299.49, 15.150618], [1575532020000, 7299.16, 7304.13, 7299.16, 7303.65, 25.041672], [1575532080000, 7303.68, 7308.97, 7302.76, 7308.97, 19.671468]]

notice that the data at 1575532080000 is not finish at 15:48, so the close value will keep changing when i fetch the data between 15:48:00 to 15:48:59

is there any way to update data after i write the old one into store?

Append error: TypeError: Cannot compare tz-naive and tz-aware timestamps

Hello,

I am passing a tz-aware dataframe to pystore/append, and I get this error message.

 collection.append(item_ID, df, npartitions=item.data.npartitions)
  File "C:\Users\pierre.juillard\Documents\Programs\Anaconda\lib\site-packages\pystore\coll
ection.py", line 184, in append
    combined = dd.concat([current.data, new]).drop_duplicates(keep="last")
  File "C:\Users\pierre.juillard\Documents\Programs\Anaconda\lib\site-packages\dask\datafra
me\multi.py", line 1070, in concat
    for i in range(len(dfs) - 1)
  File "C:\Users\pierre.juillard\Documents\Programs\Anaconda\lib\site-packages\dask\datafra
me\multi.py", line 1070, in <genexpr>
    for i in range(len(dfs) - 1)
  File "pandas\_libs\tslibs\c_timestamp.pyx", line 109, in pandas._libs.tslibs.c_timestamp.
_Timestamp.__richcmp__
  File "pandas\_libs\tslibs\c_timestamp.pyx", line 169, in pandas._libs.tslibs.c_timestamp.
_Timestamp._assert_tzawareness_compat
TypeError: Cannot compare tz-naive and tz-aware timestamps

[EDIT]
Here is a code that can be simply copy/past to reproduce the error message.
Please, does someone sees what I can be possibly doing wrong?

import os
import pandas as pd
import pystore

ts_list = ['Sun Dec 22 2019 07:40:00 GMT-0100',
           'Sun Dec 22 2019 07:45:00 GMT-0100',
           'Sun Dec 22 2019 07:50:00 GMT-0100',
           'Sun Dec 22 2019 07:55:00 GMT-0100']

op_list = [7134.0, 7134.34, 7135.03, 7131.74]

GC = pd.DataFrame(list(zip(ts_list, op_list)), columns =['date', 'open'])

# Getting timestamps back into GC, and resolving it to UTC time
GC['date'] = pd.to_datetime(GC['date'], utc=True)

# Rename columns
GC.rename(columns={'date': 'Timestamp'}, inplace=True)
    
# Set timestamp column as index
GC.set_index('Timestamp', inplace = True, verify_integrity = True)

# Connect to datastore (create it if not exist)
store = pystore.store('OHLCV')
# Access a collection (create it if not exist)
collection = store.collection('AAPL')
item_ID = 'EOD'
collection.write(item_ID, GC[:-1], overwrite=True)
item = collection.item(item_ID)
collection.append(item_ID, GC[-1:], npartitions=item.data.npartitions)

I thank you for your help.
Have a good day,
Bests,
Pierrot

Collection list_items()

Ran,

I have two proposed ideas regarding the list_items():

  1. In the collection object I notice that you currently initialize the self.items = self.list_items(). The self.list_items() is called. Calling self.list_items is unnecessary overhead if you have thousands of items in the collection. It seems like the self.items can be updated directly without searching the directory again. Can we just remove the item from the list or add it too the list

  2. Do you see value in changing the type of self.items from a list to a set? Since the items in the collection are unique, this seems like a valid option and the search speed would be considerably faster.

If these changes are of interest I can do a PR. Let me know

Craig

Behaviour of append method

Hi everyone :)

I would like to confirm my understanding of method to append data to one item using collection.append(item, data).
To my understanding, this operation creates a new parquet file and modifies the metadata.
I would like to avoid having thousands of very small file but rather "including new data in the last .parquet file" and create a new one only after the last .parquet file reaches the predefined length (I'm fine with the current value of 1 million rows).
I see from the code that this goes in the end to calling Dask's dd.to_parquet() and I tried to dig deeper into it but I find the code very convoluted and difficult to read :(

Ideally my workflow would be this in pseudocode:

open last file in a pandas
append the new data
remove last parquet file
write with to_parquet() in chunks of 10**6 rows
update _metadata file to be able to still do efficient read with pystore

I didn't find a way to modify the _metadata file, any hint on this would be really appreciated :)
Also, any opinion on why I shouldn't be doing this is very welcome!

Modification of tutorial: append section: duplicates are dropped

Hello,

I haven't tested append() yet, and I was wondering if duplicates are removed when an append is managed.
I had a look in collection.py script and following pandas function are used:
combined = dd.concat([current.data, new]).drop_duplicates(keep="last")

After a look into pandas documentation, I understand that duplicate lines are removed, only the last occurence is kept.

Please, I think it would be relevant to simply say so in the tutorial.
You write:
Let's append the last day (row) to our item:

Wouldn't it be worth to add:
Let's append the last day (row) to our item. With current data, there is obviously no duplicate rows. If you append a dataframe that contain duplicate rows with that of the existing item, these duplicates will be removed by use of 'drop_duplicates()' method from panda dataframe

Thanks again for bringing pystore!
Bests,
Pierrot

Multiindex and/or building minute bars

Seems I cannot have multiindexes in my frame when storing data ...

how do you handle minute bars ... especially for when you are doing futures when it crosses days...
I have columns like this
['date','time','bid','ask','open','high','low','close','volume','session']

where date is the 'Trade Date' of the minute bar ... and time is the datetime of occurrence using a multiindex similar to how I did it when using Arctic... but pystore gives me an error when writing this that multiindexes are NOT allowed.

What can I do that gives me the flexibility I require here? as Futures start at 18:00EST previous day and go thru 17:00 the next day which is the 'Trade Date'. Is there a better way?

Your sample code is good but it doesn't show how to handle anything like minute bar data ... which would be very helpful as I couldn't find any code out there that would help me with this.

thanks.

Asyncio?

I know Dask Client can run Async but is it possible to run pystore with asyncio?

Has anyone tried to install this project on windows?

Hi, I am trying to instal this projecton windows 8.1, using python envs 3.6 or 3.7, but i keep receiving an error message:

Installing collected packages: python-snappy, multitasking, toolz, locket, partd, cloudpickle, pyyaml, dask, msgpack, click, sortedcontainers, psutil, tblib, tornado, heapdict, zict, distributed, six, python-dateutil, pytz, numpy, pandas, llvmlite, numba, thrift, fastparquet, PyStore
Running setup.py install for python-snappy: started
Running setup.py install for python-snappy: finished with status 'error'

ERROR: Command errored out with exit status 1:
 command: 'D:\WindowsFolder\Documents\VisualStudioCode\Projects\Python\stackflow_answers\env\Scripts\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\XXXXXXXXXXXX\\AppData\\Local\\Temp\\pycharm-packaging\\python-snappy\\setup.py'"'"'; __file__='"'"'C:\\Users\\XXXXXXXXXXXX\\AppData\\Local\\Temp\\pycharm-packaging\\python-snappy\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\Arnaud\AppData\Local\Temp\pip-record-u2qg1h8c\install-record.txt' --single-version-externally-managed --compile --install-headers 'D:\WindowsFolder\Documents\VisualStudioCode\Projects\Python\stackflow_answers\env\include\site\python3.7\python-snappy'
     cwd: C:\Users\XXXXXXXXXXXX\AppData\Local\Temp\pycharm-packaging\python-snappy\
Complete output (26 lines):
C:\Python\Python370\lib\distutils\dist.py:274: UserWarning: Unknown distribution option: 'cffi_modules'
  warnings.warn(msg)
running install
running build
running build_py
creating build
creating build\lib.win-amd64-3.7
creating build\lib.win-amd64-3.7\snappy
copying snappy\hadoop_snappy.py -> build\lib.win-amd64-3.7\snappy
copying snappy\snappy.py -> build\lib.win-amd64-3.7\snappy
copying snappy\snappy_cffi.py -> build\lib.win-amd64-3.7\snappy
copying snappy\snappy_cffi_builder.py -> build\lib.win-amd64-3.7\snappy
copying snappy\snappy_formats.py -> build\lib.win-amd64-3.7\snappy
copying snappy\__init__.py -> build\lib.win-amd64-3.7\snappy
copying snappy\__main__.py -> build\lib.win-amd64-3.7\snappy
warning: build_py: byte-compiling is disabled, skipping.

running build_ext
building 'snappy._snappy' extension
creating build\temp.win-amd64-3.7
creating build\temp.win-amd64-3.7\Release
creating build\temp.win-amd64-3.7\Release\snappy
C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.25.28610\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -ID:\WindowsFolder\Documents\VisualStudioCode\Projects\Python\stackflow_answers\env\include -IC:\Python\Python370\include -IC:\Python\Python370\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.25.28610\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\cppwinrt" /EHsc /Tpsnappy/snappymodule.cc /Fobuild\temp.win-amd64-3.7\Release\snappy/snappymodule.obj
snappymodule.cc
snappy/snappymodule.cc(31): fatal error C1083: Cannot open include file: 'snappy-c.h': No such file or directory
error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\Community\\VC\\Tools\\MSVC\\14.25.28610\\bin\\HostX86\\x64\\cl.exe' failed with exit status 2
----------------------------------------

ERROR: Command errored out with exit status 1: 'D:\WindowsFolder\Documents\VisualStudioCode\Projects\Python\stackflow_answers\env\Scripts\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\XXXXXXXXXXXX\AppData\Local\Temp\pycharm-packaging\python-snappy\setup.py'"'"'; file='"'"'C:\Users\XXXXXXXXXXXX\AppData\Local\Temp\pycharm-packaging\python-snappy\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\Arnaud\AppData\Local\Temp\pip-record-u2qg1h8c\install-record.txt' --single-version-externally-managed --compile --install-headers 'D:\WindowsFolder\Documents\VisualStudioCode\Projects\Python\stackflow_answers\env\include\site\python3.7\python-snappy' Check the logs for full command output.

Writing pandas.Series fails

Does Pystore support writing pandas.Series? I'm getting an error when attempting to write a pandas.Series to a pystore.collection because the pandas.Series.memory_usage() returns an int instead of a pandas.Series. And calling .sum on an int raises an error.

The line where this occurs is shown below:

if npartitions is None and chunksize is None:
memusage = data.memory_usage(deep=True).sum()

Not sure whether this is an issue, but I figured I'd ask!

Terrible performance on dask=2.2.0

Hi everyone,
I noticed that pystore unexpectedly started being orders of magnitude slower (running the same script), like minutes to get daily timestamp of a single instrument ...
It seems that dask=2.2.0 is terribly slow using snappy and engine="fastparquet" but everything is great again downgrading to dask=2.1.0.

@ranaroussi it would be great if you could update us on when things will be fin with latest Dask update :)

Best,
Davide

append error

Hello,

Have you any recommendations regarding importing data from arctic?
I'm currently using cryptostore with arctic as a backend.
Cryptostore is by the very same author of arctic, but loading trades as a dataframe takes too much time with it.

For now, this is what I did :

import pystore
from arctic import Arctic

exchange = "BITFINEX"
datastore = "mydatastore"

arctic_store = Arctic("localhost")
arctic_lib = arctic_store[exchange]
symbols = arctic_lib.list_symbols()

store = pystore.store(datastore)
collection = store.collection(exchange)
for symbol in symbols:
    df_src = arctic_lib.read(symbol)
    if symbol in collection.list_items():
        item = collection.item(symbol)
        df_dst = item.to_pandas()
        # https://stackoverflow.com/a/44318806
        df_diff = df_src[~df_src.index.isin(df_dst.index)]
        rows, columns = df_diff.shape
        if df_diff.empty:
            print("No new row to append...")
        else:
            print(f"Appending {rows} rows to {symbol} item")
            collection.append(symbol, df_diff)
    else:
        rows, columns = df_src.shape
        print(f"Importing {symbol} for the first time w/ {rows} rows and {columns} columns")
        collection.write(symbol, df_src, metadata={'source': 'cryptostore'})

But I'm facing errors similar to #16 - even if rollbacking dask and fastparquet to previous releases - when append is happening.

    raise ValueError("Exactly one of npartitions and chunksize must be specified.")
ValueError: Exactly one of npartitions and chunksize must be specified.

my setup :

dask==2.6.0
fastparquet==0.3.2
numba==0.46.0

Thanks, and keep the good work!

Reading data via date range

Very nice and fast pandas dataframe database. Thank you!

You think it would somehow be possible to partially read data via applied date range to prevent getting the whole dataframe at once?

python 2.7 compatibility

I am trying PyStore on win 10 + python 2.7. This does not work.

Lot of issues having to do with paths as far as I can tell so far.

Any plan to port it?

Pystore Tutorial loading data problem

Hi there,
i was following pystore tutorial to check it's performance.
it seems when it want to load data; the dataframe is empty and i have this error

image

Python 3.9.5
PyStore 0.1.23

and here is the datafram data when i read it, seems empty

image

Intake integration

I am pleased to see another use for dask and fastparquet, meeting the specific use of your users.

I am also the main contributor to Intake, which is a one-stop-shop for finding datasets in catalogs and loading them without the user having to know anything about the specifics of the data format or service.

Pystore already has some capability for hosting a named set-of-datasets, with metadata, and so would fit very nicely in the Intake ecosystem. Indeed, pandas (or dask) dataframes are one of the built-in container types supported by Intake. Would you be interested in writing an Intake driver interface for pystore? That way, these data could take their place among all the other datasets of various types from various services that may be available in an analyst's session.

One place where you are ahead of us is snapshotting, a topic we have discussed, but not yet designed or implemented. Comments at intake/intake#382 would be highly appreciated.

Unable to load data after update - error: bad escape \p

Recently updated a number of packages and now can't load any of my items, with the same traceback each time.

Current versions:
python 3.7.3 in conda env
pystore 0.1.13
dask 2.3.0
pandas 0.25.1
fastparquet 0.3.2
numba 0.45.1
python-snappy 0.5.4

In[24]: store.collection('Yahoo').item('adjClose')

Traceback (most recent call last):

  File "<ipython-input-24-c3f5c3720a65>", line 1, in <module>
    store.collection('Yahoo').item('adjClose')

  File "C:\Users\Tim\Anaconda3\lib\site-packages\pystore\collection.py", line 79, in item
    snapshot, filters, columns, engine=self.engine)

  File "C:\Users\Tim\Anaconda3\lib\site-packages\pystore\item.py", line 56, in __init__
    self._path, engine=self.engine, filters=filters, columns=columns)

  File "C:\Users\Tim\Anaconda3\lib\site-packages\dask\dataframe\io\parquet\core.py", line 155, in read_parquet
    **kwargs

  File "C:\Users\Tim\Anaconda3\lib\site-packages\dask\dataframe\io\parquet\fastparquet.py", line 173, in read_metadata
    fs, paths, gather_statistics, **kwargs

  File "C:\Users\Tim\Anaconda3\lib\site-packages\dask\dataframe\io\parquet\fastparquet.py", line 114, in _determine_pf_parts
    paths = fs.glob(paths[0] + fs.sep + "*")

  File "C:\Users\Tim\Anaconda3\lib\site-packages\fsspec\implementations\local.py", line 50, in glob
    return super().glob(path)

  File "C:\Users\Tim\Anaconda3\lib\site-packages\fsspec\spec.py", line 449, in glob
    pattern = re.compile(pattern.replace("=PLACEHOLDER=", '.*'))

  File "C:\Users\Tim\Anaconda3\lib\re.py", line 234, in compile
    return _compile(pattern, flags)

  File "C:\Users\Tim\Anaconda3\lib\re.py", line 286, in _compile
    p = sre_compile.compile(pattern, flags)

  File "C:\Users\Tim\Anaconda3\lib\sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)

  File "C:\Users\Tim\Anaconda3\lib\sre_parse.py", line 930, in parse
    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)

  File "C:\Users\Tim\Anaconda3\lib\sre_parse.py", line 426, in _parse_sub
    not nested and not items))

  File "C:\Users\Tim\Anaconda3\lib\sre_parse.py", line 507, in _parse
    code = _escape(source, this, state)

  File "C:\Users\Tim\Anaconda3\lib\sre_parse.py", line 402, in _escape
    raise source.error("bad escape %s" % escape, len(escape))

error: bad escape \p

Store attribute: associate store path?

Hello,

Thanks a lot for this terrific library.
I am starting using it, and when I initialize the store, as given in the tutorial, I have these 2 lines:

    # Set storage path
    pystore.set_path(my_path)
    # Connect to datastore (create it if not exist)
    store = pystore.store(my_store_name)

I would like to know how is managed the path variable: is it attached to the store?
It seems to me, it is an attribute that defines the store.
Shouldn't we have it directly given to create or connect the store?

    # Connect to datastore (create it if not exist) with store name & its location
    store = pystore.store(my_path, my_store_name)

I thank you in advance for sharing your feeling about it.
Have a good day,
Bests,
Pierre

Append not working

Using latest version 0.0.12 the append function does not seem to work nor raises an error

Initial dataframe's last row:

df.tail(2)

Close High Low Open Volume 
2017-12-29 00:00:00+00:00 167.15 168.65 167.05 168.25 364619 
2018-01-02 00:00:00+00:00 167.15 167.7 165.25 167.15 587067 

Saving and appending:

store = pystore.store('mydatastore') 
collection = store.collection('eod') 
last = df[-1:] 
collection.write('symbol', df[:-1], overwrite=True) 
collection.append('symbol', last) 
df = collection.item('symbol').to_pandas() 
df.tail(1) 

yields:

Close High Low Open Volume index 
2017-12-29 167.15 168.65 167.05 168.25 364619 

At least on my end. What am I doing wrong here?

Issues with nanoseconds

I am writing the data in nanoseconds but upon retrieving it has lost the precision:

import pystore
import pandas as pd

pystore.set_path("/tmp")
store = pystore.store('test')
collection = store.collection('tick')

ts = '2019-06-15 00:00:12.868214001+00:00'
ts = pd.to_datetime(ts, format="%Y-%m-%d %H:%M:%S.%f")
df = pd.DataFrame({'ts': [ts],
                   'data': [100]})
df.set_index('ts', inplace=True)

name = 'ns_test'
collection.write(name, df, overwrite=True)

result = collection.item(name).to_pandas()
print(f'Before write: {df.index}')
print(f'After read: {result.index}')
print(f'Difference: {df.index.values.astype(int) - result.index.values.astype(int)}')
----
Before write: DatetimeIndex(['2019-06-15 00:00:12.868214001+00:00'], dtype='datetime64[ns, UTC]', name='ts', freq=None)
After read: DatetimeIndex(['2019-06-15 00:00:12.868214+00:00'], dtype='datetime64[ns, UTC]', name='ts', freq=None)
Difference: [1]

Parquet is preventing data frame storage

I tried out pystore, and it is almost perfect for my needs. However, using parquet in the stack causes an issue with storage. I have data frames to store, and have no control over the data. Parquet cannot figure out how to store some of my data frames due to conversion problems. Since parquet focuses on language-independent storage, the data frames cannot be stored without some conversion. I cannot remember where I read that, but it was while tracking down a problem storing directly with parquet a few days ago. If pystore is meant as a python-only solution, maybe it would be better to use something like pickle instead? Just a thought.

collection.list_items() with metadata paremeter is showing "*** json.decoder.JSONDecodeError: Expecting value: line 1 column 198 (char 197)"

Collection.list_items() with metadata parameter is showing "*** json.decoder.JSONDecodeError: Expecting value: line 1 column 198 (char 197)"

Below is the code.

>>> item = collection.item("NIFTY16JAN2011700PE")
>>> item.metadata
{'': 100653, 'tradingSymbol': 'NIFTY16JAN2011700PE', 'Symbolname': 'NIFTY', 'exchange': 'NFO', 'exchangeSegment': 'nse_fo', 'symbolCode': 'NA', 'instrument': 'OPTIDX', 'lotSize': 75, 'strikePrice': 11700.0, 'expiryDate': '01/16/2020, 00:00:00', 'tickSize': 0.05, 'Created_Date': '06/27/2021, 23:38:07', 'Last_Record_Date': '01/16/2020, 15:18:00', 'Last_Updated': '06/27/2021, 23:38:07', '_updated': '2021-07-06 21:09:19.931785'}
>>> collection.list_items(tradingSymbol = 'NIFTY16JAN2011700PE')
*** json.decoder.JSONDecodeError: Expecting value: line 1 column 198 (char 197)
>>> 

.to_pandas() required in tutorial notebook when writing snap_df

snap_df is a pystore.item.Item:

snap_df = collection.item('AAPL', snapshot='snapshot_name')

but I think collection.write() takes a pandas DataFrame instead, so I get an error when I try to run this line:

collection.write('AAPL', snap_df,
                 metadata={'source': 'Quandl'},
                 overwrite=True)

Adding .to_pandas() fixes it:

collection.write('AAPL', snap_df.to_pandas(),
                 metadata={'source': 'Quandl'},
                 overwrite=True)

Save and load data inconsistent

I have saved a pandas.dataframe ( index is datetime ) . When reading data, the obtained data content is inconsistent with the data at the time of saving.

write
open high low close dollar_volume shares
datetime
2018-09-14 14:59:00 10.26 10.26 10.26 10.26 0.00 0
2018-09-14 15:00:00 10.24 10.24 10.24 10.24 2523136.25 246400

read from file
open high low close dollar_volume shares
datetime
2018-09-14 15:00:00 10.26 10.26 10.26 10.26 0.00 0
2018-09-14 15:00:00 10.24 10.24 10.24 10.24 2523136.25 246400

Is append loading the entire data into memory just to append new data ?

based on this code , on each append we load all the data into memory to check for duplicates then doing a write on all the data to rewrite parquet.
doing that for some items with 100k existing record with multiple threads, the task is consuming 100% of memory for each 1 record append

why not use fastparquet write method to append the data, (with True / False / overwrite)
https://fastparquet.readthedocs.io/en/latest/api.html#fastparquet.write

 try:
          if epochdate or ("datetime" in str(data.index.dtype) and
                           any(data.index.nanosecond) > 0):
              data = utils.datetime_to_int64(data)
          old_index = dd.read_parquet(self._item_path(item, as_string=True),
                                      columns=[], engine=self.engine
                                      ).index.compute()
          data = data[~data.index.isin(old_index)]
      except Exception:
          return

      if data.empty:
          return

      if data.index.name == "":
          data.index.name = "index"

      # combine old dataframe with new
      current = self.item(item)
      new = dd.from_pandas(data, npartitions=1)
      combined = dd.concat([current.data, new]).drop_duplicates(keep="last")

unable to set npartitions when writing collection

I was hoping to set the number of partitions while writing a collection, but the code currently provides at error when executing:

    def write(self, item, data, metadata={},
              npartitions=None, chunksize=1e6, overwrite=False,
              epochdate=False, compression="snappy", **kwargs):

In the following example I tried to execute the following code

collection.write(item='spx', data=df, metadata={'source': 'ivolatility'}, npartitions=5, overwrite=True)

The error I receive is:

Traceback (most recent call last):
  File "C:\Users\C\AppData\Local\Temporary Projects\pystore_Test\pystore_Test.py", line 30, in <module>
    collection.write(item='spx', data=df, metadata={'source': 'ivolatility'},  npartitions=5, overwrite=True)
  File "F:\anaconda3\envs\envTensorflow\lib\site-packages\pystore\collection.py", line 111, in write
    chunksize=int(chunksize))
  File "F:\anaconda3\envs\envTensorflow\lib\site-packages\dask\dataframe\io\io.py", line 177, in from_pandas
    raise ValueError('Exactly one of npartitions and chunksize must be specified.')
ValueError: Exactly one of npartitions and chunksize must be specified.

I tried a couple different ways to apply partitions -vs- chunksize, but because the chunksize is specified I can't seem to find an easy way to execute the code using partitions

any suggestions on how to utilize partitions over chunksize?

Appending string Timeseries damages stored Timeseries

Hi everyone. I'm using the useful pystore library to store also very long timeseries with string values.
I noticed a very strange behaviour when I try to append to an existing string Timeseries item, as in the following example code. In particular when appending to an item in the database, the resulting dataframe in the databases loses many of its values.

#import
import pystore
import pandas as pd
import numpy as np

pystore.set_path('./pystore_test')
store = pystore.store('store')
collection = store.collection('collection')

#define first dataframe (string values)
index_1 = pd.date_range('2013-1-1 00:00', '2014-1-1  00:00', freq='1m')
data_1= pd.DataFrame(np.random.choice(list('abc'), [len(index_1),1]),index=index_1, columns=['test']) 

#write dataframe to database
collection.write('test', data_1)

#read from database
item = collection.item('test')
data_DB = item.to_pandas() # contains 12 dates

#define second dataframe (string values)
index_2 = pd.date_range('2014-1-2 00:00', '2015-1-1  00:00', freq='1m')
data_2= pd.DataFrame(np.random.choice(list('abc'), [len(index_2),1]),index=index_2, columns=['test']) 

#append to database
item = collection.item('test')
collection.append('test', data_2, npartitions=item.data.npartitions)

#read appending result from database
item = collection.item('test')
data_DB = item.to_pandas()  #<-- Contains only 3 of the 12 initial and 12 appended values

Instead it works perfectly if I replace strings with floats:

data_1['test'] = np.random.rand(len(index_1),1) 
data_2['test'] = np.random.rand(len(index_2),1)

Is pystore library able to store also string values?
Thank you

Append function not working

So this issue already posted, i can't get "append" to update the data using the demo example.
Changed python version, Dask, pystore, numba, pandas,
I tested same code on Linux (mint and Ubuntu) and windows. 2 different computers
Tested with 4 versions of pystore nothing no errors.
I'm wondering if anyone is using this package.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.