otsaloma / dataiter Goto Github PK

View Code? Open in Web Editor NEW

25.0 3.0 0.0 3.02 MB

Python classes for data manipulation

Home Page: https://dataiter.readthedocs.io/

License: MIT License

Makefile 0.80% Python 97.22% Shell 0.92% R 1.06%

python data-frame json numpy numba

dataiter's Introduction

Python Classes for Data Manipulation

Dataiter currently includes the following classes.

DataFrame is a class for tabular data similar to R's data.frame or pandas.DataFrame. It is under the hood a dictionary of NumPy arrays and thus capable of fast vectorized operations. You can consider this to be a light-weight alternative to Pandas with a simple and consistent API. Performance-wise Dataiter relies on NumPy and Numba and is likely to be at best comparable to Pandas.

ListOfDicts is a class useful for manipulating data from JSON APIs. It provides functionality similar to libraries such as Underscore.js, with manipulation functions that iterate over the data and return a shallow modified copy of the original. attd.AttributeDict is used to provide convenient access to dictionary keys.

GeoJSON is a simple wrapper class that allows reading a GeoJSON file into a DataFrame and writing a data frame to a GeoJSON file. Any operations on the data are thus done with methods provided by the data frame class. Geometry is read as-is into the "geometry" column, but no special geometric operations are currently supported.

Installation

# Latest stable version
pip install -U dataiter

# Latest development version
pip install -U git+https://github.com/otsaloma/dataiter

# Numba (optional)
pip install -U numba

Dataiter optionally uses Numba to speed up certain operations. If you have Numba installed and importing it succeeds, Dataiter will use it automatically. It's currently not a hard dependency, so you need to install it separately.

Documentation

https://dataiter.readthedocs.io/

If you're familiar with either dplyr (R) or Pandas (Python), the comparison table in the documentation will give you a quick overview of the differences and similarities.

https://dataiter.readthedocs.io/en/latest/comparison.html

dataiter's People

Contributors

Stargazers

Watchers

dataiter's Issues

min and max should use missing_value, not np.nan

In order to be able to take min and max datetime columns.

Maybe just use default=data[name].missing_value for all functions?

Add grouped modify

Usually used to calculate fractions.

Like in dplyr

> data.frame(x=rep(1:3, 3), y=1) %>% group_by(x) %>% mutate(z=1/sum(y))
  x y        z
1 1 1 0.333333
2 2 1 0.333333
3 3 1 0.333333
4 1 1 0.333333
5 2 1 0.333333
6 3 1 0.333333
7 1 1 0.333333
8 2 1 0.333333
9 3 1 0.333333

Allow using strings with Numba aggregation functions

In aggregate.use_numba, if they can be made to work.

Check handling of object columns

Testing if we could use dtype object instead of string, which might be necessary at times to reduce memory use. At least np.unique can't handle that.

diff --git a/dataiter/test/__init__.py b/dataiter/test/__init__.py
index 7cf496f..3f38dda 100644
--- a/dataiter/test/__init__.py
+++ b/dataiter/test/__init__.py
@@ -30,7 +30,11 @@ def data_frame(path):
     path = get_data_path(path)
     extension = path.suffix.lstrip(".")
     read = getattr(DataFrame, f"read_{extension}")
-    return read(path)
+    data = read(path)
+    for colname, column in data.items():
+        if column.is_string():
+            data[colname] = column.as_object()
+    return data
 
 def geojson(path):
     path = get_data_path(path)

py.test --tb=no dataiter/test/test_data_frame.py
================================================= test session starts =================================================
platform linux -- Python 3.9.9, pytest-6.2.5, py-1.10.0, pluggy-0.13.1
rootdir: /home/osmo/Source/dataiter
collected 82 items                                                                                                    

dataiter/test/test_data_frame.py ................FF...........FFF.FFF.....................F.....F.....FF...F.F. [ 95%]
....                                                                                                            [100%]

=============================================== short test summary info ===============================================
FAILED dataiter/test/test_data_frame.py::TestDataFrame::test_aggregate - TypeError: The axis argument to unique is n...
FAILED dataiter/test/test_data_frame.py::TestDataFrame::test_anti_join - TypeError: The axis argument to unique is n...
FAILED dataiter/test/test_data_frame.py::TestDataFrame::test_from_json - assert \n   category ...905 rows total == \...
FAILED dataiter/test/test_data_frame.py::TestDataFrame::test_from_pandas - assert \n      id    ...442 rows total ==...
FAILED dataiter/test/test_data_frame.py::TestDataFrame::test_full_join - TypeError: The axis argument to unique is n...
FAILED dataiter/test/test_data_frame.py::TestDataFrame::test_inner_join - TypeError: The axis argument to unique is ...
FAILED dataiter/test/test_data_frame.py::TestDataFrame::test_left_join - TypeError: The axis argument to unique is n...
FAILED dataiter/test/test_data_frame.py::TestDataFrame::test_left_join_by_tuple - TypeError: The axis argument to un...
FAILED dataiter/test/test_data_frame.py::TestDataFrame::test_semi_join - TypeError: The axis argument to unique is n...
FAILED dataiter/test/test_data_frame.py::TestDataFrame::test_to_json - assert \n   category ...905 rows total == \n ...
FAILED dataiter/test/test_data_frame.py::TestDataFrame::test_unique_by_one - TypeError: The axis argument to unique ...
FAILED dataiter/test/test_data_frame.py::TestDataFrame::test_unique_by_same_dtype - TypeError: The axis argument to ...
FAILED dataiter/test/test_data_frame.py::TestDataFrame::test_write_csv - assert \n      id    ...442 rows total == \...
FAILED dataiter/test/test_data_frame.py::TestDataFrame::test_write_json - assert \n   category ...905 rows total == ...
=========================================== 14 failed, 68 passed in 22.00s ============================================

UnicodeEncodeError with astype(bytes) in DataFrame string columns

ListOfDicts.select should return in the requested order

Rename missing na?

e.g. Vector.is_missing to Vector.isna

Replace Pandas with Arrow

We're notably using Pandas for DataFrame.read_csv. That could probably be replaced with pyarrow.csv.read_csv, which would allow removing Pandas from the list of dependencies, leaving it as an optional dependency only needed for the from_pandas and to_pandas methods (with Pandas imported within the method body).

Arrow seems to be a lot faster at reading CSV files and we need it anyway for reading and writing Parquet files, so it would probably allow dropping something we've never liked and have sought to replace.

Round floats by significant digits, not decimals

Switch Vector to use pandas.Series for proper NA support

Proper NA support in NumPy doesn't look like it's happening.

https://numpy.org/neps/

Perhaps for that very reason, they are being pushed into a wrong part of the stack. While I'd like to avoid using Pandas, if this actually gets implemented consistently across dtypes and operations, it's probably big enough of a reason to switch.

https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#experimental-na-scalar-to-denote-missing-values

ListOfDicts.select returns regular dicts

Represent strings as objects?

When reading in data with arbitrarily long strings, dataiter.DataFrame will use a huge amount of memory due to the 4 byte length NumPy Unicode strings and the fixed length representation (all elements allocated to match the longest one). We might need to either (1) use object conditionally when a column has long strings or (2) always use object, if possible, or (3) use bytes instead via a custom dtype.

Allow using datetimes with Numba aggregation functions

They probably should work, but there seems to be some issue with NaT, which is needed as the "default" value.

Switch method chaining syntax to parantheses

Since that allows leaving comments within the chain.

Use keyword-only arguments where appropriate

https://sethmlarson.dev/blog/strict-python-function-parameters

Adapt to NumPy 2.0

https://github.com/numpy/numpy/releases

At least

A new variable-length string dtype, numpy.dtypes.StringDType and a new
numpy.strings namespace with performant ufuncs for string operations

Hence

Check for NumPy version >= 2.0 in __init__.py
Use the new string dtype by default?
Keep the old around as char, i.e. Vector.as_char etc.?
Check what missing value to use for strings
Use np.strings instead of np.char where applicable
Allow strings in aggregate.py function use_numba?

Print fails on None

  File "test.py", line 14, in test
    print(res.head())
  File "/usr/local/lib/python3.8/dist-packages/dataiter/data_frame.py", line 160, in __str__
    return self.to_string()
  File "/usr/local/lib/python3.8/dist-packages/dataiter/data_frame.py", line 740, in to_string
    columns = {colname: util.pad(
  File "/usr/local/lib/python3.8/dist-packages/dataiter/data_frame.py", line 740, in <dictcomp>
    columns = {colname: util.pad(
  File "/usr/local/lib/python3.8/dist-packages/dataiter/deco.py", line 30, in wrapper
    return list(value)
  File "/usr/local/lib/python3.8/dist-packages/dataiter/util.py", line 67, in pad
    width = max(len(x) for x in strings)
  File "/usr/local/lib/python3.8/dist-packages/dataiter/util.py", line 67, in <genexpr>
    width = max(len(x) for x in strings)
TypeError: object of type 'NoneType' has no len()

Shorten aggregate notation

Currently we do

(data
 .group_by("year", "month")
 .aggregate(
     sales_total=lambda x: x.sales.sum(),
     sales_per_day=lambda x: x.sales.mean(),
 ))

With a lot of calculated columns, that gets a bit verbose with all the lambdas.

Maybe we could add helpers to shorten the lambdas in common cases?, e.g.

def mean(name):
    return lambda x: x[name].mean()

def sum(name):
    return lambda x: x[name].sum()

(data
 .group_by("year", "month")
 .aggregate(
     sales_total=di.sum("sales"),
     sales_per_day=di.mean("sales"),
 ))

Or, use a single lambda with a complex return value similar to Pandas' apply? Looks nice with a lot of columns, but really bad if only needing one column, such as in current notation .aggregate(n=di.nrow).

(data
 .group_by("year", "month")
 .aggregate(lambda x: {
     "sales_total": x.sales.sum(),
     "sales_per_day": x.sales.mean(),
 }))