Code Monkey home page Code Monkey logo

dataiter's Introduction

Python Classes for Data Manipulation

Test Documentation Status PyPI Downloads

Dataiter currently includes the following classes.

DataFrame is a class for tabular data similar to R's data.frame or pandas.DataFrame. It is under the hood a dictionary of NumPy arrays and thus capable of fast vectorized operations. You can consider this to be a light-weight alternative to Pandas with a simple and consistent API. Performance-wise Dataiter relies on NumPy and Numba and is likely to be at best comparable to Pandas.

ListOfDicts is a class useful for manipulating data from JSON APIs. It provides functionality similar to libraries such as Underscore.js, with manipulation functions that iterate over the data and return a shallow modified copy of the original. attd.AttributeDict is used to provide convenient access to dictionary keys.

GeoJSON is a simple wrapper class that allows reading a GeoJSON file into a DataFrame and writing a data frame to a GeoJSON file. Any operations on the data are thus done with methods provided by the data frame class. Geometry is read as-is into the "geometry" column, but no special geometric operations are currently supported.

Installation

# Latest stable version
pip install -U dataiter

# Latest development version
pip install -U git+https://github.com/otsaloma/dataiter

# Numba (optional)
pip install -U numba

Dataiter optionally uses Numba to speed up certain operations. If you have Numba installed and importing it succeeds, Dataiter will use it automatically. It's currently not a hard dependency, so you need to install it separately.

Documentation

https://dataiter.readthedocs.io/

If you're familiar with either dplyr (R) or Pandas (Python), the comparison table in the documentation will give you a quick overview of the differences and similarities.

https://dataiter.readthedocs.io/en/latest/comparison.html

dataiter's People

Contributors

dependabot[bot] avatar otsaloma avatar otsaw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

dataiter's Issues

Add grouped modify

Usually used to calculate fractions.

Like in dplyr

> data.frame(x=rep(1:3, 3), y=1) %>% group_by(x) %>% mutate(z=1/sum(y))
  x y        z
1 1 1 0.333333
2 2 1 0.333333
3 3 1 0.333333
4 1 1 0.333333
5 2 1 0.333333
6 3 1 0.333333
7 1 1 0.333333
8 2 1 0.333333
9 3 1 0.333333

Check handling of object columns

Testing if we could use dtype object instead of string, which might be necessary at times to reduce memory use. At least np.unique can't handle that.

diff --git a/dataiter/test/__init__.py b/dataiter/test/__init__.py
index 7cf496f..3f38dda 100644
--- a/dataiter/test/__init__.py
+++ b/dataiter/test/__init__.py
@@ -30,7 +30,11 @@ def data_frame(path):
     path = get_data_path(path)
     extension = path.suffix.lstrip(".")
     read = getattr(DataFrame, f"read_{extension}")
-    return read(path)
+    data = read(path)
+    for colname, column in data.items():
+        if column.is_string():
+            data[colname] = column.as_object()
+    return data
 
 def geojson(path):
     path = get_data_path(path)
py.test --tb=no dataiter/test/test_data_frame.py
================================================= test session starts =================================================
platform linux -- Python 3.9.9, pytest-6.2.5, py-1.10.0, pluggy-0.13.1
rootdir: /home/osmo/Source/dataiter
collected 82 items                                                                                                    

dataiter/test/test_data_frame.py ................FF...........FFF.FFF.....................F.....F.....FF...F.F. [ 95%]
....                                                                                                            [100%]

=============================================== short test summary info ===============================================
FAILED dataiter/test/test_data_frame.py::TestDataFrame::test_aggregate - TypeError: The axis argument to unique is n...
FAILED dataiter/test/test_data_frame.py::TestDataFrame::test_anti_join - TypeError: The axis argument to unique is n...
FAILED dataiter/test/test_data_frame.py::TestDataFrame::test_from_json - assert \n   category ...905 rows total == \...
FAILED dataiter/test/test_data_frame.py::TestDataFrame::test_from_pandas - assert \n      id    ...442 rows total ==...
FAILED dataiter/test/test_data_frame.py::TestDataFrame::test_full_join - TypeError: The axis argument to unique is n...
FAILED dataiter/test/test_data_frame.py::TestDataFrame::test_inner_join - TypeError: The axis argument to unique is ...
FAILED dataiter/test/test_data_frame.py::TestDataFrame::test_left_join - TypeError: The axis argument to unique is n...
FAILED dataiter/test/test_data_frame.py::TestDataFrame::test_left_join_by_tuple - TypeError: The axis argument to un...
FAILED dataiter/test/test_data_frame.py::TestDataFrame::test_semi_join - TypeError: The axis argument to unique is n...
FAILED dataiter/test/test_data_frame.py::TestDataFrame::test_to_json - assert \n   category ...905 rows total == \n ...
FAILED dataiter/test/test_data_frame.py::TestDataFrame::test_unique_by_one - TypeError: The axis argument to unique ...
FAILED dataiter/test/test_data_frame.py::TestDataFrame::test_unique_by_same_dtype - TypeError: The axis argument to ...
FAILED dataiter/test/test_data_frame.py::TestDataFrame::test_write_csv - assert \n      id    ...442 rows total == \...
FAILED dataiter/test/test_data_frame.py::TestDataFrame::test_write_json - assert \n   category ...905 rows total == ...
=========================================== 14 failed, 68 passed in 22.00s ============================================

Replace Pandas with Arrow

We're notably using Pandas for DataFrame.read_csv. That could probably be replaced with pyarrow.csv.read_csv, which would allow removing Pandas from the list of dependencies, leaving it as an optional dependency only needed for the from_pandas and to_pandas methods (with Pandas imported within the method body).

Arrow seems to be a lot faster at reading CSV files and we need it anyway for reading and writing Parquet files, so it would probably allow dropping something we've never liked and have sought to replace.

Switch Vector to use pandas.Series for proper NA support

Proper NA support in NumPy doesn't look like it's happening.

https://numpy.org/neps/

Perhaps for that very reason, they are being pushed into a wrong part of the stack. While I'd like to avoid using Pandas, if this actually gets implemented consistently across dtypes and operations, it's probably big enough of a reason to switch.

https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#experimental-na-scalar-to-denote-missing-values

Represent strings as objects?

When reading in data with arbitrarily long strings, dataiter.DataFrame will use a huge amount of memory due to the 4 byte length NumPy Unicode strings and the fixed length representation (all elements allocated to match the longest one). We might need to either (1) use object conditionally when a column has long strings or (2) always use object, if possible, or (3) use bytes instead via a custom dtype.

Adapt to NumPy 2.0

https://github.com/numpy/numpy/releases

At least

  • A new variable-length string dtype, numpy.dtypes.StringDType and a new
    numpy.strings namespace with performant ufuncs for string operations

Hence

  • Check for NumPy version >= 2.0 in __init__.py
  • Use the new string dtype by default?
  • Keep the old around as char, i.e. Vector.as_char etc.?
  • Check what missing value to use for strings
  • Use np.strings instead of np.char where applicable
  • Allow strings in aggregate.py function use_numba?

Print fails on None

  File "test.py", line 14, in test
    print(res.head())
  File "/usr/local/lib/python3.8/dist-packages/dataiter/data_frame.py", line 160, in __str__
    return self.to_string()
  File "/usr/local/lib/python3.8/dist-packages/dataiter/data_frame.py", line 740, in to_string
    columns = {colname: util.pad(
  File "/usr/local/lib/python3.8/dist-packages/dataiter/data_frame.py", line 740, in <dictcomp>
    columns = {colname: util.pad(
  File "/usr/local/lib/python3.8/dist-packages/dataiter/deco.py", line 30, in wrapper
    return list(value)
  File "/usr/local/lib/python3.8/dist-packages/dataiter/util.py", line 67, in pad
    width = max(len(x) for x in strings)
  File "/usr/local/lib/python3.8/dist-packages/dataiter/util.py", line 67, in <genexpr>
    width = max(len(x) for x in strings)
TypeError: object of type 'NoneType' has no len()

Shorten aggregate notation

Currently we do

(data
 .group_by("year", "month")
 .aggregate(
     sales_total=lambda x: x.sales.sum(),
     sales_per_day=lambda x: x.sales.mean(),
 ))

With a lot of calculated columns, that gets a bit verbose with all the lambdas.

Maybe we could add helpers to shorten the lambdas in common cases?, e.g.

def mean(name):
    return lambda x: x[name].mean()

def sum(name):
    return lambda x: x[name].sum()

(data
 .group_by("year", "month")
 .aggregate(
     sales_total=di.sum("sales"),
     sales_per_day=di.mean("sales"),
 ))

Or, use a single lambda with a complex return value similar to Pandas' apply? Looks nice with a lot of columns, but really bad if only needing one column, such as in current notation .aggregate(n=di.nrow).

(data
 .group_by("year", "month")
 .aggregate(lambda x: {
     "sales_total": x.sales.sum(),
     "sales_per_day": x.sales.mean(),
 }))

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.