rcalsaverini / pandas-predicates Goto Github PK

View Code? Open in Web Editor NEW

0.0 2.0 0.0 20 KB

A library for creating predicating expressions to filter pandas dataframes.

License: MIT License

Python 100.00%

pandas-predicates's Introduction

pandas-predicates

A library for creating predicating expressions to filter pandas dataframes.

Status

This is a work in progress. Be mindful that:

there's no released version yet
the API is not stable
the documentation is sparse

Why

I often found myself creating really unwieldy code for creating filtering masks for pandas dataframes. Inspired by libraries like polars or pyspark, which use static expressions to build queries, I decided to create a simple API to create expressions that can be used to filter a dataframe in pandas.

Goals

For now the goal is to simply provide a simple API to create predicates to filter a pandas dataframe. Initially I don't intend to support generic expressions, but that may change in the future.

Stated goals:

follow as much as possible the pandas API
cover all methods that can be used to create a boolean mask
implement useful combinators to combine predicates

Progress tracking

We are implementing the methods in this order:

Syntax

For now the expression language have the following syntax:

    import pandas as pd
    from pandas_predicates import col

    df = pd.DataFrame([
        {"a": 1, "b": "blabla", "c": 3},
        {"a": 4, "b": "bleble", "c": 6},
        {"a": 7, "b": "marco", "c": 9},
    ])

    # simple predicates

    # select rows where column a is equal to 1
    a_is_1 = col("a") == 1
    assert a_is_1.filter(df) == df[df.a == 1]

    # select rows where column a is not equal to 1
    a_is_not_1 = col("a") != 1
    assert a_is_not_1.filter(df) == df[df.a != 1]

    # select rows where column a is greater than 4 and b contains the regex "(bl.)+"
    a_gt_1_and_b_matches = (col("a") > 4) & col("b").str.contains(r"(bl.)+")
    assert a_gt_1_and_b_matches.filter(df) == df[(df.a > 4) & df.b.str.contains(r"(bl.)+")]

    # select rows where column a is greater than 4 or b contains the regex "(bl.)+"
    a_gt_1_or_b_matches = (col("a") > 4) | col("b").str.contains(r"(bl.)+")
    assert a_gt_1_or_b_matches.filter(df) == df[(df.a > 4) | df.b.str.contains(r"(bl.)+")]

Recommend Projects