A library for creating predicating expressions to filter pandas dataframes.
This is a work in progress. Be mindful that:
- there's no released version yet
- the API is not stable
- the documentation is sparse
I often found myself creating really unwieldy code for creating filtering masks for pandas dataframes. Inspired by libraries like polars or pyspark, which use static expressions to build queries, I decided to create a simple API to create expressions that can be used to filter a dataframe in pandas.
For now the goal is to simply provide a simple API to create predicates to filter a pandas dataframe. Initially I don't intend to support generic expressions, but that may change in the future.
Stated goals:
- follow as much as possible the pandas API
- cover all methods that can be used to create a boolean mask
- implement useful combinators to combine predicates
We are implementing the methods in this order:
- Documentation
- API reference
- Examples
- Finish packaging and CI
- basic packaging (pypi, readthedocs, etc.)
- CI
- release installable 0.1.0 version
- All series methods that produce a boolean mask directly (like
isdigit
,str.contains
, etc.).- base API:
isna
,isnull
,isin
- str base API:
isalpha
,isalnum
,isdigit
,isspace
,islower
,isupper
,istitle
,isnumeric
,isdecimal
- str matching API:
contains
,match
,fullmatch
,startswith
- dt API:
- cat API:
- anything missing?
- base API:
- Methods that can be used for simple comparisons (like doing
my_df.my_float_column > 20
).- base API:
__eq__
,__ne__
,__gt__
,__ge__
,__lt__
,__le__
,between
- needs reserach to make a complete list
- Any other method that would be useful
- needs reserach to make a complete list
- base API:
- Boolean combinators:
- trivial:
never
,always
- basic:
__invert__
,__and__
,__or__
, ,__xor__
- reverse:
__rand__
,__ror__
, ,__rxor__
- basic aggregations:
all_of
,any_of
,none_of
- count based aggregations:
at_most_one_of
,at_least_k_of
,at_most_k_of
,exactly_k_of
- custom aggregators?
aggregate_with(agg_func, *expressions, start=True)
- anything missing?
- trivial:
- Expressions API
- Formalize the expression tree
- Implement a pretty printer
- Improved type checking?
- needs research
- Anything else?
For now the expression language have the following syntax:
import pandas as pd
from pandas_predicates import col
df = pd.DataFrame([
{"a": 1, "b": "blabla", "c": 3},
{"a": 4, "b": "bleble", "c": 6},
{"a": 7, "b": "marco", "c": 9},
])
# simple predicates
# select rows where column a is equal to 1
a_is_1 = col("a") == 1
assert a_is_1.filter(df) == df[df.a == 1]
# select rows where column a is not equal to 1
a_is_not_1 = col("a") != 1
assert a_is_not_1.filter(df) == df[df.a != 1]
# select rows where column a is greater than 4 and b contains the regex "(bl.)+"
a_gt_1_and_b_matches = (col("a") > 4) & col("b").str.contains(r"(bl.)+")
assert a_gt_1_and_b_matches.filter(df) == df[(df.a > 4) & df.b.str.contains(r"(bl.)+")]
# select rows where column a is greater than 4 or b contains the regex "(bl.)+"
a_gt_1_or_b_matches = (col("a") > 4) | col("b").str.contains(r"(bl.)+")
assert a_gt_1_or_b_matches.filter(df) == df[(df.a > 4) | df.b.str.contains(r"(bl.)+")]