Code Monkey home page Code Monkey logo

pandas-predicates's Introduction

pandas-predicates

A library for creating predicating expressions to filter pandas dataframes.

Status

This is a work in progress. Be mindful that:

  • there's no released version yet
  • the API is not stable
  • the documentation is sparse

Why

I often found myself creating really unwieldy code for creating filtering masks for pandas dataframes. Inspired by libraries like polars or pyspark, which use static expressions to build queries, I decided to create a simple API to create expressions that can be used to filter a dataframe in pandas.

Goals

For now the goal is to simply provide a simple API to create predicates to filter a pandas dataframe. Initially I don't intend to support generic expressions, but that may change in the future.

Stated goals:

  • follow as much as possible the pandas API
  • cover all methods that can be used to create a boolean mask
  • implement useful combinators to combine predicates

Progress tracking

We are implementing the methods in this order:

  • Documentation
    • API reference
    • Examples
  • Finish packaging and CI
    • basic packaging (pypi, readthedocs, etc.)
    • CI
    • release installable 0.1.0 version
  • All series methods that produce a boolean mask directly (like isdigit, str.contains, etc.).
    • base API: isna, isnull, isin
    • str base API: isalpha, isalnum, isdigit, isspace, islower, isupper, istitle, isnumeric, isdecimal
    • str matching API: contains, match, fullmatch, startswith
    • dt API:
    • cat API:
    • anything missing?
  • Methods that can be used for simple comparisons (like doing my_df.my_float_column > 20).
    • base API: __eq__, __ne__, __gt__, __ge__, __lt__, __le__, between
    • needs reserach to make a complete list
    • Any other method that would be useful
    • needs reserach to make a complete list
  • Boolean combinators:
    • trivial: never, always
    • basic: __invert__, __and__, __or__, , __xor__
    • reverse: __rand__, __ror__, , __rxor__
    • basic aggregations: all_of, any_of, none_of
    • count based aggregations: at_most_one_of, at_least_k_of, at_most_k_of, exactly_k_of
    • custom aggregators? aggregate_with(agg_func, *expressions, start=True)
    • anything missing?
  • Expressions API
    • Formalize the expression tree
    • Implement a pretty printer
    • Improved type checking?
    • needs research
  • Anything else?

Syntax

For now the expression language have the following syntax:

    import pandas as pd
    from pandas_predicates import col

    df = pd.DataFrame([
        {"a": 1, "b": "blabla", "c": 3},
        {"a": 4, "b": "bleble", "c": 6},
        {"a": 7, "b": "marco", "c": 9},
    ])

    # simple predicates

    # select rows where column a is equal to 1
    a_is_1 = col("a") == 1
    assert a_is_1.filter(df) == df[df.a == 1]

    # select rows where column a is not equal to 1
    a_is_not_1 = col("a") != 1
    assert a_is_not_1.filter(df) == df[df.a != 1]

    # select rows where column a is greater than 4 and b contains the regex "(bl.)+"
    a_gt_1_and_b_matches = (col("a") > 4) & col("b").str.contains(r"(bl.)+")
    assert a_gt_1_and_b_matches.filter(df) == df[(df.a > 4) & df.b.str.contains(r"(bl.)+")]

    # select rows where column a is greater than 4 or b contains the regex "(bl.)+"
    a_gt_1_or_b_matches = (col("a") > 4) | col("b").str.contains(r"(bl.)+")
    assert a_gt_1_or_b_matches.filter(df) == df[(df.a > 4) | df.b.str.contains(r"(bl.)+")]

pandas-predicates's People

Contributors

rcalsaverini avatar

Watchers

 avatar James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.