Code Monkey home page Code Monkey logo

cuallee's People

Contributors

canimus avatar dcodeyl avatar dependabot[bot] avatar dsaad68 avatar herminio-iovio avatar maltzsama avatar runkelcorey avatar ryanjulyan avatar stuffbyyuki avatar vestalisvirginis avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

cuallee's Issues

Exception thrown when whole dataset fails validation using Polars

Describe the bug
When using Polars, if the given dataset fails validation, instead of getting a dataframe with the validation results, an exception is thrown: TypeError: '>=' not supported between instances of 'NoneType' and 'float'

The example below shows the use of the is_unique check, but this has been replicated with other checks as well.

To Reproduce
Steps to reproduce the behavior:

  1. Create a new Jupyter notebook / Python file
  2. Paste the following lines in:
import polars as pl
from cuallee import Check, CheckLevel

df = pl.DataFrame(
    {
        "id": [1, 1, 2, 3, 4],
        "bar": [6, 7, 8, 9, 10],
        "ham": ["a", "b", "c", "d", "e"],
    }
)

id_check = Check(CheckLevel.WARNING, "ID unique")
display(id_check.is_unique("id").validate(df))

Expected behavior
Code returns a dataframe with validation results.

Actual behavior
Code throws exception below:


TypeError Traceback (most recent call last)
Cell In[28], line 13
4 df = pl.DataFrame(
5 {
6 "id": [1, 1, 2, 3, 4],
(...)
9 }
10 )
12 id_check = Check(CheckLevel.WARNING, "ID unique")
---> 13 display(id_check.is_unique("id").validate(df))

File ~/Source/replicate-issue/.venv/lib/python3.11/site-packages/cuallee/init.py:586, in Check.validate(self, dataframe)
581 self.compute_engine = importlib.import_module("cuallee.polars_validation")
583 assert self.compute_engine.validate_data_types(
584 self.rules, dataframe
585 ), "Invalid data types between rules and dataframe"
--> 586 return self.compute_engine.summary(self, dataframe)

File ~/Source/replicate-issue/.venv/lib/python3.11/site-packages/cuallee/polars_validation.py:414, in summary(check, dataframe)
410 return "FAIL"
412 rows = len(dataframe)
--> 414 computation_basis = [
415 {
416 "id": index,
417 "timestamp": check.date.strftime("%Y-%m-%d %H:%M:%S"),
418 "check": check.name,
419 "level": check.level.name,
420 "column": str(rule.column),
421 "rule": rule.method,
422 "value": rule.value,
423 "rows": rows,
424 "violations": _calculate_violations(first(unified_results[hash_key]), rows),
425 "pass_rate": _calculate_pass_rate(first(unified_results[hash_key]), rows),
426 "pass_threshold": rule.coverage,
427 "status": _evaluate_status(
428 _calculate_pass_rate(first(unified_results[hash_key]), rows),
429 rule.coverage,
430 ),
431 }
432 for index, (hash_key, rule) in enumerate(check._rule.items(), 1)
433 ]
434 pl.Config.set_tbl_cols(12)
435 return pl.DataFrame(computation_basis)

File ~/Source/replicate-issue/.venv/lib/python3.11/site-packages/cuallee/polars_validation.py:427, in (.0)
410 return "FAIL"
412 rows = len(dataframe)
414 computation_basis = [
415 {
416 "id": index,
417 "timestamp": check.date.strftime("%Y-%m-%d %H:%M:%S"),
418 "check": check.name,
419 "level": check.level.name,
420 "column": str(rule.column),
421 "rule": rule.method,
422 "value": rule.value,
423 "rows": rows,
424 "violations": _calculate_violations(first(unified_results[hash_key]), rows),
425 "pass_rate": _calculate_pass_rate(first(unified_results[hash_key]), rows),
426 "pass_threshold": rule.coverage,
--> 427 "status": _evaluate_status(
428 _calculate_pass_rate(first(unified_results[hash_key]), rows),
429 rule.coverage,
430 ),
431 }
432 for index, (hash_key, rule) in enumerate(check._rule.items(), 1)
433 ]
434 pl.Config.set_tbl_cols(12)
435 return pl.DataFrame(computation_basis)

File ~/Source/replicate-issue/.venv/lib/python3.11/site-packages/cuallee/polars_validation.py:407, in summary.._evaluate_status(pass_rate, pass_threshold)
405 def _evaluate_status(pass_rate, pass_threshold):
--> 407 if pass_rate >= pass_threshold:
408 return "PASS"
410 return "FAIL"

TypeError: '>=' not supported between instances of 'NoneType' and 'float'

Desktop (please complete the following information):

  • OS: Linux
  • cuallee = "^0.4.5"
  • ipykernel = "^6.24.0"
  • polars = "0.18.7"

Additional context
None.

Implementation of `has_workflow` on snowpark

The has_sum method, collapses a column in a dataframe by adding up all the elements. It has been successfully implemented, in pandas, pyspark and duckdb. Missing implementation on snowpark

Number of overall violations divided by number of columns in `are_complete`

Describe the bug
The issue is probably noticeable with other are_* validations. It seems that when passing a number of columns to be checked, the total number of violations found will be divided by the number of columns given. For example, if a check will be done over 3 columns, and 12 violations are found, only 4 will be reported. This is OK if all violations were on the same row, but will underreport when the violations are on different rows.

To Reproduce
Run the following code:

from pyspark.sql import SparkSession
from cuallee import Check, CheckLevel
spark = SparkSession.builder.getOrCreate()

check = Check(CheckLevel.WARNING, "Not NULL").are_complete(["col_a", "col_b", "col_c"])
# This is fine

df1 = spark.createDataFrame([
    {"col_a": 1, "col_b": 1, "col_c": 1}, {"col_a": 2, "col_b": 2, "col_c": 2}, {"col_a": 3, "col_b": 3, "col_c": 3}
], schema="struct<col_a: int, col_b: int, col_c: int>")
df1.show(truncate=False)
check.validate(df1).show(truncate=False)
+-----+-----+-----+
|col_a|col_b|col_c|
+-----+-----+-----+
|1    |1    |1    |
|2    |2    |2    |
|3    |3    |3    |
+-----+-----+-----+

+---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+---------+--------------+------+
|id |timestamp          |check   |level  |column                     |rule        |value|rows|violations|pass_rate|pass_threshold|status|
+---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+---------+--------------+------+
|1  |2023-08-07 12:23:32|Not NULL|WARNING|('col_a', 'col_b', 'col_c')|are_complete|N/A  |3   |0.0       |1.0      |1.0           |PASS  |
+---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+---------+--------------+------+
# This is also fine

df2 = spark.createDataFrame([
    {"col_a": None, "col_b": None, "col_c": None}, {"col_a": 2, "col_b": 2, "col_c": 2}, {"col_a": 3, "col_b": 3, "col_c": 3}
], schema="struct<col_a: int, col_b: int, col_c: int>")
df2.show(truncate=False)
check.validate(df2).show(truncate=False)
+-----+-----+-----+
|col_a|col_b|col_c|
+-----+-----+-----+
|null |null |null |
|2    |2    |2    |
|3    |3    |3    |
+-----+-----+-----+

+---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+------------------+--------------+------+
|id |timestamp          |check   |level  |column                     |rule        |value|rows|violations|pass_rate         |pass_threshold|status|
+---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+------------------+--------------+------+
|1  |2023-08-07 12:23:32|Not NULL|WARNING|('col_a', 'col_b', 'col_c')|are_complete|N/A  |3   |1.0       |0.6666666666666666|1.0           |FAIL  |
+---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+------------------+--------------+------+
# This should show 3 violations, but it seems that the number of violations outputted is divided by the number of columns

df3 = spark.createDataFrame([
    {"col_a": None, "col_b": 1, "col_c": 1}, {"col_a": None, "col_b": 2, "col_c": 2}, {"col_a": None, "col_b": 3, "col_c": 3}
], schema="struct<col_a: int, col_b: int, col_c: int>")
df3.show(truncate=False)
check.validate(df3).show(truncate=False)
+-----+-----+-----+
|col_a|col_b|col_c|
+-----+-----+-----+
|null |1    |1    |
|null |2    |2    |
|null |3    |3    |
+-----+-----+-----+

+---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+------------------+--------------+------+
|id |timestamp          |check   |level  |column                     |rule        |value|rows|violations|pass_rate         |pass_threshold|status|
+---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+------------------+--------------+------+
|1  |2023-08-07 12:23:32|Not NULL|WARNING|('col_a', 'col_b', 'col_c')|are_complete|N/A  |3   |1.0       |0.6666666666666666|1.0           |FAIL  |
+---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+------------------+--------------+------+
# This should also show 3 violations and a 0% pass rate

df4 = spark.createDataFrame([
    {"col_a": None, "col_b": 1, "col_c": 1}, {"col_a": 2, "col_b": None, "col_c": 2}, {"col_a": 3, "col_b": 3, "col_c": None}
], schema="struct<col_a: int, col_b: int, col_c: int>")
df4.show(truncate=False)
check.validate(df4).show(truncate=False)
+-----+-----+-----+
|col_a|col_b|col_c|
+-----+-----+-----+
|null |1    |1    |
|2    |null |2    |
|3    |3    |null |
+-----+-----+-----+

+---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+------------------+--------------+------+
|id |timestamp          |check   |level  |column                     |rule        |value|rows|violations|pass_rate         |pass_threshold|status|
+---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+------------------+--------------+------+
|1  |2023-08-07 12:23:32|Not NULL|WARNING|('col_a', 'col_b', 'col_c')|are_complete|N/A  |3   |1.0       |0.6666666666666666|1.0           |FAIL  |
+---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+------------------+--------------+------+

Expected behavior
Ideally, the number of violations should reflect the number of offending rows (unless I'm misunderstanding something).

Desktop (please complete the following information):

  • OS: Linux
  • Browser: Visual Studio Code / Jupyter Notebooks
  • Version cuallee==0.4.7

Executing validation tasks through Spark Connect is failing

Issue

I've been using Spark Connect for both testing and data validation tasks. Despite following the provided documentation closely, I encountered errors with every example I attempted.

These issues occurred on Apache Spark version 3.5.1. Below, I provide detailed steps to reproduce two specific errors, along with the corresponding error messages.

Environment:

  • Python Version: 3.11.8
  • Apache Spark Versions Tested: 3.5.1
  • Scala Version: 2.12
  • Operating System: Windows

Steps to Reproduce:

Setup

  1. Single node Spark cluster initiated via Docker using the command:
docker run -ti --name spark -p 15002:15002 bitnami/spark:latest /opt/bitnami/spark/sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.5.1
  1. Spark Session creation:
spark = SparkSession.builder.appName("PySpark Test") \
            .remote("sc://localhost:15002") \
            .getOrCreate()

1. Error with Dates Example

Issue Reproduction:

Executing the Dates Example as provided in the documentation:

# Unique values on id
check = Check(CheckLevel.WARNING, "CheckIsBetweenDates")
df = spark.sql(
    """
    SELECT 
        explode(
            sequence(
                to_date('2022-01-01'), 
                to_date('2022-01-10'), 
                interval 1 day)) as date
    """)
assert (
    check.is_between("date", "2022-01-01", "2022-01-10")
    .validate(df)
    .first()
    .status == "PASS"
)

Error Message:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[8], line 14
      3 df = spark.sql(
      4     """
      5     SELECT 
   (...)
     10                 interval 1 day)) as date
     11     """)
     12 df = df.toPandas()
     13 assert (
---> 14     check.is_between("date", "2022-01-01", "2022-01-10")
     15     .validate(df)
     16     .first()
     17     .status == "PASS"
     18 )

File c:\Users\Dsaad\GitHub\pyspark-tools\.venv\Lib\site-packages\cuallee\__init__.py:403, in Check.is_between(self, column, value, pct)
    401 def is_between(self, column: str, value: Tuple[Any], pct: float = 1.0):
    402     """Validation of a column between a range"""
--> 403     Rule("is_between", column, value, CheckDataType.AGNOSTIC, pct) >> self._rule
    404     return self

File <string>:13, in __init__(self, method, column, value, data_type, coverage, options, status, violations, pass_rate, ordinal)

File c:\Users\Dsaad\GitHub\pyspark-tools\.venv\Lib\site-packages\cuallee\__init__.py:100, in Rule.__post_init__(self)
     99 def __post_init__(self):
--> 100     if (self.coverage <= 0) or (self.coverage > 1):
    101         raise ValueError("Coverage should be between 0 and 1")
    103     if isinstance(self.column, List):

TypeError: '<=' not supported between instances of 'str' and 'int'

2. Error with Completeness Check Example:

Issue Reproduction:

Executing an example that checks for null values and uniqueness:

from datetime import date
from pyspark.sql import Row

df = spark.createDataFrame([
        Row(user_id=1111, order_id=4343, preferred_store='string1', birthdate=date(1999, 1, 1), joined_date=date(2022, 6, 1)),
        Row(user_id=2222, order_id=5454, preferred_store='string2', birthdate=date(2000, 2, 1), joined_date=date(2022, 7, 2)),
        Row(user_id=3333, order_id=6565, preferred_store='string3', birthdate=date(2001, 3, 1), joined_date=date(2022, 8, 3))
        ])

# Nulls on column Id
check = Check(CheckLevel.WARNING, "Completeness")
(   check
    .is_complete("user_id")
    .is_unique("user_id")
    .validate(df)
).show()

Error Message:

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Cell In[7], line 13
      7 # Nulls on column Id
      8 check = Check(CheckLevel.WARNING, "Completeness")
      9 (   check
     10     .is_complete("user_id")
     11     .is_unique("user_id")
---> 12     .validate(df)
     13 ).show()

File c:\Users\Dsaad\GitHub\pyspark-tools\.venv\Lib\site-packages\cuallee\__init__.py:703, in Check.validate(self, dataframe)
    700     self.compute_engine = importlib.import_module("cuallee.polars_validation")
    702 else:
--> 703     raise Exception(
    704         "Cuallee is not ready for this data structure. You can log a Feature Request in Github."
    705     )
    707 assert self.compute_engine.validate_data_types(
    708     self.rules, dataframe
    709 ), "Invalid data types between rules and dataframe"
    711 return self.compute_engine.summary(self, dataframe)

Exception: Cuallee is not ready for this data structure. You can log a Feature Request in Github.

I would appreciate any guidance or updates on resolving these errors. Thank you for your assistance.

Show results when running 1000 rules are not in order

I appears that when running a test with 1000 rules.
Test scenario is with the Taxi NYC data set with 20M Rows.

df = spark.read.parquet("temp/data/*.parquet")
c = Check(CheckLevel.Warning, "NYC")
for i in range(1000):
  c.is_greater_than("fare_amount", i)
c.validate(spark, df).show(n=1000, truncate=False)

# Displayed dataframe contains wrong order in rows
# in 995 there is a discrepancy because 10% of the rows are certainly not with `fare_amount > 995`

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.