canimus / cuallee Goto Github PK
View Code? Open in Web Editor NEWPossibly the fastest DataFrame-agnostic quality check library in town.
License: Apache License 2.0
Possibly the fastest DataFrame-agnostic quality check library in town.
License: Apache License 2.0
would it be possible to add example on using with pandas dataframe
Describe the bug
When using Polars, if the given dataset fails validation, instead of getting a dataframe with the validation results, an exception is thrown: TypeError: '>=' not supported between instances of 'NoneType' and 'float'
The example below shows the use of the is_unique
check, but this has been replicated with other checks as well.
To Reproduce
Steps to reproduce the behavior:
import polars as pl
from cuallee import Check, CheckLevel
df = pl.DataFrame(
{
"id": [1, 1, 2, 3, 4],
"bar": [6, 7, 8, 9, 10],
"ham": ["a", "b", "c", "d", "e"],
}
)
id_check = Check(CheckLevel.WARNING, "ID unique")
display(id_check.is_unique("id").validate(df))
Expected behavior
Code returns a dataframe with validation results.
Actual behavior
Code throws exception below:
TypeError Traceback (most recent call last)
Cell In[28], line 13
4 df = pl.DataFrame(
5 {
6 "id": [1, 1, 2, 3, 4],
(...)
9 }
10 )
12 id_check = Check(CheckLevel.WARNING, "ID unique")
---> 13 display(id_check.is_unique("id").validate(df))File ~/Source/replicate-issue/.venv/lib/python3.11/site-packages/cuallee/init.py:586, in Check.validate(self, dataframe)
581 self.compute_engine = importlib.import_module("cuallee.polars_validation")
583 assert self.compute_engine.validate_data_types(
584 self.rules, dataframe
585 ), "Invalid data types between rules and dataframe"
--> 586 return self.compute_engine.summary(self, dataframe)File ~/Source/replicate-issue/.venv/lib/python3.11/site-packages/cuallee/polars_validation.py:414, in summary(check, dataframe)
410 return "FAIL"
412 rows = len(dataframe)
--> 414 computation_basis = [
415 {
416 "id": index,
417 "timestamp": check.date.strftime("%Y-%m-%d %H:%M:%S"),
418 "check": check.name,
419 "level": check.level.name,
420 "column": str(rule.column),
421 "rule": rule.method,
422 "value": rule.value,
423 "rows": rows,
424 "violations": _calculate_violations(first(unified_results[hash_key]), rows),
425 "pass_rate": _calculate_pass_rate(first(unified_results[hash_key]), rows),
426 "pass_threshold": rule.coverage,
427 "status": _evaluate_status(
428 _calculate_pass_rate(first(unified_results[hash_key]), rows),
429 rule.coverage,
430 ),
431 }
432 for index, (hash_key, rule) in enumerate(check._rule.items(), 1)
433 ]
434 pl.Config.set_tbl_cols(12)
435 return pl.DataFrame(computation_basis)File ~/Source/replicate-issue/.venv/lib/python3.11/site-packages/cuallee/polars_validation.py:427, in (.0)
410 return "FAIL"
412 rows = len(dataframe)
414 computation_basis = [
415 {
416 "id": index,
417 "timestamp": check.date.strftime("%Y-%m-%d %H:%M:%S"),
418 "check": check.name,
419 "level": check.level.name,
420 "column": str(rule.column),
421 "rule": rule.method,
422 "value": rule.value,
423 "rows": rows,
424 "violations": _calculate_violations(first(unified_results[hash_key]), rows),
425 "pass_rate": _calculate_pass_rate(first(unified_results[hash_key]), rows),
426 "pass_threshold": rule.coverage,
--> 427 "status": _evaluate_status(
428 _calculate_pass_rate(first(unified_results[hash_key]), rows),
429 rule.coverage,
430 ),
431 }
432 for index, (hash_key, rule) in enumerate(check._rule.items(), 1)
433 ]
434 pl.Config.set_tbl_cols(12)
435 return pl.DataFrame(computation_basis)File ~/Source/replicate-issue/.venv/lib/python3.11/site-packages/cuallee/polars_validation.py:407, in summary.._evaluate_status(pass_rate, pass_threshold)
405 def _evaluate_status(pass_rate, pass_threshold):
--> 407 if pass_rate >= pass_threshold:
408 return "PASS"
410 return "FAIL"TypeError: '>=' not supported between instances of 'NoneType' and 'float'
Desktop (please complete the following information):
Additional context
None.
Define the has_sum
check for snowpark. It collapses by adding all elements in a column for a dataframe.
For example std need to have a tolerance between passed value and real value (with rounding/truncate/ ceil-floor) and number of decimals for rounding.
The has_sum
method, collapses a column in a dataframe by adding up all the elements. It has been successfully implemented, in pandas
, pyspark
and duckdb
. Missing implementation on snowpark
Describe the bug
The issue is probably noticeable with other are_*
validations. It seems that when passing a number of columns to be checked, the total number of violations found will be divided by the number of columns given. For example, if a check will be done over 3 columns, and 12 violations are found, only 4 will be reported. This is OK if all violations were on the same row, but will underreport when the violations are on different rows.
To Reproduce
Run the following code:
from pyspark.sql import SparkSession
from cuallee import Check, CheckLevel
spark = SparkSession.builder.getOrCreate()
check = Check(CheckLevel.WARNING, "Not NULL").are_complete(["col_a", "col_b", "col_c"])
# This is fine
df1 = spark.createDataFrame([
{"col_a": 1, "col_b": 1, "col_c": 1}, {"col_a": 2, "col_b": 2, "col_c": 2}, {"col_a": 3, "col_b": 3, "col_c": 3}
], schema="struct<col_a: int, col_b: int, col_c: int>")
df1.show(truncate=False)
check.validate(df1).show(truncate=False)
+-----+-----+-----+ |col_a|col_b|col_c| +-----+-----+-----+ |1 |1 |1 | |2 |2 |2 | |3 |3 |3 | +-----+-----+-----+ +---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+---------+--------------+------+ |id |timestamp |check |level |column |rule |value|rows|violations|pass_rate|pass_threshold|status| +---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+---------+--------------+------+ |1 |2023-08-07 12:23:32|Not NULL|WARNING|('col_a', 'col_b', 'col_c')|are_complete|N/A |3 |0.0 |1.0 |1.0 |PASS | +---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+---------+--------------+------+
# This is also fine
df2 = spark.createDataFrame([
{"col_a": None, "col_b": None, "col_c": None}, {"col_a": 2, "col_b": 2, "col_c": 2}, {"col_a": 3, "col_b": 3, "col_c": 3}
], schema="struct<col_a: int, col_b: int, col_c: int>")
df2.show(truncate=False)
check.validate(df2).show(truncate=False)
+-----+-----+-----+ |col_a|col_b|col_c| +-----+-----+-----+ |null |null |null | |2 |2 |2 | |3 |3 |3 | +-----+-----+-----+ +---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+------------------+--------------+------+ |id |timestamp |check |level |column |rule |value|rows|violations|pass_rate |pass_threshold|status| +---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+------------------+--------------+------+ |1 |2023-08-07 12:23:32|Not NULL|WARNING|('col_a', 'col_b', 'col_c')|are_complete|N/A |3 |1.0 |0.6666666666666666|1.0 |FAIL | +---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+------------------+--------------+------+
# This should show 3 violations, but it seems that the number of violations outputted is divided by the number of columns
df3 = spark.createDataFrame([
{"col_a": None, "col_b": 1, "col_c": 1}, {"col_a": None, "col_b": 2, "col_c": 2}, {"col_a": None, "col_b": 3, "col_c": 3}
], schema="struct<col_a: int, col_b: int, col_c: int>")
df3.show(truncate=False)
check.validate(df3).show(truncate=False)
+-----+-----+-----+ |col_a|col_b|col_c| +-----+-----+-----+ |null |1 |1 | |null |2 |2 | |null |3 |3 | +-----+-----+-----+ +---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+------------------+--------------+------+ |id |timestamp |check |level |column |rule |value|rows|violations|pass_rate |pass_threshold|status| +---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+------------------+--------------+------+ |1 |2023-08-07 12:23:32|Not NULL|WARNING|('col_a', 'col_b', 'col_c')|are_complete|N/A |3 |1.0 |0.6666666666666666|1.0 |FAIL | +---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+------------------+--------------+------+
# This should also show 3 violations and a 0% pass rate
df4 = spark.createDataFrame([
{"col_a": None, "col_b": 1, "col_c": 1}, {"col_a": 2, "col_b": None, "col_c": 2}, {"col_a": 3, "col_b": 3, "col_c": None}
], schema="struct<col_a: int, col_b: int, col_c: int>")
df4.show(truncate=False)
check.validate(df4).show(truncate=False)
+-----+-----+-----+ |col_a|col_b|col_c| +-----+-----+-----+ |null |1 |1 | |2 |null |2 | |3 |3 |null | +-----+-----+-----+ +---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+------------------+--------------+------+ |id |timestamp |check |level |column |rule |value|rows|violations|pass_rate |pass_threshold|status| +---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+------------------+--------------+------+ |1 |2023-08-07 12:23:32|Not NULL|WARNING|('col_a', 'col_b', 'col_c')|are_complete|N/A |3 |1.0 |0.6666666666666666|1.0 |FAIL | +---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+------------------+--------------+------+
Expected behavior
Ideally, the number of violations should reflect the number of offending rows (unless I'm misunderstanding something).
Desktop (please complete the following information):
cuallee==0.4.7
I've been using Spark Connect for both testing and data validation tasks. Despite following the provided documentation closely, I encountered errors with every example I attempted.
These issues occurred on Apache Spark version 3.5.1
. Below, I provide detailed steps to reproduce two specific errors, along with the corresponding error messages.
Environment:
3.11.8
3.5.1
2.12
docker run -ti --name spark -p 15002:15002 bitnami/spark:latest /opt/bitnami/spark/sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.5.1
spark = SparkSession.builder.appName("PySpark Test") \
.remote("sc://localhost:15002") \
.getOrCreate()
Issue Reproduction:
Executing the Dates Example as provided in the documentation:
# Unique values on id
check = Check(CheckLevel.WARNING, "CheckIsBetweenDates")
df = spark.sql(
"""
SELECT
explode(
sequence(
to_date('2022-01-01'),
to_date('2022-01-10'),
interval 1 day)) as date
""")
assert (
check.is_between("date", "2022-01-01", "2022-01-10")
.validate(df)
.first()
.status == "PASS"
)
Error Message:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[8], line 14
3 df = spark.sql(
4 """
5 SELECT
(...)
10 interval 1 day)) as date
11 """)
12 df = df.toPandas()
13 assert (
---> 14 check.is_between("date", "2022-01-01", "2022-01-10")
15 .validate(df)
16 .first()
17 .status == "PASS"
18 )
File c:\Users\Dsaad\GitHub\pyspark-tools\.venv\Lib\site-packages\cuallee\__init__.py:403, in Check.is_between(self, column, value, pct)
401 def is_between(self, column: str, value: Tuple[Any], pct: float = 1.0):
402 """Validation of a column between a range"""
--> 403 Rule("is_between", column, value, CheckDataType.AGNOSTIC, pct) >> self._rule
404 return self
File <string>:13, in __init__(self, method, column, value, data_type, coverage, options, status, violations, pass_rate, ordinal)
File c:\Users\Dsaad\GitHub\pyspark-tools\.venv\Lib\site-packages\cuallee\__init__.py:100, in Rule.__post_init__(self)
99 def __post_init__(self):
--> 100 if (self.coverage <= 0) or (self.coverage > 1):
101 raise ValueError("Coverage should be between 0 and 1")
103 if isinstance(self.column, List):
TypeError: '<=' not supported between instances of 'str' and 'int'
Issue Reproduction:
Executing an example that checks for null values and uniqueness:
from datetime import date
from pyspark.sql import Row
df = spark.createDataFrame([
Row(user_id=1111, order_id=4343, preferred_store='string1', birthdate=date(1999, 1, 1), joined_date=date(2022, 6, 1)),
Row(user_id=2222, order_id=5454, preferred_store='string2', birthdate=date(2000, 2, 1), joined_date=date(2022, 7, 2)),
Row(user_id=3333, order_id=6565, preferred_store='string3', birthdate=date(2001, 3, 1), joined_date=date(2022, 8, 3))
])
# Nulls on column Id
check = Check(CheckLevel.WARNING, "Completeness")
( check
.is_complete("user_id")
.is_unique("user_id")
.validate(df)
).show()
Error Message:
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
Cell In[7], line 13
7 # Nulls on column Id
8 check = Check(CheckLevel.WARNING, "Completeness")
9 ( check
10 .is_complete("user_id")
11 .is_unique("user_id")
---> 12 .validate(df)
13 ).show()
File c:\Users\Dsaad\GitHub\pyspark-tools\.venv\Lib\site-packages\cuallee\__init__.py:703, in Check.validate(self, dataframe)
700 self.compute_engine = importlib.import_module("cuallee.polars_validation")
702 else:
--> 703 raise Exception(
704 "Cuallee is not ready for this data structure. You can log a Feature Request in Github."
705 )
707 assert self.compute_engine.validate_data_types(
708 self.rules, dataframe
709 ), "Invalid data types between rules and dataframe"
711 return self.compute_engine.summary(self, dataframe)
Exception: Cuallee is not ready for this data structure. You can log a Feature Request in Github.
I would appreciate any guidance or updates on resolving these errors. Thank you for your assistance.
I appears that when running a test with 1000 rules.
Test scenario is with the Taxi NYC data set with 20M Rows.
df = spark.read.parquet("temp/data/*.parquet")
c = Check(CheckLevel.Warning, "NYC")
for i in range(1000):
c.is_greater_than("fare_amount", i)
c.validate(spark, df).show(n=1000, truncate=False)
# Displayed dataframe contains wrong order in rows
# in 995 there is a discrepancy because 10% of the rows are certainly not with `fare_amount > 995`
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.