Comments (5)
Thank you @dsaad68 for investigating the use of cuallee
in the context of Spark Connect. To be honest, the fix does not look that complex. The pyspark_validation.py
already takes in consideration versions earlier of 3.3.0
that don't have support for Observation
and it falls back to use a plain SELECT
to compute the quality metrics. We will explore the feature of searching for a Spark Connect session in the globals
as we do for compatibility with Databricks, and follow your advise on the use of a ENV=CUALLEE_SPARK_REMOTE
to retrieve the location of the endpoint as a fallback. Thanks for the very detailed root cause analysis, very much appreciated.
from cuallee.
Hi @dsaad68 , bring your implementation in, if you feel it fits more your use case. Always happy to see new contributions, and let me know if the v0.9.1
worked out. The fact that you investigated all the way, made the rest of the spark-connect part relatively trivial. The real value was understanding the session creation and the observation limitation you highlighted. So again, thanks again for pointing out that.
from cuallee.
I have investigated the issue in more depth:
- DataFrames generated via Spark Connect are of the
pyspark.sql.connect.dataframe.DataFrame
type. Unfortunately, this type is not compatible with thevalidate
function located in__init__.py
. Which can be remedied, for example:try: from pyspark.sql.connect.dataframe import DataFrame as pyspark_connect_dataframe except (ModuleNotFoundError, ImportError): logger.debug("KO: PySpark Connect") def validate(self, dataframe: Any): ... # When dataframe is Spark Connect DataFrame API elif "pyspark_connect_dataframe" in globals() and isinstance( dataframe, pyspark_connect_dataframe ): self.compute_engine = importlib.import_module("cuallee.pyspark_validation ...
- Additionally, the
summary
function inpyspark_validation.py
lacks the functionality to accept parameters necessary for initiating a Remote Spark Session, essential for Spark Connect.
A workaround for this issue involves setting one environment variable as follows in the code:os.environ['SPARK_REMOTE'] = "sc://localhost:15002"
- Currently, as of its latest version 3.5.1, Spark Connect does not support the
Observation
feature, which is crucial to this library's functionality.raise PySparkNotImplementedError( pyspark.errors.exceptions.base.PySparkNotImplementedError: [NOT_IMPLEMENTED] Observation support for Spark Connect is not implemented.
It's unfortunate, as I was eager to integrate the cuallee
package into my PySpark process. I'll keep an eye on subsequent releases of PySpark, and if they start supporting Observation
with Spark Connect, I will submit a detailed proposal or contribute via a pull request.
from cuallee.
Solved with #185
from cuallee.
Thanks a lot. It was quick.
I finished my fix, but you beat me to it.
Just one suggestion, if os.environ[SPARK_CONNECT_MODE_ENABLED] == "1"
that means the spark session is connected with spark connect. It is a good identifier that Spark Connect is being used.
from cuallee.
Related Issues (19)
- Show results when running 1000 rules are not in order HOT 2
- Feature tolerance on statistical methods HOT 1
- Implementation of `has_workflow` on snowpark
- Implementation of `has_sum` for snowpark
- Complete all unit test cases for `pandas`
- Implement test cases for `duckdb`
- Complete remaining test cases for `pyspark`
- Complete remaining test cases for `snowpark`
- how to use it on pandas dataframe HOT 2
- Check for ISO-4217 for currency codes HOT 1
- Check for ISO-3166 for country codes HOT 1
- Exception thrown when whole dataset fails validation using Polars HOT 2
- Number of overall violations divided by number of columns in `are_complete` HOT 1
- daft_validation - AttributeError: module 'statistics' has no attribute 'correlation' HOT 4
- [JOSS REVIEW] Automatic testing is failing HOT 4
- [JOSS REVIEW] Community guidelines HOT 3
- [JOSS Review] Extend Docstring of central class HOT 2
- [JOSS Review] Documentation page HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cuallee.