Code Monkey home page Code Monkey logo

Comments (5)

canimus avatar canimus commented on May 13, 2024 1

Thank you @dsaad68 for investigating the use of cuallee in the context of Spark Connect. To be honest, the fix does not look that complex. The pyspark_validation.py already takes in consideration versions earlier of 3.3.0 that don't have support for Observation and it falls back to use a plain SELECT to compute the quality metrics. We will explore the feature of searching for a Spark Connect session in the globals as we do for compatibility with Databricks, and follow your advise on the use of a ENV=CUALLEE_SPARK_REMOTE to retrieve the location of the endpoint as a fallback. Thanks for the very detailed root cause analysis, very much appreciated.

from cuallee.

canimus avatar canimus commented on May 13, 2024 1

Hi @dsaad68 , bring your implementation in, if you feel it fits more your use case. Always happy to see new contributions, and let me know if the v0.9.1 worked out. The fact that you investigated all the way, made the rest of the spark-connect part relatively trivial. The real value was understanding the session creation and the observation limitation you highlighted. So again, thanks again for pointing out that.

from cuallee.

dsaad68 avatar dsaad68 commented on May 13, 2024

I have investigated the issue in more depth:

  • DataFrames generated via Spark Connect are of the pyspark.sql.connect.dataframe.DataFrame type. Unfortunately, this type is not compatible with the validate function located in __init__.py. Which can be remedied, for example:
    try:
        from pyspark.sql.connect.dataframe import DataFrame as pyspark_connect_dataframe
    except (ModuleNotFoundError, ImportError):
        logger.debug("KO: PySpark Connect")
    
    def validate(self, dataframe: Any):
            ...
            # When dataframe is Spark Connect DataFrame API
            elif "pyspark_connect_dataframe" in globals() and isinstance(
                dataframe, pyspark_connect_dataframe
            ):
                self.compute_engine = importlib.import_module("cuallee.pyspark_validation
            ...
  • Additionally, the summary function in pyspark_validation.py lacks the functionality to accept parameters necessary for initiating a Remote Spark Session, essential for Spark Connect.
    A workaround for this issue involves setting one environment variable as follows in the code:
    os.environ['SPARK_REMOTE'] = "sc://localhost:15002"
  • Currently, as of its latest version 3.5.1, Spark Connect does not support the Observation feature, which is crucial to this library's functionality.
    raise PySparkNotImplementedError(
    pyspark.errors.exceptions.base.PySparkNotImplementedError: [NOT_IMPLEMENTED] Observation support for Spark Connect is not implemented.
    

It's unfortunate, as I was eager to integrate the cuallee package into my PySpark process. I'll keep an eye on subsequent releases of PySpark, and if they start supporting Observation with Spark Connect, I will submit a detailed proposal or contribute via a pull request.

from cuallee.

canimus avatar canimus commented on May 13, 2024

Solved with #185

from cuallee.

dsaad68 avatar dsaad68 commented on May 13, 2024

Thanks a lot. It was quick.
I finished my fix, but you beat me to it.

Just one suggestion, if os.environ[SPARK_CONNECT_MODE_ENABLED] == "1" that means the spark session is connected with spark connect. It is a good identifier that Spark Connect is being used.

from cuallee.

Related Issues (19)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.