I initially reported this as a papermill issue(not quite sure about this). I am copyin

Hi Dosu, <div class="snippet-clipboard-content notranslate position-relative overf

[BUG] SparkMagic pyspark kernel magic(%%sql) hangs when running with Papermill.,about jupyter-incubator/sparkmagic

Comments (18)

dosubot commented on September 25, 2024 3

🤖

Hello,

Thank you for providing a detailed description of the issue you're experiencing. This seems to be related to the asynchronous nature of the SparkMagic %%sql command. When running in Jupyter, the command is designed to execute asynchronously, allowing the notebook to continue executing subsequent cells while the Spark job runs in the background. However, when running in Papermill, the execution model is synchronous, meaning it waits for each cell to finish executing before moving on to the next one.

To resolve this issue, you can try to modify the SparkMagic %%sql command to run synchronously. This can be done by modifying the execute_sqlquery method in the SparkMagicBase class in the sparkmagicsbase.py file. You would need to add a call to the wait_for_idle method after the run_sqlquery method call in the execute_sqlquery method. This will force the command to wait for the Spark job to finish before returning.

Here's how you can modify the execute_sqlquery method:

def execute_sqlquery(
        self,
        cell,
        samplemethod,
        maxrows,
        samplefraction,
        session,
        output_var,
        quiet,
        coerce,
    ):
        sqlquery = self._sqlquery(cell, samplemethod, maxrows, samplefraction, coerce)
        df = self.spark_controller.run_sqlquery(sqlquery, session)

        # Add this line to wait for the Spark job to finish
        self.spark_controller.wait_for_idle()

        if output_var is not None:
            self.shell.user_ns[output_var] = df
        if quiet:
            return None
        else:
            return df

Please note that this modification will make the %%sql command run synchronously in all environments, not just in Papermill. If you want to maintain the asynchronous behavior in Jupyter, you might need to add some logic to detect the execution environment and only call wait_for_idle when running in Papermill.

I hope this helps! Let me know if you have any other questions or if there's anything else I can do to assist you.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

from sparkmagic.

dosubot commented on September 25, 2024 2

🤖

Hi there,

Thank you for the detailed report. This seems to be a complex issue related to the interaction between SparkMagic, Livy, and Papermill.

From your description, it appears that the %%sql magic command is executed successfully, but the output is not rendered properly in the cell, and the execution of subsequent cells is blocked. This might be due to the asynchronous nature of SparkMagic's interaction with Livy.

When you run a cell with SparkMagic in JupyterLab, it sends a request to Livy and then polls for the result. This works fine in an interactive environment like JupyterLab, but when you're running a notebook non-interactively with Papermill, it might not wait long enough for the result to come back from Livy before moving on to the next cell.

You might want to look into adjusting the Livy session timeout or the polling interval in SparkMagic's configuration to see if that helps. However, please note that this is a common issue with running SparkMagic in non-interactive environments and there might not be a straightforward solution.

As for the codebase, SparkMagic handles the execution of %%sql magic commands through the execute_sqlquery method in the SparkMagicBase class. This method is responsible for executing the SQL query on the remote Spark cluster. It creates an SQLQuery object using the parameters provided by the user and then calls the run_sqlquery method of the spark_controller object to execute the SQL query on the remote Spark cluster.

The SQLQuery object has an execute method that emits a SQL execution start event, converts the SQL query to a command based on the session kind (PySpark, Spark, or SparkR), executes the command, and then emits a SQL execution end event. If the execution is successful, it converts the result to a DataFrame and returns it. If an error occurs during the execution, it emits a SQL execution end event with the error details and raises the error.

I hope this information helps. If you have any further questions or need more clarification, please let me know.

Best,
Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

from sparkmagic.

dosubot commented on September 25, 2024 2

🤖

Hello,

Thank you for providing detailed information about the issue you're experiencing. It seems like the problem might be due to the asynchronous nature of the execute method in the SQLQuery class. This method sends a request to the Spark cluster and returns immediately, without waiting for the Spark job to complete. This can cause issues when running cells in Jupyter notebooks using Papermill, as the notebook might try to execute the next cell before the previous one has finished executing.

To resolve this issue, you can modify the execute method to wait for the Spark job to complete before returning. This can be done by polling the status of the Spark job in a loop until it's done.

Here's an example of how you can modify the execute method:

def execute(self, session):
    self._spark_events.emit_sql_execution_start_event(
        session.guid,
        session.kind,
        session.id,
        self.guid,
        self.samplemethod,
        self.maxrows,
        self.samplefraction,
    )
    command_guid = ""
    try:
        command = self.to_command(session.kind, session.sql_context_variable_name)
        command_guid = command.guid
        (success, records_text, mimetype) = command.execute(session)

        # Poll the status of the command until it's done
        while True:
            statement = session.http_client.get_statement(session.id, command_guid)
            if statement['state'] in ['available', 'error']:
                break
            time.sleep(1)  # Wait for a second before polling the status again

        if not success:
            raise BadUserDataException(records_text)
        result = records_to_dataframe(records_text, session.kind, self._coerce)
    except Exception as e:
        self._spark_events.emit_sql_execution_end_event(
            session.guid,
            session.kind,
            session.id,
            self.guid,
            command_guid,
            False,
            e.__class__.__name__,
            str(e),
        )
        raise

    else:
        self._spark_events.emit_sql_execution_end_event(
            session.guid,
            session.kind,
            session.id,
            self.guid,
            command_guid,
            True,
            "",
            "",
        )
        return result

This modification will make the execute method block until the Spark job is done, which should resolve the issue of cells in Jupyter notebooks not executing in the correct order when run with Papermill.

Please note that this is a simplified example and the actual implementation might need to be adjusted based on the specifics of your Spark cluster and the rest of your code. For example, you might want to add error handling for cases where the Spark job fails or takes too long to complete.

I hope this helps! Let me know if you have any other questions or need further clarification.

Best,
Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

from sparkmagic.

dosubot commented on September 25, 2024 2

🤖

Hello,

Thank you for providing a detailed description of your issue. It seems like you're experiencing a problem with the asynchronous execution of the %%sql magic command in SparkMagic when running with Papermill.

The %%sql magic command in SparkMagic is designed to execute SQL queries against the Spark context. When you run this command, it first checks if a session has been started, parses the arguments provided in the command line, and then executes the SQL query using the execute_sqlquery method. This method creates an SQLQuery object with the provided parameters and runs it using the run_sqlquery method of the spark_controller. The SQLQuery object then converts itself to a Command object based on the session kind and the SQL context variable name. The command is a string that will be executed in the Spark session. The SQLQuery object then executes the Command object in the Spark session. If the command execution is successful, it converts the result to a dataframe and returns it. If the command execution fails, it raises an exception.

As for the asynchronous execution in SparkMagic, it is handled by sending commands to the Livy server and waiting for the result. The actual execution of the commands is done asynchronously on the Livy server, allowing the Jupyter notebook to continue with other tasks.

Now, regarding your issue with Papermill, it seems like Papermill is not waiting for the wait_for_idle to complete. This could be due to the way Papermill handles cell execution. Papermill executes each cell in the notebook individually and waits for the cell to complete before moving on to the next one. If a cell takes too long to execute, Papermill might timeout and stop the execution. This could be why the %%sql command is hanging when running with Papermill.

One possible solution could be to increase the cell execution timeout in Papermill. You can do this by passing the --request-timeout option when running Papermill. For example:

papermill pm_sparkmagic_test.ipynb output1.ipynb --kernel pysparkkernel  --log-level DEBUG --request-timeout 600

This will set the cell execution timeout to 600 seconds (10 minutes). You can adjust this value as needed.

I hope this helps! If you have any other questions or need further clarification, please let me know.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

from sparkmagic.

dosubot commented on September 25, 2024 2

🤖

Hello,

The %%sql magic command in SparkMagic and direct PySpark code have different execution patterns. The %%sql magic command uses the Livy REST interface to submit the SQL query to the Spark cluster and retrieve the results. This process is asynchronous, meaning the command returns immediately after submitting the query, and the actual execution happens in the Spark cluster. This could be the reason why you're experiencing the hanging issue when using Papermill.

On the other hand, direct PySpark code gives you more control over the execution process. You can create a SparkSession, execute transformations and actions on RDDs/DataFrames, and manage resources manually.

To modify the %%sql magic command to follow the same execution pattern as the direct PySpark code, you would need to modify the SparkMagic codebase. Specifically, you would need to modify the execute method in the SQLQuery class in the sparkmagic.livyclientlib.sqlquery module. This method is responsible for executing the SQL query and retrieving the results.

However, modifying the SparkMagic codebase might not be the best solution to resolve the hanging issue. Instead, you could try the following:

Increase the Livy session timeout in your SparkMagic configuration. This could prevent the Livy session from timing out while waiting for the SQL query to complete.
Use direct PySpark code instead of the %%sql magic command. This gives you more control over the execution process and might help to avoid the hanging issue.
Check the resources of your Spark cluster. The hanging issue could be due to insufficient resources in the Spark cluster, especially if you are running large SQL queries.
Check the logs of your Spark cluster and the Livy server. They might contain useful information about why the SQL query is hanging.

I hope this helps! Let me know if you have any other questions.

Best,
Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

from sparkmagic.

edwardps commented on September 25, 2024

Thanks Dosu for the quick reply.

I actually tried to troubleshoot by adding more log in command.py. Ideally each retry, there will be one line added into the notebook cell output area. I noticed that sometimes, the notebook cell does not have all the retry logs before the statement reaching from running to available state. This might align with what you mentioned on the timing issue.

But when I changed the configurable_retry_policy_max_retries, it does not help and I saw similar behavior. Not fully understand how Papermill behaves for this case.

but when you're running a notebook non-interactively with Papermill, it might not wait long enough for the result to come back from Livy before moving on to the next cell.

It did not move to the next cell actually and got stuck in the %%sql magic cell.

from sparkmagic.

edwardps commented on September 25, 2024

Awesome! Thanks a lot.
I will give it a try and see if it could fix the issue.

from sparkmagic.

edwardps commented on September 25, 2024

Hi Dosu,

sparkmagicsbase.py 
def execute_sqlquery(
    .....
    # Add this line to wait for the Spark job to finish
        self.spark_controller.wait_for_idle()

Given spark_controller does not have wait_for_idle, I made change to command.py. Between line 57 and 58, I added following lines:

            session.wait_for_idle()
            print('session.wait_for_idle...')

Also I added log for the session status.

print(f'session status -> {status}')

I can observe session was waiting for some time, then there was no further log. In the Papermill backend, I see more log until idle again.

In output notebook:

In the papermill log, the session reaches to idle.

Seems papermill does not wait until the wait_for_idle to complete. Any suggestions?

from sparkmagic.

edwardps commented on September 25, 2024

Thanks Dosu.
I added the waiting code in sqlquery.py. Following is the output from jupyterlab. The new log("waiting in sqlquery execute....") is shown in cell output.

While executing with papermill, it does not show the log in the notebook cell output. Seems the cell output update stopped some time later and subsequent cell did not run as well.

It seems like the problem might be due to the asynchronous nature of the execute method in the SQLQuery class.

Can you share how asynchronous works? It would be greate if you can share the code using the asynchronous mode to call livy API. I am trying to understand the execution pattern difference of %%sql magic.

Following piece works. I am think if we can use same call pattern to run %%sql logic.

sqlContext = HiveContext(sqlContext)

dbs = sqlContext.sql("show databases")
dbs.show()

tables = sqlContext.sql("show tables")
tables.show()

Execution log:
From the log, we can see all the cell logic and sparkmagic logic are executed, but cell output is not updated properly and subsequent cell does not execute.
log3.log

from sparkmagic.

edwardps commented on September 25, 2024

request-timeout is not supported by papermill. execution-timeout is supported and tried that but not work.

 --execution-timeout INTEGER     Time in seconds to wait for each cell before
                                  failing execution (default: forever)

Can you share how the pyspark code is executed? Is that following async mode?
I see the following code can succeed stably. Can you suggest if we can follow the same pattern for %%sql?

sqlContext = HiveContext(sqlContext)

dbs = sqlContext.sql("show databases")
dbs.show()

tables = sqlContext.sql("show tables")
tables.show()

from sparkmagic.

devstein commented on September 25, 2024

Hi @edwardps, glad Dosu (the bot) was able to help you narrow down the cause. I believe the bot is correct that the issue has something to do with the async execution of the cell magic.

I actually encountered this issue when implementing ipykernel >= 6 support and left comment on related papermill issue; however, the final implementation fixed the papermill issue for me 🤔

Maybe try changing your ipykernel version and see if that helps. I'd try both downgrading to v5 and upgrading to v7

from sparkmagic.

edwardps commented on September 25, 2024

Thanks @devstein. I will give a try and get back soon.
Do you happen to know call pattern difference between %%sql and running pyspark code?

BTW, Dosu works awesome!

from sparkmagic.

devstein commented on September 25, 2024

@edwardps Sounds good, let me know.

BTW, Dosu works awesome!

Thank you! Let me know if you have other open source projects that could benefit from it.

^I'm still teaching it when it should and shouldn't respond 😅

from sparkmagic.

edwardps commented on September 25, 2024

FYI - @devstein

ipykernel                 5.5.6
ipython                   7.33.0
ipython-genutils          0.2.0
ipywidgets                7.7.0

Downgraded ipykernel to 5.5.6(ipykernel has no v7) and did not get any luck.

Do you happen to know call pattern difference between %%sql and running pyspark code?

Is it possible to have %%sql following the call pattern used by pyspark code execution? (as the pyspark code cell can work properly)

from sparkmagic.

edwardps commented on September 25, 2024

FYI - @devstein
Made some progress on the troubleshooting. I narrowed down this to the visualization.

If comment out the follow two lines, it works.
https://github.com/jupyter-incubator/sparkmagic/blob/master/sparkmagic/sparkmagic/kernels/wrapperkernel/sparkkernelbase.py#L118-L119

        if conf.use_auto_viz():
            self._register_auto_viz()

From my observation, the plotly based visualization will be disabled.

from sparkmagic.

devstein commented on September 25, 2024

Thanks @edwardps that's helpful...that means it's failing to execute this code while the kernel is getting created.

I ran into this issue when I tried to make the _register_auto_viz() function async. It's curious that it no longer works.

Is there any difference in behavior if the first cell isn't %%sql?

For example

print("start the spark session")

%%sql
show tables

from sparkmagic.

edwardps commented on September 25, 2024

Is there any difference in behavior if the first cell isn't %%sql?
There is no much difference. Still see cell(%%sql) got stuck.

from sparkmagic.

duc-dn commented on September 25, 2024

Hi @edwardps , @devstein. I faced this problem so now is there any update about this problem?

from sparkmagic.

[BUG] SparkMagic pyspark kernel magic(%%sql) hangs when running with Papermill. about sparkmagic HOT 18 OPEN

Comments (18)

Sources

Sources

Sources

Sources

Sources

In output notebook:

In the papermill log, the session reaches to idle.

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent