Comments (18)
🤖
Hello,
Thank you for providing a detailed description of the issue you're experiencing. This seems to be related to the asynchronous nature of the SparkMagic %%sql
command. When running in Jupyter, the command is designed to execute asynchronously, allowing the notebook to continue executing subsequent cells while the Spark job runs in the background. However, when running in Papermill, the execution model is synchronous, meaning it waits for each cell to finish executing before moving on to the next one.
To resolve this issue, you can try to modify the SparkMagic %%sql
command to run synchronously. This can be done by modifying the execute_sqlquery
method in the SparkMagicBase
class in the sparkmagicsbase.py
file. You would need to add a call to the wait_for_idle
method after the run_sqlquery
method call in the execute_sqlquery
method. This will force the command to wait for the Spark job to finish before returning.
Here's how you can modify the execute_sqlquery
method:
def execute_sqlquery(
self,
cell,
samplemethod,
maxrows,
samplefraction,
session,
output_var,
quiet,
coerce,
):
sqlquery = self._sqlquery(cell, samplemethod, maxrows, samplefraction, coerce)
df = self.spark_controller.run_sqlquery(sqlquery, session)
# Add this line to wait for the Spark job to finish
self.spark_controller.wait_for_idle()
if output_var is not None:
self.shell.user_ns[output_var] = df
if quiet:
return None
else:
return df
Please note that this modification will make the %%sql
command run synchronously in all environments, not just in Papermill. If you want to maintain the asynchronous behavior in Jupyter, you might need to add some logic to detect the execution environment and only call wait_for_idle
when running in Papermill.
I hope this helps! Let me know if you have any other questions or if there's anything else I can do to assist you.
Sources
- sparkmagic/sparkmagic/kernels/kernelmagics.py
- sparkmagic/sparkmagic/magics/sparkmagicsbase.py
- sparkmagic/sparkmagic/magics/remotesparkmagics.py
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
from sparkmagic.
🤖
Hi there,
Thank you for the detailed report. This seems to be a complex issue related to the interaction between SparkMagic, Livy, and Papermill.
From your description, it appears that the %%sql
magic command is executed successfully, but the output is not rendered properly in the cell, and the execution of subsequent cells is blocked. This might be due to the asynchronous nature of SparkMagic's interaction with Livy.
When you run a cell with SparkMagic in JupyterLab, it sends a request to Livy and then polls for the result. This works fine in an interactive environment like JupyterLab, but when you're running a notebook non-interactively with Papermill, it might not wait long enough for the result to come back from Livy before moving on to the next cell.
You might want to look into adjusting the Livy session timeout or the polling interval in SparkMagic's configuration to see if that helps. However, please note that this is a common issue with running SparkMagic in non-interactive environments and there might not be a straightforward solution.
As for the codebase, SparkMagic handles the execution of %%sql
magic commands through the execute_sqlquery
method in the SparkMagicBase
class. This method is responsible for executing the SQL query on the remote Spark cluster. It creates an SQLQuery object using the parameters provided by the user and then calls the run_sqlquery
method of the spark_controller
object to execute the SQL query on the remote Spark cluster.
The SQLQuery object has an execute
method that emits a SQL execution start event, converts the SQL query to a command based on the session kind (PySpark, Spark, or SparkR), executes the command, and then emits a SQL execution end event. If the execution is successful, it converts the result to a DataFrame and returns it. If an error occurs during the execution, it emits a SQL execution end event with the error details and raises the error.
I hope this information helps. If you have any further questions or need more clarification, please let me know.
Best,
Dosu
Sources
- sparkmagic/sparkmagic/magics/sparkmagicsbase.py
- sparkmagic/sparkmagic/magics/remotesparkmagics.py
- sparkmagic/sparkmagic/livyclientlib/sqlquery.py
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
from sparkmagic.
🤖
Hello,
Thank you for providing detailed information about the issue you're experiencing. It seems like the problem might be due to the asynchronous nature of the execute
method in the SQLQuery
class. This method sends a request to the Spark cluster and returns immediately, without waiting for the Spark job to complete. This can cause issues when running cells in Jupyter notebooks using Papermill, as the notebook might try to execute the next cell before the previous one has finished executing.
To resolve this issue, you can modify the execute
method to wait for the Spark job to complete before returning. This can be done by polling the status of the Spark job in a loop until it's done.
Here's an example of how you can modify the execute
method:
def execute(self, session):
self._spark_events.emit_sql_execution_start_event(
session.guid,
session.kind,
session.id,
self.guid,
self.samplemethod,
self.maxrows,
self.samplefraction,
)
command_guid = ""
try:
command = self.to_command(session.kind, session.sql_context_variable_name)
command_guid = command.guid
(success, records_text, mimetype) = command.execute(session)
# Poll the status of the command until it's done
while True:
statement = session.http_client.get_statement(session.id, command_guid)
if statement['state'] in ['available', 'error']:
break
time.sleep(1) # Wait for a second before polling the status again
if not success:
raise BadUserDataException(records_text)
result = records_to_dataframe(records_text, session.kind, self._coerce)
except Exception as e:
self._spark_events.emit_sql_execution_end_event(
session.guid,
session.kind,
session.id,
self.guid,
command_guid,
False,
e.__class__.__name__,
str(e),
)
raise
else:
self._spark_events.emit_sql_execution_end_event(
session.guid,
session.kind,
session.id,
self.guid,
command_guid,
True,
"",
"",
)
return result
This modification will make the execute
method block until the Spark job is done, which should resolve the issue of cells in Jupyter notebooks not executing in the correct order when run with Papermill.
Please note that this is a simplified example and the actual implementation might need to be adjusted based on the specifics of your Spark cluster and the rest of your code. For example, you might want to add error handling for cases where the Spark job fails or takes too long to complete.
I hope this helps! Let me know if you have any other questions or need further clarification.
Best,
Dosu
Sources
- sparkmagic/sparkmagic/magics/sparkmagicsbase.py
- sparkmagic/sparkmagic/livyclientlib/sparkcontroller.py
- sparkmagic/sparkmagic/livyclientlib/sqlquery.py
- sparkmagic/sparkmagic/tests/test_sqlquery.py
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
from sparkmagic.
🤖
Hello,
Thank you for providing a detailed description of your issue. It seems like you're experiencing a problem with the asynchronous execution of the %%sql
magic command in SparkMagic when running with Papermill.
The %%sql
magic command in SparkMagic is designed to execute SQL queries against the Spark context. When you run this command, it first checks if a session has been started, parses the arguments provided in the command line, and then executes the SQL query using the execute_sqlquery
method. This method creates an SQLQuery object with the provided parameters and runs it using the run_sqlquery
method of the spark_controller
. The SQLQuery object then converts itself to a Command object based on the session kind and the SQL context variable name. The command is a string that will be executed in the Spark session. The SQLQuery object then executes the Command object in the Spark session. If the command execution is successful, it converts the result to a dataframe and returns it. If the command execution fails, it raises an exception.
As for the asynchronous execution in SparkMagic, it is handled by sending commands to the Livy server and waiting for the result. The actual execution of the commands is done asynchronously on the Livy server, allowing the Jupyter notebook to continue with other tasks.
Now, regarding your issue with Papermill, it seems like Papermill is not waiting for the wait_for_idle
to complete. This could be due to the way Papermill handles cell execution. Papermill executes each cell in the notebook individually and waits for the cell to complete before moving on to the next one. If a cell takes too long to execute, Papermill might timeout and stop the execution. This could be why the %%sql
command is hanging when running with Papermill.
One possible solution could be to increase the cell execution timeout in Papermill. You can do this by passing the --request-timeout
option when running Papermill. For example:
papermill pm_sparkmagic_test.ipynb output1.ipynb --kernel pysparkkernel --log-level DEBUG --request-timeout 600
This will set the cell execution timeout to 600 seconds (10 minutes). You can adjust this value as needed.
I hope this helps! If you have any other questions or need further clarification, please let me know.
Sources
- sparkmagic/sparkmagic/kernels/kernelmagics.py
- sparkmagic/sparkmagic/magics/sparkmagicsbase.py
- sparkmagic/sparkmagic/livyclientlib/sqlquery.py
- sparkmagic/sparkmagic/livyclientlib/sendtosparkcommand.py
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
from sparkmagic.
🤖
Hello,
The %%sql
magic command in SparkMagic and direct PySpark code have different execution patterns. The %%sql
magic command uses the Livy REST interface to submit the SQL query to the Spark cluster and retrieve the results. This process is asynchronous, meaning the command returns immediately after submitting the query, and the actual execution happens in the Spark cluster. This could be the reason why you're experiencing the hanging issue when using Papermill.
On the other hand, direct PySpark code gives you more control over the execution process. You can create a SparkSession, execute transformations and actions on RDDs/DataFrames, and manage resources manually.
To modify the %%sql
magic command to follow the same execution pattern as the direct PySpark code, you would need to modify the SparkMagic codebase. Specifically, you would need to modify the execute
method in the SQLQuery
class in the sparkmagic.livyclientlib.sqlquery
module. This method is responsible for executing the SQL query and retrieving the results.
However, modifying the SparkMagic codebase might not be the best solution to resolve the hanging issue. Instead, you could try the following:
-
Increase the Livy session timeout in your SparkMagic configuration. This could prevent the Livy session from timing out while waiting for the SQL query to complete.
-
Use direct PySpark code instead of the
%%sql
magic command. This gives you more control over the execution process and might help to avoid the hanging issue. -
Check the resources of your Spark cluster. The hanging issue could be due to insufficient resources in the Spark cluster, especially if you are running large SQL queries.
-
Check the logs of your Spark cluster and the Livy server. They might contain useful information about why the SQL query is hanging.
I hope this helps! Let me know if you have any other questions.
Best,
Dosu
Sources
- sparkmagic/sparkmagic/tests/test_sqlquery.py
- sparkmagic/sparkmagic/kernels/kernelmagics.py
- sparkmagic/sparkmagic/magics/sparkmagicsbase.py
- sparkmagic/sparkmagic/livyclientlib/sqlquery.py
- sparkmagic/sparkmagic/magics/remotesparkmagics.py
- sparkmagic/sparkmagic/livyclientlib/sparkcontroller.py
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
from sparkmagic.
Thanks Dosu for the quick reply.
I actually tried to troubleshoot by adding more log in command.py. Ideally each retry, there will be one line added into the notebook cell output area. I noticed that sometimes, the notebook cell does not have all the retry logs before the statement reaching from running to available state. This might align with what you mentioned on the timing issue.
But when I changed the configurable_retry_policy_max_retries, it does not help and I saw similar behavior. Not fully understand how Papermill behaves for this case.
but when you're running a notebook non-interactively with Papermill, it might not wait long enough for the result to come back from Livy before moving on to the next cell.
It did not move to the next cell actually and got stuck in the %%sql magic cell.
from sparkmagic.
Awesome! Thanks a lot.
I will give it a try and see if it could fix the issue.
from sparkmagic.
Hi Dosu,
sparkmagicsbase.py
def execute_sqlquery(
.....
# Add this line to wait for the Spark job to finish
self.spark_controller.wait_for_idle()
Given spark_controller does not have wait_for_idle, I made change to command.py. Between line 57 and 58, I added following lines:
session.wait_for_idle()
print('session.wait_for_idle...')
Also I added log for the session status.
print(f'session status -> {status}')
I can observe session was waiting for some time, then there was no further log. In the Papermill backend, I see more log until idle again.
In output notebook:
In the papermill log, the session reaches to idle.
Seems papermill does not wait until the wait_for_idle to complete. Any suggestions?
from sparkmagic.
Thanks Dosu.
I added the waiting code in sqlquery.py. Following is the output from jupyterlab. The new log("waiting in sqlquery execute....") is shown in cell output.
While executing with papermill, it does not show the log in the notebook cell output. Seems the cell output update stopped some time later and subsequent cell did not run as well.
It seems like the problem might be due to the asynchronous nature of the execute method in the SQLQuery class.
Can you share how asynchronous works? It would be greate if you can share the code using the asynchronous mode to call livy API. I am trying to understand the execution pattern difference of %%sql magic.
Following piece works. I am think if we can use same call pattern to run %%sql logic.
sqlContext = HiveContext(sqlContext)
dbs = sqlContext.sql("show databases")
dbs.show()
tables = sqlContext.sql("show tables")
tables.show()
Execution log:
From the log, we can see all the cell logic and sparkmagic logic are executed, but cell output is not updated properly and subsequent cell does not execute.
log3.log
from sparkmagic.
request-timeout is not supported by papermill. execution-timeout is supported and tried that but not work.
--execution-timeout INTEGER Time in seconds to wait for each cell before
failing execution (default: forever)
Can you share how the pyspark code is executed? Is that following async mode?
I see the following code can succeed stably. Can you suggest if we can follow the same pattern for %%sql?
sqlContext = HiveContext(sqlContext)
dbs = sqlContext.sql("show databases")
dbs.show()
tables = sqlContext.sql("show tables")
tables.show()
from sparkmagic.
Hi @edwardps, glad Dosu (the bot) was able to help you narrow down the cause. I believe the bot is correct that the issue has something to do with the async execution of the cell magic.
I actually encountered this issue when implementing ipykernel >= 6 support and left comment on related papermill issue; however, the final implementation fixed the papermill issue for me 🤔
Maybe try changing your ipykernel
version and see if that helps. I'd try both downgrading to v5
and upgrading to v7
from sparkmagic.
Thanks @devstein. I will give a try and get back soon.
Do you happen to know call pattern difference between %%sql and running pyspark code?
BTW, Dosu works awesome!
from sparkmagic.
@edwardps Sounds good, let me know.
BTW, Dosu works awesome!
Thank you! Let me know if you have other open source projects that could benefit from it.
^I'm still teaching it when it should and shouldn't respond 😅
from sparkmagic.
FYI - @devstein
ipykernel 5.5.6
ipython 7.33.0
ipython-genutils 0.2.0
ipywidgets 7.7.0
Downgraded ipykernel to 5.5.6(ipykernel has no v7) and did not get any luck.
Do you happen to know call pattern difference between %%sql and running pyspark code?
Is it possible to have %%sql following the call pattern used by pyspark code execution? (as the pyspark code cell can work properly)
from sparkmagic.
FYI - @devstein
Made some progress on the troubleshooting. I narrowed down this to the visualization.
If comment out the follow two lines, it works.
https://github.com/jupyter-incubator/sparkmagic/blob/master/sparkmagic/sparkmagic/kernels/wrapperkernel/sparkkernelbase.py#L118-L119
if conf.use_auto_viz():
self._register_auto_viz()
From my observation, the plotly based visualization will be disabled.
from sparkmagic.
Thanks @edwardps that's helpful...that means it's failing to execute this code while the kernel is getting created.
I ran into this issue when I tried to make the _register_auto_viz()
function async. It's curious that it no longer works.
Is there any difference in behavior if the first cell isn't %%sql
?
For example
print("start the spark session")
%%sql
show tables
from sparkmagic.
Is there any difference in behavior if the first cell isn't %%sql?
There is no much difference. Still see cell(%%sql) got stuck.
from sparkmagic.
Hi @edwardps , @devstein. I faced this problem so now is there any update about this problem?
from sparkmagic.
Related Issues (20)
- Plotly scala HOT 6
- Publish sparkmagic Docker images regularly HOT 1
- Run Tests Nightly to Catch Upstream Dependency Issues Earlier
- [BUG] Sparkmagic errors out using iPython 7.33.0 HOT 1
- pip deprecation warning when installing hdijupyterutils and autovizwidget HOT 1
- [QST] How to automatically load sparkmagic.magics when open a new ipython kernel tab HOT 1
- Document extending SparkMagic HOT 2
- how to pass python variable to %%sql cell ?
- [BUG] Default Docker container got broken HOT 4
- Support >= Pandas 2.0.0 HOT 6
- [BUG] error when first client connects HOT 1
- Jupyterlab 4.0.2 python 3.10 HOT 1
- [BUG] Cannot build Dockerfile.jupyter HOT 3
- Use variables in %%configure HOT 4
- [BUG] %%send-to-spark fails for dataframes with '\n' or ' characters HOT 2
- [BUG] launcher issue using jupyterlab 3.6.3 / sparkmagic 0.21.0 HOT 5
- Support notebook >= 7 HOT 1
- Does sparkmagic support dual scala/python spark session? HOT 3
- [BUG] spark magic cannot be installed on python3.11 HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sparkmagic.