Comments (16)
Hi Jerry,
Yes, HDInsight will include sparkmagic and Livy in its Spark cluster, plugged in to work with Spark magics. The release will be out soon.
Is there any way we could talk to you about your requirements/scenario? I've contacted you on gitter: https://gitter.im
Thanks!
Alejandro
from sparkmagic.
Hi @aggFTW, good to hear that HDInsight has this built-in. We are evaluating options for our data analytics needs. We are looking for cloud that provides the bigdata infrastructure that we need. Livy + Jupyter + spark + yarn + docker are the technologies that we are currently using. Unfortunately, our ETL is built with Amazon AWS in mind. Therefore, moving the data across cloud provider might be challenging and costly.
Do you know when Livy and sparkmagic will be released approximately? Thanks!
from sparkmagic.
@aggFTW I have the spark magic running in my jupyter notebook. It works GREAT! The only thing that I would like to suggest is to allow data to return to the client in some ways. For example:
%%spark
val NUM_SAMPLES = 100000000
val count = sc.parallelize(1 to NUM_SAMPLES).map { i =>
val x = Math.random();
val y = Math.random();
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _);
4.0 * count / NUM_SAMPLES
The above should return 3.141728 as a double in the scala notebook but currently, there is no way to use the result directly in the notebook without writing it to a FileSystem.
from sparkmagic.
HDInsight will have Livy + Spark magic support in Jupyter in a few weeks.
As per getting the result back, what is the output of the cell above? If I'm not mistaken, the result should be displayed as a string to you.
Supposing that you got the result back as a double, what are you planning to do with it then? I want to understand the scenario well so that we can make the right choices.
I'm glad you found the magics useful so far :)
from sparkmagic.
@aggFTW
Currently the result comes back to my as a string in the Out[3] where 3 is the output number 3 in the ipython notebook. You can access this Out[3] variable in the notebook directly which right now it returns a string:
'res4: Double = 3.141728'
It would be great if the 3.141728 value is returned as a Double in the Out[3]. More generally, if it could return an array of tuples that can be used directly in the notebook.
from sparkmagic.
I see. What kind of operation are you going to do with Out[3] though? What do you want to do with it?
Is there a reason why you couldn't do that operation remotely on the cluster, just like you did with count and NUM_SAMPLES in your example?
from sparkmagic.
One possible use case is that the result of the computation can be an input of a third party library that the cluster might not have it installed.
from sparkmagic.
Thanks, that makes sense. I think that for now, it might be easier to install that library in the cluster and do the computation there, since creating a feature like the one being suggested would require a Scala parser, as well as having the Spark libraries installed.
The only feature that is in the roadmap is to be able to return the result of SparkSQL queries as pandas dataframes back to the user. That's detailed in #54.
from sparkmagic.
I think we should also make it possible to get data from the python, R or
scale code as well.
Also, it is very likely users will want to load data into spark from the
local python kernel as well.
Does livy support these things?
On Mon, Dec 7, 2015 at 4:39 PM, Alejandro Guerrero Gonzalez <
[email protected]> wrote:
Thanks, that makes sense. I think that for now, it might be easier to
install that library in the cluster and do the computation there, since
creating a feature like the one being suggested would require a Scala
parser, as well as having the Spark libraries installed.The only feature that is in the roadmap is to be able to return the result
of SparkSQL queries as pandas dataframes back to the user. That's detailed
in #54 #54.—
Reply to this email directly or view it on GitHub
#49 (comment)
.
Brian E. Granger
Associate Professor of Physics and Data Science
Cal Poly State University, San Luis Obispo
@ellisonbg on Twitter and GitHub
[email protected] and [email protected]
from sparkmagic.
You send and receive text to/from Livy.
If we wanted that kind of interfacing across languages, we would need language parsers and the libraries installed in the Jupyter environment.
Some scenarios to think through:
- User wants to receive the Scala string
'res4: Double = 3.141728'
and have it convert to a Python double. - User wants to receive a Spark object from the cluster back in the IPython kernel and does not have the Spark libraries installed.
- User wants to receive a Spark object from the cluster back in the IPython kernel and has a mismatched version of the Spark libraries installed.
- User wants to receive a {Scala/R/etc} Spark object from the cluster back in the IPython kernel and has a mismatched version of the Spark libraries installed.
- User wants to receive a
CustomLibraryObject
object from a Python library installed in the cluster back in the IPython kernel. - User wants to receive a
CustomLibraryObject
object from a {Scala/R/etc} library installed in the cluster back in the IPython kernel. - User wants to send a
CustomLibraryObject
object from a Python library installed in the Jupyter environment to the Spark context where the library is not available.
We should discuss how to do this and whether that's the paradigm we want to support or not.
from sparkmagic.
Just wonder how can someone visualize the data in the notebook after being processed by spark? It will be difficult for people to ship the image back from the livy server and it will also quite difficult to use interactively, isn't it?
Hue is the jupyter notebook for Livy. They seem to be able to interpret the data coming back from Livy. Unfortunately, I don't know the details yet.
from sparkmagic.
Data visualization is something we are currently working on. You'll be able to do %sql queries and get a nice rich visualization back. You will also be able to get pandas dataframes back from your %sql queries to do your own visualizations.
As for how Livy interprets results, that's by using the Livy repl, which is running next to Livy in the cluster and passes objects around. That is not the same architecture as the remote execution scenario intended here. We are actively talking to Cloudera to see how we can achieve some of the nice properties of running in the cluster.
from sparkmagic.
@aggFTW, being able to execute SQL and return pandas dataframe is definitely a great feature! The only issue is that pandas is limited to python only. That said, I do mostly visualization in python instead of scala. :)
Have you also consider returning pandas dataframe for Spark DataFrame? That would be awesome actually.
For example:
%spark
return sqlContext.read.parquet("some_files").select("customer_id", "age").toPandas()
from sparkmagic.
That is not something that would be handled by this library. That's a feature ask for Spark that would require Scala to talk to Python. To be honest, I doubt you'll get support for that for Scala--Pyspark already has support for it.
However, you will indeed get a pandas dataframe back by using this library, using the Scala Spark context like this:
%%spark -c sql -s myscalasession
SELECT customer_id, age FROM previously_saved_table_from_rdd
from sparkmagic.
So, I agree that this problem is infeasible in practice for arbitrary output. Making output data available to kernels would require specialized parsers written for each language to capture all of the possible kinds of output, which might be ambiguous or insufficiently detailed.
The magics always print the output strings to stdout right now. One workaround would be to provide an option (-r
?) to return the string instead of printing it. Then the user could capture the string, parse it themselves, and then manipulate the result however they want. This could allow you to work around the language barrier if e.g. you're working on a Scala cluster but you need to transform some result data into a Python object and do some further processing on it locally.
from sparkmagic.
Currently, pyspark dataframe API does have a function to return pandas dataframe.
from sparkmagic.
Related Issues (20)
- Sparkmagic Kerberos authentication issue HOT 1
- Plotly scala HOT 5
- Publish sparkmagic Docker images regularly HOT 1
- Run Tests Nightly to Catch Upstream Dependency Issues Earlier
- [BUG] Sparkmagic errors out using iPython 7.33.0 HOT 1
- pip deprecation warning when installing hdijupyterutils and autovizwidget HOT 1
- [QST] How to automatically load sparkmagic.magics when open a new ipython kernel tab HOT 1
- Document extending SparkMagic HOT 2
- how to pass python variable to %%sql cell ?
- [BUG] Default Docker container got broken HOT 4
- Support >= Pandas 2.0.0 HOT 6
- [BUG] error when first client connects HOT 1
- Jupyterlab 4.0.2 python 3.10 HOT 1
- [BUG] Cannot build Dockerfile.jupyter HOT 3
- [BUG] SparkMagic pyspark kernel magic(%%sql) hangs when running with Papermill. HOT 17
- Use variables in %%configure HOT 4
- [BUG] %%send-to-spark fails for dataframes with '\n' or ' characters HOT 2
- [BUG] launcher issue using jupyterlab 3.6.3 / sparkmagic 0.21.0 HOT 5
- Support notebook >= 7 HOT 1
- Does sparkmagic support dual scala/python spark session? HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sparkmagic.