Code Monkey home page Code Monkey logo

Comments (16)

aggFTW avatar aggFTW commented on June 5, 2024

Hi Jerry,

Yes, HDInsight will include sparkmagic and Livy in its Spark cluster, plugged in to work with Spark magics. The release will be out soon.

Is there any way we could talk to you about your requirements/scenario? I've contacted you on gitter: https://gitter.im

Thanks!
Alejandro

from sparkmagic.

jlamcanopy avatar jlamcanopy commented on June 5, 2024

Hi @aggFTW, good to hear that HDInsight has this built-in. We are evaluating options for our data analytics needs. We are looking for cloud that provides the bigdata infrastructure that we need. Livy + Jupyter + spark + yarn + docker are the technologies that we are currently using. Unfortunately, our ETL is built with Amazon AWS in mind. Therefore, moving the data across cloud provider might be challenging and costly.

Do you know when Livy and sparkmagic will be released approximately? Thanks!

from sparkmagic.

jlamcanopy avatar jlamcanopy commented on June 5, 2024

@aggFTW I have the spark magic running in my jupyter notebook. It works GREAT! The only thing that I would like to suggest is to allow data to return to the client in some ways. For example:

%%spark

val NUM_SAMPLES = 100000000
val count = sc.parallelize(1 to NUM_SAMPLES).map { i =>
val x = Math.random();
val y = Math.random();
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _);

4.0 * count / NUM_SAMPLES

The above should return 3.141728 as a double in the scala notebook but currently, there is no way to use the result directly in the notebook without writing it to a FileSystem.

from sparkmagic.

aggFTW avatar aggFTW commented on June 5, 2024

HDInsight will have Livy + Spark magic support in Jupyter in a few weeks.

As per getting the result back, what is the output of the cell above? If I'm not mistaken, the result should be displayed as a string to you.

Supposing that you got the result back as a double, what are you planning to do with it then? I want to understand the scenario well so that we can make the right choices.

I'm glad you found the magics useful so far :)

from sparkmagic.

jlamcanopy avatar jlamcanopy commented on June 5, 2024

@aggFTW
Currently the result comes back to my as a string in the Out[3] where 3 is the output number 3 in the ipython notebook. You can access this Out[3] variable in the notebook directly which right now it returns a string:

'res4: Double = 3.141728'

It would be great if the 3.141728 value is returned as a Double in the Out[3]. More generally, if it could return an array of tuples that can be used directly in the notebook.

from sparkmagic.

aggFTW avatar aggFTW commented on June 5, 2024

I see. What kind of operation are you going to do with Out[3] though? What do you want to do with it?
Is there a reason why you couldn't do that operation remotely on the cluster, just like you did with count and NUM_SAMPLES in your example?

from sparkmagic.

jlamcanopy avatar jlamcanopy commented on June 5, 2024

One possible use case is that the result of the computation can be an input of a third party library that the cluster might not have it installed.

from sparkmagic.

aggFTW avatar aggFTW commented on June 5, 2024

Thanks, that makes sense. I think that for now, it might be easier to install that library in the cluster and do the computation there, since creating a feature like the one being suggested would require a Scala parser, as well as having the Spark libraries installed.

The only feature that is in the roadmap is to be able to return the result of SparkSQL queries as pandas dataframes back to the user. That's detailed in #54.

from sparkmagic.

ellisonbg avatar ellisonbg commented on June 5, 2024

I think we should also make it possible to get data from the python, R or
scale code as well.

Also, it is very likely users will want to load data into spark from the
local python kernel as well.

Does livy support these things?

On Mon, Dec 7, 2015 at 4:39 PM, Alejandro Guerrero Gonzalez <
[email protected]> wrote:

Thanks, that makes sense. I think that for now, it might be easier to
install that library in the cluster and do the computation there, since
creating a feature like the one being suggested would require a Scala
parser, as well as having the Spark libraries installed.

The only feature that is in the roadmap is to be able to return the result
of SparkSQL queries as pandas dataframes back to the user. That's detailed
in #54 #54.


Reply to this email directly or view it on GitHub
#49 (comment)
.

Brian E. Granger
Associate Professor of Physics and Data Science
Cal Poly State University, San Luis Obispo
@ellisonbg on Twitter and GitHub
[email protected] and [email protected]

from sparkmagic.

aggFTW avatar aggFTW commented on June 5, 2024

You send and receive text to/from Livy.

If we wanted that kind of interfacing across languages, we would need language parsers and the libraries installed in the Jupyter environment.

Some scenarios to think through:

  • User wants to receive the Scala string 'res4: Double = 3.141728' and have it convert to a Python double.
  • User wants to receive a Spark object from the cluster back in the IPython kernel and does not have the Spark libraries installed.
  • User wants to receive a Spark object from the cluster back in the IPython kernel and has a mismatched version of the Spark libraries installed.
  • User wants to receive a {Scala/R/etc} Spark object from the cluster back in the IPython kernel and has a mismatched version of the Spark libraries installed.
  • User wants to receive a CustomLibraryObject object from a Python library installed in the cluster back in the IPython kernel.
  • User wants to receive a CustomLibraryObject object from a {Scala/R/etc} library installed in the cluster back in the IPython kernel.
  • User wants to send a CustomLibraryObject object from a Python library installed in the Jupyter environment to the Spark context where the library is not available.

We should discuss how to do this and whether that's the paradigm we want to support or not.

from sparkmagic.

jerrylam avatar jerrylam commented on June 5, 2024

Just wonder how can someone visualize the data in the notebook after being processed by spark? It will be difficult for people to ship the image back from the livy server and it will also quite difficult to use interactively, isn't it?

Hue is the jupyter notebook for Livy. They seem to be able to interpret the data coming back from Livy. Unfortunately, I don't know the details yet.

from sparkmagic.

aggFTW avatar aggFTW commented on June 5, 2024

Data visualization is something we are currently working on. You'll be able to do %sql queries and get a nice rich visualization back. You will also be able to get pandas dataframes back from your %sql queries to do your own visualizations.

As for how Livy interprets results, that's by using the Livy repl, which is running next to Livy in the cluster and passes objects around. That is not the same architecture as the remote execution scenario intended here. We are actively talking to Cloudera to see how we can achieve some of the nice properties of running in the cluster.

from sparkmagic.

jlamcanopy avatar jlamcanopy commented on June 5, 2024

@aggFTW, being able to execute SQL and return pandas dataframe is definitely a great feature! The only issue is that pandas is limited to python only. That said, I do mostly visualization in python instead of scala. :)
Have you also consider returning pandas dataframe for Spark DataFrame? That would be awesome actually.
For example:

%spark 
return sqlContext.read.parquet("some_files").select("customer_id", "age").toPandas()

from sparkmagic.

aggFTW avatar aggFTW commented on June 5, 2024

That is not something that would be handled by this library. That's a feature ask for Spark that would require Scala to talk to Python. To be honest, I doubt you'll get support for that for Scala--Pyspark already has support for it.

However, you will indeed get a pandas dataframe back by using this library, using the Scala Spark context like this:

%%spark -c sql -s myscalasession
SELECT customer_id, age FROM previously_saved_table_from_rdd

from sparkmagic.

msftristew avatar msftristew commented on June 5, 2024

So, I agree that this problem is infeasible in practice for arbitrary output. Making output data available to kernels would require specialized parsers written for each language to capture all of the possible kinds of output, which might be ambiguous or insufficiently detailed.

The magics always print the output strings to stdout right now. One workaround would be to provide an option (-r?) to return the string instead of printing it. Then the user could capture the string, parse it themselves, and then manipulate the result however they want. This could allow you to work around the language barrier if e.g. you're working on a Scala cluster but you need to transform some result data into a Python object and do some further processing on it locally.

from sparkmagic.

jlamcanopy avatar jlamcanopy commented on June 5, 2024

Currently, pyspark dataframe API does have a function to return pandas dataframe.

from sparkmagic.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.