jupyter-incubator / sparkmagic Goto Github PK
View Code? Open in Web Editor NEWJupyter magics and kernels for working with remote Spark clusters
License: Other
Jupyter magics and kernels for working with remote Spark clusters
License: Other
See title; installation is a few steps, which should ideally all be subsumed by a Pip install.
When a user adds a new session, the user might find out that leaked/unused Livy sessions are taking resources up and might want to kill some of them.
This should display to the user:
Ran the following code:
hvac = sc.textFile('wasb:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv')
from pyspark.sql import Row
Doc = Row("TargetTemp", "ActualTemp", "System", "SystemAge", "BuildingID")
def parseDocument(line):
values = [str(x) for x in line.split(',')]
return Doc(values[2], values[3], values[4], values[5], values[6])
documents = hvac.filter(lambda s: "Date" not in s).map(parseDocument)
df = sqlContext.createDataFrame(documents)
df.registerTempTable('data')
and then
%select * from data limit 100
The visualizations, at least for the pie graphs, are wrong. Screenshot:
Clearly there is no building where the desired target temperature is 1.
This looks like a regression when the improved output rendering change was introduced. %hive SHOW TABLES
crashes explaining it doesn't know how to convert the output into a dataframe (the output is an empty list). It definitely doesn't work when the list of tables is empty; it also probably does not work if that list is nonempty.
Right now, magics return the result from Livy without being aware of whether the string is a result or an error.
In the case of a result, magics should nicely print the result back.
In the case of an error, it should be clear to the user that an error just happened in the cluster. A stacktrace should be printed if available.
Give the user some kind of feedback so that user knows that code is not frozen.
This prevents users from typing their credentials in clear text.
Users should not be allowed to manage sessions by using text subcommands if they are running the notebook in the browser. It should be allowed for users in a terminal, though.
For Scala and R support, we may want to look at http://pygments.org/languages/
For SQL support, look at @alope107's solution in alope107/spark-sql-magic@34b4d86
The pandas df being constructed is not being passed back to the user for the user to play with.
So, user does something like:
%spark -c sql SELECT * FROM table
and even though the result is being constructed into a pandas df, user cannot visualize it.
Instead, a user could do something like:
%spark -c sql -v myDf SELECT * FROM table
and result could be available in myDf
.
We should discuss the syntax before implementing this. Pinging @ellisonbg
When state goes into a final state, like error, wait for state should immediately return.
We should create other repos for a Livy client, a configuration getter, a logger, and etcetera. Then we should put our python files into folders that make actual sense.
In doing this, we might want to add docstring comments for public methods in every repo.
Consolidate magics and commands. Clean up UX.
Does not retry now. Look at TODO
From livyclient.py
:
def execute_sql(self, command):
return self.execute('sqlContext.sql("{}").collect()'.format(command))
def execute_hive(self, command):
return self.execute('hiveContext.sql("{}").collect()'.format(command))
If the SQL query the user passes in has double-quotes in it (double-quotes are a valid string delimiter in Spark SQL), then this is liable to cause an error. Moreover, if the query happens to have a stray "
in it (maybe as a result of user error), then this will cause a syntax error in the rest of the running code.
See title. I wonder if this change would increase the odds that someone can do a clean install of the wrapper kernels and have everything "just work" without having to mess with configurations at all.
I just ran into this when a HiveContext failed to create in Scala and I wasn't notified that any error happened. When I then tried to run %hive SHOW TABLES, an error was thrown that hiveContext wasn't defined.
This will be the API:
%spark add session_name language conn_string
%spark info
%spark config <configuration_overrides>
%spark info conn_string
session_id, language, state
%spark delete session_name
%spark delete conn_string session_id
%spark cleanup
%spark cleanup conn_string
This covers #56, #75, and #76 for magics in Python kernel.
We are not designing the API for the wrapper kernels here and we'll tackle that as a separate improvement.
ping @msftristew @ellisonbg to take a look when they can
A thought I just had: Currently we only return responses as dataframes if their query is a SQL query. Since we already have the code for parsing dataframes from JSON responses, we might provide an option that lets users say "I'm going to output a bunch of JSON here, please parse it and return it as a dataframe", even if their code isn't a SQL query. This could be useful and could allow users to get arbitrary semi-structured data back from the remote cluster. I'm not sure how feasible this is, or if this would be too error-prone, but I think it makes sense and could be really useful in some niche scenarios.
A user should be able to specify how many rows to get back in a case by case basis.
Also, the user might want to sometimes do a take, sometimes do a sample, and sometimes do something else when getting results back in a case by case basis.
This is part of a greater architectural issue around configurations. There should be a configuration module which:
So that user can delete individual sessions from a wrapper kernel. To be used with #75
Is there a reason for this?
Currently, when you fail to specify credentials in the config file, the kernel crashes. The exception (that you need to provide credentials for Livy) is visible in Jupyter's logs but it would be ideal if the kernel didn't crash and instead the error message was written to the screen.
Change the way that the magic is used so that it is run once to specify the livy connection and all subsequent cells are run against the remote cluster. This clears up the confusion between the local and remote namespaces without being as heavyweight of a solution as a new kernel.
@ellisonbg you might have some ideas.
DataFrameParseExceptions
so that error messages can be displayed to the user nicely.I think the current version of ipywidgets is 4.1.1 so this is not a pressing issue (it's not deprecated yet) but for future-proofing we should consider moving the autovizwidget way from that model.
This includes running nunique on the string columns and then making them categorical values instead of just strings.
Enable Travis support
Can all graph types handle ~2500 result rows?
Repro Steps:
For now, to delete it, you have to do that manually from ssh the cluster.
I was doing the following hive query when I discovered an error:
%hive SELECT * FROM hivesampletable WHERE deviceplatform = 'Android'
With the following pandas df:
records_text = '{"clientid":"8","querytime":"18:54:20","market":"en-US","deviceplatform":"Android","devicemake":"Samsung","devicemodel":"SCH-i500","state":"California","country":"United States","querydwelltime":13.9204007,"sessionid":0,"sessionpagevieworder":0}\n{"clientid":"23","querytime":"19:19:44","market":"en-US","deviceplatform":"Android","devicemake":"HTC","devicemodel":"Incredible","state":"Pennsylvania","country":"United States","sessionid":0,"sessionpagevieworder":0}'
json_array = "[{}]".format(",".join(records_text.split("\n")))
import json
d = json.loads(json_array)
result = pd.DataFrame(d)
result
the NaN for querydwelltime produces the following error:
Javascript error adding output!
TypeError: Cannot read property 'prop' of undefined
See your browser Javascript console for more details.
The vegalite spec produced by Altair is:
{'config': {'width': 600, 'gridOpacity': 0.08, 'gridColor': u'black', 'height': 400}, 'marktype': 'point', 'data': {'formatType': 'json', 'values': [{u'deviceplatform': u'Android', u'devicemodel': u'SCH-i500', u'country': u'United States', u'sessionpagevieworder': 0, u'state': u'California', u'clientid': u'8', u'sessionid': 0, u'querytime': u'18:54:20', u'devicemake': u'Samsung', u'market': u'en-US', u'querydwelltime': 13.9204007}, {u'deviceplatform': u'Android', u'devicemodel': u'Incredible', u'country': u'United States', u'sessionpagevieworder': 0, u'state': u'Pennsylvania', u'clientid': u'23', u'sessionid': 0, u'querytime': u'19:19:44', u'devicemake': u'HTC', u'market': u'en-US', u'querydwelltime': nan}]}}
For commentary and possible fixes, this issue is tracked by lightning renderer in:
lightning-viz/lightning-python#34
We might need to address this in Altair too.
Look at TODO
When user runs %spark? in the notebook, the docstring should include a small paragraph stateing that the magic allows you to connect to a Livy endpoint by creating sessions that are tied to a particular language and that every session can run code in that language + SparkSQL.
Hi there,
I'm looking for a solution that works with jupyter notebook via livy to use Spark. It sesm that sparkmagic is a good fit that it. I wonder if Azure HDInsights have this service build-in?
Best Regards,
Jerry
Steps to reproduce:
%sql SHOW TABLES
or %hive SHOW TABLES
when there are no tables (i.e. the result dataframe is empty).
The data viz widget pops up. Switch from "table" to any of the other chart styles.
You get an exception. This except doesn't go away even if you switch back to the table graph type.
ValueError: cannot label index with a null key
Make -e be -s
Sparkmagic currently supports only vanilla SQLContexts as first class interfaces. If a user wants to use an alternate context (like a HiveQLContext), they can do so through the pyspark or scala interfaces, but they must handle the context itself. It may be useful to allow the user to specify a type of SQLContext when using the SQL interface.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.