Code Monkey home page Code Monkey logo

sparknotebook's Introduction

sparknotebook

This project contains samples of ipython notebooks running Spark. One notebook, 2-Text-Analytics.ipynb is written in python. The second, Scala-2-Text-Analytics.ipynb is in Scala. The dataset and the most excellent 2-Text-Analytics.ipynb are originally from https://github.com/xsankar/cloaked-ironman.

Just open each notebook to see how Spark is instantiated and used.

Notebook

Python Set-up

To run the python notebook, you will need to:

  1. Install ipython and ipython notebook. For simplicity, I am just using the free Anaconda python distribution from Continuum Analytics.
  2. Download and install Spark distribution. The download includes the pyspark script that you need to launch python with Spark.

For best results, cd into this projects root directory before starting ipython. The actual command to start the ipython notebook is:

IPYTHON=1 IPYTHON_OPTS="notebook --pylab inline" pyspark

NOTE: Sometimes when running Spark on Java 7 you may get a java.net.UnknownHostException. I have not yet seen this on Java 8. If this happens to you, you can resolve it by setting the SPARK_LOCAL_IP environment variable to 127.0.0.1 before launching Spark. For example:

SPARK_LOCAL_IP=127.0.0.1 IPYTHON=1 IPYTHON_OPTS="notebook --pylab inline" pyspark

Scala Set-up

To run the scala notebook, you will need to:

  1. Create a Scala profile for ipython
    ipython profile create scala
    
    The output from this command will tell you the location of the ipython_config.py file. You will need to edit that file soon.
  2. Download IScala.jar. You will need to stash it somewhere. I put it in ~/.ipython/profile_scala/lib
  3. Edit your ipython_config.py to tell ipython about IScala
    c = get_config()
    
    c.KernelManager.kernel_cmd = ["java", "-jar",
                              "/User/yournamehere/.ipython/profile_scala/lib/IScala.jar",
                              "--profile",
                              "{connection_file}",
                              "--parent"]

At this point you can start up IScala

ipython notebook --profile scala

If you are running your notebook and it crashes with OutOfMemoryErrors you can increase the amount of memory used with the -Xmx flag (e.g. -Xmx2g or -Xmx2048m will both allocate 2GB of memory for the JVM to use):

SBT_OPTS=-Xmx2048m ipython notebook --profile scala

As with the python example, if you get a java.net.UnknownHostException when starting ipython use the following command:

SPARK_LOCAL_IP=127.0.0.1 SBT_OPTS=-Xmx2048m ipython notebook --profile scala

NOTE: For the Scala notebook, you do not need to download and install Spark. The Spark dependencies are managed via sbt which is running under the hood in the Spark notebook.

Plotting

As of Late-October 2014, IScala has added ploting and rich text output. if you build IScala from source you can have these in your notebooks. There's a Display.ipynb in the IScala project that demonstrates this. Just a note on biulding IScala from source for Spark using SBT:

  • IScala cross builds against Scala 2.10 and 2.11. Spark is currently on Scala 2.10. To build the correct IScala jar run: sbt + assembly. The correct IScala jar will then be in IScala/target/scala-2.10/lib/IScala.jar.

sparknotebook's People

Contributors

hohonuuli avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.