Code Monkey home page Code Monkey logo

spark-livy-on-airflow-workspace's Introduction

spark-livy-on-airflow-workspace

A workspace to experiment with Apache Spark, Livy, and Airflow in a containerized Docker environment.

Prerequisites

We'll be using the below tools and versions:

Tool Version Notes
Java 8 SDK openjdk version "1.8.0_282" Installed locally in order to run the Spark driver (see brew installation options).
Scala scala-sdk-2.11.12 Running on JetBrains IntelliJ.
Python 3.6 Installed using conda 4.9.2 .
PySpark 2.4.6 Installed inside conda virtual environment.
Livy 0.7.0-incubating See release history here.
Airflow 1.10.14 For compatibility with apache-airflow-backport-providers-apache-livy .
Docker 20.10.5 See Mac installation instructions.
Bitnami Spark Docker Images docker.io/bitnami/spark:2 I use the Spark 2.4.6 images for comtability with Apache Livy, which supports Spark (2.2.x to 2.4.x).

Environment

The only Python packages you'll need are pyspark and requests .

You can use the provided requirements.txt file to create a virtual environment using the tool of your choice.

For example, with conda , you can create a virtual environment with:

conda create --no-default-packages -n spark-livy-on-airflow-workspace python=3.6

Then activate using:

conda activate spark-livy-on-airflow-workspace

Then install the requirements with:

pip install -r requirements.txt

Alternatively, you can use the provided environment.yml file and run:

conda env create -n spark-livy-on-airflow-workspace -f environment.yaml

Contents

1. Running PySpark jobs on Bitnami Docker images

This first part will focus on executing a simple PySpark job on a Dockerized Spark cluster based on the default docker-compose.yaml provided in the bitnami-docker-spark repo. Below is a rough outline of what this setup will look like.

img.png

To get started, we'll need to set:

  • SPARK_MASTER_URL - This is the endpoint that our Spark worker and driver will use to connect to the master.

Assuming you're using the docker-compose.yaml in this repo, this variable should be set to spark://spark-master:7077 , which is the name of the Docker container and the default port for the Spark master.

  • SPARK_DRIVER_HOST - This will be the IP of where you're running your Spark driver

(in our case your personal workstation). See this discussion and this sample configuration for more details.

Both of these settings will be passed to the Spark configuration object:

    conf.setAll(
        [
            ("spark.master", os.environ.get("SPARK_MASTER_URL", "spark://spark-master:7077")),
            ("spark.driver.host", os.environ.get("SPARK_DRIVER_HOST", "local[*]")),
            ("spark.submit.deployMode", "client"),
            ("spark.driver.bindAddress", "0.0.0.0"),
        ]
    )

To spin up the containers, run:

docker-compose up -d

2. Running Scala Spark jobs on Bitnami Docker images

Same as the above, only in Scala:

    val conf = new SparkConf

    conf.set("spark.master", Properties.envOrElse("SPARK_MASTER_URL", "spark://spark-master:7077"))
    conf.set("spark.driver.host", Properties.envOrElse("SPARK_DRIVER_HOST", "local[*]"))
    conf.set("spark.submit.deployMode", "client")
    conf.set("spark.driver.bindAddress", "0.0.0.0")

3. Extending Bitnami Spark Docker images for Apache Livy

From the Apache Livy website:

Apache Livy is a service that enables easy interaction with a Spark cluster over a REST interface. It enables easy submission of Spark jobs or snippets of Spark code, synchronous or asynchronous result retrieval, as well as Spark Context management, all via a simple REST interface or an RPC client library.

We'll now extend the Bitnami Docker image to include Apache Livy in our container architecture to achieve the below. See this repo's Dockerfile for implementation details.

img.png

After instantiating the containers with docker-compose up -d you can test your connection by creating a session:

curl -X POST -d '{"kind": "pyspark"}' \
  -H "Content-Type: application/json" \
  localhost:8998/sessions/

And executing a sample command:

curl -X POST -d '{
    "kind": "pyspark",
    "code": "for i in range(1,10): print(i)"
  }' \
  -H "Content-Type: application/json" \
  localhost:8998/sessions/0/statements

4. Exploring Apache Livy sessions with PySpark

See Livy session API examples.

5. Exploring Apache Livy batches with Scala

See Livy batch API documentation.

6. Running Spark jobs with LivyOperator in Apache Airflow

#TODO

7. Running Spark jobs with SparkSubmitOperator in Apache Airflow

#TODO

spark-livy-on-airflow-workspace's People

Contributors

dsynkov avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.