Code Monkey home page Code Monkey logo

docker-spark's Introduction

Docker-Spark-Tutorial

This repo is intended to be a walkthrough in how to set up a Spark cluster running inside Docker containers.

I assume some familiarity with Spark and Docker and their basic commands such as build and run. Everything else will be explained in this file.

We will build up the complexity to eventually a full architecture of a Spark cluster running inside of Docker containers in a sequential fashion so as to hopefully build the understanding.


Tutorial

A full walk through on how to create a Spark cluster running on separate machines inside Docker containers is container within the TUTORIAL.md. The file builds up the complexity in a sequential fashion so as to help the understanding of the user. It starts off by demonstrating simple docker container networking, moves to setting up a Spark cluster on a local machine, and then finally combining the two locally and in a distributed fashion.

Prerequisites

I assume knowledge of basic Docker commands such as run, build, etc.

You will need to set up multiple machines with a cloud provider such as AWS or Azure.


Apache Spark

The APACHESPARKTUNING.md explains the main terms involved in a Spark cluster such as worker node, master node, executor, task, job, etc. The second section of this file describes some rough rules to use when setting the parameters of your cluster to get optimal performance with some demonstrations.


Examples

The examples/ directory contains some example python scripts and jupyter notebooks for demonstrating various aspects of of Spark.

Getting Started

To get started, pull the following three docker images

docker pull sdesilva26/spark_master:latest
docker pull sdesilva26/spark_worker:latest
docker pull sdesilva26/spark_submit:latest

Create a docker swarm using

docker swarm init

then attach the other machines you wish to be in the cluster to the docker swarm by copying and pasting the output from the above command.

Create an overlay network by running the following on one of the machines

docker network create -d overlay --attachable spark-net

On the machine you wish to be the master node of the Spark cluster run

docker run -it --name spark-master --network spark-net -p 8080:8080 sdeislva26/spark_master
:latest

On the machines you wish to be workers run

docker run -it --name spark-worker1 --network spark-net -p 8081:8081 -e MEMORY=6G -e
     CORES=3 sdesilva26/spark_worker:latest

substituting the values for CORES and MEMORY to be cores of the machine - 1 and RAM of machine

  • 1GB.

Start a driver node by running

docker run -it --name spark-submit --network spark-net -p 4040:4040 sdesilva26/spark_submit
:latest bash

You can now either submit files to spark using

$SPARK_HOME/bin/spark-submit [flags] file 

or run a pyspark shell

$SPARK_HOME/bin/pyspark

NOTE: by default the sdesilva26/spark_submit and sdesilva26/spark_worker images will try to connect to the cluster manager at spark://spark-master:7077 so if you change the name of the container running the sdesilva26/spark_master image then you must pass this change as follows

docker run -it --name spark-submit --network spark-net -p 4040:4040 -e MASTER_CONTAINER_NAME
=<your-master-container-name> sdesilva26/spark_submit
:latest bash

for the submit node and

docker run -it --name spark-worker1 --network spark-net -p 8081:8081 -e MEMORY=6G -e
     CORES=3 -e MASTER_CONTAINER_NAME=<your-master-container-name> sdesilva26/spark_worker:latest

for thw worker node.

These instructions allow you to manually set up a Apache Spark cluster running inside docker containers located on different machines. For a full walk-through including explanations and a docker-compose version of setting up a cluster see the TUTORIAL.md

Authors

  • Shane de Silva

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Acknowledgments

docker-spark's People

Contributors

sdesilva26 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

docker-spark's Issues

driver image does not crashes on Java exceptions

Thanks for the helpful material.

I have manged to setup a test spark cluster from the stack file suggested and everything worked. I had to created custom docker images to update the python and pyspark versions , which gave me some headache, but is done now (can make a PR in github if needed).

I have now one issue with the spark driver and wanted to check if there is a known solution. I am working with spark structured streaming and thus need to deploy some persistent spark job. What i have done here is building a custom spark-dirver docker image which starts a python code and deploys the job in "cluster" mode. This works, however if any Java exception is triggered, the job dies, but the container does not ... which should be the intended behavior.

The command CMD i am using for starting the python script is:

CMD /spark/bin/spark-submit \
--packages org.apache.spark:spark-streaming-kafka-0-10_2.12:${SPARK_VERSION},org.apache.spark:spark-sql-kafka-0-10_2.12:${SPARK_VERSION} \
--executor-memory $SPARK_EXECUTOR_MEMORY \
--executor-cores $SPARK_EXECUTOR_CORES \
# --supervise \
script.py

and i have tried both with and without --supervise but without success.

any hint why the container does not fail with exit code 1 when java exceptions occur?

UnsatisfiedLinkError - libsnappyjava.so

Trying to execute a job after creating one master and 3 slave containers, I get the following error about missing file:

. . .
Caused by: java.lang.UnsatisfiedLinkError: /tmp/snappy-1.1.7-c5d5b016-d413-484a-a7ee-68ed2bc6a29a-libsnappyjava.so: Error loading shared library ld-linux-x86-64.so.2: No such file or directory (needed by /tmp/snappy-1.1.7-c5d5b016-d413-484a-a7ee-68ed2bc6a29a-libsnappyjava.so)
. . .

Ansible for worker nodes?

Can we configure ansible for actually setting up the worker nodes? Did you consider it as an option?

Attaching to network failed

sudo docker run -it --name spark-master --network spark-net -p 8080:8080 sdesilva26/spark_master:0.0.2
docker: Error response from daemon: attaching to network failed, make sure your network options are correct and check manager logs: context deadline exceeded.
ERRO[0020] error waiting for container: context canceled

Don't work on Apple Silicon

Hi, I tried to deploy the master node docker container on my Mac M1 and doesn't deploy the Service. It is possible to deploy this solution on ARM architecture??

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.