This part is to deploy a spark cluster using docker swarm. Implemented in following steps.
- Deploy spark cluster in docker containers
- Deploy spark cluster in docker swarm (VM)
- Using NFS to do the file sharing in all docker machines.
- Submit and run tasks
- Support start jupyter notebook in master / manager node.
Code Repository: spark-in-docker
directory | description |
---|---|
bin | docker container mode scripts. All images startup in local machine. |
sbin | Docker swarm deployment scripts. Spark nodes are deployed on docker machines |
code | some test code |
docker | files to build docker image and docker-compose.yml |
docker/config | configuration files for image building and services. |
docker/pkg | install packages for image building. Including spark / java / hadoop |
docker/setup.sh | Script for manually setup other configurations. such as jupyter-notebook etc. |
docker/build-image.sh | Script for image building |
docker/Dockerfile | Spark image. Base on Ubunut:18.04 |
docker/docker-compose.yml | spark services in docker swarm |
-
The image is based on ubuntu:18.04
-
Install some sofewares: vim, ssh, python3.6, jupyter
-
Configurate ssh trust with self
-
Install JDK & Spark thought local file. (Can also through internet.)
The install packages should be placed in
docker/pkg
directory according torequired_files.txt
in the same directory. -
Configurate environment variables.
-
Configurate Spark slaves. (This is only necessary when start multiple containers of this image in single machine)
-
Install necessary python packages according to file
docker/config/requirements.txt
. -
Configurate SSH service.
-
Start SSH and waiting for further instructions.
- There are 2 services: master & worker. And they use the same image which compiled by the Dockerfile above.
- Adjust the number of workers by change field
services-worker-deploy-replicas
- There an overlay network.
- There is an docker volume:
/nfsshare:/root/nfsshare
which maps local/nsfshare
folder to container/root/nfsshare
folder. - To ensure the spark can work base on docker swarm, there are 2 options:
-
- Using Hadoop with HDFS in docker swarm cluster.
-
- Using NFS to sync files in the cluster.
-
- The container resources are configurable in
cpus
&memory
fild. - The Spark resources are configurable by setting environment variables in
environment
fild. The variables' name can be found in offical documents of Spark
Scripts under bin/
Scripts under this folder manage containers directly. And all images startup in local machine.
Script | Usage |
---|---|
get-master-ip.sh | Aquire the spark master node IP address. |
resize-cluster.sh | Change cluster size. Need to stop ALL containers before resize. |
spark-container-service.sh | Entrance for service control. Follow command start, stop, status, usage. |
spark-containers-start.sh | Script for start up the container mode of spark cluster. |
spark-containers-status.sh | Script for checking status of the container mode of spark cluster. |
spark-containers-stop.sh | Script for stop the container mode of spark cluster. |
Scripts under sbin/
Script | Usage |
---|---|
config-swarm.sh | Configurate docker swarm environments. *1 |
spark-cluster-service.sh | Entrance for service control. Commands: deploy, remove, status, usage. |
spark-cluster-deploy.sh | Script for start up the spark cluster in docker swarm mode. |
spark-cluster-status.sh | Script for checking status of the spark cluster in docker swarm mode. |
spark-cluster-remove.sh | Script for stop the spark cluster in docker swarm mode. |
*1: For now, the config-swarm script only support create docker machine in virtual box. If you want to run this on a real cluster of physical machines. You should manually configurate the docker machines on these machine, and adjust the script about the IP addresses.
*2: Normally, only need to use the entrance script to manage the spark cluster service (stacks).
-
Normally, to take the advantage of Spark computation ability, you should run the spark in a real cluser. In this case, please using the script under
sbin/
to manage your cluster as docker swarm mode. -
Running command
./config-swarm.sh --create --num=3 --prefix=myvm
to create a virtual environment of 3 machine cluster with virtualbox as driver. -
Configurate share folder (NFS, if physical machine) for ALL your docker machines.
-
Run commad
docker-machine restart myvm1 myvm2 myvm3
to restart you docker machines. Making the shared folder effective. -
By running command
docker-machine ls
, you can check the docker machines you created.➜ sbin git:(master) docker-machine ls NAME ACTIVE DRIVER STATE URL SWARM DOCKER ERRORS myvm1 - virtualbox Running tcp://192.168.99.100:2376 v18.09.6 myvm2 - virtualbox Running tcp://192.168.99.101:2376 v18.09.6 myvm3 - virtualbox Running tcp://192.168.99.102:2376 v18.09.6
-
By runing the command
eval $(docker-machine env myvm1)
, you can manage your docker machines easier. Or connect your docker machine with commanddocker-machine ssh myvm1
➜ sbin git:(master) docker-machine env myvm1 export DOCKER_TLS_VERIFY="1" export DOCKER_HOST="tcp://192.168.99.100:2376" export DOCKER_CERT_PATH="/Users/Chen/.docker/machine/machines/myvm1" export DOCKER_MACHINE_NAME="myvm1" # Run this command to configure your shell: # eval $(docker-machine env myvm1) ➜ sbin git:(master) ✗ eval $(docker-machine env myvm1)
-
By running the command
./spark-cluster-service.sh deploy
, the service stacks should be startup and running now. The first time may cost some time for downloading the image.➜ sbin git:(master) ✗ ./spark-cluster-service.sh deploy deploying spark... Creating network spark_spark-net Creating service spark_master Creating service spark_worker getting information... spark-master : 192.168.99.100:7077 web UI : 192.168.99.100:8080 data-dir-host : ./../nfsshare data-dir-master : /root/nfsshare to start jupyter notebook, run 'docker container exec -it spark_master.1* bash'; then run 'sh /root/install/setup.sh' jupyter-noteboo : 192.168.99.100:8888
-
Using commands
./spark-cluster-service.sh status
ordocker stack ls
ordocker stack ps spark
ordocker service ls
to check the services do startup.Please notice REPLICAS field shows 0/1 or 0/2 and the CURRENT STATE is Preparing n minutes ago. This is because it takes some time to download the images . The image size is around 1GB. And all docker machines will need to download the image respectively.
➜ sbin git:(master) ✗ ./spark-cluster-service.sh status check spark status... ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS y98c8ywnckl9 spark_worker.1 cchencool/spark:latest myvm3 Running Preparing 7 minutes ago nf1gzikfybcw spark_master.1 cchencool/spark:latest myvm1 Running Preparing 7 minutes ago twtjuuqdh8qj spark_worker.2 cchencool/spark:latest myvm2 Running Preparing 7 minutes ago done ➜ sbin git:(master) ✗ docker stack ls NAME SERVICES ORCHESTRATOR spark 2 Swarm ➜ sbin git:(master) ✗ docker stack ps spark ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS y98c8ywnckl9 spark_worker.1 cchencool/spark:latest myvm3 Running Preparing 8 minutes ago nf1gzikfybcw spark_master.1 cchencool/spark:latest myvm1 Running Preparing 8 minutes ago twtjuuqdh8qj spark_worker.2 cchencool/spark:latest myvm2 Running Preparing 8 minutes ago ➜ sbin git:(master) ✗ docker service ls ID NAME MODE REPLICAS IMAGE PORTS ot7ucdk904ne spark_master replicated 0/1 cchencool/spark:latest *:4040->4040/tcp, *:6066->6066/tcp, *:7077->7077/tcp, *:8080->8080/tcp, *:8888->8888/tcp f3k81ebll3ny spark_worker replicated 0/2 cchencool/spark:latest *:8081->8081/tcp
-
When the spark cluster successfully deployed, you should see DESIRED STATE becomes to Running and REPLICAS becomes to 1/1 and 2/2:
➜ sbin git:(master) ✗ ./spark-cluster-service.sh status check spark status... ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS czie0i2ky43v spark_worker.1 cchencool/spark:latest myvm3 Running Running 5 seconds ago khcidid7d44m spark_master.1 cchencool/spark:latest myvm1 Running Running 11 seconds ago yw58robszmo9 spark_worker.2 cchencool/spark:latest myvm2 Running Running 5 seconds ago done ➜ sbin git:(master) ✗ docker service ls ID NAME MODE REPLICAS IMAGE PORTS n3u9w9ej1u12 spark_master replicated 1/1 cchencool/spark:latest *:4040->4040/tcp, *:6066->6066/tcp, *:7077->7077/tcp, *:8080->8080/tcp, *:8888->8888/tcp ruufeky42eqi spark_worker replicated 2/2 cchencool/spark:latest *:8081->8081/tcp
-
Using the infomation printted out by the script to connect spark cluster. Checking the web UI.
-
You can connect to your container using commands below:
➜ sbin git:(master) ✗ ➜ sbin git:(master) ✗ docker container ls CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 4239569e16e0 cchencool/spark:latest "sh -c 'source ~/.ba…" 3 minutes ago Up 3 minutes 7077/tcp spark_master.1.khcidid7d44m32kcfdo8jf7ya ➜ sbin git:(master) ✗ docker container exec -it spark_master.1.khcidid7d44m32kcfdo8jf7ya bash root@master:~/workspace# ls -al total 16 drwxr-xr-x 1 root root 4096 May 27 08:59 . drwx------ 1 root root 4096 May 27 09:03 .. -rw-r--r-- 1 root root 133 May 27 08:59 log root@master:~/workspace# pwd /root/workspace root@master:~/workspace# cat log starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark--org.apache.spark.deploy.master.Master-1-master.out
-
When you want to shut down the cluster, using the script
./spark-cluster-service.sh remove
.➜ sbin git:(master) ✗ ./spark-cluster-service.sh remove removing spark... Removing service spark_master Removing service spark_worker Removing network spark_spark-net success
Data upload
-
For this demo implementation, just using docker-machine virtualBox as driver. You can directly shareing folder
/nfsshare
with host in the virtual box configuration to simulate the well configurated NFS environment. -
Should use NFS in production. Shareing
/nfsshare
accross the hosts. This folder is also changeable by modifying thevolumes
of 2 services in thedocker-compose.yml
file.
Task submission
- When use the
spark-cluster-service.sh deploy
command start the spark cluster in docker swarm, it should print out the spark-master, spark web UI addresses. - To run a task, no matter in Jupyter Notebook or a single driver program, simply follow the Spark documentation to connect your driver to the spark master.
- You should be aware and manage the CPU & Memory resources by yourself. To ensure multiple drivers are not blocked when running at same time, you should carefully configurate those resource limitation parameters of you driver programs to avoid running out of resources.