mvillarrealb / docker-spark-cluster Goto Github PK

View Code? Open in Web Editor NEW

520.0 520.0 338.0 2.36 MB

A simple spark standalone cluster for your testing environment purposses

Shell 20.59% Dockerfile 44.45% Python 34.96%

bigdata developer-tools docker-compose spark

docker-spark-cluster's People

Contributors

Stargazers

Watchers

Forkers

cmobarry lone112 leisheyoufu venumeda imskyer chrisplyn changyeli leonlee723 carbonblack shirbr rcpbayindir alonsoir yennanliu koksen iknowchang ilovefood2 wangzhitao81 fengrk nguyenvantien0123 jbontech sv650s kengotoda jairoduarte hanhongyuan pppk520 xx254 arshpreetsingh jothibasu-kamaraj pared malikzohaib mario-renau-a surjansr igor-dobrovolskiy-ppc tuhinsharma121 rollerknobster daringanitch teoria dockerizer tangyiyi pscopelliti zephyrgit deeperunderstanding ram-iyer lberoiza agaro1121 valtroffuture dexterlinlin sebastiendarocha qdai-korewireless joeledwards jkondapalli03 danielfeol rodrigo196 cartermckinnon ablazleon spydernaz karthik-rg gaithmtiri morganzhh seversond nubiofs kmlearning llamaslama doppiomacchiatto alistairlr112 wilvk jeansferreira maganaluis xmarvin cp-lim llambi dublado rafamartinc alexisdevgrp nithin-geon ophid1an singhdx dong-jason dedmari samuelamico jonasoliveir petersvane basanthjenuhb tan4ek parthi10 leolvt bedman3 merlindm henryzxj mirkoperrone hg2c rahalfat candyadmin genkimaru zslajchrt mschenone ckevincrow beneykim zblumen wwagner4

docker-spark-cluster's Issues

Docker-compose up command running on osx

~/s/docker-spark-cluster> docker-compose up
Creating spark-master ... error

ERROR: for spark-master Cannot start service spark-master: b'Mounts denied: \r\nThe paths /mnt/spark-data and /mnt/spark-apps\r\nare not shared from OS X and are not known to Docker.\r\nYou can configure shared paths from Docker -> Preferences... -> File Sharing.\r\nSee https://docs.docker.com/docker-for-mac/osxfs/#namespaces for more info.\r\n.'

I guess I'll have to modify the configuration files to point to a directory other than /mnt/spark-apps and /mnt/spark.

Crimes App link is empty.

The link of your CrimesApp specified in the README is not working.

Wget is broken

First of all thanks for the Magic, Seems like it has saved my 10 days,
in base/Dockerfile while ding "wget" "URL" is 404.

Not able to connect UI's

Everything works fine and docker compose launches spark workers, but I am not able to connect to the master and workers UI. It shows in terminal that MasterUI / WorkerUI is launched at some URL.

Feature request: Support for notebooks

If this container could support notebooks, it will be a great deal to test lot of cloud lakes that are in a limbo of not having the capability to test notebooks style deployment. It ll be huge win.

Hostname, XYZ resolves to a loopback address

How to fix the below issue?

22/02/27 16:39:58 WARN Utils: Your hostname, XYZ resolves to a loopback address: 127.0.1.1; using 172.xx.xxx.xx instead (on interface eth0) 22/02/27 16:39:58 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.1.1-amzn-0.jar) to constructor java.nio.DirectByteBuffer(long,int) WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release

compose step is stuck

HI
I am using mac
The following step is hanging?
The final step to create your test cluster will be to run the compose file:

docker-compose up --scale spark-worker=3

and unable to open the url http://10.5.0.3:8081/

thanks
Sriram

After Accidentally deleting the Docker Network, the master will not start

It seems to be looking by default for the old Docker Network - it still tries to create a new docker network, but it does not automatically point to it.

/mnt/spark-data: permission denied

Starting spark-master ... error

ERROR: for spark-master Cannot start service spark-master: error while creating mount source path '/mnt/spark-data': mkdir /mnt/spark-data: permission denied

ERROR: for spark-master Cannot start service spark-master: error while creating mount source path '/mnt/spark-data': mkdir /mnt/spark-data: permission denied
ERROR: Encountered errors while bringing up the project.

a problem with doing wget spark

aironman@MacBook-Pro-Retina-de-Alonso ~/s/docker-spark-cluster> clear && ./build-images.sh

Sending build context to Docker daemon 3.072kB
Step 1/10 : FROM java:8-jdk-alpine
---> 3fd9dd82815c
Step 2/10 : ENV DAEMON_RUN=true
---> Using cache
---> fadbc04feea2
Step 3/10 : ENV SPARK_VERSION=2.3.1
---> Using cache
---> fd7483167fc6
Step 4/10 : ENV HADOOP_VERSION=2.7
---> Using cache
---> 8d491e9006c4
Step 5/10 : ENV SCALA_VERSION=2.12.4
---> Using cache
---> 75f3a65b4c6c
Step 6/10 : ENV SCALA_HOME=/usr/share/scala
---> Using cache
---> 993a12bb9e00
Step 7/10 : RUN apk add --no-cache --virtual=.build-dependencies wget ca-certificates && apk add --no-cache bash curl jq && cd "/tmp" && wget --no-verbose "https://downloads.typesafe.com/scala/${SCALA_VERSION}/scala-${SCALA_VERSION}.tgz" && tar xzf "scala-${SCALA_VERSION}.tgz" && mkdir "${SCALA_HOME}" && rm "/tmp/scala-${SCALA_VERSION}/bin/".bat && mv "/tmp/scala-${SCALA_VERSION}/bin" "/tmp/scala-${SCALA_VERSION}/lib" "${SCALA_HOME}" && ln -s "${SCALA_HOME}/bin/" "/usr/bin/" && apk del .build-dependencies && rm -rf "/tmp/"*
---> Using cache
---> 5306ca105f4d
Step 8/10 : RUN export PATH="/usr/local/sbt/bin:$PATH" && apk update && apk add ca-certificates wget tar && mkdir -p "/usr/local/sbt" && wget -qO - --no-check-certificate "https://github.com/sbt/sbt/releases/download/v1.2.8/sbt-1.2.8.tgz" | tar xz -C /usr/local/sbt --strip-components=1 && sbt sbtVersion
---> Using cache
---> 4b2ebf0c237a
Step 9/10 : RUN apk add --no-cache python3
---> Using cache
---> 58495a28f4e7
Step 10/10 : RUN wget --no-verbose http://apache.mirror.iphh.net/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz && tar -xvzf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz && mv spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} spark && rm spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz
---> Running in 8b2c8b02afb6
http://apache.mirror.iphh.net/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz:
2019-03-11 11:12:34 ERROR 404: Not Found.
The command '/bin/sh -c wget --no-verbose http://apache.mirror.iphh.net/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz && tar -xvzf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz && mv spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} spark && rm spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz' returned a non-zero code: 8
aironman@MacBook-Pro-Retina-de-Alonso ~/s/docker-spark-cluster>

Which license rights applies

I was wondering what is the license rights, to using your code on one of my projects at work?

Connection refused with submit spark

I am trying to run the example jar that comes with spark and I get connection refused to the master node:

Caused by: java.io.IOException: Failed to connect to spark-master/172.18.0.2:6066 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187) at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: spark-master/172.18.0.2:6066 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) ... 1 more Caused by: java.net.ConnectException: Connection refused ... 11 more

I first ran docker exec bash to run the spark submit, and here is the spark submit I ran:

/spark/bin/spark-submit --class org.mvb.applications.example --master spark://spark-master:6066 --deploy-mode cluster /spark/examples/jars/spark-examples_2.11-2.4.3.jar

it seems to work within one single node

I wonder how to join existing spark-master from a different node as a worker

http://apache.mirror.iphh.net/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz is removed.

Step 10/10 : RUN wget --no-verbose http://apache.mirror.iphh.net/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz && tar -xvzf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz       && mv spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} spark       && rm spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz
 ---> Running in f077b663eab8
http://apache.mirror.iphh.net/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz:
2019-04-06 16:22:07 ERROR 404: Not Found.
The command '/bin/sh -c wget --no-verbose

Installing Spark 2.4.3 despite of changing in base Dockerfile

In base Dockerfile I changed the following:

ENV DAEMON_RUN=true
ENV SPARK_VERSION=3.0.0
ENV HADOOP_VERSION=3.2
ENV SCALA_VERSION=2.12.4
ENV SCALA_HOME=/usr/share/scala
ENV SPARK_HOME=/spark

# ENV DAEMON_RUN=true
# ENV SPARK_VERSION=2.4.3
# ENV HADOOP_VERSION=2.7
# ENV SCALA_VERSION=2.12.4
# ENV SCALA_HOME=/usr/share/scala
# ENV SPARK_HOME=/spark

I can see it is downloading Spark 3.x yet in the master node it is installing Spark 2.4.3. Where is it hard-coded?

job not started

Hello
When a run a job i'm getting :
Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

Zeppelin integration

Not sure if here is the best please to request a new feature, but It would be great if Zeppelin can be added, or if you can explain how to use Zeppelin with such a cluster.

Many thanks

http://apache.mirror.iphh.net/spark/spark-2.4.3 is removed.

Issue while executing docker-compose

Hi All,I am facing issues while executing the docker-compose up -d

Getting below error

: No such file or directorypark.sh: line 1: /opt/spark/bin/load-spark-env.sh
spark-worker-b_1 | /start-spark.sh: line 2: $'\r': command not found
spark-worker-b_1 | /start-spark.sh: line 10: syntax error near unexpected token `elif'
'park-worker-b_1 | /start-spark.sh: line 10: `elif [ "$SPARK_WORKLOAD" == "worker" ];
: No such file or directoryrk.sh: line 1: /opt/spark/bin/load-spark-env.sh
spark-master_1 | /start-spark.sh: line 2: $'\r': command not found
spark-master_1 | /start-spark.sh: line 10: syntax error near unexpected token `elif'
'park-master_1 | /start-spark.sh: line 10: `elif [ "$SPARK_WORKLOAD" == "worker" ];
: No such file or directorypark.sh: line 1: /opt/spark/bin/load-spark-env.sh
spark-worker-a_1 | /start-spark.sh: line 2: $'\r': command not found
spark-worker-a_1 | /start-spark.sh: line 10: syntax error near unexpected token `elif'
'park-worker-a_1 | /start-spark.sh: line 10: `elif [ "$SPARK_WORKLOAD" == "worker" ];

Failed driver state

I have used your instruction from README.md
But I have met some issuses:

In a file docker-compose.yml - version: "3.7" is not support by docker-compose (I have changet it to the version: "3.3")
The some files are changed and do not match with your instruction from README.md file. For example: some versions are changed to the "latest".

After all I have started the cluster, but the spark-sublit option is not work:
I have used the script from spark-submit/crimes-app.sh, only changet it for my application.

log:

./crimes-app.sh
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/08/23 16:48:33 INFO SecurityManager: Changing view acls to: root
19/08/23 16:48:33 INFO SecurityManager: Changing modify acls to: root
19/08/23 16:48:33 INFO SecurityManager: Changing view acls groups to: 
19/08/23 16:48:33 INFO SecurityManager: Changing modify acls groups to: 
19/08/23 16:48:33 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
19/08/23 16:48:33 INFO Utils: Successfully started service 'driverClient' on port 33063.
19/08/23 16:48:33 INFO TransportClientFactory: Successfully created connection to spark-master/172.19.0.2:7077 after 22 ms (0 ms spent in bootstraps)
19/08/23 16:48:33 INFO ClientEndpoint: Driver successfully submitted as driver-20190823164833-0002
19/08/23 16:48:33 INFO ClientEndpoint: ... waiting before polling master for driver state
19/08/23 16:48:38 INFO ClientEndpoint: ... polling master for driver state
19/08/23 16:48:39 INFO ClientEndpoint: State of driver-20190823164833-0002 is FAILED
19/08/23 16:48:39 INFO ShutdownHookManager: Shutdown hook called
19/08/23 16:48:39 INFO ShutdownHookManager: Deleting directory /tmp/spark-01a41054-33c9-49c6-9550-c1537854553a

Please help me to solve this issue.

¿Where is crimes-app.jar?

Can you upload the test jar file? The link is pointing to nowhere.

thanks

Timeout error

On accessing URLs like http://10.5.0.2:8080/ I am getting timeout on my MacOS Catalina.

mta_data not found

I had to run CREATE DATABASE mta_data; in the database docker for the python task to run succesfully.
Maybe this statement could run when creating the database image, or specified as an instruction in the readme.

How can i pass arguments to docker run command?

Hi Marcos, I'm trying to launch the task in the docker way as you indicate, but I have to pass parameters to the work and for whatever reason, docker is not accepting it. According to the official help of docker, i have to use -e APP_ARGS, such that way:

~/s/docker-spark-cluster> docker run --network docker-spark-cluster_spark-network -v /tmp/spark-apps:/opt/spark-apps --env SPARK_APPLICATION_JAR_LOCATION=$SPARK_APPLICATION_JAR_LOCATION --env SPARK_APPLICATION_MAIN_CLASS=$SPARK_APPLICATION_MAIN_CLASS -e APP_ARGS="/opt/spark-data/README.md /opt/spark-data/output-6" spark-submit:2.4.0

But when I go to the driver page, I see the following error:

Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
at com.oreilly.learningsparkexamples.mini.scala.WordCount$.main(WordCount.scala:11)
at com.oreilly.learningsparkexamples.mini.scala.WordCount.main(WordCount.scala)
... 6 more

That error points directly to this line of code:

  val inputFile = args(0)

In other words, docker is not passing the file in argument zero.
I have tried to launch the work from the ip of the driver, entering the ip and doing the traditional spark-submit, with the same arguments and everything works as expected.

I hope you can help me because it's driving me crazy. Can you think of what I'm doing wrong?
Thank you.

Read write does not work

I am properly mounting each worker to the same folder location, but I cannot perform a spark.read.parquet without an error. I would guess this is due to the fact it is not accessing the file from an HDFS. Any workarounds or thoughts on how to submit a job that can read and write? Everything else is working properly.

Latest master doesn't match the document

Found the repo from https://medium.com/@marcovillarreal_40011/creating-a-spark-standalone-cluster-with-docker-and-docker-compose-ba9d743a157f but the latest master is changed.

Cannot open 10.5.0.2:8080 spark-master

Hi, i cannot validate the cluster because it did not reach to spark-master web page.
It is weird, because everything looks fine when i run docker-compose up command.

This is the output.

How to run python app?

Thank you for this repo.

One thing that was unclear to me - once I've got the spark cluster up and running via docker-compose, if I have a pyspark script on my computer, can I simply just run it and connect to this spark cluster? Or does the python app have to live in the container?

Say I had this file hello-spark.py

from pyspark.sql import SparkSession


def main():
    # Initialize SparkSession
    spark = SparkSession.builder \
        .appName("HelloWorld")  \
        .getOrCreate()

    # Create an RDD containing numbers from 1 to 10
    numbers_rdd = spark.sparkContext.parallelize(range(1, 11))

    # Count the elements in the RDD
    count = numbers_rdd.count()

    print(f"Count of numbers from 1 to 10 is: {count}")

    # Stop the SparkSession
    spark.stop()


if __name__ == "__main__":
    main()