Code Monkey home page Code Monkey logo

docker-spark-cluster's People

Contributors

datlife avatar jedpittman avatar mvillarrealb avatar neelpatel21 avatar spydernaz avatar zethson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

docker-spark-cluster's Issues

Docker-compose up command running on osx

~/s/docker-spark-cluster> docker-compose up
Creating spark-master ... error

ERROR: for spark-master Cannot start service spark-master: b'Mounts denied: \r\nThe paths /mnt/spark-data and /mnt/spark-apps\r\nare not shared from OS X and are not known to Docker.\r\nYou can configure shared paths from Docker -> Preferences... -> File Sharing.\r\nSee https://docs.docker.com/docker-for-mac/osxfs/#namespaces for more info.\r\n.'

ERROR: for spark-master Cannot start service spark-master: b'Mounts denied: \r\nThe paths /mnt/spark-data and /mnt/spark-apps\r\nare not shared from OS X and are not known to Docker.\r\nYou can configure shared paths from Docker -> Preferences... -> File Sharing.\r\nSee https://docs.docker.com/docker-for-mac/osxfs/#namespaces for more info.\r\n.'
ERROR: Encountered errors while bringing up the project.

I guess I'll have to modify the configuration files to point to a directory other than /mnt/spark-apps and /mnt/spark.

Wget is broken

First of all thanks for the Magic, Seems like it has saved my 10 days,
in base/Dockerfile while ding "wget" "URL" is 404.

Not able to connect UI's

Everything works fine and docker compose launches spark workers, but I am not able to connect to the master and workers UI. It shows in terminal that MasterUI / WorkerUI is launched at some URL.

Feature request: Support for notebooks

If this container could support notebooks, it will be a great deal to test lot of cloud lakes that are in a limbo of not having the capability to test notebooks style deployment. It ll be huge win.

Hostname, XYZ resolves to a loopback address

How to fix the below issue?

22/02/27 16:39:58 WARN Utils: Your hostname, XYZ resolves to a loopback address: 127.0.1.1; using 172.xx.xxx.xx instead (on interface eth0) 22/02/27 16:39:58 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.1.1-amzn-0.jar) to constructor java.nio.DirectByteBuffer(long,int) WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release

compose step is stuck

HI
I am using mac
The following step is hanging?
The final step to create your test cluster will be to run the compose file:

docker-compose up --scale spark-worker=3

and unable to open the url http://10.5.0.3:8081/

thanks
Sriram

/mnt/spark-data: permission denied

Starting spark-master ... error

ERROR: for spark-master Cannot start service spark-master: error while creating mount source path '/mnt/spark-data': mkdir /mnt/spark-data: permission denied

ERROR: for spark-master Cannot start service spark-master: error while creating mount source path '/mnt/spark-data': mkdir /mnt/spark-data: permission denied
ERROR: Encountered errors while bringing up the project.

a problem with doing wget spark

aironman@MacBook-Pro-Retina-de-Alonso ~/s/docker-spark-cluster> clear && ./build-images.sh

Sending build context to Docker daemon 3.072kB
Step 1/10 : FROM java:8-jdk-alpine
---> 3fd9dd82815c
Step 2/10 : ENV DAEMON_RUN=true
---> Using cache
---> fadbc04feea2
Step 3/10 : ENV SPARK_VERSION=2.3.1
---> Using cache
---> fd7483167fc6
Step 4/10 : ENV HADOOP_VERSION=2.7
---> Using cache
---> 8d491e9006c4
Step 5/10 : ENV SCALA_VERSION=2.12.4
---> Using cache
---> 75f3a65b4c6c
Step 6/10 : ENV SCALA_HOME=/usr/share/scala
---> Using cache
---> 993a12bb9e00
Step 7/10 : RUN apk add --no-cache --virtual=.build-dependencies wget ca-certificates && apk add --no-cache bash curl jq && cd "/tmp" && wget --no-verbose "https://downloads.typesafe.com/scala/${SCALA_VERSION}/scala-${SCALA_VERSION}.tgz" && tar xzf "scala-${SCALA_VERSION}.tgz" && mkdir "${SCALA_HOME}" && rm "/tmp/scala-${SCALA_VERSION}/bin/".bat && mv "/tmp/scala-${SCALA_VERSION}/bin" "/tmp/scala-${SCALA_VERSION}/lib" "${SCALA_HOME}" && ln -s "${SCALA_HOME}/bin/" "/usr/bin/" && apk del .build-dependencies && rm -rf "/tmp/"*
---> Using cache
---> 5306ca105f4d
Step 8/10 : RUN export PATH="/usr/local/sbt/bin:$PATH" && apk update && apk add ca-certificates wget tar && mkdir -p "/usr/local/sbt" && wget -qO - --no-check-certificate "https://github.com/sbt/sbt/releases/download/v1.2.8/sbt-1.2.8.tgz" | tar xz -C /usr/local/sbt --strip-components=1 && sbt sbtVersion
---> Using cache
---> 4b2ebf0c237a
Step 9/10 : RUN apk add --no-cache python3
---> Using cache
---> 58495a28f4e7
Step 10/10 : RUN wget --no-verbose http://apache.mirror.iphh.net/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz && tar -xvzf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz && mv spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} spark && rm spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz
---> Running in 8b2c8b02afb6
http://apache.mirror.iphh.net/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz:
2019-03-11 11:12:34 ERROR 404: Not Found.
The command '/bin/sh -c wget --no-verbose http://apache.mirror.iphh.net/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz && tar -xvzf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz && mv spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} spark && rm spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz' returned a non-zero code: 8
aironman@MacBook-Pro-Retina-de-Alonso ~/s/docker-spark-cluster>

Connection refused with submit spark

I am trying to run the example jar that comes with spark and I get connection refused to the master node:

Caused by: java.io.IOException: Failed to connect to spark-master/172.18.0.2:6066 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187) at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194) at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: spark-master/172.18.0.2:6066 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323) at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) ... 1 more Caused by: java.net.ConnectException: Connection refused ... 11 more

I first ran docker exec bash to run the spark submit, and here is the spark submit I ran:

/spark/bin/spark-submit --class org.mvb.applications.example --master spark://spark-master:6066 --deploy-mode cluster /spark/examples/jars/spark-examples_2.11-2.4.3.jar

http://apache.mirror.iphh.net/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz is removed.

Step 10/10 : RUN wget --no-verbose http://apache.mirror.iphh.net/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz && tar -xvzf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz       && mv spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} spark       && rm spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz
 ---> Running in f077b663eab8
http://apache.mirror.iphh.net/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz:
2019-04-06 16:22:07 ERROR 404: Not Found.
The command '/bin/sh -c wget --no-verbose

Installing Spark 2.4.3 despite of changing in base Dockerfile

In base Dockerfile I changed the following:

ENV DAEMON_RUN=true
ENV SPARK_VERSION=3.0.0
ENV HADOOP_VERSION=3.2
ENV SCALA_VERSION=2.12.4
ENV SCALA_HOME=/usr/share/scala
ENV SPARK_HOME=/spark

# ENV DAEMON_RUN=true
# ENV SPARK_VERSION=2.4.3
# ENV HADOOP_VERSION=2.7
# ENV SCALA_VERSION=2.12.4
# ENV SCALA_HOME=/usr/share/scala
# ENV SPARK_HOME=/spark

I can see it is downloading Spark 3.x yet in the master node it is installing Spark 2.4.3. Where is it hard-coded?

job not started

Hello
When a run a job i'm getting :
Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

Zeppelin integration

Not sure if here is the best please to request a new feature, but It would be great if Zeppelin can be added, or if you can explain how to use Zeppelin with such a cluster.

Many thanks

Issue while executing docker-compose

Hi All,I am facing issues while executing the docker-compose up -d

Getting below error

: No such file or directorypark.sh: line 1: /opt/spark/bin/load-spark-env.sh
spark-worker-b_1 | /start-spark.sh: line 2: $'\r': command not found
spark-worker-b_1 | /start-spark.sh: line 10: syntax error near unexpected token `elif'
'park-worker-b_1 | /start-spark.sh: line 10: `elif [ "$SPARK_WORKLOAD" == "worker" ];
: No such file or directoryrk.sh: line 1: /opt/spark/bin/load-spark-env.sh
spark-master_1 | /start-spark.sh: line 2: $'\r': command not found
spark-master_1 | /start-spark.sh: line 10: syntax error near unexpected token `elif'
'park-master_1 | /start-spark.sh: line 10: `elif [ "$SPARK_WORKLOAD" == "worker" ];
: No such file or directorypark.sh: line 1: /opt/spark/bin/load-spark-env.sh
spark-worker-a_1 | /start-spark.sh: line 2: $'\r': command not found
spark-worker-a_1 | /start-spark.sh: line 10: syntax error near unexpected token `elif'
'park-worker-a_1 | /start-spark.sh: line 10: `elif [ "$SPARK_WORKLOAD" == "worker" ];

Failed driver state

I have used your instruction from README.md
But I have met some issuses:

  1. In a file docker-compose.yml - version: "3.7" is not support by docker-compose (I have changet it to the version: "3.3")

  2. The some files are changed and do not match with your instruction from README.md file. For example: some versions are changed to the "latest".

After all I have started the cluster, but the spark-sublit option is not work:
I have used the script from spark-submit/crimes-app.sh, only changet it for my application.

log:

./crimes-app.sh
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/08/23 16:48:33 INFO SecurityManager: Changing view acls to: root
19/08/23 16:48:33 INFO SecurityManager: Changing modify acls to: root
19/08/23 16:48:33 INFO SecurityManager: Changing view acls groups to: 
19/08/23 16:48:33 INFO SecurityManager: Changing modify acls groups to: 
19/08/23 16:48:33 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
19/08/23 16:48:33 INFO Utils: Successfully started service 'driverClient' on port 33063.
19/08/23 16:48:33 INFO TransportClientFactory: Successfully created connection to spark-master/172.19.0.2:7077 after 22 ms (0 ms spent in bootstraps)
19/08/23 16:48:33 INFO ClientEndpoint: Driver successfully submitted as driver-20190823164833-0002
19/08/23 16:48:33 INFO ClientEndpoint: ... waiting before polling master for driver state
19/08/23 16:48:38 INFO ClientEndpoint: ... polling master for driver state
19/08/23 16:48:39 INFO ClientEndpoint: State of driver-20190823164833-0002 is FAILED
19/08/23 16:48:39 INFO ShutdownHookManager: Shutdown hook called
19/08/23 16:48:39 INFO ShutdownHookManager: Deleting directory /tmp/spark-01a41054-33c9-49c6-9550-c1537854553a

Please help me to solve this issue.

Timeout error

On accessing URLs like http://10.5.0.2:8080/ I am getting timeout on my MacOS Catalina.

mta_data not found

I had to run CREATE DATABASE mta_data; in the database docker for the python task to run succesfully.
Maybe this statement could run when creating the database image, or specified as an instruction in the readme.

How can i pass arguments to docker run command?

Hi Marcos, I'm trying to launch the task in the docker way as you indicate, but I have to pass parameters to the work and for whatever reason, docker is not accepting it. According to the official help of docker, i have to use -e APP_ARGS, such that way:

~/s/docker-spark-cluster> docker run --network docker-spark-cluster_spark-network -v /tmp/spark-apps:/opt/spark-apps --env SPARK_APPLICATION_JAR_LOCATION=$SPARK_APPLICATION_JAR_LOCATION --env SPARK_APPLICATION_MAIN_CLASS=$SPARK_APPLICATION_MAIN_CLASS -e APP_ARGS="/opt/spark-data/README.md /opt/spark-data/output-6" spark-submit:2.4.0

But when I go to the driver page, I see the following error:

Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
at com.oreilly.learningsparkexamples.mini.scala.WordCount$.main(WordCount.scala:11)
at com.oreilly.learningsparkexamples.mini.scala.WordCount.main(WordCount.scala)
... 6 more

That error points directly to this line of code:

  val inputFile = args(0)

In other words, docker is not passing the file in argument zero.
I have tried to launch the work from the ip of the driver, entering the ip and doing the traditional spark-submit, with the same arguments and everything works as expected.

I hope you can help me because it's driving me crazy. Can you think of what I'm doing wrong?
Thank you.

Read write does not work

I am properly mounting each worker to the same folder location, but I cannot perform a spark.read.parquet without an error. I would guess this is due to the fact it is not accessing the file from an HDFS. Any workarounds or thoughts on how to submit a job that can read and write? Everything else is working properly.

Cannot open 10.5.0.2:8080 spark-master

Hi, i cannot validate the cluster because it did not reach to spark-master web page.
It is weird, because everything looks fine when i run docker-compose up command.

This is the output.

How to run python app?

Thank you for this repo.

One thing that was unclear to me - once I've got the spark cluster up and running via docker-compose, if I have a pyspark script on my computer, can I simply just run it and connect to this spark cluster? Or does the python app have to live in the container?

Say I had this file hello-spark.py

from pyspark.sql import SparkSession


def main():
    # Initialize SparkSession
    spark = SparkSession.builder \
        .appName("HelloWorld")  \
        .getOrCreate()

    # Create an RDD containing numbers from 1 to 10
    numbers_rdd = spark.sparkContext.parallelize(range(1, 11))

    # Count the elements in the RDD
    count = numbers_rdd.count()

    print(f"Count of numbers from 1 to 10 is: {count}")

    # Stop the SparkSession
    spark.stop()


if __name__ == "__main__":
    main()

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.