Code Monkey home page Code Monkey logo

docker-spark's Introduction

Gitter chat Build Status Twitter

Spark docker

Docker images to:

  • Setup a standalone Apache Spark cluster running one Spark Master and multiple Spark workers
  • Build Spark applications in Java, Scala or Python to run on a Spark cluster
Currently supported versions:
  • Spark 3.3.0 for Hadoop 3.3 with OpenJDK 8 and Scala 2.12
  • Spark 3.2.1 for Hadoop 3.2 with OpenJDK 8 and Scala 2.12
  • Spark 3.2.0 for Hadoop 3.2 with OpenJDK 8 and Scala 2.12
  • Spark 3.1.2 for Hadoop 3.2 with OpenJDK 8 and Scala 2.12
  • Spark 3.1.1 for Hadoop 3.2 with OpenJDK 8 and Scala 2.12
  • Spark 3.1.1 for Hadoop 3.2 with OpenJDK 11 and Scala 2.12
  • Spark 3.0.2 for Hadoop 3.2 with OpenJDK 8 and Scala 2.12
  • Spark 3.0.1 for Hadoop 3.2 with OpenJDK 8 and Scala 2.12
  • Spark 3.0.0 for Hadoop 3.2 with OpenJDK 11 and Scala 2.12
  • Spark 3.0.0 for Hadoop 3.2 with OpenJDK 8 and Scala 2.12
  • Spark 2.4.5 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.4.4 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.4.3 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.4.1 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.4.0 for Hadoop 2.8 with OpenJDK 8 and Scala 2.12
  • Spark 2.4.0 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.3.2 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.3.1 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.3.1 for Hadoop 2.8 with OpenJDK 8
  • Spark 2.3.0 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.2.2 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.2.1 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.2.0 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.1.3 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.1.2 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.1.1 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.1.0 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.0.2 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.0.1 for Hadoop 2.7+ with OpenJDK 8
  • Spark 2.0.0 for Hadoop 2.7+ with Hive support and OpenJDK 8
  • Spark 2.0.0 for Hadoop 2.7+ with Hive support and OpenJDK 7
  • Spark 1.6.2 for Hadoop 2.6 and later
  • Spark 1.5.1 for Hadoop 2.6 and later

Using Docker Compose

Add the following services to your docker-compose.yml to integrate a Spark master and Spark worker in your BDE pipeline:

version: '3'
services:
  spark-master:
    image: bde2020/spark-master:3.3.0-hadoop3.3
    container_name: spark-master
    ports:
      - "8080:8080"
      - "7077:7077"
    environment:
      - INIT_DAEMON_STEP=setup_spark
  spark-worker-1:
    image: bde2020/spark-worker:3.3.0-hadoop3.3
    container_name: spark-worker-1
    depends_on:
      - spark-master
    ports:
      - "8081:8081"
    environment:
      - "SPARK_MASTER=spark://spark-master:7077"
  spark-worker-2:
    image: bde2020/spark-worker:3.3.0-hadoop3.3
    container_name: spark-worker-2
    depends_on:
      - spark-master
    ports:
      - "8082:8081"
    environment:
      - "SPARK_MASTER=spark://spark-master:7077"
  spark-history-server:
      image: bde2020/spark-history-server:3.3.0-hadoop3.3
      container_name: spark-history-server
      depends_on:
        - spark-master
      ports:
        - "18081:18081"
      volumes:
        - /tmp/spark-events-local:/tmp/spark-events

Make sure to fill in the INIT_DAEMON_STEP as configured in your pipeline.

Running Docker containers without the init daemon

Spark Master

To start a Spark master:

docker run --name spark-master -h spark-master -d bde2020/spark-master:3.3.0-hadoop3.3

Spark Worker

To start a Spark worker:

docker run --name spark-worker-1 --link spark-master:spark-master -d bde2020/spark-worker:3.3.0-hadoop3.3

Launch a Spark application

Building and running your Spark application on top of the Spark cluster is as simple as extending a template Docker image. Check the template's README for further documentation.

Kubernetes deployment

The BDE Spark images can also be used in a Kubernetes enviroment.

To deploy a simple Spark standalone cluster issue

kubectl apply -f https://raw.githubusercontent.com/big-data-europe/docker-spark/master/k8s-spark-cluster.yaml

This will setup a Spark standalone cluster with one master and a worker on every available node using the default namespace and resources. The master is reachable in the same namespace at spark://spark-master:7077. It will also setup a headless service so spark clients can be reachable from the workers using hostname spark-client.

Then to use spark-shell issue

kubectl run spark-base --rm -it --labels="app=spark-client" --image bde2020/spark-base:3.3.0-hadoop3.3 -- bash ./spark/bin/spark-shell --master spark://spark-master:7077 --conf spark.driver.host=spark-client

To use spark-submit issue for example

kubectl run spark-base --rm -it --labels="app=spark-client" --image bde2020/spark-base:3.3.0-hadoop3.3 -- bash ./spark/bin/spark-submit --class CLASS_TO_RUN --master spark://spark-master:7077 --deploy-mode client --conf spark.driver.host=spark-client URL_TO_YOUR_APP

You can use your own image packed with Spark and your application but when deployed it must be reachable from the workers. One way to achieve this is by creating a headless service for your pod and then use --conf spark.driver.host=YOUR_HEADLESS_SERVICE whenever you submit your application.

docker-spark's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

docker-spark's Issues

About CDH and Hive support?

Will this project support Cloudera image in the future? I found many Big-Data Docker images use Apache Hadoop instead of CDH, however in product environment CDH version is most common choice. I want to know the differences.

And I want to ask, why Hive image only available in the earlier Spark version?
like 'Flink 1.8.1 and Spark 2.0.0 for Hadoop 2.7+ with Hive support and OpenJDK 8', why can't we have a one in all image like 'Spark 2.4.1 for Hadoop 2.8 with Hive with OpenJDK 8 and Scala 2.12'

Spark error

Hi! I submitted a spark-job in the docker-spark-cluster where the spark-master was bde2020/spark-master:1.5.1-hadoop2.6 and spark-worker(s) were bde2020/spark-worker:1.5.1-hadoop2.6
I got an error that I believe is spark-related ("You need to install libgfortran3."). Is this an error that is resolved by the newer version or is it something new?
I attached the error, so you can see it whole.
error.txt

Connection refused when attempt to connect to Spark master in Swarm mode

I want to run a python application on a Spark cluster, but my application is not inside the master container necessarily. The following is my compose yaml file:

version: '3'
services:
  spark-master:
    image: bde2020/spark-master
    deploy:
      placement:
        constraints:
          [node.hostname == master]
    environment:
      - INIT_DAEMON_STEP=setup_spark
    ports:
      - '6080:8080'
      - '6077:7077'
      - '6040:4040'
  spark-worker-1:
    image: bde2020/spark-worker
    depends_on:
      - 'spark-master'
    environment:
      - 'SPARK_MASTER=spark://spark-master:7077'
    ports:
      - '6081:8081'
      - '6041:4040'

When I create a stack and run these two containers on my Swarm cluster, and run my python application with the following SparkSession configuration, I receive connection refused error.

spark = SparkSession.builder \
    .master("spark://PRIVATE_HOST_IP:6077") \
    .appName("Spark Swarm") \
    .getOrCreate()

On the other hand when I run those containers in normal mode with docker-compose up, the same python application with the same SparkSession configuration works like a charm. Obviously, it is not desirable since I want to have the possibility of scaling up and down. Therefore I am looking for a way to run my application in Swarm mode.

The strange thing about my issue is that I am pretty sure that port mapping is done correctly because after setting up the stack, I am able to connect to Spark UI via 6080 port, which is the port that I have mapped to Spark:8080.

Another point is that I have successfully connected to other containers like Cassandra and Kafka via the same approach (mapping the serving ports of those containers to host ports, and connecting to the set ports on host), but the same avenue is not working for Spark container.

Start Master with a properties-file

I would like to use ZooKeeper for High-Availability of Master.

In order to do that, I've create a ha.conf file:

spark.deploy.recoveryMode=ZOOKEEPER
spark.deploy.zookeeper.url=<zookeeper_host>:2181
spark.deploy.zookeeper.dir=/spark

The next step is to start-master.sh with --properties-file ha.conf
but I couldn't find any way to do that here

Thank you in advance,
Haim

Testing on docker swarm

Please provide docker-compose v2 definition compatible with docker swarm and instructions on how to run it inside swarm.

Python application and deploy-mode

Hi everyone,

Thanks for the effort: these docker images are really a good starting point!
I am working with python applications on standalone clusters.

I am getting a problem of not being able to run a python app, if I try to launch it from a remote location (i.e. from a spark-submit (or derivative) docker container). If I try to run the app from the master or worker containers everything runs fine.

The problem seems to be the executors cannot be registered properly, therefore new executors are always being created endlessly and nothing is ever computed.
I saw that when apps are to be submitted remotely, the config "--deploy-mode" should be set to "cluster" to reduce network latency (source):

A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster). In this setup, client mode is appropriate. In client mode, the driver is launched directly within the spark-submit process which acts as a client to the cluster. The input and output of the application is attached to the console. Thus, this mode is especially suitable for applications that involve the REPL (e.g. Spark shell).

Alternatively, if your application is submitted from a machine far from the worker machines (e.g. locally on your laptop), it is common to use cluster mode to minimize network latency between the drivers and the executors. Currently, the standalone mode does not support cluster mode for Python applications.

The problem is this option is not available for python apps which kind leaves the implementation of python apps in a deadlock (with the proposed configuration of submitting via remote container).

So I guess my only option is to implement a pipeline that ignores the spark-submit container, and instead extend the worker container and create something like worker+launch app container.

Does anyone have any alternatives?

Thanks
A

Pull request for 2.0.2-hadoop2.7

Hi there,

Firstly, thanks for this awesome set of projects.

I forked it and made a new branch for Spark 2.0.2-hadoop2.7 which I'd like to submit a pull request for. However, I wasn't exactly sure how to go about that whilst maintaining that:

  1. It's in a new branch: "2.0.2-hadoop2.7". I, understandably, cannot create branches here.
  2. No strange Docker automatic builds happen. I don't know the settings that are in place in that regard.

How was this handled with the 2.0.1-hadoop2.7 branch? Shall we repeat history, or do you have another idea?

Regards,
Bilal.

spark-submit 2.4.0-hadoop2.8 looks to be using spark 2.3.2

Got an error when running a spark jobs with the new foreachBatch feature using the new 2.4.0.

Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.sql.streaming.DataStreamWriter.foreachBatch(Lorg/apache/spark/api/java/function/VoidFunction2;)Lorg/apache/spark/sql/streaming/DataStreamWriter;

There possibly could have been a race condition when the submit image was being rebuilt with 2.4.0:

1: Base image built using 2.3.2 with 2.4.0-hadoop2.8 tag
https://hub.docker.com/r/bde2020/spark-base/builds/bmyk2bwzrf45mtnm2zwj5d7/

2a: Create submit built using current 2.4.0-hadoop2.8 tag
https://hub.docker.com/r/bde2020/spark-submit/builds/bs2wmyvasv3mys5liazbp2f/

2b: Rebuilding 2.4.0-hadoop2.8 tag with 2.4.0
https://hub.docker.com/r/bde2020/spark-base/builds/b9qiydy4a2k6xfx3mfkizdn/

Would it be possible to trigger the spark-submit image to get rebuilt?

Spark UI has strange 'band' on top

Hi!
Firstly, thanks for providing these, they are very useful for getting started quickly with Spark.

However there is a minor issue: with any of the newer images (I have tried bde2020/spark-master:2.1.1-hadoop2.7, bde2020/spark-master:2.1.0-hadoop2.8-hive-java8, and latest) I am seeing a strange artifact on the top of the Spark UI. This happens whether I use Google Chrome or Microsoft Edge.

capture

This causes an obvious usability problem because I can't see the other sections (like SQL / Executors etc.) Please suggest if I'm doing something wrong or if there is a known issue around this.

Thanks!

Spark Python template

Hi,

I had the following problem when trying to use the Spark Python template to run a Spark application on top of the Spark cluster.

The submit.sh file that is executed from template.sh that is executed from this Dockerfile can not find the file app/app.py which is the value of the env variable SPARK_APPLICATION_PYTHON_LOCATION (see line 19 in submit.sh ). The problem is solved after replacing ONBUILD COPY with COPY in the Dockerfile.

In addition, the python packages in requirements.txt were also not installed. To solve the problem I had to add the following lines before line 19 in submit.sh:

cd /app 
pip3 install -r requirements.txt 
cd ..

The Dockerfile that I used (instead of this) is given below:

FROM bde2020/spark-submit:2.3.0-hadoop2.7

COPY template.sh /
COPY requirements.txt /app/requirements.txt
COPY app.py /app/app.py
COPY submit.sh /submit.sh

ENV SPARK_APPLICATION_PYTHON_LOCATION /app/app.py

CMD ["/bin/bash", "/template.sh"]

The submit.sh file that I used is given below:

#!/bin/bash

export SPARK_MASTER_URL=spark://${SPARK_MASTER_NAME}:${SPARK_MASTER_PORT}
export SPARK_HOME=/spark

/wait-for-step.sh


/execute-step.sh


cd /app 
pip3 install -r requirements.txt
cd ..

if [ -f "${SPARK_APPLICATION_JAR_LOCATION}" ]; then
    echo "Submit application ${SPARK_APPLICATION_JAR_LOCATION} with main class ${SPARK_APPLICATION_MAIN_CLASS} to Spark master ${SPARK_MASTER_URL}"
    echo "Passing arguments ${SPARK_APPLICATION_ARGS}"
    /spark/bin/spark-submit \
        --class ${SPARK_APPLICATION_MAIN_CLASS} \
        --master ${SPARK_MASTER_URL} \
        ${SPARK_SUBMIT_ARGS} \
        ${SPARK_APPLICATION_JAR_LOCATION} ${SPARK_APPLICATION_ARGS}
else
    if [ -f "${SPARK_APPLICATION_PYTHON_LOCATION}" ]; then
        echo "Submit application ${SPARK_APPLICATION_PYTHON_LOCATION} to Spark master ${SPARK_MASTER_URL}"
        echo "Passing arguments ${SPARK_APPLICATION_ARGS}"
        PYSPARK_PYTHON=python3 /spark/bin/spark-submit \
            --master ${SPARK_MASTER_URL} \
            ${SPARK_SUBMIT_ARGS} \
            ${SPARK_APPLICATION_PYTHON_LOCATION} ${SPARK_APPLICATION_ARGS}
    else
        echo "Not recognized application."
    fi
fi
/finish-step.sh

The application app.py was launched, as described in the notes (docker-spark, spark-python) , but with the addition of -e ENABLE_INIT_DAEMON=false (see this issue ):

docker pull bde2020/spark-master:2.3.0-hadoop2.7

docker run --name spark-master -h spark-master -e ENABLE_INIT_DAEMON=false -d bde2020/spark-master:2.3.0-hadoop2.7

docker build --rm -t bde/spark-app .

docker run --name my-spark-app --link spark-master:spark-master -e ENABLE_INIT_DAEMON=false bde/spark-app

Getting java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy

Hi, I'm getting an error when I submit a spark application from a bde2020/spark-submit:2.4.0-hadoop2.7 container to the bde2020/spark-submit:2.4.0-hadoop2.7 container:

[task-result-getter-0] WARN org.apache.spark.scheduler.TaskSetManager - Lost task 0.0 in stage 0.0 (TID 0, 172.18.0.12, executor 0): java.lang.UnsatisfiedLinkError: /tmp/snappy-1.1.7-791b98df-469b-42a9-922e-5cf3481bc613-libsnappyjava.so: Error loading shared library ld-linux-x86-64.so.2: No such file or directory (needed by /tmp/snappy-1.1.7-791b98df-469b-42a9-922e-5cf3481bc613-libsnappyjava.so)
	at java.lang.ClassLoader$NativeLibrary.load(Native Method)
	at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1941)
	at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1824)
	at java.lang.Runtime.load0(Runtime.java:809)
	at java.lang.System.load(System.java:1086)
	at org.xerial.snappy.SnappyLoader.loadNativeLibrary(SnappyLoader.java:179)
	at org.xerial.snappy.SnappyLoader.loadSnappyApi(SnappyLoader.java:154)
	at org.xerial.snappy.Snappy.<clinit>(Snappy.java:47)
	at org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:435)
	at org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:466)
	at java.io.DataInputStream.readByte(DataInputStream.java:265)
...
Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy
	at org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:435)
	at org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:466)
	at java.io.DataInputStream.readByte(DataInputStream.java:265)
	at org.apache.kafka.common.utils.ByteUtils.readVarint(ByteUtils.java:168)
	at org.apache.kafka.common.record.DefaultRecord.readFrom(DefaultRecord.java:292)
	at org.apache.kafka.common.record.DefaultRecordBatch$1.readNext(DefaultRecordBatch.java:264)
	at org.apache.kafka.common.record.DefaultRecordBatch$RecordIterator.next(DefaultRecordBatch.java:563)
	at org.apache.kafka.common.record.DefaultRecordBatch$RecordIterator.next(DefaultRecordBatch.java:532)
	at org.apache.kafka.clients.consumer.internals.Fetcher$PartitionRecords.nextFetchedRecord(Fetcher.java:1146)
	at org.apache.kafka.clients.consumer.internals.Fetcher$PartitionRecords.fetchRecords(Fetcher.java:1181)
	at org.apache.kafka.clients.consumer.internals.Fetcher$PartitionRecords.access$1500(Fetcher.java:1035)
	at org.apache.kafka.clients.consumer.internals.Fetcher.fetchRecords(Fetcher.java:544)
	at org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:505)
	at org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1259)
	at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1187)
...

I'm submitting the application using this command from the spark-submit container:

/spark/bin/spark-submit \
  --class "Main" \
  --master spark://spark-master:7077 \
  --deploy-mode client \
  --driver-class-path /app/pipeline/target/scala-2.11/streaming.jar \
  /app/pipeline/target/scala-2.11/streaming.jar

I didn't have this problem a month ago when I upgraded to spark 2.4.0, but after rebuilding the images last week I haven't been able to submit the application. Any ideas about what I might be doing wrong?

Thanks for your time,
Kinzeng

question to environment configuration

How do I read this environment setting:

"constraint:node==<yourmasternode>"?

I mean, in a Docker Spark environment you would just have a master node and a set of worker nodes, or?

Add support for submitting application jar by specifying application jar as URL

At the moment the submit docker image requires the application jar to be available to the image. I would like to use the submit image for submitting application to a remote cluster where the application jar is in S3. Using the following command:

docker run --rm -it \
    -e ENABLE_INIT_DAEMON=false \
    -e SPARK_MASTER_NAME=$SPARK_MASTER \
    -e SPARK_MASTER_PORT=7077 \
    -e "SPARK_APPLICATION_JAR_LOCATION=http://s3-bucket/my.jar" \
    -e SPARK_APPLICATION_MAIN_CLASS=com.armis.aggregations.Main \
    spark-submit:spark-2.3.2-hadoop2.7

Spark 2.4.1

it would be nice to see Spark 2.4.1 images as there were fixes of some annoying bugs there

About spark submit

I have run the spark master container. If I need to submit an application, shall I use the spark submit script in the spark master or shall I run the spark submit container to submit my application?

Illegal character in hostname at index 33: hdfs://namenode.dockerhadoopspark_default:9870

My code like

        String master = "spark://spark-master:7077";

        SparkConf sparkConf = new SparkConf().setAppName("Spark WordCount Application (java)").setMaster(master);

        JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);

        String hdfsBasePath = "hdfs://" + hdfsHost + ":" + hdfsPort;
        String inputPath = hdfsBasePath + "/input/" + textFileName;
        String outputPath = hdfsBasePath + "/output/"
                + new SimpleDateFormat("yyyyMMddHHmmss").format(new Date());
        List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
        JavaRDD<Integer> distData = javaSparkContext.parallelize(data);

        distData.reduce((a, b) -> a + b);

        distData.saveAsTextFile(outputPath);

        javaSparkContext.close();
  1. build JAR, docker cp into spark-master container,
  2. login in spark-mater container;spark/bin/spark-submit --master spark://172.18.0.4:7077 --class com.bolingcavalry.sparkwordcount.WordCount --executor-memory 512m --total-executor-cores 2 sparkwordcount-1.0-SNAPSHOT.jar 172.18.0.2 9870 GoneWiththeWind.txt
  3. log have error:
Exception in thread "main" java.lang.IllegalArgumentException: java.net.URISyntaxException: Illegal character in hostname at index 33: hdfs://namenode.dockerhadoopspark_default:9870
        at org.apache.hadoop.net.NetUtils.getCanonicalUri(NetUtils.java:274)
        at org.apache.hadoop.hdfs.DistributedFileSystem.canonicalizeUri(DistributedFileSystem.java:1577)
        at org.apache.hadoop.fs.FileSystem.getCanonicalUri(FileSystem.java:235)
        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:623)
        at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:468)
        at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:122)
        at org.apache.spark.internal.io.HadoopMapRedWriteConfigUtil.assertConf(SparkHadoopWriter.scala:287)
        at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:71)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1096)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
        at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1094)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1067)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
        at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1032)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:958)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
        at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
        at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:957)
        at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1499)
        at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1478)
        at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1478)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
        at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1478)
        at org.apache.spark.api.java.JavaRDDLike$class.saveAsTextFile(JavaRDDLike.scala:550)
        at org.apache.spark.api.java.AbstractJavaRDDLike.saveAsTextFile(JavaRDDLike.scala:45)
        at com.bolingcavalry.sparkwordcount.WordCount.main(WordCount.java:105)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.URISyntaxException: Illegal character in hostname at index 33: hdfs://namenode.dockerhadoopspark_default:9870
        at java.net.URI$Parser.fail(URI.java:2848)
        at java.net.URI$Parser.parseHostname(URI.java:3387)
        at java.net.URI$Parser.parseServer(URI.java:3236)
        at java.net.URI$Parser.parseAuthority(URI.java:3155)
        at java.net.URI$Parser.parseHierarchical(URI.java:3097)
        at java.net.URI$Parser.parse(URI.java:3053)
        at java.net.URI.<init>(URI.java:673)
        at org.apache.hadoop.net.NetUtils.getCanonicalUri(NetUtils.java:272)
        ... 50 more

my docker-compost.yml

version: "3"
services:
  namenode:
    image: bde2020/hadoop-namenode:2.0.0-hadoop3.1.1-java8
    hostname: namenode
    container_name: namenode
    ports:
      - 9870:9870
    volumes:
      - hadoop_namenode:/hadoop/dfs/name
    environment:
      CLUSTER_NAME: test
      INIT_DAEMON_STEP: setup_hdfs
      VIRTUAL_HOST: hdfs-namenode.demo.big-data-europe.local
    env_file:
      - ./hadoop.env

  datanode:
    image: bde2020/hadoop-datanode:2.0.0-hadoop3.1.1-java8
    hostname: datanode
    container_name: datanode
    volumes:
      - hadoop_datanode:/hadoop/dfs/data
    environment:
      SERVICE_PRECONDITION: "namenode:9870"
    ports:
      - 9864:9864
    env_file:
      - ./hadoop.env

  resourcemanager:
    image: bde2020/hadoop-resourcemanager:2.0.0-hadoop3.1.1-java8
    hostname: resourcemanager
    container_name: resourcemanager
    environment:
      SERVICE_PRECONDITION: "namenode:9870 datanode:9864"
      VIRTUAL_HOST: hdfs-resourcemanager.demo.big-data-europe.local
    ports:
      - 8088:8088
    env_file:
      - ./hadoop.env

  nodemanager1:
    image: bde2020/hadoop-nodemanager:2.0.0-hadoop3.1.1-java8
    hostname: nodemanager
    container_name: nodemanager
    environment:
      SERVICE_PRECONDITION: "namenode:9870 datanode:9864 resourcemanager:8088"
      VIRTUAL_HOST: hdfs-nodemanager.demo.big-data-europe.local
    ports:
      - 8042:8042
    env_file:
      - ./hadoop.env

  historyserver:
    image: bde2020/hadoop-historyserver:2.0.0-hadoop3.1.1-java8
    hostname: historyserver
    container_name: historyserver
    environment:
      SERVICE_PRECONDITION: "namenode:9870 datanode:9864 resourcemanager:8088"
    volumes:
      - hadoop_historyserver:/hadoop/yarn/timeline
    env_file:
      - ./hadoop.env
  spark-master:
    image: bde2020/spark-master:2.4.1-hadoop2.7
    hostname: spark-master
    container_name: spark-master
    ports:
      - "8080:8080"
      - "7077:7077"
    environment:
      INIT_DAEMON_STEP: setup_spark
      VIRTUAL_HOST: spark-master.demo.big-data-europe.local
  spark-worker-1:
    image: bde2020/spark-worker:2.4.1-hadoop2.7
    container_name: spark-worker-1
    depends_on:
      - spark-master
    ports:
      - "8081:8081"
    environment:
      - "SPARK_MASTER=spark://spark-master:7077"
volumes:
  hadoop_namenode:
  hadoop_datanode:
  hadoop_historyserver:

hadoop.env :

CORE_CONF_fs_defaultFS=hdfs://namenode:9000
CORE_CONF_hadoop_http_staticuser_user=root
CORE_CONF_hadoop_proxyuser_hue_hosts=*
CORE_CONF_hadoop_proxyuser_hue_groups=*
CORE_CONF_io_compression_codecs=org.apache.hadoop.io.compress.SnappyCodec

HDFS_CONF_dfs_webhdfs_enabled=true
HDFS_CONF_dfs_permissions_enabled=false
HDFS_CONF_dfs_namenode_datanode_registration_ip___hostname___check=false

YARN_CONF_yarn_log___aggregation___enable=true
YARN_CONF_yarn_log_server_url=http://historyserver:8188/applicationhistory/logs/
YARN_CONF_yarn_resourcemanager_recovery_enabled=true
YARN_CONF_yarn_resourcemanager_store_class=org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore
YARN_CONF_yarn_resourcemanager_scheduler_class=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
YARN_CONF_yarn_scheduler_capacity_root_default_maximum___allocation___mb=8192
YARN_CONF_yarn_scheduler_capacity_root_default_maximum___allocation___vcores=4
YARN_CONF_yarn_resourcemanager_fs_state___store_uri=/rmstate
YARN_CONF_yarn_resourcemanager_system___metrics___publisher_enabled=true
YARN_CONF_yarn_resourcemanager_hostname=resourcemanager
YARN_CONF_yarn_resourcemanager_address=resourcemanager:8032
YARN_CONF_yarn_resourcemanager_scheduler_address=resourcemanager:8030
YARN_CONF_yarn_resourcemanager_resource__tracker_address=resourcemanager:8031
YARN_CONF_yarn_timeline___service_enabled=true
YARN_CONF_yarn_timeline___service_generic___application___history_enabled=true
YARN_CONF_yarn_timeline___service_hostname=historyserver
YARN_CONF_mapreduce_map_output_compress=true
YARN_CONF_mapred_map_output_compress_codec=org.apache.hadoop.io.compress.SnappyCodec
YARN_CONF_yarn_nodemanager_resource_memory___mb=16384
YARN_CONF_yarn_nodemanager_resource_cpu___vcores=8
YARN_CONF_yarn_nodemanager_disk___health___checker_max___disk___utilization___per___disk___percentage=98.5
YARN_CONF_yarn_nodemanager_remote___app___log___dir=/app-logs
YARN_CONF_yarn_nodemanager_aux___services=mapreduce_shuffle

MAPRED_CONF_mapreduce_framework_name=yarn
MAPRED_CONF_mapred_child_java_opts=-Xmx4096m
MAPRED_CONF_mapreduce_map_memory_mb=4096
MAPRED_CONF_mapreduce_reduce_memory_mb=8192
MAPRED_CONF_mapreduce_map_java_opts=-Xmx3072m
MAPRED_CONF_mapreduce_reduce_java_opts=-Xmx6144m
MAPRED_CONF_yarn_app_mapreduce_am_env=HADOOP_MAPRED_HOME=/opt/hadoop-3.1.1/
MAPRED_CONF_mapreduce_map_env=HADOOP_MAPRED_HOME=/opt/hadoop-3.1.1/
MAPRED_CONF_mapreduce_reduce_env=HADOOP_MAPRED_HOME=/opt/hadoop-3.1.1/

How can I fix the error ? Please help me.

how apply "spark-cluster dynamic resource allocation"

how apply "spark-cluster dynamic resource allocation"

i hope apply num-executor, executor-cores and executor-memory in spark-cluster
but only executor-memory can use...

how i apply num-executors and executor-cores?

error when trying to copy large data

I was able to use your docker-compose.yml to get spark up and running on my computer. I am able to copy small data to it using sparklyr: For example:

 Connect to your Spark cluster
spark_conn <- spark_connect(master ="spark://192.168.86.31:7077",
                            spark_home = "/usr/local/spark-2.4.0-bin-hadoop2.7")

# Copy iris to Spark
track_metadata_tbl <- copy_to(spark_conn, iris, overwrite = TRUE)

# List the data frames available in Spark
src_tbls(spark_conn)
[1] "iris"

Works without any problem. However, when I try to copy a larger dataset I get the following errors:

track_metadata_tbl <- copy_to(spark_conn, track_metadata, overwrite = TRUE)
Error: java.lang.ArrayIndexOutOfBoundsException: 14 at sparklyr.Utils$$anonfun$11$$anonfun$12.apply(utils.scala:337) at sparklyr.Utils$$anonfun$11$$anonfun$12.apply(utils.scala:336) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofInt.foreach(ArrayOps.scala:234) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofInt.map(ArrayOps.scala:234) at sparklyr.Utils$$anonfun$11.apply(utils.scala:336) at sparklyr.Utils$$anonfun$11.apply(utils.scala:334) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at sparklyr.Utils$.createDataFrameFromText(utils.scala:334) at sparklyr.Utils.createDataFrameFromText(utils.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at sparklyr.Invoke.invoke(invoke.scala:139) at sparklyr.StreamHandler.handleMethodCall(stream.scala:123) at sparklyr.StreamHandler.read(stream.scala:66) at sparklyr.BackendHandler.channelRead0(handler.scala:51) at sparklyr.BackendHandler.channelRead0(handler.scala:4) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:138) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) at java.lang.Thread.run(Thread.java:748)

Someone at the Rstudio community forum told me that the problem might be the driver memory. Alas, I'm not sure how to modify that. Could you point me in the right direction?

In case it helps:

image

image

I think what I need to do is increase the "Storage Memory" from 384.1 MB to something like 2Gb. is that right? If so, how can I do that?

Thanks a lot for the help!

Wrong driverHostname for Application UI

Hi, I am using your wonderful docker images.
When I moved up from a cluster with

spark-base:2.2.0-hadoop2.8-hive-java8

to a cluster with the most recent version:

spark-master:2.3.1-hadoop2.7

I notice that the Web UI is no more a valid URL.
Before that was giving the http://<worker ip>:4040, while now it's a strange http://a585a25be0f7:4040 which is not reachable (I have to write it by hand).

Do you know what could be the cause?

Can't open /spark/sbin/xxxx.sh

I build spark-base images, build master worker images, run docker-compose.yaml found this

D:\project\idea\myproject\demo-test\demo-spark-env-docker (master)
λ docker-compose up
Creating spark-master ... done
Creating spark-worker-1 ... done
Attaching to spark-master, spark-worker-1
spark-master | sh: 0: Can't open /spark/sbin/spark-config.sh
spark-master | sh: 0: Can't open /spark/bin/load-spark-env.sh
spark-master | ln: failed to create symbolic link ‘/spark/logs/spark-master.out\r’: No such file or directory
/bin/load-spark-env.sh: No such file or directorys: line 24: /spark
/assembly/target/scala-/jars).find Spark jars directory (/spark
spark-master | You need to build Spark with the target "package" before running this program.
: No such file or directory.sh: line 9: /spark/logs/spark-master.out
spark-worker-1 | sh: 0: Can't open /spark/sbin/spark-config.sh
spark-worker-1 | sh: 0: Can't open /spark/bin/load-spark-env.sh
spark-worker-1 | ln: failed to create symbolic link ‘/spark/logs/spark-worker.out\r’: No such file or directory
/bin/load-spark-env.sh: No such file or directorys: line 24: /spark
/assembly/target/scala-/jars).find Spark jars directory (/spark
spark-worker-1 | You need to build Spark with the target "package" before running this program.
: No such file or directory.sh: line 8: /spark/logs/spark-worker.out
spark-master exited with code 1
spark-worker-1 exited with code 1

D:\project\idea\myproject\demo-test\demo-spark-env-docker (master)
λ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
lyuanx/spark-worker 2.3.0-hadoop2.7 7a9f2b339fe8 4 days ago 936MB
lyuanx/spark-master 2.3.0-hadoop2.7 56893c2b4680 4 days ago 936MB
lyuanx/spark-base 2.3.0-hadoop2.7 43ae73ee9a75 4 days ago 936MB
java 8 d23bdf5b1b1b 15 months ago 643MB

D:\project\idea\myproject\demo-test\demo-spark-env-docker (master)
λ docker run -it 56893c2b4680 /bin/bash
root@a9fd5d21e621:/# cd /spark/sbin/
root@a9fd5d21e621:/spark/sbin# cat spark-config.sh
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at

can you tell me why , I don't understand

Error connecting to spark cluster from another docker service

Hey guys,
I'm trying to set up a standalone spark cluster in docker and I found that this docker image is most up to date and easiest to use for spinning up a spark cluster. I'm able to get the cluster running just fine and I'm able to connect to it from inside of the spark-master node using spark-shell, but I'm running into an error when I try to connect to it.

Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
	at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:293)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:201)
	at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:65)
	at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
	... 4 more
Caused by: java.io.IOException: Failed to connect to processor:34117
	at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
	at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
	at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
	at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
	at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.UnknownHostException: processor
	at java.net.InetAddress.getAllByName0(InetAddress.java:1280)
	at java.net.InetAddress.getAllByName(InetAddress.java:1192)
	at java.net.InetAddress.getAllByName(InetAddress.java:1126)
	at java.net.InetAddress.getByName(InetAddress.java:1076)
	at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:146)
	at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:143)
	at java.security.AccessController.doPrivileged(Native Method)
	at io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:143)
	at io.netty.resolver.DefaultNameResolver.doResolve(DefaultNameResolver.java:43)
	at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:63)
	at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:55)
	at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:57)
	at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:32)
	at io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:108)
	at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(Bootstrap.java:208)
	at io.netty.bootstrap.Bootstrap.access$000(Bootstrap.java:49)
	at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:188)
	at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:174)
	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
	at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
	at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
	at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104)
	at io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:82)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetSuccess(AbstractChannel.java:978)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:512)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.access$200(AbstractChannel.java:423)
	at io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:482)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
	... 1 more

I think the issue here is that spark is unable to find my processor service:

Caused by: java.net.UnknownHostException: processor

Any ideas on how to solve this issue? Thanks for your time!

Update README file on master branch

Master branch README file is reflected in the docker hub description. Need to update to include all the versions of Spark currently supported (see branches).

Exception when submitting to spark 2.2.2

Hi,
I think there may be some versions mismatch for 2.2.2-hadoop2.7 version of your submit container.
When I run my app on 2.2.2-hadoop2.7 cluster I have exception (most probably versions mismatch):

19/03/07 15:43:06 ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() on RPC id 6892155609033431893
java.io.InvalidClassException: org.apache.spark.storage.BlockManagerId; local class incompatible: stream classdesc serialVersionUID = -3720498261147521051, local class serialVersionUID = 6155820641931972169
    at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616)
    at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630)
    at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1781)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
    at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
    at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)

When I looked inside a running spark-submit 2.2.2-hadoop2.7 container I noticed that spark version is actually different:
Spark 2.2.1 built for Hadoop 2.7.3

May this be the cause of this issue? Any idea how can I make it to run properly?

My pom looks ok, should not give any incompatibilities:

<properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>

        <java.version>1.8</java.version>
        <scala.version>2.11.12</scala.version>
        <spark-version>2.2.2</spark-version>
        <spark-cassandra-connector.version>2.0.11</spark-cassandra-connector.version>
        <paho.client.version>1.2.1</paho.client.version>

    </properties>

    <dependencies>
        <!-- scala-maven-plugin determines the Scala version to use from this dependency -->
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.version}</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>${spark-version}</version>
            <scope>provided</scope>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>${spark-version}</version>
            <scope>provided</scope>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-mllib -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-mllib_2.11</artifactId>
            <version>${spark-version}</version>
            <scope>provided</scope>
        </dependency>

        <!-- https://mvnrepository.com/artifact/com.datastax.spark/spark-cassandra-connector -->
        <dependency>
            <groupId>com.datastax.spark</groupId>
            <artifactId>spark-cassandra-connector_2.11</artifactId>
            <version>${spark-cassandra-connector.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.11</artifactId>
            <version>${spark-version}</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
            <version>${spark-version}</version>
            <exclusions>
                <exclusion>
                    <groupId>org.slf4j</groupId>
                    <artifactId>slf4j-log4j12</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>log4j</groupId>
                    <artifactId>log4j</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

        <dependency>
            <groupId>org.eclipse.paho</groupId>
            <artifactId>org.eclipse.paho.client.mqttv3</artifactId>
            <version>${paho.client.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.bahir</groupId>
            <artifactId>spark-streaming-mqtt_2.11</artifactId>
            <version>${spark-version}</version>
        </dependency>



    </dependencies>

EDIT:
Yup, changing all versions to 2.2.1 allowed me to run my app, nevertheless the spark-submit 2.2.2-hadoop2.7 container needs to be fixed as it contains wrong spark version.

Image for Spark 2.4.0 and Scala 2.12.X

Spark 2.4.0 has introduced official (and experimental) support for Scala 2.12.

It would be awesome to have an image for those (a lot!) interested in support for the current stable version of Scala.
My hunch is that this could be another incentive to adopt Spark.

Thanks in advance!

Support for Spark 2.2.2, 2.1.3, and 2.1.2

Hi all,

I would like to propose to push some additional docker images to your Docker Hub.

My suggestions are as follows:

  • 2.2.2-hadoop2.7
  • 2.1.3-hadoop2.7
  • 2.1.2-hadoop2.7

Do you guys think about adding this versions? If so, can I contribute for it with a PR?

Many thanks!

entrypoint.sh not executed when starting the container

I'm using this docker image to launch a series of slave/masters and execute workloads using hdfs as the file system. I set the environmental variables needed such as CORE_CONF_fs_defaultFS or CLUSTER_NAME but they don't seem to have any effect on the hadoop configuration files. I've realised that the script entrypoint.sh that uses these environmental variables is not executed. This script is included in the docker-hadoop image, which is the base of this docker-spark image.

Now, I'm not sure if launching a docker container means launching all of the inherited entry points of the parent images but the HDFS configuration part is not working.

Python app does not invoke

Hi - I downloaded the code as is. In my case, python template code does not execute at all.

What changes do I need to make to invoke python template?

Exception: Java gateway process exited before sending its port number Error when try to run spark cluster in python app

I am trying to create python app which use spark cluster consists with docker.

version: "3.6"
services:

  webapp:
    container_name: webapp
    image: todhm/flask_spark_webapp
    build:
      context: ./flask_with_spark
    working_dir: /app
    command: gunicorn -b 0.0.0.0:8000 --reload -w 4  wsgi:app
    networks:
        - sparknetwork
    ports:
      - "8000:8000"
    volumes:
        - ./flask_with_spark:/app
    depends_on:
      - spark-master
    environment:
     - SPARK_APPLICATION_PYTHON_LOCATION =/app/wsgi.py
     - SPARK_MASTER_NAME=spark-master
     - SPARK_MASTER_PORT=7077

  spark-master:
    image: bde2020/spark-master:2.3.0-hadoop2.7
    container_name: spark-master
    networks:
    - sparknetwork
    ports:
      - "8080:8080"
      - "7077:7077"
    environment:
      - INIT_DAEMON_STEP=setup_spark

  spark-worker-1:
    image: bde2020/spark-worker:2.3.0-hadoop2.7
    container_name: spark-worker-1
    networks:
    - sparknetwork
    depends_on:
      - spark-master
    ports:
      - "8081:8081"
    environment:
      - "SPARK_MASTER=spark://spark-master:7077"


networks:
    sparknetwork:


However when I try to create pyspark app in webapp container with following configuration it gave me some error.

   spark = SparkSession.\
            builder.\
            master("spark://spark-master:7077").\
            config("spark.submit.deployMode", "cluster").\
            config("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.11:2.2.3").\
            config("spark.executor.memory", self.executor_memory).\
            getOrCreate()
Traceback (most recent call last):
  File "/app/spark/tests.py", line 36, in test_write_duplicate_names
    self.sa.return_all_books()
  File "/app/spark/sparkapp.py", line 37, in return_all_books
    self.create_spark_app()
  File "/app/spark/sparkapp.py", line 33, in create_spark_app
    getOrCreate()
  File "/usr/local/lib/python3.4/dist-packages/pyspark/sql/session.py", line 173, in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
  File "/usr/local/lib/python3.4/dist-packages/pyspark/context.py", line 343, in getOrCreate
    SparkContext(conf=conf or SparkConf())
  File "/usr/local/lib/python3.4/dist-packages/pyspark/context.py", line 115, in __init__
    SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
  File "/usr/local/lib/python3.4/dist-packages/pyspark/context.py", line 292, in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway(conf)
  File "/usr/local/lib/python3.4/dist-packages/pyspark/java_gateway.py", line 93, in launch_gateway
    raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number

How can I solve this problem?

connection issue between worker and master

when i start up my docker stack for spark
i get this exception most of the time

i am running docker stack in google cloud
10 machines cluster right now i cant even get one spark master worker to connect.
this is a random occurrence

i have three stack
1 - api (3 different service with 3 replica each proxied through trefik)
2 - pipeline (just to lunch pipeline in standalone client mode)
3 - spark (1 worker and 6 masters)

from spark worker docker i can ping the master
PING master (10.0.23.55): 56 data bytes
64 bytes from 10.0.23.55: seq=0 ttl=64 time=0.114 ms
64 bytes from 10.0.23.55: seq=1 ttl=64 time=0.120 ms
64 bytes from 10.0.23.55: seq=2 ttl=64 time=0.116 ms
64 bytes from 10.0.23.55: seq=3 ttl=64 time=0.124 ms

stack.yml

version: '3.7'

networks:
  spark:
    name: spark
    attachable: true
  statsd:
    external: true
  hyperion-api_net_0:

x-deploy-template: &deploy-template
  placement:
    constraints:
      - node.role == worker


x-service-template: &service-template
  image: accern/hyperion:pipeline_${GIT_BRANCH:-master}
  env_file: ${HOME}/.env_settings/spark.env
  networks:
    - spark
    - statsd
    - hyperion-api_net_0

  logging:
    options:
      max-size: "500k"
  deploy:
    <<: *deploy-template

services:
  master:
    <<: *service-template
    image: accern/hyperion:spark_master_master
    environment:
      - ENABLE_INIT_DAEMON=false
      - INIT_DAEMON_STEP=setup_spark
      - SPARK_MASTER_HOST=master
      - SPARK_PUBLIC_DNS=xxx
    ports:
      - 8081:8080
      - 7077:7077
      - 4040:4040

  worker:
    <<: *service-template
    image: accern/hyperion:spark_worker_master
    environment:
      - "SPARK_MASTER=spark://master:7077"
      - ENABLE_INIT_DAEMON=false
      - SPARK_WORKER_WEBUI_PORT=8082
      - SPARK_PUBLIC_DNS=xxx
    ports:
      - 8082:8082
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.role == worker

2019-04-05 14:47:14 INFO Worker:2566 - Started daemon with process name: 289@d2e88e21a245
2019-04-05 14:47:14 INFO SignalUtils:54 - Registered signal handler for TERM
2019-04-05 14:47:14 INFO SignalUtils:54 - Registered signal handler for HUP
2019-04-05 14:47:14 INFO SignalUtils:54 - Registered signal handler for INT
2019-04-05 14:47:14 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-04-05 14:47:15 INFO SecurityManager:54 - Changing view acls to: root
2019-04-05 14:47:15 INFO SecurityManager:54 - Changing modify acls to: root
2019-04-05 14:47:15 INFO SecurityManager:54 - Changing view acls groups to:
2019-04-05 14:47:15 INFO SecurityManager:54 - Changing modify acls groups to:
2019-04-05 14:47:15 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
2019-04-05 14:47:15 INFO Utils:54 - Successfully started service 'sparkWorker' on port 34297.
2019-04-05 14:47:15 INFO Worker:54 - Starting Spark worker 10.0.23.60:34297 with 2 cores, 3.0 GB RAM
2019-04-05 14:47:15 INFO Worker:54 - Running Spark version 2.4.0
2019-04-05 14:47:15 INFO Worker:54 - Spark home: /spark
2019-04-05 14:47:15 INFO log:192 - Logging initialized @2327ms
2019-04-05 14:47:16 INFO Server:351 - jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
2019-04-05 14:47:16 INFO Server:419 - Started @2398ms
2019-04-05 14:47:16 WARN Utils:66 - Service 'WorkerUI' could not bind on port 8082. Attempting port 8083.
2019-04-05 14:47:16 INFO AbstractConnector:278 - Started ServerConnector@23a72034{HTTP/1.1,[http/1.1]}{0.0.0.0:8083}
2019-04-05 14:47:16 INFO Utils:54 - Successfully started service 'WorkerUI' on port 8083.
2019-04-05 14:47:16 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7e8425b2{/logPage,null,AVAILABLE,@spark}
2019-04-05 14:47:16 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@17a461a7{/logPage/json,null,AVAILABLE,@spark}
2019-04-05 14:47:16 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@204a97bf{/,null,AVAILABLE,@spark}
2019-04-05 14:47:16 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@b73e6d{/json,null,AVAILABLE,@spark}
2019-04-05 14:47:16 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@342ace85{/static,null,AVAILABLE,@spark}
2019-04-05 14:47:16 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6e2b5350{/log,null,AVAILABLE,@spark}
2019-04-05 14:47:16 INFO WorkerWebUI:54 - Bound WorkerWebUI to 0.0.0.0, and started at http://xxxx:8083
2019-04-05 14:47:16 INFO Worker:54 - Connecting to master master:7077...
2019-04-05 14:47:16 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@20c10f1d{/metrics/json,null,AVAILABLE,@spark}
2019-04-05 14:47:16 WARN Worker:87 - Failed to connect to master master:7077
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109)
at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:253)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Failed to connect to master/10.0.23.55:7077
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
... 4 more
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: master/10.0.23.55:7077
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
... 1 more
Caused by: java.net.ConnectException: Connection refused
... 11 more

Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

I'm running into an issue when try to submit a job in client mode against a standalone cluster or just open the spark-shell. I just get an warning this warning message:

Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

and the job never execute, but never ends, it just retry to execute the job again.

if I set deploy-mode to cluster, the job works.

I deploy the standalone cluster locally using docker-compose and also in DC/OS (marathon).

Any idea?

How to deploy it on multi-node env?

Hi, I can deploy it on one node and use docker-compose to start several containers. Now I want to deploy it in a multi-node environment. What I should modify in this project? Where should I configure the IP address of each node? Thank you in advance!

Master Rest API

There is a way to enable the rest API via environment variable. The master image expose the 6066 port but there is nothing listening on this port. Looks like the rest API disabled by default.

Any idea?

Validating if step spark_master_init can start in pipeline

Hi everyone,
I'm just try to use this images to run a simple application using the containers.
After clone the repository at
https://github.com/big-data-europe/docker-spark

I have run the docker compose. The container with master and worker was up and running, the worker was successfully registered at the master.
After I had created a python application to run it on spark-container.
The image that submit the application is bd2020/spark-submit, I just add the ENV variable as like as wrote in the README.
When i run the container with the command:

sudo docker run --network=docker-spark_default --name my_container --link spark-master:spark-master my_image

Appears:
Validating if step spark_master_init can start in pipeline

and nothings work. There is something wrong?

python3.4 can not run in docker-spark

Hello, I create a Spark standalone cluster from docker-spark, and then, I use python 3.4 to execute some code , but it export some errors.
image
How could I use python 3.4 to Build the spark application from the docker-spark?

Spark JDBC mysql connection error in executors (Communications link failure)

When execute the python script via submit command:

spark-submit --master spark://localhost:7077 spark_sample.py \
--jars mysql-connector-java.jar

I get the follow error:

Caused by: com.mysql.cj.jdbc.exceptions.CommunicationsException: 
Communications link failure

The last packet sent successfully to the server was 0 milliseconds ago. 
The driver has not received any packets from the server.

But, this error occurs only when working with spark cluster (with a worker)

For connection I used this code in python:

dataSource = sqlContext.read.format('jdbc').options(
          url='jdbc:mysql://host:3306/database?user=root&password=xxxx',
          dbtable='table',
          driver="com.mysql.jdbc.Driver"
          )

    dataSource.load().show()

What could be happening?

Thankyou for help

passing the spark configuration properties to the dockerized spark

Hi,
I am not sure how to pass spark properties like
spark.executor.memory
spark.driver.memory ....
to docker-spark.

In the compose file I see we can pass the env file

spark-master:
image: 'bde2020/spark-master:2.1.0-hadoop2.8-hive-java8'
container_name: spark-master
ports:
- '8080:8080'
- '7077:7077'

    env_file:
        - ./hadoop.env

but this env file does not tell how to pass spark properties. It has prefixes like CORE_CONF , HIVE_SITE_CONF, HDFS_CONF, YARN_CONF but it does not tell how to pass Spark conf properties.
Could we just append them with SPARK_CONF.
So SPARK_CONF_spark_driver_memory?

Plans for `master` and other branches?

Hello again,

No other branch appears to be in line with master at the moment, and I'm not sure what the deal is with the other branches like scala, python-fixes and worker-wait.

I'd be happy to help keep them up to date. However, I'd need to understand their purpose and effect first.

  1. Does the master branch reflect the latest tag in Docker Hub? Should it but updated to 2.0.1-hadoop2.7 or 2.0.2-hadoop2.7 (when it's ready)?
  2. Looking at Docker Hub, the only tag that doesn't follow the version style (i.e. 2.0.1-hadoop2.7) was poc-init-daemon, which is legacy by the sound of it. Does that infer all of those other branches need not be maintained, or do you have some other plans with them?
  3. Anything else that would be helpful for contributors to know regarding the branch and tag setup/intention in place?

Regards,
Bilal.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.