googleclouddataproc / initialization-actions Goto Github PK

View Code? Open in Web Editor NEW

583.0 583.0 514.0 34.68 MB

Run in all nodes of your cluster before the cluster starts - lets you customize your cluster

Home Page: https://cloud.google.com/dataproc/init-actions

License: Apache License 2.0

Shell 67.91% Python 28.73% Dockerfile 0.10% Starlark 2.72% R 0.01% Scala 0.53%

google-cloud-dataproc

initialization-actions's People

Contributors

Stargazers

Watchers

Forkers

giuseppereina pmkc dennishuo bomboradata felixcheung nguyenvanthan pquentin rangadi namliz arisha84 erikdubbelboer dandaoyi rdoppalapudi jmhehir satishkt jmikula bakebrain jamiefifty piotrpolatowski wade-p ajkl tmatsuo feczo aman-ebay sanjeevsuresh mt bodschut gilmar digideskio grivescorbett amoghrao2003 ferrero-zhang lootcrate chekran combineads revanwabale akbindal pecu ipostanogov sirres hcaisc tcsalameh pascal082 alunarbeach senthilaru devorbitus yashasgc capital-gh farshadj moander trk54ylmz samelamin robertlacok jhendric98 steffengr fanez randyh0329 vadym-wix bsidhom pydemia kheuton ojarjur jasonjho danivzq acampos-sbd yuandra vicenteg lakshmanok osemer01 shengnwen thuyntran jlamiel iliatimofeev jegadeshhike arjudoso jburke007 dsignr zxiaoc007 sammcveety karth295 qjin2016 mahendrakariya fhussains danielspringt lukefalsina isharamet drivenbrands robinsu sunishchal jmbejara jmthibault79 y44k0v mjgrav2001 linekha dossett data-carpentry vicros mleloc grahamanderson toddsynatrax

initialization-actions's Issues

Host publicly available copy of init-actions on GCS.

Hi All,

As a service to existing Dataproc users and potential adopters, any way we could get a copy of this repo's master branch hosted on publicly available google cloud bucket (e.g., gs://dataproc-initalization-actions)? I believe this would help much in allowing users to quickly reference and leverage init-action functionality in a gcloud dataproc cluster create call, along with having a public resource for testing against (until the Dataproc team is able to provide some sort of testing and CI service).

Thanks!

Zeppelin fails on startup of Single Node dataproc cluster (version 1.2.4)

Hi,
I added the intialization for Zeppelin to my dataproc cluster (--initialization-actions 'gs://dataproc-initialization-actions/apache-zeppelin/zeppelin.sh' --metadata='zeppelin-port=9995'), but Zeppelin fails to start.

I am in zone europe-west1-b.

Best,
Christoph

$ sudo service zeppelin restart
Job for zeppelin.service failed. See 'systemctl status zeppelin.service' and 'journalctl -xn' for details.

$ sudo systemctl status zeppelin.service
● zeppelin.service - LSB: Zeppelin
   Loaded: loaded (/etc/init.d/zeppelin)
   Active: failed (Result: exit-code) since Sat 2017-09-02 19:58:25 UTC; 17s ago
  Process: 16605 ExecStart=/etc/init.d/zeppelin start (code=exited, status=3)
 Main PID: 4499 (code=exited, status=143)

Sep 02 19:58:23 cluster-fyyg-m su[16610]: Successful su for zeppelin by root
Sep 02 19:58:23 cluster-fyyg-m su[16610]: + ??? root:zeppelin
Sep 02 19:58:23 cluster-fyyg-m su[16610]: pam_unix(su:session): session opened for user zeppelin by (uid=0)
Sep 02 19:58:25 cluster-fyyg-m su[16610]: pam_unix(su:session): session closed for user zeppelin
Sep 02 19:58:25 cluster-fyyg-m zeppelin[16605]: Failed to start Zeppelin. Return value: 3 ... failed!
Sep 02 19:58:25 cluster-fyyg-m systemd[1]: zeppelin.service: control process exited, code=exited status=3
Sep 02 19:58:25 cluster-fyyg-m systemd[1]: Failed to start LSB: Zeppelin.
Sep 02 19:58:25 cluster-fyyg-m systemd[1]: Unit zeppelin.service entered failed state.

$ sudo journalctl -xn
-- Logs begin at Sat 2017-09-02 19:19:36 UTC, end at Sat 2017-09-02 19:59:06 UTC. --
Sep 02 19:58:25 cluster-fyyg-m systemd[1]: Failed to start LSB: Zeppelin.
-- Subject: Unit zeppelin.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit zeppelin.service has failed.
-- 
-- The result is failed.

$ cat /var/log/zeppelin/zeppelin-zeppelin-cluster-fyyg-m.log
 WARN [2017-09-02 19:22:22,132] ({main} ZeppelinConfiguration.java[create]:97) - Failed to load configuration, proceeding with a default
 INFO [2017-09-02 19:22:22,182] ({main} ZeppelinConfiguration.java[create]:109) - Server Host: 0.0.0.0
 INFO [2017-09-02 19:22:22,182] ({main} ZeppelinConfiguration.java[create]:111) - Server Port: 8080
 INFO [2017-09-02 19:22:22,182] ({main} ZeppelinConfiguration.java[create]:115) - Context Path: /
 INFO [2017-09-02 19:22:22,190] ({main} ZeppelinConfiguration.java[create]:116) - Zeppelin Version: 0.7.1
 INFO [2017-09-02 19:22:22,216] ({main} Log.java[initialized]:186) - Logging initialized @356ms
 INFO [2017-09-02 19:22:22,315] ({main} ZeppelinServer.java[setupWebAppContext]:341) - ZeppelinServer Webapp path: /var/lib/zeppelin/webapps
 INFO [2017-09-02 19:22:22,550] ({main} ZeppelinServer.java[main]:185) - Starting zeppelin server
 INFO [2017-09-02 19:22:22,553] ({main} Server.java[doStart]:327) - jetty-9.2.15.v20160210

$ cat /var/log/zeppelin/zeppelin-zeppelin-cluster-fyyg-m.out
ZEPPELIN_CLASSPATH: ::/usr/lib/zeppelin/lib/interpreter/*:/usr/lib/zeppelin/lib/*:/usr/lib/zeppelin/*::/etc/zeppelin/conf
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/zeppelin/lib/interpreter/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/zeppelin/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

Kafka Script

Running kafka.sh on Dataproc 1.1 leads to the following errors:

apt-get install -y kafka
Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package kafka

apt-get install -y kafka-server
Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package kafka-server
dpkg -l kafka-server
dpkg-query: no packages found matching kafka-server

Zeppelin Debian build files?

We're trying to make a custom Zeppelin build and looking at https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/blob/master/apache-zeppelin/zeppelin.sh, it's doing apt-get install from this repo:
http://dataproc-bigtop-repo.storage.googleapis.com/

Just wondering if the Debian build files are publicly available somewhere?

gcloud components update causes timeout during dataproc create

Hello!

Due to the default gcloud sdk version included in Dataproc cluster having a bug in python2 path resolution (see: https://code.google.com/p/google-cloud-sdk/issues/detail?id=355), I am targeting the following shell script as an initialization action to update gcloud components (which resolves the python2 pathing issue). It looks like this:

#!/usr/bin/env bash

# Update Google Cloud SDK
if [[ "${ROLE}" == 'Master' ]]; then
    gcloud components install alpha beta -q
fi
gcloud components update -q
exec -l $SHELL

Unfortunately, the Dataproc cluster errors on creation (logs attached below)

Waiting on operation [operations/projects/bombora-dev/operations/81b2e287-bcba-4d64-be7c-b253d19de47e].
Waiting for cluster creation operation...done.
ERROR: (gcloud.beta.dataproc.clusters.create) Operation [operations/projects/bombora-dev/operations/81b2e287-bcba-4d64-be7c-b253d19de47e] failed: Multiple Errors:
 - Google Cloud Dataproc Agent reports failure. If logs are available, they can be found in 'gs://dataproc-e66c6e3d-80da-4211-b66c-0109bbe4567a-us/google-cloud-dataproc-metainfo/04a941d2-8eff-4ab1-b2f0-3c6988c1fb85/spark-cluster-m'.
 - Google Cloud Dataproc Agent reports failure. If logs are available, they can be found in 'gs://dataproc-e66c6e3d-80da-4211-b66c-0109bbe4567a-us/google-cloud-dataproc-metainfo/04a941d2-8eff-4ab1-b2f0-3c6988c1fb85/spark-cluster-w-1'..

google-cloud-dataproc-metainfo-04a941d2-8eff-4ab1-b2f0-3c6988c1fb85-spark-cluster-w-1-dataproc-startup-script_output.txt
google-cloud-dataproc-metainfo-04a941d2-8eff-4ab1-b2f0-3c6988c1fb85-spark-cluster-m-dataproc-startup-script_output.txt

Any help greatly appreciated—thanks much. :)

Remove "beta" from gcloud commands

As Dataproc is in GA, it make sense to replace "gcloud beta dataproc" to "gcloud dataproc" in all the README files.

ipython kernel busy

After initializing the ipython.sh into the cluster, the ipython notebook always seem to be busy. What is the issue behind this?

The Datalab initialization action should include a volume mount for /content/datalab

The Datalab initialization action script specifies volume mounts for the spark libraries, but none for mapping any part of the VM's persistent disk to the '/content/datalab/' directory in the container.

That means notebooks won't be saved to the VM's disk

Create init action for installing conda

Conda is growing in popularity as a package manager, allowing easy dependency management for Python (and other) applications. Miniconda is an installation script that contains a barebones version of the Anaconda Python distribution, along with the conda package manager. Would be great to support it with a Dataproc init action.

Initialization action for Apache Zeppelin

A couple of customers asked directions to Cloud Support on how to install Apache Zeppelin on Dataproc. I have created a init script and it would be nice to have it in this repository.

Google Cloud Dataproc Agent reports failure when creating cluster with Oozie

I am trying to create DataProc cluster with initialization script of Oozie present in public directory dataproc-initialization-actions but receiving error as - Google Cloud Dataproc Agent reports failure. If logs are available followed by big log file which doesn't contain any error message.

There is a thread posted by someone on Stackoverflow 4 months ago http://stackoverflow.com/questions/41580869/google-cloud-dataproc-agent-reports-failure-when-creating-cluster-with-oozie which is unanswered.

I tried to create cluster using both Console and Cloud SDK but got same error.

Could you please check.

Please let me know if you need more details.

Edit: Closing this issue as another one #44 is already open.

Jupyter setup failure since April 5

Hi, since today started to encounter jupyter setup failure.

For the previous 2 months everything worked just fine

Now, I'm getting an error:

jupyter-1.0.0-   0% |                              | ETA:  --:--:--   0.00  B/s
jupyter-1.0.0- 100% |###############################| Time: 0:00:00   5.96 MB/s
jupyter-1.0.0- 100% |###############################| Time: 0:00:00   3.71 MB/s
Traceback (most recent call last):
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gsutil", line 22, in <module>
    gsutil.RunMain()
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gsutil.py", line 112, in RunMain
    import gslib.__main__
  File "/usr/lib/google-cloud-sdk/platform/gsutil/gslib/__main__.py", line 39, in <module>
    import boto
  File "/usr/lib/google-cloud-sdk/platform/gsutil/third_party/boto/boto/__init__.py", line 1216, in <module>
    boto.plugin.load_plugins(config)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/third_party/boto/boto/plugin.py", line 93, in load_plugins
    _import_module(file)
  File "/usr/lib/google-cloud-sdk/platform/gsutil/third_party/boto/boto/plugin.py", line 75, in _import_module
    return imp.load_module(name, file, filename, data)
  File "/usr/lib/python2.7/dist-packages/google_compute_engine/boto/boto_config.py", line 29, in <module>
    from google_compute_engine import config_manager
ImportError: No module named google_compute_engine

KeyError during conda install and environment activation

When I spin up a dataproc cluster and include bootstrap-conda.sh and install-conda-env.sh in my initialization actions, I'm getting the following error (which shows up on the master node and one of the worker nodes). It looks like the error is coming from the install-conda-env.sh script and it's throwing a KeyError.

Here is the output from the script:

echo $USER:
echo $PWD: /
echo $PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
echo $CONDA_BIN_PATH:
echo $USER:
echo $PWD: /
echo $PATH: /usr/local/bin/miniconda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
echo $CONDA_BIN_PATH: /usr/local/bin/miniconda/bin
No conda environment name specified, setting to 'root' env...
Attempting to create conda environment: root
Using Anaconda Cloud api site https://api.anaconda.org
conda environment root detected, skipping env creation...
Activating root environment...
Traceback (most recent call last):
File "/usr/local/bin/miniconda/bin/conda", line 6, in
sys.exit(conda.cli.main())
File "/usr/local/bin/miniconda/lib/python2.7/site-packages/conda/cli/main.py", line 48, in main
activate.main()
File "/usr/local/bin/miniconda/lib/python2.7/site-packages/conda/cli/activate.py", line 120, in main
shelldict = shells[shell]
KeyError: 'dataproc-initia'

Unable to run Pig (using HBaseStorage) on Tez

I created a dataproc cluster using tez init script.

Command:

gcloud dataproc clusters create aspen-airflow-test-tez --bucket aspen-dataproc --zone us-central1-c --num-masters 1 --master-machine-type n1-standard-16 --master-boot-disk-size 64 --num-workers 2 --worker-machine-type n1-standard-16 --worker-boot-disk-size 20 --num-preemptible-workers 0 --scopes 'https://www.googleapis.com/auth/cloud-platform' --project aspen-sandbox --initialization-actions gs://dataproc-initialization-actions/tez/tez.sh --num-worker-local-ssds 1 --properties 'mapred:mapreduce.map.java.opts=-Xmx1024m,mapred:mapreduce.map.cpu.vcores=1,mapred:mapreduce.map.memory.mb=1024,mapred:mapreduce.reduce.java.opts=-Xmx1024m,mapred:mapreduce.reduce.cpu.vcores=1,mapred:mapreduce.reduce.memory.mb=1024'

I am running a Pig Job that uses HBaseStorage to load data from HBase

Pig Script: (I am adding -"x tez" while running Pig command and it does accepts Tez as execType)

%default GRANULARITY    'hour'
%default LOOKUP_TABLE   'one_table_to_rule_them_all'
%default RESULT_TABLE   'one_table_to_rule_them_all'
%default TASK_NAME      'rollup\_$GRANULARITY\_$LOOKUP_TABLE'
%default PARALLEL       '32'
%default regex          '[a-z|_|-]*(min-201703290005)[0-9|-]*'

%default LEASE_PERIOD     '300000'
%default TASK_TIMEOUT     '3600000'
%default RPC_TIMEOUT      '3600000'

REGISTER /opt/aspen/depends/*.jar;
REGISTER /opt/aspen/mapreduce/p_udf.py USING jython AS udf;

data = LOAD 'hbase://$LOOKUP_TABLE' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('stats:*', '-loadKey true -regex $regex') AS (rowkey:chararray, stats:map[]);

DUMP data;

What works:
When I try to run this as a pig job (By specifying exectype=tez), it does use tez as execution engine.
It is able to connect to bigtable and find my results table and able to figure out number of splits.
Tez libs are available on the master.

What doesn't work:
It fails when I try to dump the result. (This is when it executes the tez dag)

Pig Stack Trace:

Pig Stack Trace
---------------
ERROR 2017: Internal error creating job configuration.

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias data
        at org.apache.pig.PigServer.openIterator(PigServer.java:1019)
        at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:747)
        at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:376)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:231)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:206)
        at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66)
        at org.apache.pig.Main.run(Main.java:564)
        at org.apache.pig.Main.main(Main.java:176)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: org.apache.pig.PigException: ERROR 1002: Unable to store alias data
        at org.apache.pig.PigServer.storeEx(PigServer.java:1122)
        at org.apache.pig.PigServer.store(PigServer.java:1081)
        at org.apache.pig.PigServer.openIterator(PigServer.java:994)
        ... 13 more
Caused by: org.apache.pig.backend.hadoop.executionengine.JobCreationException: ERROR 2017: Internal error creating job configuration.
        at org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler.getJob(TezJobCompiler.java:137)
        at org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler.compile(TezJobCompiler.java:78)
        at org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher.launchPig(TezLauncher.java:198)
        at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:308)
        at org.apache.pig.PigServer.launchPlan(PigServer.java:1474)
        at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1459)
        at org.apache.pig.PigServer.storeEx(PigServer.java:1118)
        ... 15 more
Caused by: java.lang.NoSuchMethodException: org.apache.tez.dag.api.DAG.setCallerContext(org.apache.tez.client.CallerContext)
        at java.lang.Class.getMethod(Class.java:1786)
        at org.apache.pig.backend.hadoop.executionengine.tez.TezJobCompiler.getJob(TezJobCompiler.java:128)
        ... 21 more
================================================================================

Does tez.sh init script configures Pig to use Tez as execution engine?

dataproc script broken?

Steps to reproduce:

Copy shell file exactly, and upload to bucket.
Create cluster:

gcloud dataproc clusters create <CLUSTER_NAME> \
    --initialization-actions gs://<CLUSTER_BUCKET>/datalab/datalab.sh

Verify all is setup
SSH into box to master to verify all is setup.
Try to SSH proxy to box to gain web access:

gcloud compute ssh  --zone=<zone> \
  --ssh-flag="-D 1081" --ssh-flag="-N" --ssh-flag="-n" <master-instance-name>

Results: timeout request, no further details.

Any direction would be appreciated. I've gone so far as to open wide the network and try to access the box directly without any security. It's as if the port never opens on the box.

Storing Custom Dataproc images for fast cluster provisioning

I am using a lot of open source software on top of my Dataproc clusters using Initialization Actions, things like Presto, Tez and some others. Today, my cluster provisioning time is about 20 minutes - the Dataproc itself is ready within 90 seconds and then all the rest of taken by installation of additional software.

It would be great if I could save my own Dataproc image and use it when I provision new clusters and then all my additional software will be already in the image and the provisioning time will be back ±90 seconds.

Unless this feature is available, it's challenging to use Dataproc for workloads when I want to create a cluster, run the job and delete the cluster.

Jupyter setup failure for Dataproc 1.1 (Spark 2.0)

Hi,

It seems that the Jupyter initialization scripts are failing to correctly setup Jupyter in a cluster using Dataproc image version 1.1 (Spark 2.0).

Here is how I have created my dataproc cluster:

gcloud dataproc clusters create $CLUSTER_NAME \
--image-version 1.1 \
--project $PROJECT_ID \
--bucket $BUCKET \
--zone us-central1-f \
--num-workers 2 \ 
--scopes cloud-platform \
--initialization-actions gs://dataproc-initialization-actions/jupyter/jupyter.sh \
--initialization-action-timeout 60m

When I open the Jupyter UI and run "sc.version" in a cell I get the following error:

NameErrorTraceback (most recent call last)
<ipython-input-1-fd0e3f469918> in <module>()
----> 1 sc.version

NameError: name 'sc' is not defined

I presume this is something you are already working on, but I didn't find any issues opened reporting this problem. Meanwhile I'm trying to solve it by myself.
Let me know if you need further details.
Thanks in advance.

Regards,
Gilmar

Question: apply custom Docker image to Dataproc cluster nodes?

I have a Docker image with the settings and software I need. Is it possible for Dataproc to use the image during cluster generation? I cannot seem to find related documentation.

I see in https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/blob/master/datalab/datalab.sh that cluster generation pulls from a docker image with

gcloud docker -- pull ${DOCKER_IMAGE}

where

  DOCKER_IMAGE="gcr.io/cloud-datalab/datalab:local"

If I use this command to point to my own docker image in a custom script, and supply this to Initialization actions in cluster generation, will this work?

Thanks in advance for your help. Please let me know if I should be posting questions elsewhere.

extra spark packages in datalab

Hi,

The datalab init script works great, but I was wondering how to make additional spark packages (like e.g. the avro package from databricks) available in the datalab notebook. In the jupyter init script you can add them to the PYSPARK_SUBMIT_ARGS. Can we do something similar here?

Thanks

Make Jupyter auto restart on reboot

Right now the init action only explicitly calls https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/blob/master/jupyter/internal/launch-jupyter-kernel.sh in the initialization action, so if the master is rebooted, the jupyter server doesn't start up automatically. It should instead configure a systemd service to auto startup.

conda/README.MD: MINICONDA variables must be metadata, not env vars

conda/README.MD currently states: "users can easily config for targeted versions via the following env vars [...] MINICONDA_VARIANT [...] MINICONDA_VER"

However, as far as I can tell, bootstrap-conda.sh only reads from instance metadata. The first lines of the script currently set these variables with:

MINICONDA_VARIANT=$(/usr/share/google/get_metadata_value attributes/MINICONDA_VARIANT || true)
MINICONDA_VERSION=$(/usr/share/google/get_metadata_value attributes/MINICONDA_VERSION || true)

I think there are two possible fixes:

Change README.MD to say "instance metadata" instead of "env vars".
Change the script to prefer existing environment variables, then instance metadata, then the hard coded defaults.

(We ran into this since some update to something broke our Python 2 scripts, since it started installing Python 3 instead. We first tried setting the environment variables before running the script, before realizing that we need to set instance metadata keys instead)

custom build of zeppelin 0.7.2 + spark on dataproc version 1.2.4

hi guys,
we are trying to get a customer build of zeppelin on dataproc version 1.2.4. We have downloaded zeppelin from their releases and set everything correctly. However, we are constantly getting an error:

 INFO [2017-08-28 16:02:40,840] ({Thread-0} RemoteInterpreterServer.java[run]:95) - Starting remote interpreter server on port 60775
 INFO [2017-08-28 16:02:41,490] ({pool-1-thread-3} RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate interpreter org.apache.zeppelin.spark.SparkInterpreter
 INFO [2017-08-28 16:02:41,511] ({pool-1-thread-3} RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate interpreter org.apache.zeppelin.spark.SparkSqlInterpreter
 INFO [2017-08-28 16:02:41,516] ({pool-1-thread-3} RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate interpreter org.apache.zeppelin.spark.DepInterpreter
 INFO [2017-08-28 16:02:41,541] ({pool-1-thread-3} RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate interpreter org.apache.zeppelin.spark.PySparkInterpreter
 INFO [2017-08-28 16:02:41,543] ({pool-1-thread-3} RemoteInterpreterServer.java[createInterpreter]:190) - Instantiate interpreter org.apache.zeppelin.spark.SparkRInterpreter
 INFO [2017-08-28 16:02:41,609] ({pool-2-thread-2} SchedulerFactory.java[jobStarted]:131) - Job remoteInterpretJob_1503936161608 started by scheduler interpreter_41495429
 INFO [2017-08-28 16:02:41,612] ({pool-2-thread-2} PySparkInterpreter.java[createPythonScript]:108) - File /tmp/zeppelin_pyspark-8746947718105378480.py created
ERROR [2017-08-28 16:02:42,285] ({pool-2-thread-2} Job.java[run]:188) - Job failed
**java.lang.NoSuchMethodError: scala.reflect.internal.settings.MutableSettings$SettingValue.valueSetByUser()Lscala/Option;**
        at scala.tools.nsc.Global.<init>(Global.scala:334)
        at scala.tools.nsc.interpreter.IMain$$anon$1.<init>(IMain.scala:247)
        at scala.tools.nsc.interpreter.IMain.newCompiler(IMain.scala:247)
        at scala.tools.nsc.interpreter.IMain.<init>(IMain.scala:93)
        at scala.tools.nsc.interpreter.IMain.<init>(IMain.scala:113)
        at scala.tools.nsc.interpreter.ILoop$ILoopInterpreter.<init>(ILoop.scala:108)
        at scala.tools.nsc.interpreter.ILoop.createInterpreter(ILoop.scala:118)
        at org.apache.zeppelin.spark.SparkInterpreter.open(SparkInterpreter.java:783)
        at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:70)
        at org.apache.zeppelin.spark.PySparkInterpreter.getSparkInterpreter(PySparkInterpreter.java:564)
        at org.apache.zeppelin.spark.PySparkInterpreter.createGatewayServerAndStartScript(PySparkInterpreter.java:208)
        at org.apache.zeppelin.spark.PySparkInterpreter.open(PySparkInterpreter.java:162)
        at org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:70)
        at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:483)
        at org.apache.zeppelin.scheduler.Job.run(Job.java:175)
        at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)
 INFO [2017-08-28 16:02:42,297] ({pool-2-thread-2} SchedulerFactory.java[jobFinished]:137) - Job remoteInterpretJob_1503936161608 finished by scheduler interpreter_41495429

we are very puzzled by this. we tried playing with various version of spark, etc - but it was a no go.
Would really be grateful for any help or direction

Apache Tez

We are starting to use Apache Tez for complex directed-acyclic-graph of tasks for processing data and it would be great if we could install it with Initialization Actions.

oozie script is broken

Vanilla dataproc clusters fails to run this script. I have tried image version 1.0, 1.1, and preview.

Image version 1.0.24

Unpacking oozie (4.2.0-1) ...
dpkg: error processing archive /var/cache/apt/archives/oozie_4.2.0-1_all.deb (--unpack):
 trying to overwrite '/usr/lib/oozie/lib/zookeeper-3.4.6.jar', which is also in package oozie-client 4.2.0-1
dpkg-deb: error: subprocess paste was killed by signal (Broken pipe)
Processing triggers for man-db (2.7.0.2-5) ...
Processing triggers for systemd (215-17+deb8u5) ...
Errors were encountered while processing:
 /var/cache/apt/archives/oozie_4.2.0-1_all.deb
E: Sub-process /usr/bin/dpkg returned an error code (1)

Image version 1.1 and preview

Reading package lists... Done
+ apt-get install oozie oozie-client -y
Reading package lists... Done
Building dependency tree       
Reading state information... Done
E: Unable to locate package oozie
E: Unable to locate package oozie-client

Too many levels of symbolic links

Hello,
i created a cluster with dataproc after that, i installled docker (https://docs.docker.com/engine/installation/linux/debian/)

I had this fatal error will running the datalab.sh in the master

FATA[0000] Error response from daemon: Cannot start container 8f9f9d3a29d4dc2dc1836f847e8e79912466aeba386e971dca
5dec52b77fd938: [8] System error: too many levels of symbolic links

Thanks in advance for your help

Improvements and future-proofing for Apache Drill init action

A few ideas for making the Drill init action more friendly for debugging in case of failure and better future-proofing for version changes or different configurations of dependencies (like zookeeper being already running in HA mode):

If server status still fails in drill, cat out /var/log/drill/* for easier debugging; possibly poll in a loop for up to a limited number of times for server to come up instead of failing later when trying tin add storage plugins in case the configuration is broken for some reason
Find the GCS connector by wildcard for version-proofing
Maybe derive Hive and HDFS settings from *-site.xml files
Derive zookeeper node list and client port from /etc/zookeeper/conf
Could be something like:

    ZOOKEEPER_CLIENT_PORT=$(grep clientPort /etc/zookeeper/conf/zoo.cfg | \
        cut -d '=' -f 2
    ZOOKEEPER_LIST=$(grep "^server\." /etc/zookeeper/conf/zoo.cfg | \
        cut -d '=' -f 2 | cut -d ':' -f 1 | sed "s/$/:${ZOOKEEPER_CLIENT_PORT}/" | \
        xargs echo | sed "s/ /,/g")

Move drill profile to a subdirectory of dataproc bucket that has cluster uuid

Dataproc Spark running on Zeppelin 0.71 can't see hive tables created in Zeppelin 0.62

Hi all,

I once used Datapoc(image version 1.1) with Zeppelin 0.62 to create hive tables stored in Google Cloud Bucket. Now I created another Dataproc version 1.2 which uses Zeppelin 0.71 by following https://zeppelin.apache.org/docs/0.7.1/interpreter/spark.html. Once every external component(Hive metastore on MySQL server, Zeppelin) was initialized completely I queried all hive tables using

%sql
show tables

but tables created from previous version of Dataproc were not returned. I rechecked the initialization scripts of zeppelin.sh and cloud-sql-proxy.sh and they were correct. I then rechecked the value of hive.metastore.warehouse.dir and it matched the one used in the previous version of Dataproc but this time Spark 2.2.0 changed to spark.sql.warehouse.dir instead (see https://issues.apache.org/jira/browse/SPARK-15034).

I then created a new hive table, table_zeppelin, and the content was stored in the bucket correctly. When I verified it by show tables the table showed up as expected. But once I restarted Zeppelin and reran show tables I got nothing back. Strange.. because the content of table_zeppelin was already in the bucket. Once I verified the table TBLS in MySQL instance that stores hive metastore I didn't see table_zeppelin. I guess there's something wrong with hive metastore.

Surprisingly, when I created another hive table, table_spark but this time via spark-shell everything worked as expected. When I ran show tables I got table_spark and all the tables created in the previous Dataproc version but not table_zeppelin previously created via Zeppelin 0.71. table_spark was also in table TBLS of MySQL instance. I'm quite certain there's something wrong with setting hive metastore in Zeppelin 0.71 as Zeppelin can't read/write anything to the metastore. I can confirm that SPARK_HOME was set correctly in zeppelin-env.sh to point to Dataproc Spark.

Here's my cluster creation script:

gcloud dataproc --region us-west1 clusters create coco-cluster --bucket rcom_dataproc_dev --zone us-west1-a --master-machine-type n1-highmem-4 --master-boot-disk-size 500 --num-workers 3 --worker-machine-type n1-highcpu-8 --worker-boot-disk-size 500 --image-version 1.2 --project true-dmp --initialization-actions 'gs://dmp_recommendation_dev/env_dependencies/cloud-sql-proxy.sh','gs://dmp_recommendation_dev/env_dependencies/zeppelin.sh' --scopes cloud-platform --properties hive:hive.metastore.warehouse.dir=gs://rcom_dataproc_dev/hive-warehouse --metadata "hive-metastore-instance=true-dmp:asia-northeast1:rcom-metastore-sql,hive-metastore-db=hive_metastore_dev"

Note MySQL instance storing hive metastore is in Asia but the cluster is in the US. I don't think that's the cause of this.

So my question is how can I set Zeppelin 0.71 to recognize Hive Metastore which is in Google Cloud SQL instance?

Thank you
Peeranat F.

Error Installing Oozie

Discovered that Oozie wasn't installing when using this script on initilization, so fired up a cluster and manually ran the script only to get this error when running sudo apt-get install oozie:

sayle_matthews@cluster-1-m:~$ sudo apt-get install oozie
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following extra packages will be installed:
  bigtop-tomcat oozie-client
The following NEW packages will be installed:
  bigtop-tomcat oozie oozie-client
0 upgraded, 3 newly installed, 0 to remove and 0 not upgraded.
Need to get 306 MB of archives.
After this operation, 363 MB of additional disk space will be used.
Do you want to continue? [Y/n] 
Get:1 http://dataproc-bigtop-repo.storage.googleapis.com/v1.0/ dataproc/contrib bigtop-tomcat all 6.0.36-1 [5,376 k
B]
Get:2 http://dataproc-bigtop-repo.storage.googleapis.com/v1.0/ dataproc/contrib oozie-client all 4.2.0-1 [10.7 MB]
Get:3 http://dataproc-bigtop-repo.storage.googleapis.com/v1.0/ dataproc/contrib oozie all 4.2.0-1 [290 MB]        
Fetched 306 MB in 17s (17.6 MB/s)                                                                                 
Selecting previously unselected package bigtop-tomcat.
(Reading database ... 59410 files and directories currently installed.)
Preparing to unpack .../bigtop-tomcat_6.0.36-1_all.deb ...
Unpacking bigtop-tomcat (6.0.36-1) ...
Selecting previously unselected package oozie-client.
Preparing to unpack .../oozie-client_4.2.0-1_all.deb ...
Unpacking oozie-client (4.2.0-1) ...
Selecting previously unselected package oozie.
Preparing to unpack .../archives/oozie_4.2.0-1_all.deb ...
Unpacking oozie (4.2.0-1) ...
dpkg: error processing archive /var/cache/apt/archives/oozie_4.2.0-1_all.deb (--unpack):
 trying to overwrite '/usr/lib/oozie/lib/zookeeper-3.4.6.jar', which is also in package oozie-client 4.2.0-1
dpkg-deb: error: subprocess paste was killed by signal (Broken pipe)
Processing triggers for man-db (2.7.0.2-5) ...
Processing triggers for systemd (215-17+deb8u3) ...
Errors were encountered while processing:
 /var/cache/apt/archives/oozie_4.2.0-1_all.deb
E: Sub-process /usr/bin/dpkg returned an error code (1)

Not sure how to get it working, but getting this on a clean cluster just running the script even.

zeppelin.sh error

Error in the initialization script zeppelin.sh

E: Failed to fetch http://security.debian.org/pool/updates/main/c/cups-filters/libcupsfilters1_1.0.61-5+deb8u2_amd64.deb  404  Not Found [IP: 150.203.164.38 80]

E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?

datalab is only accessible first time after creating cluster

I create a cluster and utilize the dataproc-initialization-actions script for datalab (gs://dataproc-initialization-actions/datalab/datalab.sh).

then use the cluster web interfaces instructions to create a connection. datalab is accessible to me at http://master-host-name:8080/tree/datalab. I can utilize Spark etc. from the datalab notebook.

If I stop the cluster nodes in Compute Engine -> Instances, then restart the nodes, I can no longer navigate to http://master-host-name:8080/tree/datalab. I have tried multiple times with multiple clusters. The YARN UI is always accessible.

Is this expected?

Support Python >= 3.3 in Dataproc (ensure spark-submit shell has access to global env vars)

Hi dataproc team.

tl;dr

Hope this title isn't too bombastic, but it seems dataproc cannot support PySpark workloads in Python version 3.3 and greater. This stems from PySpark checking for a PYTHONHASHSEED env var that, while set, is not detected during execution of spark jobs on a Dataproc cluster. This occurs with other environment variables in different runtimes as well (e.g., remote spark job submittal with a property setting for PYSPARK_PYTHON, see below).

0. details

PySpark's rdd.py module requires a set PYTHONHASHSEED env var for Python 3.3 and higher, ensuring a common seed among workloads distributed across multiple Python instances.

While there is a discussion of whether or not PySpark should set PYTHONHASHSEED on its own, ultimately, PySpark workloads and applications based on Python 3.3 and higher must have PYTHONHASHSEED set. This can be done by setting exports in global profiles or more concisely, in the spark-env.sh config. Some examples of this:

1. Setup

To demonstrate, I've implemented a basic shell script that also installs Python 3, like this:

❯❯ wget https://gist.githubusercontent.com/nehalecky/9258c01fb2077f51545a/raw/789f08141dc681cf1ad5da05455c2cd01d1649e8/install-py3-dataproc.sh
❯❯ cat install-py3-dataproc.sh
#!/bin/bash
apt-get -y install python3
echo "export PYSPARK_PYTHON=python3" | tee -a  /etc/profile.d/spark_config.sh  /etc/*bashrc
echo "export PYTHONHASHSEED=123" | tee -a  /etc/profile.d/spark_config.sh  /etc/*bashrc /usr/lib/spark/conf/spark-env.sh
source ~/.bashrc

and target this in as an init action in launching a minimal dataproc cluster

❯❯ gcloud beta dataproc clusters create py3-test
    --initialization-actions \
        gs://bombora-dev-analytics/dataproc-init-actions/install-py3-dataproc.sh #public object

Logging in, one can confirm that PYTHONHASHSEED env var is set across both global profiles along with spark-env.sh config, across both master and workers nodes.

master

❯❯  gcloud compute ssh root@py3-test-m
root@spark-cluster-m:~# echo $PYTHONHASHSEED
123
root@py3-test-m:~# echo $PYSPARK_PYTHON
python3

worker

❯❯  gcloud compute ssh root@py3-test-w-0
root@py3-test-w-0:~# echo $PYTHONHASHSEED
123
root@py3-test-w-0:~#  echo $PYSPARK_PYTHON
python3

2. Testing

To test, I have a simple python script that outputs info on PYTHONHASHSEED, raises an exception if not detected, and attempts a pyspark job to printout the python version detected by executors.

root@spark-cluster-m:~# wget https://raw.githubusercontent.com/nehalecky/dataproc-initialization-actions/feature/conda_init_action/develop/conda/get-sys-exec.py
root@spark-cluster-m:~# cat get-sys-exec.py
import pyspark
import sys
import os

pyhashseed = os.environ['PYTHONHASHSEED']
print(pyhashseed)
print(type(pyhashseed))
print(sys.version)

if sys.version >= '3.3' and 'PYTHONHASHSEED' not in os.environ:
    raise Exception("Randomness of hash of string should be disabled via PYTHONHASHSEED")

else:
   sc = pyspark.SparkContext()
   distData = sc.parallelize(range(100))
   python_distros = distData.map(lambda x: sys.executable).distinct().collect()
   print(python_distros)

Local

I run this locally on the cluster, with a call to spark-submit. While able to read and print out the correct setting for PYTHONHASHSEED, it later hits an exception raised by the exact same logic that it initially passed! Seems that the env var is detected on the master node fine, as the exception is raised on the worker node. Crazy.

root@spark-cluster-m:~# spark-submit get-sys-exec.py
123
<class 'str'>
3.5.1 |Continuum Analytics, Inc.| (default, Dec  7 2015, 11:16:01)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
16/01/18 01:18:03 INFO akka.event.slf4j.Slf4jLogger: Slf4jLogger started
16/01/18 01:18:03 INFO Remoting: Starting remoting
16/01/18 01:18:03 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:47010]
16/01/18 01:18:04 INFO org.spark-project.jetty.server.Server: jetty-8.y.z-SNAPSHOT
16/01/18 01:18:04 INFO org.spark-project.jetty.server.AbstractConnector: Started [email protected]:60173
16/01/18 01:18:04 INFO org.spark-project.jetty.server.Server: jetty-8.y.z-SNAPSHOT
16/01/18 01:18:04 INFO org.spark-project.jetty.server.AbstractConnector: Started [email protected]:4040
16/01/18 01:18:04 WARN org.apache.spark.metrics.MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
16/01/18 01:18:04 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at spark-cluster-m/10.240.0.2:8032
16/01/18 01:18:06 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1453072376140_0005
[Stage 0:>                                                          (0 + 2) / 2]16/01/18 01:18:16 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, spark-cluster-w-0.c.bombora-dev.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1453072376140_0005/container_1453072376140_0005_01_000002/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
  File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1453072376140_0005/container_1453072376140_0005_01_000002/pyspark.zip/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1453072376140_0005/container_1453072376140_0005_01_000002/pyspark.zip/pyspark/serializers.py", line 133, in dump_stream
    for obj in iterator:
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1704, in add_shuffle_key
  File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1453072376140_0005/container_1453072376140_0005_01_000002/pyspark.zip/pyspark/rdd.py", line 74, in portable_hash
    raise Exception("Randomness of hash of string should be disabled via PYTHONHASHSEED")
Exception: Randomness of hash of string should be disabled via PYTHONHASHSEED

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
    at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:342)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:88)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Remote

Remote execution completes because, regardless of any setting of PYSPARK_PYTHON env var, it uses the default Python installation (2.7.9). Again, this is likely due to not being able to reference the globally set env vars.

>> gcloud beta dataproc jobs submit pyspark --cluster py3-test get-sys-exec.py
Copying file://get-sys-exec.py [Content-Type=text/x-python]...
Uploading   ...33ab-c06f-4234-b495-92c3bf9ac6e0/get-sys-exec.py: 480 B/480 B
Job [6512a774-fc83-4d2c-b735-f40fa7bac534] submitted.
Waiting for job output...
123
<type 'str'>
2.7.9 (default, Mar  1 2015, 12:57:24)
[GCC 4.9.2]
16/01/18 19:04:30 INFO akka.event.slf4j.Slf4jLogger: Slf4jLogger started
16/01/18 19:04:30 INFO Remoting: Starting remoting
16/01/18 19:04:30 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:43622]
16/01/18 19:04:30 INFO org.spark-project.jetty.server.Server: jetty-8.y.z-SNAPSHOT
16/01/18 19:04:30 INFO org.spark-project.jetty.server.AbstractConnector: Started [email protected]:52135
16/01/18 19:04:31 INFO org.spark-project.jetty.server.Server: jetty-8.y.z-SNAPSHOT
16/01/18 19:04:31 INFO org.spark-project.jetty.server.AbstractConnector: Started [email protected]:4040
16/01/18 19:04:31 WARN org.apache.spark.metrics.MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
16/01/18 19:04:31 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at py3-test-m/10.240.0.3:8032
16/01/18 19:04:33 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1453141822766_0002
['/usr/bin/python']
16/01/18 19:04:43 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/01/18 19:04:43 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
Job [6512a774-fc83-4d2c-b735-f40fa7bac534] finished successfully.
driverControlFilesUri: gs://dataproc-e66c6e3d-80da-4211-b66c-0109bbe4567a-us/google-cloud-dataproc-metainfo/44b233ab-c06f-4234-b495-92c3bf9ac6e0/jobs/6512a774-fc83-4d2c-b735-f40fa7bac534/
driverOutputResourceUri: gs://dataproc-e66c6e3d-80da-4211-b66c-0109bbe4567a-us/google-cloud-dataproc-metainfo/44b233ab-c06f-4234-b495-92c3bf9ac6e0/jobs/6512a774-fc83-4d2c-b735-f40fa7bac534/driveroutput
placement:
  clusterName: py3-test
  clusterUuid: 44b233ab-c06f-4234-b495-92c3bf9ac6e0
pysparkJob:
  loggingConfiguration: {}
  mainPythonFileUri: gs://dataproc-e66c6e3d-80da-4211-b66c-0109bbe4567a-us/google-cloud-dataproc-staging/44b233ab-c06f-4234-b495-92c3bf9ac6e0/get-sys-exec.py
reference:
  jobId: 6512a774-fc83-4d2c-b735-f40fa7bac534
  projectId: bombora-dev
status:
  state: DONE
  stateStartTime: '2016-01-18T19:04:50.779Z'
statusHistory:
- state: PENDING
  stateStartTime: '2016-01-18T19:04:24.164Z'
- state: SETUP_DONE
  stateStartTime: '2016-01-18T19:04:24.320Z'
- details: Agent reported job success
  state: RUNNING
  stateStartTime: '2016-01-18T19:04:29.613Z'

Attempt at passing in a --properties argument has no impact (and note the warning for ignoring PYSPARK_PYTHON).

gcloud beta dataproc jobs submit pyspark --cluster py3-test --properties PYSPARK_PYTHON=/usr/bin/python3 get-sys-exec.py                                                                          ✱ ◼
Copying file://get-sys-exec.py [Content-Type=text/x-python]...
Uploading   ...33ab-c06f-4234-b495-92c3bf9ac6e0/get-sys-exec.py: 480 B/480 B
Job [2ba2a5b3-3e4a-41ce-b463-6970185b03bf] submitted.
Waiting for job output...
Warning: Ignoring non-spark config property: PYSPARK_PYTHON=/usr/bin/python3
123
<type 'str'>
2.7.9 (default, Mar  1 2015, 12:57:24)
[GCC 4.9.2]
16/01/18 19:33:25 INFO akka.event.slf4j.Slf4jLogger: Slf4jLogger started
16/01/18 19:33:25 INFO Remoting: Starting remoting
16/01/18 19:33:25 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:56989]
16/01/18 19:33:26 INFO org.spark-project.jetty.server.Server: jetty-8.y.z-SNAPSHOT
16/01/18 19:33:26 INFO org.spark-project.jetty.server.AbstractConnector: Started [email protected]:44267
16/01/18 19:33:26 INFO org.spark-project.jetty.server.Server: jetty-8.y.z-SNAPSHOT
16/01/18 19:33:26 INFO org.spark-project.jetty.server.AbstractConnector: Started [email protected]:4040
16/01/18 19:33:26 WARN org.apache.spark.metrics.MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
16/01/18 19:33:26 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at py3-test-m/10.240.0.3:8032
16/01/18 19:33:28 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1453141822766_0005
['/usr/bin/python']
Job [2ba2a5b3-3e4a-41ce-b463-6970185b03bf] finished successfully.
driverControlFilesUri: gs://dataproc-e66c6e3d-80da-4211-b66c-0109bbe4567a-us/google-cloud-dataproc-metainfo/44b233ab-c06f-4234-b495-92c3bf9ac6e0/jobs/2ba2a5b3-3e4a-41ce-b463-6970185b03bf/
driverOutputResourceUri: gs://dataproc-e66c6e3d-80da-4211-b66c-0109bbe4567a-us/google-cloud-dataproc-metainfo/44b233ab-c06f-4234-b495-92c3bf9ac6e0/jobs/2ba2a5b3-3e4a-41ce-b463-6970185b03bf/driveroutput
placement:
  clusterName: py3-test
  clusterUuid: 44b233ab-c06f-4234-b495-92c3bf9ac6e0
pysparkJob:
  loggingConfiguration: {}
  mainPythonFileUri: gs://dataproc-e66c6e3d-80da-4211-b66c-0109bbe4567a-us/google-cloud-dataproc-staging/44b233ab-c06f-4234-b495-92c3bf9ac6e0/get-sys-exec.py
  properties:
    PYSPARK_PYTHON: /usr/bin/python3
reference:
  jobId: 2ba2a5b3-3e4a-41ce-b463-6970185b03bf
  projectId: bombora-dev
status:
  state: DONE
  stateStartTime: '2016-01-18T19:33:45.155Z'
statusHistory:
- state: PENDING
  stateStartTime: '2016-01-18T19:33:20.791Z'
- state: SETUP_DONE
  stateStartTime: '2016-01-18T19:33:20.947Z'
- details: Agent reported job success
  state: RUNNING
  stateStartTime: '2016-01-18T19:33:27.724Z'

3. Hacking

In one last desperate attempt to get things working, I modified the /usr/lib/spark/python/pyspark/rdd.py module to statically define the env var by changing line 74 to this:

os.environ['PYTHONHASHSEED'] = 0
warnings.warn('Environment variable PYTHONHASHSEED not detected, set to 0')
#raise Exception("Randomness of hash of string should be disabled via PYTHONHASHSEED")

This, however, had no effect, which was totally unexpected until I realized that code base wasn't being called. Instead, the traceback references a different directory at: /usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py, which indeed I had not modified.

4. Fin

I've exhausted all leads on my side, wanted to hand this off, and hope this helps in identifying and resolving this issue. Appreciate your time and please let me know how more I can help.

Many thanks. :)

Google Cloud SQL Proxy setup failed for Dataproc 1.1

Hi,

It seems that the Google Cloud SQL Proxy initialization script (cloud-sql-proxy.sh) is not working in a cluster using Dataproc image version 1.1

Using this dataproc creation command:

gcloud dataproc clusters create [CLUSTER_NAME]
--zone [ZONE]
--master-machine-type [MASTER_MACHINE_TYPE]
--master-boot-disk-size [MASTER_BOOT_DISK_SIZE]
--num-workers [NUM_WORKERS]
--worker-machine-type [WORKER_MACHINE_TYPE]
--worker-boot-disk-size [WORKER_BOOT_DISK_SIZE]
--scopes cloud-platform
--project [PROJECT_ID]
--properties hive:hive.metastore.warehouse.dir=gs://[SEGMENT]/hive-warehouse
--metadata hive-metastore-instance=[PROJECT_ID]:[REGION]:[SQL_INSTANCE_NAME]
--initialization-actions 'gs://[INIT_SEGMENT]/initialization-scripts/cloud-sql-proxy.sh'

The proxy service seems to be up and running but it is not reachable.

Going into Cloud SQL proxy parameters and flags I realized that some parameters are wrong in the initialization script, all parameters has to be written with the format -[PARAM]=[VALUE] (i.e.: -dir=/cloudsql).

After fixing them everything runs as expected.

Please let me know if you need more details

Thanks

Would be great to have Presto Initialization Action Example

https://prestodb.io

Oozie init action

We should have an Oozie init action. This is also going to be a prerequisite for getting Hue to work.

Adding Hue to Dataproc Initialization Action

It would be great if you guys could add Hue as a Dataproc initialization action

Reference: http://gethue.com/

bootstrap-conda.sh failing on build (no .bashrc found)

In the past 24 hours, our builds (w/ @jeffkpayne) against Google Cloud Dataproc have been failing due to an error in the ./conda/bootstrap-conda.sh init action script. Targeting this script as the primary init action in a dataproc cluster creation call like:

> gcloud dataproc clusters create bombora-dev-analytics-test --initialization-actions=gs://bombora-dev-analytics/dataproc-init-actions/conda/bootstrap-conda.sh

results in multiple errors on master and worker nodes

Waiting on operation [projects/bombora-dev/regions/global/operations/0f85471c-c270-4044-baad-9b38b6ba9194].
Waiting for cluster creation operation...done.
ERROR: (gcloud.dataproc.clusters.create) Operation [projects/bombora-dev/regions/global/operations/0f85471c-c270-4044-baad-9b38b6ba9194] failed: Multiple Errors:
 - Google Cloud Dataproc Agent reports failure. If logs are available, they can be found in 'gs://dataproc-e66c6e3d-80da-4211-b66c-0109bbe4567a-us/google-cloud-dataproc-metainfo/be3ac51a-6e50-449c-9516-5114350408c3/bombora-dev-analytics-test-m'.
 - Google Cloud Dataproc Agent reports failure. If logs are available, they can be found in 'gs://dataproc-e66c6e3d-80da-4211-b66c-0109bbe4567a-us/google-cloud-dataproc-metainfo/be3ac51a-6e50-449c-9516-5114350408c3/bombora-dev-analytics-test-w-1'..

Tailing dataproc-initialization-script-0_output reveals inability to resolve /root/.bashrc?

Python 3.5.1 :: Continuum Analytics, Inc.
creating default environment...
installation finished.
Setting environment variables...
Updated PATH: /usr/local/bin/miniconda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
And also HOME:
/usr/local/bin/miniconda/bin/conda
/etc/google-dataproc/startup-scripts/dataproc-initialization-script-0: line 90: /root/.bashrc: No such file or directory

As this was previously working via our internal testing (along with @dennishuo testing and accepting PR #18), I'm unsure as to what changes might have occurred? Any tips or suggestions? Thanks much!

SSH tunnel error

I am following the tutorial to install Jupyter Notebook in Dataproc.
https://cloud.google.com/dataproc/tutorials/jupyter-notebook

This step give me an error:

gcloud compute ssh  --zone=<master-host-zone> \
  --ssh-flag="-D 1080" --ssh-flag="-N" --ssh-flag="-n" <master-host-name>

Error:

ssh: unknown option "-D 1080"
ssh: unknown option "-n"

Hue init action does not work on Dataproc 1.2

I'm creating cluster group in gcp dataproc and using initialization action gs://dataproc-initialization-actions/hue/hue.sh

Error log:
Reading package lists...

apt-get install hue -y
Reading package lists...
Building dependency tree...
Reading state information...
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation: Depends: hue-server (= 3.11.0-1) but it is not going to be installed
Depends: hue-beeswax (= 3.11.0-1) but it is not going to be installed
Depends: hue-impala (= 3.11.0-1) but it is not going to be installed
Depends: hue-pig (= 3.11.0-1) but it is not going to be installed
Depends: hue-hbase (= 3.11.0-1) but it is not going to be installed
Depends: hue-search (= 3.11.0-1) but it is not going to be installed
Depends: hue-sqoop (= 3.11.0-1) but it is not going to be installed
Depends: hue-rdbms (= 3.11.0-1) but it is not going to be installed
Depends: hue-security (= 3.11.0-1) but it is not going to be installed
Depends: hue-spark (= 3.11.0-1) but it is not going to be installed
Depends: hue-zookeeper (= 3.11.0-1) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.

Secure Jupyter Notebook with simple authentication

The jupyter notebook initialization action is working good as expected. Mostly, we would need that to be authenticated/secured while launching in the internet.

Required a metadata property to specify notebook password, similar to the PORT option exposed during cluster creation.

The task may involve creating private-public key/certificate and set those to the jupyter configuration property, and set the default password.

SSL validation fails using BigQuery connector for Spark

I've setup Datalab according to the README with the proper scopes. I've verified that I can read data from BQ using the %bigquery commands and my PySpark environment generally seems to be functioning properly. However, when invoking the following code to load data from BQ using the Spark connector, I run into an SSL trusted cert error.

bucket = sc._jsc.hadoopConfiguration().get('fs.gs.system.bucket')
project = sc._jsc.hadoopConfiguration().get('fs.gs.project.id')
input_directory = 'gs://{}/hadoop/tmp/bigquery/pyspark_input'.format(bucket)

conf = {
    # Input Parameters
    'mapred.bq.project.id': project,
    'mapred.bq.gcs.bucket': bucket,
    'mapred.bq.temp.gcs.path': input_directory,
    'mapred.bq.input.project.id': 'publicdata',
    'mapred.bq.input.dataset.id': 'samples',
    'mapred.bq.input.table.id': 'shakespeare',
}

# Output Parameters
output_dataset = 'wordcount_dataset'
output_table = 'wordcount_table'

# Load data in from BigQuery.
table_data = sc.newAPIHadoopRDD(
    'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
    'org.apache.hadoop.io.LongWritable',
    'com.google.gson.JsonObject',
    conf=conf)

Here is the error that is generated:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: No trusted certificate found
	at sun.security.ssl.Alerts.getSSLException(Alerts.java:192)
	at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1949)
	at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:302)
	at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:296)
	at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1514)
	at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:216)
	at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1026)
	at sun.security.ssl.Handshaker.process_record(Handshaker.java:961)
	at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1062)
	at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
	at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
	at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
	at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:559)
	at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:185)
	at sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:153)
	at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:93)
	at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
	at com.google.cloud.hadoop.io.bigquery.BigQueryHelper.getTable(BigQueryHelper.java:255)
	at com.google.cloud.hadoop.io.bigquery.AbstractBigQueryInputFormat.constructExport(AbstractBigQueryInputFormat.java:191)
	at com.google.cloud.hadoop.io.bigquery.AbstractBigQueryInputFormat.getSplits(AbstractBigQueryInputFormat.java:117)
	at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:121)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
	at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1303)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
	at org.apache.spark.rdd.RDD.take(RDD.scala:1298)
	at org.apache.spark.api.python.SerDeUtil$.pairRDDToPython(SerDeUtil.scala:203)
	at org.apache.spark.api.python.PythonRDD$.newAPIHadoopRDD(PythonRDD.scala:582)
	at org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD(PythonRDD.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:745)
Caused by: sun.security.validator.ValidatorException: No trusted certificate found
	at sun.security.validator.SimpleValidator.buildTrustedChain(SimpleValidator.java:394)
	at sun.security.validator.SimpleValidator.engineValidate(SimpleValidator.java:133)
	at sun.security.validator.Validator.validate(Validator.java:260)
	at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:324)
	at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:229)
	at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:124)
	at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1496)
	... 47 more

Any ideas?

Unable to run PySpark via Jupyter Notebook

After Jupyter initialization script was updated, I am unable to run Spark scripts via Jupyter notebook, since PySpark kernel has python version 3.6.0

Starting Presto server fails in Dataproc 1.2

In /var/presto/data/var/log/launcher.log:

Invalid maximum heap size: -Xmxm
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

The script sets:

-Xmx${PRESTO_JVM_MB}m

Where PRESTO_JVM_MB comes from spark properties parsed from spark-defaults.conf: https://github.com/GoogleCloudPlatform/dataproc-initialization-actions/blob/master/presto/presto.sh#L53

The init action assumes spark-defaults.conf has key=value pairs, but does not accept key value with a space in the middle. Looking at my (unchanged) spark-defaults.conf in Dataproc 1.2:

spark.yarn.executor.memoryOverhead 465

$ grep spark.yarn.executor.memoryOverhead /etc/spark/conf/spark-defaults.conf | tail -n 1 | cut -d '=' -f 2
spark.yarn.executor.memoryOverhead 465

Expose init action API to a higher level provisioner / orchestration library?

I've been testing and using Dataproc for less than a week, so far, just awesome. Kudos to all the Google developers who made it a reality. I've recently been diving into custom provisioning of the cluster, and was able to quickly setup and connect to Jupyter using the IPython init action provided in this repo. Thanks, and it works great.

As the need for more refined control is always pressing (both myself and others #9, #8), it would be interesting to understand the decision behind the current architecture for Dataproc initialization actions, and how it compares to the roadmap for future functionality?

In getting something off the ground, it's simple and effective to point to a bunch of bash shell scripts to perform provisioning. Admittedly this is what I do a majority of the time when setting up a local dev environment via provisioning a VM. In my case, however, one begins to realize that managing and passing state through a bunch of custom environment variables loses it's charm quickly, along with the need to parameterize through interactions with higher level objects become painfully apparent. :/

With this idea, was there any plan to expose Dataproc initialization actions to the provisioning capabilities of platforms like Ansible or Salt? This would allow the master and worker nodes to simply be targeted with a playbook / formula, and allow the Dataproc end user the ability to tap into a large public resource of existing mechanisms for provisioning nodes with particular libraries and / or configurations.

Just a thought, and hope that it make sense. Thanks again for such a great resource. :)

Control of components' state

Right now, Dataproc deploys Hadoop, Spark, PySpark, SparkSQL, Hive and Pig among other things like HDFS. It would be great to have a AI script which will control the state of each component (on/off).

Some of us need only Spark working with Google Cloud Storage and having other components installed and running just wastes the resources of workers.

Maybe we should have a script with small "settings" section which will control which components we need to be ON and which are not necessary and turn them OFF

Init action for Jupyter and JupyterHub

Jupyter is a widely adopted web-based user interface to interactive computing sessions with a variety of kernels. Currently, dataproc init actions contain support for ~~IPython~~ Jupyter Notebook, however, this does not integrate with recent features supporting package and environment management in conda.

JupterHub supports multi-user sessions to Jupyter in a secure and efficient manner. We should expand on existing init action functionality to also:

install Jupyter ~~and JupyterHub~~
support install via pip or conda
launch Jupyter~~(Hub)~~ on cluster create
properly log sessions
script / command to remotely start / stop service
support use of multiple notebooks per user

Extensibility/modularity features for initialization actions

Though we don't have any official/public timelines to disclose at this point with respect to any next round of features for scaling and maintaining the growing sets of initialization actions used both privately and in public repositories, we'd like to solicit user input in this tracking issue and hopefully also provide increased transparency into our plans and key feature constraints/requirements on the service side of Dataproc as things begin to take shape.

Thanks to @nehalecky for starting to bring up thoughts on the issue as part of #58 - we'd love to hear any additional feedback from anyone interested as to what kinds of pain points have come up using/managing more sophisticated sets of init actions, what existing provisioning frameworks you're familiar with (Ansible, Chef, Salt, Puppet, etc), and how additional mix-ins fit into your workflow.

As a disclaimer, there's no guarantee as to whether higher-level initialization action constructs will become first-class in the Dataproc API--it'll depend on finding the right balance between out-of-the-box ease of use vs expressiveness of customizability. By design, the current interface is minimalist to avoid being too opinionated, but that comes at the cost of consistency and cleanliness when you have to use the lower-level primitives to try to build up more complex dependencies.

However, that doesn't imply we can't/won't solve the maintainability problems in other ways; it just means some ideas might be suitable for inclusion in the top-level API while others might do better as just best-practices documentation (e.g. setting consistent recommendations for wiring metadata keys into scripts) or possibly "meta" init actions which bootstrap a common framework with which further customization can be described declaratively or templatized.

Specifically, concepts which can generalize well may be good candidates to add first-class to Dataproc's API, whereas more specialized solutions will likely do better as meta-init-actions.

Update Hue IA to work on Dataproc 1.2

Presto installation fails

I am trying to create Dataproc cluster using the following command:

gcloud beta dataproc clusters create cluster1 --zone us-central1-c --master-machine-type n1-standard-4 --master-boot-disk-size 500 --num-workers 2 --worker-machine-type n1-standard-4 --worker-boot-disk-size 500 --image-version 0.2 --project dataproc-demo-0112 --initialization-actions gs://dataproc-ia-scripts/presto.sh --initialization-action-timeout 5m

The cluster creation fails with the following error:

ERROR: (gcloud.beta.dataproc.clusters.create) Operation [operations/projects/dataproc-demo-0112/operations/413e11c3-31ac-45e3-9071-33b61e602d4f] failed: Google Cloud Dataproc Agent reports failure. If logs are available, they can be found in 'gs://dataproc-681d33e3-57d8-40eb-8598-bac527ad5e2d-us/google-cloud-dataproc-metainfo/b40fe991-dfb0-42da-ac24-edc1f3b4fe8e/cluster1-m'..

google-cloud-dataproc-metainfo-b40fe991-dfb0-42da-ac24-edc1f3b4fe8e-cluster1-m-dataproc-initialization-script-0_output:

/tmp/script-4077178730655453328.tmp: 6: /tmp/script-4077178730655453328.tmp: Syntax error: newline unexpected

google-cloud-dataproc-metainfo-b40fe991-dfb0-42da-ac24-edc1f3b4fe8e-cluster1-m-dataproc-startup-script_output output:

google-dataproc-startup: Mon Nov 30 09:30:12 UTC 2015: Starting Dataproc startup script
google-dataproc-startup: Mon Nov 30 09:30:12 UTC 2015: Uninstalling un-needed daemons
google-dataproc-startup: Mon Nov 30 09:30:12 UTC 2015: Generating helper scripts
google-dataproc-startup: Mon Nov 30 09:30:12 UTC 2015: Running helper scripts
google-dataproc-startup: Mon Nov 30 09:30:12 UTC 2015: Running mount_disks.sh
google-dataproc-startup:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
google-dataproc-startup:                                  Dload  Upload   Total   Spent    Left  Speed
google-dataproc-startup: 
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100     3  100     3    0     0    574      0 --:--:-- --:--:-- --:--:--   600
google-dataproc-startup:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
google-dataproc-startup:                                  Dload  Upload   Total   Spent    Left  Speed
google-dataproc-startup: 
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100    17  100    17    0     0   3191      0 --:--:-- --:--:-- --:--:--  3400
google-dataproc-startup:   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
google-dataproc-startup:                                  Dload  Upload   Total   Spent    Left  Speed
google-dataproc-startup: 
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100     1  100     1    0     0    175      0 --:--:-- --:--:-- --:--:--   200
google-dataproc-startup: Reading package lists...  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
google-dataproc-startup:                                  Dload  Upload   Total   Spent    Left  Speed
google-dataproc-startup: 
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100    10  100    10    0     0   1800      0 --:--:-- --:--:-- --:--:--  2000
google-dataproc-startup: Boot disk is persistent-disk-0; will not attempt to mount it.
google-dataproc-startup: Mon Nov 30 09:30:12 UTC 2015: Running setup_master_nfs.sh
google-dataproc-startup: Mon Nov 30 09:30:12 UTC 2015: Running configure_hadoop.sh
google-dataproc-startup: /bin/systemctl
google-dataproc-startup: 999
google-dataproc-startup: About to run 'overwrite_file_with_strings /proc/sys/fs/nfs/nlm_grace_period 10' with retries...
google-dataproc-startup: Overwriting /proc/sys/fs/nfs/nlm_grace_period with contents '10'
google-dataproc-startup: About to run 'overwrite_file_with_strings /proc/fs/nfsd/nfsv4gracetime 10' with retries...
google-dataproc-startup: Overwriting /proc/fs/nfsd/nfsv4gracetime with contents '10'
google-dataproc-startup: About to run 'overwrite_file_with_strings /proc/fs/nfsd/nfsv4leasetime 10' with retries...
google-dataproc-startup: Overwriting /proc/fs/nfsd/nfsv4leasetime with contents '10'
google-dataproc-startup: 
google-dataproc-startup: Building dependency tree...Mon Nov 30 09:30:12 UTC 2015: Running configure_hdfs.sh
google-dataproc-startup: Mon Nov 30 09:30:13 UTC 2015: Running setup_client_nfs.sh
google-dataproc-startup: Mon Nov 30 09:30:13 UTC 2015: Running configure_connectors.sh
google-dataproc-startup: 
google-dataproc-startup: Reading state information...
google-dataproc-startup: Mon Nov 30 09:30:13 UTC 2015: Running configure_spark.sh
google-dataproc-startup: The following packages will be REMOVED:
google-dataproc-startup:   hadoop-hdfs-datanode* hadoop-yarn-nodemanager*
google-dataproc-startup: Mon Nov 30 09:30:13 UTC 2015: Populating initial cluster member list
google-dataproc-startup: Mon Nov 30 09:30:13 UTC 2015: Starting services
google-dataproc-startup: Mon Nov 30 09:30:13 UTC 2015: Formatting NameNode
google-dataproc-startup: Mon Nov 30 09:30:13 UTC 2015: Starting service: hadoop-hdfs-secondarynamenode
google-dataproc-startup: Mon Nov 30 09:30:13 UTC 2015: Starting service: hive-server2
google-dataproc-startup: Mon Nov 30 09:30:13 UTC 2015: Waiting on async proccesses
google-dataproc-startup: Mon Nov 30 09:30:13 UTC 2015: Starting service: hive-metastore
google-dataproc-startup: Mon Nov 30 09:30:13 UTC 2015: Starting service: mysql
google-dataproc-startup: Mon Nov 30 09:30:13 UTC 2015: Starting service: hadoop-mapreduce-historyserver
google-dataproc-startup: Mon Nov 30 09:30:13 UTC 2015: Starting service: hadoop-yarn-resourcemanager
google-dataproc-startup: Mon Nov 30 09:30:13 UTC 2015: Waiting on '/usr/bin/perl /usr/sbin/update-rc.d mysql enable
google-dataproc-startup: /bin/sh /usr/sbin/service mysql start'
google-dataproc-startup: insserv: can not remove(../rc3.d/S01mysql): No such file or directory
google-dataproc-startup: insserv: can not remove(../rc3.d/S01spark-history-server): No such file or directory
google-dataproc-startup: insserv: can not remove(../rc3.d/S01hadoop-yarn-resourcemanager): No such file or directory
google-dataproc-startup: insserv: can not remove(../rc3.d/S01hive-server2): No such file or directory
google-dataproc-startup: insserv: can not remove(../rc3.d/S01hadoop-mapreduce-historyserver): No such file or directory
google-dataproc-startup: insserv: can not remove(../rc3.d/S01hive-metastore): No such file or directory
google-dataproc-startup: insserv: can not remove(../rc3.d/S01hadoop-hdfs-secondarynamenode): No such file or directory
google-dataproc-startup: insserv: can not remove(../rc5.d/S01mysql): No such file or directory
google-dataproc-startup: insserv: can not remove(../rc5.d/S01spark-history-server): No such file or directory
google-dataproc-startup: insserv: can not remove(../rc5.d/S01hadoop-yarn-resourcemanager): No such file or directory
google-dataproc-startup: insserv: can not remove(../rc5.d/S01hive-server2): No such file or directory
google-dataproc-startup: insserv: can not remove(../rc5.d/S01hadoop-mapreduce-historyserver): No such file or directory
google-dataproc-startup: insserv: can not remove(../rc5.d/S01hive-metastore): No such file or directory
google-dataproc-startup: insserv: can not remove(../rc5.d/S01hadoop-hdfs-secondarynamenode): No such file or directory
google-dataproc-startup: insserv: can not remove(../rc3.d/S01hive-metastore): No such file or directory
google-dataproc-startup: insserv: can not remove(../rc3.d/S01hadoop-hdfs-secondarynamenode): No such file or directory
google-dataproc-startup: insserv: warning: current start runlevel(s) (2 3 4) of script `hive-metastore' overrides LSB defaults (2 3 4 5).
google-dataproc-startup: insserv: warning: current start runlevel(s) (2 4 5) of script `spark-history-server' overrides LSB defaults (2 3 4 5).
google-dataproc-startup: 0 upgraded, 0 newly installed, 2 to remove and 5 not upgraded.
google-dataproc-startup: After this operation, 16.4 kB disk space will be freed.
google-dataproc-startup: (Reading database ... 
(Reading database ... 5%
(Reading database ... 10%
(Reading database ... 15%
(Reading database ... 20%
(Reading database ... 25%
(Reading database ... 30%
(Reading database ... 35%
(Reading database ... 40%
(Reading database ... 45%
(Reading database ... 50%
(Reading database ... 55%
(Reading database ... 60%
(Reading database ... 65%
(Reading database ... 70%
(Reading database ... 75%
(Reading database ... 80%
(Reading database ... 85%
(Reading database ... 90%
(Reading database ... 95%
(Reading database ... 100%
(Reading database ... 56196 files and directories currently installed.)
google-dataproc-startup: Removing hadoop-hdfs-datanode (2.7.1-1) ...
google-dataproc-startup: Purging configuration files for hadoop-hdfs-datanode (2.7.1-1) ...
google-dataproc-startup: 15/11/30 09:30:15 INFO namenode.NameNode: STARTUP_MSG: 
google-dataproc-startup: /************************************************************
google-dataproc-startup: STARTUP_MSG: Starting NameNode
google-dataproc-startup: STARTUP_MSG:   host = cluster1-m.c.dataproc-demo-0112.internal/10.240.0.2
google-dataproc-startup: STARTUP_MSG:   args = [-format]
google-dataproc-startup: STARTUP_MSG:   version = 2.7.1
google-dataproc-startup: STARTUP_MSG:   classpath = /etc/hadoop/conf:/usr/lib/hadoop/lib/apacheds-kerberos-codec-2.0.0-M15.jar:/usr/lib/hadoop/lib/hamcrest-core-1.3.jar:/usr/lib/hadoop/lib/commons-httpclient-3.1.jar:/usr/lib/hadoop/lib/commons-cli-1.2.jar:/usr/lib/hadoop/lib/gcs-connector-1.4.3-hadoop2.jar:/usr/lib/hadoop/lib/commons-net-3.1.jar:/usr/lib/hadoop/lib/commons-compress-1.4.1.jar:/usr/lib/hadoop/lib/jsch-0.1.42.jar:/usr/lib/hadoop/lib/netty-3.6.2.Final.jar:/usr/lib/hadoop/lib/jsp-api-2.1.jar:/usr/lib/hadoop/lib/curator-recipes-2.7.1.jar:/usr/lib/hadoop/lib/jettison-1.1.jar:/usr/lib/hadoop/lib/gson-2.2.4.jar:/usr/lib/hadoop/lib/commons-lang-2.6.jar:/usr/lib/hadoop/lib/xmlenc-0.52.jar:/usr/lib/hadoop/lib/xz-1.0.jar:/usr/lib/hadoop/lib/junit-4.11.jar:/usr/lib/hadoop/lib/jackson-xc-1.9.13.jar:/usr/lib/hadoop/lib/commons-logging-1.1.3.jar:/usr/lib/hadoop/lib/jaxb-impl-2.2.3-1.jar:/usr/lib/hadoop/lib/commons-configuration-1.6.jar:/usr/lib/hadoop/lib/guava-11.0.2.jar:/usr/lib/hadoop/lib/curator-framework-2.7.1.jar:/usr/lib/hado
google-dataproc-startup: op/lib/api-util-1.0.0-M20.jar:/usr/lib/hadoop/lib/jersey-core-1.9.jar:/usr/lib/hadoop/lib/api-asn1-api-1.0.0-M20.jar:/usr/lib/hadoop/lib/jetty-6.1.26.jar:/usr/lib/hadoop/lib/commons-collections-3.2.1.jar:/usr/lib/hadoop/lib/slf4j-api-1.7.10.jar:/usr/lib/hadoop/lib/apacheds-i18n-2.0.0-M15.jar:/usr/lib/hadoop/lib/servlet-api-2.5.jar:/usr/lib/hadoop/lib/activation-1.1.jar:/usr/lib/hadoop/lib/jersey-server-1.9.jar:/usr/lib/hadoop/lib/jaxb-api-2.2.2.jar:/usr/lib/hadoop/lib/htrace-core-3.1.0-incubating.jar:/usr/lib/hadoop/lib/jets3t-0.9.0.jar:/usr/lib/hadoop/lib/zookeeper-3.4.6.jar:/usr/lib/hadoop/lib/commons-beanutils-core-1.8.0.jar:/usr/lib/hadoop/lib/httpclient-4.2.5.jar:/usr/lib/hadoop/lib/stax-api-1.0-2.jar:/usr/lib/hadoop/lib/jackson-mapper-asl-1.9.13.jar:/usr/lib/hadoop/lib/log4j-1.2.17.jar:/usr/lib/hadoop/lib/commons-digester-1.8.jar:/usr/lib/hadoop/lib/paranamer-2.3.jar:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar:/usr/lib/hadoop/lib/bigquery-connector-0.7.3-hadoop2.jar:/usr/lib/hadoop/lib/protobuf-java-2
google-dataproc-startup: .5.0.jar:/usr/lib/hadoop/lib/jetty-util-6.1.26.jar:/usr/lib/hadoop/lib/mockito-all-1.8.5.jar:/usr/lib/hadoop/lib/commons-io-2.4.jar:/usr/lib/hadoop/lib/commons-math3-3.1.1.jar:/usr/lib/hadoop/lib/asm-3.2.jar:/usr/lib/hadoop/lib/jersey-json-1.9.jar:/usr/lib/hadoop/lib/jackson-jaxrs-1.9.13.jar:/usr/lib/hadoop/lib/java-xmlbuilder-0.4.jar:/usr/lib/hadoop/lib/commons-codec-1.4.jar:/usr/lib/hadoop/lib/curator-client-2.7.1.jar:/usr/lib/hadoop/lib/avro-1.7.4.jar:/usr/lib/hadoop/lib/httpcore-4.2.5.jar:/usr/lib/hadoop/lib/jsr305-3.0.0.jar:/usr/lib/hadoop/lib/commons-beanutils-1.7.0.jar:/usr/lib/hadoop/lib/snappy-java-1.0.4.1.jar:/usr/lib/hadoop/lib/jackson-core-asl-1.9.13.jar:/usr/lib/hadoop/.//hadoop-nfs.jar:/usr/lib/hadoop/.//hadoop-auth.jar:/usr/lib/hadoop/.//hadoop-nfs-2.7.1.jar:/usr/lib/hadoop/.//hadoop-auth-2.7.1.jar:/usr/lib/hadoop/.//hadoop-common-2.7.1.jar:/usr/lib/hadoop/.//hadoop-common-2.7.1-tests.jar:/usr/lib/hadoop/.//hadoop-annotations.jar:/usr/lib/hadoop/.//hadoop-annotations-2.7.1.jar:/usr/lib/hadoop/
google-dataproc-startup: .//hadoop-common.jar:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/commons-cli-1.2.jar:/usr/lib/hadoop-hdfs/lib/xml-apis-1.3.04.jar:/usr/lib/hadoop-hdfs/lib/netty-3.6.2.Final.jar:/usr/lib/hadoop-hdfs/lib/commons-lang-2.6.jar:/usr/lib/hadoop-hdfs/lib/xmlenc-0.52.jar:/usr/lib/hadoop-hdfs/lib/commons-logging-1.1.3.jar:/usr/lib/hadoop-hdfs/lib/guava-11.0.2.jar:/usr/lib/hadoop-hdfs/lib/netty-all-4.0.23.Final.jar:/usr/lib/hadoop-hdfs/lib/jersey-core-1.9.jar:/usr/lib/hadoop-hdfs/lib/jetty-6.1.26.jar:/usr/lib/hadoop-hdfs/lib/servlet-api-2.5.jar:/usr/lib/hadoop-hdfs/lib/jersey-server-1.9.jar:/usr/lib/hadoop-hdfs/lib/htrace-core-3.1.0-incubating.jar:/usr/lib/hadoop-hdfs/lib/xercesImpl-2.9.1.jar:/usr/lib/hadoop-hdfs/lib/jackson-mapper-asl-1.9.13.jar:/usr/lib/hadoop-hdfs/lib/log4j-1.2.17.jar:/usr/lib/hadoop-hdfs/lib/commons-daemon-1.0.13.jar:/usr/lib/hadoop-hdfs/lib/protobuf-java-2.5.0.jar:/usr/lib/hadoop-hdfs/lib/jetty-util-6.1.26.jar:/usr/lib/hadoop-hdfs/lib/commons-io-2.4.jar:/usr/lib/hadoop-hdfs/lib/asm-3.2.jar:/
google-dataproc-startup: usr/lib/hadoop-hdfs/lib/leveldbjni-all-1.8.jar:/usr/lib/hadoop-hdfs/lib/commons-codec-1.4.jar:/usr/lib/hadoop-hdfs/lib/jsr305-3.0.0.jar:/usr/lib/hadoop-hdfs/lib/jackson-core-asl-1.9.13.jar:/usr/lib/hadoop-hdfs/.//hadoop-hdfs-2.7.1-tests.jar:/usr/lib/hadoop-hdfs/.//hadoop-hdfs.jar:/usr/lib/hadoop-hdfs/.//hadoop-hdfs-nfs.jar:/usr/lib/hadoop-hdfs/.//hadoop-hdfs-nfs-2.7.1.jar:/usr/lib/hadoop-hdfs/.//hadoop-hdfs-2.7.1.jar:/usr/lib/hadoop-yarn/lib/commons-cli-1.2.jar:/usr/lib/hadoop-yarn/lib/commons-compress-1.4.1.jar:/usr/lib/hadoop-yarn/lib/jettison-1.1.jar:/usr/lib/hadoop-yarn/lib/commons-lang-2.6.jar:/usr/lib/hadoop-yarn/lib/javax.inject-1.jar:/usr/lib/hadoop-yarn/lib/xz-1.0.jar:/usr/lib/hadoop-yarn/lib/jackson-xc-1.9.13.jar:/usr/lib/hadoop-yarn/lib/commons-logging-1.1.3.jar:/usr/lib/hadoop-yarn/lib/jaxb-impl-2.2.3-1.jar:/usr/lib/hadoop-yarn/lib/guava-11.0.2.jar:/usr/lib/hadoop-yarn/lib/jersey-core-1.9.jar:/usr/lib/hadoop-yarn/lib/guice-servlet-3.0.jar:/usr/lib/hadoop-yarn/lib/jetty-6.1.26.jar:/usr/lib/hadoop-
google-dataproc-startup: yarn/lib/commons-collections-3.2.1.jar:/usr/lib/hadoop-yarn/lib/servlet-api-2.5.jar:/usr/lib/hadoop-yarn/lib/activation-1.1.jar:/usr/lib/hadoop-yarn/lib/jersey-server-1.9.jar:/usr/lib/hadoop-yarn/lib/jaxb-api-2.2.2.jar:/usr/lib/hadoop-yarn/lib/zookeeper-3.4.6-tests.jar:/usr/lib/hadoop-yarn/lib/zookeeper-3.4.6.jar:/usr/lib/hadoop-yarn/lib/stax-api-1.0-2.jar:/usr/lib/hadoop-yarn/lib/jackson-mapper-asl-1.9.13.jar:/usr/lib/hadoop-yarn/lib/log4j-1.2.17.jar:/usr/lib/hadoop-yarn/lib/aopalliance-1.0.jar:/usr/lib/hadoop-yarn/lib/jersey-client-1.9.jar:/usr/lib/hadoop-yarn/lib/protobuf-java-2.5.0.jar:/usr/lib/hadoop-yarn/lib/jetty-util-6.1.26.jar:/usr/lib/hadoop-yarn/lib/commons-io-2.4.jar:/usr/lib/hadoop-yarn/lib/asm-3.2.jar:/usr/lib/hadoop-yarn/lib/jersey-guice-1.9.jar:/usr/lib/hadoop-yarn/lib/guice-3.0.jar:/usr/lib/hadoop-yarn/lib/jersey-json-1.9.jar:/usr/lib/hadoop-yarn/lib/jackson-jaxrs-1.9.13.jar:/usr/lib/hadoop-yarn/lib/leveldbjni-all-1.8.jar:/usr/lib/hadoop-yarn/lib/commons-codec-1.4.jar:/usr/lib/hadoop-yarn/li
google-dataproc-startup: b/jsr305-3.0.0.jar:/usr/lib/hadoop-yarn/lib/jackson-core-asl-1.9.13.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-api.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-server-sharedcachemanager-2.7.1.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-registry.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-applications-distributedshell-2.7.1.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-server-nodemanager-2.7.1.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-registry-2.7.1.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-client-2.7.1.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-client.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-server-tests.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-applications-distributedshell.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-applications-unmanaged-am-launcher-2.7.1.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-server-applicationhistoryservice.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-server-applicationhistoryservice-2.7.1.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-server-web-proxy-2.7.1.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-server-common-2.7.1.jar:/usr/lib/ha
google-dataproc-startup: doop-yarn/.//hadoop-yarn-common.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-server-sharedcachemanager.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-server-nodemanager.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-api-2.7.1.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-server-resourcemanager-2.7.1.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-server-common.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-server-web-proxy.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-applications-unmanaged-am-launcher.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-server-resourcemanager.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-common-2.7.1.jar:/usr/lib/hadoop-yarn/.//hadoop-yarn-server-tests-2.7.1.jar:/usr/lib/hadoop-mapreduce/lib/hamcrest-core-1.3.jar:/usr/lib/hadoop-mapreduce/lib/commons-compress-1.4.1.jar:/usr/lib/hadoop-mapreduce/lib/netty-3.6.2.Final.jar:/usr/lib/hadoop-mapreduce/lib/javax.inject-1.jar:/usr/lib/hadoop-mapreduce/lib/xz-1.0.jar:/usr/lib/hadoop-mapreduce/lib/junit-4.11.jar:/usr/lib/hadoop-mapreduce/lib/jersey-core-1.9.jar:/usr/lib/hadoop-mapreduce/lib/guice-
google-dataproc-startup: servlet-3.0.jar:/usr/lib/hadoop-mapreduce/lib/jersey-server-1.9.jar:/usr/lib/hadoop-mapreduce/lib/jackson-mapper-asl-1.9.13.jar:/usr/lib/hadoop-mapreduce/lib/log4j-1.2.17.jar:/usr/lib/hadoop-mapreduce/lib/aopalliance-1.0.jar:/usr/lib/hadoop-mapreduce/lib/paranamer-2.3.jar:/usr/lib/hadoop-mapreduce/lib/protobuf-java-2.5.0.jar:/usr/lib/hadoop-mapreduce/lib/commons-io-2.4.jar:/usr/lib/hadoop-mapreduce/lib/asm-3.2.jar:/usr/lib/hadoop-mapreduce/lib/jersey-guice-1.9.jar:/usr/lib/hadoop-mapreduce/lib/guice-3.0.jar:/usr/lib/hadoop-mapreduce/lib/leveldbjni-all-1.8.jar:/usr/lib/hadoop-mapreduce/lib/avro-1.7.4.jar:/usr/lib/hadoop-mapreduce/lib/snappy-java-1.0.4.1.jar:/usr/lib/hadoop-mapreduce/lib/jackson-core-asl-1.9.13.jar:/usr/lib/hadoop-mapreduce/.//hadoop-auth.jar:/usr/lib/hadoop-mapreduce/.//apacheds-kerberos-codec-2.0.0-M15.jar:/usr/lib/hadoop-mapreduce/.//hamcrest-core-1.3.jar:/usr/lib/hadoop-mapreduce/.//commons-httpclient-3.1.jar:/usr/lib/hadoop-mapreduce/.//hadoop-mapreduce-client-app.jar:/usr/lib/hadoop-mapr
google-dataproc-startup: educe/.//commons-cli-1.2.jar:/usr/lib/hadoop-mapreduce/.//commons-net-3.1.jar:/usr/lib/hadoop-mapreduce/.//commons-compress-1.4.1.jar:/usr/lib/hadoop-mapreduce/.//jsch-0.1.42.jar:/usr/lib/hadoop-mapreduce/.//hadoop-distcp.jar:/usr/lib/hadoop-mapreduce/.//hadoop-mapreduce-client-jobclient-2.7.1-tests.jar:/usr/lib/hadoop-mapreduce/.//netty-3.6.2.Final.jar:/usr/lib/hadoop-mapreduce/.//hadoop-azure.jar:/usr/lib/hadoop-mapreduce/.//hadoop-aws.jar:/usr/lib/hadoop-mapreduce/.//jsp-api-2.1.jar:/usr/lib/hadoop-mapreduce/.//curator-recipes-2.7.1.jar:/usr/lib/hadoop-mapreduce/.//hadoop-sls.jar:/usr/lib/hadoop-mapreduce/.//jettison-1.1.jar:/usr/lib/hadoop-mapreduce/.//gson-2.2.4.jar:/usr/lib/hadoop-mapreduce/.//hadoop-mapreduce-client-common-2.7.1.jar:/usr/lib/hadoop-mapreduce/.//commons-lang-2.6.jar:/usr/lib/hadoop-mapreduce/.//xmlenc-0.52.jar:/usr/lib/hadoop-mapreduce/.//xz-1.0.jar:/usr/lib/hadoop-mapreduce/.//hadoop-mapreduce-client-core.jar:/usr/lib/hadoop-mapreduce/.//jackson-databind-2.2.3.jar:/usr/lib/hadoop-mapr
google-dataproc-startup: educe/.//hadoop-archives.jar:/usr/lib/hadoop-mapreduce/.//junit-4.11.jar:/usr/lib/hadoop-mapreduce/.//jackson-xc-1.9.13.jar:/usr/lib/hadoop-mapreduce/.//hadoop-mapreduce-client-hs-plugins.jar:/usr/lib/hadoop-mapreduce/.//hadoop-ant-2.7.1.jar:/usr/lib/hadoop-mapreduce/.//hadoop-streaming-2.7.1.jar:/usr/lib/hadoop-mapreduce/.//hadoop-azure-2.7.1.jar:/usr/lib/hadoop-mapreduce/.//commons-logging-1.1.3.jar:/usr/lib/hadoop-mapreduce/.//jaxb-impl-2.2.3-1.jar:/usr/lib/hadoop-mapreduce/.//hadoop-mapreduce-client-hs-2.7.1.jar:/usr/lib/hadoop-mapreduce/.//hadoop-datajoin-2.7.1.jar:/usr/lib/hadoop-mapreduce/.//commons-configuration-1.6.jar:/usr/lib/hadoop-mapreduce/.//guava-11.0.2.jar:/usr/lib/hadoop-mapreduce/.//hadoop-mapreduce-client-app-2.7.1.jar:/usr/lib/hadoop-mapreduce/.//curator-framework-2.7.1.jar:/usr/lib/hadoop-mapreduce/.//api-util-1.0.0-M20.jar:/usr/lib/hadoop-mapreduce/.//jersey-core-1.9.jar:/usr/lib/hadoop-mapreduce/.//hadoop-distcp-2.7.1.jar:/usr/lib/hadoop-mapreduce/.//hadoop-archives-2.7.1.jar:/usr/lib
google-dataproc-startup: /hadoop-mapreduce/.//hadoop-mapreduce-client-hs.jar:/usr/lib/hadoop-mapreduce/.//hadoop-openstack.jar:/usr/lib/hadoop-mapreduce/.//jackson-core-2.2.3.jar:/usr/lib/hadoop-mapreduce/.//hadoop-mapreduce-client-shuffle.jar:/usr/lib/hadoop-mapreduce/.//api-asn1-api-1.0.0-M20.jar:/usr/lib/hadoop-mapreduce/.//jetty-6.1.26.jar:/usr/lib/hadoop-mapreduce/.//commons-collections-3.2.1.jar:/usr/lib/hadoop-mapreduce/.//joda-time-2.9.jar:/usr/lib/hadoop-mapreduce/.//apacheds-i18n-2.0.0-M15.jar:/usr/lib/hadoop-mapreduce/.//hadoop-mapreduce-client-jobclient.jar:/usr/lib/hadoop-mapreduce/.//servlet-api-2.5.jar:/usr/lib/hadoop-mapreduce/.//activation-1.1.jar:/usr/lib/hadoop-mapreduce/.//jersey-server-1.9.jar:/usr/lib/hadoop-mapreduce/.//jaxb-api-2.2.2.jar:/usr/lib/hadoop-mapreduce/.//hadoop-auth-2.7.1.jar:/usr/lib/hadoop-mapreduce/.//hadoop-mapreduce-client-core-2.7.1.jar:/usr/lib/hadoop-mapreduce/.//hadoop-mapreduce-client-jobclient-2.7.1.jar:/usr/lib/hadoop-mapreduce/.//hadoop-rumen-2.7.1.jar:/usr/lib/hadoop-mapreduce/.//had
google-dataproc-startup: oop-mapreduce-examples-2.7.1.jar:/usr/lib/hadoop-mapreduce/.//htrace-core-3.1.0-incubating.jar:/usr/lib/hadoop-mapreduce/.//jets3t-0.9.0.jar:/usr/lib/hadoop-mapreduce/.//zookeeper-3.4.6.jar:/usr/lib/hadoop-mapreduce/.//hadoop-ant.jar:/usr/lib/hadoop-mapreduce/.//commons-beanutils-core-1.8.0.jar:/usr/lib/hadoop-mapreduce/.//httpclient-4.2.5.jar:/usr/lib/hadoop-mapreduce/.//stax-api-1.0-2.jar:/usr/lib/hadoop-mapreduce/.//hadoop-mapreduce-client-shuffle-2.7.1.jar:/usr/lib/hadoop-mapreduce/.//jackson-mapper-asl-1.9.13.jar:/usr/lib/hadoop-mapreduce/.//aws-java-sdk-1.7.4.jar:/usr/lib/hadoop-mapreduce/.//log4j-1.2.17.jar:/usr/lib/hadoop-mapreduce/.//commons-digester-1.8.jar:/usr/lib/hadoop-mapreduce/.//hadoop-aws-2.7.1.jar:/usr/lib/hadoop-mapreduce/.//paranamer-2.3.jar:/usr/lib/hadoop-mapreduce/.//hadoop-openstack-2.7.1.jar:/usr/lib/hadoop-mapreduce/.//hadoop-streaming.jar:/usr/lib/hadoop-mapreduce/.//hadoop-extras-2.7.1.jar:/usr/lib/hadoop-mapreduce/.//protobuf-java-2.5.0.jar:/usr/lib/hadoop-mapreduce/.//jetty-uti
google-dataproc-startup: l-6.1.26.jar:/usr/lib/hadoop-mapreduce/.//mockito-all-1.8.5.jar:/usr/lib/hadoop-mapreduce/.//hadoop-mapreduce-client-hs-plugins-2.7.1.jar:/usr/lib/hadoop-mapreduce/.//commons-io-2.4.jar:/usr/lib/hadoop-mapreduce/.//commons-lang3-3.3.2.jar:/usr/lib/hadoop-mapreduce/.//commons-math3-3.1.1.jar:/usr/lib/hadoop-mapreduce/.//asm-3.2.jar:/usr/lib/hadoop-mapreduce/.//hadoop-gridmix.jar:/usr/lib/hadoop-mapreduce/.//azure-storage-2.0.0.jar:/usr/lib/hadoop-mapreduce/.//hadoop-mapreduce-examples.jar:/usr/lib/hadoop-mapreduce/.//hadoop-datajoin.jar:/usr/lib/hadoop-mapreduce/.//hadoop-extras.jar:/usr/lib/hadoop-mapreduce/.//jersey-json-1.9.jar:/usr/lib/hadoop-mapreduce/.//jackson-jaxrs-1.9.13.jar:/usr/lib/hadoop-mapreduce/.//hadoop-gridmix-2.7.1.jar:/usr/lib/hadoop-mapreduce/.//hadoop-rumen.jar:/usr/lib/hadoop-mapreduce/.//hadoop-mapreduce-client-common.jar:/usr/lib/hadoop-mapreduce/.//java-xmlbuilder-0.4.jar:/usr/lib/hadoop-mapreduce/.//commons-codec-1.4.jar:/usr/lib/hadoop-mapreduce/.//curator-client-2.7.1.jar:/usr/lib/
google-dataproc-startup: hadoop-mapreduce/.//metrics-core-3.0.1.jar:/usr/lib/hadoop-mapreduce/.//hadoop-sls-2.7.1.jar:/usr/lib/hadoop-mapreduce/.//avro-1.7.4.jar:/usr/lib/hadoop-mapreduce/.//httpcore-4.2.5.jar:/usr/lib/hadoop-mapreduce/.//jsr305-3.0.0.jar:/usr/lib/hadoop-mapreduce/.//commons-beanutils-1.7.0.jar:/usr/lib/hadoop-mapreduce/.//snappy-java-1.0.4.1.jar:/usr/lib/hadoop-mapreduce/.//jackson-annotations-2.2.3.jar:/usr/lib/hadoop-mapreduce/.//jackson-core-asl-1.9.13.jar
google-dataproc-startup: STARTUP_MSG:   build = https://bigdataoss-internal.googlesource.com/third_party/apache/bigtop -r 2a194d4d838b79460c3ceb892f3c9444218ba970; compiled by 'bigtop' on 2015-11-10T18:38Z
google-dataproc-startup: STARTUP_MSG:   java = 1.8.0_66-internal
google-dataproc-startup: ************************************************************/
google-dataproc-startup: 15/11/30 09:30:15 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
google-dataproc-startup: 15/11/30 09:30:15 INFO namenode.NameNode: createNameNode [-format]
google-dataproc-startup: Removing hadoop-yarn-nodemanager (2.7.1-1) ...
google-dataproc-startup: Purging configuration files for hadoop-yarn-nodemanager (2.7.1-1) ...
google-dataproc-startup: 15/11/30 09:30:18 WARN common.Util: Path /hadoop/dfs/name should be specified as a URI in configuration files. Please update hdfs configuration.
google-dataproc-startup: 15/11/30 09:30:18 WARN common.Util: Path /hadoop/dfs/name should be specified as a URI in configuration files. Please update hdfs configuration.
google-dataproc-startup: Formatting using clusterid: CID-f14fdb9f-80a5-4940-a746-bc6550c5267f
google-dataproc-startup: 15/11/30 09:30:18 INFO namenode.FSNamesystem: No KeyProvider found.
google-dataproc-startup: 15/11/30 09:30:18 INFO namenode.FSNamesystem: fsLock is fair:true
google-dataproc-startup: 15/11/30 09:30:18 INFO util.HostsFileReader: Adding cluster1-w-0.c.dataproc-demo-0112.internal to the list of included hosts from /etc/hadoop/conf/nodes_include
google-dataproc-startup: 15/11/30 09:30:18 INFO util.HostsFileReader: Adding cluster1-w-1.c.dataproc-demo-0112.internal to the list of included hosts from /etc/hadoop/conf/nodes_include
google-dataproc-startup: 15/11/30 09:30:18 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit=1000
google-dataproc-startup: 15/11/30 09:30:18 INFO blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.ip-hostname-check=true
google-dataproc-startup: 15/11/30 09:30:18 INFO blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:00:00.000
google-dataproc-startup: 15/11/30 09:30:18 INFO blockmanagement.BlockManager: The block deletion will start around 2015 Nov 30 09:30:18
google-dataproc-startup: 15/11/30 09:30:18 INFO util.GSet: Computing capacity for map BlocksMap
google-dataproc-startup: 15/11/30 09:30:18 INFO util.GSet: VM type       = 64-bit
google-dataproc-startup: 15/11/30 09:30:18 INFO util.GSet: 2.0% max memory 2.6 GB = 53.6 MB
google-dataproc-startup: 15/11/30 09:30:18 INFO util.GSet: capacity      = 2^23 = 8388608 entries
google-dataproc-startup: 15/11/30 09:30:18 INFO blockmanagement.BlockManager: dfs.block.access.token.enable=false
google-dataproc-startup: 15/11/30 09:30:18 INFO blockmanagement.BlockManager: defaultReplication         = 2
google-dataproc-startup: 15/11/30 09:30:18 INFO blockmanagement.BlockManager: maxReplication             = 512
google-dataproc-startup: 15/11/30 09:30:18 INFO blockmanagement.BlockManager: minReplication             = 1
google-dataproc-startup: 15/11/30 09:30:18 INFO blockmanagement.BlockManager: maxReplicationStreams      = 2
google-dataproc-startup: 15/11/30 09:30:18 INFO blockmanagement.BlockManager: shouldCheckForEnoughRacks  = false
google-dataproc-startup: 15/11/30 09:30:18 INFO blockmanagement.BlockManager: replicationRecheckInterval = 3000
google-dataproc-startup: 15/11/30 09:30:18 INFO blockmanagement.BlockManager: encryptDataTransfer        = false
google-dataproc-startup: 15/11/30 09:30:18 INFO blockmanagement.BlockManager: maxNumBlocksToLog          = 1000
google-dataproc-startup: 15/11/30 09:30:18 INFO namenode.FSNamesystem: fsOwner             = hdfs (auth:SIMPLE)
google-dataproc-startup: 15/11/30 09:30:18 INFO namenode.FSNamesystem: supergroup          = hadoop
google-dataproc-startup: 15/11/30 09:30:18 INFO namenode.FSNamesystem: isPermissionEnabled = false
google-dataproc-startup: 15/11/30 09:30:18 INFO namenode.FSNamesystem: HA Enabled: false
google-dataproc-startup: 15/11/30 09:30:18 INFO namenode.FSNamesystem: Append Enabled: true
google-dataproc-startup: 15/11/30 09:30:19 INFO util.GSet: Computing capacity for map INodeMap
google-dataproc-startup: 15/11/30 09:30:19 INFO util.GSet: VM type       = 64-bit
google-dataproc-startup: 15/11/30 09:30:19 INFO util.GSet: 1.0% max memory 2.6 GB = 26.8 MB
google-dataproc-startup: 15/11/30 09:30:19 INFO util.GSet: capacity      = 2^22 = 4194304 entries
google-dataproc-startup: Mon Nov 30 09:30:19 UTC 2015: Waiting on 'systemctl start hadoop-yarn-resourcemanager.service'
google-dataproc-startup: 15/11/30 09:30:19 INFO namenode.FSDirectory: ACLs enabled? false
google-dataproc-startup: 15/11/30 09:30:19 INFO namenode.FSDirectory: XAttrs enabled? true
google-dataproc-startup: 15/11/30 09:30:19 INFO namenode.FSDirectory: Maximum size of an xattr: 16384
google-dataproc-startup: 15/11/30 09:30:19 INFO namenode.NameNode: Caching file names occuring more than 10 times
google-dataproc-startup: 15/11/30 09:30:19 INFO util.GSet: Computing capacity for map cachedBlocks
google-dataproc-startup: 15/11/30 09:30:19 INFO util.GSet: VM type       = 64-bit
google-dataproc-startup: 15/11/30 09:30:19 INFO util.GSet: 0.25% max memory 2.6 GB = 6.7 MB
google-dataproc-startup: 15/11/30 09:30:19 INFO util.GSet: capacity      = 2^20 = 1048576 entries
google-dataproc-startup: 15/11/30 09:30:19 INFO namenode.FSNamesystem: dfs.namenode.safemode.threshold-pct = 0.9990000128746033
google-dataproc-startup: 15/11/30 09:30:19 INFO namenode.FSNamesystem: dfs.namenode.safemode.min.datanodes = 0
google-dataproc-startup: 15/11/30 09:30:19 INFO namenode.FSNamesystem: dfs.namenode.safemode.extension     = 30000
google-dataproc-startup: 15/11/30 09:30:19 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.window.num.buckets = 10
google-dataproc-startup: 15/11/30 09:30:19 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.num.users = 10
google-dataproc-startup: 15/11/30 09:30:19 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.windows.minutes = 1,5,25
google-dataproc-startup: 15/11/30 09:30:19 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
google-dataproc-startup: 15/11/30 09:30:19 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
google-dataproc-startup: 15/11/30 09:30:19 INFO util.GSet: Computing capacity for map NameNodeRetryCache
google-dataproc-startup: 15/11/30 09:30:19 INFO util.GSet: VM type       = 64-bit
google-dataproc-startup: 15/11/30 09:30:19 INFO util.GSet: 0.029999999329447746% max memory 2.6 GB = 823.6 KB
google-dataproc-startup: 15/11/30 09:30:19 INFO util.GSet: capacity      = 2^17 = 131072 entries
google-dataproc-startup: 15/11/30 09:30:19 INFO namenode.FSImage: Allocated new BlockPoolId: BP-68773937-10.240.0.2-1448875819639
google-dataproc-startup: 15/11/30 09:30:19 INFO common.Storage: Storage directory /hadoop/dfs/name has been successfully formatted.
google-dataproc-startup: 15/11/30 09:30:20 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
google-dataproc-startup: 15/11/30 09:30:20 INFO util.ExitUtil: Exiting with status 0
google-dataproc-startup: 15/11/30 09:30:20 INFO namenode.NameNode: SHUTDOWN_MSG: 
google-dataproc-startup: /************************************************************
google-dataproc-startup: SHUTDOWN_MSG: Shutting down NameNode at cluster1-m.c.dataproc-demo-0112.internal/10.240.0.2
google-dataproc-startup: ************************************************************/
google-dataproc-startup: Mon Nov 30 09:30:20 UTC 2015: Waiting on 'systemctl start hadoop-hdfs-secondarynamenode.service'
google-dataproc-startup: Mon Nov 30 09:30:20 UTC 2015: Starting service: hadoop-hdfs-namenode
google-dataproc-startup: Mon Nov 30 09:30:23 UTC 2015: Waiting on 'systemctl start hadoop-hdfs-namenode.service'
google-dataproc-startup: Mon Nov 30 09:30:30 UTC 2015: Initializing HDFS directories
google-dataproc-startup: Mon Nov 30 09:30:38 UTC 2015: All done

use apt-get install and binary installations

Starting with Dataproc 1.0, Zeppelin should be available via apt-get install to install from the Dataproc distro, so that we no longer have to do the expensive mvn install in the init action.

Race condition in Zeppelin initialization script

There seems to be a race condition with the interpreter.json file

dataproc_initialization_script output:

+ /usr/lib/incubator-zeppelin/bin/zeppelin-daemon.sh start
Zeppelin start �  OK 
+ /usr/lib/incubator-zeppelin/bin/zeppelin-daemon.sh stop
Zeppelin stop   OK 
+ sed -i 's/"spark.executor.memory": "512m",/"spark.executor.memory": "",/' /usr/lib/incubator-zeppelin/conf/interpreter.json
sed: can't read /usr/lib/incubator-zeppelin/conf/interpreter.json: No such file or directory

When SSH into the master node the conf/interpreter.json is there.

Proposed change is to sleep a bit between
/usr/lib/incubator-zeppelin/bin/zeppelin-daemon.sh start
and
/usr/lib/incubator-zeppelin/bin/zeppelin-daemon.sh stop