tony-framework / tony Goto Github PK

View Code? Open in Web Editor NEW

697.0 53.0 161.0 3.4 MB

TonY is a framework to natively run deep learning frameworks on Apache Hadoop.

Home Page: https://tony-project.ai

License: Other

Java 83.40% Python 7.48% HTML 1.14% Shell 6.13% CSS 1.85%

tensorflow hadoop-yarn hadoop machine-learning deep-learning horovod

tony's People

Contributors

Stargazers

Watchers

Forkers

erwa zhe-thoughts airbots ptzagk doddaiah hungj hanwsf hitflame xiashuijun pdtran3k6 jmscraig steccami liuq4360 staticor ml-lab chinpeng johnsonman phenixmzy chenliang613 deisler134 zhangxuhong enowy jsnchen blue1881eulb seandity kevinsuo jingchengdu frankfqchen gaogf kioco yangzz945 mingruiwang2017 wuchirat superwangvip wangqiang-qiniu chenyuyun-emc fromradio davidgphub peteroxic arpang chenjunzou hongyunnchen anke522 ethan199111 zhanghonglishanzai synapticrumble imaginespark oryondark corner4world jiyulongxu enterstudio zhouyonglong feitianyiren zylxust yoelee awesomemachinelearning hktklxz rbramwell bradmiro kaizeonwong yuriyao zhangpengshan wenkaibai anupam128 sunqinbo yuhonghong7035 andrew8305 pvk444 shravanmn mbrukman run-lin dengziming gcoandchao burgerkingeater asagjj xouyang0079 m4rkl1u lazybkk yuanzac giri4it mashroomxl lhmzll burness deepquantitative diffblue-benchmarks brenthe kevin-14 cnxtech tralfamadude zuston wangw7606 pdambrauskas iamharshverma yusukesasaki chengbingliu ccjy2181 pingsutw luzhonghao xueqiyang seongl

tony's Issues

tony-examples/README.md should be updated to use new `tony.application.security.enabled` property

Currently, the README.md still mentions the old tony.application.insecure-mode property, which was changed to tony.application.security.enabled in #14.

how to run tensorflow model on several workers (several mapreduce datanode)?

Installation of TonY in GCP. Tests under TestTonyE2E fail during build.

Tried to install TonY on Google Cloud via Dataproc, but tests failed during build.

Setup:

1 Master node
2 Workers

Operating System:

Debian GNU/Linux 8

Attached logs with -debug information.

build.log

#yarn node -list
18/11/03 21:56:40 INFO client.RMProxy: Connecting to ResourceManager at tony-m/10.138.0.4:8032
Total Nodes:2
         Node-Id             Node-State Node-Http-Address       Number-of-Running-Containers
tony-w-0.c.dpe.internal:33607         RUNNING tony-w-0.c.dpe.internal:8042                            11
tony-w-1.c.dpe.internal:44563         RUNNING tony-w-1.c.dpe.internal:8042                             9

#hadoop version
Hadoop 2.8.4
Subversion Unknown -r Unknown
Compiled by bigtop on 2018-08-09T10:27Z
Compiled with protoc 2.5.0
From source with checksum 373fbec5524db42be27f1396ffbd2fc6This command was run using 
[build.log](https://github.com/linkedin/TonY/files/2545362/build.log)
/usr/lib/hadoop/hadoop-common-2.8.4.jar

#java -version
openjdk version "1.8.0_171"OpenJDK Runtime Environment (build 1.8.0_171-8u171-b11-1~bpo8+1-b11)OpenJDK 64-Bit Server VM (build 25.171-b11, mixed mode))
#echo $JAVA_HOME
/usr/lib/jvm/java-8-openjdk-amd64

When running ./gradlew build

[sudo ./gradlew build  --stacktrace
> Task :tony-core:test
Gradle suite > Gradle test > com.linkedin.tony.TestTonyE2E.testNullAMRpcClient FAILED
    java.lang.AssertionError at TestTonyE2E.java:268
Gradle suite > Gradle test > com.linkedin.tony.TestTonyE2E.testPSSkewedWorkerTrainingShouldPass FAILED
    java.lang.AssertionError at TestTonyE2E.java:110
Gradle suite > Gradle test > com.linkedin.tony.TestTonyE2E.testPSWorkerTrainingShouldPass FAILED
    java.lang.AssertionError at TestTonyE2E.java:127
Gradle suite > Gradle test > com.linkedin.tony.TestTonyE2E.testSingleNodeTrainingShouldPass FAILED
    java.lang.AssertionError at TestTonyE2E.java:73
27 tests completed, 4 failed
> Task :tony-core:test FAILED
FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':tony-core:test'.
> There were failing tests. See the report at: file:///usr/local/src/TonY/tony-core/build/reports/tests/test/index.html
* Try:
Run with --info or --debug option to get more log output. Run with --scan to get full insights.](url)

Unauthorized connection for super-user: rm/[email protected]

2018-11-12 22:03:02 INFO  TonyClient:198 - Submitting YARN application
2018-11-12 22:03:03 FATAL TonyClient:776 - Failed to run TonyClient
org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_1541916949981_266668 to YARN : Unauthorized connection for super-user: rm/[email protected] from IP xx.xx.xx.xx
	at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:272)
	at com.linkedin.tony.TonyClient.run(TonyClient.java:199)
	at com.linkedin.tony.TonyClient.start(TonyClient.java:774)
	at com.linkedin.tony.TonyClient.start(TonyClient.java:762)
	at com.linkedin.tony.cli.ClusterSubmitter.main(ClusterSubmitter.java:76)
2018-11-12 22:03:03 ERROR TonyClient:786 - Application failed to complete successfully

TonY hangs if a worker is killed due to OOM

JupyterHub support

Investigate how to spawning containers for JupyterHub's notebooks.

src_dir should be nullable

Part of making #93 easier. User should be able to pass in --resources SCHEMA://PATH/TO/RESOURCES instead of having to add resources through local file system.

Example:


  @Test
  public void testTonyResourcesFlag() throws ParseException {
    conf.setBoolean(TonyConfigurationKeys.IS_SINGLE_NODE, false);
    client = new TonyClient(conf);
    client.init(new String[]{
        "--executes", "'/bin/cat log4j.properties'",
        "--hdfs_classpath", "/yarn/libs",
        "--container_env", Constants.SKIP_HADOOP_PATH + "=true",
        "--conf", "tony.worker.resources=/yarn/libs",
        "--conf", "tony.ps.instances=0",
    });
    int exitCode = client.start();
    Assert.assertEquals(exitCode, 0);
  }

Feature request: Add Google Cloud Bucket support

When running DataProc with Google Cloud, would be ideal to keep files in Company GCS bucket, private or public

Support for:

jars: (Already included in gcloud dataproc command). To be tested.
python_venv
executes
conf_file
src_dir

Since some GCS buckets are not public may be required to pass credentials (json file) in a different parameter.
Code sample here.

gcloud dataproc jobs submit hadoop --cluster tony-staging \
--class com.linkedin.tony.cli.ClusterSubmitter \
--jars gs://tony-staging/tony-cli-0.1.5-all.jar -- \
--python_venv=gs://tony-staging/env/tf19.zip \
--src_dir=gs://tony-staging/tony/mnist/src/ \
--executes=gs://tony-staging/tony/mnist/src/mnist_distributed.py \
--conf_file=gs://tony-staging/tony/conf/tony.xml \
--python_binary_path=tf19/bin/python3.5

Related to #74

TonyClient should print a link to the TonY History Server page for an application when it finishes

To make the THS page for a TonY application more easily accessible, the TonyClient could print a link to it when the application finishes.

Run Play tests during build

Currently, the Play tests BrowserTest and HomeControllerTest do not run as part of the build. For example, in https://api.travis-ci.org/v3/job/448123064/log.txt, we see

> Task :tony-history-server:testClasses UP-TO-DATE
Skipping task ':tony-history-server:testClasses' as it has no actions.
:tony-history-server:testClasses (Thread[Task worker for ':',5,main]) completed. Took 0.0 secs.
:tony-history-server:test (Thread[Task worker for ':',5,main]) started.

> Task :tony-history-server:test NO-SOURCE
Skipping task ':tony-history-server:test' as it has no source files and no previous output files.
:tony-history-server:test (Thread[Task worker for ':',5,main]) completed. Took 0.002 secs.

The Play tests get run as part of the testPlayBinary which test does NOT depend on.

Failed to run mnist example on Hadoop cluster

Tried to build TonY and run mnist-tensorflow example, but get error message "ERROR tony.TonyClient: Application failed to complete successfully". There is no clear error in hadoop logs. Furthermore, I successfully built TonY but I didn't find the tony folder and configuration tony.xml. Thanks in advance for any help.

My configurations are:
A Hadoop cluster (4 nodes: 1 master and 3 slaves) on Virtualbox.
TonY: 0.1.3
Hadoop: 2.9.1
Tensorflow: 1.9.0

The printed out info:
18/11/05 14:49:59 INFO tony.TonyClient: TonY heartbeat interval [1000]
18/11/05 14:49:59 INFO tony.TonyClient: TonY max heartbeat misses allowed [25]
18/11/05 14:49:59 INFO tony.TonyClient: Starting client..
18/11/05 14:49:59 INFO client.RMProxy: Connecting to ResourceManager at /192.168.56.100:8032
18/11/05 14:50:05 WARN hdfs.DataStreamer: Caught exception
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1252)
at java.lang.Thread.join(Thread.java:1326)
at org.apache.hadoop.hdfs.DataStreamer.closeResponder(DataStreamer.java:980)
at org.apache.hadoop.hdfs.DataStreamer.endBlock(DataStreamer.java:630)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:807)
18/11/05 14:50:05 INFO tony.TonyClient: Completed setting up app master command {{JAVA_HOME}}/bin/java -Xmx1638m -Dyarn.app.container.log.dir=<LOG_DIR> com.linkedin.tony.TonyApplicationMaster --python_binary_path /home/rui/venv/bin/python --python_venv /home/rui/venv.zip --executes /home/rui/TonY/tony-examples/mnist/mnist_distributed.py --hdfs_classpath hdfs://192.168.56.100:9000/user/rui/.tony/1adf67c5-3be7-4245-8a31-3c9204ae84a8 --container_env TONY_CONF_PATH=hdfs://192.168.56.100:9000/user/rui/.tony/application_1541449424539_0001/tony-final.xml --container_env TONY_CONF_TIMESTAMP=
1541451005299 --container_env TF_ZIP_LENGTH=102664099 --container_env TF_ZIP_TIMESTAMP=1541451005184 --container_env TF_ZIP_PATH=hdfs://192.168.56.100:9000/user/rui/.tony/application_1541449424539_0001/tf.zip --container_env TONY_CONF_L
ENGTH=3200 --container_env CLASSPATH={{CLASSPATH}}./{{HADOOP_CONF_DIR}}{{HADOOP_COMMON_HOME}}/share/hadoop/common/{{HADOOP_COMMON_HOME}}/share/hadoop/common/lib/{{HADOOP_HDFS_HOME}}/share/hadoop/hdfs/{{
HADOOP_HDFS_HOME}}/share/hadoop/hdfs/lib/{{HADOOP_YARN_HOME}}/share/hadoop/yarn/{{HADOOP_YARN_HOME}}/share/hadoop/yarn/lib/* 1><LOG_DIR>/amstdout.log 2><LOG_DIR>/amstderr.log
18/11/05 14:50:05 INFO tony.TonyClient: Submitting YARN application
18/11/05 14:50:05 INFO impl.YarnClientImpl: Submitted application application_1541449424539_0001
18/11/05 14:50:05 INFO tony.TonyClient: URL to track running application (will proxy to TensorBoard once it has started): http://tf-yarn-master:8088/proxy/application_1541449424539_0001/
18/11/05 14:50:05 INFO tony.TonyClient: ResourceManager web address for application: http://192.168.56.100:8088/cluste
r/app/application_1541449424539_0001
18/11/05 14:50:11 INFO tony.TonyClient: AM host: tf-yarn-slave3
18/11/05 14:50:11 INFO tony.TonyClient: AM RPC port: 14925
18/11/05 14:50:11 INFO client.RMProxy: Connecting to ResourceManager at /192.168.56.100:8032
18/11/05 14:50:13 INFO tony.TonyClient: Logs for ps 0 at: http://tf-yarn-slave2:8042/node/containerlogs/container_1541
449424539_0001_01_000002/rui
18/11/05 14:50:13 INFO tony.TonyClient: Logs for worker 0 at: http://tf-yarn-slave3:8042/node/containerlogs/container_
1541449424539_0001_01_000003/rui
18/11/05 14:50:21 INFO tony.TonyClient: Application finished unsuccessfully. YarnState=FINISHED, DSFinalStatus=FAILED. Breaking monitoring loop : ApplicationId:1
18/11/05 14:50:22 ERROR tony.TonyClient: Application failed to complete successfully

Add TF_CONFIG variable in system environment

Massage CLUSTER_SPEC, TASK_INDEX, JOB_NAME into a TF_CONFIG env variable to support Estimator API

TestEventHandler duplicates ParserUtils code

TestEventHandler.parseEvents() duplicates much of the code in ParserUtils.parseEvents(). Instead of duplicating the code, we should move ParserUtils from tony-history-server into the tony-core module (to avoid a cyclic dependency). For more context, see https://github.com/linkedin/TonY/pull/117/files#r241235693

Container exited with 132 when runs example mnist-tensorflow

I am trying to follow the mnist-tensorflow in tony-example, but when I run the following command, I found my containers exited with code 132 and I can't find why, which really confused me. Any Ideas?

java version: 1.8.0_181

Hadoop version: 3.1.1

java -cp "`hadoop classpath --glob`:tony/*:tony" \
            com.linkedin.tony.cli.ClusterSubmitter \
            -executes src/models/mnist_distributed.py \
            -python_venv env.zip \
            -python_binary_path env/bin/python \
            -src_dir src \
            -shell_env LD_LIBRARY_PATH=/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server

tony.xml

<configuration>
  <property>
    <name>tony.application.hdfs-conf-path</name>
    <value>/home/hadoop/hadoop/etc/hadoop/hdfs-site.xml</value>
  </property>
  <property>
    <name>tony.application.yarn-conf-path</name>
    <value>/home/hadoop/hadoop/etc/hadoop/yarn-site.xml</value>
  </property>
  <property>
    <name>tony.application.security.enabled</name>
    <value>false</value>
  </property>
</configuration>

the console:

2018-10-17 08:27:49,932 INFO cli.ClusterSubmitter: Starting ClusterSubmitter..
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/distribute/tony/tony-cli-0.1.3-all.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2018-10-17 08:27:50,132 INFO cli.ClusterSubmitter: Configuration: core-default.xml, core-site.xml, null/core-site.xml, null/hdfs-site.xml
2018-10-17 08:27:50,465 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-10-17 08:27:51,887 INFO cli.ClusterSubmitter: Copying /home/hadoop/distribute/tony/tony-cli-0.1.3-all.jar to: hdfs://localhost:9000/user/hadoop/.tony/ffae84e0-edd3-444a-9148-a25124a3e7bc
2018-10-17 08:27:53,753 INFO tony.TonyClient: TonY heartbeat interval [1000]
2018-10-17 08:27:53,753 INFO tony.TonyClient: TonY max heartbeat misses allowed [25]
2018-10-17 08:27:53,790 INFO tony.TonyClient: Starting client..
2018-10-17 08:27:53,796 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
2018-10-17 08:27:54,163 INFO conf.Configuration: resource-types.xml not found
2018-10-17 08:27:54,164 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2018-10-17 08:28:08,606 INFO tony.TonyClient: Completed setting up app master command {{JAVA_HOME}}/bin/java -Xmx1638m -Dyarn.app.container.log.dir=<LOG_DIR> com.linkedin.tony.TonyApplicationMaster --python_binary_path env/bin/python --python_venv env.zip --executes src/models/mnist_distributed.py --hdfs_classpath hdfs://localhost:9000/user/hadoop/.tony/ffae84e0-edd3-444a-9148-a25124a3e7bc --shell_env LD_LIBRARY_PATH=/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server --container_env TONY_CONF_PATH=hdfs://localhost:9000/user/hadoop/.tony/application_1539761891085_0002/tony-final.xml --container_env YARN_CONF_PATH=home/hadoop/hadoop/etc/hadoop/yarn-site.xml --container_env TONY_CONF_TIMESTAMP=1539764888557 --container_env TONY_CONF_LENGTH=3659 --container_env TONY_ZIP_PATH=hdfs://localhost:9000/user/hadoop/.tony/application_1539761891085_0002/tony.zip --container_env TONY_ZIP_LENGTH=154330934 --container_env TONY_ZIP_TIMESTAMP=1539764888067 --container_env CLASSPATH={{CLASSPATH}}<CPS>./*<CPS>{{HADOOP_CONF_DIR}}<CPS>{{HADOOP_COMMON_HOME}}/share/hadoop/common/*<CPS>{{HADOOP_COMMON_HOME}}/share/hadoop/common/lib/*<CPS>{{HADOOP_HDFS_HOME}}/share/hadoop/hdfs/*<CPS>{{HADOOP_HDFS_HOME}}/share/hadoop/hdfs/lib/*<CPS>{{HADOOP_YARN_HOME}}/share/hadoop/yarn/*<CPS>{{HADOOP_YARN_HOME}}/share/hadoop/yarn/lib/* --container_env HDFS_CONF_PATH=home/hadoop/hadoop/etc/hadoop/hdfs-site.xml 1><LOG_DIR>/amstdout.log 2><LOG_DIR>/amstderr.log 
2018-10-17 08:28:08,607 INFO tony.TonyClient: Submitting YARN application
2018-10-17 08:28:08,712 INFO impl.YarnClientImpl: Submitted application application_1539761891085_0002
2018-10-17 08:28:08,718 INFO tony.TonyClient: URL to track running application (will proxy to TensorBoard once it has started): http://HP-DL580-G7:8088/proxy/application_1539761891085_0002/
2018-10-17 08:28:08,719 INFO tony.TonyClient: ResourceManager web address for application: http://0.0.0.0:8088/cluster/app/application_1539761891085_0002
2018-10-17 08:28:14,764 INFO tony.TonyClient: AM host: HP-DL580-G7
2018-10-17 08:28:14,764 INFO tony.TonyClient: AM RPC port: 13923
2018-10-17 08:28:14,770 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
2018-10-17 08:28:19,025 INFO tony.TonyClient: Logs for ps 0 at: http://HP-DL580-G7:8042/node/containerlogs/container_1539761891085_0002_01_000002/hadoop
2018-10-17 08:28:19,026 INFO tony.TonyClient: Logs for worker 0 at: http://HP-DL580-G7:8042/node/containerlogs/container_1539761891085_0002_01_000003/hadoop
2018-10-17 08:28:36,135 INFO tony.TonyClient: Application finished unsuccessfully. YarnState=FINISHED, DSFinalStatus=FAILED. Breaking monitoring loop : ApplicationId:2
2018-10-17 08:28:36,200 ERROR tony.TonyClient: Application failed to complete successfully

the amstdout.log:

2018-10-17 08:40:07 INFO  TonyApplicationMaster:145 - Logs for worker 0 at: http://HP-DL580-G7:8042/node/containerlogs/container_1539765357563_0002_01_000003/hadoop
2018-10-17 08:40:07 INFO  TonyApplicationMaster:909 - Successfully started container container_1539765357563_0002_01_000003
2018-10-17 08:40:08 INFO  TonyApplicationMaster:728 - Client requesting TaskUrls!
2018-10-17 08:40:09 INFO  TonyApplicationMaster:528 - Completed worker tasks: 0, total worker tasks: 1
2018-10-17 08:40:14 INFO  TonyApplicationMaster:528 - Completed worker tasks: 0, total worker tasks: 1
2018-10-17 08:40:18 INFO  TonyApplicationMaster:770 - Received cluster spec registration request from task ps:0 with spec: HP-DL580-G7:33439
2018-10-17 08:40:18 INFO  TonyApplicationMaster:783 - [ps:0] Received Registration for HB !!
2018-10-17 08:40:18 INFO  TonyApplicationMaster:795 - Received registrations from 1 tasks, awaiting registration from 1 tasks.
2018-10-17 08:40:18 INFO  TonyApplicationMaster:797 - Awaiting registration from task worker 0 in container_1539765357563_0002_01_000003 on host HP-DL580-G7
2018-10-17 08:40:19 INFO  TonyApplicationMaster:770 - Received cluster spec registration request from task worker:0 with spec: HP-DL580-G7:35215
2018-10-17 08:40:19 INFO  TonyApplicationMaster:783 - [worker:0] Received Registration for HB !!
2018-10-17 08:40:19 INFO  TonyApplicationMaster:789 - All 2 tasks registered.
2018-10-17 08:40:19 INFO  TonyApplicationMaster:831 - Got request to update TensorBoard URL: HP-DL580-G7:45163
2018-10-17 08:40:19 WARN  TonyApplicationMaster:850 - This Hadoop version doesn't have the YARN-7974 patch, TonY won't register TensorBoard URL withapplication's tracking URL
2018-10-17 08:40:19 INFO  TonyApplicationMaster:528 - Completed worker tasks: 0, total worker tasks: 1
2018-10-17 08:40:20 INFO  TonyApplicationMaster:811 - Received result registration request with exit code 132 from worker 0
2018-10-17 08:40:21 INFO  TonyApplicationMaster:789 - All 2 tasks registered.
2018-10-17 08:40:21 INFO  TonyApplicationMaster:944 - Completed containers: 1
2018-10-17 08:40:21 INFO  TonyApplicationMaster:947 - ContainerID = container_1539765357563_0002_01_000003, state = COMPLETE, exitStatus = 132
2018-10-17 08:40:21 ERROR TonyApplicationMaster:952 - [2018-10-17 08:40:20.925]Exception from container-launch.
Container id: container_1539765357563_0002_01_000003
Exit code: 132

[2018-10-17 08:40:20.934]Container exited with a non-zero exit code 132. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop/tmp/nm-local-dir/usercache/hadoop/filecache/13/tony-cli-0.1.3-all.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]


[2018-10-17 08:40:20.935]Container exited with a non-zero exit code 132. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop/tmp/nm-local-dir/usercache/hadoop/filecache/13/tony-cli-0.1.3-all.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]



2018-10-17 08:40:21 INFO  TonyApplicationMaster:961 - Unregister task [worker:0] from Heartbeat monitor..
2018-10-17 08:40:21 INFO  TonyApplicationMaster:966 - Container failed, id = container_1539765357563_0002_01_000003
2018-10-17 08:40:22 INFO  TonyApplicationMaster:811 - Received result registration request with exit code 132 from ps 0
2018-10-17 08:40:23 INFO  TonyApplicationMaster:944 - Completed containers: 1
2018-10-17 08:40:23 INFO  TonyApplicationMaster:947 - ContainerID = container_1539765357563_0002_01_000002, state = COMPLETE, exitStatus = 132
2018-10-17 08:40:23 ERROR TonyApplicationMaster:952 - [2018-10-17 08:40:23.168]Exception from container-launch.
Container id: container_1539765357563_0002_01_000002
Exit code: 132

[2018-10-17 08:40:23.175]Container exited with a non-zero exit code 132. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop/tmp/nm-local-dir/usercache/hadoop/filecache/13/tony-cli-0.1.3-all.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]


[2018-10-17 08:40:23.177]Container exited with a non-zero exit code 132. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop/tmp/nm-local-dir/usercache/hadoop/filecache/13/tony-cli-0.1.3-all.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]



2018-10-17 08:40:23 INFO  TonyApplicationMaster:961 - Unregister task [ps:0] from Heartbeat monitor..
2018-10-17 08:40:23 INFO  TonyApplicationMaster:966 - Container failed, id = container_1539765357563_0002_01_000002
2018-10-17 08:40:24 INFO  TonyApplicationMaster:512 - Completed jobs: 1 total jobs: 1
2018-10-17 08:40:24 INFO  TonyApplicationMaster:564 - Total completed worker tasks: 1, total worker tasks: 1
2018-10-17 08:40:24 INFO  TonyApplicationMaster:570 - TensorFlow session failed: At least one job task exited with non-zero status, failedCnt=1
2018-10-17 08:40:24 INFO  TonyApplicationMaster:335 - Result: false, job failed: true, retry count: 0
2018-10-17 08:40:25 INFO  TonyApplicationMaster:837 - Client signals AM to finish application.
2018-10-17 08:40:29 INFO  Utils:61 - Poll function finished within 30 seconds
2018-10-17 08:40:29 INFO  TonyApplicationMaster:145 - Logs for ps 0 at: http://HP-DL580-G7:8042/node/containerlogs/container_1539765357563_0002_01_000002/hadoop
2018-10-17 08:40:29 INFO  TonyApplicationMaster:145 - Logs for worker 0 at: http://HP-DL580-G7:8042/node/containerlogs/container_1539765357563_0002_01_000003/hadoop
2018-10-17 08:40:29 INFO  TonyApplicationMaster:355 - Application Master failed. exiting

and the worker container:

2018-10-17 08:40:08 INFO  TaskExecutor:86 - TaskExecutor is running..
2018-10-17 08:40:08 INFO  TaskExecutor:80 - Reserved rpcPort: 35215
2018-10-17 08:40:08 INFO  TaskExecutor:81 - Reserved tbPort: 45163
2018-10-17 08:40:08 INFO  TaskExecutor:82 - Reserved py4j gatewayServerPort: 35421
2018-10-17 08:40:08 INFO  TaskExecutor:178 - Task command: venv/env/bin/python src/models/mnist_distributed.py
2018-10-17 08:40:08 INFO  Utils:109 - Unzipping tony.zip to destination ./
2018-10-17 08:40:10 INFO  TaskExecutor:184 - Setting up Rpc client, connecting to: HP-DL580-G7:11789
2018-10-17 08:40:10 INFO  TaskExecutor:96 - Unpacking Python virtual environment: env.zip
2018-10-17 08:40:10 INFO  Utils:109 - Unzipping env.zip to destination venv
2018-10-17 08:40:19 INFO  TaskExecutor:107 - Executor is running task worker 0
2018-10-17 08:40:19 INFO  TaskExecutor:190 - Application Master address : HP-DL580-G7:11789
2018-10-17 08:40:19 INFO  TaskExecutor:193 - ContainerId is: container_1539765357563_0002_01_000003 HostName is: HP-DL580-G7
2018-10-17 08:40:19 INFO  TaskExecutor:201 - Connecting to HP-DL580-G7:11789 to register worker spec: worker 0 HP-DL580-G7:35215
2018-10-17 08:40:19 INFO  Utils:82 - Poll function finished within 120 seconds
2018-10-17 08:40:19 INFO  TaskExecutor:114 - Successfully registered and got cluster spec: {"ps":["HP-DL580-G7:33439"],"worker":["HP-DL580-G7:35215"]}
2018-10-17 08:40:19 INFO  TaskExecutor:211 - TensorBoard address : HP-DL580-G7:45163
2018-10-17 08:40:19 INFO  Utils:82 - Poll function finished within 60 seconds
2018-10-17 08:40:19 INFO  TaskExecutor:214 - Register TensorBoard response: SUCCEEDED
2018-10-17 08:40:19 INFO  Utils:210 - Executing command: venv/env/bin/python src/models/mnist_distributed.py
2018-10-17 08:40:20 INFO  Utils:82 - Poll function finished within 60 seconds
2018-10-17 08:40:20 INFO  TaskExecutor:223 - AM response for result execution run: RECEIVED
2018-10-17 08:40:20 INFO  TaskExecutor:148 - Child process exited with exit code 132

Refactor hdfs_classpath, src_dir, python_venv to all use resources and fix the path issue for input files

Currently, we have different handling logic for hdfs_classpath, which we adds to container localizable resources for am and then pass this again to the workers, for src_dir & python_venv, we add them to a tony.zip. A difference between these two is for src_dir, we care about the folder structure, however for python_venv, we don't care since it is a single zip file.

All these resources handling could be unified via the new -tony.container.resources flag. We localize all resources in that field (delimited by ,)

We pass python_venv all the way from client -> am -> taskExecutor as command line arguments, which is not necessary. We can always use relative path and assume the top level folder is venv.

The plan is to get rid of the logic to create a tony.zip and use the logic to handle resources to handle all these scenarios. The -execute and -python_binary will always take relative path inside the uploaded artifact.

ls /user/alice/tonyJob
 - venv.zip
 - src/
    - mnist.py

Inside venv.zip

venv.zip
  - bin/
  - lib/

Example:

java -cp `hadoop classpath`:/path/to/TonY/tony-cli/build/libs/tony-cli-x.x.x-all.jar com.linkedin.tony.cli.ClusterSubmitter \
--python_venv=/user/alice/tonyJob/venv.zip  \
--src_dir=/user/alice/tonyJob/src  \
--executes=mnist_distributed.py \
--python_binary_path=bin/python

Under the hood, we pack src_dir into a SRC.zip, upload to hdfs, set tonyConf's tony.container.resources to include that and then all containers will localize the zip and if the SRC.zip exists, we'll unzip the file.

tony-final.xml's tony.container.resources will be like:

<property>
<name>
tony.container.resources
</name>
<value>
hdfs://tony_tmp/SRC.zip, hdfs://tony_tmp/venv.zip, hdfs://hdfs_classpath/tony.jar
</value>
</property>

Same applies to python_venv, we upload the venv zip file to hdfs, rename to venv.zip and set tonyConf and then a common logic will localize it to all containers. We'll unzip the file at the containers.

WIP branch https://github.com/linkedin/TonY/tree/refactor

Ref #74

Failed to run PyTorch mnist example in GCP

Unable to run PyTorch sample code. Task is stuck in "RUNNING"

Setup

GCP DataProc

1 master node
2 worker nodes

Version

Hadoop 2.9.0
Subversion https://bigdataoss-internal.googlesource.com/third_party/apache/hadoop -r e8ce80c37eebb173fc688e7f5686d7df74d182aa
Compiled by bigtop on 2018-10-25T12:56Z
Compiled with protoc 2.5.0
From source with checksum 1eb388d554db8e1cadcab4c1326ee72
This command was run using /usr/lib/hadoop/hadoop-common-2.9.0.jar

ML framework versions

PyTorch 0.4.0
Python 3.5

tony.xml

<configuration>
  <property>
    <name>tony.application.name</name>
    <value>PyTorch</value>
  </property>   
  <property>
    <name>tony.application.security.enabled</name>
   <value>false</value>
  </property>    
  <property>
    <name>tony.worker.instances</name>
    <value>2</value>
  </property>
  <property>
    <name>tony.worker.memory</name>
    <value>4g</value>
  </property>
  <property>
    <name>tony.worker.gpus</name>
    <value>0</value>
  </property>
  <property>
    <name>tony.ps.memory</name>
    <value>4g</value>
  </property>
  <property>
    <name>tony.application.framework</name>
    <value>pytorch</value>
  </property>
</configuration>

#yarn application -list -appStates ALL
18/11/20 07:06:41 INFO client.RMProxy: Connecting to ResourceManager at tony-staging-m/10.138.0.2:8032
18/11/20 07:06:42 INFO client.AHSProxy: Connecting to Application History server at tony-staging-m/10.138.0.2:10200
The application state AL is invalid.
The valid application state can be one of the following: ALL,NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING,FINISHED,FAILED,KILLED
(torch04) root@tony-staging-m:/usr/local/src/jobs/PTJob# yarn application -list -appStates ALL
18/11/20 07:06:46 INFO client.RMProxy: Connecting to ResourceManager at tony-staging-m/10.138.0.2:8032
18/11/20 07:06:47 INFO client.AHSProxy: Connecting to Application History server at tony-staging-m/10.138.0.2:10200
Total number of applications (application-types: [], states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, RUNNING, FINISHED, FAILED, KILLED] and tags: []):9
                Application-Id      Application-Name        Application-Type          User           Queue                   State             Final-State             Progress         
               Tracking-URL
application_1542587994073_0009  TensorFlowApplication             TENSORFLOW          root         default                  KILLED                  KILLED                 100% http://t
ony-staging-m:8188/applicationhistory/app/application_1542587994073_0009
application_1542587994073_0010  TensorFlowApplication             TENSORFLOW          root         default                FINISHED                  FAILED                 100%         
                        N/A
application_1542587994073_0015               PyTorch              TENSORFLOW          root         default                 RUNNING               UNDEFINED                   0%         
                        N/A

Logs from:

node/containerlogs/container_1542587994073_0015_01_000002/root

Code fails in:

executor.taskIndex = Integer.parseInt(System.getenv(Constants.TASK_INDEX));

stderr

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hadoop/yarn/nm-local-dir/usercache/root/filecache/36/tony-cli-0.1.5-all.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Exception in thread "main" java.lang.NumberFormatException: null
	at java.lang.Integer.parseInt(Integer.java:542)
	at java.lang.Integer.parseInt(Integer.java:615)
	at com.linkedin.tony.TaskExecutor.main(TaskExecutor.java:109)

stdout

2018-11-20 06:46:37 INFO  TaskExecutor:89 - TaskExecutor is running..
2018-11-20 06:46:37 INFO  TaskExecutor:83 - Reserved rpcPort: 43073
2018-11-20 06:46:37 INFO  TaskExecutor:84 - Reserved tbPort: 37633
2018-11-20 06:46:37 INFO  TaskExecutor:85 - Reserved py4j gatewayServerPort: 35571
2018-11-20 06:46:37 INFO  TaskExecutor:175 - Task command: venv/torch04/bin/python3.5 /usr/local/src/jobs/PTJob/src/mnist_distributed.py --root /tmp/data/
2018-11-20 06:46:37 INFO  Utils:132 - Unzipping tony.zip to destination ./
2018-11-20 06:46:39 INFO  TaskExecutor:184 - Setting up Rpc client, connecting to: tony-staging-w-0.c.dpe-cloud-mle.internal:11616
2018-11-20 06:46:39 INFO  TaskExecutor:102 - Unpacking Python virtual environment: /usr/local/src/jobs/PTJob/env/torch04.zip
2018-11-20 06:46:39 INFO  Utils:132 - Unzipping /usr/local/src/jobs/PTJob/env/torch04.zip to destination venv

tony-cli-all fat jar should exclude com/sun/jersey classes since they are CDDL licensed

Hadoop transitively brings in jersey classes, which are CDDL-1.0-licensed. Since we do not use the CDDL 1.0 license, we should not be including the jersey classes in the tony-cli-all fat jar.

Ref: https://opensource.org/licenses/CDDL-1.0

TonY History Server should start using nohup

Currently, the startTHS.sh script uses exec to start, and so if the SSH session in which the THS is started dies, the THS process itself dies. We should use nohup so the THS will continue running even when the SSH session is closed. (Alternatively, we could run THS as a background process.)

TonY Portal should enforce retention on history files

Currently, history files are retained forever. The retention period should be configurable and TonY Portal should take care of enforcing retention.

As part of retention, we can also clean up in-progress files that are older than the retention period. (These are probably jobs that crashed or encountered other abnormal conditions.)

Internal Jira: LIHADOOP-43855

Running this on Cloudera Hadoop Distribution (5.13.1)

Hi,

I came across this great work just recently. I had lot of issues using the TensorflowOnSpark and TensorflowOnYARN earlier this year and had given up. I'm wondering how can I make use of this repo on top of my Cloudera Distribution of Hadoop. Any help is appreciated.

Thanks !

Mohammed Ayub

TonY Job History Server Phase I

A job history server for TonY jobs.

Metrics collection (CPU/MEM/GPU utilization)
UI for displaying counter and other information

TonY on docker

with #67 , we'll support running TonY with docker images. The prerequisite is you need a properly configured cluster to be capable of running YARN applications with docker. Personal experience with that: https://medium.com/@oliver_hu/enable-hadoop-yarn-2-9-1-3-0-3-1-to-launch-application-using-docker-containers-1442a639bb64.

With this change, you should be able to launch your training jobs without zipping the python virtual env anymore.

To enable Tony launch your jobs on docker, set:

<property>
    <description>Whether we use docker container to launch the tasks</description>
    <name>tony.docker.enabled</name>
    <value>false</value>
  </property>

  <property>
    <description>Whether we use docker container to launch the tasks</description>
    <name>tony.docker.image</name>
    <value>oliverhu/hadoop-base</value> // your image
  </property>

TonY EventHandler should use take() instead of poll()

Currently, EventHandler.run() spins inside this loop:

    while (!isStopped) {
      writeEvent(eventQueue, dataFileWriter);
    }

because writeEvent uses poll() which will immediately return null if the queue is empty.

We should update writeEvent() to use take() instead and have the stop() method call Thread.currentThread().interrupt() (the writeEvent() method should catch the InterruptedException).

build tony和使用gpu的一些问题

Java version：1.8.0_18
tensorflow-gpu： 1.9
运行如下命令：
./gradlew build

会出现如下错误：
com.linkedin.tony.TestTonyE2E.setup FAILED
java.lang.IllegalArgumentException: The value of property bind.address must not be null
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:148)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1338)
at org.apache.hadoop.http.HttpServer2.initializeWebServer(HttpServer2.java:585)
at org.apache.hadoop.http.HttpServer2.(HttpServer2.java:537)
at org.apache.hadoop.http.HttpServer2.(HttpServer2.java:117)
at org.apache.hadoop.http.HttpServer2$Builder.build(HttpServer2.java:421)
at org.apache.hadoop.hdfs.server.namenode.NameNodeHttpServer.start(NameNodeHttpServer.java:160)
at org.apache.hadoop.hdfs.server.namenode.NameNode.startHttpServer(NameNode.java:869)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:691)
at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937)
at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643)
at org.apache.hadoop.hdfs.MiniDFSCluster.createNameNode(MiniDFSCluster.java:1308)
at org.apache.hadoop.hdfs.MiniDFSCluster.configureNameService(MiniDFSCluster.java:1077)
at org.apache.hadoop.hdfs.MiniDFSCluster.createNameNodesAndSetConf(MiniDFSCluster.java:952)
at org.apache.hadoop.hdfs.MiniDFSCluster.initMiniDFSCluster(MiniDFSCluster.java:884)
at org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:517)
at org.apache.hadoop.hdfs.MiniDFSCluster$Builder.build(MiniDFSCluster.java:476)
at com.linkedin.minitony.cluster.MiniCluster.start(MiniCluster.java:50)
at com.linkedin.tony.TestTonyE2E.setup(TestTonyE2E.java:34)
是什么原因？hadoop的各项配置检查了好几遍，都是正确的。。

还有在用gpu时，运行如下命令：
java -cp hadoop classpath:/TonY/tony-cli/build/libs/tony-cli-0.1.3-all.jar com.linkedin.tony.cli.LocalSubmitter
--python_venv=/venv.zip
--src_dir=/TonY/tony-examples/mnist
--executes=/TonY/tony-examples/mnist/mnist_distributed.py
--conf_file=/path/tony-test.xml
--python_binary_path=venv/bin/python
会出现找不到libcublas.so.9.0的错误，之前配置过cuda和cudnn，没问题，可以跑tf，
在这个虚拟环境中运行tf也可以，但是运行如上命令，则报错。谢谢

Fast-fail on chief worker failure

If chief worker fails, we should immediately fail the application. (i.e. the underlying TF distributed training may hang, so TonY should just fail the application.)

Other workers failing is a separate issue. In theory if they fail the training can continue. But it can be configurable.

Support initialization actions for TonY in DataProc for GCP

Will be adding support in Google Cloud DataProc for TonY as part of the initialization actions.
https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions

This issue is to track this effort.

Problem launching job

Hello,
I am having the following issue when trying to launch a job:

java.lang.RuntimeException: Failed to get FS delegation token for default FS.
at com.linkedin.tony.TonyClient.getTokens(TonyClient.java:555)
at com.linkedin.tony.TonyClient.run(TonyClient.java:177)
at com.linkedin.tony.TonyClient.start(TonyClient.java:716)
at com.linkedin.tony.TonyClient.start(TonyClient.java:703)
at com.linkedin.tony.cli.ClusterSubmitter.main(ClusterSubmitter.java:54)

I went to the code and I see a call to fs.getDelegationToken(tokenRenewer). However, I don't see such method in the API for FileSystem, so I am not sure what I should do next.
Thanks in advance for the help provided!

PyTorch support

Support PyTorch.

TonyApplicationMaster doesn't return correct status.

When running locally with a python script that exits 1 (mock a fail worker), TonyAM only fails when tony.application.single-node is set to true. We expect that when a non-chief worker fails, TonyAM should continue on training, but it should return fail, regardless of training mode.

build TonY on CDH 5.15.1(Hadoop 2.6.0)

Is there some feature that is strictly not supported on Hadoop of lower version ? I try to build TonY on Hadoop 2.6.0 and get the following:

$ ./gradlew build -x test

> Task :tony-core:compileJava
Note: Some input files use or override a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Note: /home/xxx/TonY/tony-core/src/main/java/com/linkedin/tony/rpc/impl/ApplicationRpcClient.java uses unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
Unknown file extension: tony-core/src/main/resources/META-INF/services/org.apache.hadoop.security.SecurityInfo
Unknown file extension: tony-core/src/test/resources/test.tar
Unknown file extension: tony-core/src/test/resources/test.tar.gz
Unknown file extension: tony-core/src/test/resources/test.zip

> Task :tony-history-server:compilePlayBinaryScala
Pruning sources from previous analysis, due to incompatible CompileSetup.

> Task :tony-history-server:compilePlayBinaryTests
Pruning sources from previous analysis, due to incompatible CompileSetup.
Note: /home/xxx/TonY/tony-history-server/test/utils/TestHdfsUtils.java uses unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.

> Task :tony-history-server:testPlayBinary

controllers.BrowserTest > test FAILED
    java.lang.RuntimeException
        Caused by: akka.stream.impl.io.ConnectionSourceStage$$anon$2$$anon$1
            Caused by: java.net.BindException

12 tests completed, 1 failed

> Task :tony-history-server:testPlayBinary FAILED

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':tony-history-server:testPlayBinary'.
> There were failing tests. See the report at: file:///home/xxx/TonY/tony-history-server/build/playBinary/reports/test/index.html

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.

* Get more help at https://help.gradle.org

Deprecated Gradle features were used in this build, making it incompatible with Gradle 5.0.
Use '--warning-mode all' to show the individual deprecation warnings.
See https://docs.gradle.org/4.10.2/userguide/command_line_interface.html#sec:command_line_warnings

BUILD FAILED in 27s
45 actionable tasks: 40 executed, 5 up-to-date

TensorFlowJob type should inject Azkaban metadata into TonY config

The TensorFlowJob type used to run TonY jobs should inject Azkaban metadata into the TonY configuration so that the Azkaban metadata (exec link, job link, project name, etc.) show up in the config.xml file written to HDFS by the TonY ApplicationMaster.

TonY uses both tony.history.location and tony.historyFolder

Currently, tony-default.xml contains tony.history.location whereas other places use tony.historyFolder. We should standardize on one. Looking at Hadoop configs, seems like all.lowercase.period.separated is the more standard naming convention, rather than using camelCase in config names.

We should also avoid hardcoding tony.historyFolder in multiple places and instead define a String constant once and use that elsewhere.

Usage section of README should include how to use Notebook

WIth #13, Notebook is another way to submit a TonY job, and we should include it in the README.

TonY MNIST distributed example hangs when using multiple workers and one worker finishes

The MNIST distributed example uses asynchronous training, so the workers should not need to communicate with each other (only the ps), and one worker finishing should not cause the other workers to hang.

Print a custom error message when container runs out of memory

This is a very common error, right now the AM just prints the diagnostics with an error code -104, which doesn't make sense if you haven't memorized what the error codes mean.

TonY assumes python_venv.zip at root folder

TaskExecutor uses the -python_venv passed by user as the path to locate the python_venv path, this is wrong.. it will always be at root folder and we should make a constant.

Remove registration timeout and retry logic

Currently, there are a couple timeouts involved in worker/parameter server registration:

tony.task.registration-timeout-sec (default 300 sec)
120 sec polling until non-null in TaskExecutor.registerAndGetClusterSpec()

    return Utils.pollTillNonNull(() ->
        proxy.registerWorkerSpec(jobName + ":" + taskIndex,
            InetAddress.getLocalHost().getHostName() + ":" + rpcPort), 3, 120);

If there are large container scheduling/start-up delays, jobs can fail due to this. We should remove these timeouts entirely. We also then don't need the tony.task.registration-retry-count property either.

hdfs_classpath should be an array

Now we can only localize one hdfs classpath, we should be able to pull resources from different hdfs_classpath. This should also be renamed to hdfs_resources

Auto generated archive & config files not cleaned up

We generate appid-tony.zip + appid-tony-final.xml and leave them in the client.. should clean them up after job has finished.

mnist_distributed.py example doesn't work with TF 1.11

mnist_distributed.py example doesn't work with TF 1.11. It complains FLAGS doesn't have ports or logdir

Failed to run tensorflow mnist example

2018-12-12 19:43:26 ERROR TonyApplicationMaster:935 - [2018-12-12 19:43:25.607]Container [pid=12069,containerID=container_1544604976318_0003_01_000003] is running 22081205248B beyond the 'VIRTUAL' memory limit. Current usage: 979.3 MB of 2 GB physical memory used; 24.8 GB of 4.2 GB virtual memory used. Killing container.

How to configure the 'VIRTUAL' memory in Tony or on YARN？It seems that the vmem in my process is so large.

TonY application hangs if it requests more GPUs per task than are available on a single node

Suppose each node in a cluster only has 4 GPUs. If a TonY application requests 5 GPUs per worker (tony.worker.gpus = 5), YARN will give TonY containers with 4 GPUs, but TonY will not start anything in those containers due to an NPE:

TonyApplicationMaster:1013 - Error java.lang.NullPointerException: Task was null! Nothing to schedule.

This comes from ContainerLauncher.run():

TFTask task = session.getMatchingTask(container.getAllocationRequestId());
Preconditions.checkNotNull(task, "Task was null! Nothing to schedule.");

Instead of hanging, TonY should probably either:

Fail the application if it requests more GPUs per task than any single node has
Still try and launch the tasks on the containers YARN gives TonY, even though they have fewer GPUs than requested

Internal JIRA: LIHADOOP-40976

ps process is not killed after application finishes

When I run the mnist example, the ps process is still running after the tony application finishes.
Comment in tensorflow/python/training/server_lib.py in Tensorflow writes :
This method currently blocks forever.

should Tony do the cleanup?

Set up continuous integration TonY repo

Would be nice to set up continuous integration for this repo. Travis CI is the most popular tool, per https://blog.github.com/2017-11-07-github-welcomes-all-ci-tools/. It's free.

Would be good to run build on every Pull Request.

An admin of TonY project (@oliverhu or @chriseppstein) can set it up here: https://travis-ci.org/linkedin/TonY

Update doc for MiniTony for model dev

Currently, MiniTony is only used in our unit tests. We should provide doc that how other folks can leverage to iterate their model code faster without submitting the model code to a remote cluster.

TonY History Server should use in-memory cache to avoid pounding the NameNode every time the jobs list page is loaded

See #62 (comment) for context.

TonY should inject the version info (build version, commit, user, branch, etc.) into the Configuration

TonY should inject build metadata (version number, commit hash, user, date, etc.) into the Configuration created by the TonyClient so that this info is available in the config.xml for TonY applications.

TonyClient and startTHS.sh should read tony-site.xml config file from TONY_CONF_DIR

TonY clients and the TonY History Server need to use the same value for the location of the history files. The client needs to tell the TonY AM to write to that location and the history server needs to read from that location.

We should define this location in a tony-site.xml file and expect it to be in the directory pointed to by the environment variable TONY_CONF_DIR, which can be set before running TonY clients or the history server.

See #81 (comment) for more context.

Failed to run TF mnist example in GCP

Unable to run mnist example in Dataproc.

sudo java -cp `hadoop classpath`:/usr/local/src/MyJob/tony-cli-0.1.5-all.jar com.linkedin.tony.cli.ClusterSubmitter --python_venv=/usr/lo
cal/src/MyJob/venv.zip --src_dir=/usr/local/src/TonY/mnist/ --executes=/usr/local/src/TonY/mnist/src/mnist_distributed.py --conf_file=/usr/local/src/tony.xml --python_binary_path
=venv/bin/python3.5
18/11/11 08:23:02 INFO cli.ClusterSubmitter: Starting ClusterSubmitter..
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/src/MyJob/tony-cli-0.1.5-all.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
18/11/11 08:23:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/11/11 08:23:02 INFO cli.ClusterSubmitter: Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, null/core-site.
xml, null/hdfs-site.xml
Nov 11, 2018 8:23:02 AM com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase <clinit>
INFO: GHFS version: hadoop2-1.9.8
18/11/11 08:23:03 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
18/11/11 08:23:03 INFO cli.ClusterSubmitter: Copying /usr/local/src/MyJob/tony-cli-0.1.5-all.jar to: hdfs://tony-dev-m/user/root/.tony/6665ca2a-fd31-4f61-a947-33f895517302
Exception in thread "main" java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [Ljava.lang.String;
        at com.linkedin.tony.cli.ClusterSubmitter.main(ClusterSubmitter.java:60)

hadoop version
Hadoop 2.9.0
Subversion Unknown -r Unknown
Compiled by bigtop on 2018-08-17T12:00Z
Compiled with protoc 2.5.0
From source with checksum f510b6e8bafb2ddfd660aeb7454e7c30
This command was run using /usr/lib/hadoop/hadoop-common-2.9.0.jar

Java version

java -version
openjdk version "1.8.0_181"
OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-1~deb9u1-b13)
OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)

Command run:

java -cp `hadoop classpath`:/usr/local/src/MyJob/tony-cli-0.1.5-all.jar com.linkedin.tony.cli.ClusterSubmitter \
--python_venv=/usr/local/src/MyJob/venv.zip \
--src_dir=/usr/local/src/TonY/mnist/ \
--executes=/usr/local/src/TonY/mnist/src/mnist_distributed.py \
--conf_file=/usr/local/src/tony.xml \
--python_binary_path=venv/bin/python3.5

Directory structure:

.
├── src
│   └── mnist_distributed.py
├── tony-cli-0.1.5-all.jar
├── tony.xml
└── venv.zip

tony.xml contents:

<configuration>
  <property>
    <name>tony.application.security.enabled</name>
   <value>false</value>
  </property>    
  <property>
    <name>tony.worker.instances</name>
    <value>2</value>
  </property>
  <property>
    <name>tony.worker.memory</name>
    <value>15g</value>
  </property>
  <property>
    <name>tony.worker.gpus</name>
    <value>0</value>
  </property>
  <property>
    <name>tony.ps.memory</name>
    <value>3g</value>
  </property>
</configuration>

TonY should launch TensorBoard in separate container

Currently, TonY allocates the TensorBoard port on the chief worker (worker 0) and expects the chief worker task to read in the port from the environment and launch the TensorBoard process. However, often, this can cause the chief worker to fail with OutOfMemoryExceptions. To help mitigate this, TonY should allocate a separate container just for running TensorBoard. This would be more similar to what's generally done when running TensorFlow on Kubernetes -- TensorBoard is started in a separate pod form the workers and parameters servers (see the training.yaml example here).