Code Monkey home page Code Monkey logo

tony's People

Contributors

0kelvins avatar alipasha2019 avatar arde171 avatar burgerkingeater avatar cguegi avatar charliechen211 avatar chengbingliu avatar chenjunzou avatar daugraph avatar erwa avatar fantajeon avatar gogasca avatar goyalankit avatar haibchen avatar helloworld1 avatar hungj avatar i-ony avatar li-gsalia avatar medb avatar nevesnunes avatar oliverhu avatar pdambrauskas avatar pdtran3k6 avatar pingsutw avatar pralabhkumar avatar shipkit-org avatar uwfrankgu avatar yuanzac avatar zhe-thoughts avatar zuston avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tony's Issues

Installation of TonY in GCP. Tests under TestTonyE2E fail during build.

Tried to install TonY on Google Cloud via Dataproc, but tests failed during build.

Setup:

  • 1 Master node
  • 2 Workers

Operating System:

Debian GNU/Linux 8

Attached logs with -debug information.

build.log

#yarn node -list
18/11/03 21:56:40 INFO client.RMProxy: Connecting to ResourceManager at tony-m/10.138.0.4:8032
Total Nodes:2
         Node-Id             Node-State Node-Http-Address       Number-of-Running-Containers
tony-w-0.c.dpe.internal:33607         RUNNING tony-w-0.c.dpe.internal:8042                            11
tony-w-1.c.dpe.internal:44563         RUNNING tony-w-1.c.dpe.internal:8042                             9
#hadoop version
Hadoop 2.8.4
Subversion Unknown -r Unknown
Compiled by bigtop on 2018-08-09T10:27Z
Compiled with protoc 2.5.0
From source with checksum 373fbec5524db42be27f1396ffbd2fc6This command was run using 
[build.log](https://github.com/linkedin/TonY/files/2545362/build.log)
/usr/lib/hadoop/hadoop-common-2.8.4.jar
#java -version
openjdk version "1.8.0_171"OpenJDK Runtime Environment (build 1.8.0_171-8u171-b11-1~bpo8+1-b11)OpenJDK 64-Bit Server VM (build 25.171-b11, mixed mode))
#echo $JAVA_HOME
/usr/lib/jvm/java-8-openjdk-amd64

When running ./gradlew build

[sudo ./gradlew build  --stacktrace
> Task :tony-core:test
Gradle suite > Gradle test > com.linkedin.tony.TestTonyE2E.testNullAMRpcClient FAILED
    java.lang.AssertionError at TestTonyE2E.java:268
Gradle suite > Gradle test > com.linkedin.tony.TestTonyE2E.testPSSkewedWorkerTrainingShouldPass FAILED
    java.lang.AssertionError at TestTonyE2E.java:110
Gradle suite > Gradle test > com.linkedin.tony.TestTonyE2E.testPSWorkerTrainingShouldPass FAILED
    java.lang.AssertionError at TestTonyE2E.java:127
Gradle suite > Gradle test > com.linkedin.tony.TestTonyE2E.testSingleNodeTrainingShouldPass FAILED
    java.lang.AssertionError at TestTonyE2E.java:73
27 tests completed, 4 failed
> Task :tony-core:test FAILED
FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':tony-core:test'.
> There were failing tests. See the report at: file:///usr/local/src/TonY/tony-core/build/reports/tests/test/index.html
* Try:
Run with --info or --debug option to get more log output. Run with --scan to get full insights.](url)

Unauthorized connection for super-user: rm/[email protected]

2018-11-12 22:03:02 INFO  TonyClient:198 - Submitting YARN application
2018-11-12 22:03:03 FATAL TonyClient:776 - Failed to run TonyClient
org.apache.hadoop.yarn.exceptions.YarnException: Failed to submit application_1541916949981_266668 to YARN : Unauthorized connection for super-user: rm/[email protected] from IP xx.xx.xx.xx
	at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:272)
	at com.linkedin.tony.TonyClient.run(TonyClient.java:199)
	at com.linkedin.tony.TonyClient.start(TonyClient.java:774)
	at com.linkedin.tony.TonyClient.start(TonyClient.java:762)
	at com.linkedin.tony.cli.ClusterSubmitter.main(ClusterSubmitter.java:76)
2018-11-12 22:03:03 ERROR TonyClient:786 - Application failed to complete successfully

src_dir should be nullable

Part of making #93 easier. User should be able to pass in --resources SCHEMA://PATH/TO/RESOURCES instead of having to add resources through local file system.

Example:


  @Test
  public void testTonyResourcesFlag() throws ParseException {
    conf.setBoolean(TonyConfigurationKeys.IS_SINGLE_NODE, false);
    client = new TonyClient(conf);
    client.init(new String[]{
        "--executes", "'/bin/cat log4j.properties'",
        "--hdfs_classpath", "/yarn/libs",
        "--container_env", Constants.SKIP_HADOOP_PATH + "=true",
        "--conf", "tony.worker.resources=/yarn/libs",
        "--conf", "tony.ps.instances=0",
    });
    int exitCode = client.start();
    Assert.assertEquals(exitCode, 0);
  }

Feature request: Add Google Cloud Bucket support

When running DataProc with Google Cloud, would be ideal to keep files in Company GCS bucket, private or public

Support for:

  • jars: (Already included in gcloud dataproc command). To be tested.
  • python_venv
  • executes
  • conf_file
  • src_dir

Since some GCS buckets are not public may be required to pass credentials (json file) in a different parameter.
Code sample here.

gcloud dataproc jobs submit hadoop --cluster tony-staging \
--class com.linkedin.tony.cli.ClusterSubmitter \
--jars gs://tony-staging/tony-cli-0.1.5-all.jar -- \
--python_venv=gs://tony-staging/env/tf19.zip \
--src_dir=gs://tony-staging/tony/mnist/src/ \
--executes=gs://tony-staging/tony/mnist/src/mnist_distributed.py \
--conf_file=gs://tony-staging/tony/conf/tony.xml \
--python_binary_path=tf19/bin/python3.5

Related to #74

Run Play tests during build

Currently, the Play tests BrowserTest and HomeControllerTest do not run as part of the build. For example, in https://api.travis-ci.org/v3/job/448123064/log.txt, we see

> Task :tony-history-server:testClasses UP-TO-DATE
Skipping task ':tony-history-server:testClasses' as it has no actions.
:tony-history-server:testClasses (Thread[Task worker for ':',5,main]) completed. Took 0.0 secs.
:tony-history-server:test (Thread[Task worker for ':',5,main]) started.

> Task :tony-history-server:test NO-SOURCE
Skipping task ':tony-history-server:test' as it has no source files and no previous output files.
:tony-history-server:test (Thread[Task worker for ':',5,main]) completed. Took 0.002 secs.

The Play tests get run as part of the testPlayBinary which test does NOT depend on.

Failed to run mnist example on Hadoop cluster

Tried to build TonY and run mnist-tensorflow example, but get error message "ERROR tony.TonyClient: Application failed to complete successfully". There is no clear error in hadoop logs. Furthermore, I successfully built TonY but I didn't find the tony folder and configuration tony.xml. Thanks in advance for any help.

My configurations are:
A Hadoop cluster (4 nodes: 1 master and 3 slaves) on Virtualbox.
TonY: 0.1.3
Hadoop: 2.9.1
Tensorflow: 1.9.0

The printed out info:
18/11/05 14:49:59 INFO tony.TonyClient: TonY heartbeat interval [1000]
18/11/05 14:49:59 INFO tony.TonyClient: TonY max heartbeat misses allowed [25]
18/11/05 14:49:59 INFO tony.TonyClient: Starting client..
18/11/05 14:49:59 INFO client.RMProxy: Connecting to ResourceManager at /192.168.56.100:8032
18/11/05 14:50:05 WARN hdfs.DataStreamer: Caught exception
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Thread.join(Thread.java:1252)
at java.lang.Thread.join(Thread.java:1326)
at org.apache.hadoop.hdfs.DataStreamer.closeResponder(DataStreamer.java:980)
at org.apache.hadoop.hdfs.DataStreamer.endBlock(DataStreamer.java:630)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:807)
18/11/05 14:50:05 INFO tony.TonyClient: Completed setting up app master command {{JAVA_HOME}}/bin/java -Xmx1638m -Dyarn.app.container.log.dir=<LOG_DIR> com.linkedin.tony.TonyApplicationMaster --python_binary_path /home/rui/venv/bin/python --python_venv /home/rui/venv.zip --executes /home/rui/TonY/tony-examples/mnist/mnist_distributed.py --hdfs_classpath hdfs://192.168.56.100:9000/user/rui/.tony/1adf67c5-3be7-4245-8a31-3c9204ae84a8 --container_env TONY_CONF_PATH=hdfs://192.168.56.100:9000/user/rui/.tony/application_1541449424539_0001/tony-final.xml --container_env TONY_CONF_TIMESTAMP=
1541451005299 --container_env TF_ZIP_LENGTH=102664099 --container_env TF_ZIP_TIMESTAMP=1541451005184 --container_env TF_ZIP_PATH=hdfs://192.168.56.100:9000/user/rui/.tony/application_1541449424539_0001/tf.zip --container_env TONY_CONF_L
ENGTH=3200 --container_env CLASSPATH={{CLASSPATH}}./{{HADOOP_CONF_DIR}}{{HADOOP_COMMON_HOME}}/share/hadoop/common/{{HADOOP_COMMON_HOME}}/share/hadoop/common/lib/{{HADOOP_HDFS_HOME}}/share/hadoop/hdfs/{{
HADOOP_HDFS_HOME}}/share/hadoop/hdfs/lib/{{HADOOP_YARN_HOME}}/share/hadoop/yarn/{{HADOOP_YARN_HOME}}/share/hadoop/yarn/lib/* 1><LOG_DIR>/amstdout.log 2><LOG_DIR>/amstderr.log
18/11/05 14:50:05 INFO tony.TonyClient: Submitting YARN application
18/11/05 14:50:05 INFO impl.YarnClientImpl: Submitted application application_1541449424539_0001
18/11/05 14:50:05 INFO tony.TonyClient: URL to track running application (will proxy to TensorBoard once it has started): http://tf-yarn-master:8088/proxy/application_1541449424539_0001/
18/11/05 14:50:05 INFO tony.TonyClient: ResourceManager web address for application: http://192.168.56.100:8088/cluste
r/app/application_1541449424539_0001
18/11/05 14:50:11 INFO tony.TonyClient: AM host: tf-yarn-slave3
18/11/05 14:50:11 INFO tony.TonyClient: AM RPC port: 14925
18/11/05 14:50:11 INFO client.RMProxy: Connecting to ResourceManager at /192.168.56.100:8032
18/11/05 14:50:13 INFO tony.TonyClient: Logs for ps 0 at: http://tf-yarn-slave2:8042/node/containerlogs/container_1541
449424539_0001_01_000002/rui
18/11/05 14:50:13 INFO tony.TonyClient: Logs for worker 0 at: http://tf-yarn-slave3:8042/node/containerlogs/container_
1541449424539_0001_01_000003/rui
18/11/05 14:50:21 INFO tony.TonyClient: Application finished unsuccessfully. YarnState=FINISHED, DSFinalStatus=FAILED. Breaking monitoring loop : ApplicationId:1
18/11/05 14:50:22 ERROR tony.TonyClient: Application failed to complete successfully

Container exited with 132 when runs example mnist-tensorflow

I am trying to follow the mnist-tensorflow in tony-example, but when I run the following command, I found my containers exited with code 132 and I can't find why, which really confused me. Any Ideas?

java version: 1.8.0_181

Hadoop version: 3.1.1

java -cp "`hadoop classpath --glob`:tony/*:tony" \
            com.linkedin.tony.cli.ClusterSubmitter \
            -executes src/models/mnist_distributed.py \
            -python_venv env.zip \
            -python_binary_path env/bin/python \
            -src_dir src \
            -shell_env LD_LIBRARY_PATH=/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server

tony.xml

<configuration>
  <property>
    <name>tony.application.hdfs-conf-path</name>
    <value>/home/hadoop/hadoop/etc/hadoop/hdfs-site.xml</value>
  </property>
  <property>
    <name>tony.application.yarn-conf-path</name>
    <value>/home/hadoop/hadoop/etc/hadoop/yarn-site.xml</value>
  </property>
  <property>
    <name>tony.application.security.enabled</name>
    <value>false</value>
  </property>
</configuration> 

the console:

2018-10-17 08:27:49,932 INFO cli.ClusterSubmitter: Starting ClusterSubmitter..
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/distribute/tony/tony-cli-0.1.3-all.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2018-10-17 08:27:50,132 INFO cli.ClusterSubmitter: Configuration: core-default.xml, core-site.xml, null/core-site.xml, null/hdfs-site.xml
2018-10-17 08:27:50,465 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-10-17 08:27:51,887 INFO cli.ClusterSubmitter: Copying /home/hadoop/distribute/tony/tony-cli-0.1.3-all.jar to: hdfs://localhost:9000/user/hadoop/.tony/ffae84e0-edd3-444a-9148-a25124a3e7bc
2018-10-17 08:27:53,753 INFO tony.TonyClient: TonY heartbeat interval [1000]
2018-10-17 08:27:53,753 INFO tony.TonyClient: TonY max heartbeat misses allowed [25]
2018-10-17 08:27:53,790 INFO tony.TonyClient: Starting client..
2018-10-17 08:27:53,796 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
2018-10-17 08:27:54,163 INFO conf.Configuration: resource-types.xml not found
2018-10-17 08:27:54,164 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2018-10-17 08:28:08,606 INFO tony.TonyClient: Completed setting up app master command {{JAVA_HOME}}/bin/java -Xmx1638m -Dyarn.app.container.log.dir=<LOG_DIR> com.linkedin.tony.TonyApplicationMaster --python_binary_path env/bin/python --python_venv env.zip --executes src/models/mnist_distributed.py --hdfs_classpath hdfs://localhost:9000/user/hadoop/.tony/ffae84e0-edd3-444a-9148-a25124a3e7bc --shell_env LD_LIBRARY_PATH=/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server --container_env TONY_CONF_PATH=hdfs://localhost:9000/user/hadoop/.tony/application_1539761891085_0002/tony-final.xml --container_env YARN_CONF_PATH=home/hadoop/hadoop/etc/hadoop/yarn-site.xml --container_env TONY_CONF_TIMESTAMP=1539764888557 --container_env TONY_CONF_LENGTH=3659 --container_env TONY_ZIP_PATH=hdfs://localhost:9000/user/hadoop/.tony/application_1539761891085_0002/tony.zip --container_env TONY_ZIP_LENGTH=154330934 --container_env TONY_ZIP_TIMESTAMP=1539764888067 --container_env CLASSPATH={{CLASSPATH}}<CPS>./*<CPS>{{HADOOP_CONF_DIR}}<CPS>{{HADOOP_COMMON_HOME}}/share/hadoop/common/*<CPS>{{HADOOP_COMMON_HOME}}/share/hadoop/common/lib/*<CPS>{{HADOOP_HDFS_HOME}}/share/hadoop/hdfs/*<CPS>{{HADOOP_HDFS_HOME}}/share/hadoop/hdfs/lib/*<CPS>{{HADOOP_YARN_HOME}}/share/hadoop/yarn/*<CPS>{{HADOOP_YARN_HOME}}/share/hadoop/yarn/lib/* --container_env HDFS_CONF_PATH=home/hadoop/hadoop/etc/hadoop/hdfs-site.xml 1><LOG_DIR>/amstdout.log 2><LOG_DIR>/amstderr.log 
2018-10-17 08:28:08,607 INFO tony.TonyClient: Submitting YARN application
2018-10-17 08:28:08,712 INFO impl.YarnClientImpl: Submitted application application_1539761891085_0002
2018-10-17 08:28:08,718 INFO tony.TonyClient: URL to track running application (will proxy to TensorBoard once it has started): http://HP-DL580-G7:8088/proxy/application_1539761891085_0002/
2018-10-17 08:28:08,719 INFO tony.TonyClient: ResourceManager web address for application: http://0.0.0.0:8088/cluster/app/application_1539761891085_0002
2018-10-17 08:28:14,764 INFO tony.TonyClient: AM host: HP-DL580-G7
2018-10-17 08:28:14,764 INFO tony.TonyClient: AM RPC port: 13923
2018-10-17 08:28:14,770 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
2018-10-17 08:28:19,025 INFO tony.TonyClient: Logs for ps 0 at: http://HP-DL580-G7:8042/node/containerlogs/container_1539761891085_0002_01_000002/hadoop
2018-10-17 08:28:19,026 INFO tony.TonyClient: Logs for worker 0 at: http://HP-DL580-G7:8042/node/containerlogs/container_1539761891085_0002_01_000003/hadoop
2018-10-17 08:28:36,135 INFO tony.TonyClient: Application finished unsuccessfully. YarnState=FINISHED, DSFinalStatus=FAILED. Breaking monitoring loop : ApplicationId:2
2018-10-17 08:28:36,200 ERROR tony.TonyClient: Application failed to complete successfully

the amstdout.log:

2018-10-17 08:40:07 INFO  TonyApplicationMaster:145 - Logs for worker 0 at: http://HP-DL580-G7:8042/node/containerlogs/container_1539765357563_0002_01_000003/hadoop
2018-10-17 08:40:07 INFO  TonyApplicationMaster:909 - Successfully started container container_1539765357563_0002_01_000003
2018-10-17 08:40:08 INFO  TonyApplicationMaster:728 - Client requesting TaskUrls!
2018-10-17 08:40:09 INFO  TonyApplicationMaster:528 - Completed worker tasks: 0, total worker tasks: 1
2018-10-17 08:40:14 INFO  TonyApplicationMaster:528 - Completed worker tasks: 0, total worker tasks: 1
2018-10-17 08:40:18 INFO  TonyApplicationMaster:770 - Received cluster spec registration request from task ps:0 with spec: HP-DL580-G7:33439
2018-10-17 08:40:18 INFO  TonyApplicationMaster:783 - [ps:0] Received Registration for HB !!
2018-10-17 08:40:18 INFO  TonyApplicationMaster:795 - Received registrations from 1 tasks, awaiting registration from 1 tasks.
2018-10-17 08:40:18 INFO  TonyApplicationMaster:797 - Awaiting registration from task worker 0 in container_1539765357563_0002_01_000003 on host HP-DL580-G7
2018-10-17 08:40:19 INFO  TonyApplicationMaster:770 - Received cluster spec registration request from task worker:0 with spec: HP-DL580-G7:35215
2018-10-17 08:40:19 INFO  TonyApplicationMaster:783 - [worker:0] Received Registration for HB !!
2018-10-17 08:40:19 INFO  TonyApplicationMaster:789 - All 2 tasks registered.
2018-10-17 08:40:19 INFO  TonyApplicationMaster:831 - Got request to update TensorBoard URL: HP-DL580-G7:45163
2018-10-17 08:40:19 WARN  TonyApplicationMaster:850 - This Hadoop version doesn't have the YARN-7974 patch, TonY won't register TensorBoard URL withapplication's tracking URL
2018-10-17 08:40:19 INFO  TonyApplicationMaster:528 - Completed worker tasks: 0, total worker tasks: 1
2018-10-17 08:40:20 INFO  TonyApplicationMaster:811 - Received result registration request with exit code 132 from worker 0
2018-10-17 08:40:21 INFO  TonyApplicationMaster:789 - All 2 tasks registered.
2018-10-17 08:40:21 INFO  TonyApplicationMaster:944 - Completed containers: 1
2018-10-17 08:40:21 INFO  TonyApplicationMaster:947 - ContainerID = container_1539765357563_0002_01_000003, state = COMPLETE, exitStatus = 132
2018-10-17 08:40:21 ERROR TonyApplicationMaster:952 - [2018-10-17 08:40:20.925]Exception from container-launch.
Container id: container_1539765357563_0002_01_000003
Exit code: 132

[2018-10-17 08:40:20.934]Container exited with a non-zero exit code 132. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop/tmp/nm-local-dir/usercache/hadoop/filecache/13/tony-cli-0.1.3-all.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]


[2018-10-17 08:40:20.935]Container exited with a non-zero exit code 132. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop/tmp/nm-local-dir/usercache/hadoop/filecache/13/tony-cli-0.1.3-all.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]



2018-10-17 08:40:21 INFO  TonyApplicationMaster:961 - Unregister task [worker:0] from Heartbeat monitor..
2018-10-17 08:40:21 INFO  TonyApplicationMaster:966 - Container failed, id = container_1539765357563_0002_01_000003
2018-10-17 08:40:22 INFO  TonyApplicationMaster:811 - Received result registration request with exit code 132 from ps 0
2018-10-17 08:40:23 INFO  TonyApplicationMaster:944 - Completed containers: 1
2018-10-17 08:40:23 INFO  TonyApplicationMaster:947 - ContainerID = container_1539765357563_0002_01_000002, state = COMPLETE, exitStatus = 132
2018-10-17 08:40:23 ERROR TonyApplicationMaster:952 - [2018-10-17 08:40:23.168]Exception from container-launch.
Container id: container_1539765357563_0002_01_000002
Exit code: 132

[2018-10-17 08:40:23.175]Container exited with a non-zero exit code 132. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop/tmp/nm-local-dir/usercache/hadoop/filecache/13/tony-cli-0.1.3-all.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]


[2018-10-17 08:40:23.177]Container exited with a non-zero exit code 132. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop/tmp/nm-local-dir/usercache/hadoop/filecache/13/tony-cli-0.1.3-all.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]



2018-10-17 08:40:23 INFO  TonyApplicationMaster:961 - Unregister task [ps:0] from Heartbeat monitor..
2018-10-17 08:40:23 INFO  TonyApplicationMaster:966 - Container failed, id = container_1539765357563_0002_01_000002
2018-10-17 08:40:24 INFO  TonyApplicationMaster:512 - Completed jobs: 1 total jobs: 1
2018-10-17 08:40:24 INFO  TonyApplicationMaster:564 - Total completed worker tasks: 1, total worker tasks: 1
2018-10-17 08:40:24 INFO  TonyApplicationMaster:570 - TensorFlow session failed: At least one job task exited with non-zero status, failedCnt=1
2018-10-17 08:40:24 INFO  TonyApplicationMaster:335 - Result: false, job failed: true, retry count: 0
2018-10-17 08:40:25 INFO  TonyApplicationMaster:837 - Client signals AM to finish application.
2018-10-17 08:40:29 INFO  Utils:61 - Poll function finished within 30 seconds
2018-10-17 08:40:29 INFO  TonyApplicationMaster:145 - Logs for ps 0 at: http://HP-DL580-G7:8042/node/containerlogs/container_1539765357563_0002_01_000002/hadoop
2018-10-17 08:40:29 INFO  TonyApplicationMaster:145 - Logs for worker 0 at: http://HP-DL580-G7:8042/node/containerlogs/container_1539765357563_0002_01_000003/hadoop
2018-10-17 08:40:29 INFO  TonyApplicationMaster:355 - Application Master failed. exiting

and the worker container:

2018-10-17 08:40:08 INFO  TaskExecutor:86 - TaskExecutor is running..
2018-10-17 08:40:08 INFO  TaskExecutor:80 - Reserved rpcPort: 35215
2018-10-17 08:40:08 INFO  TaskExecutor:81 - Reserved tbPort: 45163
2018-10-17 08:40:08 INFO  TaskExecutor:82 - Reserved py4j gatewayServerPort: 35421
2018-10-17 08:40:08 INFO  TaskExecutor:178 - Task command: venv/env/bin/python src/models/mnist_distributed.py
2018-10-17 08:40:08 INFO  Utils:109 - Unzipping tony.zip to destination ./
2018-10-17 08:40:10 INFO  TaskExecutor:184 - Setting up Rpc client, connecting to: HP-DL580-G7:11789
2018-10-17 08:40:10 INFO  TaskExecutor:96 - Unpacking Python virtual environment: env.zip
2018-10-17 08:40:10 INFO  Utils:109 - Unzipping env.zip to destination venv
2018-10-17 08:40:19 INFO  TaskExecutor:107 - Executor is running task worker 0
2018-10-17 08:40:19 INFO  TaskExecutor:190 - Application Master address : HP-DL580-G7:11789
2018-10-17 08:40:19 INFO  TaskExecutor:193 - ContainerId is: container_1539765357563_0002_01_000003 HostName is: HP-DL580-G7
2018-10-17 08:40:19 INFO  TaskExecutor:201 - Connecting to HP-DL580-G7:11789 to register worker spec: worker 0 HP-DL580-G7:35215
2018-10-17 08:40:19 INFO  Utils:82 - Poll function finished within 120 seconds
2018-10-17 08:40:19 INFO  TaskExecutor:114 - Successfully registered and got cluster spec: {"ps":["HP-DL580-G7:33439"],"worker":["HP-DL580-G7:35215"]}
2018-10-17 08:40:19 INFO  TaskExecutor:211 - TensorBoard address : HP-DL580-G7:45163
2018-10-17 08:40:19 INFO  Utils:82 - Poll function finished within 60 seconds
2018-10-17 08:40:19 INFO  TaskExecutor:214 - Register TensorBoard response: SUCCEEDED
2018-10-17 08:40:19 INFO  Utils:210 - Executing command: venv/env/bin/python src/models/mnist_distributed.py
2018-10-17 08:40:20 INFO  Utils:82 - Poll function finished within 60 seconds
2018-10-17 08:40:20 INFO  TaskExecutor:223 - AM response for result execution run: RECEIVED
2018-10-17 08:40:20 INFO  TaskExecutor:148 - Child process exited with exit code 132

Refactor hdfs_classpath, src_dir, python_venv to all use resources and fix the path issue for input files

Currently, we have different handling logic for hdfs_classpath, which we adds to container localizable resources for am and then pass this again to the workers, for src_dir & python_venv, we add them to a tony.zip. A difference between these two is for src_dir, we care about the folder structure, however for python_venv, we don't care since it is a single zip file.

All these resources handling could be unified via the new -tony.container.resources flag. We localize all resources in that field (delimited by ,)

We pass python_venv all the way from client -> am -> taskExecutor as command line arguments, which is not necessary. We can always use relative path and assume the top level folder is venv.

The plan is to get rid of the logic to create a tony.zip and use the logic to handle resources to handle all these scenarios. The -execute and -python_binary will always take relative path inside the uploaded artifact.

ls /user/alice/tonyJob
 - venv.zip
 - src/
    - mnist.py

Inside venv.zip

venv.zip
  - bin/
  - lib/

Example:

java -cp `hadoop classpath`:/path/to/TonY/tony-cli/build/libs/tony-cli-x.x.x-all.jar com.linkedin.tony.cli.ClusterSubmitter \
--python_venv=/user/alice/tonyJob/venv.zip  \
--src_dir=/user/alice/tonyJob/src  \
--executes=mnist_distributed.py \
--python_binary_path=bin/python

Under the hood, we pack src_dir into a SRC.zip, upload to hdfs, set tonyConf's tony.container.resources to include that and then all containers will localize the zip and if the SRC.zip exists, we'll unzip the file.

tony-final.xml's tony.container.resources will be like:

<property>
<name>
tony.container.resources
</name>
<value>
hdfs://tony_tmp/SRC.zip, hdfs://tony_tmp/venv.zip, hdfs://hdfs_classpath/tony.jar
</value>
</property>

Same applies to python_venv, we upload the venv zip file to hdfs, rename to venv.zip and set tonyConf and then a common logic will localize it to all containers. We'll unzip the file at the containers.

WIP branch https://github.com/linkedin/TonY/tree/refactor

Ref #74

Failed to run PyTorch mnist example in GCP

Unable to run PyTorch sample code. Task is stuck in "RUNNING"

Setup

GCP DataProc

  • 1 master node
  • 2 worker nodes

Version

Hadoop 2.9.0
Subversion https://bigdataoss-internal.googlesource.com/third_party/apache/hadoop -r e8ce80c37eebb173fc688e7f5686d7df74d182aa
Compiled by bigtop on 2018-10-25T12:56Z
Compiled with protoc 2.5.0
From source with checksum 1eb388d554db8e1cadcab4c1326ee72
This command was run using /usr/lib/hadoop/hadoop-common-2.9.0.jar

ML framework versions

PyTorch 0.4.0
Python 3.5

tony.xml

<configuration>
  <property>
    <name>tony.application.name</name>
    <value>PyTorch</value>
  </property>   
  <property>
    <name>tony.application.security.enabled</name>
   <value>false</value>
  </property>    
  <property>
    <name>tony.worker.instances</name>
    <value>2</value>
  </property>
  <property>
    <name>tony.worker.memory</name>
    <value>4g</value>
  </property>
  <property>
    <name>tony.worker.gpus</name>
    <value>0</value>
  </property>
  <property>
    <name>tony.ps.memory</name>
    <value>4g</value>
  </property>
  <property>
    <name>tony.application.framework</name>
    <value>pytorch</value>
  </property>
</configuration>
#yarn application -list -appStates ALL
18/11/20 07:06:41 INFO client.RMProxy: Connecting to ResourceManager at tony-staging-m/10.138.0.2:8032
18/11/20 07:06:42 INFO client.AHSProxy: Connecting to Application History server at tony-staging-m/10.138.0.2:10200
The application state AL is invalid.
The valid application state can be one of the following: ALL,NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING,FINISHED,FAILED,KILLED
(torch04) root@tony-staging-m:/usr/local/src/jobs/PTJob# yarn application -list -appStates ALL
18/11/20 07:06:46 INFO client.RMProxy: Connecting to ResourceManager at tony-staging-m/10.138.0.2:8032
18/11/20 07:06:47 INFO client.AHSProxy: Connecting to Application History server at tony-staging-m/10.138.0.2:10200
Total number of applications (application-types: [], states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, RUNNING, FINISHED, FAILED, KILLED] and tags: []):9
                Application-Id      Application-Name        Application-Type          User           Queue                   State             Final-State             Progress         
               Tracking-URL
application_1542587994073_0009  TensorFlowApplication             TENSORFLOW          root         default                  KILLED                  KILLED                 100% http://t
ony-staging-m:8188/applicationhistory/app/application_1542587994073_0009
application_1542587994073_0010  TensorFlowApplication             TENSORFLOW          root         default                FINISHED                  FAILED                 100%         
                        N/A
application_1542587994073_0015               PyTorch              TENSORFLOW          root         default                 RUNNING               UNDEFINED                   0%         
                        N/A

image

Logs from:

node/containerlogs/container_1542587994073_0015_01_000002/root

Code fails in:

executor.taskIndex = Integer.parseInt(System.getenv(Constants.TASK_INDEX));

stderr

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/hadoop/yarn/nm-local-dir/usercache/root/filecache/36/tony-cli-0.1.5-all.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Exception in thread "main" java.lang.NumberFormatException: null
	at java.lang.Integer.parseInt(Integer.java:542)
	at java.lang.Integer.parseInt(Integer.java:615)
	at com.linkedin.tony.TaskExecutor.main(TaskExecutor.java:109)

stdout

2018-11-20 06:46:37 INFO  TaskExecutor:89 - TaskExecutor is running..
2018-11-20 06:46:37 INFO  TaskExecutor:83 - Reserved rpcPort: 43073
2018-11-20 06:46:37 INFO  TaskExecutor:84 - Reserved tbPort: 37633
2018-11-20 06:46:37 INFO  TaskExecutor:85 - Reserved py4j gatewayServerPort: 35571
2018-11-20 06:46:37 INFO  TaskExecutor:175 - Task command: venv/torch04/bin/python3.5 /usr/local/src/jobs/PTJob/src/mnist_distributed.py --root /tmp/data/
2018-11-20 06:46:37 INFO  Utils:132 - Unzipping tony.zip to destination ./
2018-11-20 06:46:39 INFO  TaskExecutor:184 - Setting up Rpc client, connecting to: tony-staging-w-0.c.dpe-cloud-mle.internal:11616
2018-11-20 06:46:39 INFO  TaskExecutor:102 - Unpacking Python virtual environment: /usr/local/src/jobs/PTJob/env/torch04.zip
2018-11-20 06:46:39 INFO  Utils:132 - Unzipping /usr/local/src/jobs/PTJob/env/torch04.zip to destination venv

TonY History Server should start using nohup

Currently, the startTHS.sh script uses exec to start, and so if the SSH session in which the THS is started dies, the THS process itself dies. We should use nohup so the THS will continue running even when the SSH session is closed. (Alternatively, we could run THS as a background process.)

TonY Portal should enforce retention on history files

Currently, history files are retained forever. The retention period should be configurable and TonY Portal should take care of enforcing retention.

As part of retention, we can also clean up in-progress files that are older than the retention period. (These are probably jobs that crashed or encountered other abnormal conditions.)

Internal Jira: LIHADOOP-43855

Running this on Cloudera Hadoop Distribution (5.13.1)

Hi,

I came across this great work just recently. I had lot of issues using the TensorflowOnSpark and TensorflowOnYARN earlier this year and had given up. I'm wondering how can I make use of this repo on top of my Cloudera Distribution of Hadoop. Any help is appreciated.

Thanks !

Mohammed Ayub

TonY Job History Server Phase I

A job history server for TonY jobs.

  • Metrics collection (CPU/MEM/GPU utilization)
  • UI for displaying counter and other information

TonY on docker

with #67 , we'll support running TonY with docker images. The prerequisite is you need a properly configured cluster to be capable of running YARN applications with docker. Personal experience with that: https://medium.com/@oliver_hu/enable-hadoop-yarn-2-9-1-3-0-3-1-to-launch-application-using-docker-containers-1442a639bb64.

With this change, you should be able to launch your training jobs without zipping the python virtual env anymore.

To enable Tony launch your jobs on docker, set:

<property>
    <description>Whether we use docker container to launch the tasks</description>
    <name>tony.docker.enabled</name>
    <value>false</value>
  </property>
  <property>
    <description>Whether we use docker container to launch the tasks</description>
    <name>tony.docker.image</name>
    <value>oliverhu/hadoop-base</value> // your image
  </property>

TonY EventHandler should use take() instead of poll()

Currently, EventHandler.run() spins inside this loop:

    while (!isStopped) {
      writeEvent(eventQueue, dataFileWriter);
    }

because writeEvent uses poll() which will immediately return null if the queue is empty.

We should update writeEvent() to use take() instead and have the stop() method call Thread.currentThread().interrupt() (the writeEvent() method should catch the InterruptedException).

build tony和使用gpu的一些问题

Java version:1.8.0_18
tensorflow-gpu: 1.9
运行如下命令:
./gradlew build

会出现如下错误:
com.linkedin.tony.TestTonyE2E.setup FAILED
java.lang.IllegalArgumentException: The value of property bind.address must not be null
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:148)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357)
at org.apache.hadoop.conf.Configuration.set(Configuration.java:1338)
at org.apache.hadoop.http.HttpServer2.initializeWebServer(HttpServer2.java:585)
at org.apache.hadoop.http.HttpServer2.(HttpServer2.java:537)
at org.apache.hadoop.http.HttpServer2.(HttpServer2.java:117)
at org.apache.hadoop.http.HttpServer2$Builder.build(HttpServer2.java:421)
at org.apache.hadoop.hdfs.server.namenode.NameNodeHttpServer.start(NameNodeHttpServer.java:160)
at org.apache.hadoop.hdfs.server.namenode.NameNode.startHttpServer(NameNode.java:869)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:691)
at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937)
at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643)
at org.apache.hadoop.hdfs.MiniDFSCluster.createNameNode(MiniDFSCluster.java:1308)
at org.apache.hadoop.hdfs.MiniDFSCluster.configureNameService(MiniDFSCluster.java:1077)
at org.apache.hadoop.hdfs.MiniDFSCluster.createNameNodesAndSetConf(MiniDFSCluster.java:952)
at org.apache.hadoop.hdfs.MiniDFSCluster.initMiniDFSCluster(MiniDFSCluster.java:884)
at org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:517)
at org.apache.hadoop.hdfs.MiniDFSCluster$Builder.build(MiniDFSCluster.java:476)
at com.linkedin.minitony.cluster.MiniCluster.start(MiniCluster.java:50)
at com.linkedin.tony.TestTonyE2E.setup(TestTonyE2E.java:34)
是什么原因?hadoop的各项配置检查了好几遍,都是正确的。。

还有在用gpu时,运行如下命令:
java -cp hadoop classpath:/TonY/tony-cli/build/libs/tony-cli-0.1.3-all.jar com.linkedin.tony.cli.LocalSubmitter
--python_venv=/venv.zip
--src_dir=/TonY/tony-examples/mnist
--executes=/TonY/tony-examples/mnist/mnist_distributed.py
--conf_file=/path/tony-test.xml
--python_binary_path=venv/bin/python
会出现找不到libcublas.so.9.0的错误,之前配置过cuda和cudnn,没问题,可以跑tf,
在这个虚拟环境中运行tf也可以,但是运行如上命令,则报错。谢谢

Fast-fail on chief worker failure

If chief worker fails, we should immediately fail the application. (i.e. the underlying TF distributed training may hang, so TonY should just fail the application.)

Other workers failing is a separate issue. In theory if they fail the training can continue. But it can be configurable.

Problem launching job

Hello,
I am having the following issue when trying to launch a job:

java.lang.RuntimeException: Failed to get FS delegation token for default FS.
at com.linkedin.tony.TonyClient.getTokens(TonyClient.java:555)
at com.linkedin.tony.TonyClient.run(TonyClient.java:177)
at com.linkedin.tony.TonyClient.start(TonyClient.java:716)
at com.linkedin.tony.TonyClient.start(TonyClient.java:703)
at com.linkedin.tony.cli.ClusterSubmitter.main(ClusterSubmitter.java:54)

I went to the code and I see a call to fs.getDelegationToken(tokenRenewer). However, I don't see such method in the API for FileSystem, so I am not sure what I should do next.
Thanks in advance for the help provided!

TonyApplicationMaster doesn't return correct status.

When running locally with a python script that exits 1 (mock a fail worker), TonyAM only fails when tony.application.single-node is set to true. We expect that when a non-chief worker fails, TonyAM should continue on training, but it should return fail, regardless of training mode.

build TonY on CDH 5.15.1(Hadoop 2.6.0)

Is there some feature that is strictly not supported on Hadoop of lower version ? I try to build TonY on Hadoop 2.6.0 and get the following:

$ ./gradlew build -x test

> Task :tony-core:compileJava
Note: Some input files use or override a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Note: /home/xxx/TonY/tony-core/src/main/java/com/linkedin/tony/rpc/impl/ApplicationRpcClient.java uses unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
Unknown file extension: tony-core/src/main/resources/META-INF/services/org.apache.hadoop.security.SecurityInfo
Unknown file extension: tony-core/src/test/resources/test.tar
Unknown file extension: tony-core/src/test/resources/test.tar.gz
Unknown file extension: tony-core/src/test/resources/test.zip

> Task :tony-history-server:compilePlayBinaryScala
Pruning sources from previous analysis, due to incompatible CompileSetup.

> Task :tony-history-server:compilePlayBinaryTests
Pruning sources from previous analysis, due to incompatible CompileSetup.
Note: /home/xxx/TonY/tony-history-server/test/utils/TestHdfsUtils.java uses unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.

> Task :tony-history-server:testPlayBinary

controllers.BrowserTest > test FAILED
    java.lang.RuntimeException
        Caused by: akka.stream.impl.io.ConnectionSourceStage$$anon$2$$anon$1
            Caused by: java.net.BindException

12 tests completed, 1 failed

> Task :tony-history-server:testPlayBinary FAILED

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':tony-history-server:testPlayBinary'.
> There were failing tests. See the report at: file:///home/xxx/TonY/tony-history-server/build/playBinary/reports/test/index.html

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.

* Get more help at https://help.gradle.org

Deprecated Gradle features were used in this build, making it incompatible with Gradle 5.0.
Use '--warning-mode all' to show the individual deprecation warnings.
See https://docs.gradle.org/4.10.2/userguide/command_line_interface.html#sec:command_line_warnings

BUILD FAILED in 27s
45 actionable tasks: 40 executed, 5 up-to-date

TonY uses both tony.history.location and tony.historyFolder

Currently, tony-default.xml contains tony.history.location whereas other places use tony.historyFolder. We should standardize on one. Looking at Hadoop configs, seems like all.lowercase.period.separated is the more standard naming convention, rather than using camelCase in config names.

We should also avoid hardcoding tony.historyFolder in multiple places and instead define a String constant once and use that elsewhere.

TonY assumes python_venv.zip at root folder

TaskExecutor uses the -python_venv passed by user as the path to locate the python_venv path, this is wrong.. it will always be at root folder and we should make a constant.

Remove registration timeout and retry logic

Currently, there are a couple timeouts involved in worker/parameter server registration:

  • tony.task.registration-timeout-sec (default 300 sec)
  • 120 sec polling until non-null in TaskExecutor.registerAndGetClusterSpec()
    return Utils.pollTillNonNull(() ->
        proxy.registerWorkerSpec(jobName + ":" + taskIndex,
            InetAddress.getLocalHost().getHostName() + ":" + rpcPort), 3, 120);

If there are large container scheduling/start-up delays, jobs can fail due to this. We should remove these timeouts entirely. We also then don't need the tony.task.registration-retry-count property either.

hdfs_classpath should be an array

Now we can only localize one hdfs classpath, we should be able to pull resources from different hdfs_classpath. This should also be renamed to hdfs_resources

Failed to run tensorflow mnist example

2018-12-12 19:43:26 ERROR TonyApplicationMaster:935 - [2018-12-12 19:43:25.607]Container [pid=12069,containerID=container_1544604976318_0003_01_000003] is running 22081205248B beyond the 'VIRTUAL' memory limit. Current usage: 979.3 MB of 2 GB physical memory used; 24.8 GB of 4.2 GB virtual memory used. Killing container.

How to configure the 'VIRTUAL' memory in Tony or on YARN?It seems that the vmem in my process is so large.

TonY application hangs if it requests more GPUs per task than are available on a single node

Suppose each node in a cluster only has 4 GPUs. If a TonY application requests 5 GPUs per worker (tony.worker.gpus = 5), YARN will give TonY containers with 4 GPUs, but TonY will not start anything in those containers due to an NPE:

TonyApplicationMaster:1013 - Error java.lang.NullPointerException: Task was null! Nothing to schedule.

This comes from ContainerLauncher.run():

TFTask task = session.getMatchingTask(container.getAllocationRequestId());
Preconditions.checkNotNull(task, "Task was null! Nothing to schedule.");

Instead of hanging, TonY should probably either:

  1. Fail the application if it requests more GPUs per task than any single node has
  2. Still try and launch the tasks on the containers YARN gives TonY, even though they have fewer GPUs than requested

Internal JIRA: LIHADOOP-40976

ps process is not killed after application finishes

When I run the mnist example, the ps process is still running after the tony application finishes.
Comment in tensorflow/python/training/server_lib.py in Tensorflow writes :
This method currently blocks forever.

should Tony do the cleanup?

Update doc for MiniTony for model dev

Currently, MiniTony is only used in our unit tests. We should provide doc that how other folks can leverage to iterate their model code faster without submitting the model code to a remote cluster.

TonyClient and startTHS.sh should read tony-site.xml config file from TONY_CONF_DIR

TonY clients and the TonY History Server need to use the same value for the location of the history files. The client needs to tell the TonY AM to write to that location and the history server needs to read from that location.

We should define this location in a tony-site.xml file and expect it to be in the directory pointed to by the environment variable TONY_CONF_DIR, which can be set before running TonY clients or the history server.

See #81 (comment) for more context.

Failed to run TF mnist example in GCP

Unable to run mnist example in Dataproc.

sudo java -cp `hadoop classpath`:/usr/local/src/MyJob/tony-cli-0.1.5-all.jar com.linkedin.tony.cli.ClusterSubmitter --python_venv=/usr/lo
cal/src/MyJob/venv.zip --src_dir=/usr/local/src/TonY/mnist/ --executes=/usr/local/src/TonY/mnist/src/mnist_distributed.py --conf_file=/usr/local/src/tony.xml --python_binary_path
=venv/bin/python3.5
18/11/11 08:23:02 INFO cli.ClusterSubmitter: Starting ClusterSubmitter..
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/src/MyJob/tony-cli-0.1.5-all.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
18/11/11 08:23:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/11/11 08:23:02 INFO cli.ClusterSubmitter: Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, null/core-site.
xml, null/hdfs-site.xml
Nov 11, 2018 8:23:02 AM com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase <clinit>
INFO: GHFS version: hadoop2-1.9.8
18/11/11 08:23:03 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
18/11/11 08:23:03 INFO cli.ClusterSubmitter: Copying /usr/local/src/MyJob/tony-cli-0.1.5-all.jar to: hdfs://tony-dev-m/user/root/.tony/6665ca2a-fd31-4f61-a947-33f895517302
Exception in thread "main" java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [Ljava.lang.String;
        at com.linkedin.tony.cli.ClusterSubmitter.main(ClusterSubmitter.java:60)
hadoop version
Hadoop 2.9.0
Subversion Unknown -r Unknown
Compiled by bigtop on 2018-08-17T12:00Z
Compiled with protoc 2.5.0
From source with checksum f510b6e8bafb2ddfd660aeb7454e7c30
This command was run using /usr/lib/hadoop/hadoop-common-2.9.0.jar

Java version

java -version
openjdk version "1.8.0_181"
OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-1~deb9u1-b13)
OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)

Command run:

java -cp `hadoop classpath`:/usr/local/src/MyJob/tony-cli-0.1.5-all.jar com.linkedin.tony.cli.ClusterSubmitter \
--python_venv=/usr/local/src/MyJob/venv.zip \
--src_dir=/usr/local/src/TonY/mnist/ \
--executes=/usr/local/src/TonY/mnist/src/mnist_distributed.py \
--conf_file=/usr/local/src/tony.xml \
--python_binary_path=venv/bin/python3.5

Directory structure:

.
├── src
│   └── mnist_distributed.py
├── tony-cli-0.1.5-all.jar
├── tony.xml
└── venv.zip

tony.xml contents:

<configuration>
  <property>
    <name>tony.application.security.enabled</name>
   <value>false</value>
  </property>    
  <property>
    <name>tony.worker.instances</name>
    <value>2</value>
  </property>
  <property>
    <name>tony.worker.memory</name>
    <value>15g</value>
  </property>
  <property>
    <name>tony.worker.gpus</name>
    <value>0</value>
  </property>
  <property>
    <name>tony.ps.memory</name>
    <value>3g</value>
  </property>
</configuration>

TonY should launch TensorBoard in separate container

Currently, TonY allocates the TensorBoard port on the chief worker (worker 0) and expects the chief worker task to read in the port from the environment and launch the TensorBoard process. However, often, this can cause the chief worker to fail with OutOfMemoryExceptions. To help mitigate this, TonY should allocate a separate container just for running TensorBoard. This would be more similar to what's generally done when running TensorFlow on Kubernetes -- TensorBoard is started in a separate pod form the workers and parameters servers (see the training.yaml example here).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.