Code Monkey home page Code Monkey logo

nauta's Introduction

IMPORTANT:

Intel has decided to stop further development of Nauta and will no longer be supporting the product. We appreciate your involvement with the product.

Nauta

Nauta Diagram

See the docs at: https://intelai.github.io/nauta/

The Nauta software provides a multi-user, distributed computing environment for running deep learning model training experiments. Results of experiments, can be viewed and monitored using a command line interface, web UI and/or TensorBoard*. You can use existing data sets, use your own data, or downloaded data from online sources, and create public or private folders to make collaboration among teams easier.

Nauta runs using the industry leading Kubernetes* and Docker* platform for scalability and ease of management. Template packs for various DL frameworks and tooling are available (and customizable) on the platform to take the complexities out of creating and running single and multi-node deep learning training experiments without all the systems overhead and scripting needed with standard container environments.

To test your model, Nauta also supports both batch and streaming inference, all in a single platform.

To build Nauta installation package and run it smoothly on Google Cloud Platform please follow our Nauta on Google Cloud Platform - Getting Started. More details on building Nauta artifacts can be found in How to Build guide.

To get things up and running quickly please take a look at our Getting Started guide.

For more in-depth information please refer to the following documents:

License

By contributing to the project software, you agree that your contributions will be licensed under the Apache 2.0 license that is included in the LICENSE file in the root directory of this source tree. The user materials are licensed under CC-BY-ND 4.0.

Contact

Submit Github issue to ask a question, submit a request or report a bug.

nauta's People

Contributors

adam-marek avatar adamtumi avatar ajanikow avatar aniejek avatar dancobb1 avatar dependabot[bot] avatar intel-rrozestw avatar jacob27 avatar janekmichalik avatar jcchr avatar kamahoney1965 avatar liuholly avatar mateusz-ciesielski avatar mgumowsk avatar michgorecki avatar mzylowski avatar pmilewsk avatar pskindel avatar waldekpi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nauta's Issues

Install error(failed)

Is this a BUG REPORT or FEATURE REQUEST? (choose one):
BUG REPORT

I meet the error message as below during installing

TASK [master/etcd : [platform] Pull docker image] ******************************************************************************************************************
Friday 21 February 2020 21:36:25 +0900 (0:00:00.758) 0:02:17.605 *******
fatal: [master-0]: FAILED! => {"changed": false, "msg": "Error pulling image registry.service.lab007.nauta:5000/core/etcd:3.3.10 - 404 Client Error: Not Found ("manifest for registry.service.lab007.nauta:5000/core/etcd:3.3.10 not found")"}
......
TASK [kubectl-info : Get nodes descriptions] ***********************************************************************************************************************
Friday 21 February 2020 22:16:07 +0900 (0:00:00.505) 0:00:00.579 *******
fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["/home/deepcell/nauta/install/bin/Linux/amd64/kubectl", "describe", "--all-namespaces", "nodes"], "delta": "0:00:00.032504", "end": "2020-02-21 22:16:07.421973", "msg": "non-zero return code", "rc": 1, "start": "2020-02-21 22:16:07.389469", "stderr": "The connection to the server localhost:8080 was refused - did you specify the right host or port?", "stderr_lines": ["The connection to the server localhost:8080 was refused - did you specify the right host or port?"], "stdout": "", "stdout_lines": []}
to retry, use: --limit @/home/deepcell/nauta/install/diagnose/diagnose.retry

PLAY RECAP *********************************************************************************************************************************************************
localhost : ok=1 changed=0 unreachable=0 failed=1

My env :
Build : Success
Installer : Ubuntu 16.04 on Docker Container
Target : CentOS 7.6 (Kernel 4.4.214-1.el7.elrepo.x86_64)

Cluster configuration details:

  • Cloud provider or hardware configuration:
    bare metal : master : CentOS 7.6

  • Operating system: (printf "$(uname -srm)\n$(cat /etc/os-release)\n"):
    Linux 4.4.214-1.el7.elrepo.x86_64 x86_64
    NAME="CentOS Linux"
    VERSION="7 (Core)"
    ID="centos"
    ID_LIKE="rhel fedora"
    VERSION_ID="7"
    PRETTY_NAME="CentOS Linux 7 (Core)"
    ANSI_COLOR="0;31"
    CPE_NAME="cpe:/o:centos:centos:7"
    HOME_URL="https://www.centos.org/"
    BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

  • Nauta version and commit: (nctl version)(git rev-parse --short HEAD):
    I cloned github yesterday.

Nauta component related with bug: (build system/installer/nctl(cli)/dashboard/documentation/k8s/any of nauta container)
install.sh

What is the current behavior?

What is the expected behavior?

Steps to reproduce:
install.sh install
*
*

Anything else do we need to know:

Dynamic IPs Google Cloud Platform

Hi,

I set up a cluster on Google Cloud followed by your Getting Started guide. Add the moment I am facing the problem that after shutdown/restart of the created instances.
All the VMs came up again but they can't communicate with each other. I think this happens because Google provided the instances some new IP addresses after the restart.
Is it possible to change the terraform configuration, so that all the machines will be set up with static IP addresses? Where could this change be done?

BR
Timo

how to set transparent proxy

i set ENV http_proxy and https_proxy of dockerfile. the docker can download , but there are also something wrong while build the nauta.
i can successfully run go build in another dockerfile . i have no idea why this one is failed . i am looking forward your reply

Step 31/33 : RUN go build -o /loader main.go
---> Running in 003b4d5cae2e

github.com/NervanaSystems/carbon/applications/loader/vendor/github.com/docker/go-connections/sockets

vendor/github.com/docker/go-connections/sockets/sockets.go:35:26: dialer.DialContext undefined (type proxy.Dialer has no field or method DialContext)
vendor/github.com/docker/go-connections/sockets/sockets_unix.go:24:28: undefined: context

Build Faild

Hello,
during "make k8s-installer-build" command, it failed and stopped in the middle.
Can you teach me how to solve it?

my environment is,
OS
ubuntu18.04.3 LTS (Bionic Beaver)

CPU
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xsaves arat pku ospke md_clear flush_l1d arch_capabilities

HDD
available over 100GB

Tuesday 08 October 2019 15:26:09 +0900 (0:00:00.469) 0:06:57.883 *******

container-build : Wait for task shared/centos ------------------------------------------------------------------------------------------- 256.79s
container-build : Wait for task shared/centos/build-go ---------------------------------------------------------------------------------- 106.31s
prepare : Start registry container -------------------------------------------------------------------------------------------------------- 7.22s
Gathering Facts --------------------------------------------------------------------------------------------------------------------------- 1.58s
container-build : Build image shared/centos as 127.0.0.1:32791/shared/centos:7.6.1810 ----------------------------------------------------- 1.19s
container-build : Build image shared/build/rpm/python2-pip as 127.0.0.1:32791/shared/build/rpm/python2-pip:8.1.2 -------------------------- 1.14s
container-build : Build image shared/build/rpm/containerd-io as 127.0.0.1:32791/shared/build/rpm/containerd-io:1.2 ------------------------ 1.14s
container-build : Build image shared/build/rpm/nginx as 127.0.0.1:32791/shared/build/rpm/nginx:1.13.9 ------------------------------------- 1.13s
container-build : Build image shared/build/metrics as 127.0.0.1:32791/shared/build/metrics:0.0.1 ------------------------------------------ 1.13s
container-build : Build image shared/build/consul as 127.0.0.1:32791/shared/build/consul:v1.1.0 ------------------------------------------- 1.13s
container-build : Build image shared/build/rpm/docker-ce as 127.0.0.1:32791/shared/build/rpm/docker-ce:18.09 ------------------------------ 1.13s
container-build : Build image shared/build/rpm/docker-distribution as 127.0.0.1:32791/shared/build/rpm/docker-distribution:2.6.2 ---------- 1.13s
container-build : Build image shared/centos/build-go as 127.0.0.1:32791/shared/centos/build-go:1.10.2 ------------------------------------- 1.13s
container-build : Build image shared/run/tensorflow/py3.6 as 127.0.0.1:32791/shared/run/tensorflow/py3.6:py3 ------------------------------ 1.12s
container-build : Build image shared/run/tensorflow/py2.7 as 127.0.0.1:32791/shared/run/tensorflow/py2.7:py2 ------------------------------ 1.12s
container-build : Build image shared/build/rpm/container-selinux as 127.0.0.1:32791/shared/build/rpm/container-selinux:2.68 --------------- 1.12s
container-build : Build image shared/build/rpm/docker-ce-cli as 127.0.0.1:32791/shared/build/rpm/docker-ce-cli:18.09 ---------------------- 1.11s
Cleanup registry container ---------------------------------------------------------------------------------------------------------------- 0.47s
prepare : Create docker registry directories ---------------------------------------------------------------------------------------------- 0.40s
container-build : Build all images -------------------------------------------------------------------------------------------------------- 0.29s
Makefile:9: recipe for target 'build' failed
make[4]: *** [build] Error 2
make[4]: ディレクトリ '/usr/local/src/nauta/tools/container-build' から出ます
Makefile:197: recipe for target '/usr/local/src/nauta/tools/.workspace/tools/1.0.0-oss-20191008061852/tools/shared.tar.gz' failed
make[3]: *** [/usr/local/src/nauta/tools/.workspace/tools/1.0.0-oss-20191008061852/tools/shared.tar.gz] Error 2
make[3]: ディレクトリ '/usr/local/src/nauta/tools' から出ます
Makefile:71: recipe for target 'tools-release' failed
make[2]: *** [tools-release] Error 2
make[2]: ディレクトリ '/usr/local/src/nauta' から出ます
Makefile:57: recipe for target 'k8s-installer-build-wrapped' failed
make[1]: *** [k8s-installer-build-wrapped] Error 2
make[1]: ディレクトリ '/usr/local/src/nauta' から出ます
Makefile:63: recipe for target 'k8s-installer-build' failed
make: *** [k8s-installer-build] Error 2

Need to correct TensorFlow version in requirements to install on bare metal.

Is this a BUG REPORT or FEATURE REQUEST? (choose one):
BUG REPORT

I think
/nauta/nauta-containers/openvino-mo/requirements.txt
needs to be modified as below in order to build on bare metal because TensorFlow Version 1.15 does not support Linux OS out of google cloud.

tensorflow==1.15.2 --> tensorflow==1.14.0

I wish you check it.

status of horovod experiment is QUEUED.

hi,

I am running the experiment of the sample, but I can run it on a single node, but in multi node and horovod the status remains QUEUED.
Could you tell me how to run horovod?

[user01@console nauta]$ nctl experiment submit -t tf-training-horovod examples/mnist_horovod.py -v
INFO:util.dependencies_checker:Detected OS: centos 7
ERROR:util.cli_state:Dependency check failed.
Traceback (most recent call last):
  File "util/cli_state.py", line 82, in verify_cli_dependencies
  File "util/dependencies_checker.py", line 157, in check_os
util.exceptions.InvalidOsError: WARNING: This OS version (centos 7) is unsupported. Check the release notes for supported operating systems and proceed at your own risk.
Dependency check failed. Use -v or -vv option for more info.
Submitting experiments.
Experiment data directory: /home/user01/nauta/config/experiments/horovod-4 already exists. It will be deleted to proceed with experiment submission. Do you want to continue? [y/N]: y
? Uploading experiment...INFO:util.kubectl:Port forwarding - proxy set up
? Building experiment image...INFO:platform_resources.workflow:Waiting for workflow horovod-4-image-build-7btht to complete. Attempt #0
? Building experiment image...INFO:platform_resources.workflow:Waiting for workflow horovod-4-image-build-7btht to complete. Attempt #1
? Building experiment image...INFO:platform_resources.workflow:Waiting for workflow horovod-4-image-build-7btht to complete. Attempt #2
? Building experiment image...INFO:platform_resources.workflow:Waiting for workflow horovod-4-image-build-7btht to complete. Attempt #3
? Building experiment image...INFO:platform_resources.workflow:Waiting for workflow horovod-4-image-build-7btht to complete. Attempt #4
? Building experiment image...INFO:platform_resources.workflow:Waiting for workflow horovod-4-image-build-7btht to complete. Attempt #5
? Building experiment image...INFO:platform_resources.workflow:Waiting for workflow horovod-4-image-build-7btht to complete. Attempt #6
? Building experiment image...INFO:platform_resources.workflow:Waiting for workflow horovod-4-image-build-7btht to complete. Attempt #7
? Building experiment image...INFO:platform_resources.workflow:Waiting for workflow horovod-4-image-build-7btht to complete. Attempt #8
| Name      | Parameters       | Status   | Message   |
|-----------+------------------+----------+-----------|
| horovod-4 | mnist_horovod.py | QUEUED   |           |

[user01@console nauta]$ nctl exp list
Dependency check failed. Use -v or -vv option for more info.
| Name      | Parameters           | Metrics                     | Submission date        | Start date             | Duration     | Owner   | Status    | Template name       | Template version   |
|-----------+----------------------+-----------------------------+------------------------+------------------------+--------------+---------+-----------+---------------------+--------------------|
| horovod   | mnist_horovod.py     |                             | 2019-11-06 02:12:25 PM |                        |              | user01  | CANCELLED | tf-training-horovod | 0.2.2              |
| horovod-2 | mnist_horovod.py     |                             | 2019-11-06 02:16:13 PM |                        |              | user01  | CANCELLED | tf-training-horovod | 0.2.2              |
| horovod-3 | mnist_horovod.py     |                             | 2019-11-06 02:19:40 PM |                        |              | user01  | CANCELLED | tf-training-horovod | 0.2.2              |
| horovod-4 | mnist_horovod.py     |                             | 2019-11-06 02:21:17 PM |                        |              | user01  | QUEUED    | tf-training-horovod | 0.2.2              |
| multinode | mnist_multinode.py   |                             | 2019-11-06 02:25:31 PM |                        |              | user01  | QUEUED    | tf-training-multi   | 0.1.0              |
| single    | mnist_single_node.py | accuracy: 0.96875           | 2019-11-06 02:08:11 PM | 2019-11-06 02:08:47 PM | 0d 0h 0m 36s | user01  | COMPLETE  | tf-training-single  | 0.1.0              |
|           |                      | global_step: 499            |                        |                        |              |         |           |                     |                    |
|           |                      | loss: 0.080687426           |                        |                        |              |         |           |                     |                    |
|           |                      | validation_accuracy: 0.9814 |                        |                        |              |         |           |                     |                    |
[user01@console nauta]$

[user01@console nauta]$ kubectl get node
NAME                            STATUS   ROLES    AGE   VERSION
nauta-master.node.ai-ex.local   Ready    Master   65m   v1.15.3
worker01.node.ai-ex.local       Ready    Worker   65m   v1.15.3
worker02.node.ai-ex.local       Ready    Worker   65m   v1.15.3
worker03.node.ai-ex.local       Ready    Worker   65m   v1.15.3
worker04.node.ai-ex.local       Ready    Worker   65m   v1.15.3
[user01@console nauta]$

thanks,
yas

Unsupported parameters for (apt) module: disable_plugin, disablerepo, enablerepo

I try to install nauta but there are some bugs.

fatal: [master-0]: FAILED! => {"ansible_facts": {"pkg_mgr": "apt"}, "changed": false, "msg": "Unsupported parameters for (apt) module: disable_plugin, disablerepo, enablerepo Supported parameters include: allow_unauthenticated, autoclean, autoremove, cache_valid_time, deb, default_release, dpkg_options, force, force_apt_get, install_recommends, only_upgrade, package, purge, state, update_cache, upgrade"}
fatal: [worker-0]: FAILED! => {"ansible_facts": {"pkg_mgr": "apt"}, "changed": false, "msg": "Unsupported parameters for (apt) module: disable_plugin, disablerepo, enablerepo Supported parameters include: allow_unauthenticated, autoclean, autoremove, cache_valid_time, deb, default_release, dpkg_options, force, force_apt_get, install_recommends, only_upgrade, package, purge, state, update_cache, upgrade"}

inventory.yaml

[master]
master-0 ansible_ssh_host=10.241.98.56 ansible_ssh_user=lisas ansible_ssh_pass=intel.123 internal_interface=eth0 external_interface=lo local_data_device=/dev/
sda1

[worker]
worker-0 ansible_ssh_host=10.238.154.70 ansible_ssh_user=lisas ansible_ssh_pass=intel.123 internal_interface=eno1 external_interface=lo

config.yaml
proxy:
http_proxy: http://10.7.211.16/911
ftp_proxy: ftp://10.7.211.16/911
https_proxy: https://10.7.211.16/911
no_proxy: localhost

Looking forward to your reply

best
lisa shi

build error

Is this a BUG REPORT or FEATURE REQUEST? (choose one):
BUG REPORT

I've got errors as below during building nauta on kubernetes pod system as build system.
---------------------------------------------------------------------------------------------
fatal: [local]: FAILED! => {"ansible_job_id": "452303190776.86355", "attempts": 5, "changed": false, "finished": 1, "msg": "Error building nauta/rpm/python - code: 127, message: The command '/bin/sh -c pip install -U pip==19.0.3 virtualenv==16.0.0 setuptools==39.2.0 wheel==0.31.1' returned a non-zero code: 127, logs: ['Step 1/12 : ARG BASE_IMAGE=shared/centos/rpm-packer', '\n', 'Step 2/12 : ARG PYTHON2_PIP_RPM_IMAGE=shared/build/rpm/python2-pip', '\n', 'Step 3/12 : FROM ${PYTHON2_PIP_RPM_IMAGE} as python2_pip_rpm_image', '\n', ' ---> 03175a7225d0\n', 'Step 4/12 : FROM ${BASE_IMAGE}', '\n', ' ---> 0cf01b1e3621\n', 'Step 5/12 : ENV RPM_VERSION=2.7', '\n', ' ---> Using cache\n', ' ---> 1a58dc8d7507\n', 'Step 6/12 : ENV RPM_RELEASE=0', '\n', ' ---> Using cache\n', ' ---> 54fe0479563b\n', 'Step 7/12 : RUN yum update -y && yum install -y python-devel python libffi-devel openssl-devel gcc gcc-c++', '\n', ' ---> Using cache\n', ' ---> d08159f4c7fb\n', 'Step 8/12 : RUN curl "https://bootstrap.pypa.io/get-pip.py" | python', '\n', ' ---> Using cache\n', ' ---> 753d99098bff\n', 'Step 9/12 : RUN pip install -U pip==19.0.3 virtualenv==16.0.0 setuptools==39.2.0 wheel==0.31.1', '\n', ' ---> Running in 7ca17da74ecf\n', '\x1b[91m/bin/sh: pip: command not found\n\x1b[0m', 'Removing intermediate container 7ca17da74ecf\n']"}
---------------------------------------------------------------------------------------------
This message say that pip command is not found, but pip command is exist.

Cluster configuration details:
I tried on my own system.

  • Cloud provider or hardware configuration:
  • Operating system: (printf "$(uname -srm)\n$(cat /etc/os-release)\n"):
    Linux 4.4.214-1.el7.elrepo.x86_64 x86_64
    NAME="Ubuntu"
    VERSION="16.04.6 LTS (Xenial Xerus)"
    ID=ubuntu
    ID_LIKE=debian
    PRETTY_NAME="Ubuntu 16.04.6 LTS"
    VERSION_ID="16.04"
    HOME_URL="http://www.ubuntu.com/"
    SUPPORT_URL="http://help.ubuntu.com/"
    BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
    VERSION_CODENAME=xenial

This system info about Docker container, actually k8s on nauta.

  • Nauta version and commit: (nctl version)(git rev-parse --short HEAD):
    git cloned today(2020/03/10)

Nauta component related with bug: (build system/installer/nctl(cli)/dashboard/documentation/k8s/any of nauta container)

make k8s-installer-build

What is the current behavior?

What is the expected behavior?

Steps to reproduce:
*
*

Anything else do we need to know:

Is any problem building on container(actually nauta)

Nauta compilation failure: Error 2

I´m trying to install Nauta in a ubuntu 16.04 LTS.

When I run : make k8s-installer-build

I get this kind of error:
TASK [container-build : set_fact] **************************************************************************
Friday 15 February 2019 13:41:32 +0000 (0:00:03.190) 0:01:36.086 *******
ok: [local]

TASK [container-build : Wait for tasks] ********************************************************************
Friday 15 February 2019 13:41:33 +0000 (0:00:00.766) 0:01:36.853 *******
included: /nauta/tools/container-build/tasks/wait.yml for local
included: /nauta/tools/container-build/tasks/wait.yml for local
included: /nauta/tools/container-build/tasks/wait.yml for local
included: /nauta/tools/container-build/tasks/wait.yml for local
included: /nauta/tools/container-build/tasks/wait.yml for local

TASK [container-build : Wait for task shared/build/metrics] ************************************************
Friday 15 February 2019 13:41:34 +0000 (0:00:01.023) 0:01:37.876 *******
FAILED - RETRYING: Wait for task shared/build/metrics (3600 retries left).
FAILED - RETRYING: Wait for task shared/build/metrics (3599 retries left).
FAILED - RETRYING: Wait for task shared/build/metrics (3598 retries left).
FAILED - RETRYING: Wait for task shared/build/metrics (3597 retries left).
FAILED - RETRYING: Wait for task shared/build/metrics (3596 retries left).
FAILED - RETRYING: Wait for task shared/build/metrics (3595 retries left).
FAILED - RETRYING: Wait for task shared/build/met

After some retries, the scripts shows:

TASK [container-build : Wait for task shared/run/tensorflow/py3.6] *****************************************
Friday 15 February 2019 14:33:16 +0000 (0:04:29.250) 0:53:19.465 *******
fatal: [local]: FAILED! => {"ansible_job_id": "321866871359.23364", "attempts": 1, "changed": false, "finished": 1, "msg": "Error pulling image 127.0.0.1:32768/nauta/shared/run/tensorflow/py3.6:d8526b2b739b33e5bf1d27fb64a29a4efe0a450d - 404 Client Error: Not Found ("manifest for 127.0.0.1:32768/nauta/shared/run/tensorflow/py3.6:d8526b2b739b33e5bf1d27fb64a29a4efe0a450d not found")"}

And finally stops with Error 2 :

Makefile:9: recipe for target 'build' failed
make[3]: *** [build] Error 2
make[3]: Leaving directory '/nauta/tools/container-build'
Makefile:210: recipe for target '/nauta/tools/.workspace/tools/1.0.0-20190215133832/tools/shared.tar.gz' failed
make[2]: *** [/nauta/tools/.workspace/tools/1.0.0-20190215133832/tools/shared.tar.gz] Error 2
make[2]: Leaving directory '/nauta/tools'
Makefile:60: recipe for target 'tools-release' failed
make[1]: *** [tools-release] Error 2
make[1]: Leaving directory '/nauta'
Makefile:52: recipe for target 'k8s-installer-build' failed
make: *** [k8s-installer-build] Error 2

¿Can someone help me to avoid this error to finish the installation?
Thanks in advance.

Build error: pull access denied for nauta/shared/build/metrics

I'm trying to build nauta on Ubuntu 18.04
commit cca02d9

make k8s-installer-build finishes with error

TASK [container-build : Wait for task nauta/batch-inference] ********************************************************************************************************************************************************************************
Thursday 07 March 2019  15:27:39 +0100 (0:01:18.479)       0:27:04.823 ********
fatal: [local]: FAILED! => {"ansible_job_id": "666420320773.24313", "attempts": 1, "changed": false, "finished": 1, "msg": "Error building nauta/nauta/batch-inference - code: None, message: pull access denied for nauta/shared/build/metrics, repository does not exist or may require 'docker login', logs: ['Step 1/11 : ARG METRICS_IMAGE=metrics-image', '\\n', 'Step 2/11 : ARG BASE_IMAGE=python:3.6.8', '\\n', 'Step 3/11 : FROM ${METRICS_IMAGE} as metrics', '\\n']"}

k8s_installer_build.log

What is developement environment?

Is this a BUG REPORT or FEATURE REQUEST? (choose one):
Just question.

Hi,
Could you tell me about your developement environment?
IDE and testing method?
I'm not accustomed to python project and test for it, but I would like to use nauta modified on my needs.
Should I develope on Linux or Can I do on Windows?
I'm reading codes using PyCharm CE on Windows 10 now.
Do I need more?
I think your answer is very helpful.

Thank you,

user creation on GCP

Hi,

after setting up the cluster on Google Cloud Platform we wanted to run some of the examples provided in the repository.
To do so we have to create some user account. The user is defined by a config file. I added some questions in brackets to create such a config file:

gateway_users:[username?]
nautaoperator: [role? Which exist?]
groups:
- docker [Which exist and are necessary for a user?]
# yamllint disable-line rule:line-length
authorized_key: "ssh-rsa dummykey [email protected]"

I would appreciate if you can give me some hints.

Thanks & BR
Timo

Upgrade to marshmallow 3

marshmallow 3.0.0rc2 is out, and a stable release is just around the corner. We don't expect any more large breaking changes, so we are recommending that new projects use marshmallow 3.

One benefit of upgrading is that you will be able to get rid of the duplicate load_from/dump_to arguments, like here:

https://github.com/IntelAI/nauta/blob/afcf19d236d25bb6157eb67b26b0984d47ff142c/applications/cli/platform_resources/custom_object_meta_model.py#L36-L39

In marshmallow 3, these parameters have been merged into a single parameter, data_key.

cluster_name = fields.String(required=False, allow_none=True, missing=None, data_key='clusterName')

There are other marshmallow 3 features that Nauta could take advantage, of, like structured Dict fields:

    labels = fields.Dict(keys=fields.Str(), values=fields.Str(), required=False, allow_none=True, missing=None)

The migration should be relatively straightforward. In addition to using data_key instead of load_from/dump_to, it looks like you'll need to adjust usages like this...

https://github.com/IntelAI/nauta/blob/a65a6a573f410ca19941abdfa9c33dfcea8f3ed4/applications/cli/platform_resources/run.py#L173-L175

to marshmallow 3's more strict (and more simple) API:

try:
    created_run = schema.load(response)
except ValidationError as err:
    raise RuntimeError(f'load of RunKubernetes request object error - {err.messages}')

EDIT: Update data_key example.

Some questions about adding nodes

Just Questions!
I want to add nodes after nauta install.
The way I know is to rerun after editing inventory file.
Is this right?
I think the way is not to add nodes but to re-install whole cluster.
Is there other way?

I don't this is the right place to ask.
Would you tell me where is the right place or board to ask about Nauta if it exist?

Thank you,

Failed to upload experiment

hi,

An error occurred when running the Mnist sample after deploying nauta.
Could you teach me how to solve error?

[user01@centos7 nctl]$ nctl experiment submit -t tf-training-single examples/mnist_single_node.py --name single -v
INFO:util.dependencies_checker:Detected OS: centos 7
ERROR:util.cli_state:Dependency check failed.
Traceback (most recent call last):
  File "util/cli_state.py", line 82, in verify_cli_dependencies
  File "util/dependencies_checker.py", line 157, in check_os
util.exceptions.InvalidOsError: WARNING: This OS version (centos 7) is unsupported. Check the release notes for supported operating systems and proceed at your own risk.
Dependency check failed. Use -v or -vv option for more info.
Submitting experiments.
Experiment data directory: /home/user01/nctl/config/experiments/single already exists. It will be deleted to proceed with experiment submission. Do you want to continue? [y/N]: y
? Uploading experiment...INFO:util.kubectl:Port forwarding - proxy set up
? Uploading experiment...ERROR:util.system:COMMAND: ['git', 'ls-remote'] RESULT: Warning: Permanently added '[localhost]:56253' (ECDSA) to the list of known hosts.\nPermission denied (publickey).\nfatal: Could not read from remote repository.\n\nPlease make sure you have the correct access rights\nand the repository exists.\n
Traceback (most recent call last):
  File "util/system.py", line 182, in execute_system_command
  File "subprocess.py", line 356, in check_output
  File "subprocess.py", line 438, in run
subprocess.CalledProcessError: Command '['git', 'ls-remote']' returned non-zero exit status 128.
ERROR:util.system:Warning: Permanently added '[localhost]:56253' (ECDSA) to the list of known hosts.
Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

ERROR:git_repo_manager.utils:Failed to upload experiment single to git repo manager.
Traceback (most recent call last):
  File "git_repo_manager/utils.py", line 102, in upload_experiment_to_git_repo_manager
  File "util/system.py", line 121, in __call__
RuntimeError: Failed to execute command: ['git', 'ls-remote']
WARNING:retry.api:Failed to execute command: ['git', 'ls-remote'], retrying in 1 seconds...
? Uploading experiment...INFO:util.kubectl:Port forwarding - proxy set up
? Uploading experiment...ERROR:util.system:COMMAND: ['git', 'ls-remote'] RESULT: Warning: Permanently added '[localhost]:30372' (ECDSA) to the list of known hosts.\nPermission denied (publickey).\nfatal: Could not read from remote repository.\n\nPlease make sure you have the correct access rights\nand the repository exists.\n
Traceback (most recent call last):
  File "util/system.py", line 182, in execute_system_command
  File "subprocess.py", line 356, in check_output
  File "subprocess.py", line 438, in run
subprocess.CalledProcessError: Command '['git', 'ls-remote']' returned non-zero exit status 128.
ERROR:util.system:Warning: Permanently added '[localhost]:30372' (ECDSA) to the list of known hosts.
Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

ERROR:git_repo_manager.utils:Failed to upload experiment single to git repo manager.
Traceback (most recent call last):
  File "git_repo_manager/utils.py", line 102, in upload_experiment_to_git_repo_manager
  File "util/system.py", line 121, in __call__
RuntimeError: Failed to execute command: ['git', 'ls-remote']
WARNING:retry.api:Failed to execute command: ['git', 'ls-remote'], retrying in 1 seconds...
? Uploading experiment...INFO:util.kubectl:Port forwarding - proxy set up
? Uploading experiment...ERROR:util.system:COMMAND: ['git', 'ls-remote'] RESULT: Warning: Permanently added '[localhost]:25259' (ECDSA) to the list of known hosts.\nPermission denied (publickey).\nfatal: Could not read from remote repository.\n\nPlease make sure you have the correct access rights\nand the repository exists.\n
Traceback (most recent call last):
  File "util/system.py", line 182, in execute_system_command
  File "subprocess.py", line 356, in check_output
  File "subprocess.py", line 438, in run
subprocess.CalledProcessError: Command '['git', 'ls-remote']' returned non-zero exit status 128.
ERROR:util.system:Warning: Permanently added '[localhost]:25259' (ECDSA) to the list of known hosts.
Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

ERROR:git_repo_manager.utils:Failed to upload experiment single to git repo manager.
Traceback (most recent call last):
  File "git_repo_manager/utils.py", line 102, in upload_experiment_to_git_repo_manager
  File "util/system.py", line 121, in __call__
RuntimeError: Failed to execute command: ['git', 'ls-remote']
WARNING:retry.api:Failed to execute command: ['git', 'ls-remote'], retrying in 1 seconds...
? Uploading experiment...INFO:util.kubectl:Port forwarding - proxy set up
? Uploading experiment...ERROR:util.system:COMMAND: ['git', 'ls-remote'] RESULT: Warning: Permanently added '[localhost]:16217' (ECDSA) to the list of known hosts.\nPermission denied (publickey).\nfatal: Could not read from remote repository.\n\nPlease make sure you have the correct access rights\nand the repository exists.\n
Traceback (most recent call last):
  File "util/system.py", line 182, in execute_system_command
  File "subprocess.py", line 356, in check_output
  File "subprocess.py", line 438, in run
subprocess.CalledProcessError: Command '['git', 'ls-remote']' returned non-zero exit status 128.
ERROR:util.system:Warning: Permanently added '[localhost]:16217' (ECDSA) to the list of known hosts.
Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

? Uploading experiment...ERROR:git_repo_manager.utils:Failed to upload experiment single to git repo manager.
Traceback (most recent call last):
  File "git_repo_manager/utils.py", line 102, in upload_experiment_to_git_repo_manager
  File "util/system.py", line 121, in __call__
RuntimeError: Failed to execute command: ['git', 'ls-remote']
WARNING:retry.api:Failed to execute command: ['git', 'ls-remote'], retrying in 1 seconds...
? Uploading experiment...INFO:util.kubectl:Port forwarding - proxy set up
? Uploading experiment...ERROR:util.system:COMMAND: ['git', 'ls-remote'] RESULT: Warning: Permanently added '[localhost]:51135' (ECDSA) to the list of known hosts.\nPermission denied (publickey).\nfatal: Could not read from remote repository.\n\nPlease make sure you have the correct access rights\nand the repository exists.\n
Traceback (most recent call last):
  File "util/system.py", line 182, in execute_system_command
  File "subprocess.py", line 356, in check_output
  File "subprocess.py", line 438, in run
subprocess.CalledProcessError: Command '['git', 'ls-remote']' returned non-zero exit status 128.
ERROR:util.system:Warning: Permanently added '[localhost]:51135' (ECDSA) to the list of known hosts.
Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

ERROR:git_repo_manager.utils:Failed to upload experiment single to git repo manager.
Traceback (most recent call last):
  File "git_repo_manager/utils.py", line 102, in upload_experiment_to_git_repo_manager
  File "util/system.py", line 121, in __call__
RuntimeError: Failed to execute command: ['git', 'ls-remote']
? Uploading experiment...ERROR:commands.experiment.common:Failed to upload experiment.
Traceback (most recent call last):
  File "commands/experiment/common.py", line 377, in submit_experiment
  File "</home/user01/nctl/nctl-cli/decorator.pyc:decorator-gen-4>", line 2, in upload_experiment_to_git_repo_manager
  File "retry/api.py", line 74, in retry_decorator
  File "retry/api.py", line 33, in __retry_internal
  File "git_repo_manager/utils.py", line 102, in upload_experiment_to_git_repo_manager
  File "util/system.py", line 121, in __call__
RuntimeError: Failed to execute command: ['git', 'ls-remote']
Problems during submitting experiment: Failed to upload experiment.

thanks,
yas

i can not install nauta successfully

hello i have successfully build the nauta . but there are some bugs while i try to install nauta.
the inventory.yaml and config,yaml is as follows

inventory.txt
config.txt

TASK [verification/pre : [platform] Fail if test failed] ******************************************************************************************************
Wednesday 22 May 2019 00:46:37 -0700 (0:00:00.039) 0:00:00.200 *********
fatal: [localhost]: FAILED! => {"changed": false, "msg": "Master group is not defined in inventory"}
TASK [kubectl-info : Get nodes descriptions] ******************************************************************************************************************
Wednesday 22 May 2019 00:46:39 -0700 (0:00:00.636) 0:00:00.706 *********
fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["/home/lisas/nauta/bin/Linux/amd64/kubectl", "describe", "--all-namespaces", "nodes"], "delta": "0:00:00.040610", "end": "2019-05-22 00:46:39.311990", "msg": "non-zero return code", "rc": 1, "start": "2019-05-22 00:46:39.271380", "stderr": "The connection to the server localhost:8080 was refused - did you specify the right host or port?", "stderr_lines": ["The connection to the server localhost:8080 was refused - did you specify the right host or port?"], "stdout": "", "stdout_lines": []}
to retry, use: --limit @/home/lisas/nauta/diagnose/diagnose.retry

i have no idea how to solve it . and i am looking forward to your reply.

PyTorch exp error

Is this a BUG REPORT or FEATURE REQUEST? (choose one):
BUG REPORT

I run as below.
nctl experiment submit --name pytorch --template pytorch-training examples/pytorch_mnist.py

result -> FAILED
error log : some line
2020-02-25T08:12:56.333153646+00:00 pytorch-master-0 ImportError: cannot import name 'PILLOW_VERSION' from 'PIL' (/usr/local/lib/python3.7/site-packages/PIL/init.py)

Cluster configuration details:

  • Cloud provider or hardware configuration:
    Cluster configuration : my own hardware (3 Inter Servers : 1 master, 2 workers)

  • Operating system: (printf "$(uname -srm)\n$(cat /etc/os-release)\n"):
    Centos 7.6

  • Nauta version and commit: (nctl version)(git rev-parse --short HEAD):
    nctl version`
    | Component | Version |
    |------------------+--------------------------|
    | nctl application | 1.0.0-oss-20202124062154 |
    | nauta platform | 1.0.0-oss-20200221134229 |

all logs -------------
2020-02-25T08:12:56.333059496+00:00 pytorch-master-0 File "pytorch_mnist.py", line 35, in
2020-02-25T08:12:56.333084761+00:00 pytorch-master-0 from torchvision import datasets
2020-02-25T08:12:56.333076341+00:00 pytorch-master-0 File "/usr/local/lib/python3.7/site-packages/torchvision/init.py", line 2, in
2020-02-25T08:12:56.333120635+00:00 pytorch-master-0 from .transforms import *
2020-02-25T08:12:56.333090521+00:00 pytorch-master-0 File "/usr/local/lib/python3.7/site-packages/torchvision/datasets/init.py", line 9, in
2020-02-25T08:12:56.333097216+00:00 pytorch-master-0 from .fakedata import FakeData
2020-02-25T08:12:56.333102754+00:00 pytorch-master-0 File "/usr/local/lib/python3.7/site-packages/torchvision/datasets/fakedata.py", line 3, in
2020-02-25T08:12:56.333126129+00:00 pytorch-master-0 File "/usr/local/lib/python3.7/site-packages/torchvision/transforms/transforms.py", line 17, in
2020-02-25T08:12:56.333153646+00:00 pytorch-master-0 ImportError: cannot import name 'PILLOW_VERSION' from 'PIL' (/usr/local/lib/python3.7/site-packages/PIL/init.py)
2020-02-25T08:12:56.333114393+00:00 pytorch-master-0 File "/usr/local/lib/python3.7/site-packages/torchvision/transforms/init.py", line 1, in
2020-02-25T08:12:56.333132536+00:00 pytorch-master-0 from . import functional as F
2020-02-25T08:12:56.333138002+00:00 pytorch-master-0 File "/usr/local/lib/python3.7/site-packages/torchvision/transforms/functional.py", line 5, in
2020-02-25T08:12:56.333147806+00:00 pytorch-master-0 from PIL import Image, ImageOps, ImageEnhance, PILLOW_VERSION
2020-02-25T08:12:56.333006084+00:00 pytorch-master-0 Traceback (most recent call last):
2020-02-25T08:12:56.333069922+00:00 pytorch-master-0 from torchvision import datasets, transforms
2020-02-25T08:12:56.333108868+00:00 pytorch-master-0 from .. import transforms
2020-02-25T08:13:29.401626054+00:00 pytorch-worker-0 from torchvision import datasets, transforms
2020-02-25T08:13:29.401661965+00:00 pytorch-worker-0 File "/usr/local/lib/python3.7/site-packages/torchvision/init.py", line 2, in
2020-02-25T08:13:29.401680845+00:00 pytorch-worker-0 File "/usr/local/lib/python3.7/site-packages/torchvision/datasets/init.py", line 9, in
2020-02-25T08:13:29.401687445+00:00 pytorch-worker-0 from .fakedata import FakeData
2020-02-25T08:13:29.401693211+00:00 pytorch-worker-0 File "/usr/local/lib/python3.7/site-packages/torchvision/datasets/fakedata.py", line 3, in
2020-02-25T08:13:29.401699300+00:00 pytorch-worker-0 from .. import transforms
2020-02-25T08:13:29.401723131+00:00 pytorch-worker-0 from . import functional as F
2020-02-25T08:13:29.401728640+00:00 pytorch-worker-0 File "/usr/local/lib/python3.7/site-packages/torchvision/transforms/functional.py", line 5, in
2020-02-25T08:13:29.401738629+00:00 pytorch-worker-0 from PIL import Image, ImageOps, ImageEnhance, PILLOW_VERSION
2020-02-25T08:13:29.401559390+00:00 pytorch-worker-0 Traceback (most recent call last):
2020-02-25T08:13:29.401704839+00:00 pytorch-worker-0 File "/usr/local/lib/python3.7/site-packages/torchvision/transforms/init.py", line 1, in
2020-02-25T08:13:29.401674500+00:00 pytorch-worker-0 from torchvision import datasets
2020-02-25T08:13:29.401615053+00:00 pytorch-worker-0 File "pytorch_mnist.py", line 35, in
2020-02-25T08:13:29.401711254+00:00 pytorch-worker-0 from .transforms import *
2020-02-25T08:13:29.401716654+00:00 pytorch-worker-0 File "/usr/local/lib/python3.7/site-packages/torchvision/transforms/transforms.py", line 17, in
2020-02-25T08:13:29.401744333+00:00 pytorch-worker-0 ImportError: cannot import name 'PILLOW_VERSION' from 'PIL' (/usr/local/lib/python3.7/site-packages/PIL/init.py)
2020-02-25T08:14:52.010687437+00:00 pytorch-worker-0 Traceback (most recent call last):
2020-02-25T08:14:52.010757927+00:00 pytorch-worker-0 from torchvision import datasets
2020-02-25T08:14:52.010761921+00:00 pytorch-worker-0 File "/usr/local/lib/python3.7/site-packages/torchvision/datasets/init.py", line 9, in
2020-02-25T08:14:52.010773894+00:00 pytorch-worker-0 from .. import transforms
2020-02-25T08:14:52.010804013+00:00 pytorch-worker-0 ImportError: cannot import name 'PILLOW_VERSION' from 'PIL' (/usr/local/lib/python3.7/site-packages/PIL/init.py)
2020-02-25T08:14:52.010747250+00:00 pytorch-worker-0 from torchvision import datasets, transforms
2020-02-25T08:14:52.010751768+00:00 pytorch-worker-0 File "/usr/local/lib/python3.7/site-packages/torchvision/init.py", line 2, in
2020-02-25T08:14:52.010769767+00:00 pytorch-worker-0 File "/usr/local/lib/python3.7/site-packages/torchvision/datasets/fakedata.py", line 3, in
2020-02-25T08:14:52.010785577+00:00 pytorch-worker-0 File "/usr/local/lib/python3.7/site-packages/torchvision/transforms/transforms.py", line 17, in
2020-02-25T08:14:52.010796009+00:00 pytorch-worker-0 File "/usr/local/lib/python3.7/site-packages/torchvision/transforms/functional.py", line 5, in
2020-02-25T08:14:52.010777597+00:00 pytorch-worker-0 File "/usr/local/lib/python3.7/site-packages/torchvision/transforms/init.py", line 1, in
2020-02-25T08:14:52.010781893+00:00 pytorch-worker-0 from .transforms import *
2020-02-25T08:14:52.010800159+00:00 pytorch-worker-0 from PIL import Image, ImageOps, ImageEnhance, PILLOW_VERSION
2020-02-25T08:14:52.010738929+00:00 pytorch-worker-0 File "pytorch_mnist.py", line 35, in
2020-02-25T08:14:52.010766179+00:00 pytorch-worker-0 from .fakedata import FakeData
2020-02-25T08:14:52.010792184+00:00 pytorch-worker-0 from . import functional as F
2020-02-25T08:16:22.039446452+00:00 pytorch-worker-0 Traceback (most recent call last):
2020-02-25T08:16:22.039484524+00:00 pytorch-worker-0 File "pytorch_mnist.py", line 35, in
2020-02-25T08:16:22.039508699+00:00 pytorch-worker-0 File "/usr/local/lib/python3.7/site-packages/torchvision/datasets/init.py", line 9, in
2020-02-25T08:16:22.039540319+00:00 pytorch-worker-0 from . import functional as F
2020-02-25T08:16:22.039578586+00:00 pytorch-worker-0 from PIL import Image, ImageOps, ImageEnhance, PILLOW_VERSION
2020-02-25T08:16:22.039516896+00:00 pytorch-worker-0 File "/usr/local/lib/python3.7/site-packages/torchvision/datasets/fakedata.py", line 3, in
2020-02-25T08:16:22.039524169+00:00 pytorch-worker-0 from .. import transforms
2020-02-25T08:16:22.039504000+00:00 pytorch-worker-0 from torchvision import datasets
2020-02-25T08:16:22.039528168+00:00 pytorch-worker-0 File "/usr/local/lib/python3.7/site-packages/torchvision/transforms/init.py", line 1, in
2020-02-25T08:16:22.039532424+00:00 pytorch-worker-0 from .transforms import *
2020-02-25T08:16:22.039544224+00:00 pytorch-worker-0 File "/usr/local/lib/python3.7/site-packages/torchvision/transforms/functional.py", line 5, in
2020-02-25T08:16:22.039495964+00:00 pytorch-worker-0 File "/usr/local/lib/python3.7/site-packages/torchvision/init.py", line 2, in
2020-02-25T08:16:22.039595337+00:00 pytorch-worker-0 ImportError: cannot import name 'PILLOW_VERSION' from 'PIL' (/usr/local/lib/python3.7/site-packages/PIL/init.py)
2020-02-25T08:16:22.039491749+00:00 pytorch-worker-0 from torchvision import datasets, transforms
2020-02-25T08:16:22.039513205+00:00 pytorch-worker-0 from .fakedata import FakeData
2020-02-25T08:16:22.039536067+00:00 pytorch-worker-0 File "/usr/local/lib/python3.7/site-packages/torchvision/transforms/transforms.py", line 17, in

Compilation Issues with Nauta

Is this a BUG REPORT or FEATURE REQUEST? (choose one):BUG REPORT

Cluster configuration details:

  • Cloud provider or hardware configuration:
    using skylake cpu with 512 Gig Ram and 6152 Gold 22 core processors.

  • Operating system: (printf "$(uname -srm)\n$(cat /etc/os-release)\n"):
    OS: Ubuntu 16.04 LTS

  • Nauta version and commit: (nctl version)(git rev-parse --short HEAD):
    Using the latest committed version.
    k8s_installer_build.log

Nauta component related with bug: (build system/installer/nctl(cli)/dashboard/documentation/k8s/any of nauta container)

What is the current behavior?

What is the expected behavior?

Steps to reproduce:
*
*

Anything else do we need to know:

Build file attached.

i can not build nauta successfully

hello
i build nauta on the branch develop and pull the new change. my computer's memory is 32G . free space 79G. i am sure the porxy is right . but when i build nauta ,there are the same error.

fatal: [local]: FAILED! => {"ansible_job_id": "367321430963.7555", "attempts": 71, "changed": false, "finished": 1, "msg": "Error building nauta/rpm/kubernetes - code: 35, message: The command '/bin/sh -c yes | ./kubernetes/cluster/get-kube-binaries.sh' returned a non-zero code: 35, logs: ['Step 1/7 : ARG BASE_IMAGE', '\n', 'Step 2/7 : FROM ${BASE_IMAGE}', '\n', ' ---> 71ec932e6b55\n', 'Step 3/7 : RUN curl -L https://github.com/kubernetes/kubernetes/releases/download/v1.10.13/kubernetes.tar.gz -o kubernetes.tar.gz', '\n', ' ---> Using cache\n', ' ---> a64ffaafeded\n', 'Step 4/7 : RUN tar -xvf kubernetes.tar.gz && rm kubernetes.tar.gz', '\n', ' ---> Using cache\n', ' ---> 20b5c52d9c3b\n', 'Step 5/7 : RUN yes | ./kubernetes/cluster/get-kube-binaries.sh', '\n', ' ---> Running in 909ce11c2068\n', 'Kubernetes release: v1.10.13\nServer: linux/amd64 (to override, set KUBERNETES_SERVER_ARCH)\nClient: linux/amd64 (autodetected)\n\n', 'Will download kubernetes-server-linux-amd64.tar.gz from https://dl.k8s.io/v1.10.13\n', 'Will download and extract kubernetes-client-linux-amd64.tar.gz from https://dl.k8s.io/v1.10.13\n', 'Is this ok? [Y]/n\n', '\x1b[91m % Total % Received % Xferd Average Speed Time T\x1b[0m', '\x1b[91mime Time Current\n \x1b[0m', '\x1b[91m \x1b[0m', '\x1b[91m \x1b[0m', '\x1b[91m \x1b[0m', '\x1b[91m \x1b[0m', '\x1b[91m D\x1b[0m', '\x1b[91mload Upload \x1b[0m', '\x1b[91m Total \x1b[0m', '\x1b[91m Spent Left Sp\x1b[0m', '\x1b[91meed\n\r\x1b[0m', '\x1b[91m 0 0 0\x1b[0m', '\x1b[91m 0 0 0 0 \x1b[0m', '\x1b[91m 0 --:--:-- --:--\x1b[0m', '\x1b[91m:-- --:--:-- \x1b[0m', '\x1b[91m0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:02 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:03 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:04 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:05 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:06 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:07 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:08 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:09 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:10 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:11 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:12 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:13 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:14 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:15 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:16 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:17 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:18 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:19 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:20 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:21 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:22 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:23 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:24 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:25 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:26 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:27 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:28 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:29 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:30 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:31 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:32 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:33 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:34 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:35 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:36 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:37 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:38 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:39 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:40 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:41 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:42 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:43 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:44 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:45 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:46 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:47 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:48 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:49 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:50 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:51 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:52 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:53 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:54 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:55 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:56 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:57 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:58 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:00:59 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:01:00 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:01:01 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:01:02 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:01:03 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:01:04 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:01:05 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:01:06 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:01:07 --:--:-- \x1b[0m', '\x1b[91m 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:01:08 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:01:09 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- \x1b[0m', '\x1b[91m 0:01:10 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:01:11 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:01:12 --:--:--\x1b[0m', '\x1b[91m 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:01:13 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:01:14 --:--:-- 0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- \x1b[0m', '\x1b[91m0:01:15 --:--:-- \x1b[0m', '\x1b[91m0\x1b[0m', '\x1b[91m\r 0 0 0 0 0 0 0 0 --:--:-- 0:01:15\x1b[0m', '\x1b[91m --:--:-- 0\n\x1b[0m', '\x1b[91mcurl: (35) TCP connection reset by\x1b[0m', '\x1b[91m peer\n\x1b[0m', 'Removing intermediate container 909ce11c2068\n']"}

RUNNING HANDLER [Cleanup registry container] ***********************************
Wednesday 17 April 2019 18:08:48 +0800 (0:01:15.779) 0:01:44.245 *******
changed: [local]
to retry, use: --limit @/home/xin/program/nauta/tools/container-build/container.retry

PLAY RECAP *********************************************************************
local : ok=391 changed=18 unreachable=0 failed=1

Wednesday 17 April 2019 18:08:49 +0800 (0:00:00.680) 0:01:44.925 *******

container-build : Wait for task rpm/kubernetes ------------------------- 75.78s
container-build : Build image rpm/kubernetes as 127.0.0.1:32778/rpm/kubernetes:v1.10.11 --- 1.18s
prepare : Start registry container -------------------------------------- 1.16s
Cleanup registry container ---------------------------------------------- 0.68s
Gathering Facts --------------------------------------------------------- 0.43s
prepare : Create docker registry directories ---------------------------- 0.34s
container-build : Check if image exists --------------------------------- 0.27s
container-build : Register each image if defined in parent build -------- 0.22s
container-build : Register each defined image --------------------------- 0.22s
container-build : Find base image in repo ------------------------------- 0.20s
container-build : Find base image in repo ------------------------------- 0.19s
container-build : Verify presence of base images ------------------------ 0.19s
container-build : List files in directory ------------------------------- 0.19s
container-build : Calculate local repository tags for all images -------- 0.18s
container-build : Find base image in repo ------------------------------- 0.18s
container-build : Find base image in repo ------------------------------- 0.18s
container-build : Stat dockerfile --------------------------------------- 0.17s
container-build : Find base image in repo ------------------------------- 0.17s
container-build : Find base image in repo ------------------------------- 0.17s
container-build : Find base image in repo ------------------------------- 0.17s
Makefile:9: recipe for target 'build' failed

fatal: [local]: FAILED! => {"ansible_job_id": "168062879634.11491", "attempts": 80, "changed": false, "finished": 1, "msg": "Error building nauta/nauta/experiment-service - code: 2, message: The command '/bin/sh -c make build' returned a non-zero code: 2, logs: ['Step 1/29 : FROM golang:1.11.5 as build', '\n', ' ---> 1454e2b3d01f\n', 'Step 2/29 : RUN mkdir -p /build/dep', '\n', ' ---> Using cache\n', ' ---> bf768815d3a9\n', 'Step 3/29 : ENV DEP_VERSION=v0.4.1', '\n', ' ---> Using cache\n', ' ---> 44595bc0eab5\n', 'Step 4/29 : ENV DEP_ARCH=amd64', '\n', ' ---> Using cache\n', ' ---> 80e3c8ef437f\n', 'Step 5/29 : ENV DEP_OS=linux', '\n', ' ---> Using cache\n', ' ---> 5a6aa0017f8e\n', 'Step 6/29 : ENV DEP_FILE=dep-${DEP_OS}-${DEP_ARCH}', '\n', ' ---> Using cache\n', ' ---> 5ea994ae725b\n', 'Step 7/29 : RUN wget https://github.com/golang/dep/releases/download/${DEP_VERSION}/${DEP_FILE}', '\n', ' ---> Using cache\n', ' ---> f8a62f9bf99b\n', 'Step 8/29 : RUN cp ${DEP_FILE} /build/dep/dep', '\n', ' ---> Using cache\n', ' ---> f250af21a5ed\n', 'Step 9/29 : RUN chmod 0777 /build/dep/dep', '\n', ' ---> Using cache\n', ' ---> 4f2376d0501d\n', 'Step 10/29 : ENV PATH=${PATH}:/build/dep/', '\n', ' ---> Using cache\n', ' ---> 921700bdc2de\n', 'Step 11/29 : ENV APISERVER_BUILDER_VERSION=v1.9-alpha.4', '\n', ' ---> Using cache\n', ' ---> 2fe9abe88da0\n', 'Step 12/29 : ENV APISERVER_BUILDER_ARCH=amd64', '\n', ' ---> Using cache\n', ' ---> a5e29f833514\n', 'Step 13/29 : ENV APISERVER_BUILDER_OS=linux', '\n', ' ---> Using cache\n', ' ---> 94d10ece8ae5\n', 'Step 14/29 : ENV APISERVER_BUILDER_FILE=apiserver-builder-${APISERVER_BUILDER_VERSION}-${APISERVER_BUILDER_OS}-${APISERVER_BUILDER_ARCH}.tar.gz', '\n', ' ---> Using cache\n', ' ---> f9af63c2a801\n', 'Step 15/29 : RUN wget https://github.com/kubernetes-incubator/apiserver-builder/releases/download/${APISERVER_BUILDER_VERSION}/${APISERVER_BUILDER_FILE}', '\n', ' ---> Using cache\n', ' ---> 48e2fe2352a2\n', 'Step 16/29 : RUN mkdir -p /apiserver-builder', '\n', ' ---> Using cache\n', ' ---> f3727d8cb245\n', 'Step 17/29 : RUN cp ${APISERVER_BUILDER_FILE} /tmp/${APISERVER_BUILDER_FILE}', '\n', ' ---> Using cache\n', ' ---> 6d51a10501c1\n', 'Step 18/29 : RUN tar -zvxf /tmp/${APISERVER_BUILDER_FILE} -C /apiserver-builder && rm -rf /tmp/${APISERVER_BUILDER_FILE}', '\n', ' ---> Using cache\n', ' ---> dda2e0346e52\n', 'Step 19/29 : ENV APISERVER_BUILDER_PATH=/apiserver-builder/bin', '\n', ' ---> Using cache\n', ' ---> ee330f23c56f\n', 'Step 20/29 : ENV PATH=${PATH}:${APISERVER_BUILDER_PATH}', '\n', ' ---> Using cache\n', ' ---> ae1ed4cd10b0\n', 'Step 21/29 : RUN apt update && apt install -y make mercurial', '\n', ' ---> Using cache\n', ' ---> c9611c6211af\n', 'Step 22/29 : ENV EXP_SVC_PATH=${GOPATH}/src/github.com/nervanasystems/carbon/applications/experiment-service', '\n', ' ---> Using cache\n', ' ---> 188f2c67798f\n', 'Step 23/29 : RUN mkdir -p ${EXP_SVC_PATH}', '\n', ' ---> Using cache\n', ' ---> 9ff4311fb414\n', 'Step 24/29 : WORKDIR ${EXP_SVC_PATH}', '\n', ' ---> Using cache\n', ' ---> ec6daaf8da45\n', 'Step 25/29 : ADD ./ ./', '\n', ' ---> Using cache\n', ' ---> 63b0799c0d48\n', 'Step 26/29 : RUN make build', '\n', ' ---> Running in c20e37a5af81\n', 'go fmt ./...\n', 'dep ensure -v\n', '\x1b[91mThe following issues were found in Gopkg.toml:\n\n ✗ unable to deduce repository and source type for "k8s.io/apimachinery": unable to read metadata: go-import metadata not found\n ✗ unable to deduce repository and source type for "k8s.io/client-go": unable to read metadata: go-import metadata not found\n ✗ unable to deduce repository and source type for "k8s.io/apiextensions-apiserver": unable to read metadata: go-import metadata not found\n ✗ unable to deduce repository and source type for "k8s.io/kube-aggregator": unable to read metadata: go-import metadata not found\n ✗ unable to deduce repository and source type for "k8s.io/api": unable to read metadata: go-import metadata not found\n ✗ unable to deduce repository and source type for "k8s.io/code-generator": unable to read metadata: go-import metadata not found\n ✗ unable to deduce repository and source type for "k8s.io/code-generator": unable to read metadata: go-import metadata not found\n ✗ unable to deduce repository and source type for "k8s.io/apiserver": unable to read metadata: go-import metadata not found\n\n\x1b[0m', '\x1b[91mProjectRoot name validation failed\n\x1b[0m', "Makefile:19: recipe for target 'dep_update' failed\n", '\x1b[91mmake: *** [dep_update] Error 1\n\x1b[0m', 'Removing intermediate container c20e37a5af81\n']"}

RUNNING HANDLER [Cleanup registry container] ***********************************************************************************************************************************************************************
Wednesday 17 April 2019 18:09:59 +0800 (0:01:25.409) 0:02:16.384 *******
changed: [local]
to retry, use: --limit @/home/xin/program/nauta/tools/container-build/container.retry

PLAY RECAP *********************************************************************************************************************************************************************************************************
local : ok=743 changed=47 unreachable=0 failed=1

Wednesday 17 April 2019 18:10:00 +0800 (0:00:00.730) 0:02:17.114 *******

container-build : Wait for task nauta/experiment-service --------------------------------------------------------------------------------------------------------------------------------------------------- 85.41s
Gathering Facts --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 1.53s
container-build : Build image nauta/experiment-service as 127.0.0.1:32780/nauta/experiment-service:1.0.0 ---------------------------------------------------------------------------------------------------- 1.16s
container-build : Pull image nauta/pytorch-operator as 127.0.0.1:32780/nauta/pytorch-operator:v0.4.0 -------------------------------------------------------------------------------------------------------- 1.16s
container-build : Build image nauta/tf/operator as 127.0.0.1:32780/nauta/tf/operator:v2 --------------------------------------------------------------------------------------------------------------------- 1.11s
prepare : Start registry container -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 0.87s
container-build : Calculate remote repository tags for all images ------------------------------------------------------------------------------------------------------------------------------------------- 0.76s
Cleanup registry container ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 0.73s
container-build : Check if image exists --------------------------------------------------------------------------------------------------------------------------------------------------------------------- 0.38s
container-build : Register each image if defined in parent build -------------------------------------------------------------------------------------------------------------------------------------------- 0.37s
prepare : Create docker registry directories ---------------------------------------------------------------------------------------------------------------------------------------------------------------- 0.33s
container-build : Register each defined image --------------------------------------------------------------------------------------------------------------------------------------------------------------- 0.32s
container-build : Calculate local repository tags for all images -------------------------------------------------------------------------------------------------------------------------------------------- 0.30s
container-build : Verify presence of base images ------------------------------------------------------------------------------------------------------------------------------------------------------------ 0.29s
container-build : Build all images -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 0.27s
container-build : Load all images --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 0.22s
container-build : Add image definition ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- 0.21s
container-build : Calculate remote repository tags for all images ------------------------------------------------------------------------------------------------------------------------------------------- 0.19s
container-build : List files in directory ------------------------------------------------------------------------------------------------------------------------------------------------------------------- 0.19s
container-build : List files in directory ------------------------------------------------------------------------------------------------------------------------------------------------------------------- 0.18s
Makefile:9: recipe for target 'build' failed
make[4]: *** [build] Error 2
make[4]: Leaving directory '/home/xin/program/nauta/tools/container-build'
Makefile:159: recipe for target '/home/xin/program/nauta/tools/.workspace/tools/1.0.0-oss-20190417100509/tools/nauta.tar.gz' failed
make[3]: *** [/home/xin/program/nauta/tools/.workspace/tools/1.0.0-oss-20190417100509/tools/nauta.tar.gz] Error 2
make[3]: Leaving directory '/home/xin/program/nauta/tools'
Makefile:71: recipe for target 'tools-release' failed
make[2]: *** [tools-release] Error 2
make[2]: Leaving directory '/home/xin/program/nauta'
Makefile:57: recipe for target 'k8s-installer-build-wrapped' failed
make[1]: *** [k8s-installer-build-wrapped] Error 2
make[1]: Leaving directory '/home/xin/program/nauta'

roadmap query

Will Nauta be just for Intel Xeon CPUs? what does it take to extend it to other CPUs, GPUs, and FPGA (BrainWave on Azure)?

Need value proposition deck with this repo!

Feature Request

I'm still trying to figure out what makes this distribution different than others. What's the rough roadmap for this product? What does this repo do that other's can't? Is this just a repo with Intel's name on it?

resource setting with nctl config command

hi,

I am setting resources with nctl config, but a message is displayed with the verify command.
please teach me how to resolve it.

[nauta01@console ~]$ nctl config -c 4 -m 8Gi
Resources' settings have been updated with a success.
[nauta01@console ~]$ nctl verify
WARNING: This OS version (centos 7) is unsupported. Check the release notes for supported operating systems and proceed at your own risk. Use -v or -vv option for more info.
Following nauta components have failed:
kubectl verified successfully.
kubectl server verified successfully.
helm client verified successfully.
helm server verified successfully.
git verified successfully.
The following packs have incorrect resources' settings. Use the 'config' command to align those settings with resources available on a cluster.
- jupyter
- jupyter-py2
- openvino-inference-batch
- openvino-inference-stream
- pytorch-training
- pytorch-training-py2
- tf-inference-batch
- tf-inference-stream
- tf-training-horovod
- tf-training-horovod-py2
- tf-training-multi
- tf-training-multi-py2
- tf-training-single
- tf-training-single-py2
[nauta01@console ~]$

thanks,
yas

Tiller instance is not ready

hello
attempting to install nauta fails with an error.

TASK [applications : [platform] Wait for at least one tiller instances to get ready] *******************************************************************
Friday 13 September 2019 13:21:24 +0900 (0:00:00.280) 0:02:40.789 ******
FAILED - RETRYING: [platform] Wait for at least one tiller instances to get ready (60 retries left).
FAILED - RETRYING: [platform] Wait for at least one tiller instances to get ready (59 retries left).
.
.
.
FAILED - RETRYING: [platform] Wait for at least one tiller instances to get ready (3 retries left).
FAILED - RETRYING: [platform] Wait for at least one tiller instances to get ready (2 retries left).
FAILED - RETRYING: [platform] Wait for at least one tiller instances to get ready (1 retries left).
fatal: [master-0]: FAILED! => {"attempts": 60, "changed": false, "cmd": "kubectl --namespace=kube-system get ds tiller -o jsonpath='{.status.numberReady}'", "delta": "0:00:00.127966", "end": "2019-09-13 13:22:36.301711", "rc": 0, "start": "2019-09-13 13:22:36.173745", "stderr": "", "stderr_lines": [], "stdout": "0", "stdout_lines": ["0"]}

NO MORE HOSTS LEFT *************************************************************************************************************************************
to retry, use: --limit @/usr/local/src/nauta-install-2/platform/nauta.retry

i have no idea how to solve it. an anyone tell me how to solve it?
I'm sorry if I'm clumsy because I'm using git for the first time.

failed to install helm chart

hi,

A Helm chart installation error occurred during nauta installation.
Could you tell me how to solve it?

TASK [applications : [platform] Install helm chart] *********************************************************************************************************************************************
Friday 11 October 2019  10:56:41 +0900 (0:00:00.287)       0:03:17.477 ********
fatal: [nauta-master]: FAILED! => {"changed": true, "cmd": "helm upgrade nauta-k8s-platform --namespace kube-system -i /tmp/nauta-platform-1.0.0-oss-20191008083632.tgz --wait -f /tmp/nauta-platform-1.0.0-oss-20191008083632.values.yaml", "delta": "0:00:00.151825", "end": "2019-10-11 10:56:41.334962", "msg": "non-zero return code", "rc": 1, "start": "2019-10-11 10:56:41.183137", "stderr": "Error: UPGRADE FAILED: \"nauta-k8s-platform\" has no deployed releases", "stderr_lines": ["Error: UPGRADE FAILED: \"nauta-k8s-platform\" has no deployed releases"], "stdout": "", "stdout_lines": []}

NO MORE HOSTS LEFT ******************************************************************************************************************************************************************************
        to retry, use: --limit @/usr/local/src/platform/nauta.retry

PLAY RECAP **************************************************************************************************************************************************************************************
localhost                  : ok=4    changed=0    unreachable=0    failed=0
nauta-master               : ok=559  changed=7    unreachable=0    failed=1
worker01                   : ok=163  changed=5    unreachable=0    failed=0
worker02                   : ok=163  changed=5    unreachable=0    failed=0

thanks,
yas

Need docs in pdf!

Feature Request

I think we need install and technical documents in PDF too. I really, really don't want lengthy documentation in my IDE, lol.

Question about exp submit.

Hello,
I want know about experiment submit.
When I submit the experiment to reuse as the same name the below message appears.


Problems during submitting experiment: experiment with name: horovod-pod already exist!

Should I name a experiment newly always I submit even though same training?

I appreciate your help always.

Need more validated builds besides Ubuntu!!

Feature Request

Ubuntu is great and I've heard about the other platforms supported but we need them validated.
We needs Windows Server 2019, Redhat and Suse. I'd like to see Oracle Linux if you can do it too

Migrate from kube-batch to Volcano

Description :

Volcano is a CNCF project right now, and it provide more enhancements regarding kube-batch; so we'd suggest to migrate from kube-batch to volcano.

If any comments/suggestion, please let me know :)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.