microsoft / dlworkspace Goto Github PK

Deep Learning Workspace

License: Other

Python 15.72% Shell 1.50% C# 1.62% HTML 1.33% JavaScript 75.72% CSS 0.69% Go 0.03% Dockerfile 0.22% TypeScript 3.17% C++ 0.02%

dlworkspace's Introduction

Project Overview

Deep Learning Workspace (DLWorkspace) is an open source toolkit that allows AI scientists to spin up an AI cluster in turn-key fashion. Once setup, the DLWorkspace provides web UI and/or restful API that allows AI scientist to run job (interactive exploration, training, inferencing, data analystics) on the cluster with resource allocated by DL Workspace cluster for each job (e.g., a single node job with a couple of GPUs with GPU Direct connection, or a distributed job with multiple GPUs per node). DLWorkspace also provides unified job template and operating environment that allows AI scientists to easily share their job and setting among themselves and with outside community. DLWorkspace out-of-box supports all major deep learning toolkits (PyTorch, TensorFlow, CNTK, Caffe, MxNet, etc..), and supports popular big data analytic toolkit such as hadoop/spark.

Documentations

DLWorkspace Cluster Deployment

Frequently Asked Questions

dlworkspace's People

Contributors

Stargazers

Watchers

Forkers

jinlccs kant sanzgiri sanjeevm0 hongzhili amsword skaarthik hellosaumil wuyuebupt junjieqian cr8microlab oujunke buhijs resouer geffzhang dlworkspace spitchai phdn nagarjunbaindla imace btowner01 gaybro8777 junhuaw gerhut zmoon111 choupw karmaforks deepak-ms tjruwase hao1939 xudifsd anbang-hu 0xhanh leohongyi eam2bi mattzhang105 vdedyukhin zhaowangbu ldw-sh-cn yinyangofdao hengzi nlgrf phitheta debbie-alaine leigaoms yazici onejune2018 ishaansharma qasimshk alexandrard ziplex ai-ml-projects qasim-1develop 250lth sobhardwaj shiyutong123 tsukuyomih2 bhaskers-blu-org2 taffywrinkle claudiusgonzo zhangqilike2015 summerx maggiehust iuyoy shadowtudark lupen0120 richardzhaow nautilusshell silence2746 newsbreak-jiaxin standardgalactic edenbuaa

dlworkspace's Issues

Comparison with Azure Batch AI

Hello, is it possible to add a section to the FAQ explaining the difference between DL Workspace and Azure Batch AI? Thanks.

Upgrade to marshmallow 3

👋 Just dropping by to let you know that marshmallow v3 is released.

Upgrading will provide a few benefits for this project:

You can get rid of places where load_from and dump_to match, since they've been merged into a single parameter, data_key.

DLWorkspace/src/ClusterManager/job.py

Line 152 in bb8ee5b

dump_to="jobId", load_from="jobId",

You could even do automatic camel-casing using this snippet: https://marshmallow.readthedocs.io/en/latest/examples.html#inflection-camel-casing-keys

You can use the improved fields.Dict to validate the type of keys and values: https://marshmallow.readthedocs.io/en/stable/api_reference.html#marshmallow.fields.Dict
Performance is significantly improved compared to v2

After skimming the codebase, it looks like the migration will be straightforward.

Replace usages of load_from and dump_to with data_key.
Update call sites of JobSchema.load[s] to expect the deserialized data as the return value instead of a tuple. Instead of getting the errors from a tuple, handle ValidationErrors.

from marshmallow import ValidationError


try:
    job_object = JobSchema.load(job)
except ValidationError as error:
    errors = error.messages

How do I update the list of authorised users?

How can I update the list of authorised users on a deployed cluster? I can't seem to find the instructions to do that. Any help would be much appreciated :)

Incorrect override of __get_job_log_legacy by __extract_job_log_legacy

DLWorkspace/src/ClusterBootstrap/template/RestfulAPI/config.yaml

Line 83 in bf6840f

__get_job_log_legacy: True

DLWorkspace/src/utils/JobRestAPIUtils.py

Line 662 in 27d3a3a

_get_job_log_legacy = config.get('__get_job_log_legacy',

If __get_job_log_legacy is false in config.yaml, its value is overriden by __extract_job_log_legacy.

Deploy cluster on Azure

Hi,

I am having some trouble accessing the cluster once I have created it. I follow the instructions from here https://github.com/Microsoft/DLWorkspace/blob/alpha.v1.5/docs/deployment/Azure/Readme.md. My configuration file is as follows.

cluster_name: beowolf

azure_cluster:
beowolf:
infra_node_num: 1
infra_vm_size : Standard_D3_v2
worker_node_num: 2
worker_vm_size: Standard_NC6
azure_location: westus2

UserGroups:
DLWSAdmins:
Allowed: [ "@gmail.com" ]
uid : "20000"
gid : "20001"
DLWSRegister:
Allowed: [ "@gmail.com" ]
uid : "20001-29999"
gid : "20001"

WebUIregisterGroups: [ "DLWSRegister"]
WebUIauthorizedGroups: [ "DLWSAdmins" ]
WebUIadminGroups: ["DLWSAdmins"]

DeployAuthentications : ["Corp", "Gmail", "Live"]

WinbindServers: []

webuiport: 3080

Everything installs fine as far as I can tell. I run ./deploy.py display and get the url. When I try to access the ui from my browser I get a 502 bad gateway error. I am running the setup on a remote linux vm and trying to access it from my local laptop.

I have tried a number of times and still get the same error. Evidently I am mising something but not sure what. Thanks in advance

this file breaks on-prem deployment

./template/kubelet/daemon.json

"azure_cluster" may not be in on-prem cluster config file. Please fix @YinYangOfDao

Problem in gen_configs() in deploy.py

when calling ./deploy.py acs postdeploy I get

File "./deploy.py", line 826, in gen_configs
config["api_servers"] = "https://"+config["kubernetes_master_node"][0]+":"+str(config["k8sAPIport"])

The reason is the empty config["kubernetes_master_node"][0]. I could perhaps fill in missing values but need more info on the structure of the config. What exactly is stored in
config["kubernetes_master_node"] = Nodes (line 646)?

This way I could try to fix the connection problem:

Checking configurations 'kubernetes_master_node' = '[]'
Checking configurations 'kubernetes_master_ssh_user' = 'dlwsadmin'
Checking configurations 'api_servers' = 'https://clusterdl-clusterdlresgrp-3d8900mgmt.westeurope.cloudapp.azure.com:1443'
Checking configurations 'etcd_user' = 'dlwsadmin'
Checking configurations 'etcd_node' = '[]'
Checking configurations 'etcd_endpoints' = ''
Checking configurations 'ssh_cert' = './deploy/sshkey/id_rsa'
Checking configurations 'pod_ip_range' = '10.2.0.0/16'
Checking configurations 'kubernetes_docker_image' = 'mlcloudreg.westus.cloudapp.azure.com:5000/dlworkspace/hyperkube:v1.7.5'
Checking configurations 'service_cluster_ip_range' = '10.3.0.0/16'
./
./macvlan
./dhcp
./loopback
./ptp
./ipvlan
./bridge
./tuning
./noop
./host-local
./cnitool
./flannel
lost connection
lost connection
lost connection
lost connection

Multinode training (data-parallel)

Hi is it possible to do multi-node data parallel training in Tensorflow or other frameworks? Does DLWorspace provide require any special parameters or setup? Are there any examples on how to do this?

Consider use Traefik + Ingress to replace reverse proxy in the future

The latest Kubernetes use Ingress to proxy requests form outside world. And Traefik seems to be the only implementation of Ingress which support on-the-fly reload.

Project:
https://github.com/containous/traefik

Guide:
https://capgemini.github.io/kubernetes/kube-traefik/

On-prem Single-Ubuntu

Hi guys, I want to test DLworkspace on my ubuntu server WITHOUT connecting with any azure cluster thing.

So I am following SingleUbuntu on-prem this document. Unfortunately I am stuck at the last line.

./deploy.py --verbose scriptblocks ubuntu_uncordon

My ubuntu server has the same setup as indicated in the document:

ubuntu 16.04 x64

Things I have done:

run src/ClusterBootstrap/install_prerequisites.sh as mentioned in DevEnvironment/Readme.md.
I have manually installed docker-ce, build an new GPU-favored kubernetes binary and add the folder to PATH, also DevEnvironment/Readme.md.
I create a pair of ssh-keys, and manually put them inside ./src/ClusterBootstrap/deploy/sshkey folder
I also setup the mssqlserver docker container to run on my Ubuntu server at the port and setup authentication as shown in my config.yaml (docker-compose run db, I am assuming DLworkspace will find the database and setup everything it needs)
I setup an email account at outlook.com and put info inside my config.yaml, follow Auth

Now my whole config.yaml file looks like this:

Now I keep getting the error shown below:

It looks like kubernetes cluster is not running. So what should I do to get it work on my own server?

This repo is missing important files

There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.

Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

Merge this pull request

useFetch has been updated with breaking changes

I was just taking a look at some of the people using use-http and saw this repo. Great work! I just wanted to let you know I deprecated useGet, usePost, usePatch, etc. since all that functionality is baked into useFetch.

So, instead of

  const { data, error, get } = useGet<Job>({
    url: `/api/clusters/${clusterId}/jobs/${jobId}`,
    onMount: true
  });

just do this instead

  const { data, error, get } = useFetch<Job>({
    url: `/api/clusters/${clusterId}/jobs/${jobId}`,
    onMount: true
  });

Looks like you're only using useGet in 4 places so should be a quick fix.

Use --register-with-taints instead of unschedulable in node spec to do cordon

See: https://github.com/kubernetes/kubernetes/pull/31647/files

Then you can deploy system pods to master.

And you will need to use taint to do cordon/uncordon instead.

Error when trying to run a job

I tried running a few different jobs and I get a message similar to the one below:

========================================================= ========================================================= ========================================================= logs from pod: c394fd78-1699-4d10-92ad-922f7bda2f37 ========================================================= ========================================================= ========================================================= failed to open log file "/var/log/pods/322513bd-524a-11e8-9e0d-000d3a0621a8/c394fd78-1699-4d10-92ad-922f7bda2f37_0.log": open /var/log/pods/322513bd-524a-11e8-9e0d-000d3a0621a8/c394fd78-1699-4d10-92ad-922f7bda2f37_0.log: no such file or directory ========================================================= end of logs from pod: c394fd78-1699-4d10-92ad-922f7bda2f37 =========================================================

I am using the alpha 1.5 release.
Is there a configuration I missed?

Hope to get a more detailed installation description

1.When I tried to setup the development environment, I found the installation description says there exists a 'src/ClusterBootstrap/deploy' folder which contains important information to access the deployed DL workspace cluster, but I could not find such a folder.
2.After I ran the following command, I could not get to the '/home/DLWorkspace/src/ClusterBootstrap' folder.

git clone https://github.com/microsoft/DLWorkspace
docker run -ti -v DLWorkspace:/home/DLWorkspace jinl/dlworkspacedevdocker /bin/bash
cd /home/DLWorkspace/src/ClusterBootstrap

These are operated on the Ubuntu 16.04 system.

Bugs in cloud_init_deploy.py

time is not imported

DLWorkspace/src/ClusterBootstrap/cloud_init_deploy.py

Line 788 in 858b5ca

time.sleep(int(filename.split(" ")[1]))
kubernetes_get_node_name is not defined in the file

DLWorkspace/src/ClusterBootstrap/cloud_init_deploy.py

Line 222 in 858b5ca

nodename = kubernetes_get_node_name(node)

@YinYangOfDao can you take a look?

Consider using kubeadm for Kubernetes cluster setup

The benefit of kubeadm is it will be responsible for handling Kubernetes versions. i.e. versions of etcd, kubelet, kube-addons, master pods, also it support custom CRI, which is required by GPU.

The change to update to kubeadm is straigtforward:

Install kubeadm and kubelet on every node.
On master node, instead of create master Pods, just execute kubeadm init
On every worker node, execute kubeadm join <masterIP:port>
Intall network plugin: kubectl apply -f flannel.yaml
Install addons: dns is included by default, others are the same as today

No enforcement here, just brought up a thought.

init script is hanging in NFS write

root@dltsw7be500001T:/tmp/sample_repo_files# pstree -p
bash(1)---bash(10)---bash(1054)
root@dltsw7be500001T:/tmp/sample_repo_files# ps aux | cat
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root          1  0.0  0.0  19040  3968 ?        Ss   01:35   0:00 bash /pod/scripts/bootstrap.sh
root         10  0.0  0.0  19040  3936 ?        S    01:35   0:00 bash /pod/scripts/init_user.sh
root       1054  0.0  0.0  19040  3560 ?        D    01:35   0:00 bash /pod/scripts/init_user.sh
root       5551  0.0  0.0  19264  4348 pts/0    Ss   08:13   0:00 bash
root       6140  0.0  0.0  34516  2956 pts/0    R+   08:16   0:00 ps aux
root       6141  0.0  0.0   4624   868 pts/0    S+   08:16   0:00 cat
root@dltsw7be500001T:/tmp/sample_repo_files# cat /proc/1054/stack
[<0>] nfs_wait_bit_killable+0x24/0xa0 [nfs]
[<0>] nfs4_wait_clnt_recover+0x67/0x80 [nfsv4]
[<0>] nfs4_client_recover_expired_lease+0x1d/0x60 [nfsv4]
[<0>] nfs4_do_open+0xf5/0x690 [nfsv4]
[<0>] nfs4_atomic_open+0xea/0x100 [nfsv4]
[<0>] nfs4_file_open+0x113/0x280 [nfsv4]
[<0>] do_dentry_open+0x1c2/0x310
[<0>] vfs_open+0x4f/0x80
[<0>] path_openat+0x676/0x1780
[<0>] do_filp_open+0x9b/0x110
[<0>] do_sys_open+0x1bb/0x2c0
[<0>] __x64_sys_open+0x21/0x30
[<0>] do_syscall_64+0x6a/0x1a0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<0>] 0xffffffffffffffff

Should have some way to prevent the process from hanging indefinitely.