Code Monkey home page Code Monkey logo

dlworkspace's Introduction

Build Status Coverage Status

Deep Learning Workspace (DLWorkspace) is an open source toolkit that allows AI scientists to spin up an AI cluster in turn-key fashion. Once setup, the DLWorkspace provides web UI and/or restful API that allows AI scientist to run job (interactive exploration, training, inferencing, data analystics) on the cluster with resource allocated by DL Workspace cluster for each job (e.g., a single node job with a couple of GPUs with GPU Direct connection, or a distributed job with multiple GPUs per node). DLWorkspace also provides unified job template and operating environment that allows AI scientists to easily share their job and setting among themselves and with outside community. DLWorkspace out-of-box supports all major deep learning toolkits (PyTorch, TensorFlow, CNTK, Caffe, MxNet, etc..), and supports popular big data analytic toolkit such as hadoop/spark.

Documentations

dlworkspace's People

Contributors

anbang-hu avatar cutieowl avatar debbie-alaine avatar deepak-ms avatar dependabot[bot] avatar gerhut avatar hao1939 avatar hongzhili avatar jinlccs avatar jinlmsft avatar junjieqian avatar jzhzhz avatar kant avatar leigao-ms avatar leigaoms avatar leohongyi avatar ningzhou avatar on-the-run avatar rajpratik71 avatar resouer avatar sanjeevm0 avatar shadensmith avatar shamrock-frost avatar shiyutong123 avatar tjruwase avatar tomjaguarpaw avatar xericzephyr avatar xudifsd avatar yinyangofdao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dlworkspace's Issues

Comparison with Azure Batch AI

Hello, is it possible to add a section to the FAQ explaining the difference between DL Workspace and Azure Batch AI? Thanks.

Upgrade to marshmallow 3

๐Ÿ‘‹ Just dropping by to let you know that marshmallow v3 is released.

Upgrading will provide a few benefits for this project:

dump_to="jobId", load_from="jobId",

You could even do automatic camel-casing using this snippet: https://marshmallow.readthedocs.io/en/latest/examples.html#inflection-camel-casing-keys

After skimming the codebase, it looks like the migration will be straightforward.

  • Replace usages of load_from and dump_to with data_key.
  • Update call sites of JobSchema.load[s] to expect the deserialized data as the return value instead of a tuple. Instead of getting the errors from a tuple, handle ValidationErrors.
from marshmallow import ValidationError


try:
    job_object = JobSchema.load(job)
except ValidationError as error:
    errors = error.messages

Deploy cluster on Azure

Hi,

I am having some trouble accessing the cluster once I have created it. I follow the instructions from here https://github.com/Microsoft/DLWorkspace/blob/alpha.v1.5/docs/deployment/Azure/Readme.md. My configuration file is as follows.

cluster_name: beowolf

azure_cluster:
beowolf:
infra_node_num: 1
infra_vm_size : Standard_D3_v2
worker_node_num: 2
worker_vm_size: Standard_NC6
azure_location: westus2

UserGroups:
DLWSAdmins:
Allowed: [ "@gmail.com" ]
uid : "20000"
gid : "20001"
DLWSRegister:
Allowed: [ "@gmail.com" ]
uid : "20001-29999"
gid : "20001"

WebUIregisterGroups: [ "DLWSRegister"]
WebUIauthorizedGroups: [ "DLWSAdmins" ]
WebUIadminGroups: ["DLWSAdmins"]

DeployAuthentications : ["Corp", "Gmail", "Live"]

WinbindServers: []

webuiport: 3080

Everything installs fine as far as I can tell. I run ./deploy.py display and get the url. When I try to access the ui from my browser I get a 502 bad gateway error. I am running the setup on a remote linux vm and trying to access it from my local laptop.

I have tried a number of times and still get the same error. Evidently I am mising something but not sure what. Thanks in advance

Problem in gen_configs() in deploy.py

when calling ./deploy.py acs postdeploy I get

File "./deploy.py", line 826, in gen_configs
config["api_servers"] = "https://"+config["kubernetes_master_node"][0]+":"+str(config["k8sAPIport"])

The reason is the empty config["kubernetes_master_node"][0]. I could perhaps fill in missing values but need more info on the structure of the config. What exactly is stored in
config["kubernetes_master_node"] = Nodes (line 646)?

This way I could try to fix the connection problem:

Checking configurations 'kubernetes_master_node' = '[]'
Checking configurations 'kubernetes_master_ssh_user' = 'dlwsadmin'
Checking configurations 'api_servers' = 'https://clusterdl-clusterdlresgrp-3d8900mgmt.westeurope.cloudapp.azure.com:1443'
Checking configurations 'etcd_user' = 'dlwsadmin'
Checking configurations 'etcd_node' = '[]'
Checking configurations 'etcd_endpoints' = ''
Checking configurations 'ssh_cert' = './deploy/sshkey/id_rsa'
Checking configurations 'pod_ip_range' = '10.2.0.0/16'
Checking configurations 'kubernetes_docker_image' = 'mlcloudreg.westus.cloudapp.azure.com:5000/dlworkspace/hyperkube:v1.7.5'
Checking configurations 'service_cluster_ip_range' = '10.3.0.0/16'
./
./macvlan
./dhcp
./loopback
./ptp
./ipvlan
./bridge
./tuning
./noop
./host-local
./cnitool
./flannel
lost connection
lost connection
lost connection
lost connection

Multinode training (data-parallel)

Hi is it possible to do multi-node data parallel training in Tensorflow or other frameworks? Does DLWorspace provide require any special parameters or setup? Are there any examples on how to do this?

On-prem Single-Ubuntu

Hi guys, I want to test DLworkspace on my ubuntu server WITHOUT connecting with any azure cluster thing.

So I am following SingleUbuntu on-prem this document. Unfortunately I am stuck at the last line.

./deploy.py --verbose scriptblocks ubuntu_uncordon

My ubuntu server has the same setup as indicated in the document:

  1. ubuntu 16.04 x64

Things I have done:

  1. run src/ClusterBootstrap/install_prerequisites.sh as mentioned in DevEnvironment/Readme.md.
  2. I have manually installed docker-ce, build an new GPU-favored kubernetes binary and add the folder to PATH, also DevEnvironment/Readme.md.
  3. I create a pair of ssh-keys, and manually put them inside ./src/ClusterBootstrap/deploy/sshkey folder
  4. I also setup the mssqlserver docker container to run on my Ubuntu server at the port and setup authentication as shown in my config.yaml (docker-compose run db, I am assuming DLworkspace will find the database and setup everything it needs)
  5. I setup an email account at outlook.com and put info inside my config.yaml, follow Auth

Now my whole config.yaml file looks like this:
image

Now I keep getting the error shown below:
image

It looks like kubernetes cluster is not running. So what should I do to get it work on my own server?

useFetch has been updated with breaking changes

I was just taking a look at some of the people using use-http and saw this repo. Great work! I just wanted to let you know I deprecated useGet, usePost, usePatch, etc. since all that functionality is baked into useFetch.

So, instead of

  const { data, error, get } = useGet<Job>({
    url: `/api/clusters/${clusterId}/jobs/${jobId}`,
    onMount: true
  });

just do this instead

  const { data, error, get } = useFetch<Job>({
    url: `/api/clusters/${clusterId}/jobs/${jobId}`,
    onMount: true
  });

Looks like you're only using useGet in 4 places so should be a quick fix.

Error when trying to run a job

I tried running a few different jobs and I get a message similar to the one below:

========================================================= ========================================================= ========================================================= logs from pod: c394fd78-1699-4d10-92ad-922f7bda2f37 ========================================================= ========================================================= ========================================================= failed to open log file "/var/log/pods/322513bd-524a-11e8-9e0d-000d3a0621a8/c394fd78-1699-4d10-92ad-922f7bda2f37_0.log": open /var/log/pods/322513bd-524a-11e8-9e0d-000d3a0621a8/c394fd78-1699-4d10-92ad-922f7bda2f37_0.log: no such file or directory ========================================================= end of logs from pod: c394fd78-1699-4d10-92ad-922f7bda2f37 =========================================================

I am using the alpha 1.5 release.
Is there a configuration I missed?

Hope to get a more detailed installation description

1.When I tried to setup the development environment, I found the installation description says there exists a 'src/ClusterBootstrap/deploy' folder which contains important information to access the deployed DL workspace cluster, but I could not find such a folder.
2.After I ran the following command, I could not get to the '/home/DLWorkspace/src/ClusterBootstrap' folder.

git clone https://github.com/microsoft/DLWorkspace
docker run -ti -v DLWorkspace:/home/DLWorkspace jinl/dlworkspacedevdocker /bin/bash
cd /home/DLWorkspace/src/ClusterBootstrap

These are operated on the Ubuntu 16.04 system.

Consider using kubeadm for Kubernetes cluster setup

The benefit of kubeadm is it will be responsible for handling Kubernetes versions. i.e. versions of etcd, kubelet, kube-addons, master pods, also it support custom CRI, which is required by GPU.

The change to update to kubeadm is straigtforward:

  1. Install kubeadm and kubelet on every node.
  2. On master node, instead of create master Pods, just execute kubeadm init
  3. On every worker node, execute kubeadm join <masterIP:port>
  4. Intall network plugin: kubectl apply -f flannel.yaml
  5. Install addons: dns is included by default, others are the same as today

No enforcement here, just brought up a thought.

init script is hanging in NFS write

root@dltsw7be500001T:/tmp/sample_repo_files# pstree -p
bash(1)---bash(10)---bash(1054)
root@dltsw7be500001T:/tmp/sample_repo_files# ps aux | cat
USER        PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root          1  0.0  0.0  19040  3968 ?        Ss   01:35   0:00 bash /pod/scripts/bootstrap.sh
root         10  0.0  0.0  19040  3936 ?        S    01:35   0:00 bash /pod/scripts/init_user.sh
root       1054  0.0  0.0  19040  3560 ?        D    01:35   0:00 bash /pod/scripts/init_user.sh
root       5551  0.0  0.0  19264  4348 pts/0    Ss   08:13   0:00 bash
root       6140  0.0  0.0  34516  2956 pts/0    R+   08:16   0:00 ps aux
root       6141  0.0  0.0   4624   868 pts/0    S+   08:16   0:00 cat
root@dltsw7be500001T:/tmp/sample_repo_files# cat /proc/1054/stack
[<0>] nfs_wait_bit_killable+0x24/0xa0 [nfs]
[<0>] nfs4_wait_clnt_recover+0x67/0x80 [nfsv4]
[<0>] nfs4_client_recover_expired_lease+0x1d/0x60 [nfsv4]
[<0>] nfs4_do_open+0xf5/0x690 [nfsv4]
[<0>] nfs4_atomic_open+0xea/0x100 [nfsv4]
[<0>] nfs4_file_open+0x113/0x280 [nfsv4]
[<0>] do_dentry_open+0x1c2/0x310
[<0>] vfs_open+0x4f/0x80
[<0>] path_openat+0x676/0x1780
[<0>] do_filp_open+0x9b/0x110
[<0>] do_sys_open+0x1bb/0x2c0
[<0>] __x64_sys_open+0x21/0x30
[<0>] do_syscall_64+0x6a/0x1a0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<0>] 0xffffffffffffffff

Should have some way to prevent the process from hanging indefinitely.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.