numericalalgorithmsgroup / mlfirststeps_azure Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 2.0 206 KB

Tutorial demonstrating how to get started porting an existing Machine Learning application to MS Azure

Dockerfile 1.29% Shell 22.29% Python 76.42%

mlfirststeps_azure's People

Contributors

Stargazers

Watchers

Forkers

felipemoz stjordanis

mlfirststeps_azure's Issues

Allow for SPOT pricing

Allow the option for spot pricing. Add the following to the az vm create command

--priority Spot

--priority Regular

Further questions:

Have the type (Regular/Spot) as an environment variable?
If using spot and if the create command fails because of lack of capacity what should we do?

Making it easy for the copy and pasters

We currently do, for example,

az group create --name <rg_name> --location <location>

If I were working along with this tutorial and copy and pasting, it would be annoying to change <rg_name> every time.

Should we start the tutorial with creating the variables and then using them throughout?

Doesn't for work for K80s

In the text, we say that it works for any NVIDIA GPU.

I just tried K80s and it didn't work

vm_size="Standard_NC6"

gives

Deployment failed. Correlation ID: 70fe08db-0c8b-4753-9602-6b041f35a430. {
  "error": {
    "code": "BadRequest",
    "message": "The selected VM Size 'Standard_NC6' cannot boot Hypervisor Generation '2'. If this was a Create operation please check that the Hypervisor Generation of the Image matches the Hypervisor Generation of the selected VM Size. If this was an Update operation please select a Hypervisor Generation '2' VM Size.

Unless actually tested on other NVIDIA GPUs, we should just say something like 'We used a P100 in this example. More powerful NVIDIA GPUs may work but we haven't tested them'

Number of GPUs

We do

python -m torch.distributed.launch \
  --nproc_per_node=<number_of_gpus> \
  --use_env ncf.py \
  --data /data/cache/ml-25m \
  --checkpoint_dir /work \
  --threshold 0.9

Given that the tutorial explicitly uses an instance with one GPU, should we set <number_of_gpus> to 1?

Newby-fying the training script

If you don't edit the training script it will fail because ssh keys.Shall we make the beginning of the script more newbie friendly?

Giving instructions on what must be edited for example?

Failing without running any az commands if it is not editied?

Something like

#!/bin/bash

#Entries you must edit or this script will not work for you
sshkey="$HOME/.ssh/id_rsa.pub"
#echo "This script needs editing before you can use it."
#exit

# User configs that you can edit if you like
rg_name="NCF-Tutorial"
vm_name="NCF-Trainer"
location="southcentralus"
vm_size="Standard_NC6s_v2"
admin_user=$USER

# Not necessary to edit below this line
# Azure/docker specific config
work_mount=/work
dataset_mount=/data

Watching for 'waiting for a lock'

In the text we say

'yum is "waiting for a lock". This can occur when azure extensions are still running in the background'

This suggests that the script will fail sometimes for this reason.
Is there any way we can either check that azure extensions are finished or catch the lock error and act accordingly?

A check for personal ratings?

Should we add a script that allows users to run a check on their personal ratings before submitting for training?
The workflow might then be

Fork this repo to your own github account
Edit personalised ratings
Run rating_checker.py on local machine to make sure that you did it OK (Better to discover this now rather than when you have spent money on GPUs)

We might split the tutorial then into 3 sections

1 You want to benchmark and run the standard model with the minimum of effort. Just edit and run the training script.
2 You want to understand what's going on and want to run through the tutorial to do step 1 manually.
3 You want your own recommendations

Personalised reccomendations

Hi Phil

While working through the tutorial, I found myself worrying a lot about the personalised recommendations section. After moving it around a lot and still being unhappy, I wonder if we should remove it from this blog post and maybe having it as a follow up.

I have three reasons for thinking this.

While its fun. It doesn't add anything to the main thrust of the tutorial which is 'Take a model and get it running on Azure'
While following the tutorial, I found myself working through the section on personalisation on the P100 VM. That is, I am paying for a GPU and yet spending my time editing a text file. Yes I am an idiot. I doubt I'll be the only one.
If we were to do a follow up article that includes the extra material, we could perhaps make it another teachable moment. Essentially, we could say 'You got your training pipeline working on Azure. It uses expensive GPUs and is good.' But now you want to add some data to it. How best to do that? (Would the solution perhaps involve me editing my file on my local machine, uploading to an S3 bucket and then running the model?)

Something to discuss on Friday.

Mike

Error when using Windows subsystem for Linux

On running the ./deploy_and_run_training.sh I get a bunch of errors like

./deploy_and_run_training.sh: line 2: $'\r': command not found
./deploy_and_run_training.sh: line 10: $'\r': command not found
./deploy_and_run_training.sh: line 21: $'\r': command not found
Creating VM Instance
====================

./deploy_and_run_training.sh: line 23: $'\r': command not found

Resource Group:
az group create: error: the following arguments are required: --name/--resource-group/-n/-g, --location/-l
usage: az group create [-h] [--verbose] [--debug]
                       [--output {json,jsonc,table,tsv,yaml,yamlc,none}]
                       [--query JMESPATH] [--subscription _SUBSCRIPTION]
                       --name RG_NAME --location LOCATION
                       [--tags [TAGS [TAGS ...]]] [--managed-by MANAGED_BY]

Fixed with

dos2unix deploy_and_run_training.sh

This is because of the way I have git checkout configured and is easily fixed but may be encountered by others.

Should we have a Troubleshooting section that mentions this?
How can we write the script so that in the event of this happening, it fails much sooner? Ideally I'd like it to never get to the az commands. Changing the first line to

#!/bin/bash -e

does the job. Now it fails at the first line

./deploy_and_run_training.sh
./deploy_and_run_training.sh: line 2: $'\r': command not found

but will this lead to unintended consequences? I've been bitten by the -e flag before.

Deleting the VM instance

When running the automated script, it asks my permission to delete the VM instance.

Deleting VM Instance
====================

Are you sure you want to perform this operation? (y/n):

So it's not fire and forget

Making sure the NVIIDA driver extension has finished before doing anything more.

In the text we recommend waiting for up to 10 minutes.

In the script we wait for 60 seconds.

Could we reach out to Xavier, for example, on how he deals with this in his end to end scripts?
Maybe there is a way to probe the VM every 30 seconds or so?

I hope for a better solution than 'wait for an indeterminate amount of time'

Benchmarking results

Imagine that I want to use this as a benchmark. I.e. point it at an instance type, launch the training and keep the result.

Could we return the training time to the user as well as the results of the training?