numericalalgorithmsgroup / mlfirststeps_azure Goto Github PK
View Code? Open in Web Editor NEWTutorial demonstrating how to get started porting an existing Machine Learning application to MS Azure
Tutorial demonstrating how to get started porting an existing Machine Learning application to MS Azure
Allow the option for spot pricing. Add the following to the az vm create
command
--priority Spot
or
--priority Regular
Further questions:
We currently do, for example,
az group create --name <rg_name> --location <location>
If I were working along with this tutorial and copy and pasting, it would be annoying to change <rg_name>
every time.
Should we start the tutorial with creating the variables and then using them throughout?
In the text, we say that it works for any NVIDIA GPU.
I just tried K80s and it didn't work
vm_size="Standard_NC6"
gives
Deployment failed. Correlation ID: 70fe08db-0c8b-4753-9602-6b041f35a430. {
"error": {
"code": "BadRequest",
"message": "The selected VM Size 'Standard_NC6' cannot boot Hypervisor Generation '2'. If this was a Create operation please check that the Hypervisor Generation of the Image matches the Hypervisor Generation of the selected VM Size. If this was an Update operation please select a Hypervisor Generation '2' VM Size.
Unless actually tested on other NVIDIA GPUs, we should just say something like 'We used a P100 in this example. More powerful NVIDIA GPUs may work but we haven't tested them'
We do
python -m torch.distributed.launch \
--nproc_per_node=<number_of_gpus> \
--use_env ncf.py \
--data /data/cache/ml-25m \
--checkpoint_dir /work \
--threshold 0.9
Given that the tutorial explicitly uses an instance with one GPU, should we set <number_of_gpus> to 1?
If you don't edit the training script it will fail because ssh keys.Shall we make the beginning of the script more newbie friendly?
Giving instructions on what must be edited for example?
Failing without running any az
commands if it is not editied?
Something like
#!/bin/bash
#Entries you must edit or this script will not work for you
sshkey="$HOME/.ssh/id_rsa.pub"
#echo "This script needs editing before you can use it."
#exit
# User configs that you can edit if you like
rg_name="NCF-Tutorial"
vm_name="NCF-Trainer"
location="southcentralus"
vm_size="Standard_NC6s_v2"
admin_user=$USER
# Not necessary to edit below this line
# Azure/docker specific config
work_mount=/work
dataset_mount=/data
In the text we say
'yum is "waiting for a lock". This can occur when azure extensions are still running in the background'
This suggests that the script will fail sometimes for this reason.
Is there any way we can either check that azure extensions are finished or catch the lock error and act accordingly?
Should we add a script that allows users to run a check on their personal ratings before submitting for training?
The workflow might then be
rating_checker.py
on local machine to make sure that you did it OK (Better to discover this now rather than when you have spent money on GPUs)We might split the tutorial then into 3 sections
1 You want to benchmark and run the standard model with the minimum of effort. Just edit and run the training script.
2 You want to understand what's going on and want to run through the tutorial to do step 1 manually.
3 You want your own recommendations
Hi Phil
While working through the tutorial, I found myself worrying a lot about the personalised recommendations section. After moving it around a lot and still being unhappy, I wonder if we should remove it from this blog post and maybe having it as a follow up.
I have three reasons for thinking this.
While its fun. It doesn't add anything to the main thrust of the tutorial which is 'Take a model and get it running on Azure'
While following the tutorial, I found myself working through the section on personalisation on the P100 VM. That is, I am paying for a GPU and yet spending my time editing a text file. Yes I am an idiot. I doubt I'll be the only one.
If we were to do a follow up article that includes the extra material, we could perhaps make it another teachable moment. Essentially, we could say 'You got your training pipeline working on Azure. It uses expensive GPUs and is good.' But now you want to add some data to it. How best to do that? (Would the solution perhaps involve me editing my file on my local machine, uploading to an S3 bucket and then running the model?)
Something to discuss on Friday.
Mike
On running the ./deploy_and_run_training.sh
I get a bunch of errors like
./deploy_and_run_training.sh: line 2: $'\r': command not found
./deploy_and_run_training.sh: line 10: $'\r': command not found
./deploy_and_run_training.sh: line 21: $'\r': command not found
Creating VM Instance
====================
./deploy_and_run_training.sh: line 23: $'\r': command not found
Resource Group:
az group create: error: the following arguments are required: --name/--resource-group/-n/-g, --location/-l
usage: az group create [-h] [--verbose] [--debug]
[--output {json,jsonc,table,tsv,yaml,yamlc,none}]
[--query JMESPATH] [--subscription _SUBSCRIPTION]
--name RG_NAME --location LOCATION
[--tags [TAGS [TAGS ...]]] [--managed-by MANAGED_BY]
Fixed with
dos2unix deploy_and_run_training.sh
This is because of the way I have git checkout configured and is easily fixed but may be encountered by others.
Should we have a Troubleshooting section that mentions this?
How can we write the script so that in the event of this happening, it fails much sooner? Ideally I'd like it to never get to the az
commands. Changing the first line to
#!/bin/bash -e
does the job. Now it fails at the first line
./deploy_and_run_training.sh
./deploy_and_run_training.sh: line 2: $'\r': command not found
but will this lead to unintended consequences? I've been bitten by the -e
flag before.
When running the automated script, it asks my permission to delete the VM instance.
Deleting VM Instance
====================
Are you sure you want to perform this operation? (y/n):
So it's not fire and forget
In the text we recommend waiting for up to 10 minutes.
In the script we wait for 60 seconds.
Could we reach out to Xavier, for example, on how he deals with this in his end to end scripts?
Maybe there is a way to probe the VM every 30 seconds or so?
I hope for a better solution than 'wait for an indeterminate amount of time'
Imagine that I want to use this as a benchmark. I.e. point it at an instance type, launch the training and keep the result.
Could we return the training time to the user as well as the results of the training?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.