Code Monkey home page Code Monkey logo

gce-scripts's Introduction

gce-scripts

Scripts to manage VMs on Google Cloud. This script can help you create a large number of VM instances with GPUs, automatically setup CUDA and Docker on the VMs, and run TensorFlow models.

Workflow for testing and deploying models on Google Cloud VMs

  1. Install Google Cloud SDK https://cloud.google.com/sdk/

  2. Login to Google Cloud account in a terminal, and correctly configure the default account, project, and zone.

    gcloud auth login [ACCOUNT]
    # list current configuration, make changes if necessary
    gcloud config list
  3. Clone this repo. Make changes to scripts to best fit your needs. Import functions defined in manage_instances.sh into the shell. A branch of the code cuda-9.2 is available to install Cuda 9.2 on VMs.

    git clone https://github.com/haoyuz/gce-scripts.git
    cd gce-scripts
    . manage_instances.sh 
  4. Create your first VM instance with correct configurations. You can set numbers of GPUs and CPUs, as well as size of memory in GB. By default the script will install Ubuntu 16.04 on those VMs. The VM startup script will be automatically launched on the VM to install CUDA and Docker. It takes about 5 min to finish, after which you should restart the VM.

    # Create a VM with 1 NVIDIA V100 GPU (can have up to 12 vCPU and 78 GB memory)
    create_custom_instance <instance-name> 12 1 64
    # Create a VM with 8 NVIDIA V100 GPUs
    # create_custom_instance <instance-name> 96 8 512
    
    # Wait 5 minutes for VM startup script to finish...
    
    # Restart the instance after VM startup script finishes
    stop_instance <instance-name>
    start_instance <instance-name>
  5. For the first time, you want to create a separate data disk, attach the disk to the VM, and preprocess all data on the disk (for example, convert to TFRecord files). You can later attach it to multiple VM instances, or clone the data disk (from data disk template).

    # Create a separate data disk of desired size.
    create_data_disk <data-disk-name> <size-in-gb> $DISK_TYPE
    # Attach the data disk to VM (in READ WRITE mode)
    attach_data_disk_rw <instance-name> <data-disk-name>
    
    # Login to VM, mount data disk, and prepare data
    gcloud compute ssh <instance-name>
    
    # ------------------ BASH INSIDE THE VM --------------------------
    # sudo mkdir /data
    # sudo chmod +rw /data
    # sudo mount /deb/sdb /data
    # (donwload and preprocess dataset, save to /data)
    # ------------------------- END ----------------------------------
    
    # You can re-attach data disk in READ ONLY mode to prevent later changes to
    # the data. Note that you have to re-mount the disk inside VM. A data disk
    # can be attached to multiple VMs in READ ONLY mode.
    detach_data_disk <instance-name> <data-disk-name>
    attach_data_disk <instance-name> <data-disk-name>  # READ ONLY
  6. You can run TensorFlow using nvidia-docker inside the VM.

    # (Bash inside VM)
    sudo chmod a+r /data
    sudo mount -o ro /dev/sdb /data
    
    nvidia-docker run -it -v $HOME:$HOME -v /data:/data tensorflow/tensorflow:nightly-gpu bash

    Examples of script to run object detection model (SSD) are provided in vm_mlperf_ssd_code_setup.sh.

Workflow for building TensorFlow from source

Follow the same steps 1--4 from the section above.

  1. Build a Docker image from tf-cuda-10.Dockerfile. Optionally you can use Google Cloud container registry and upload the image. You might want to clone this repo inside the VM first.

    # Build Docker image locally (inside VM)
    docker build -f tf-cuda-10.Dockerfile --tag=tensorflow/tf-cuda-10 .
    
    # Build Docker image and upload to Google Cloud container registry
    docker build -f tf-cuda-10.Dockerfile --tag=gcr.io/google.com/tensorflow-performance/tensorflow/tf-cuda-10:latest .
    gcloud docker -- push gcr.io/google.com/tensorflow-performance/tensorflow/tf-cuda-10:latest
  2. Check out TensorFlow source code.

    cd <path-to-code>
    git clone https://github.com/tensorflow/tensorflow.git
  3. Run Docker image, and build TensorFlow from source.

    nvidia-docker run -it -v <path-to-code>:<path-to-code> -v /tmp:/tmp tensorflow/tf-cuda-10 bash
    
    # (inside Docker)
    cd <path-to-code>/tensorflow
    git checkout *branch-name*
    export TF_NEED_CUDA=1
    export TF_CUDA_VERSION=9.0
    export TF_CUDNN_VERSION=7
    export TF_NCCL_VERSION=2
    yes "" | $PYTHON_BIN_PATH configure.py
    bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.