Code Monkey home page Code Monkey logo

gke-node-autoscheduler-poc's Introduction

GKE node autoscheduler POC

This repository contains a minimal POC implementation for demonstrating a node-pool scaling scheduler by using GKE node auto-provisioning and kubernetes cron jobs.

Overview

This POC will scale down the node pool size to ZERO at 5pm (NZST), and bring it back to a normal state (2 nodes) at 8am (NZST) with GPU acceleator. The goal of this solutions is for saving the cost by turning down the GPU nodes after hours.

It also can be implemented by using scaling-schedules [1], but this solution is more kubernetes-friendly.

Node auto-provisioning

Node auto-provisioning is a GKE feature to be able to provisioning nodes or managing node pools automatically, and auto-scale the nodes to meet the resource requirements based on workloads [2].

Kubernetes Cron Job

A cronjob is a kubernetes-managed schedule to run a job repeatly based on a Cron format string [3][4].

Prerequisite

  • Terragrunt (v0.35.10 and above)
  • kubectl

How to run and evaluate

Create a new GKE cluster

cd ./terragrunt-src/non-prod/dev

terragrunt run-all apply

This script will do following things:

  • create a new GKE cluster without default node pool
  • create a new default node pool with 3 nodes
  • create a GPU node pool with nvidia-tesla-p4 accelerator in zone australia-southeast1-b
    • the minimul node size is ZERO
    • the maximal node size is 5 nodes

Deploy a demo application

cd ./app-demo

kubectl apply -f deployment.yaml

This script will create a deployment in cluster with ZERO replica by default.

Deploy CronJob

cd ./cronjob

kubectl apply -f cron-job.yaml

This script will do:

  • create a new service account with cluster-admin permission
  • create two cron jobs
    • job gpu-service-up-cronjob scales up the deployment to 2 replicas
    • job gpu-service-down-cronjob scales down the deployment to ZERO replica

Evaluate

You can manually scale up and down the replica size of application, which will trigger node auto-provisioning feature to provisioning the new node, or remove them from the pool.

# scale up
kubectl scale deployment.apps/api-demo-v3 --replicas=2

# scale down
kubectl scale deployment.apps/api-demo-v3 --replicas=0

Screenshots

  1. Before scale up

before-scale-up

  1. During scale up

during-scale-up

  1. After scale up

after-scale-up

  1. Before scale down

before-scale-down

  1. During scale down

during-scale-down

  1. After scale dow

after-scale-down

Limitations

  1. There is no Google Cloud region has GPUs in all zones. So, we have to initialise node-pool for a single zone with supported accelerator type. https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#gpu_regional_cluster

  2. In order to enable auto-provisioning with GPUs, we need to install NVIDIA's device drivers to the node. https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers

  3. CRON_TZ=<timezone> prefix is not available yet until version 1.22. Currently, the latest GKE version is 1.21.5.

So, you have to use UTC timezone in current GKE version.

References

  1. https://cloud.google.com/compute/docs/autoscaler/scaling-schedules
  2. https://cloud.google.com/kubernetes-engine/docs/how-to/node-auto-provisioning
  3. https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/
  4. https://en.wikipedia.org/wiki/Cron

License

See the License File.

gke-node-autoscheduler-poc's People

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.