Code Monkey home page Code Monkey logo

hpcpack-acm's Introduction

HPC Azure Cluster Management Service

Introduction

The service enabled the diagnostics scenario of HPC clusters in Azure by providing the following features:

  • Diagnostics Jobs

    With predefined diagnostic test definitions, the clsuter admin can easily validate the health of an HPC cluster.

  • Clusrun Jobs

    By selecting a group of nodes and run clusrun, the commands will be dispatched to the selected nodes, and the outputs will be collected and shown interactively.

  • Heatmap

    The heatmap is a real-time graphical view of a specific metric value of all nodes in the cluster. It provides a vivid way to view the cluster's metrics.

How to deploy

There are two ways to deploy the service.

Deploy from scratch

Deploy to Azure

A cluster is deployed together with the diagnostic services, allowing the deployer to choose the scheduler, location, cluster size, the portal name, etc. This is the easiest way to create an HPC cluster with diagnostics functionalities enabled. For detailed usage of the deployment template, please refer: Azure cluster deployment

Apply to an existing cluster

For an already deployed cluster, to enable the diagnostics functionalities, follow the steps below:

  1. Create the service alone

For how to use the template, please refer: Build HPC ACM Diagnostic service

  1. Register the cluster with the service (You can register multiple clusters with the same service by repeating this step for each of your cluster)

    Download the script from: RegisterToAcm.ps1 Run it in an elevated powershell window:

    .\RegisterToAcm.ps1 -resourceGroupName theResourceGroupOfYourCluster -acmRgName theResourceGroupOfAcmServices -subscriptionId theSubscriptionId
    

    After the configuration, the VMs will register themselves to the HPC ACM services, and you could check the resources section in the portal to see them.

Known issues

  • The service only support linux for now
  • The service provide https portal with a self-signed cert, you need bypass the cert validation to visit the portal and use the rest api.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.