Code Monkey home page Code Monkey logo

planetary-computer-hub's Introduction

Planetary Computer Hub

Hub - production Hub - staging

This repository contains the configuration and continuous deployment for the Planetary Computer's Hub, a Dask-Gateway enabled JupyterHub deployment focused on supporting scalable geospatial analysis.

For general questions or discussions about the Planetary Computer, use the microsoft/PlanetaryComputer repository.

Overview

See the user documentation for an overview of what all is provided.

This deployment is relatively complex, and contains a few Microsoft Planetary Computer-specific aspects. For developers or system administrators looking to deploy their own hub, consult the deployment guide. This can serve as a concrete example.

There are two main components to the planetary-computer-hub repository:

  1. helm: A wrapper around the daskhub helm chart.
  2. terraform: Terraform code to deploy all the necessary Azure resources and the Hub itself.

Helm

The most interesting pieces are the YAML configuration files. These are used by the Terraform helm-release provider to customize the JupyterHub and Dask Gateway charts (see hub.tf). In addition to these values_files, the hub.tf terraform module passes some terraform variables through to the chart using set blocks.

The bulk of the configuration is done in values.yaml. See the inline comments there for documentation on why those values are set.

profiles.yaml configures daskhub.jupyterhub.singleuser.ProfileList. The helm-release provider does not lend itself to setting List values, and we need to get the various image tags from the terraform configuration. We place this in its own file to keep things a bit more manageable.

jupyterhub_opencensus_monitor.yaml sets daskhub.jupyterhub.hub.extraFiles.jupyterhub_open_census_monitor.stringData to be the jupyterhub_opencensus_monitor.py script (see below). We couldn't figure out out to get the helm-release provider working with with kubectl's set-file so we needed to inline the script. There's probably a better way to do this.

Finally, the custom UI elements used by the Hub process and additional notebook server configuration are included under helm/chart/files and helm/cart/templates. These are mounted into the pods. See custom UI for more.

Terraform

The terraform directory contains all the deployment code for the Hub. It manages the Azure resources and Helm release.

The terraform code is split into deployment-specific directories (prod, staging) and a resources directory that contains the shared configuration between the two deployments. To the extent possible, resources should be defined in resources. staging and prod should only contain configuration (e.g. the URL for the hub, or the size of the core VM).

Additionally, there's a shared directory, which contains the definition for resources that are shared between the two. Currently, this includes a Storage Account and file share for mounting data volumes onto notebook pods. Resources in the shared directory are deployed manually.

acr.tf

This module creates the Azure Container Registry used for Hub images. Its deployment is a bit strange, an artifact of the deployment history and a desire to use the same container registry for both the staging and prod deployments.

These images are available publicly through the Microsoft Container Registry. See https://github.com/microsoft/planetary-computer-containers for more.

aks.tf

This module deploys the Kubernetes cluster using Azure Kubernetes Service.

Most of the configuration is around node pools. We use the default node pool for "core" JupyterHub pods (e.g. the hub pod). We add a user_pool for users, and a cpu_worker_pool for Dask workers (using preemptible nodes).

In addition to the node pools configured here, we attach two GPU node pools. See scripts/gpu. We're following this upstream issue to deploy GPU node pools through terraform.

hub.tf

This uses the helm_release provider to deploy the Hub using our Helm chart. See helm above for more.

keyvault.tf

We manually place some secrets in an Azure Key Vault. These are accessed in keyvault.tf and used in the deployment. The Azure Service Principal used by Terraform must have permissions to read these keys.

logs.tf

This deploys a Log Analytics workspace, Log Analytics solution, and application insights.

outputs.tf

A terraform values are used later in the process (e.g. the Kubernetes configuration to start tests). These are exported in outputs.tf.

providers.tf

This sets the versions of the Terraform providers we use.

rg.tf

Creates a Resource Group to contain all the created Azure resources.

variables.tf

Defines the variables that can be controlled by the staging / prod deployments. See the variable descriptions for documentation on what each variable is used for.

vnet.tf

Creates the Azure Virtual Network used by the Kubernetes Cluster.

data-volumes.tf

Creates an Azure Storage Account, File share, and Kubernetes Secret for mounting the file share. This is used to mount read-only, static files into all the user pods (e.g. a dataset for a machine learning competition).

Manual Resources

We rely on a few "manual" resources that are created outside of this repository. These include

  • A storage account and container for Terraform state
  • A keyvault for secrets

The service principal used by Terraform should have access to the manual resources resource group.

Keyvault secrets reference

This table documents the values we set in keyvault. They can be created with

$ az keyvault secret set --vault-name pc-deploy-secrets --name '<prefix>--<key-name>' --value '<key-value>'
Keyvault Key Description
pcc-staging--jupyterhub-proxy-secret-token Sets daskhub.jupyterhub.proxy.secretToken for the staging JupyterHub
pcc-prod--jupyterhub-proxy-secret-token Sets daskhub.jupyterhub.proxy.secretToken for the prod JupyterHub
pcc--id-client-secret Sets daskhub.jupyterhub.hub.config.GenericOAuthenticator.client_secret, an Oauth token to communicate with the pc-id oauth provider
pcc--pc-id-token Sets daskhub.jupyterhub.hub.extraEnv.PC_ID_TOKEN, an API token with the pc-id application to look up users, enabling the API management integration
pcc--azure-client-secret Sets daskhub.jupyterhub.hub.extraEnv.AZURE_CLIENT_SECRET, an secret key to allow the hub to access Azure resources, enabling the API management integration
pcc-staging--kbatch-server-api-token JupyterHub token for the kbatch application in staging.
pcc-prod--kbatch-server-api-token JupyterHub token for the kbatch application in production.
pcc--velero-azure-subscription-id Set in velero_credentials.tpl for backups / migrations
pcc--velero-azure-tenant-id Set in velero_credentials.tpl for backups / migrations
pcc--velero-azure-client-id Set in velero_credentials.tpl for backups / migrations
pcc--velero-azure-client-secret Set in velero_credentials.tpl for backups / migrations

Continuous deployment

This repository deploys on commits to the staging environment on commits main. We commit to production on tags. The deployment is done through GitHub Actions.

We created a service principal to mange deployment.

To enable creating network security groups

$ az role assignment create \
    --role "/subscriptions/<subscription-id>/providers/Microsoft.Authorization/roleDefinitions/4d97b98b-1d4f-4787-a291-c67834d212e7" \
    --assignee "<service-principal-id>" \
    --scope="/subscriptions/<subscription-id>/resourceGroups/MC_pcc-staging-rg_pcc-staging-cluster_westeurope/providers/Microsoft.Network/routeTables/aks-agentpool-27180469-routetable"

Likewise for production (change the resource group name in the scope).

AKS RBAC

Requires the service principal executing terraform to also have permissions on the Kubernetes Cluster.

$ az role assignment create \
    --role "Azure Kubernetes Service RBAC Writer" \
    --scope "/subscriptions/$ARM_SUBSCRIPTION_ID/resourceGroups/pcc-staging-2-rg/providers/Microsoft.ContainerService/managedClusters/pcc-staging-2-cluster" \
    --assignee $ARM_CLIENT_ID

Velero backup configuration

The Terraform deployment also installs velero on the cluster via helm. See velero.tf.

This requires the manual creation of some resources.

Opencensus monitor service

jupyterhub_opencensus_monitory.py module is deployed as a JuptyerHub service. It collects metrics on usage from the JupyterHub REST API. It would ideally be refactored into a standalone repository: jupyterhub/jupyterhub#3116.

API Management integration

The Planetary Computer API is deployed using API Management. The hub includes an integration to automatically insert the logged in user's subscription key as an environment variable. This is used by libraries like planetary-computer to automatically sign requests. See daskhub.jupyterhub.hub.extraConfig.pre_spawn_hook in values.yaml for where this is done.

Testing

We used the JupyterHub admin panel to create a user for tests, [email protected]. The tests/ starts a notebook server for this user and verifies that a few common operations work.

ACR Integration

A previous iteration used a common Azure Container Registry for both staging and prod. After splitting, we need to manually grant the staging cluster access to the ACR.

$ az aks update -n pcc-staging-cluster -g pcc-staging-rg --attach-acr pcccr

Custom UI

We're able to customize the JupyterHub and jupyterlab UIs following the approach outlined in https://discourse.jupyter.org/t/customizing-jupyterhub-on-kubernetes/1769/4.

To test changes to the templates locally, install jupyterhub and run it from the root of the project directory, which includes a jupyterhub_config.py file. Changes to the template files in helm/chart/files/etc/jupyterhub/templates/ can be previewed at localhost:8000.

Additional References

Many of the concepts used here were learned in deployments at the pangeo-cloud-federation and 2i2c pilot hubs. Those might serve as additional references for how to deploy a Hub.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

planetary-computer-hub's People

Contributors

m-cappi avatar microsoft-github-operations[bot] avatar microsoftopensource avatar mmcfarland avatar tomaugspurger avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

planetary-computer-hub's Issues

The Planetary Computer Hub(Pytorch mode) fail to start up

Planetary computer hub(Pytorch mode) cannot be opened for two consecutive days. It always report errors: Spawn failed: Timeout
image

It says that there is not enough CPU, GPU and memory. I think it is the server's problem. I make sure that my network is normal and can be opened normally two days ago. How to solve this problem?
Thanks!!

Accessing kernel from vscode does not work - certificate has expired

Steps to reproduce:

  • Follow instructions as per Use VS Code
  • Try to run code on the interactive Jupyter interpreter
  • The following error will occur
Error 14:08:50: Error loading notebook controllers [FetchError: request to https://pccompute.westeurope.cloudapp.azure.com/compute/user/<USER>/api/kernels?<KERNEL_NUM> failed, reason: certificate has expired
	at ClientRequest.<anonymous> (/home/user/.vscode/extensions/ms-toolsai.jupyter-2021.11.1001550889/out/client/extension.js:16:343322)
	at ClientRequest.emit (events.js:327:22)
	at ClientRequest.EventEmitter.emit (domain.js:467:12)
	at TLSSocket.socketErrorListener (_http_client.js:469:9)
	at TLSSocket.emit (events.js:315:20)
	at TLSSocket.EventEmitter.emit (domain.js:467:12)
	at emitErrorNT (internal/streams/destroy.js:106:8)
	at emitErrorCloseNT (internal/streams/destroy.js:74:3)
	at processTicksAndRejections (internal/process/task_queues.js:80:21)]

Error 14:08:50: DataScience Error [FetchError: request to https://pccompute.westeurope.cloudapp.azure.com/compute/user/<USER>/api/kernels?<KERNEL_NUM> failed, reason: certificate has expired
	at ClientRequest.<anonymous> (/home/user/.vscode/extensions/ms-toolsai.jupyter-2021.11.1001550889/out/client/extension.js:16:343322)
	at ClientRequest.emit (events.js:327:22)
	at ClientRequest.EventEmitter.emit (domain.js:467:12)
	at TLSSocket.socketErrorListener (_http_client.js:469:9)
	at TLSSocket.emit (events.js:315:20)
	at TLSSocket.EventEmitter.emit (domain.js:467:12)
	at emitErrorNT (internal/streams/destroy.js:106:8)
	at emitErrorCloseNT (internal/streams/destroy.js:74:3)
	at processTicksAndRejections (internal/process/task_queues.js:80:21)]

Error 14:08:50: DataScience Error [FetchError: request to https://pccompute.westeurope.cloudapp.azure.com/compute/user/<USER>/api/kernels?<KERNEL_NUM> failed, reason: certificate has expired
	at ClientRequest.<anonymous> (/home/user/.vscode/extensions/ms-toolsai.jupyter-2021.11.1001550889/out/client/extension.js:16:343322)
	at ClientRequest.emit (events.js:327:22)
	at ClientRequest.EventEmitter.emit (domain.js:467:12)
	at TLSSocket.socketErrorListener (_http_client.js:469:9)
	at TLSSocket.emit (events.js:315:20)
	at TLSSocket.EventEmitter.emit (domain.js:467:12)
	at emitErrorNT (internal/streams/destroy.js:106:8)
	at emitErrorCloseNT (internal/streams/destroy.js:74:3)
	at processTicksAndRejections (internal/process/task_queues.js:80:21)]

Error 14:08:50: Failed to find & set preferred controllers [FetchError: request to https://pccompute.westeurope.cloudapp.azure.com/compute/user/<USER>/api/kernels?<KERNEL_NUM> failed, reason: certificate has expired
	at ClientRequest.<anonymous> (/home/user/.vscode/extensions/ms-toolsai.jupyter-2021.11.1001550889/out/client/extension.js:16:343322)
	at ClientRequest.emit (events.js:327:22)
	at ClientRequest.EventEmitter.emit (domain.js:467:12)
	at TLSSocket.socketErrorListener (_http_client.js:469:9)
	at TLSSocket.emit (events.js:315:20)
	at TLSSocket.EventEmitter.emit (domain.js:467:12)
	at emitErrorNT (internal/streams/destroy.js:106:8)
	at emitErrorCloseNT (internal/streams/destroy.js:74:3)
	at processTicksAndRejections (internal/process/task_queues.js:80:21)]

Spawn fails after starting normal notebook

Screen Shot 2022-05-19 at 10 42 35 AM

Hello,

I recently started using planetary computer and loved it, until I ran into the same issue yesterday and today. I didn't do anything differently than before, but can't seem to avoid this problem. The final output reads, "Spawn failed: Server ... didn't respond in 30 seconds."

Is there some debugging I can do on my end?

Thanks,
Alex

ERROR: IOPub data rate exceeded.

When I upload a shapefile and print its size, I meet this problem:

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
--ServerApp.iopub_data_rate_limit.
Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)

The shapefile has over 20000 elements. I think it is out of the computing power of the Planetary computer. In stead of spliting the shapefile before uploading it, I tried to change the config variable mentioned in the error. However, when I run "jupyter lab --NotebookApp.iopub_data_rate_limit=1.0e10" in the ternimal, I meet another problem:

image

So how to change the config variable? Thanks a lot!!!!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.