canonical / microk8s-core-addons Goto Github PK

View Code? Open in Web Editor NEW

33.0 9.0 28.0 568 KB

Core MicroK8s addons

License: Apache License 2.0

Shell 24.07% Python 75.93%

microk8s-core-addons's Introduction

microk8s-addons

This repository contains the core addons that ship along with MicroK8s.

Directory structure

addons.yaml         Authoritative list of addons included in this repository. See format below.
addons/
    <addon1>/
        enable      Executable script that runs when enabling the addon
        disable     Executable script that runs when disabling the addon
    <addon2>/
        enable
        disable
    ...

`addons.yaml` format

microk8s-addons:
  # A short description for the addons in this repository.
  description: Core addons of the MicroK8s project

  # Revision number. Increment when there are important changes.
  revision: 1

  # List of addons.
  addons:
    - name: addon1
      description: My awesome addon

      # Addon version.
      version: "1.0.0"

      # Test to check that addon has been enabled. This may be:
      # - A path to a file. For example, "${SNAP_DATA}/var/lock/myaddon.enabled"
      # - A Kubernetes resource, in the form `resourceType/resourceName`, just
      #   as it would appear in the output of the `kubectl get all -A` command.
      #   For example, "deployment.apps/registry".
      #
      # The addon is assumed to be enabled when the specified file or Kubernetes
      # resource exists.
      check_status: "deployment.apps/addon1"

      # List of architectures supported by this addon.
      # MicroK8s supports "amd64", "arm64" and "s390x".
      supported_architectures:
        - amd64
        - arm64
        - s390x

    - name: addon2
      description: My second awesome addon, supported for amd64 only
      version: "1.0.0"
      check_status: "pod/addon2"
      supported_architectures:
        - amd64

Adding new addons

See HACKING.md for instructions on how to develop custom MicroK8s addons.

microk8s-core-addons's People

Contributors

Stargazers

Watchers

microk8s-core-addons's Issues

Mayastor: Disable / enable/ disable failure

After running enable disable enable disable The following error was listed

❯ microk8s disable mayastor

Infer repository core for addon mayastor

Error from server (NotFound): Unable to list "openebs.io/v1alpha1, Resource=mayastorpools": the server could not find the requested resource (get mayastorpools.openebs.io)

Traceback (most recent call last):

  File "/var/snap/microk8s/common/addons/core/addons/mayastor/disable", line 48, in <module>

    main()

  File "/snap/microk8s/3065/usr/lib/python3/dist-packages/click/core.py", line 722, in __call__

    return self.main(*args, **kwargs)

  File "/snap/microk8s/3065/usr/lib/python3/dist-packages/click/core.py", line 697, in main

    rv = self.invoke(ctx)

  File "/snap/microk8s/3065/usr/lib/python3/dist-packages/click/core.py", line 895, in invoke

    return ctx.invoke(self.callback, **ctx.params)

  File "/snap/microk8s/3065/usr/lib/python3/dist-packages/click/core.py", line 535, in invoke

    return callback(*args, **kwargs)

  File "/var/snap/microk8s/common/addons/core/addons/mayastor/disable", line 23, in main

    [KUBECTL, "get", "msp", "-n", "mayastor", "-o", "json"],

  File "/snap/microk8s/3065/usr/lib/python3.6/subprocess.py", line 356, in check_output

    **kwargs).stdout

  File "/snap/microk8s/3065/usr/lib/python3.6/subprocess.py", line 438, in run

    output=stdout, stderr=stderr)

subprocess.CalledProcessError: Command '['/snap/microk8s/3065/microk8s-kubectl.wrapper', 'get', 'msp', '-n', 'mayastor', '-o', 'json']' returned non-zero exit status 1.

Observability addon configuration

Hi, I have just installed the observability addon. I have searched through the deployments and config maps, and tried to find the way out, but failed, so let me ask - how to supply custom Prometheus (alert.rules, prometheus.yml) and alertmanager rules?

Also please notice that this addon is not mentioned on https://microk8s.io/docs/addons (there is a Prometheus addon [obsolete], but url is 404 anyway).

Just wonder if passing own kube-prometheus repo deployment would fix it - just to override, but it is a little overkill IMO, and maybe something microk8s may be overridden?

Thanks.

'mayastor-pools' is not a valid MicroK8s subcommand.

Summary

When I try: microk8s mayastor-pools --help

What Should Happen Instead?

get: 'mayastor-pools' is not a valid MicroK8s subcommand.

Reproduction Steps

after enable mayastor and after
sudo microk8s enable core/mayastor --default-pool-size 20G
microk8s mayastor-pools --help

Please Help I'm using Red Hat Enterprise Linux release 8.8 (Ootpa)
and snap --version
snap 2.58.3-1.el8
snapd 2.58.3-1.el8
series 16
rhel 8.8
kernel 4.18.0-477.27.1.el8_8.x86_64

Problem's enabling GPU - Workaround included

Summary

The default GPU | NVIDIA addon does not find the correct drivers and thus containers are crashing.

Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown

What Should Happen Instead?

Everything should work after enabling GPU-Addon.
microk8s enable nvidia

Reproduction Steps

microk8s enable nvidia

Infer repository core for addon nvidia
Addon core/dns is already enabled
Addon core/helm3 is already enabled
WARNING: --set-as-default-runtime is deprecated, please use --gpu-operator-toolkit-version instead
"nvidia" has been added to your repositories
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "nvidia" chart repository
Update Complete. ⎈Happy Helming!⎈
Deploy NVIDIA GPU operator
Using auto GPU driver
W1222 14:39:49.104108 1716891 warnings.go:70] unknown field "spec.daemonsets.rollingUpdate"
W1222 14:39:49.104132 1716891 warnings.go:70] unknown field "spec.daemonsets.updateStrategy"
NAME: gpu-operator
LAST DEPLOYED: Fri Dec 22 14:39:47 2023
NAMESPACE: gpu-operator-resources
STATUS: deployed
REVISION: 1
TEST SUITE: None
Deployed NVIDIA GPU operator

microk8s kubectl get pods --namespace gpu-operator-resources

NAME                                                          READY   STATUS                  RESTARTS      AGE
gpu-operator-node-feature-discovery-worker-ldvbf              1/1     Running                 0             4m42s
gpu-operator-559f7cd69b-7cqhm                                 1/1     Running                 0             4m42s
gpu-operator-node-feature-discovery-master-5bfbc54c8d-hppfr   1/1     Running                 0             4m42s
gpu-feature-discovery-zrp99                                   0/1     Init:CrashLoopBackOff   5 (91s ago)   4m21s
nvidia-operator-validator-hxfbf                               0/1     Init:CrashLoopBackOff   5 (89s ago)   4m22s
nvidia-device-plugin-daemonset-xmvvr                          0/1     Init:CrashLoopBackOff   5 (85s ago)   4m22s
nvidia-container-toolkit-daemonset-shdrn                      0/1     Init:CrashLoopBackOff   5 (80s ago)   4m22s
nvidia-dcgm-exporter-96gmz                                    0/1     Init:CrashLoopBackOff   5 (77s ago)   4m21s

microk8s kubectl describe pod nvidia-operator-validator-hxfbf -n gpu-operator-resources

Name:                 nvidia-operator-validator-hxfbf
Namespace:            gpu-operator-resources
Priority:             2000001000
Priority Class Name:  system-node-critical
Runtime Class Name:   nvidia
Service Account:      nvidia-operator-validator
Node:                 gpu01/132.176.10.80
Start Time:           Fri, 22 Dec 2023 14:40:09 +0100
Labels:               app=nvidia-operator-validator
                      app.kubernetes.io/part-of=gpu-operator
                      controller-revision-hash=6bd5fd4488
                      pod-template-generation=1
Annotations:          cni.projectcalico.org/containerID: b921d851a8c76ad40b2f18e285c2b61d7f7300fd471f8ac751ca401bf9a32ded
                      cni.projectcalico.org/podIP: 10.1.69.188/32
                      cni.projectcalico.org/podIPs: 10.1.69.188/32
Status:               Pending
IP:                   10.1.69.188
IPs:
  IP:           10.1.69.188
Controlled By:  DaemonSet/nvidia-operator-validator
Init Containers:
  driver-validation:
    Container ID:  containerd://2d9d39b1bbf489f5fc99c451a463935d8f63d5faddefac4305f7c849710eb7a5
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.1
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:18c9ea88ae06d479e6657b8a4126a8ee3f4300a40c16ddc29fb7ab3763d46005
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:       Waiting
      Reason:    CrashLoopBackOff
    Last State:  Terminated
      Reason:    StartError
      Message:   failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error d                                                 uring container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli.real: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown
      Exit Code:    128
      Started:      Thu, 01 Jan 1970 01:00:00 +0100
      Finished:     Fri, 22 Dec 2023 14:45:44 +0100
    Ready:          False
    Restart Count:  6
    Environment:
      WITH_WAIT:  true
      COMPONENT:  driver
    Mounts:
      /host from host-root (ro)
      /run/nvidia/driver from driver-install-path (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8xhcc (ro)
  toolkit-validation:
    Container ID:
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.1
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      WITH_WAIT:  false
      COMPONENT:  toolkit
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8xhcc (ro)
  cuda-validation:
    Container ID:
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.1
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      WITH_WAIT:                    false
      COMPONENT:                    cuda
      NODE_NAME:                     (v1:spec.nodeName)
      OPERATOR_NAMESPACE:           gpu-operator-resources (v1:metadata.namespace)
      VALIDATOR_IMAGE:              nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.1
      VALIDATOR_IMAGE_PULL_POLICY:  IfNotPresent
      VALIDATOR_RUNTIME_CLASS:      nvidia
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8xhcc (ro)
  plugin-validation:
    Container ID:
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.1
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      COMPONENT:                    plugin
      WITH_WAIT:                    false
      WITH_WORKLOAD:                true
      MIG_STRATEGY:                 single
      NODE_NAME:                     (v1:spec.nodeName)
      OPERATOR_NAMESPACE:           gpu-operator-resources (v1:metadata.namespace)
      VALIDATOR_IMAGE:              nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.1
      VALIDATOR_IMAGE_PULL_POLICY:  IfNotPresent
      VALIDATOR_RUNTIME_CLASS:      nvidia
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8xhcc (ro)
Containers:
  nvidia-operator-validator:
    Container ID:
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v22.9.1
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      echo all validations are successful; sleep infinity
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8xhcc (ro)
Conditions:
  Type              Status
  Initialized       False
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  driver-install-path:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/driver
    HostPathType:
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:
  kube-api-access-8xhcc:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.operator-validator=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
Type     Reason   Age                    From     Message
----     ------   ----                   ----     -------
Warning  BackOff  2m57s (x26 over 8m3s)  kubelet  Back-off restarting failed container driver-validation in pod nvidia-operator-validator-hxfbf_gpu-operator-resources                                                 (97c4f528-a16c-476b-a696-3c70cf6ed271)

nvidia-smi


Thu Dec 21 16:55:31 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A5000               On  | 00000000:01:00.0 Off |                  Off |
| 30%   28C    P8               7W / 230W |      2MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A5000               On  | 00000000:25:00.0 Off |                  Off |
| 30%   29C    P8               6W / 230W |      2MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A5000               On  | 00000000:41:00.0 Off |                  Off |
| 30%   28C    P8               8W / 230W |      2MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A5000               On  | 00000000:61:00.0 Off |                  Off |
| 30%   28C    P8               5W / 230W |      2MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA RTX A5000               On  | 00000000:81:00.0 Off |                  Off |
| 30%   27C    P8               9W / 230W |      2MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA RTX A5000               On  | 00000000:C1:00.0 Off |                  Off |
| 30%   27C    P8               7W / 230W |      2MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA RTX A5000               On  | 00000000:C4:00.0 Off |                  Off |
| 30%   27C    P8               2W / 230W |      2MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA RTX A5000               On  | 00000000:E1:00.0 Off |                  Off |
| 30%   27C    P8               6W / 230W |      2MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

ls -la /run/nvidia/driver

total 0
drwxr-xr-x 2 root root 40 Dez 21 17:26 .
drwxr-xr-x 4 root root 80 Dez 21 17:26 ..

cat /etc/docker/daemon.json

{
    "insecure-registries" : ["localhost:32000"],
       "runtimes": {
       "nvidia": {
           "args": [],
           "path": "nvidia-container-runtime"
       }
   }
}}

cat /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml

accept-nvidia-visible-devices-as-volume-mounts = false
accept-nvidia-visible-devices-envvar-when-unprivileged = true
disable-require = false

[nvidia-container-cli]
  environment = []
  ldconfig = "@/run/nvidia/driver/sbin/ldconfig.real"
  load-kmods = true
  path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
  root = "/run/nvidia/driver"

[nvidia-container-runtime]
  log-level = "info"
  mode = "auto"
  runtimes = ["docker-runc", "runc"]

  [nvidia-container-runtime.modes]

    [nvidia-container-runtime.modes.csv]
      mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

microk8s inspect

Inspecting system
Inspecting Certificates
Inspecting services
  Service snap.microk8s.daemon-cluster-agent is running
  Service snap.microk8s.daemon-containerd is running
  Service snap.microk8s.daemon-kubelite is running
  Service snap.microk8s.daemon-k8s-dqlite is running
  Service snap.microk8s.daemon-apiserver-kicker is running
  Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
  Copy processes list to the final report tarball
  Copy disk usage information to the final report tarball
  Copy memory usage information to the final report tarball
  Copy server uptime to the final report tarball
  Copy openSSL information to the final report tarball
  Copy snap list to the final report tarball
  Copy VM name (or none) to the final report tarball
  Copy current linux distribution to the final report tarball
  Copy asnycio usage and limits to the final report tarball
  Copy inotify max_user_instances and max_user_watches to the final report tarball
  Copy network configuration to the final report tarball
Inspecting kubernetes cluster
  Inspect kubernetes cluster
Inspecting dqlite
  Inspect dqlite


Building the report tarball
  Report tarball is at /var/snap/microk8s/6089/inspection-report-20231221_170521.tar.gz

microk8s kubectl describe clusterpolicies --all-namespaces

Name:         cluster-policy
Namespace:
Labels:       app.kubernetes.io/component=gpu-operator
              app.kubernetes.io/managed-by=Helm
Annotations:  meta.helm.sh/release-name: gpu-operator
              meta.helm.sh/release-namespace: gpu-operator-resources
API Version:  nvidia.com/v1
Kind:         ClusterPolicy
Metadata:
  Creation Timestamp:  2023-12-21T16:16:44Z
  Generation:          1
  Resource Version:    105635519
  UID:                 e20bbaad-bdaf-4c87-86dd-b2fcc3d8f88f
Spec:
  Daemonsets:
    Priority Class Name:  system-node-critical
    Tolerations:
      Effect:    NoSchedule
      Key:       nvidia.com/gpu
      Operator:  Exists
  Dcgm:
    Enabled:            false
    Host Port:          5555
    Image:              dcgm
    Image Pull Policy:  IfNotPresent
    Repository:         nvcr.io/nvidia/cloud-native
    Version:            3.1.3-1-ubuntu20.04
  Dcgm Exporter:
    Enabled:  true
    Env:
      Name:             DCGM_EXPORTER_LISTEN
      Value:            :9400
      Name:             DCGM_EXPORTER_KUBERNETES
      Value:            true
      Name:             DCGM_EXPORTER_COLLECTORS
      Value:            /etc/dcgm-exporter/dcp-metrics-included.csv
    Image:              dcgm-exporter
    Image Pull Policy:  IfNotPresent
    Repository:         nvcr.io/nvidia/k8s
    Service Monitor:
      Additional Labels:
      Enabled:       false
      Honor Labels:  false
      Interval:      15s
    Version:         3.1.3-3.1.2-ubuntu20.04
  Device Plugin:
    Enabled:  true
    Env:
      Name:             PASS_DEVICE_SPECS
      Value:            true
      Name:             FAIL_ON_INIT_ERROR
      Value:            true
      Name:             DEVICE_LIST_STRATEGY
      Value:            envvar
      Name:             DEVICE_ID_STRATEGY
      Value:            uuid
      Name:             NVIDIA_VISIBLE_DEVICES
      Value:            all
      Name:             NVIDIA_DRIVER_CAPABILITIES
      Value:            all
    Image:              k8s-device-plugin
    Image Pull Policy:  IfNotPresent
    Repository:         nvcr.io/nvidia
    Version:            v0.13.0-ubi8
  Driver:
    Cert Config:
      Name:
    Enabled:            false
    Image:              driver
    Image Pull Policy:  IfNotPresent
    Kernel Module Config:
      Name:
    Licensing Config:
      Config Map Name:
      Nls Enabled:      false
    Manager:
      Env:
        Name:             ENABLE_GPU_POD_EVICTION
        Value:            true
        Name:             ENABLE_AUTO_DRAIN
        Value:            true
        Name:             DRAIN_USE_FORCE
        Value:            false
        Name:             DRAIN_POD_SELECTOR_LABEL
        Value:
        Name:             DRAIN_TIMEOUT_SECONDS
        Value:            0s
        Name:             DRAIN_DELETE_EMPTYDIR_DATA
        Value:            false
      Image:              k8s-driver-manager
      Image Pull Policy:  IfNotPresent
      Repository:         nvcr.io/nvidia/cloud-native
      Version:            v0.5.1
    Rdma:
      Enabled:         false
      Use Host Mofed:  false
    Repo Config:
      Config Map Name:
    Repository:         nvcr.io/nvidia
    Version:            525.60.13
    Virtual Topology:
      Config:
  Gfd:
    Enabled:  true
    Env:
      Name:             GFD_SLEEP_INTERVAL
      Value:            60s
      Name:             GFD_FAIL_ON_INIT_ERROR
      Value:            true
    Image:              gpu-feature-discovery
    Image Pull Policy:  IfNotPresent
    Repository:         nvcr.io/nvidia
    Version:            v0.7.0-ubi8
  Mig:
    Strategy:  single
  Mig Manager:
    Config:
      Name:
    Enabled:  true
    Env:
      Name:   WITH_REBOOT
      Value:  false
    Gpu Clients Config:
      Name:
    Image:              k8s-mig-manager
    Image Pull Policy:  IfNotPresent
    Repository:         nvcr.io/nvidia/cloud-native
    Version:            v0.5.0-ubuntu20.04
  Node Status Exporter:
    Enabled:            false
    Image:              gpu-operator-validator
    Image Pull Policy:  IfNotPresent
    Repository:         nvcr.io/nvidia/cloud-native
    Version:            v22.9.1
  Operator:
    Default Runtime:  containerd
    Init Container:
      Image:              cuda
      Image Pull Policy:  IfNotPresent
      Repository:         nvcr.io/nvidia
      Version:            11.8.0-base-ubi8
    Runtime Class:        nvidia
  Psp:
    Enabled:  false
  Sandbox Device Plugin:
    Enabled:            true
    Image:              kubevirt-gpu-device-plugin
    Image Pull Policy:  IfNotPresent
    Repository:         nvcr.io/nvidia
    Version:            v1.2.1
  Sandbox Workloads:
    Default Workload:  container
    Enabled:           false
  Toolkit:
    Enabled:  true
    Env:
      Name:             CONTAINERD_CONFIG
      Value:            /var/snap/microk8s/current/args/containerd-template.toml
      Name:             CONTAINERD_SOCKET
      Value:            /var/snap/microk8s/common/run/containerd.sock
      Name:             CONTAINERD_SET_AS_DEFAULT
      Value:            0
    Image:              container-toolkit
    Image Pull Policy:  IfNotPresent
    Install Dir:        /usr/local/nvidia
    Repository:         nvcr.io/nvidia/k8s
    Version:            v1.11.0-ubuntu20.04
  Validator:
    Image:              gpu-operator-validator
    Image Pull Policy:  IfNotPresent
    Plugin:
      Env:
        Name:    WITH_WORKLOAD
        Value:   true
    Repository:  nvcr.io/nvidia/cloud-native
    Version:     v22.9.1
  Vfio Manager:
    Driver Manager:
      Env:
        Name:             ENABLE_AUTO_DRAIN
        Value:            false
      Image:              k8s-driver-manager
      Image Pull Policy:  IfNotPresent
      Repository:         nvcr.io/nvidia/cloud-native
      Version:            v0.5.1
    Enabled:              true
    Image:                cuda
    Image Pull Policy:    IfNotPresent
    Repository:           nvcr.io/nvidia
    Version:              11.7.1-base-ubi8
  Vgpu Device Manager:
    Config:
      Default:          default
      Name:
    Enabled:            true
    Image:              vgpu-device-manager
    Image Pull Policy:  IfNotPresent
    Repository:         nvcr.io/nvidia/cloud-native
    Version:            v0.2.0
  Vgpu Manager:
    Driver Manager:
      Env:
        Name:             ENABLE_AUTO_DRAIN
        Value:            false
      Image:              k8s-driver-manager
      Image Pull Policy:  IfNotPresent
      Repository:         nvcr.io/nvidia/cloud-native
      Version:            v0.5.1
    Enabled:              false
    Image:                vgpu-manager
    Image Pull Policy:    IfNotPresent
Events:                   <none>

Can you suggest a fix?

Change values in:

/usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml

root = "/run/nvidia/driver"
to
root = "/"

/usr/local/nvidia/toolkit/nvidia-container-runtime
added:
"runtimes":
{
"nvidia": {
"path": "/usr/local/nvidia/toolkit/nvidia-container-runtime",
"runtimeArgs": [] }
}

Added symlink:

ln -s /sbin /run/nvidia/driver/sbin

restart k8s

microk8s stop
microk8s start

Then all containers are starting up correctly !

Best regards !

EDIT:
Found following issue containing the same issue:
NVIDIA/gpu-operator#511

[metallb] microk8s status shows addon as enabled even if addon fails during enable

Summary

The following command microk8s enable metallb 10.20.20.1/29 failed with error timedout waiting for ....
But microk8s status shows metallb as enabled.

Eventually the metallb pods are active but the issue is IPaddresspool is not set [1].

[1] https://github.com/canonical/microk8s-core-addons/blob/main/addons/metallb/enable#L63

What Should Happen Instead?

It is expected
a. either microk8s status shows metallb as disabled OR
b. ipaddresspools can be set before waiting for deployment to be active

Reproduction Steps

The problem may be observed on machines where it might take time to pull the images

microk8s enable metallb 10.20.20.1/29
microk8s status

Introspection Report

Not required.

Can you suggest a fix?

See what should happen instead

Are you interested in contributing with a fix?

I am if the way forward is decided

Mayastor: unable to start `mayastor` data plane

When i enabled the mayastor addon, most of the pods are up, except for the mayastor pod. It crash loops with this error (I turned on the debug)

[2022-04-10T05:56:58.466044674+00:00  INFO mayastor:mayastor.rs:94] free_pages: 1024 nr_pages: 1024
[2022-04-10T05:56:58.466203487+00:00  INFO mayastor:mayastor.rs:133] Starting Mayastor version: v1.0.0-119-ge5475575ea3e
[2022-04-10T05:56:58.466318231+00:00  INFO mayastor:mayastor.rs:134] kernel io_uring support: yes
[2022-04-10T05:56:58.466337877+00:00  INFO mayastor:mayastor.rs:138] kernel nvme initiator multipath support: yes
[2022-04-10T05:56:58.466380212+00:00  INFO mayastor::core::env:env.rs:600] loading mayastor config YAML file /var/local/mayastor/config.yaml
[2022-04-10T05:56:58.466398028+00:00 DEBUG mayastor::subsys::config:mod.rs:154] loading configuration file from /var/local/mayastor/config.yaml
[2022-04-10T05:56:58.466418263+00:00  INFO mayastor::subsys::config:mod.rs:168] Config file /var/local/mayastor/config.yaml is empty, reverting to default config
[2022-04-10T05:56:58.466439905+00:00  INFO mayastor::subsys::config::opts:opts.rs:155] Overriding NVMF_TCP_MAX_QUEUE_DEPTH value to '32'
[2022-04-10T05:56:58.466462485+00:00  INFO mayastor::subsys::config:mod.rs:216] Applying Mayastor configuration settings
[2022-04-10T05:56:58.466479267+00:00 DEBUG mayastor::subsys::config::opts:opts.rs:259] spdk_bdev_nvme_opts { action_on_timeout: 4, timeout_us: 5000000, timeout_admin_us: 5000000, keep_alive_timeout_ms: 1000, transport_retry_count: 0, arbitration_burst: 0, low_priority_weight: 0, medium_priority_weight: 0, high_priority_weight: 0, nvme_adminq_poll_period_us: 1000, nvme_ioq_poll_period_us: 0, io_queue_requests: 0, delay_cmd_submit: true, bdev_retry_count: 0 }
[2022-04-10T05:56:58.466507936+00:00 DEBUG mayastor::subsys::config:mod.rs:220] Config {
    source: Some(
        "/var/local/mayastor/config.yaml",
    ),
    nvmf_tcp_tgt_conf: NvmfTgtConfig {
        name: "mayastor_target",
        max_namespaces: 110,
        opts: NvmfTcpTransportOpts {
            max_queue_depth: 32,
            max_qpairs_per_ctrl: 32,
            in_capsule_data_size: 4096,
            max_io_size: 131072,
            io_unit_size: 131072,
            max_aq_depth: 32,
            num_shared_buf: 2048,
            buf_cache_size: 64,
            dif_insert_or_strip: false,
            abort_timeout_sec: 1,
            acceptor_poll_rate: 10000,
            zcopy: true,
        },
    },
    nvme_bdev_opts: NvmeBdevOpts {
        action_on_timeout: 4,
        timeout_us: 5000000,
        timeout_admin_us: 5000000,
        keep_alive_timeout_ms: 1000,
        transport_retry_count: 0,
        arbitration_burst: 0,
        low_priority_weight: 0,
        medium_priority_weight: 0,
        high_priority_weight: 0,
        nvme_adminq_poll_period_us: 1000,
        nvme_ioq_poll_period_us: 0,
        io_queue_requests: 0,
        delay_cmd_submit: true,
        bdev_retry_count: 0,
    },
    bdev_opts: BdevOpts {
        bdev_io_pool_size: 65535,
        bdev_io_cache_size: 512,
        small_buf_pool_size: 8191,
        large_buf_pool_size: 1023,
    },
    nexus_opts: NexusOpts {
        nvmf_enable: true,
        nvmf_discovery_enable: true,
        nvmf_nexus_port: 4421,
        nvmf_replica_port: 8420,
    },
}
[2022-04-10T05:56:58.466597007+00:00 DEBUG mayastor::core::env:env.rs:534] EAL arguments ["mayastor", "--no-shconf", "-m 0", "--base-virtaddr=0x200000000000", "--file-prefix=mayastor_pid1", "--huge-unlink", "--log-level=lib.eal:6", "--log-level=lib.cryptodev:5", "--log-level=user1:6", "--match-allocations", "-l 1"]
EAL: No available 1048576 kB hugepages reported
EAL: alloc_pages_on_heap(): couldn't allocate memory due to IOVA exceeding limits of current DMA mask
EAL: alloc_pages_on_heap(): Please try initializing EAL with --iova-mode=pa parameter
EAL: error allocating rte services array
EAL: FATAL: rte_service_init() failed
EAL: rte_service_init() failed
thread 'main' panicked at 'Failed to init EAL', mayastor/src/core/env.rs:543:13
stack backtrace:
   0: std::panicking::begin_panic
   1: mayastor::core::env::MayastorEnvironment::init
   2: mayastor::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

I tried restarting MicroK8s as well as rebooting the host when i change these hugepages.

Hugepages seems to be ok

$ grep HugePages /proc/meminfo
AnonHugePages:    100352 kB
ShmemHugePages:   251904 kB
FileHugePages:         0 kB
HugePages_Total:    1024
HugePages_Free:     1024
HugePages_Rsvd:        0
HugePages_Surp:        0

There is also this instruction that fails for me.

$ sudo apt install linux-modules-extra-$(uname -r)
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Package linux-modules-extra-5.16.11-76051611-generic is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

E: Package 'linux-modules-extra-5.16.11-76051611-generic' has no installation candidate

Do you think this error is the culprit?

Thanks for your help!

Prometheus alerts regarding scheduler and controller-manager

Summary

After enabling the observability add-on, Prometheus raises an alarm regarding the scheduler and the controller-manager.

Why is this important?

The observability add-on should be operational with no post-configurations and Prometheus should have no active alarms.

Are you interested in contributing with a fix?

yes

Cert Manager needs to be updated. IngressClassName incompatibility.

I lost whole day with that. Certs aren't work because auto created ingress by cert-manager is not functional. Old annotation kubernetes.io/ingress.class doesn't work anymore. According to the docs, The field ingressClassName was added in cert-manager 1.12., so we need to upgrade to at least 1.12 to get rid of the issue.

MicroK8S 1.28.3:
microk8s kubectl cert-manager version

Client Version: util.Version{GitVersion:"v1.12.7", GitCommit:"6d7629ba42b946978e3baaa75348c851f7ef9134", GitTreeState:"", GoVersion:"go1.20.12", Compiler:"gc", Platform:"linux/amd64"}
Server Version: &versionchecker.Version{Detected:"v1.8.0", Sources:map[string]string{"crdLabelVersion":"v1.8.0"}}

As we can see MicroK8S Addon doesn't have ingressClassName property. HTTP01 check is failing.

hostpath-storage can

Summary

What Should Happen Instead?

Reproduction Steps

Introspection Report

Can you suggest a fix?

Are you interested in contributing with a fix?

Mayastor: Not creating data image

Running locally snap remove microk8s and then reinstalling through the edge channel a few times.
I noticed the data image was not being recreated, there were no error messages in the enable mayastor command

No Resource requests for addons

Summary

Hey, I am new to microk8s and after deploying a 3-mode cluster and enabling cert-manager, server-metrics, ingress and dns, I only found resource requests in the coredns service, but not on ingress, cert-manager and server-metrics.
Is this by design, or do I have to enable the addons with an extra yaml, that specifies resource requests and limits?

Why is this important?

It would be important to manage the resources properly, especially when the cluster uses a lot of addons.

Are you interested in contributing to this feature?

maybe, did not do that for until now.

couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request ERROR

Summary

When metric-server is enable getting the following errors on all microk8s kubectl commands

E0201 17:31:08.903679 1454925 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0201 17:31:08.911466 1454925 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0201 17:31:08.925994 1454925 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request

What Should Happen Instead?

Should get normal kubectl response

Reproduction Steps

Introspection Report

Can you suggest a fix?

Need to add 'hostNetwork: true' to spec of metric-server deployment

Are you interested in contributing with a fix?

microk8s enable metallb does not accept ipv6 input

Summary

Name      Version   Rev    Tracking       Publisher   Notes
core20    20230801  2015   latest/stable  canonical✓  base
microk8s  v1.28.2   6085   1.28/stable    canonical✓  classic
snapd     2.60.4    20290  latest/stable  canonical✓  snapd

Enablin the metallb addon with a ipv6 cidr says it's invalid.
Your input value (2001:db8::252:252:252:1-2001:db8:1:252:252:252:252) is not a valid IP Range

Metallb works with dualstack if you supply it a ipv6 cidr manually:

---
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: custom-addresspool
  namespace: metallb-system
spec:
  addresses:
  - 10.0.1.220-10.0.1.229
  - 2001:db8::252:252:252:1-2001:db8:1:252:252:252:252

k get svc example-service
NAME              TYPE           CLUSTER-IP       EXTERNAL-IP                                 PORT(S)        AGE
example-service   LoadBalancer   10.252.252.160   10.0.1.220,2001:db8::1:252:252:252:1        80:32481/TCP   24m

What Should Happen Instead?

Accept the valid ipv6 range

Reproduction Steps

(replaced my ipv6 cidr with 2001:db8 )

microk8s enable metallb
include ipv6 in your input like so: 10.0.1.220-10.0.1.229,2001:db8::252:252:252:1-2001:db8:1:252:252:252:252

# microk8s enable metallb
Infer repository core for addon metallb
Enabling MetalLB
Enter each IP address range delimited by comma (e.g. '10.64.140.43-10.64.140.49,192.168.0.105-192.168.0.111'): 10.0.1.220-10.0.1.229,2001:db8::252:252:252:1-2001:db8:1:252:252:252:252
Your input value (2001:db8::252:252:252:1-2001:db8:1:252:252:252:252) is not a valid IP Range

Introspection Report

Can you suggest a fix?

https://github.com/canonical/microk8s-core-addons/blob/main/addons/metallb/enable#L33

These Regex are only for IPv4. They should also accept ipv6.

Furthermore I think the microk8s enable metallb:192.168.1.0/24 syntax should no longer use a :, but switch to a = to make syntax easier for ipv6 support.

Are you interested in contributing with a fix?

unfortunately no

Observability Addon Invalid IP Error

Summary

I'm trying to run microk8s enable observability but got the following error:


Error: UPGRADE FAILED: failed to create resource: Endpoints "kube-prom-stack-kube-prome-kube-controller-manager" is invalid: [subsets[0].addresses[0].ip: Invalid value: "192.168.0.11,192.168.0.22,192.168.0.33": must be a valid IP address, (e.g. 10.9.8.7 or 2001:db8::ffff), subsets[0].addresses[0].ip: Invalid value: "192.168.0.11,192.168.0.22,192.168.0.33": must be a valid IP address]

What Should Happen Instead?

Microk8s should install the observability addon without errors.

Reproduction Steps

microk8s enable observability

Introspection Report

inspection-report-20220921_194101.tar.gz

Can you suggest a fix?

Are you interested in contributing with a fix?

Yes.

Mayastor question: how to enable it without default file based pool?

I want to enable mayastor, but I want to add to it pools using real attached block devices.
I want to avoid the initial pool.
If not possible, if I remove the initial pool, on future updates (when addons will be able to do updates), will this be affected?

Can not disable mayastor with microk8s

Summary

Enabled Mayastor and used it with three PVs and wanted to disable it again. Removed the PVCs and PVs and executed microk8s disable mayastor --remove-storage. Process stucks at

Infer repository core for addon mayastor
diskpool.openebs.io "microk8s-<host>-pool" deleted

It does not run in a timeout, but keeps in this state for hours. Nothing happens. Can only remove mayastor with Helm with cli.
Same behaviour withour --remove-storage

What Should Happen Instead?

Mayastor should be disabled.

Reproduction Steps

run microk8s enable mayastor
use volumes
remove volumes
run microk8s disable mayastor --remove-storage

Introspection Report

Can you suggest a fix?

Are you interested in contributing with a fix?

metrics-server installed with helm detected as microk8s addon

I'm not sure if this matching is working as intended:

microk8s-core-addons/addons/metrics-server/metrics-server.yaml

Line 5 in 673f404

k8s-app: metrics-server

microk8s-core-addons/tests/validators.py

Line 226 in 673f404

    
           wait_for_pod_state("", "kube-system", "running", label="k8s-app=metrics-server")

Tried this today (1.23/stable, v1.23.5):

sudo snap install microk8s --classic
microk8s.enable helm3

Install metrics-server with helm3:

microk8s.helm3 repo add bitnami https://charts.bitnami.com/bitnami
microk8s.helm3 repo update

microk8s.helm3 install metrics-server bitnami/metrics-server -n kube-system \
  --set extraArgs.kubelet-insecure-tls="" \
  --set extraArgs.kubelet-preferred-address-types=InternalIP \
  --set apiService.create=true

Then suddenly:

microk8s.status
microk8s is running
high-availability: no
addons:
  enabled:
    ...
    metrics-server       # K8s Metrics Server for API access to service metrics

But I have installed it with helm, it's not a microk8s addon.

Only a single diskpool created

Similar to #133, only a single diskpool is created in my 3 node cluster (3 x Raspberry Pi 4 running Ubuntu 22.04 server and microk8s 1.27)

kubectl get diskpool -n mayastor
NAME                   NODE     STATUS   CAPACITY      USED   AVAILABLE
microk8s-glados-pool   glados   Online   21449670656   0      21449670656

kubectl get all -n mayastor
NAME                                              READY   STATUS    RESTARTS   AGE
pod/mayastor-io-engine-l4lzm                      0/1     Pending   0          5m23s
pod/mayastor-io-engine-979kg                      0/1     Pending   0          5m23s
pod/mayastor-csi-node-mgtcq                       2/2     Running   0          5m23s
pod/mayastor-csi-node-f8nkw                       2/2     Running   0          5m23s
pod/etcd-operator-mayastor-8574f998bc-clbsf       1/1     Running   0          5m23s
pod/mayastor-csi-node-fzv42                       2/2     Running   0          5m23s
pod/etcd-kp7s728fz9                               1/1     Running   0          5m
pod/mayastor-agent-core-f7ccf485-zxht9            1/1     Running   0          5m23s
pod/mayastor-io-engine-h82xp                      1/1     Running   0          5m23s
pod/mayastor-operator-diskpool-5b4cfb555b-mk5j9   1/1     Running   0          5m23s
pod/etcd-7xr4c62vrp                               1/1     Running   0          4m18s
pod/etcd-p7bfjw7gq6                               1/1     Running   0          4m43s
pod/mayastor-api-rest-bcb58d479-bp5w4             1/1     Running   0          5m23s
pod/mayastor-csi-controller-6b867dd474-7mbls      3/3     Running   0          5m23s

NAME                          TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)               AGE
service/mayastor-agent-core   ClusterIP   None             <none>        50051/TCP,50052/TCP   5m24s
service/mayastor-api-rest     ClusterIP   10.152.183.71    <none>        8080/TCP,8081/TCP     5m24s
service/etcd-client           ClusterIP   10.152.183.254   <none>        2379/TCP              5m
service/etcd                  ClusterIP   None             <none>        2379/TCP,2380/TCP     4m59s

NAME                                DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/mayastor-csi-node    3         3         3       3            3           <none>          5m24s
daemonset.apps/mayastor-io-engine   3         3         1       3            1           <none>          5m24s

NAME                                         READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/etcd-operator-mayastor       1/1     1            1           5m24s
deployment.apps/mayastor-agent-core          1/1     1            1           5m24s
deployment.apps/mayastor-operator-diskpool   1/1     1            1           5m24s
deployment.apps/mayastor-api-rest            1/1     1            1           5m24s
deployment.apps/mayastor-csi-controller      1/1     1            1           5m24s

NAME                                                    DESIRED   CURRENT   READY   AGE
replicaset.apps/etcd-operator-mayastor-8574f998bc       1         1         1       5m24s
replicaset.apps/mayastor-agent-core-f7ccf485            1         1         1       5m24s
replicaset.apps/mayastor-operator-diskpool-5b4cfb555b   1         1         1       5m24s
replicaset.apps/mayastor-api-rest-bcb58d479             1         1         1       5m23s
replicaset.apps/mayastor-csi-controller-6b867dd474      1         1         1       5m24s

I tried the patch suggested at #133 (comment) but there was no change even after disabling and re-enabling the addon

What Should Happen Instead?

As per the docs:

In a 3-node cluster, the output should look like this:

NAME NODE STATUS CAPACITY USED AVAILABLE
microk8s-m2-pool m2 Online 21449670656 0 21449670656
microk8s-m1-pool m1 Online 21449670656 0 21449670656
microk8s-m3-pool m3 Online 21449670656 0 21449670656

Reproduction Steps

Follow the instructions at https://microk8s.io/docs/addon-mayastor

Introspection Report

inspection-report-20230807_145905.tar.gz

Can you suggest a fix?

If only I had a suggestion 😞

Are you interested in contributing with a fix?

Sure, if I knew what to do!

Problems with changed CIDR in CoreDNS

Hi guys, I have changed the https://microk8s.io/docs/change-cidr and the Service CIDR.
But can't enable the DNS Addon cause of this line:

microk8s-core-addons/addons/dns/coredns.yaml

Line 149 in e2c013f

clusterIP: 10.152.183.10

I'll change it manually and apply it, but perhaps there is a better solution for other people who come across the same issue.

mayastor-io-engine pods stuck on error 403 trying to create diskpools

Summary

On my setup all 3 mayastor-io-engine pods got stuck in Init:2/3, hitting error 403 when trying to create diskpools.
I ran the curl command from the init pod manually and got:

# curl --cacert "$CACERT" -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" -XPOST -d "$BODY" "https://kubernetes.default.svc/apis/openebs.io/v1alpha1/namespaces/$NAMESPACE/diskpools?fieldManager=kubectl-create"
{
  "kind": "Status",
  "apiVersion": "v1",
  "metadata": {},
  "status": "Failure",
  "message": "diskpools.openebs.io is forbidden: User \"system:serviceaccount:mayastor:default\" cannot create resource \"diskpools\" in API group \"openebs.io\" in the namespace \"mayastor\"",
  "reason": "Forbidden",
  "details": {
    "group": "openebs.io",
    "kind": "diskpools"
  },
  "code": 403
}

After modifying the clusterrolebinding like this:

kubectl edit clusterrolebindings.rbac.authorization.k8s.io mayastor-io-engine-sa-cluster-role-binding
...
# added to subjects:
...
- kind: ServiceAccount
  name: default
  namespace: mayastor

It got automatically unblocked, all io-engine 3 pods are now running and diskpools were created correctly.

What Should Happen Instead?

It shouldn't get stuck, it should use the right service account and RBAC rules should be correct.

Reproduction Steps

Deploy microk8s 1.27.4 cluster (3 nodes) using microk8s latest/edge charm rev 115.
Enable addons: dns ingress rbac metallb
sudo microk8s addons repo add core --force https://github.com/canonical/microk8s-core-addons --reference 1.27
sudo microk8s enable core/mayastor --default-pool-size 900G
Watch: sudo microk8s get pods -n mayastor and microk8s kubectl logs -n mayastor mayastor-io-engine-nwxvc initialize-pool

I also tried disabling and enabling mayastor addon but then I ran into the same issue again.

Introspection Report

inspection-report-20230809_074438.tar.gz

Can you suggest a fix?

If io-engine pods could use mayastor-io-engine-sa SA instead of default that would fix the issue I believe.
Alternatively, what I did with clusterrolebinding can also work.

Are you interested in contributing with a fix?

No.

When enabling DNS it should default to use the same server as the host is using.

Summary

When enabling the DNS add on microk8s defaults to googles dns service ( https://microk8s.io/docs/addon-dns ). It makes more sense to me to look at the DNS server the host is configured to use and to default to that.

Why is this important?

It would be a more smoother experience for the user.

Are you interested in contributing to this feature?

Not at this time.

Mayastor: Error in MayastorPool

When running Mayastor in an Ubuntu64 Server VM I noticed that the Mayastorpool errors on the normal creation sequence

Installation

snap install microk8s --channel=latest/edge
# Enable various kernel modules
microk8s enable mayastor

 msp-operator   Mar 25 08:25:26.877  WARN msp_operator: HTTP response error: error in response: status code '404 Not Found', content: 'RestJsonError { deta │

│ ils: "Node 'alexvirt' not found", kind: NotFound }', retry scheduled @Fri, 25 Mar 2022 08:25:31 +0000 (5 seconds from now)                                 │

│ msp-operator     at control-plane/msp-operator/src/main.rs:825                                                                                             │

│ msp-operator     in kube_runtime::controller::reconciling object with object.ref: MayastorPool.v1alpha1.openebs.io/microk8s-alexvirt-pool.mayastor, object │

│ .reason: error policy requested retry                                                                                                                      │

│ msp-operator                                                                                                                                               │

│ msp-operator   Mar 25 08:25:31.892  WARN msp_operator: HTTP response error: error in response: status code '404 Not Found', content: 'RestJsonError { deta │

│ ils: "Node 'alexvirt' not found", kind: NotFound }', retry scheduled @Fri, 25 Mar 2022 08:25:36 +0000 (5 seconds from now)                                 │

│ msp-operator     at control-plane/msp-operator/src/main.rs:825                                                                                             │

│ msp-operator     in kube_runtime::controller::reconciling object with object.ref: MayastorPool.v1alpha1.openebs.io/microk8s-alexvirt-pool.mayastor, object │

│ .reason: error policy requested retry                                                                                                                      │

│ msp-operator                                                                                                                                               │

│ msp-operator   Mar 25 08:25:36.933 ERROR msp_operator: status set to error, name: "microk8s-alexvirt-pool"                                                 │

│ msp-operator     at control-plane/msp-operator/src/main.rs:395                                                                                             │

│ msp-operator     in msp_operator::create_or_import with name: "microk8s-alexvirt-pool", status: Some(MayastorPoolStatus { state: Creating, capacity: 0, us │

│ ed: 0, available: 0 })                                                                                                                                     │

│ msp-operator     in msp_operator::reconcile with name: alexvirt, status: Some(MayastorPoolStatus { state: Creating, capacity: 0, used: 0, available: 0 })  │

│ msp-operator     in kube_runtime::controller::reconciling object with object.ref: MayastorPool.v1alpha1.openebs.io/microk8s-alexvirt-pool.mayastor, object │

│ .reason: error policy requested retry                                                                                                                      │

│ msp-operator                                                                                                                                               │

│ msp-operator   Mar 25 08:25:36.936  INFO msp_operator: new resource_version inserted, name: "microk8s-alexvirt-pool"                                       │

│ msp-operator     at control-plane/msp-operator/src/main.rs:298                                                                                             │

│ msp-operator     in msp_operator::reconcile with name: alexvirt, status: Some(MayastorPoolStatus { state: Error, capacity: 0, used: 0, available: 0 })     │

│ msp-operator     in kube_runtime::controller::reconciling object with object.ref: MayastorPool.v1alpha1.openebs.io/microk8s-alexvirt-pool.mayastor, object │

│ .reason: object updated                                                                                                                                    │

│ msp-operator                                                                                                                                               │

│ msp-operator   Mar 25 08:25:36.936 ERROR msp_operator: entered error as final state, pool: "microk8s-alexvirt-pool"                                        │

│ msp-operator     at control-plane/msp-operator/src/main.rs:877                                                                                             │

│ msp-operator     in msp_operator::reconcile with name: alexvirt, status: Some(MayastorPoolStatus { state: Error, capacity: 0, used: 0, available: 0 })     │

│ msp-operator     in kube_runtime::controller::reconciling object with object.ref: MayastorPool.v1alpha1.openebs.io/microk8s-alexvirt-pool.mayastor, object │

│ .reason: object updated

Resolution

To resolve this I deleted the MSP and recreated it - at this point is created successfully.

[metallb] Support for adding multiple pools

Summary

Currently microk8s addon metallb supports creation of single pool with multiple CIDRs.
Request is to add support to create multiple pools

Possible user command:
microk8s enable metallb::10.64.140.43-10.64.140.49,192.168.0.105-192.168.0.111;:10.80.100.0-10.80.100.10
(pools are delimited by ;)

Why is this important?

Certain application services (example OpenStack control plane services) need to be exposed via different underlying networks - internal and public. This an be achieved only by creating different pools for metallb so that ingress controller can chose to pick IPs from specific pool.

Are you interested in contributing to this feature?

yes

Support secure mode for registry addon

Summary

The registry add-on only works in insecure mode. I would like a secure mode to be supported as well.

Why is this important?

Security risks associated with the insecure registry.
Users should not become accustomed to pushing insecure images around.
Since image names don't support an explicit https or http protocol, libraries sometimes infer the protocol from the image name. https is the default, and if the registry uses http, there can be confusing errors that are difficult to resolve.

Are you interested in contributing to this feature?

No, sorry.

[hostpath-storage] Custom hostpath storage class provisions to wrong path.

Summary

According to the microk8s docs, one should be able to add a storage class to provision persistent volumes to a custom path if running microk8s v1.25 or newer.

I'm currently running v1.26.8 and have deployed the following storage class:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: usb-hostpath
provisioner: microk8s.io/hostpath
reclaimPolicy: Delete
parameters:
  pvDir: /mnt/usb
volumeBindingMode: WaitForFirstConsumer

The path /mnt/usb exists and is available for use.

After deploying the storage class, I tried deploying the following PVC and mounted it to a pod:

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: test-application-storage
  labels:
    app: test-application
    heritage: "Helm"
spec:
  accessModes:
    - "ReadWriteOnce"
  resources:
    requests:
      storage: "16Gi"
  storageClassName: "usb-hostpath"

With this I expected microk8s to create a folder under /mnt/usb for this pvc, but the new directory was created under /var/snap/microk8s/common/default-storage instead.
I checked the logs, but it seems the hostpath-provisioner completely disregards the custom path:

I1109 11:26:41.403116       1 controller.go:1279] provision "default/test-application-pvc" class "usb-hostpath": started
I1109 11:26:41.412522       1 hostpath-provisioner.go:82] creating backing directory: /var/snap/microk8s/common/default-storage/default-test-application-pvc-pvc-d91c49ed-ae9f-4ff2-b076-d9fe249ab4d6
I1109 11:26:41.412637       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"test-application-pvc", UID:"d91c49ed-ae9f-4ff2-b076-d9fe249ab4d6", APIVersion:"v1", ResourceVersion:"3717740", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "default/test-application-pvc"
I1109 11:26:41.413055       1 controller.go:1384] provision "default/test-application-pvc" class "usb-hostpath": volume "pvc-d91c49ed-ae9f-4ff2-b076-d9fe249ab4d6" provisioned
I1109 11:26:41.413098       1 controller.go:1397] provision "default/test-application-pvc" class "usb-hostpath": succeeded

What Should Happen Instead?

The backing directory should be created at path /mnt/usb instead of /var/snap/microk8s/common/default-storage.

Reproduction Steps

On my installation this is consistently reproducible by attempting to use the "usb-hostpath" storage class, but I suspect it may be related to the fact that my microk8s installation was upgraded from v1.23 to v1.26.8. This upgrade may have left some files or configurations that cause the hostpath provisioner to ignore custom pvDir definitions.

Introspection Report

inspection-report-20231109_112913.tar.gz

[Cert-manager] Wait for cert-manager to be fully ready

Summary

Make sure when the cert-manager addon completes, the cert-manager is ready to accept certificate requests. What do you think?
Waiting for the pods to be in Ready state is not enough. We need to try creating resources, to make sure cert-manager is fully ready.

Example by applying the yaml below, the user will know that it is ready to serve certificate requests.

apiVersion: v1
kind: Namespace
metadata:
  name: cert-manager-test
---
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
  name: test-selfsigned
  namespace: cert-manager-test
spec:
  selfSigned: {}
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: selfsigned-cert
  namespace: cert-manager-test
spec:
  dnsNames:
    - example.com
  secretName: selfsigned-cert-tls
  issuerRef:
    name: test-selfsigned

Why is this important?

Some addons require cert-managers to be available and ready. For example Jaeger v1.31+.

Are you interested in contributing to this feature?

yes

What do you think about this?

[WIP] Consider demotion of OpenEBS to community

cncf/toc#905

Enabling the Mayastor plugin somehow crashes local network

Summary

After enabling the plugin, all the devices in the local network, including the nodes, lose access to the internet. I'm not a networking guy, so to me it looks like some dark magic.

Needless to say, the plugin can't properly start up, because it's unable to even pull necessary images.

Probably relevant:

The cluster includes 2 nodes.
Those nodes live in the same physical network.
However, the nodes are wired together via a VPN created with Tailscale.

What Should Happen Instead?

Mayastor should properly start up.

Reproduction Steps

I checked it several times to exclude the possibility of coincidence.

Prepare the nodes as explained here.
microk8s enable core/mayastor --default-pool-size 20G
Almost immediately, all devices in the local network lose access to the internet. (Note: those devices are not wired together via a VPN.) That includes my laptop, my phone, my sister's phone and a smart assistant. So, yeah, the network really does crash somehow.
microk8s disable core/mayastor
The connection gets restored.

Introspection Report

inspection-report-20230420_125254.tar.gz

Are you interested in contributing with a fix?

(1.28) MetalLB issues

Summary

In 1.28 we have observed that sometimes services of type LoadBalancer using MetalLB are not accessible from outside the cluster

This is a placeholder PR as this is our current hypothesis about the failures we observe, will update accordingly as we get more context.

What Should Happen Instead?

LoadBalancer services should be accessible outside of the cluster, L2 advertisements should work without issues.

Reproduction Steps

Install MicroK8s 1.28
Enable metallb
juju bootstrap microk8s --config controller-service-type=loadbalancer

Can you suggest a fix?

WIP

Are you interested in contributing with a fix?

cc @marosg42

Mayastor not correctly provisioning requested PVC size

MK8s 4 node cluster ( 1 control plane ) 1.25 release
Mayastor addon enabled
ArgoCD enabling prometheus https://github.com/cloud-native-skunkworks/gitops-kubernetes-bootstrap/blob/main/templates/application_observability.yaml#L40

Available attached disk space mounted on /var/snap/microk8s

MicroK8s.img has a 20Gig sparse file for each mayastorpool

I have 1 deployment consuming a 1Gi single replica.

When attempting to create a 3 replica 50Gi PVC it fails

I believe clarification is needed on where this is failing.

Mayastor documentation request: minimal realistic hardware requirements

From testing mayastor on a kubeadm created cluster, I have observed that by default, the nodes dedicated to storage bumped memory usage with close to 5GB extra ram and at least one core cpu over 90% of the time used. This is also documented by Mayastor, as suggest even dedicating 2 cores for each storage node.

As observed with other addons, and really enjoying this, I would like to suggest the official team if they can do some initial testing so some MicroK8S suggested minimal requirements would be documented for the addons.

I'd like to remind, that Mayastor recommends 4GB ram per storage node and minimal 1 core dedicated to it.

Probably with some tweaks, MicroK8S might provide Mayastor for smaller deployments, which would be enjoyable.

There is another mention on Mayastor 1.0 release that without minimal 3 nodes Mayastor will not run. Will this be in the case of the MicroK8S same?

Pods using pvcs with mayastor intermittently fail to mount back their volumes

Summary

Pod gets stuck initializing with unable to mount volume

Events:
  Type     Reason              Age    From                     Message
  ----     ------              ----   ----                     -------
  Normal   Scheduled           2m26s  default-scheduler        Successfully assigned cos/loki-0 to microk8s-2
  Warning  FailedAttachVolume  2m26s  attachdetach-controller  Multi-Attach error for volume "pvc-f17ed96a-a729-420d-8586-af31dd7eb212" Volume is already exclusively attached to one node and can't be attached to another
  Warning  FailedMount         23s    kubelet                  Unable to attach or mount volumes: unmounted volumes=[loki-loki-chunks-4bd159ef], unattached volumes=[loki-loki-chunks-4bd159ef], failed to process volumes=[]: timed out waiting for the condition

Some deployment artifacts
Loki pod
PVC in question
mk8s pvc layout
storage classes
Loki pod is scheduled to microk8s-2 (juju unit: microk8s/3), but the pvc is still on microk8s-3 (juju unit: microk8s/2) and mayastor is unable to move it to the new node.
Relevant log snippet from mayastor csi controller pod

What Should Happen Instead?

Would expect that the pvc gets moved by mayastor sucessfully, other pods in this deployment and their pvcs are ok.

Reproduction Steps

This does not happen immediately after deploying the environment, but has happened a few times

Bootstrap juju 2.9.42 controller to microk8s 1.26/1.27 with mayastor addon
Deploy COS lite
After a while, some or all pods crash this way with pvcs stuck or unstable.

Introspection Report

inspection-report-20230809_103837-sanitized.tar.gz

Can you suggest a fix?

I believe we should try to further debug and check the mayastor addon since this behavior is present in both v1 and mayastor-aio-2.0.0-microk8s-1

Are you interested in contributing with a fix?

observability fails to enable on HA cluster

Summary

I am trying to enable the new observability addon.

MicroK8s v1.25.0 revision 3883

microk8s status

microk8s is running
high-availability: yes
  datastore master nodes: 192.168.1.10:19001 192.168.1.11:19001 192.168.1.12:19001
  datastore standby nodes: none
addons:
  enabled:
    dashboard            # (core) The Kubernetes dashboard
    dns                  # (core) CoreDNS
    ha-cluster           # (core) Configure high availability on the current node
    helm                 # (core) Helm - the package manager for Kubernetes
    helm3                # (core) Helm 3 - the package manager for Kubernetes
    hostpath-storage     # (core) Storage class; allocates storage from host directory
    ingress              # (core) Ingress controller for external access
    metrics-server       # (core) K8s Metrics Server for API access to service metrics
    rbac                 # (core) Role-Based Access Control for authorisation
    storage              # (core) Alias to hostpath-storage add-on, deprecated
  disabled:
    cert-manager         # (core) Cloud native certificate management
    community            # (core) The community addons repository
    gpu                  # (core) Automatic enablement of Nvidia CUDA
    host-access          # (core) Allow Pods connecting to Host services smoothly
    kube-ovn             # (core) An advanced network fabric for Kubernetes
    mayastor             # (core) OpenEBS MayaStor
    metallb              # (core) Loadbalancer for your Kubernetes cluster
    observability        # (core) A lightweight observability stack for logs, traces and metrics
    prometheus           # (core) Prometheus operator for monitoring and logging
    registry             # (core) Private image registry exposed on localhost:32000

I have a 3 node cluster.

microk8s enable observability

Infer repository core for addon observability
Addon core/dns is already enabled
Addon core/helm3 is already enabled
Addon core/hostpath-storage is already enabled
Enabling observability
"prometheus-community" already exists with the same configuration, skipping
"grafana" already exists with the same configuration, skipping
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "openebs" chart repository
...Successfully got an update from the "prometheus-community" chart repository
...Successfully got an update from the "grafana" chart repository
Update Complete. ⎈Happy Helming!⎈
Release "kube-prom-stack" does not exist. Installing it now.
Error: Endpoints "kube-prom-stack-kube-prome-kube-scheduler" is invalid: [subsets[0].addresses[0].ip: Invalid value: "192.168.1.10,192.168.1.11,192.168.1.12": must be a valid IP address, (e.g. 10.9.8.7 or 2001:db8::ffff), subsets[0].addresses[0].ip: Invalid value: "192.168.1.10,192.168.1.11,192.168.1.12": must be a valid IP address]

What Should Happen Instead?

The addon should be enabled.

Reproduction Steps

I was running microk8s 1.24 and enabled the prometheus addon.
I then disabled the plugin and upgraded the HA cluster to 1.25.
Tried to enable the observability addon.

Introspection Report

Inspecting system
Inspecting Certificates
Inspecting services
  Service snap.microk8s.daemon-cluster-agent is running
  Service snap.microk8s.daemon-containerd is running
  Service snap.microk8s.daemon-kubelite is running
  Service snap.microk8s.daemon-k8s-dqlite is running
 FAIL:  Service snap.microk8s.daemon-apiserver-proxy is not running
For more details look at: sudo journalctl -u snap.microk8s.daemon-apiserver-proxy
  Service snap.microk8s.daemon-apiserver-kicker is running
  Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
  Copy processes list to the final report tarball
  Copy disk usage information to the final report tarball
  Copy memory usage information to the final report tarball
  Copy server uptime to the final report tarball
  Copy openSSL information to the final report tarball
  Copy snap list to the final report tarball
  Copy VM name (or none) to the final report tarball
  Copy current linux distribution to the final report tarball
  Copy network configuration to the final report tarball
Inspecting kubernetes cluster
  Inspect kubernetes cluster
Inspecting dqlite
  Inspect dqlite

sudo journalctl -u snap.microk8s.daemon-apiserver-proxy -f

-- Logs begin at Tue 2022-07-19 00:32:22 UTC. --
Sep 12 15:07:17 kubey microk8s.daemon-apiserver-proxy[459835]: + ARCH=x86_64
Sep 12 15:07:17 kubey microk8s.daemon-apiserver-proxy[459835]: + export LD_LIBRARY_PATH=/var/lib/snapd/lib/gl:/var/lib/snapd/lib/gl32:/var/lib/snapd/void::/snap/microk8s/3883/lib:/snap/microk8s/3883/usr/lib:/snap/microk8s/3883/lib/x86_64-linux-gnu:/snap/microk8s/3883/usr/lib/x86_64-linux-gnu
Sep 12 15:07:17 kubey microk8s.daemon-apiserver-proxy[459835]: + LD_LIBRARY_PATH=/var/lib/snapd/lib/gl:/var/lib/snapd/lib/gl32:/var/lib/snapd/void::/snap/microk8s/3883/lib:/snap/microk8s/3883/usr/lib:/snap/microk8s/3883/lib/x86_64-linux-gnu:/snap/microk8s/3883/usr/lib/x86_64-linux-gnu
Sep 12 15:07:17 kubey microk8s.daemon-apiserver-proxy[459835]: + source /snap/microk8s/3883/actions/common/utils.sh
Sep 12 15:07:17 kubey microk8s.daemon-apiserver-proxy[459835]: ++ [[ /snap/microk8s/3883/run-apiserver-proxy-with-args == \/\s\n\a\p\/\m\i\c\r\o\k\8\s\/\3\8\8\3\/\a\c\t\i\o\n\s\/\c\o\m\m\o\n\/\u\t\i\l\s\.\s\h ]]
Sep 12 15:07:17 kubey microk8s.daemon-apiserver-proxy[459835]: + '[' -e /var/snap/microk8s/3883/var/lock/clustered.lock ']'
Sep 12 15:07:17 kubey microk8s.daemon-apiserver-proxy[459835]: + echo 'Not a worker node, exiting'
Sep 12 15:07:17 kubey microk8s.daemon-apiserver-proxy[459835]: Not a worker node, exiting
Sep 12 15:07:17 kubey microk8s.daemon-apiserver-proxy[459835]: + exit 0
Sep 12 15:07:17 kubey systemd[1]: snap.microk8s.daemon-apiserver-proxy.service: Succeeded.

To clean up the half enabled addon I have to run this.
https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack#configuration

microk8s.helm uninstall kube-prom-stack -n observability

kubectl delete crd alertmanagerconfigs.monitoring.coreos.com
kubectl delete crd alertmanagers.monitoring.coreos.com
kubectl delete crd podmonitors.monitoring.coreos.com
kubectl delete crd probes.monitoring.coreos.com
kubectl delete crd prometheuses.monitoring.coreos.com
kubectl delete crd prometheusrules.monitoring.coreos.com
kubectl delete crd servicemonitors.monitoring.coreos.com
kubectl delete crd thanosrulers.monitoring.coreos.com

Mayastor addon does not automatically create a MayastorPool on all nodes

Summary

I have 3x Hetzner Servers where I deployed a HA MicroK8s cluster on it (twice), but when I do install it, I only get one mayastor pool on Node 3, instead of all 3. The even funnier part is, that I executed it on the first node. So at least I would have expected it to have the node 1 pool.

I added the nodes by joining them over the internal hetzner network as a full node (not worker) so I do not see the issue here, because for the rest, I sticked to the documentation

root@k8s-1:~# microk8s.kubectl get mayastorpool -n mayastor
NAME                  NODE    STATUS   CAPACITY      USED   AVAILABLE
microk8s-k8s-3-pool   k8s-3   Online   21449670656   0      21449670656

What Should Happen Instead?

As in the documentation mentioned, a mayastor pool for every node. So in the end:

root@k8s-1:~# microk8s.kubectl get mayastorpool -n mayastor
NAME                  NODE    STATUS   CAPACITY      USED   AVAILABLE
microk8s-k8s-1-pool   k8s-1   Online   21449670656   0      21449670656
microk8s-k8s-2-pool   k8s-2   Online   21449670656   0      21449670656
microk8s-k8s-3-pool   k8s-3   Online   21449670656   0      21449670656

Reproduction Steps

apt update && apt upgrade -y && apt install snapd -y
nano /etc/hosts
snap install microk8s --classic
microk8s status --wait-ready
microk8s add-node
microk8s kubectl get nodes
echo vm.nr_hugepages = 1024 | sudo tee -a /etc/sysctl.d/20-microk8s-hugepages.conf
# — restart —
apt-get install linux-modules-extra-$(uname -r)
modprobe nvme-tcp
echo 'nvme-tcp' | sudo tee -a /etc/modules-load.d/microk8s-mayastor.conf
# — restart —
microk8s status
microk8s enable dashboard dns registry istio
microk8s enable ingress
microk8s enable mayastor
microk8s dashboard-proxy

Introspection Report

Report: inspection-report-20230106_142931.tar.gz

Can you suggest a fix?

Maybe a retry command? Or a fix-health command?

microk8s addon mayastor recreate datapools

Are you interested in contributing with a fix?

I think this is too complicated for me. Possibly an easier task next time 🙂

GPU addon is not available on ARM Microk8s

Summary

GPU addon is not available on ARM Microk8s

What Should Happen Instead?

Be able to install GPU add-on

Reproduction Steps

Ubuntu amd64

$ uname -p
x86_64

$ microk8s status | grep gpu
    gpu                  # (core) Automatic enablement of Nvidia CUDA

Ubuntu arm (Same result on Nvidia DPU BFB and standard Ubuntu arm)

$ uname -p
aarch64

$ microk8s enable gpu
Addon gpu was not found in any repository

$ microk8s status
microk8s is running
high-availability: no
  datastore master nodes: 127.0.0.1:19001
  datastore standby nodes: none
addons:
  enabled:
    dns                  # (core) CoreDNS
    ha-cluster           # (core) Configure high availability on the current node
    helm                 # (core) Helm - the package manager for Kubernetes
    helm3                # (core) Helm 3 - the package manager for Kubernetes
    hostpath-storage     # (core) Storage class; allocates storage from host directory
    ingress              # (core) Ingress controller for external access
    metallb              # (core) Loadbalancer for your Kubernetes cluster
    storage              # (core) Alias to hostpath-storage add-on, deprecated
  disabled:
    cert-manager         # (core) Cloud native certificate management
    community            # (core) The community addons repository
    dashboard            # (core) The Kubernetes dashboard
    host-access          # (core) Allow Pods connecting to Host services smoothly
    kube-ovn             # (core) An advanced network fabric for Kubernetes
    mayastor             # (core) OpenEBS MayaStor
    metrics-server       # (core) K8s Metrics Server for API access to service metrics
    minio                # (core) MinIO object storage
    observability        # (core) A lightweight observability stack for logs, traces and metrics
    prometheus           # (core) Prometheus operator for monitoring and logging
    rbac                 # (core) Role-Based Access Control for authorisation
    registry             # (core) Private image registry exposed on localhost:32000

Update MinIO Version in MicroK8s Core Addons

Summary

The current version of MinIO included in MicroK8s core addons is 4.5.1, which is significantly behind the latest version, 5.0.11. The newer version of MinIO includes numerous bug fixes and improvements, one of which is critical for operations like expanding storage (--namespace bug) minio/operator#1291

What Should Happen Instead?

MicroK8s should include an updated version of MinIO, preferably the latest stable version (5.0.11 as of this request), to ensure users benefit from the latest features and bug fixes.

Reproduction Steps

The issue with the --namespace flag when expanding storage can be consistently reproduced with the current version of MinIO in MicroK8s:

Try expanding a MinIO tenant using the current version (4.5.1).
Observe the --namespace flag issue.

Introspection Report

Updating the MinIO version in MicroK8s core addons to 5.0.11 or the latest stable version available. This update would address the known issues and improve the overall user experience with MinIO in MicroK8s.

Can you suggest a fix?

Are you interested in contributing with a fix?

Add individual READMEs for each addon

There are many moving pieces and it would be handy to have them (and other things) documented organically.

For example, in the README for hostpath-storage I would explicitly list canonical/hostpath-provisioner as a dependency.

Unless all dev-facing content could be outline in the docs?

Mayastor is unrecoverable if etcd cluster is broken

Summary

If etcd cluster is broken, mayastor is unrecoverable since etcd is deployed without using Persistent Volumes and not part of a StatefulSet as required by Mayastor basic architecture [0]

This means that everything using mayastor is lost and needs to be redeployed.

https://mayastor.gitbook.io/introduction/basic-architecture

GPU addon fails to install drivers due to GPG error

Summary

Installation of the GPU addon fails because the NVIDIA GPU Driver containers cannot install the required driver packages with apt due to outdated/missing GPG keys:
W: GPG error: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY A4B469963BF863CC

Nvidia changed their CUDA Linux GPG repository key on April 28th but v1.8.2 of the GPU operator uses version 470.57.02 of the 'NVIDIA GPU Driver' image which only contains the old key.

What Should Happen Instead?

Nvidia Driver installation should complete successfuly and all GPU operator installation should progress normally.

Reproduction Steps

microk8s enable gpu
check output of nvidia-driver-ctr of the nvidia-driver-daemonset for GPG Errors

Introspection Report

inspection-report-20220518_161245.tar.gz

Can you suggest a fix?

Install v1.10+ of the GPU operator with the addon since it uses a 'NVIDIA GPU Driver' image which contains the new GPG key.

Are you interested in contributing with a fix?

Issue with enabling cis-hardening addon

Summary

I need to enable cis-hardening addon in an air-gapped microk8s cluster. As stated in the documentation, I have to disable the kube-bench download by setting the --install-kubebench flag to false. However, it doesn't seem to recognize the flag, as it keeps trying to download kube-bench from GitHub and fails.

What Should Happen Instead?

It should skip the kube-bench download, allowing me to complete the addon installation in an air-gapped environment.

Reproduction Steps

I've tried all this types of flags, but I got the same result:
microk8s kubectl enable cis-hardening --install-kubebench false
microk8s kubectl enable cis-hardening --install-kubebench False
microk8s kubectl enable cis-hardening --install-kubebench=false
microk8s kubectl enable cis-hardening --install-kubebench=False

Everytime the code goes trought the DownloadKubebench function, as it prints out the "Downloading kube-bench" message and then crashes contacting the kube-bench url.

Can you suggest a fix?

It seems to be an issue with the Click library not correctly interpreting the flag.

Thank you!

Extending DNS addon args

Summary

Please, could you extend the DNS addon args by a third argument representing the cluster-domain?
I just used the launch configuration to set up a new MicroK8s (v1.29) cluster , when I realized there is a lack of opportunities to definine the cluster-domain the for the DNS configmap. After a short glance into the code I noticed the cluster-domain value is fixed to cluster.local.

Why is this important?

It would be allow to set up a MicroK8s cluster in a programmatically way without patching the DNS configmap afterward.

Are you interested in contributing to this feature?

I couldn't test any of the following code snippets. However, maybe it helps to get it done.

Changes in file microk8s-core-addons/addons/dns/enable:

#new after line 57:
CLUSTER_DOMAIN="$3"
if [ -z "$CLUSTER_DOMAIN" ]; then
  CLUSTER_DOMAIN="cluster.local"

#new after line 67:
map[\$CLUSTERDOMAIN]="$CLUSTER_DOMAIN"

# change line 75:
if ! grep -q -- "--cluster-domain=$CLUSTER_DOMAIN" "${SNAP_DATA}/args/kubelet"; then

# change line 81:
refresh_opt_in_config "cluster-domain" "$CLUSTER_DOMAIN" kubelet

And change line 30 in file microk8s-core-addons/addons/dns/coredns.yaml:

        kubernetes $CLUSTERDOMAIN in-addr.arpa ip6.arpa {

Thanks, I appreciate your effort!

Mayastor 2 installed - cannot create pools using helper scripts - error: "TypeError: fork_exec() takes exactly 21 arguments (17 given)"

Summary

Executed microk8s mayastor-pools add --node kcp01.test.local --device /dev/sdb --size 10GB

Received the following stack trace:

Traceback (most recent call last):
  File "/var/snap/microk8s/common/plugins/mayastor-pools", line 146, in <module>
    pools.main()
  File "/usr/lib/python3/dist-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/var/snap/microk8s/common/plugins/mayastor-pools", line 92, in add
    subprocess.run([KUBECTL, "apply", "-f", "-"], input=format_pool(node, dev))
  File "/snap/microk8s/5372/usr/lib/python3.8/subprocess.py", line 493, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/snap/microk8s/5372/usr/lib/python3.8/subprocess.py", line 858, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/snap/microk8s/5372/usr/lib/python3.8/subprocess.py", line 1639, in _execute_child
    self.pid = _posixsubprocess.fork_exec(
TypeError: fork_exec() takes exactly 21 arguments (17 given)

What Should Happen Instead?

Disk pool should be created with no errors!

Reproduction Steps

As above

More information

I was running microk8s 1.26 and upgraded to 1.27, however apparently this upgrade does not seem to upgrade any admins, so I cloned this 1.27 core admins repo and replaced the mayastor addon scripts with the newer versions. This fixed my initial issues I was having with mayastor but now I'm unable to use the helper scripts.

Are you interested in contributing with a fix?

Sorry, I'm not a Pythonista!

nfs pv/pvc over NFS ignores ReclaimPolicy, and always clean up folder in the NFS server

Summary

nfs pv/pvc over NFS, as well as storageClass ignores ReclaimPolicy, and always clean up folder in the NFS server, after deleting pvc

I tried to use following:

reclaimPolicy: Retain in the StorageClass, related to PVC, created in PV over NFS
provisioner: nfs in the StorageClass
provisioner: kubernetes.io/no-provisioner in the StorageClass
Do not use StorageClass at all
persistentVolumeReclaimPolicy: Recycle in the PV

Anyway, it always happened - after deleting PVC, microk8s always deletes folder spec.nfs.path which described in PV's yaml config

What Should Happen Instead?

I expected, that after deleting PVC, microk8s do not deletes folder in the NFS server

Reproduction Steps

For example, you can use configs like this:

cat pv.yml 
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: nfs4smb
provisioner: kubernetes.io/no-provisioner
reclaimPolicy: Retain
allowVolumeExpansion: true
volumeBindingMode: Immediate

---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: samba-mapping4nfs
  namespace: samba
spec:
  volumeMode: Filesystem
  storageClassName: nfs4smb
  capacity:
    storage: 500Gi
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Recycle
  mountOptions:
    - rw
    - rsize=6476
    - wsize=64768
    - noatime  
    - nfsvers=4.2
  nfs:
    path: /tank/compressed/windows/
    server: 10.11.17.10
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: compressed
  namespace: samba
spec:
  storageClassName: nfs4smb
  accessModes:
  - ReadWriteMany      
  resources:
     requests:
       storage: 500Gi   
  volumeName: samba-mapping4nfs

I use a default snap package:
microk8s v1.24.3 3597 1.24/stable canonical✓ classic

[storage-hostpath] Support multiple storage classes

It seems https://github.com/canonical/hostpath-provisioner doesn't have an issue tracker, so I'm posting here.

It would be nice to make the hostpath-provisioner of the storage-hostpath plugin respect multiple storage classes.

Example use-case: two volumes bind-mounted into the snap, one HDD and one SSD. Currently using the provisioner for one volume and manually mounting for the other, but it's a bit more tedious than it could be.

hostpath-storage ignores pvDir

Summary

default clean installation via snap: microk8s v1.24.3

hostpath-storage ignores pvDir, if we create a custom StorageClass, Example:

---
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: nvme
provisioner: microk8s.io/hostpath
reclaimPolicy: Delete
parameters:
  pvDir: /mnt/storage/nvme/storage
volumeBindingMode: WaitForFirstConsumer

kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                   STORAGECLASS   REASON   AGE
pvc-a01196ef-01ed-4055-b1b8-23c1ccb9d1bd   8Gi        RWO            Delete           Bound    default/wiki-dokuwiki   nvme                    11s

kubectl describe pv pvc-a01196ef-01ed-4055-b1b8-23c1ccb9d1bd
Name:              pvc-a01196ef-01ed-4055-b1b8-23c1ccb9d1bd
Labels:            <none>
Annotations:       hostPathProvisionerIdentity: microk8s
                   pv.kubernetes.io/provisioned-by: microk8s.io/hostpath
Finalizers:        [kubernetes.io/pv-protection]
StorageClass:      nvme
Status:            Bound
Claim:             default/wiki-dokuwiki
Reclaim Policy:    Delete
Access Modes:      RWO
VolumeMode:        Filesystem
Capacity:          8Gi
Node Affinity:     
  Required Terms:  
    Term 0:        kubernetes.io/hostname in [microk8s]
Message:           
Source:
    Type:          HostPath (bare host directory volume)
    Path:          /var/snap/microk8s/common/default-storage/default-wiki-dokuwiki-pvc-a01196ef-01ed-4055-b1b8-23c1ccb9d1bd
    HostPathType:  DirectoryOrCreate
Events:            <none>

What Should Happen Instead?

I expected creation of the PVC in the another file system

mayastor addon does not start correctly on arm64 platform

Summary

When enabling mayastor addon on arm64 servers all pods get to running state except rest pod which stays in CrashLoopBackoff. If I monitor get pod -A I can see OOMKilled status from time to time.
Pools are in error state.
It is arm64 only issue, the same setup works fine on amd64.

What Should Happen Instead?

All pods should be in Running state and pools should be created.

Reproduction Steps

Yes, I can, but only on arm64.
Build a six node microk8s cluster, configure prerequisites as described in docs, enable mayastor addon.

Introspection Report

inspection-report-20230413_124122.tar.gz

Can you suggest a fix?

Are you interested in contributing with a fix?

[FR] Update Kube-OVN addon to the latest stable version of Kube-OVN (v1.12.x)

Hello, thank you very much for maintaining this tool. It really simplifies Kubernetes.

Summary

Would it be possible to update the Kube-OVN addon to use the latest maintained stable version of Kube-OVN?

Why is this important?

According to /var/snap/microk8s/common/addons/core/addons.yaml the current version used in the addon is 1.10.0-alpha1 (the image configuration in the yaml files in this repository also seem to point out to version v1.10.0)¹.

Between v1.10.0-alpha1 and v1.12.3 a series of bug fixes were implemented, and very important features were also added (e.g. natOutgoingPolicyRules). The complete list of changes can be seen in https://github.com/kubeovn/kube-ovn/blob/master/CHANGELOG.md.

Are you interested in contributing to this feature?

Yes we are interested in contributing to this update.

However we might need some pointers and tips on how to get started on that. (We have already been through the HACKING and CONTRIBUTING documents). For example it would be super helpful to understand the process that was initially employed to create the .yaml files in the addons/kube-ovn folder.

I can see that these files contain a comment on top in the form of Sourced from: https://..., however if I inspect the corresponding directory in GitHub for the latest release I can see that it contains many other .yaml files. Are all these new files necessary? How do I select which ones are relevant or not?

I can also see that some of the files have a Changelog comment that seems to indicate which changes were done to adapt the original source to microk8s. Is there any automation for that (e.g. patches that were applied to the original source)? Or any recommendation related to these patches?

Any other recommendation/advice/tips/hints would be greatly appreciated...

/cc @Chrisys93 @JuanMaParraU @AlexsJones

If the current version installed by microk8s is really an alpha release, it would be very compelling to update to a non-alpha release. ↩

gpu add on failing on 1.25

Summary

GPU add on is not working as expected in kubernetes 1.25

Process

[List the steps to replicate the issue.]

Install microk8s 1.25 sudo snap install microk8s --classic --channel=1.25/stable
Enable other plugins microk8s enable rbac hostpath-storage metallb ingress dns dashboard helm
Enable gpu microk8s enable gpu

After some time run microk8s kubectl logs -n gpu-operator-resources -lapp=nvidia-operator-validator -c nvidia-operator-validator. Looks like nvidia-validator is not installed.

Screenshot

[If relevant, include a screenshot.]
1.6654307708338156e+09 ERROR controller.clusterpolicy-controller Reconciler error {"name": "cluster-policy", "namespace": "", "error": "no matches for kind "RuntimeClass" in version "node.k8s.io/v1beta1""}

Browser details

[Optionally - if you can, copy the report generated by mybrowser.fyi - this might help us debug certain types of issues.]

Introspection Report

Inspecting system
Inspecting Certificates
Inspecting services
  Service snap.microk8s.daemon-cluster-agent is running
  Service snap.microk8s.daemon-containerd is running
  Service snap.microk8s.daemon-kubelite is running
  Service snap.microk8s.daemon-k8s-dqlite is running
  Service snap.microk8s.daemon-apiserver-kicker is running
  Copy service arguments to the final report tarball
Inspecting AppArmor configuration
Gathering system information
  Copy processes list to the final report tarball
  Copy disk usage information to the final report tarball
  Copy memory usage information to the final report tarball
  Copy server uptime to the final report tarball
  Copy openSSL information to the final report tarball
  Copy snap list to the final report tarball
  Copy VM name (or none) to the final report tarball
  Copy current linux distribution to the final report tarball
  Copy network configuration to the final report tarball
Inspecting kubernetes cluster
  Inspect kubernetes cluster
Inspecting dqlite
  Inspect dqlite

Can you suggest a fix?

Looks like this bug was fixed as per this PR NVIDIA/gpu-operator@6771549. Is the gpu addon picking up the changes from this PR?

Are you interested in contributing with a fix?

Enable Addons with Proxy environment

Summary

I want to run microk8s enable "addon" addon with a Proxy environment. and I have set up /etc/environment like:

$ cat /etc/environment
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin"
HTTP_PROXY="myproxy:port/"
HTTPS_PROXY="myproxy:port/"
NO_PROXY="localhost,127.0.0.1,::1"
http_proxy="myproxy:port"
https_proxy="myproxy:port"
no_proxy="localhost,127.0.0.1,::1"

Try apt update with Proxy is working:

$ sudo apt update
Hit:1 http://tw.archive.ubuntu.com/ubuntu jammy InRelease
Hit:2 http://tw.archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:3 http://tw.archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:4 http://tw.archive.ubuntu.com/ubuntu jammy-security InRelease
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
11 packages can be upgraded. Run 'apt list --upgradable' to see them.

But microk8s enable "addon" is always fails. From the error message, it seems that there is a problem with respect to the Proxy config.

What Should Happen Instead?

Addon is installed and working.

Reproduction Steps

Example: https://microk8s.io/docs/addon-dashboard

Run microk8s enable dashboard:

$ microk8s enable dashboard
Infer repository core for addon dashboard
Enabling Kubernetes Dashboard
Infer repository core for addon metrics-server
Enabling Metrics-Server
serviceaccount/metrics-server created
clusterrole.rbac.authorization.k8s.io/system:aggregated-metrics-reader created
clusterrole.rbac.authorization.k8s.io/system:metrics-server created
rolebinding.rbac.authorization.k8s.io/metrics-server-auth-reader created
clusterrolebinding.rbac.authorization.k8s.io/metrics-server:system:auth-delegator created
clusterrolebinding.rbac.authorization.k8s.io/system:metrics-server created
service/metrics-server created
deployment.apps/metrics-server created
apiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io created
clusterrolebinding.rbac.authorization.k8s.io/microk8s-admin created
Adding argument --authentication-token-webhook to nodes.
Restarting nodes.
Metrics-Server is enabled
Applying manifest
serviceaccount/kubernetes-dashboard created
service/kubernetes-dashboard created
secret/kubernetes-dashboard-certs created
secret/kubernetes-dashboard-csrf created
secret/kubernetes-dashboard-key-holder created
configmap/kubernetes-dashboard-settings created
role.rbac.authorization.k8s.io/kubernetes-dashboard created
clusterrole.rbac.authorization.k8s.io/kubernetes-dashboard created
rolebinding.rbac.authorization.k8s.io/kubernetes-dashboard created
clusterrolebinding.rbac.authorization.k8s.io/kubernetes-dashboard created
deployment.apps/kubernetes-dashboard created
service/dashboard-metrics-scraper created
deployment.apps/dashboard-metrics-scraper created
secret/microk8s-dashboard-token unchanged

If RBAC is not enabled access the dashboard using the token retrieved with:

microk8s kubectl describe secret -n kube-system microk8s-dashboard-token

Use this token in the https login UI of the kubernetes-dashboard service.

In an RBAC enabled setup (microk8s enable RBAC) you need to create a user with restricted
permissions as shown in:
https://github.com/kubernetes/dashboard/blob/master/docs/user/access-control/creating-sample-user.md

Get the token:

$ microk8s kubectl describe secret -n kube-system microk8s-dashboard-token
Name:         microk8s-dashboard-token
Namespace:    kube-system
Labels:       <none>
Annotations:  kubernetes.io/service-account.name: default
              kubernetes.io/service-account.uid: d19d8c9b-b801-4f56-bb2d-f49ecc2bfa03

Type:  kubernetes.io/service-account-token

Data
====
ca.crt:     1123 bytes
namespace:  11 bytes
token:      dashbard-token

Run microk8s dashboard-proxy got time out message:

$ microk8s dashboard-proxy
Checking if Dashboard is running.
Infer repository core for addon dashboard
Waiting for Dashboard to come up.
error: timed out waiting for the condition on deployments/kubernetes-dashboard
Traceback (most recent call last):
  File "/snap/microk8s/4094/scripts/wrappers/dashboard_proxy.py", line 111, in <module>
    dashboard_proxy()
  File "/snap/microk8s/4094/usr/lib/python3/dist-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/snap/microk8s/4094/usr/lib/python3/dist-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/snap/microk8s/4094/usr/lib/python3/dist-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/snap/microk8s/4094/usr/lib/python3/dist-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/snap/microk8s/4094/scripts/wrappers/dashboard_proxy.py", line 79, in dashboard_proxy
    check_output(command)
  File "/snap/microk8s/4094/usr/lib/python3.6/subprocess.py", line 356, in check_output
    **kwargs).stdout
  File "/snap/microk8s/4094/usr/lib/python3.6/subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/snap/microk8s/4094/microk8s-kubectl.wrapper', '-n', 'kube-system', 'wait', '--timeout=240s', 'deployment', 'kubernetes-dashboard', '--for', 'condition=available']' returned non-zero exit status 1.

Introspection Report

None

Can you suggest a fix?

Are you interested in contributing with a fix?

[metallb] add new arg for the common use case of `prefsrc`

In local dev and CI we always^{[1, 2, 3]} enable metallb with

IPADDR=$(ip -4 -j route get 2.2.2.2 | jq -r '.[] | .prefsrc')
microk8s enable metallb:$IPADDR-$IPADDR

I was hoping this fairly common use case could have its own arg for better DevX.

For example:

microk8s enable metallb:prefsrc

canonical / microk8s-core-addons Goto Github PK

microk8s-core-addons's Introduction

microk8s-addons

Directory structure

addons.yaml format

Adding new addons

microk8s-core-addons's People

Contributors

Stargazers

Watchers

Forkers

microk8s-core-addons's Issues

Summary

What Should Happen Instead?

Reproduction Steps

Summary

What Should Happen Instead?

Reproduction Steps

Can you suggest a fix?

Summary

What Should Happen Instead?

Reproduction Steps

Introspection Report

Can you suggest a fix?

Are you interested in contributing with a fix?

Summary

Why is this important?

Are you interested in contributing with a fix?

Summary

What Should Happen Instead?

Reproduction Steps

Introspection Report

Can you suggest a fix?

Are you interested in contributing with a fix?

Summary

Why is this important?

Are you interested in contributing to this feature?

Summary

What Should Happen Instead?

Reproduction Steps

Introspection Report

Can you suggest a fix?

Are you interested in contributing with a fix?

Summary

What Should Happen Instead?

Reproduction Steps

Introspection Report

Can you suggest a fix?

Are you interested in contributing with a fix?

Summary

What Should Happen Instead?

Reproduction Steps

Introspection Report

Can you suggest a fix?

Are you interested in contributing with a fix?

Summary

What Should Happen Instead?

Reproduction Steps

Introspection Report

Can you suggest a fix?

Are you interested in contributing with a fix?

Only a single diskpool created

What Should Happen Instead?

Reproduction Steps

Introspection Report

Can you suggest a fix?

Are you interested in contributing with a fix?

Summary

What Should Happen Instead?

Reproduction Steps

Introspection Report

Can you suggest a fix?

Are you interested in contributing with a fix?

Summary

Why is this important?

Are you interested in contributing to this feature?

Installation

Resolution

Summary

Why is this important?

`addons.yaml` format