Code Monkey home page Code Monkey logo

lumigo-kubernetes-operator's Introduction

Lumigo Kubernetes Operator

The Lumigo Logo

Artifact Hub

The Kubernetes operator of Lumigo provides a one-click solution to monitoring Kubernetes clusters with Lumigo.

Setup

Installation

Install the Lumigo Kubernetes operator in your Kubernets cluster with helm:

helm repo add lumigo https://lumigo-io.github.io/lumigo-kubernetes-operator
helm install lumigo lumigo/lumigo-operator --namespace lumigo-system --create-namespace --set cluster.name=<cluster_name>

Note: You have the option to alter the namespace from lumigo-system to a name of your choosing, but its important to be aware that doing so might cause slight discrepancies throughout the steps below.

(The cluster.name is optional, but highly advised, see the Naming your cluster section.)

You can verify that the Lumigo Kubernetes operator is up and running with:

$ kubectl get pods -n lumigo-system
NAME                                                         READY   STATUS    RESTARTS   AGE
lumigo-kubernetes-operator-7fc8f67bcc-ffh5k   2/2     Running   0          56s

Note: While installing the Lumigo Kubernetes operator via kustomize is generally expected to work (except the uninstallation of instrumentation on removal), it is not actually supported1.

EKS on Fargate

On EKS, the pods of the Lumigo Kubernetes operator itself need to be running on nodes running on Amazon EC2 virtual machines. Your monitored applications, however, can run on the Fargate profile without any issues. Installing the Lumigo Kubernetes operator on an EKS cluster without EC2-backed nodegroups, results in the operator pods staying in Pending state:

$ kubectl describe pod -n lumigo-system lumigo-kubernetes-operator-5999997fb7-cvg5h

Namespace:    	lumigo-system
Priority:     	0
Service Account:  lumigo-kubernetes-operator
Node:         	<none>
Labels:       	app.kubernetes.io/instance=lumigo
              	app.kubernetes.io/name=lumigo-operator
              	control-plane=controller-manager
              	lumigo.auto-trace=false
              	lumigo.cert-digest=dJTiBDRVJUSUZJQ
              	pod-template-hash=5999997fb7
Annotations:  	kubectl.kubernetes.io/default-container: manager
              	kubernetes.io/psp: eks.privileged
Status:       	Pending

(The reason for this limitation is very long story, but it is necessary for Lumigo to figure out which EKS cluster is the operator sending data from.) If you are installing the Lumigo Kubernetes operator on an EKS cluster with only the Fargate profile, add a managed nodegroup.

Naming your cluster

Kubernetes clusters does not have a built-in nothing of their identity1, but when running multiple Kubernetes clusters, you almost certainly have names from them. The Lumigo Kubernetes operator will automatically add to your telemetry the k8s.cluster.uid OpenTelemetry resource attribute, set to the value of the UID of the kube-system namespace, but UIDs are not meant for humans to remember and recognize easily. The Lumigo Kubernetes operator allows you to set a human-readable name using the cluster.name Helm setting, which enables you to filter all your tracing data based on the cluster in Lumigo's Explore view.

1 Not even Amazon EKS clusters, as their ARN is not available anywhere inside the cluster itself.

Upgrading

You can check which version of the Lumigo Kubernetes operator you have deployed in your cluster as follows:

$ helm ls -A
NAME  	NAMESPACE    	REVISION	UPDATED                              	STATUS  	CHART             	APP VERSION
lumigo	lumigo-system	2       	2023-07-10 09:20:04.233825 +0200 CEST	deployed	lumigo-operator-13	13

The Lumigo Kubernetes operator is reported as APP VERSION.

To upgrade to a newer version of the Lumigo Kubernetes operator, run:

helm repo update
helm upgrade lumigo lumigo/lumigo-operator --namespace lumigo-system

Enabling automatic tracing

Supported resource types

The Lumigo Kubernetes operator automatically adds distributed tracing to pods created via:

The distributed tracing is provided by the Lumigo OpenTelemetry distribution for JS, the Lumigo OpenTelemetry distribution for Java and the Lumigo OpenTelemetry distribution for Python.

The Lumigo Kubernetes operator will automatically trace all Java, Node.js and Python processes found in the containers of pods created in the namespaces that Lumigo traces. To activate automatic tracing for resources in a namespace, create in that namespace a Kubernetes secret containing your Lumigo token, and reference it from a Lumigo (operator.lumigo.io/v1alpha1.Lumigo) custom resource. Save the following into the lumigo.yml:

apiVersion: v1
kind: Secret
metadata:
  name: lumigo-credentials
stringData:
  # Kubectl won't allow you to deploy this dangling anchor.
  # Get the actual value from Lumigo following this documentation: https://docs.lumigo.io/docs/lumigo-tokens
  token: *lumigo-token #  <--- Change this! Example: t_123456789012345678901
---
apiVersion: operator.lumigo.io/v1alpha1
kind: Lumigo
metadata:
  labels:
    app.kubernetes.io/name: lumigo
    app.kubernetes.io/instance: lumigo
    app.kubernetes.io/part-of: lumigo-operator
  name: lumigo
spec:
  lumigoToken:
    secretRef:
      name: lumigo-credentials # This must match the name of the secret; the secret must be in the same namespace as this Lumigo custom resource
      key: token # This must match the key in the Kubernetes secret (don't touch)

After creating the secret, deploy it in the desired namespace:

kubectl apply -f lumigo.yml -n <YOUR_NAMESPACE>

Each Lumigo resource keeps in its state a list of resources it currently instruments:

$ kubectl describe lumigo -n my-namespace
Name:         lumigo
Namespace:    my-namespace
API Version:  operator.lumigo.io/v1alpha1
Kind:         Lumigo
Metadata:
  ... # Data removed for readability
Spec:
  ... # Data removed for readability
Status:
  Conditions:
  ... # Data removed for readability
  Instrumented Resources:
    API Version:       apps/v1
    Kind:              StatefulSet
    Name:              my-statefulset
    Namespace:         my-namespace
    Resource Version:  320123
    UID:               93d6d809-ac2a-43a9-bc07-f0d4e314efcc

Logging support

The Lumigo Kubernetes operator can automatically forward logs emitted by traced pods to Lumigo's log-management solution, supporting several logging providers (currently logging for Python apps, Winston and Bunyan for Node.js apps). Enabling log forwarding is done by adding the spec.logging.enabled field to the Lumigo resource:

apiVersion: operator.lumigo.io/v1alpha1
kind: Lumigo
metadata:
  labels:
    app.kubernetes.io/name: lumigo
    app.kubernetes.io/instance: lumigo
    app.kubernetes.io/part-of: lumigo-operator
  name: lumigo
spec:
  lumigoToken: ... # same token used for tracing
  logging:
    enabled: true # enables log forwarding for pods with tracing injected

Opting out for specific resources

To prevent the Lumigo Kubernetes operator from injecting tracing to pods managed by some resource in a namespace that contains a Lumigo resource, add the lumigo.auto-trace label set to false:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: hello-node
    lumigo.auto-trace: "false"  # <-- No injection will take place
  name: hello-node
  namespace: my-namespace
spec:
  selector:
    matchLabels:
      app: hello-node
  template:
    metadata:
      labels:
        app: hello-node
    spec:
      containers:
      - command:
        - /agnhost
        - netexec
        - --http-port=8080
        image: registry.k8s.io/e2e-test-images/agnhost:2.39
        name: agnhost

In the logs of the Lumigo Kubernetes operator, you will see a message like the following:

1.67534267851615e+09    DEBUG   controller-runtime.webhook.webhooks   wrote response   {"webhook": "/v1alpha1/inject", "code": 200, "reason": "the resource has the 'lumigo.auto-trace' label set to 'false'; resource will not be mutated", "UID": "6d341941-c47b-4245-8814-1913cee6719f", "allowed": true}

Settings

Inject existing resources

By default, when detecting a new Lumigo resource in a namespace, the Lumigo controller will instrument existing resources of the supported types. The injection will cause new pods to be created for daemonsets, deployments, replicasets, statefulsets and jobs; cronjobs will spawn injected pods at the next iteration. To turn off the automatic injection of existing resources, create the Lumigo resource as follows

apiVersion: operator.lumigo.io/v1alpha1
kind: Lumigo
metadata:
  labels:
    app.kubernetes.io/name: lumigo
    app.kubernetes.io/instance: lumigo
    app.kubernetes.io/part-of: lumigo-operator
  name: lumigo
spec:
  lumigoToken: ...
  tracing:
    injection:
      injectLumigoIntoExistingResourcesOnCreation: false # Default: true

Remove injection from existing resources

By default, when detecting the deletion of the Lumigo resource in a namespace, the Lumigo controller will remove instrumentation from existing resources of the supported types. The injection will cause new pods to be created for daemonsets, deployments, replicasets, statefulsets and jobs; cronjobs will spawn non-injected pods at the next iteration. To turn off the automatic removal of injection from existing resources, create the Lumigo resource as follows

apiVersion: operator.lumigo.io/v1alpha1
kind: Lumigo
metadata:
  labels:
    app.kubernetes.io/name: lumigo
    app.kubernetes.io/instance: lumigo
    app.kubernetes.io/part-of: lumigo-operator
  name: lumigo
spec:
  lumigoToken: ...
  tracing:
    injection:
      removeLumigoFromResourcesOnDeletion: false # Default: true

Note: The removal of injection from existing resources does not occur on uninstallation of the Lumigo Kubernetes operator, as the role-based access control is has likely already been deleted.

Collection of Kubernetes objects

The Lumigo Kubernetes operator will automatically collect Kubernetes object versions in the namespaces with a Lumigo resource in active state, and send them to Lumigo for issue detection (e.g., when you pods crash). The collected object types are: corev1.Events, corev1.Pods, appsv1.Deployments, apps/v1.DaemonSet, apps/v1.ReplicaSet, apps/v1.StatefulSet, batch/v1.CronJob, and batch/v1.Job. Besides events, the object versions, e.g., pods, replicasets and deployments, are needed to be able to correlate events across the owner-reference chain, e.g., the pod belongs to that replicaset, which belongs to that deployment.

To disable the automated collection of Kubernetes events and object versions, you can configure your Lumigo resources as follows:

apiVersion: operator.lumigo.io/v1alpha1
kind: Lumigo
metadata:
  labels:
    app.kubernetes.io/name: lumigo
    app.kubernetes.io/instance: lumigo
    app.kubernetes.io/part-of: lumigo-operator
  name: lumigo
spec:
  lumigoToken: ...
  infrastructure:
    kubeEvents:
      enabled: false # Default: true

When a Lumigo resource is deleted from a namespace, the collection of Kubernetes events and object versions is automatically halted.

Modify manager log level

By default, the manager will log all INFO level and above logs.

The current log level can be viewed by running:

kubectl -n lumigo-system get deploy lumigo-lumigo-operator-controller-manager -o=json | jq '.spec.template.spec.containers[0].args'

With the default settings, there will be no log level explicitly set and the above command will return:

[
  "--health-probe-bind-address=:8081",
  "--metrics-bind-address=127.0.0.1:8080",
  "--leader-elect"
]

To set the log level to only show ERROR level logs, run:

kubectl -n lumigo-system patch deploy lumigo-lumigo-operator-controller-manager --type=json -p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--zap-log-level=error"}]'

If a log level is already set, instead of using the add operation we use replace and modify the path from /args/- to the index of containing the log level setting, such as /args/3:

kubectl -n lumigo-system patch deploy lumigo-lumigo-operator-controller-manager --type=json -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/args/3", "value": "--zap-log-level=info"}]'

NOTE: The container argument array is zero indexed, so the first argument is at index 0.

Uninstall

The removal of the Lumigo Kubernetes operator is performed by:

helm delete lumigo --namespace lumigo-system

In namespaces with the Lumigo resource having spec.tracing.injection.enabled and spec.tracing.injection.removeLumigoFromResourcesOnDeletion both set to true, supported resources that have been injected by the Lumigo Kubernetes operator will be updated to remove the injection, with the following caveat:

Note: The removal of injection from existing resources does not apply to batchv1.Job resources, as their corev1.PodSpec is immutable after the batchv1.Job resource has been created.

TLS certificates

The Lumigo Kubernetes operator injector webhook uses a self-signed certificate that is automatically generate during the installation of the Helm chart. The generated certificate has a 365 days expiration, and a new certificate will be generated every time you upgrade Lumigo Kubernetes operator's helm chart.

Events

The Lumigo Kubernetes operator will add events to the resources it instruments with the following reasons and in the following cases:

Reason Created on resource types Under which conditions
LumigoAddedInstrumentation apps/v1.Deployment, apps/v1.DaemonSet, apps/v1.ReplicaSet, apps/v1.StatefulSet, batch/v1.CronJob  If a Lumigo resources exists in the namespace, and the resource is instrumented with Lumigo as a result
LumigoCannotAddInstrumentation apps/v1.Deployment, apps/v1.DaemonSet, apps/v1.ReplicaSet, apps/v1.StatefulSet, batch/v1.CronJob  If a Lumigo resources exists in the namespace, and the resource should be instrumented by Lumigo as a result, but an error occurs
LumigoUpdatedInstrumentation apps/v1.Deployment, apps/v1.DaemonSet, apps/v1.ReplicaSet, apps/v1.StatefulSet, batch/v1.CronJob  If a Lumigo resources exists in the namespace, and the resource has the Lumigo instrumented updated as a result
LumigoCannotUpdateInstrumentation apps/v1.Deployment, apps/v1.DaemonSet, apps/v1.ReplicaSet, apps/v1.StatefulSet, batch/v1.CronJob  If a Lumigo resources exists in the namespace, and the resource should have the Lumigo instrumented updated as a result, but an error occurs
LumigoRemovedInstrumentation apps/v1.Deployment, apps/v1.DaemonSet, apps/v1.ReplicaSet, apps/v1.StatefulSet, batch/v1.CronJob  If a Lumigo resources is deleted from the namespace, and the resource has the Lumigo instrumented removed as a result
LumigoCannotRemoveInstrumentation apps/v1.Deployment, apps/v1.DaemonSet, apps/v1.ReplicaSet, apps/v1.StatefulSet, batch/v1.CronJob  If a Lumigo resources is deleted from the namespace, and the resource should have the Lumigo instrumented removed as a result, but an error occurs

Footnotes

  1. The user experience of having to install Cert Manager is unnecessarily complex, and Kustomize layers, while they may be fine for one's own applications, are simply unsound for a batteries-included, rapidly-evolving product like the Lumigo Kubernetes operator. Specifically, please expect your Kustomize layers to stop working with any release of the Lumigo Kubernetes operator. 2 3

lumigo-kubernetes-operator's People

Contributors

developersteve avatar guymoses avatar harelmo-lumigo avatar kenfinnigan avatar moshe-shaham-lumigo avatar nadav3396 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

lumigo-kubernetes-operator's Issues

Lumigo Injector container running as root in pod that disallows it

Warning Failed 1s (x5 over 46s) kubelet Error: container has runAsNonRoot and image will run as root (pod: "test-app-594fc9665c-nm5rc_test-app(70735e7e-b84a-453e-af66-64a9e3cbd420)", container: lumigo-injector)

Injector container should run not as root if hosting pod is not running as root.

ITests on EKS

Automated integration tests on EKS. Success criteria (for now) is that injection occurs.

Ensure Webhook certificate rotation

Ensure the operator rotates its webhook self-signed certificates before they expire. Since the helm chart generates a new certificate on upgrade, we could set up a CronJob running regularly (quarterly? a few days before expiration?).

Implement mutation logic in Webhook

Implement mutation logic in the Webhook to inject the latest tag of the Lumigo AutoTrace.

Besides the actual logic with the init-container, shared volume and env vars (referencing the K8s secret configured in Lumigo), add lumigo.auto-trace=true labels on both top-level entities (Deployment, etc) as well as Pod template. See if we can show Kubernetes events for the top-level resources ("Injected by Lumigo"), and also add annotations to make it as clear as possible that we touched a resource.

Implement auto-update of operator

Have the Lumigo controller spawn a cron-job in its own namespace (Owns reference in controller) to nightly check which version of the image in the ECR Repo #4 is the latest and, if a later one is found, update the controller's deployment, triggering a rolling update. Try what happens when a bad image is uploaded, and if we can revert the change then.

Record events on injected resources

As a user, I want to have means of understanding why a resource injected by the Lumigo Operator is different than the resource I created with my IaC tooling.

When the controller injects into or removes injection from a resource, it should record an event on that resource.

Horizontal autoscaling for Operator

As a user, I want my Lumigo operator to "just work" even when it's processing data from large clusters.

I would like a default horizontal pod autoscaler based on OpenTelemetry collector metrics from the telemetry-proxy container, as that is by far the component that will serve most load on large clusters.

Manual settings for replica size should be exposed as optional in Helm, but the overall experience should be an "auto" one.

Reconcile loop cannot recover from event pointing to an object that doesn't exist

Describe the bug
the reconcile loop cannot recover from event pointing to an object that doesn't exist

To Reproduce

  1. create a namespace and add the operator to that namespace
  2. create a deployment with a bad name (e.g. use an underscode in the name) in that namespace
  3. see the deployment failing to create
  4. fix the deployment name and deploy it
  5. the operator fails to inject

Expected behavior
the operator should be able to inject once the deployment is fixed

Runtime details

  • tried on python3.9

Screenshots
kubectl logs

The Deployment "unicorn_analytics" is invalid: metadata.name: Invalid value: "unicorn_analytics": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')

operator logs

2023-07-13T10:04:04Z    ERROR   controllers.Lumigo      Cannot rebind dangling LumigoAddedInstrumentation event '0xc000748a60'  {"name": "lumigo", "namespace": "unicorns-rainbows", "error": "cannot fill out the 'InvolvedObject' reference: cannot retrieve 'unicorns-rainbows/unicorn_analytics' Deployment: deployments.apps \"unicorn_analytics\" not found"}
github.com/lumigo-io/lumigo-kubernetes-operator/controllers.(*LumigoReconciler).Reconcile
        /workspace/controllers/lumigo_controller.go:290
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:122
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:323
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:274
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:235

Additional context
Add any other context about the problem here.

Add Kubernetes resource attributes to traces

Implement a way to add the Kubernetes semantic conventions to the traces we collect. Possible options are:

  1. Inject the resource attributes via the OTEL_RESOURCE_ATTRIBUTE env var (not a good option, we might override or being overridden by other settings)
  2. Introduce a custom OpenTelemetry Collector image in the controller, which uses the k8sattributesprocessor to add the missing resource attributes. In this case, we will need to use the LUMIGO_ENDPOINT env var to redirect the exporting to the local OpenTelemetry Collector. We will also need to think about how to monitor and scale the collector.

Also, think if we can proxy forward dependency data over the Operator service.

Enforce only one Lumigo instance in a given namespace

  1. Implement logic to enforce that there is only one Lumigo resource in any given namespace (as multiple would cause a conflict about where to take the Lumigo token from). When the Lumigo controller sees multiple instances in the same namespace, all the instances have to be marked disabled and have errors with the updateStatusIfNeeded API.

  2. Before injecting a resource in the Webhook, check that the Lumigo resource in the mutated resource's namespace is active.

Operator self-monitoring via OpenTelemetry

As a user, I want to automatically monitor my Lumigo Operator with Lumigo.

When performing operations in a namespace, the Lumigo Operator should send log-based telemetry to the Lumigo platform.
It should be possible to define a secret in the lumigo-system namespace that the operator can use to send all telemetry as well, including telemetry about operations outside of any specific namespace.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.