grafana / k8s-monitoring-helm Goto Github PK

License: Apache License 2.0

Makefile 11.71% Shell 52.46% Smarty 34.75% Dockerfile 1.08%

k8s-monitoring-helm's Introduction

Kubernetes Monitoring Helm Charts

Maintainers

Name	Email	Url
petewall	[email protected]
skl	[email protected]

Usage

Helm must be installed to use the chart. Please refer to Helm's documentation to get started.

Once Helm is set up properly, add the repo as follows:

helm repo add grafana https://grafana.github.io/helm-charts

See the Chart Documentation for chart install instructions.

Contributing

See our Contributing Guide for more information.

k8s-monitoring-helm's People

Contributors

Stargazers

Watchers

Forkers

prakhardixit248 davhdavh hbjydev basvdl roya peterolivo casanabria2 n888 vad1mo cornelius-keller bytheway cedricziel carlosescura qumulustechnology tekicat dmouse ilian platformengineerid marioanton xsizxenjin rlankfo mischavandenburg ptc-ea-community claudioscalzo christophe-scalepad mroberts91 mar4uk xuchuntao daquinoaldo duncan485 selyx leszekblazewski ibalat gouthamve t00mas aptomaketil seamusgrafana lp-francois magsther amit-singh-7 csandanov iamdmitrij hacktohell caleb-devops atillaqb bentonam mihaller

k8s-monitoring-helm's Issues

Attempt deployment with non-root components

See if we can adjust the Agent deployment settings so that it's not running as root on the container.

Add instructions for deployment on Tanzu

Add instructions for deploying on OpenShift

GitHub release action is not great for collaboration

The current GitHub action where we release on commits to main is nice for automation, but it has three implications:

All community PRs need to know to bump the version and re-run helm-docs.
Commits without a version bump will re-write the current release
It's cumbersome to batch commits into a single release. You need to merge to a staging or development branch, then merge to main

I propose we turn off auto-releasing and instead make a manual GitHub action to release.

Agent isn't setting replica label by default

Having recently switched to this helm chart as we're building out a Mimir setup, I think I've run into an issue.

When setting the agent statefulset to run multiple replicas for HA mode, I get lots of rejected duplicate samples in the logs and the Mimir HA dashboard shows it's not recognised as a HA agent setup.

I believe Mimir by default expects the replica label to be set on all incoming metrics but that this isn't being set by default in this chart.

I did try setting __replica__: ${POD_NAME} in the externalLabels section of the chart, but this didn't seem to register correctly (it ended up sending ${POD_NAME} as the replica name which of course didn't work).

Am I missing something in the chart values for this to be setup correctly as I think it should be the default for the agent when running it in HA mode?

tests were not running grafana-agent fmt to lint generated config files

Investigate getting infrastructure metrics and logs using OTel collectors

Perhaps something in the values file like:

collector: grafana-agent
OR
collector: otel-collector

And that controls the dependent-chart and the generated config file.

Bump opencost chart to 1.18.1

See CNCF slack announcement here.

Also watch out for:

opencost/opencost-helm-chart#106

How does one go about bumping dependencies?

Add ability to send metrics to Tempo

Tracking issue for `loki.source.kubernetes`

Since grafana/agent#61 we're no longer using loki.source.kubernetes, this issue tracks the possibility of using it again in future by monitoring related issues:

Add more flexibility in label matching

For example, Node Exporter always matches the label:
app.kubernetes.io/name=promethus-node-exporter.*

But on OpenShift, it's just app.kubernetes.io/name=node-exporter

Perhaps, each metric source could have a labelMatcher section with defaults like:

metics:
  node-exporter:
    labelMatchers:
    - key: app.kubernetes.io/name
      value: node-exporter

but, it would need to be overridable...

Use external_labels to set cluster name

Setting the cluster label in the external_labels section of prometheus.remote_write means it's set for all metrics, and it removes relabeling rules for all metric sources, as well as the requirement for components in extraConfig to set it manually.

Force externalServices usernames and passwords to strings

If the username or password is numeric, YAML interprets it as a number, but it fails the templates:

Error: template: k8s-monitoring/templates/credentials.yaml:10:187: executing "k8s-monitoring/templates/credentials.yaml" at <b64enc>: wrong type for value; expected string; got float64

We should force those values to strings, no matter what.

Add log level to pod logs

Suggestion on slack:

I was missing the colorisation of logs from my pods in Loki explore mod. Adding the following to the grafana agents logs ConfigMap fixed it.
Instead of:

pipeline_stages:
            - docker: {}

I have:

pipeline_stages:
            - docker: {}
            - regex:
                expression: “(?P<level>[a-zA-Z]+):(.*)”
            - labels:
                level:

relabeler with cluster.name

I have many clusters which use the same instance of grafana-cloud with similar things deployed. For instance, in my values file, i specify the following:

extraConfig: |-
  // Get all redis nodes
  prometheus.exporter.redis "redis" {
    redis_addr = "redis.redis:6379"
    namespace  = "redis"
    is_cluster = true
  }

  // Scrape redis nodes
  prometheus.scrape "redis" {
    targets    = prometheus.exporter.redis.redis.targets
    forward_to = [prometheus.relabel.redis.receiver]
  }

  // do some relabelling and forward metrics to grafana
  prometheus.relabel "redis" {
    forward_to = [prometheus.remote_write.grafana_cloud_prometheus.receiver]

    rule {
      source_labels = ["__name__"]
      regex         = "redis_blocked_clients|redis_cluster_slots_fail|redis_cluster_slots_pfail|redis_cluster_state|redis_commands_duration_seconds_total|redis_commands_total|redis_connected_clients|redis_connected_slaves|redis_db_keys|redis_db_keys_expiring|redis_evicted_keys_total|redis_keyspace_hits_total|redis_keyspace_misses_total|redis_master_last_io_seconds_ago|redis_memory_fragmentation_ratio|redis_memory_max_bytes|redis_memory_used_bytes|redis_memory_used_rss_bytes|redis_total_system_memory_bytes|redis_up"
      action        = "keep"
    }

    rule {
      source_labels = ["instance"]
      action = "replace"
      replacement = "development"
      target_label = "instance"
    }
  }

It would be neat to have a consistent prometheus.relabel configuration which could change the instance label to the cluster.name. I'm trying to reduce duplication where applicable so that i can have one highly re-usable configuration whilst ensuring that data sent up to grafana has the relevant instance in for the relevant dashboards.

Ideally, i could change the above to something like:

prometheus.relabel "redis" {
    // Changed to the "rename instance label and ship"
    forward_to = [prometheus.relabel.instance.receiver]

    rule {
      source_labels = ["__name__"]
      regex         = "redis_blocked_clients|redis_cluster_slots_fail|redis_cluster_slots_pfail|redis_cluster_state|redis_commands_duration_seconds_total|redis_commands_total|redis_connected_clients|redis_connected_slaves|redis_db_keys|redis_db_keys_expiring|redis_evicted_keys_total|redis_keyspace_hits_total|redis_keyspace_misses_total|redis_master_last_io_seconds_ago|redis_memory_fragmentation_ratio|redis_memory_max_bytes|redis_memory_used_bytes|redis_memory_used_rss_bytes|redis_total_system_memory_bytes|redis_up"
      action        = "keep"
    }
  }

I only see this working "once" although if possible to make the instance piece more distinguishable for those who might deploy something to the same cluster would be useful?

one instance of thing -> instance_label: <cluster.name>
multiple instances of thing -> instance_label: <cluster.name>-something <-- i have no idea what something would be but prehaps the name of the thing that called the relabel receiver (i don't know the ins and outs of the flow mode yet so really getting to understand it)

Allow basicAuth username/password to be sourced from secrets

We use terraform to control the configuration of this helm chart and it means that we have to check the raw username and password values into our terraform source code, which is a very bad practice. It would be much safer and cleaner if we had the option of using values from regular kubernetes secrets which can be managed in a much more secure way.

Pod logs are missing certain fields

When processing logs, the static config sets these fields, while the flow config in this chart does not:

namespace, from __meta_kubernetes_namespace
pod, from __meta_kubernetes_pod_name
container, from __meta_kubernetes_pod_container_name

Also, the job field is set to <namespace>/<pod>, which we should match.

Add support for Probes and PodLogs

We support PodMonitors and ServiceMonitors. We could simply add support for Probes and PodLogs.

There's no reason to support the first two without supporting the second two.

Podmonitor and Servicemonitor Metrics are missing Cluster name

Currently the metrics scraped via pod and servicemonitors don't get the cluster_name label added, and i can't find any hint in the docs how to add that.

Add scrape intervals to each metric?

I've heard that there was confusion over the default scrape interval being used when scraping metrics.

The default is 60s, which is set in prometheus.scrape. However, a user not familiar with flow might not know that is the default. They wouldn't necessarily know to dig down into agent documentation.

If we set:

node-exporter:
  scrapeInterval: 60s

And use that in the generated config, it would make it explicit.

HTTP status 404 Not Found: 404 page not found

I am getting 400 error for my pushgateway from the grafana agent. But when I write sample data through a curl command it worked

ts=2023-08-30T19:16:22.239618449Z component=prometheus.remote_write.grafana_cloud_prometheus subcomponent=rw level=error remote_name=8117c6 url=https://XXXXXX:XXXXXXX%[email protected]/metrics/job/some_job msg="non-recoverable error" count=564 exemplarCount=0 err="server returned HTTP status 400 Bad Request: snappy: corrupt input"

Call out instructions for GKE autopilot

Incorrect logs parsing

Heyo,

I've been using v0.0.15 of this chart and everything was working ok - both logs and metrics.
After upgrade (don't remember exact version (>v0.1.x) I started receiving strange logs.

Logs look now like that:

<time> <output stream> F <log_line>
2023-08-29T19:19:00.559181246Z stdout F {"jsonLogKey": "value"}

which breaks json parsing in Loki.

Tested with the latest version (0.1.15) and problem still exists.

Move the cluster label to discovery.relabel

prometheus.relabel can be expensive (less so for keep|drop operations), so a call that just sets the cluster label should be moved to the discovery.relabel components for each metric source.

Things to consider:

Does this affect how ServiceMonitor and PodMonitor objects set the cluster label?
Should we have a section in the docs about best practice for adding scrape targets?

Rename `tests` directory to `examples` and add a README.md for each one.

They are useful as documentation, so let's treat them as such.

That way, we can link to them to explain concepts like how to enable logs.

Add more documentation and examples for customizing the config

Adding examples for config customizations such as:

Scrape a new metric source
Scrape a new log source
Scrape using a ServiceMonitor

Allow for pod logs to use custom discovery list

Currently, there isn't a clean way to get pod logs from a subset of pods (i.e. only from a certain set of namespaces).

We should make a way to use a different discovery component.

Perhaps by giving a custom component:

logs:
  pod_logs:
    target: "discovery.kubernetes.my_pods_for_logs.targets"

extraConfig: |-
  discovery.kubernetes "my_pods_for_logs" {
    role = "pod"
    namespaces {
      names = ["mynamespace"]
    }
  }

`loki.source.kubernetes` is not usable

There's a bug in loki.source.kubernetes where it stops tailing logs after log file rotation. It's actually an issue with the underlying Kubernetes API, with a PR in process, but that will likely take some time and will only work on new enough versions of Kubernetes.

We should move off of loki.source.kubernetes and on to loki.source.file, which requires the Grafana Agent to be in daemonset mode.

Add `proxy_url` to prometheus and loki services

This is required for certain environments that require proxies.

Add a post-install hook to validate the configuration

This would be a Job that would run grafana-agent fmt to validate the configuration is valid syntax. This would be useful, especially when including .extraConfig, to catch issues when running helm install, rather than waiting for grafana-agent pods to crash.

Add ability to specify the whole config in values

A full values override.

The OTel Collector helm chart wants users to put their whole config in the values:
https://github.com/open-telemetry/opentelemetry-helm-charts/blob/main/charts/opentelemetry-collector/values.yaml#L72-L144

It'd be nice to support the same, which would allow for ease of customization. The risk is that users lose the benefit of getting new recommended config updates.

Add support for windows node exporters

Create values.schema.json

Loki TenantID should not be treated as a secret

The new tenant id field is loaded as a secret.
loki.write has tenant_id as a string type. You cannot put a secret into a string.

So, either we stop treating tenant id as a secret, or we wrap it in [nonsenstive()](https://grafana.com/docs/agent/latest/flow/reference/stdlib/nonsensitive/).

We should look at how the prometheus tenant id is set, too.

Issues with Node Exporter and http ports

The Node Exporter config always assumed https, but when deploying the bundled node exporter, the metrics port is http.

Commits to main should run test workflow before the release workflow

Discussion: Should we set the default logging format to CRI?

The default logging format is currently set to docker. Kubernetes deprecated the docker runtime in 1.20 in favor of CRI.

Should this stay or should the default be migrated to CRI?

Cluster name for opencost cost is 'default-cluster'

It seems the cluster.name config is not applied to opencost

Need a document: "Scraping metrics from my service"

We should add a document that answers the question "I want to scrape metrics from my pod or service" and go step-by-step through doing that. We have a section in the main README for adding custom flow config, but that's only the technical side of that problem, and not really helpful if the user isn't familiar with flow components.

The structure of the doc could look like:

Introduction
- If you have an application deployed on Kubernetes with Prometheus metrics...
Discovery
- Is your app running as a pod?
- Is your app running as a service?
- Filtering to the right entity and metrics port
Scraping
Post-processing
- Filtering the number of metrics
- Adjusting metrics
Sending to prometheus
Adding the config to the helm chart

Inline cluster name

I admit, I was being cute when I put the cluster name in the config map and then mounted that into the agent container.

I think the original idea was to set it in one place, but the cluster name really shouldn't change much.

The cluster name being piggy-backed into the same ConfigMap as the Agent configuration means that it's harder for users to use their own ConfigMaps.

OpenShift doc in README needs `metrics.node-exporter.service.isTLS: true`

Add option to define write_relabel_config block

The feature that I am missing is to have option to define this block: https://grafana.com/docs/agent/latest/flow/reference/components/prometheus.remote_write/#write_relabel_config-block

My main reason behind this is that I want to be in full control to drop any metrics from any source or service monitor from single place. So something like this:

writeRelabelConfig: |-
    rule {
      source_labels = ["__name__"]
      regex = "metric_name|another_metric_to_drop"
      action = "drop"
    }

To my knowledge this option should be added to this config file: https://github.com/grafana/k8s-monitoring-helm/blob/main/charts/k8s-monitoring/templates/agent_config/_prometheus.river.txt#L23 and it needs to be configurable with helm values similar like you added it in this PR: #92

Add sane defaults to reduce cardinality

The current config will generate close to 10.000 metrics on an near empty cluster.

Much of the problem is that something like the cadvisor scrape for networking will generate a series per pod per interface per device per xxx easily generating 1000-2000 unique metrics.

The reality is that by far most of these unique labels are not used.

The default Grafana Agent controller type is not compatible with metrics

By default, the Grafana Agent deploys as a daemonset, but daemonsets do not support clustering. So, on multi-node deployments, we'll likely see duplicated metrics.

Metrics should be scraped by agents in StatefulSet mode and with clustering enabled.

Need `logs.extraConfig` for adding extra config to the agent daemonset for logs

Add support for loki.source.podlogs

Adding support for PodLogs would be another place where existing integrations would "just work" and not have to be converted to flow.

Implementation would likely follow a similar flow to the ServiceMonitor and PodMonitor paths.

Error: template: k8s-monitoring/templates/grafana-agent-logs-config.yaml:5:45: executing "k8s-monitoring/templates/grafana-agent-logs-config.yaml" at <index .Subcharts "grafana-agent-logs">: error calling index: index of untyped nil

Cannot install the chart

Client Version: v1.25.0
Kustomize Version: v4.5.7
Server Version: v1.24.3+k3s1

helm upgrade --kube-context default --install grafana-k8s-monitoring grafana/k8s-monitoring \
    --namespace "monitoring" --create-namespace --values - <<EOF
cluster:
  name: "default"

externalServices:
  prometheus:
    host: "https://prometheus-prod-05-gb-south-0.grafana.net"
    basicAuth:
      username: "1097405"
      password: "xxxxx"

metrics:
  cost: {enabled: false}

opencost: {enabled: false}
EOF
Release "grafana-k8s-monitoring" does not exist. Installing it now.
Error: template: k8s-monitoring/templates/grafana-agent-logs-config.yaml:5:45: executing "k8s-monitoring/templates/grafana-agent-logs-config.yaml" at <index .Subcharts "grafana-agent-logs">: error calling index: index of untyped nil

Kubelet and cAdvisor don't work if default cluster domain is not supported

If the cluster does not use cluster.local, then the kubelet and cadvisor will not work correctly.

Both these metric sources work by querying all nodes, but changing the target address and path to:

__address__: kubernetes.default.svc.cluster.local:443 (The kubernetes service in the default namespace)
__path__: /api/v1/nodes/<node>/proxy/metrics (A path for that service that references the node)

So, while we discover the nodes, we hard-code the kubernetes service.

If the user's cluster's internal DNS does something like kubernetes.default.svc.my-prod-cluster.local:443, the hardcoded service address will be wrong.

Put username and password underneath basicAuth section

For prometheus and loki services, we have the structure:

host:
username:
password:

That just kinda assumes basic authentication, but makes it unclear how to support alternative authentication methods.

Add ability to use existing ConfigMap

If a user has a highly customized config, they may want to use this chart to deploy the tools, but not deploy the config map.

Have the ability to use an existing config map. Potential values file:

configMap:
  create: true