linkerd / linkerd-viz Goto Github PK

Top-line service metrics dashboard for Linkerd 1.

License: Apache License 2.0

Shell 53.38% Dockerfile 46.62%

linkerd-viz's Issues

Upgrade to 0.9.0, support new metrics labels

Once linkerd/linkerd#938 ships, we'll need to update/remove the metrics relabeling rules in our prometheus configs that we use to populate the service label. Using linkerd-provided labels will make us less reliant on regexes.

auto-gen more files at boot time

we currently do port replacements in grafana.ini and prometheus_data_source.json at boot time. it may be simpler just to generate these files from scratch.

reference:
https://github.com/BuoyantIO/linkerd-viz/blob/master/linkerd-viz#L15

grafana was not installed

In terminal:
linkerd viz install | kubectl apply -f -
then:
kubectl get pods -n linkerd-viz
NAME READY STATUS RESTARTS AGE
metrics-api-57c76d5c5c-jztcg 2/2 Running 0 56m
prometheus-5bcd95c8fc-vsrhb 2/2 Running 0 56m
tap-6694cbcb97-lsntm 2/2 Running 0 17m
tap-injector-779b797dbf-lrtj6 2/2 Running 0 17m
web-7bff7b8d89-7d9b8 2/2 Running 0 56m

consolidate dcos- and k8s-specific files

dcos and k8s files live in root. as linkerd-viz supports more platforms, it may be better to put them into subdirectories.

improvement for documentaiton regarding grafana

Hi,

the parameter grafana.url is not well documented.

I installed as described linkerd-viz + grafana from the helm chart in the same namespace.
So for grafana.url I configured http://grafana, http://grafana/, but I get only a 502 in return for the grafana links.

what happed?
Linkerd-viz does something with the url and produces error log entries like
2023/03/14 10:09:15 http: proxy error: dial tcp: lookup http://grafana: no such host
2023/03/14 10:19:15 http: proxy error: dial tcp: lookup http://grafana/: no such host
depending on the configuration.

So I eneded up in reading https://linkerd.io/2.12/tasks/grafana/index.html and use the configuration grafana:80 from the example.

Something is wrong in the rewriting or the documentation needs an update.

Best...
Uwe

Limit linkerd instances search to a particular Kubernetes namespace?

Hi,

I'm trying to deploy linkerd-viz in a Kubernetes cluster where I have rights on a single namespace, and I'm getting the following errors:

Failed to list *v1.Service: User \"system:serviceaccount:my-namespace:viewer\" cannot list all services in the cluster"
Failed to list *v1.Pod: User \"system:serviceaccount:my-namespace:viewer\" cannot list all pods in the cluster" 
Failed to list *v1.Endpoints: User \"system:serviceaccount:my-namespace:viewer\" cannot list all endpoints in the cluster"

It seems that Prometheus is trying to list services / pods in the whole cluster. Is there a way to have it restrict itself to the namespace my-namespace only? I was thinking that using a __meta_kubernetes_namespace meta label could do the trick, but I'm unsure whether that will change the API call that Prometheus does, or just filter the services afterwards.

Note that I run linkerd-viz is run under a viewer service account that can list services / pods inside my namespace.

Thanks!

How to integrate auth?

@siggy The viz looks great. I am deploying a cluster for company on GKE, and really need to be able to secure the public facing Grafana auth to use the same Google Account users.

While I can normally do this from config files and UI, I'm struggling with this setup. It logs in anonymous, but the grafana admin functions are not available. I assume this is a feature of the anonymous logins, which I haven't used before.

Do you have a reference config to get it running with google cloud IAM you could share?

support simple-proxy linkerd modes

Currently linkerd-viz requires a linker-to-linker configuration to display metrics:
https://github.com/BuoyantIO/linkerd-examples/blob/master/dcos/linker-to-linker/linkerd-config.yml

Modify linkerd-viz to display metrics when in a simple-proxy configuration:
https://github.com/BuoyantIO/linkerd-examples/blob/master/dcos/simple-proxy/linkerd-config.yml

Configurable scrape_interval is broken, breaks 1m Grafana graphs

The changes in #33 made for a surprise, when the success rate and request volume graphs became empty.

The default of 1m matches the query in Grafana, so the graphs become empty when they don't have two data points. I'd recommend a different default, like 30s.

However, this can't be overridden, because the replacement with $SCRAPE_INTERVAL fails.

The root of the issue is here:

sed -i "" "s@scrape_interval:.*@scrape_interval: $SCRAPE_INTERVAL@" $PROMETHEUS_CONF
sed -i "" "s@ evaluation_interval:.*@ evaluation_interval: $SCRAPE_INTERVAL@" $PROMETHEUS_CONF

With sed -i "" , the "" is interpreted as a filename by GNU sed, and the sed command fails. So the replacements never happen, and the command is only ever executed with the default "1m" in prometheus-$PLATFORM.yml . (whichever file that turns out to be, "k8s" in my case).

This should be sed -i"", with no space. (verified inside the container)

It would be a two-character PR, but the default causing empty graphs is also surprise, so the defaults should probably be adjusted down to ensure the irate() call has data.

DC/OS application group support

linkerd-viz does not properly aggregate metrics from DC/OS applications deployed as part of groups.

For example, an app named my-group/webapp yields metrics like this:

rt:outgoing:dst:id:_:io_l5d_marathon:my_group:webapp:requests 11

the last two metrics_relabel steps defined at:
https://github.com/BuoyantIO/linkerd-viz/blob/master/dcos/prometheus-dcos.yml#L32

.... cause the metric to be rewritten as:

linkerd:incoming:webapp:requests{instance="10.0.2.164",job="linkerd",service="my_group"}

...when the expected metric should be:

linkerd:incoming:requests{instance="10.0.2.164",job="linkerd",service="my_group/webapp"}

linkerd-viz assumes marathon master runs on localhost

In prometheus-mesos-marathon.yml line 11:

marathon_sd_configs:
  - servers:
    - 'http://localhost:8080'

should be changed to:

marathon_sd_configs:
  - servers:
    - 'http://marathon.mesos:8080'

To reflect other linkerd mesos-marathon examples.

You can also modify the linkerd-viz.json under mesos-marathon to use add-host parameter if that dns entry will not resolve:

{
  "id": "linkerd-viz",
  "instances": 1,
  "cpus": 1.0,
  "mem": 512.0,
  "acceptedResourceRoles": ["*", "slave_public"],
  "maintainer": "[email protected]",
  "container": {
    "type": "DOCKER",
    "docker": {
      "image": "buoyantio/linkerd-viz:latest",
      "parameters": [
        {
          "key": "add-host",
          "value": "marathon.mesos:192.168.250.11"
        }
      ],
      "forcePullImage": true,
      "network": "HOST",
      "privileged": true
    }
  },
  "args": ["mesos-marathon"],
  ...

display connection and client pool metrics

Additional stats around connection counts and client pools can be helpful in diagnosing performance issues. Consider adding these to the dashboard.

relevant connection stats:

rt:client:connections
rt:client:connects
rt:server:connections
rt:server:connects

relevant client pool stats:

rt:client:pool_cached
rt:client:pool_num_too_many_waiters
rt:client:pool_num_waited
rt:client:pool_size
rt:client:pool_waiters

run prometheus as pid 1

modify linkerd-viz executable to exec prometheus, ensure it runs as pid 1.

relevant:
https://www.ctl.io/developers/blog/post/gracefully-stopping-docker-containers/

linkerd health metrics

The current dashboard is very top-level request volume / success rate focused. Consider displaying linkerd health metrics (gc, etc), either on the existing dashboard, or as a separate "health" dashboard.

Consul startup doc error

The Readme for the Consul Deploy says to start Consul with docker in host networking mode, then start the linkerd-vis docker container without host networking mode. This will not work because the linkerd-vis configuration for consul is set to use localhost:8500. The localhost will not work unless the linkerd-vis container is also run in host networking mode.

DCOS 1.9 linkerd-viz never deploys

Have tried to deploy in DCOS using the universe package. Results in the service never deploying or running. I have looked on each node for the docker image buoyantio/linkerd-viz. No image has been pulled. I have also tried this from using just the json and creating a service from it and the result is the same. Not sure where else to look to resolve this. Not urgent but certainly would like to see this working in 1.9

Prometheus does not install under minikube/podman

Problem description

As reported in Slack, prom/prometheus does not install properly if you are running minikube with podman.

Error: ImageInspectError
  Warning  InspectFailed  3m58s (x259 over 128m)  kubelet            Failed to inspect image "prom/prometheus:v2.47.0": rpc error: code
 = Unknown desc = short-name "prom/prometheus:v2.47.0" did not resolve to an alias and no unqualified-search registries are defined in 
"/etc/containers/registries.conf"

The reason for this is that only when running minikube with podman, there is no unqualified search registries. If you instead run minikube with docker, this just works without any issues.

Expected behavior

I have to say I'm not entirely sure. I think I would want podman to provide these unqualified search registries out of the box. But on the other hand, maybe it would be good if linkerd was friendly enough to specify the prometheus dependency prefixed with the intended search registry?

Workaround

Log into minikube with minikube ssh.
Run sudo vi /etc/containers/registries.conf.
Add unqualified-search-registries = ["docker.io", "quay.io"].
Restart minikube with minikube stop && minikube start.

stop using dockerize

In order to support an Automated Build on Docker Hub, the DockerFile must be self-contained, and not rely on dockerize as a setup step.

https://hub.docker.com/r/buoyantio/linkerd-viz/

Prometheus upgrade requires change to kubernetes_sd_configs role

I think this broke as part of 71138df. The kubernetes_sd_configs scrape config has a role called "endpoints", but linkerd-viz is using the role "endpoint". The role name must have changed when we upgraded prometheus.

consider increasing scrape_interval, or making it configurable

scrape_interval and evaluation_interval are hard-coded at 5s, which is quite frequent. consider increasing this interval, or make it configurable for the user.

Upgrade to Grafana 5, use datasources

Grafana 5 added support for data source provisioning via config file:
http://docs.grafana.org/guides/whats-new-in-v5/#data-sources

This should significantly decrease linkerd-viz's startup complexity, where we can forego hitting Grafana's API to add a data source.

DC/OS service URL

Modify the linkerd-viz DC/OS Universe package to support a DC/OS service URL:
http://<DCOS_URL>/service/linkerd-viz/

Attempted to set the server/root_url in the grafana.ini to the DCOS_URL, but got:

{"message":"Invalid Basic Auth Header"}

Linkerd2-viz chart?

Where's the chart for linkerd2? And don't point me to the docs that tell me to install a CLI on my local system and install it manually there - that's ridiculous. This should be doable via IaC.

Share dashboard configurations on Grafana.com

Grafana.com provides a central repository of popular dashboards. It would be great to see linkerd-health-dashboard.json and linkerd-viz-dashboard.json available there.

Q: communication to/from linkerd with enabled tls

I'm working through some linkerd examples from your blog.
If I use the linkerd-ingress-controller (without using tls) linkerd-viz works.

But if i switch to linkerd-tls-ingress-controller prometheus doesn't get any data.

Is there a way to adjust the config that it also works with tls enabled communication?

Support more flexible router labels

Right now the linkerd-viz dashboard requires that the router from which it pull stats be labeled as "incoming", which might not be the case in all setups. We should think about ways to support other labels, possibly using grafana templating.

Missing Network Policy Guidelines

We use Linkerd in a cluster that pretty much blocks every INGRESS/EGRESS not white listed with NetworkPolicies or GlobalNetworkPolicies (via Calico's CRD).

After successfully upgrading Linkerd from 2.9.4 to 2.10.1 we can't figure out what the viz plugin need and the fact it's installed in its own namespace makes all our previous configuration useless...

Can anyone help with some guidelines on how to proceed? What ports are used to where? If a cluster-wide configuration is needed, what would it look like?

support for minutely/hourly graphs

The dashboard today calculates rates per second. For lower velocities, It would be useful, likely via a template variable, to support minutely and hourly rates.

linkerd / linkerd-viz Goto Github PK

linkerd-viz's Issues

Problem description

Expected behavior

Workaround

Recommend Projects

Recommend Topics

Recommend Org