The dashboards from temporalio

Temporal Cloud dashboard doesn't work with Grafana Cloud

Whenever I open the dashboard, I get the following error: client_error: error parsing param: query, error: batch queries are currently not supported on all panels.

Grafana Cloud, version v9.3.2-45365 (ef5286dd77)

[Bug] Current sdk-general dashboard does not work for metrics emitted from Java SDK

What are you really trying to do?

Describe the bug

I am setting up a dashboard for SDK metrics with Java and noticed that importing the sdk-general.json into grafana does not show any metric.

counter metrics does not have the suffix _total
Java SDK emits histogram metrics with the following format${metrics}_seconds_bucket, while in the dashboard use ${metrics}_bucket
the metric activity_endtoend_latency does not exist and there is a panel querying it

Should we fix the dashboard or create a new one for the Java SDK and see later if the same is working for other SDKs. I have not yet tested other SDKs but my plan is to do it if it helps.

Minimal Reproduction

Environment/Versions

OS and processor: [e.g. M1 Mac, x86 Windows, Linux]
Temporal Version: [e.g. 1.14.0?] and/or SDK version
Are you using Docker or Kubernetes or building Temporal from source?

Additional context

Create USE method dashboard for clients

This should include graphs detailing utilisation for application workers such as poller counts, slots available vs configured and sticky cache hit rates. This should probably include poll latency as this is closely connected to poller configuration rather than system latencies due to persistence and so forth.

Add usage instructions

Create RED method dashboard for history service

This should include graphs and suggested alert levels for metrics affecting users such as error rates and latency. Where possible downstream services causing frontend latency/errors should be highlighted. It should not include breakdown of downstream latency/errors, this should be left to the dashboard of the relevant service (or persistence).

[Bug] Metrics Exported by Temporal Client's metrics Handler does not match with the ones used in SDK Metrics json file

What are you really trying to do?

I'm trying to load test temporal and see the useful metrics which can tell about how my self hosted temporal deployment is working at scale. All of temporal SDK metrics are getting listed in Grafana's metrics browser, but when I try to Visualise them by importing the SDK dashboard json file, most of the charts are just empty(No data).

Only after changing the name of these metrics used in the expression I'm able to see these graph on dashboard.

Describe the bug

The name of metrics exported by temporal Client's metrics handler does not match with the ones used in JSON file for Grafana Dashboard.

Importing the SDK metrics I see this

After changing the name of metrics used in expression

Environment/Versions

I'm using temporal in local machine setup using docker, The Grafana and Prometheus have also been setup with docker.
I'm using Golang and these are the Temporal SDK version I'm using:

go.temporal.io/api v1.19.1-0.20230322213042-07fb271d475b
go.temporal.io/sdk v1.22.1
go.temporal.io/sdk/contrib/tally v0.2.0

Additional context

I wanted to know If I'm doing something wrong here or I need to do some additional changes in order to see the metrics using the dashboard JSON file.

Attaching the replacement keys used for getting these metrics shown up in Grafana Dashboard, as It can be helpful or someone else.

{
    "existing_key_in_json_file": "needs to be replaced with below values",

    "temporal_request": "temporal_request_total",
    "temporal_request_latency_bucket": "temporal_request_latency_seconds_bucket",
    "temporal_workflow_completed": "temporal_workflow_completed_total",
    "temporal_workflow_failed": "temporal_workflow_failed_total",
    "temporal_workflow_endtoend_latency_bucket": "temporal_workflow_endtoend_latency_seconds_bucket",
    "temporal_workflow_task_queue_poll_succeed": "temporal_workflow_task_queue_poll_succeed_total",
    "temporal_workflow_task_queue_poll_empty": "temporal_workflow_task_queue_poll_empty_total",
    "temporal_workflow_task_schedule_to_start_latency_bucket": "temporal_workflow_task_schedule_to_start_latency_seconds_bucket",
    "temporal_workflow_task_execution_latency_bucket": "temporal_workflow_task_execution_latency_seconds_bucket",
    "temporal_workflow_task_replay_latency_bucket": "temporal_workflow_task_replay_latency_seconds_bucket",
    "temporal_activity_execution_latency_count": "temporal_activity_execution_latency_seconds_count",
    "temporal_activity_execution_failed": "temporal_activity_execution_failed_total",
    "temporal_activity_execution_latency_bucket": "temporal_activity_execution_latency_seconds_bucket",
    "temporal_activity_poll_no_task": "temporal_activity_poll_no_task_total",
    "temporal_activity_schedule_to_start_latency_bucket": "temporal_activity_schedule_to_start_latency_seconds_bucket"
}

[Feature Request] Update rate

those dashboards use 5m rate, we should update to $__rate_interval

from @dynajoe

Todo: update with new grafana views

from wenquan: the dashboard repo is not really up to date due to size limit issue, so you may found the 3 grafana json useful

https://temporalio.slack.com/archives/G01NJBKLG5V/p1614038212013700

Create RED method dashboard for persistence

This should include graphs and suggested alert levels for metrics affecting users such as error rates and latency. Operations should be grouped to indicate likely symptoms and affected upstream services (those responsible for creating/updating workflow state, those related to task tracking etc).

[Feature Request] Temporal Cloud Dashboard

Describe the solution you'd like

Create a dashboard using those metrics for temporal cloud users.

Create RED method dashboard for worker service

This should include graphs and suggested alert levels for metrics affecting users such as error rates and latency. Where possible downstream services causing frontend latency/errors should be highlighted. It should not include breakdown of downstream latency/errors, this should be left to the dashboard of the relevant service (or persistence).

[Feature Request] Workflow metrics should be counting each instance

Is your feature request related to a problem? Please describe.

I tried the server dashboard example and can't figure what was the Workflow Completion Overview from an business operationnal point of view. I understood that's because of the rate fonction. The dashboard displays Workflow metrics as it would do for HTTP request

Describe the solution you'd like

Removing the rate fonction in the Workflow's Overviews would allow to present as Y coordinates the unique count of workflow executions instead of per-seconds rate which has only a performance visibility purpose.

Additional context

Otherwise, it could keep a performance view (but well named) plus the business operations view.

Create RED method dashboard for clients

This should include graphs with suggested alert levels for user-affecting values such as errors and latencies. This should not include application level errors such as workflow failures.

Create RED method dashboard for task queue

This would mainly be used as a template to amend for a given task queue and would include workflows, activities and workers that use the queue in question. We may lack labelling required to achieve a clear picture here, investigation required.

One of the example dashboards has a specific timeline hardcoded in from 2020.

https://github.com/temporalio/dashboards/blob/master/dashboards/temporal.json

Lines 3306 + 3307

[Feature Request] Alertmanager rules

Hi! Thanks for ready to use dashboards!

Any guidance of important metrics for alerting, or maybe even ready to use prometheus rules? :)

Create RED method dashboard for workflows

This would mainly be used as a template to amend for a given workflow, perhaps adding in monitoring for third parties that workflow relies on, multiple relevant activity task queues and so forth.

[Bug] Incorrect prometheus metric label names in Temporal server metrics grafana dashboard.

We are trying to integrate the Temporal Server Metrics grafana dashboard on our self managed grafana server and some of charts are not displaying any data.

Description

In the server metrics grafana dashboard, there is a section (row) where there is information about Workflow completion stats and currently all those charts are not displaying any data even with valid metrics. After taking a closer look at the prometheus queries, all the queries are using sum aggregations on the label name "temporal_namespace" but the temporal server pods are publishing metrics with label name "namespace" and because of this mismatch, the charts are not working. Basically, all the charts under this section are not displaying anything.

Example metric from worker pod:

Temporal server version: 1.19.0

"Pollers" graph in server-general.json has no query

This graph https://github.com/temporalio/dashboards/blob/master/server/server-general.json#L554 has no query provided, i'm wondering what the query is meant to be?

the poller has no targets

dashboards/server/server-general.json

Line 554 in 590cbf3

"title": "Pollers",

Create RED method dashboard for frontend service

This should include graphs and suggested alert levels for metrics affecting users such as error rates and latency. Where possible downstream services causing frontend latency/errors should be highlighted. It should not include breakdown of downstream latency/errors, this should be left to the dashboard of the relevant service (or persistence).

[Bug] warning after import sdk-general.json

What are you really trying to do?

Customer reports that after importing this dashboard they see the following message

PromQL info: metric might not be a counter, name does not end in _total/_sum/_count/_bucket: "temporal_request"

Describe the bug

Minimal Reproduction

Environment/Versions

OS and processor: [e.g. M1 Mac, x86 Windows, Linux]
Temporal Version: [e.g. 1.14.0?] and/or SDK version
Are you using Docker or Kubernetes or building Temporal from source?

Additional context

Create RED method dashboard for matching service

This should include graphs and suggested alert levels for metrics affecting users such as error rates and latency. Where possible downstream services causing frontend latency/errors should be highlighted. It should not include breakdown of downstream latency/errors, this should be left to the dashboard of the relevant service (or persistence).

Create USE method dashboards for pods

This would include utilisation metrics from Kubernetes to help diagnose memory and CPU bottlenecks.

temporalio / dashboards Goto Github PK

dashboards's People

Contributors

Stargazers

Watchers

Forkers

dashboards's Issues

What are you really trying to do?

Describe the bug

Minimal Reproduction

Environment/Versions

Additional context

What are you really trying to do?

Describe the bug

Environment/Versions

Additional context

Describe the solution you'd like

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Additional context

Description

What are you really trying to do?

Describe the bug

Minimal Reproduction

Environment/Versions

Additional context

Recommend Projects

Recommend Topics

Recommend Org