Code Monkey home page Code Monkey logo

dashboards's People

Contributors

alexshtin avatar antmendoza avatar ardagan avatar dnr avatar fossabot avatar mastermanu avatar mcbryde avatar mikejoh avatar prathyushpv avatar quinncuatro avatar robholland avatar samarabbas avatar timsimmons avatar tsurdilo avatar underrun avatar vitarb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dashboards's Issues

[Bug] Current sdk-general dashboard does not work for metrics emitted from Java SDK

What are you really trying to do?

Describe the bug

I am setting up a dashboard for SDK metrics with Java and noticed that importing the sdk-general.json into grafana does not show any metric.

  • counter metrics does not have the suffix _total
  • Java SDK emits histogram metrics with the following format${metrics}_seconds_bucket, while in the dashboard use ${metrics}_bucket
  • the metric activity_endtoend_latency does not exist and there is a panel querying it

Should we fix the dashboard or create a new one for the Java SDK and see later if the same is working for other SDKs. I have not yet tested other SDKs but my plan is to do it if it helps.

Minimal Reproduction

Environment/Versions

  • OS and processor: [e.g. M1 Mac, x86 Windows, Linux]
  • Temporal Version: [e.g. 1.14.0?] and/or SDK version
  • Are you using Docker or Kubernetes or building Temporal from source?

Additional context

Create USE method dashboard for clients

This should include graphs detailing utilisation for application workers such as poller counts, slots available vs configured and sticky cache hit rates. This should probably include poll latency as this is closely connected to poller configuration rather than system latencies due to persistence and so forth.

Create RED method dashboard for history service

This should include graphs and suggested alert levels for metrics affecting users such as error rates and latency. Where possible downstream services causing frontend latency/errors should be highlighted. It should not include breakdown of downstream latency/errors, this should be left to the dashboard of the relevant service (or persistence).

[Bug] Metrics Exported by Temporal Client's metrics Handler does not match with the ones used in SDK Metrics json file

What are you really trying to do?

I'm trying to load test temporal and see the useful metrics which can tell about how my self hosted temporal deployment is working at scale. All of temporal SDK metrics are getting listed in Grafana's metrics browser, but when I try to Visualise them by importing the SDK dashboard json file, most of the charts are just empty(No data).

Only after changing the name of these metrics used in the expression I'm able to see these graph on dashboard.

Describe the bug

The name of metrics exported by temporal Client's metrics handler does not match with the ones used in JSON file for Grafana Dashboard.

Importing the SDK metrics I see this

image

After changing the name of metrics used in expression

image

Environment/Versions

I'm using temporal in local machine setup using docker, The Grafana and Prometheus have also been setup with docker.
I'm using Golang and these are the Temporal SDK version I'm using:

go.temporal.io/api v1.19.1-0.20230322213042-07fb271d475b
go.temporal.io/sdk v1.22.1
go.temporal.io/sdk/contrib/tally v0.2.0

Additional context

I wanted to know If I'm doing something wrong here or I need to do some additional changes in order to see the metrics using the dashboard JSON file.

Attaching the replacement keys used for getting these metrics shown up in Grafana Dashboard, as It can be helpful or someone else.

{
    "existing_key_in_json_file": "needs to be replaced with below values",

    "temporal_request": "temporal_request_total",
    "temporal_request_latency_bucket": "temporal_request_latency_seconds_bucket",
    "temporal_workflow_completed": "temporal_workflow_completed_total",
    "temporal_workflow_failed": "temporal_workflow_failed_total",
    "temporal_workflow_endtoend_latency_bucket": "temporal_workflow_endtoend_latency_seconds_bucket",
    "temporal_workflow_task_queue_poll_succeed": "temporal_workflow_task_queue_poll_succeed_total",
    "temporal_workflow_task_queue_poll_empty": "temporal_workflow_task_queue_poll_empty_total",
    "temporal_workflow_task_schedule_to_start_latency_bucket": "temporal_workflow_task_schedule_to_start_latency_seconds_bucket",
    "temporal_workflow_task_execution_latency_bucket": "temporal_workflow_task_execution_latency_seconds_bucket",
    "temporal_workflow_task_replay_latency_bucket": "temporal_workflow_task_replay_latency_seconds_bucket",
    "temporal_activity_execution_latency_count": "temporal_activity_execution_latency_seconds_count",
    "temporal_activity_execution_failed": "temporal_activity_execution_failed_total",
    "temporal_activity_execution_latency_bucket": "temporal_activity_execution_latency_seconds_bucket",
    "temporal_activity_poll_no_task": "temporal_activity_poll_no_task_total",
    "temporal_activity_schedule_to_start_latency_bucket": "temporal_activity_schedule_to_start_latency_seconds_bucket"
}

Create RED method dashboard for persistence

This should include graphs and suggested alert levels for metrics affecting users such as error rates and latency. Operations should be grouped to indicate likely symptoms and affected upstream services (those responsible for creating/updating workflow state, those related to task tracking etc).

Create RED method dashboard for worker service

This should include graphs and suggested alert levels for metrics affecting users such as error rates and latency. Where possible downstream services causing frontend latency/errors should be highlighted. It should not include breakdown of downstream latency/errors, this should be left to the dashboard of the relevant service (or persistence).

[Feature Request] Workflow metrics should be counting each instance

Is your feature request related to a problem? Please describe.

I tried the server dashboard example and can't figure what was the Workflow Completion Overview from an business operationnal point of view. I understood that's because of the rate fonction. The dashboard displays Workflow metrics as it would do for HTTP request

Describe the solution you'd like

Removing the rate fonction in the Workflow's Overviews would allow to present as Y coordinates the unique count of workflow executions instead of per-seconds rate which has only a performance visibility purpose.

Additional context

Otherwise, it could keep a performance view (but well named) plus the business operations view.

Create RED method dashboard for clients

This should include graphs with suggested alert levels for user-affecting values such as errors and latencies. This should not include application level errors such as workflow failures.

Create RED method dashboard for task queue

This would mainly be used as a template to amend for a given task queue and would include workflows, activities and workers that use the queue in question. We may lack labelling required to achieve a clear picture here, investigation required.

Create RED method dashboard for workflows

This would mainly be used as a template to amend for a given workflow, perhaps adding in monitoring for third parties that workflow relies on, multiple relevant activity task queues and so forth.

[Bug] Incorrect prometheus metric label names in Temporal server metrics grafana dashboard.

We are trying to integrate the Temporal Server Metrics grafana dashboard on our self managed grafana server and some of charts are not displaying any data.

Description

In the server metrics grafana dashboard, there is a section (row) where there is information about Workflow completion stats and currently all those charts are not displaying any data even with valid metrics. After taking a closer look at the prometheus queries, all the queries are using sum aggregations on the label name "temporal_namespace" but the temporal server pods are publishing metrics with label name "namespace" and because of this mismatch, the charts are not working. Basically, all the charts under this section are not displaying anything.

Example metric from worker pod:
image

Temporal server version: 1.19.0

Create RED method dashboard for frontend service

This should include graphs and suggested alert levels for metrics affecting users such as error rates and latency. Where possible downstream services causing frontend latency/errors should be highlighted. It should not include breakdown of downstream latency/errors, this should be left to the dashboard of the relevant service (or persistence).

[Bug] warning after import sdk-general.json

What are you really trying to do?

Customer reports that after importing this dashboard they see the following message

PromQL info: metric might not be a counter, name does not end in _total/_sum/_count/_bucket: "temporal_request"

Describe the bug

Minimal Reproduction

Environment/Versions

  • OS and processor: [e.g. M1 Mac, x86 Windows, Linux]
  • Temporal Version: [e.g. 1.14.0?] and/or SDK version
  • Are you using Docker or Kubernetes or building Temporal from source?

Additional context

Create RED method dashboard for matching service

This should include graphs and suggested alert levels for metrics affecting users such as error rates and latency. Where possible downstream services causing frontend latency/errors should be highlighted. It should not include breakdown of downstream latency/errors, this should be left to the dashboard of the relevant service (or persistence).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.