Code Monkey home page Code Monkey logo

amazon-managed-service-for-prometheus-roadmap's Introduction

Amazon Managed Service for Prometheus Product Feature Requests

This is the public space where you can add and influence feature requests for the Amazon Managed Service for Prometheus roadmap and allows all AWS customers to give direct feedback. Knowing about our upcoming products and priorities helps our customers plan.

Go to the feature list now »

FAQs

Q: Why did you build this?

A: We know that our customers are making decisions and plans based on what we are developing, and we want to provide our customers the insights they need to plan.

Q: Why are there no dates on the feature requests?

A: Because job zero is security and operational stability, we can't provide specific target dates for features.

Q: Is everything from the feature list on the roadmap?

A: Feature requests will be considered for the Amazon Managed Service for Prometheus roadmap.

Q: How can I provide feedback or ask for more information?

A: Please open an issue!

Q: How can I request a feature?

A: Please open an issue! You can read about how to contribute here. Community submitted issues will be tagged with proposed and will be reviewed by the service team.

Security

If you think you’ve found a potential security issue with Amazon Managed Service for Prometheus, please DO NOT create an issue for it here but rather follow the instructions in the CONTRIBUTING document or email AWS security directly.

License

The issue content is made available under the Creative Commons Attribution-ShareAlike 4.0 International License, see also the LICENSE-SUMMARY file. The sample code is made available under the MIT-0 license, see also the LICENSE-SAMPLECODE file.

amazon-managed-service-for-prometheus-roadmap's People

Contributors

amazon-auto avatar mhausenblas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

amazon-managed-service-for-prometheus-roadmap's Issues

fix documentation

on the AWS managed docs page the receiver config has two issues with the sample configuration for receivers is wrong.

https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-alertmanager-receiver-config.html

firstly, the format is not
key: key, value: value

I've verified the prometheus docs are correct:

secondly, the proposed indentation for the sigv4: should not have a - .. I think that would actually break it. This is taken from a working config:

  receivers:
    - name: 'opsgenie-sns'
      sns_configs:
        - topic_arn: "replace_arn"
          send_resolved: false
          subject: '{{ template "opsgenie.default.subject" .}}'
          message: ''
          sigv4:
            region: "${region}"
          attributes:
            tags: '{{ template "__opsgenie_tags" .Alerts.Firing }}'
            teams: '{{ template "__opsgenie_teams" .Alerts.Firing }}'
            priority: '{{ template "__opsgenie_priority" .CommonLabels.SortedPairs }}'

Add support for federated querying capability to query multiple workspaces

Context
We plan to shard metrics from a group of services into a predefined workspace. This approach allows us to streamline cost attribution/chargeback to appropriate service team owners. Additionally, this strategy helps us isolate noisy services into individual workspaces for scalability.

Problem
We anticipate that metrics for large platform teams (like Service Networking, and Compute Infra) would be spread across different workspaces. It's difficult to setup alerts and build accurate dashboards because AMP doesn't have support for federated queries that span more than one workspace.

Feature Request
Providing a federated query support allows large customers to seamlessly shard metrics across workspaces while maintaining a consistent user experience.

[Feature] Support for month-over-month and quarter-over-quarter queries

Why

In order to do solve capacity planning, SLI/SLO, or long term performance analysis, AMP needs to support long term trending queries beyond the 32 day limit that exists today. In order to make those queries efficient, AMP should support downsampling, and auto-reduce the resolution of data past 1 week (1 min resolution), 30 days (5 min resolution) 6 months (1 hour resolution) and provide the option for keeping or dropping raw data past the first tier.

Support for native histograms

There is a demand for this feature to be supported on AMP, as this is supported on Grafana-agent (which we use to remote-write ec2 metrics) also supports this starting version v0.30.0.

enhanced prometheus templates

Please consider adding 'modern' Go syntax behaviors to make formatting templates more idiomatic, functions that I'd propose are:

map, dict, split, splitList, regexMatch, lower, contains, strings, if

Also a mechanism for testing/validating (i.e. triggering a specific rule via an API call to see how it formats and runs) would enhance the usability.

Workspace's service quotas accessible through UI and CLI

It could be advantageous for some customers to have visibility into what the current service quotas are, per workspace. This would especially come in handy if there are multiple workspaces with different data setups. I could imagine this being part of the workspace UI and/or accessible through the AWS CLI.

Managed Prometheus Workspaces Cost allocation based on metric labels

I would like to see provision made for working out the cost of metrics given one or more labels on metrics, much like you would do using Tags on EC2 resources. https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/cost-alloc-tags.html

The only advice that I got from AWS Support was to create separate workspaces and so track costs at that level, but I think this is counter-intuitive. We WANT to have a single workspace so it's simple for our users to create queries across accounts, environments, applications etc without having to worry about which workspace the data was ingested into. Given that AMP now supports 500m metrics in a single workspace, I would assume that allocating costs to different departments, teams etc in a company would be a common use-case, as different depts often have their own budgets, while tooks like prometheus are a shared resource that are centrally managed.

I would like to be able to define a set of "cost-allocation labels" and then be able to report on costs associated with these labels. This cost data would ideally be added as new metrics in prometheus so we can visualize them easily.

This cost would ideally include (and perhaps be split by) ingestion, strorage and anything else that is charged. If that's too hard just combining these in a single cost based on cost-allocation tags is fine with me.

Alert Manager integration with Grafana Unified Alerting

Customers have highlighted that they want to be able to use the same Grafana UI they use build dashboards to:

  • visualize their currently firing alerts,
  • look at the rules they have installed, and
  • silence alarms.

With this feature, we consider enhancements around the integration of Amazon Managed Service for Prometheus’s Alertmanager with Grafana’s Alert Management.

Initially, we plan enabling customers to view, list and silence alarms. Further, we consider to work towards enabling create, update, and delete workflows through the Grafana UI.

Remove 1h backfilling limitation

Summary: Remove the limitation "Metric samples older than 1 hour are refused from being ingested" as stated at https://docs.aws.amazon.com/prometheus/latest/userguide/AMP_quotas.html

There is at least a similar issue like this one #18

However, our use case is different. I don't want to simply do a migration from one database to AMP (although that would also be important). We want to be able to send metrics at any time, despite how old they are.

We work with edge devices that, from time to time, are not connected to the internet. Those edge devices, when not connected to the internet and therefore unable to push the metrics, keep them on a local cache. As soon as they have an internet connection, they should be able to push that cache (sometimes older than 1h, 1 day, or in rare cases 1 week) to AMP.

This 1-hour limitation seems to be just a particular AMP limitation. At the moment, this is a blocker for our edge devices to use AMP, and therefore we keep using a self-managed Prometheus.

Use managed Prometheus with in-cluster Alertmanager

Prometheus (as deployed by the commonly used operator chart) is difficult to maintain and a resource hog; because of that AMP is very attractive. Alertmanager though works fine for our purposes, and we have

$ kubectl get prometheusrule -A -ojson | jq -r '.items[].spec.groups[].rules[].alert' | wc -l
    4449

alerts defined, by teams which can each access only one namespace (https://github.com/ministryofjustice/cloud-platform-environments/search?q=prometheusrule), so no shared visibility. Alerts go directly to eg Slack, each team having control of its channel and hooks.

It would be ideal for us to use the managed Prometheus but keep alert definitions and Alertmanager as they are right now (presumably, AMP would need a configuration option to reach the cluster's AM).

Revise the custom storage retention solution

Reviewing some past issues opened I can see there was already a similar issue opened that has been closed, see reference [1]. However, the way it seems to work is a bit cumbersome as well as this appears to basically just open a ticket to AWS Support with this request. This really doesn't seem like ideal behavior, in an ideal world users should be able to define this when creating a workspace and this does not seem like something users should have to speak with support to do. I think there's a need to take a look at this and revise the implementation to something more user friendly.

Additionally, the way this is worded in reference [2] doesn't seem to suggest that this retention period can be reduced. There are some cases where users won't want logs to be stored for 150+ days and in those cases the user should be able to decrease this retention period. If this does already allow users to request a limit decrease please clarify the wording in reference [2] to make this clear as it looks like its only for increases the way it is currently worded.

References:
[1]: #2
[2]: https://docs.aws.amazon.com/prometheus/latest/userguide/AMP_quotas.html

Mute alarms

Need ability to mute servers doing maintenance

Vended CloudWatch metrics

Customers have highlighted that they need visibility into their workspace usage relative to the quotas applied, so they can preemptively increase quotas before getting throttled.

With this feature, we plan to expose the following as vended metrics in Amazon CloudWatch:

  • Active time series per workspace
  • Ingestion rate per workspace
  • Workspaces per region
  • Active alerts per workspace
  • Alert aggregation size per workspace
  • Alerts per workspace
  • Inhibition rules per workspace
  • Routing Tree nodes per workspace
  • RuleGroups per workspace
  • RuleGroupNamespaces per workspace

Further, we plan to vend the following metrics as service usage metrics:

  • DiscardedSamples with a reason dimension per workspace
  • Throttled Alertmanager notifications per workspace
  • Alertmanager failed to send total per workspace
  • Alertmanager alerts received per workspace

Additional metrics and logging for query performance visibility

The current metrics and logs provide limited visibility into how queries perform in AMP. The users don't have visibility into the query duration, queries per second, failed queries, or how many QSPs were used for querying. The latter would especially provide insights into potential cost optimizations (since QSPs are used for calculating the cost of querying AMP).

Additional logs could help to debug failing queries or which queries need optimization. Especially for companies where a multitude of teams write their own queries (e.g Grafana), having this sort of logging output would be a great help in having better visibility into how the system is used. Having a config for a "query time threshold" could be advantageous which would only log queries if their execution time passes a certain (configurable) threshold.

Additional Vended CloudWatch Usage Metrics

Customers have highlighted that they need visibility into their workspace usage relative to the quotas applied, so they can preemptively increase quotas before getting throttled for

With this feature, we plan to expose the following as vended metrics in Amazon CloudWatch.

Account Level Quotas:

  • Workspaces per region

Alert Manager Quotas:

  • Inhibition rules per workspace
  • Routing Tree nodes per workspace

Ruler Quotas:

  • RuleGroups per workspace
  • RuleGroupNamespaces per workspace

improved logging in workspace

I've reported this to AWS support as well.

As near as I can tell 100% of the log messages in AWS cortex are useless. Log messages should provide a hint about the context of the error, they should help in diagnosing any issues or unexpected behaviors. If log messages fail to do that for whatever reason then they don't need to exist and they are just making useless noise.

logs should provide clarity and not require the administrator to guess, we have literally dozens of routes, hundreds of rules, and troubleshooting them is a huge issue. We use prometheus pint (a linter) to catch most types of errors.

I feel compelled to remind everybody that any problem on an infrastructures monitoring platform is a P1 priority, because it means alarms can get missed.

{
    "workspaceId": "ws-69b34717-4546-4e1d-a367-9f4a286a91ab",
    "message": {
        "log": "MessageAttributes has been removed because of invalid key/value, numberOfRemovedAttributes=1",
        "level": "WARN"
    },
    "component": "alertmanager"
}

Suggestions:

  • provide the key or value that was invalid, along with the regex that it expects.

{
    "workspaceId": "ws-69b34717-4546-4e1d-a367-9f4a286a91ab",
    "message": {
        "log": "Subject has been modified because it is empty.",
        "level": "WARN"
    },
    "component": "alertmanager"
}

Suggestions:

  • provide some other context about the message, which group, rule id, or other meaningful way of narrowing down the possiblity.
{
    "workspaceId": "ws-69b34717-4546-4e1d-a367-9f4a286a91ab",
    "message": {
        "log": "Message has been modified because the content was empty.",
        "level": "WARN"
    },
    "component": "alertmanager"
}

Suggestions:

  • I have no idea how/why this happens, but usually (**I assume) it's a template failure, you might put the raw version of the message into the output, you could also include a stack-trace of the variables .. perhaps another DEBUG mode for logs.
{
    "workspaceId": "ws-69b34717-4546-4e1d-a367-9f4a286a91ab",
    "message": {
        "log": "Notify for alerts failed, Invalid parameter: TopicArn",
        "level": "ERROR"
    },
    "component": "alertmanager"
}

Suggestions:

  • First off, I'm not setting this, second, is it invalid because it's not set, or invalid because the content of the TopicArn? Nobody at AWS seems to know, the online searches suggest this has something to do with a region issue (which I assume is related to SigV4 ?? but I don't really know or care, the error is obtuse and lacks any value)

Add ability to expose AMP endpoint *only* to private VPC

Currently, creating a new AMP workspace always results in a publicly accessible endpoint being created. The endpoint is secured with IAM, but we prefer no public endpoint at all, only a private endpoint that is still also locked down with IAM.

Custom storage retention

Currently, the maximum retention time for ingested metrics into an Amazon Managed Service for Prometheus workspace is fixed. That is, data older than 150 days is deleted from the workspace.

With custom storage retention, users would have the ability to change the metrics data retention period on a per-workspace basis. The feature would be exposed via the AWS console, API, CloudFormation, and Terraform.

Make Prometheus TSDB metrics available for faster debugging of active usage

The Prometheus API endpoint - api/v1/tsdb/status exposes a bunch of metrics(see attached tsdb.txt) around label cardinality, active series per metrics, head series related metrics etc. At present, Amazon Managed Prometheus exposes only a few of these metrics - that too via Cloudwatch.

This adds to the pain of exporting metrics from Cloudwatch and putting back into AMP while these metrics could be easily made available in AMP itself with an amp_tsdb prefix.

Context:

Internally, we run a Prometheus Operator in our EKS cluster and push the metrics to AMP via Remote write. We suddenly start hitting 400 Bad request error when we reach limits. This leads to data loss. Presently, we don't have proper visibility into this due to limited metric data from Amazon Managed Prometheus. These metrics would help us fix that.

How could you do it?

Prometheus JSON Exporter can be run as a sidecar for each Cortex instance that you run. A static config can scrape these metrics and push it to AMP. These metrics can finally be aggregated via Recording rules within AMP and exposed as a final TSDB metrics thats workspace wide.

Hope to see this in real soon ! Happy to help with the implementation details - We have done it locally and works like a charm for Prometheus Operator setup.

tsdb.txt

Document or raise size limit for HELP text

I'm trying to ingest metrics from cockroachdb into aws managed prometheus, and it throws a fatal error for the HELP text that I reported in cockroachdb/cockroach#87112.

I couldn't find the size limit for HELP text that AMP mandates, but more than that I find it a bit unfortunate that there is a size limit in the first place (prometheus knows no way to truncate help metadata that it ingests, and the only way to successfully ingest metrics is to not send any metadata at all).

I would wish for one of two outcomes here:

  • You all raise the size limit for HELP text (docs are good, right?), or
  • documentation is added for the size limit, and the error message points to that documentation

Use AMP to federate data from multiple EKS clusters

Hello,
WE want to use AMP to monitor multiple EKS clusters in our environment *all non prod reports to non prod AMP and all prod clusters report to prod AMP.

  1. Is this setup possible? * we have setup prometheus server, node exporter in each eks cluster and they all report to AMP, but not sure how to validate this.
  2. IF possible how / what do we need to change in the a) individual prometheus server setup in each cluster
    b) any change needed on Workspace configuration?
  3. IF possible then how do i generate per cluster dash boards in AMG ?
    Thank You
    HP

Expose additional Cortex API on Managed Prometheus to alert manager integration using Cortex type in Grafana

I want to integration AMP with Managed Grafana to manage (create/edit/delete) alerts. However, in order to do this the AlertManager datasource must be created with type=Cortex. Unfortunately, AWS support has indicated that despite being based on Cortex, AMP does not expose all the cortex APIs to allow this currently.

This is a usability issue.

  1. While we have support for creating rules on the backend via Terraform, this makes it harder for average developers to create alerts. Yes, they could use legacy Grafana alerts, but this seems to be a less-than-recommended direction.
  2. The UI still suggests that you can create alerts, and it's not until you get a bit further on that you realize you can't (Grafana issue really).

Please expose the AMP API to allow this fuller integration with Grafana Alert Manager datasource.

implement present_over_time() range function

https://prometheus.io/docs/prometheus/latest/querying/functions/

The present_over_time() alarm described in the Prometheus docs is not implemented.
I'm going to implement this using the "absent_over_time"

An error occurred (ValidationException) when calling the PutRuleGroupsNamespace operation: Invalid RuleGroupsNamespace data: [309:13: group "system", rule 1, "mytest_firealarm": could not parse expression: 1:1: parse error: unknown function with name "present_over_time"]

support alert manager "active_time_intervals"

Amazon Managed Prometheus does not appear to support the "active_time_intervals" setting/behavior.

│ Error: waiting for Prometheus Alert Manager Definition (ws-xxxxxx) update: unexpected state 'UPDATE_FAILED', wanted target 'ACTIVE'. last error: status=400, message=error validating Alertmanager config: yaml: unmarshal errors:
│   line 41: field active_time_intervals not found in type config.plain

mute_time_intervals seems to work, but generally this cortex product appears to diverge in annoying and inconvenient ways from the stock (well documented) versions of Prometheus & mimir.

A sample config is below (mute_time_intervals is accepted, active_time_intervals is not)

  # Times when the route should be active. These must match the name of a
  # time interval defined in the time_intervals section. An empty value
  # means that the route is always active.
  # Additionally, the root node cannot have any active times.
  # The route will send notifications only when active, but otherwise
  # acts normally (including ending the route-matching process
  # if the `continue` option is not set).    
  active_time_intervals:
    - name: offhours
      time_intervals:
        - weekdays: ['Saturday','Sunday']
        - times:
          - start_time: '00:00'
            end_time: '09:00'
          - start_time: '18:00'
            end_time: '24:00'

  # Times when the route should be muted. These must match the name of a
  # mute time interval defined in the mute_time_intervals section.
  # Additionally, the root node cannot have any mute times.
  # When a route is muted it will not send any notifications, but
  # otherwise acts normally (including ending the route-matching process
  # if the `continue` option is not set.)
  mute_time_intervals:
    - name: offhours
      time_intervals:
        - weekdays: ['Saturday','Sunday']
        - times:
          - start_time: '00:00'
            end_time: '09:00'
          - start_time: '18:00'
            end_time: '24:00'

Provide ability to wipe off data for AMP workspace/metrics

Use case 1

A Prometheus Counter metrics is an incremental metric that takes a new datapoint and adds/subtracts value to existing counter value. In scenarios where we wish to reset this counter, presently the only solution to use a new metrics altogether which calls for changes across dashboards and alerts.

Use case 2

In scenarios where due to an internal implementation error we end up pushing metrics which PII data in the tags. This data, when detected, needs to be wiped off immediately by wiping off tsdb data by duration/metric. In such scenarios, destroying workspace and losing on all metric data is not an option.

Use case 3

When we reach Quota limits - we presently get 400 Bad request errors for two hours as mentioned on AMP quota limits page

metrics that have reported data in the past 2 hours

Feature Request

Provide the ability to wipe off Prometheus Data on demand by metric or just a complete wipeoff by age of data. This will allow us to quickly unblock metric submission instead of losing on 2 hours of data altogether. Destroying and recreating workspace leads to a lot of changes which are not practical.

Alert Manager Alerts should link back to Grafana Dashboads/Explore pages

Customers have highlighted that when they receive an alert they often want to explore the metric being alerted, or view a dashboard that gives them more context about the alerting metric. Today, in Amazon Managed Service for Prometheus there is no out-of-the-box way to link back to a Grafana Explore or Dashboard page in the alert payload.

With this feature customers should be able to:

  1. Have template variables for the alerting expression, alerting time-window, and evaluation period (for)
  2. Have the ability to customize the link-back root, so they can link to their OSS Grafana or AMG.

Enable remote read api protocol

Summary: Enable remote read api protocol

Enabling remote read api protocol allows users to take their data from AMP easily and fast, skipping the PromQL evaluation.

Some of the possible use cases for this are described here.

We have at least another two:

  • If (for some reason) AMP isn't the right tool for us, having this feature gives us confidence that we have an easy and friendly way to migrate the data from AMP to another infrastructure (a self-managed for example).
  • Export data to be processed by other means besides PromQL.

Ability to delete individual and/or multiple metrics/time series

In addition to regular deletion after the (custom #2) retention period, the ability to delete individual and/or multiple metrics/time series is needed.

Deletion becomes necessary when

  • the data quality of the metric is not (anymore) good enough,
  • the budget threshold of Prometheus is exceeded, or
  • the data is simply no longer needed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.