aws / amazon-managed-service-for-prometheus-roadmap Goto Github PK
View Code? Open in Web Editor NEWAmazon Managed Service for Prometheus Public Roadmap
License: Other
Amazon Managed Service for Prometheus Public Roadmap
License: Other
Hello,
WE want to use AMP to monitor multiple EKS clusters in our environment *all non prod reports to non prod AMP and all prod clusters report to prod AMP.
Please consider adding 'modern' Go syntax behaviors to make formatting templates more idiomatic, functions that I'd propose are:
map, dict, split, splitList, regexMatch, lower, contains, strings, if
Also a mechanism for testing/validating (i.e. triggering a specific rule via an API call to see how it formats and runs) would enhance the usability.
Need ability to mute servers doing maintenance
Amazon Managed Prometheus does not appear to support the "active_time_intervals" setting/behavior.
│ Error: waiting for Prometheus Alert Manager Definition (ws-xxxxxx) update: unexpected state 'UPDATE_FAILED', wanted target 'ACTIVE'. last error: status=400, message=error validating Alertmanager config: yaml: unmarshal errors:
│ line 41: field active_time_intervals not found in type config.plain
mute_time_intervals
seems to work, but generally this cortex product appears to diverge in annoying and inconvenient ways from the stock (well documented) versions of Prometheus & mimir.
A sample config is below (mute_time_intervals is accepted, active_time_intervals is not)
# Times when the route should be active. These must match the name of a
# time interval defined in the time_intervals section. An empty value
# means that the route is always active.
# Additionally, the root node cannot have any active times.
# The route will send notifications only when active, but otherwise
# acts normally (including ending the route-matching process
# if the `continue` option is not set).
active_time_intervals:
- name: offhours
time_intervals:
- weekdays: ['Saturday','Sunday']
- times:
- start_time: '00:00'
end_time: '09:00'
- start_time: '18:00'
end_time: '24:00'
# Times when the route should be muted. These must match the name of a
# mute time interval defined in the mute_time_intervals section.
# Additionally, the root node cannot have any mute times.
# When a route is muted it will not send any notifications, but
# otherwise acts normally (including ending the route-matching process
# if the `continue` option is not set.)
mute_time_intervals:
- name: offhours
time_intervals:
- weekdays: ['Saturday','Sunday']
- times:
- start_time: '00:00'
end_time: '09:00'
- start_time: '18:00'
end_time: '24:00'
https://prometheus.io/docs/prometheus/latest/querying/functions/
The present_over_time() alarm described in the Prometheus docs is not implemented.
I'm going to implement this using the "absent_over_time"
An error occurred (ValidationException) when calling the PutRuleGroupsNamespace operation: Invalid RuleGroupsNamespace data: [309:13: group "system", rule 1, "mytest_firealarm": could not parse expression: 1:1: parse error: unknown function with name "present_over_time"]
Prometheus (as deployed by the commonly used operator chart) is difficult to maintain and a resource hog; because of that AMP is very attractive. Alertmanager though works fine for our purposes, and we have
$ kubectl get prometheusrule -A -ojson | jq -r '.items[].spec.groups[].rules[].alert' | wc -l
4449
alerts defined, by teams which can each access only one namespace (https://github.com/ministryofjustice/cloud-platform-environments/search?q=prometheusrule), so no shared visibility. Alerts go directly to eg Slack, each team having control of its channel and hooks.
It would be ideal for us to use the managed Prometheus but keep alert definitions and Alertmanager as they are right now (presumably, AMP would need a configuration option to reach the cluster's AM).
Customers have highlighted that when they receive an alert they often want to explore the metric being alerted, or view a dashboard that gives them more context about the alerting metric. Today, in Amazon Managed Service for Prometheus there is no out-of-the-box way to link back to a Grafana Explore or Dashboard page in the alert payload.
With this feature customers should be able to:
for
)Customers have highlighted that they want to be able to use the same Grafana UI they use build dashboards to:
With this feature, we consider enhancements around the integration of Amazon Managed Service for Prometheus’s Alertmanager with Grafana’s Alert Management.
Initially, we plan enabling customers to view, list and silence alarms. Further, we consider to work towards enabling create, update, and delete workflows through the Grafana UI.
I'm trying to ingest metrics from cockroachdb into aws managed prometheus, and it throws a fatal error for the HELP text that I reported in cockroachdb/cockroach#87112.
I couldn't find the size limit for HELP text that AMP mandates, but more than that I find it a bit unfortunate that there is a size limit in the first place (prometheus knows no way to truncate help metadata that it ingests, and the only way to successfully ingest metrics is to not send any metadata at all).
I would wish for one of two outcomes here:
Any Plans to add Cloud Computing Compliance Controls Catalog (C5)? We have a need for German/Banking Organisation
I want to integration AMP with Managed Grafana to manage (create/edit/delete) alerts. However, in order to do this the AlertManager datasource must be created with type=Cortex. Unfortunately, AWS support has indicated that despite being based on Cortex, AMP does not expose all the cortex APIs to allow this currently.
This is a usability issue.
Please expose the AMP API to allow this fuller integration with Grafana Alert Manager datasource.
According to AWS documentation we can use templates to produce JSON based output from alert manager. But in practice the output of templates get overwritten by the content of the message
field of the sns_configs
.
We need the message
field content for integrating with Pagerduty and Slack. We are currently over coming this limitation by using the templating language to create blocks in the message
field.
The current metrics and logs provide limited visibility into how queries perform in AMP. The users don't have visibility into the query duration, queries per second, failed queries, or how many QSPs were used for querying. The latter would especially provide insights into potential cost optimizations (since QSPs are used for calculating the cost of querying AMP).
Additional logs could help to debug failing queries or which queries need optimization. Especially for companies where a multitude of teams write their own queries (e.g Grafana), having this sort of logging output would be a great help in having better visibility into how the system is used. Having a config for a "query time threshold" could be advantageous which would only log queries if their execution time passes a certain (configurable) threshold.
Currently, the maximum retention time for ingested metrics into an Amazon Managed Service for Prometheus workspace is fixed. That is, data older than 150 days is deleted from the workspace.
With custom storage retention, users would have the ability to change the metrics data retention period on a per-workspace basis. The feature would be exposed via the AWS console, API, CloudFormation, and Terraform.
The AWS-managed IAM policy ReadOnlyAccess as of version 100, does not contain the action for aps:DescribeLoggingConfiguration
. This seems an oversight as other actions such as aps:DescribeWorkspace
and aps:DescribeRuleGroupsNamespace
are present.
https://docs.aws.amazon.com/aws-managed-policy/latest/reference/ReadOnlyAccess.html
Is it possible to get this added in, please?
Customers have highlighted that they need visibility into their workspace usage relative to the quotas applied, so they can preemptively increase quotas before getting throttled.
With this feature, we plan to expose the following as vended metrics in Amazon CloudWatch:
Further, we plan to vend the following metrics as service usage metrics:
Reviewing some past issues opened I can see there was already a similar issue opened that has been closed, see reference [1]. However, the way it seems to work is a bit cumbersome as well as this appears to basically just open a ticket to AWS Support with this request. This really doesn't seem like ideal behavior, in an ideal world users should be able to define this when creating a workspace and this does not seem like something users should have to speak with support to do. I think there's a need to take a look at this and revise the implementation to something more user friendly.
Additionally, the way this is worded in reference [2] doesn't seem to suggest that this retention period can be reduced. There are some cases where users won't want logs to be stored for 150+ days and in those cases the user should be able to decrease this retention period. If this does already allow users to request a limit decrease please clarify the wording in reference [2] to make this clear as it looks like its only for increases the way it is currently worded.
References:
[1]: #2
[2]: https://docs.aws.amazon.com/prometheus/latest/userguide/AMP_quotas.html
It could be advantageous for some customers to have visibility into what the current service quotas are, per workspace. This would especially come in handy if there are multiple workspaces with different data setups. I could imagine this being part of the workspace UI and/or accessible through the AWS CLI.
Summary: Remove the limitation "Metric samples older than 1 hour are refused from being ingested" as stated at https://docs.aws.amazon.com/prometheus/latest/userguide/AMP_quotas.html
There is at least a similar issue like this one #18
However, our use case is different. I don't want to simply do a migration from one database to AMP (although that would also be important). We want to be able to send metrics at any time, despite how old they are.
We work with edge devices that, from time to time, are not connected to the internet. Those edge devices, when not connected to the internet and therefore unable to push the metrics, keep them on a local cache. As soon as they have an internet connection, they should be able to push that cache (sometimes older than 1h, 1 day, or in rare cases 1 week) to AMP.
This 1-hour limitation seems to be just a particular AMP limitation. At the moment, this is a blocker for our edge devices to use AMP, and therefore we keep using a self-managed Prometheus.
Customers have highlighted that they need visibility into their workspace usage relative to the quotas applied, so they can preemptively increase quotas before getting throttled for
With this feature, we plan to expose the following as vended metrics in Amazon CloudWatch.
Account Level Quotas:
Alert Manager Quotas:
Ruler Quotas:
Currently, creating a new AMP workspace always results in a publicly accessible endpoint being created. The endpoint is secured with IAM, but we prefer no public endpoint at all, only a private endpoint that is still also locked down with IAM.
This is along the lines of a similar feature request but I would suggest an option to backfill data that is older than 1 hour to enable a simpler migration of an existing Prometheus stack.
I've reported this to AWS support as well.
As near as I can tell 100% of the log messages in AWS cortex are useless. Log messages should provide a hint about the context of the error, they should help in diagnosing any issues or unexpected behaviors. If log messages fail to do that for whatever reason then they don't need to exist and they are just making useless noise.
logs should provide clarity and not require the administrator to guess, we have literally dozens of routes, hundreds of rules, and troubleshooting them is a huge issue. We use prometheus pint (a linter) to catch most types of errors.
I feel compelled to remind everybody that any problem on an infrastructures monitoring platform is a P1 priority, because it means alarms can get missed.
{
"workspaceId": "ws-69b34717-4546-4e1d-a367-9f4a286a91ab",
"message": {
"log": "MessageAttributes has been removed because of invalid key/value, numberOfRemovedAttributes=1",
"level": "WARN"
},
"component": "alertmanager"
}
Suggestions:
{
"workspaceId": "ws-69b34717-4546-4e1d-a367-9f4a286a91ab",
"message": {
"log": "Subject has been modified because it is empty.",
"level": "WARN"
},
"component": "alertmanager"
}
Suggestions:
{
"workspaceId": "ws-69b34717-4546-4e1d-a367-9f4a286a91ab",
"message": {
"log": "Message has been modified because the content was empty.",
"level": "WARN"
},
"component": "alertmanager"
}
Suggestions:
{
"workspaceId": "ws-69b34717-4546-4e1d-a367-9f4a286a91ab",
"message": {
"log": "Notify for alerts failed, Invalid parameter: TopicArn",
"level": "ERROR"
},
"component": "alertmanager"
}
Suggestions:
The Prometheus API endpoint - api/v1/tsdb/status
exposes a bunch of metrics(see attached tsdb.txt) around label cardinality, active series per metrics, head series related metrics etc. At present, Amazon Managed Prometheus exposes only a few of these metrics - that too via Cloudwatch.
This adds to the pain of exporting metrics from Cloudwatch and putting back into AMP while these metrics could be easily made available in AMP itself with an amp_tsdb
prefix.
Internally, we run a Prometheus Operator in our EKS cluster and push the metrics to AMP via Remote write. We suddenly start hitting 400 Bad request error when we reach limits. This leads to data loss. Presently, we don't have proper visibility into this due to limited metric data from Amazon Managed Prometheus. These metrics would help us fix that.
Prometheus JSON Exporter can be run as a sidecar for each Cortex instance that you run. A static config can scrape these metrics and push it to AMP. These metrics can finally be aggregated via Recording rules within AMP and exposed as a final TSDB metrics thats workspace wide.
Hope to see this in real soon ! Happy to help with the implementation details - We have done it locally and works like a charm for Prometheus Operator setup.
In order to do solve capacity planning, SLI/SLO, or long term performance analysis, AMP needs to support long term trending queries beyond the 32 day limit that exists today. In order to make those queries efficient, AMP should support downsampling, and auto-reduce the resolution of data past 1 week (1 min resolution), 30 days (5 min resolution) 6 months (1 hour resolution) and provide the option for keeping or dropping raw data past the first tier.
I would like to see provision made for working out the cost of metrics given one or more labels on metrics, much like you would do using Tags on EC2 resources. https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/cost-alloc-tags.html
The only advice that I got from AWS Support was to create separate workspaces and so track costs at that level, but I think this is counter-intuitive. We WANT to have a single workspace so it's simple for our users to create queries across accounts, environments, applications etc without having to worry about which workspace the data was ingested into. Given that AMP now supports 500m metrics in a single workspace, I would assume that allocating costs to different departments, teams etc in a company would be a common use-case, as different depts often have their own budgets, while tooks like prometheus are a shared resource that are centrally managed.
I would like to be able to define a set of "cost-allocation labels" and then be able to report on costs associated with these labels. This cost data would ideally be added as new metrics in prometheus so we can visualize them easily.
This cost would ideally include (and perhaps be split by) ingestion, strorage and anything else that is charged. If that's too hard just combining these in a single cost based on cost-allocation tags is fine with me.
Context
We plan to shard metrics from a group of services into a predefined workspace. This approach allows us to streamline cost attribution/chargeback to appropriate service team owners. Additionally, this strategy helps us isolate noisy services into individual workspaces for scalability.
Problem
We anticipate that metrics for large platform teams (like Service Networking, and Compute Infra) would be spread across different workspaces. It's difficult to setup alerts and build accurate dashboards because AMP doesn't have support for federated queries that span more than one workspace.
Feature Request
Providing a federated query support allows large customers to seamlessly shard metrics across workspaces while maintaining a consistent user experience.
Summary: Enable remote read api protocol
Enabling remote read api protocol allows users to take their data from AMP easily and fast, skipping the PromQL evaluation.
Some of the possible use cases for this are described here.
We have at least another two:
Hey folks, please consider adding support for importing existing metrics, especially if they are already in S3.
on the AWS managed docs page the receiver config has two issues with the sample configuration for receivers is wrong.
https://docs.aws.amazon.com/prometheus/latest/userguide/AMP-alertmanager-receiver-config.html
firstly, the format is not
key: key, value: value
I've verified the prometheus docs are correct:
secondly, the proposed indentation for the sigv4: should not have a - .. I think that would actually break it. This is taken from a working config:
receivers:
- name: 'opsgenie-sns'
sns_configs:
- topic_arn: "replace_arn"
send_resolved: false
subject: '{{ template "opsgenie.default.subject" .}}'
message: ''
sigv4:
region: "${region}"
attributes:
tags: '{{ template "__opsgenie_tags" .Alerts.Firing }}'
teams: '{{ template "__opsgenie_teams" .Alerts.Firing }}'
priority: '{{ template "__opsgenie_priority" .CommonLabels.SortedPairs }}'
A Prometheus Counter metrics is an incremental metric that takes a new datapoint and adds/subtracts value to existing counter value. In scenarios where we wish to reset this counter, presently the only solution to use a new metrics altogether which calls for changes across dashboards and alerts.
In scenarios where due to an internal implementation error we end up pushing metrics which PII data in the tags. This data, when detected, needs to be wiped off immediately by wiping off tsdb data by duration/metric. In such scenarios, destroying workspace and losing on all metric data is not an option.
When we reach Quota limits - we presently get 400 Bad request errors for two hours as mentioned on AMP quota limits page
metrics that have reported data in the past 2 hours
Provide the ability to wipe off Prometheus Data on demand by metric or just a complete wipeoff by age of data. This will allow us to quickly unblock metric submission instead of losing on 2 hours of data altogether. Destroying and recreating workspace leads to a lot of changes which are not practical.
In #6, there was an ask to be able to configure Amazon Managed Service for Prometheus Alert Manager via the native Kubernetes CRDs, so that users can utilize their existing CRD based alert manager configuration mechanisms with Amazon Managed Service for Prometheus.
At this time Amazon Managed Prometheus only sends traffic to SNS. Chatbot only supports traffic routed through event bridge. Getting alerts through Slack is highly desirable as another route for notifying IT organizations that there is an issue.
In addition to regular deletion after the (custom #2) retention period, the ability to delete individual and/or multiple metrics/time series is needed.
Deletion becomes necessary when
There is a demand for this feature to be supported on AMP, as this is supported on Grafana-agent (which we use to remote-write ec2 metrics) also supports this starting version v0.30.0.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.