Comments (5)
from amazon-managed-service-for-prometheus-roadmap.
We are supportive of this feature request, but would like to hear more about the proposal for "service usage metrics". What is a service usage metric, and how is it different from a vended metric that lives within CloudWatch?
DiscardedSamples with a reason dimension per workspace
Throttled Alertmanager notifications per workspace
Alertmanager failed to send total per workspace
Alertmanager alerts received per workspace
We think it's useful to determine the ways that workspaces can break down, but it would be great to hear what service quotas these failure modes are connected to.
DiscardedSample
appears to correlate with active series, and active series per metric name, per the Cortex OSS implementation.
On Amazon Managed Service for Prometheus service quotas, however, I do not see any quotas related to notifications. Is there a limit on the number of notifications a workspace can send out in a given amount of time? At what point will we be "throttled"?
How should we distinguish, conceptually, between Alertmanager "receiving", "sending", and "notifying"? Are there other parts of the pipeline we should be aware of?
Under what circumstances would Alertmanager fail to send? If an alert fails, for example, because its query reaches the 12M Query samples limit, would this constitute a "failure to send"?
from amazon-managed-service-for-prometheus-roadmap.
Here is some additional information about DiscardedSamples with the various reasons that will be provided as a dimension:
Reason = Meaning
- greater_than_max_sample_age = Discarding log lines which are older than the current time
- new-value-for-timestamp = Duplicate samples are sent with a different timestamp than was previously recorded
- per_metric_series_limit = User has hit the active series per metric limit
- per_user_series_limit = User has hit the total number of active series limit
- rate_limited Ingestion = rate limited
- sample_out_of_order = Samples are sent with out of order timestamps and cannot be processed by AMP
- label_value_too_long = Label value is longer than allowed character limit
- max_label_names_per_series = User has hit the label names per metric
- missing_metric_name = Metric name is not provided
- metric_name_invalid = Invalid metric name provided
- label_invalid = Invalid label provided
- duplicate_label_names = Duplicate label names provided
The idea behind discarded samples is to actually show you the amount of data that has been throttled or dropped and what reason is associated with it, so you can react with the right limit increase or configuration change.
The Amazon Managed Service for Prometheus (AMP) Alert Manager metrics of failed to send and received are more to help you track the performance of the AMP Alert Manager than to correspond to any quota. We declare a notification as failing to send if the AMP Alert Manager is unable to, after all retries, send the notification from AMP Alert Manager to the downstream receiver, in this case SNS. Some customers we've spoken with have highlighted this is a useful metric to understand if something is misconfigured in their notification pipeline, such as a misconfigured access policy for SNS. The AMP Alert Manager doesn't evaluate any queries as that is done by the alerting rule run by the ruler, so the most common failure conditions for the AMP Alert Manager are failures that occur when resolving the alert manager template or failures that occur when trying to send the notification to downstream receivers, such as SNS.
With regards to modeling the pipeline, I'd model it as follows:
- The ruler runs your alerting rule, and evaluates the result
- If it evaluates to match a condition, it sends an Alert to the AMP Alert Manager to process.
- The AMP Alert Manager modifies the incoming payload based on the templates, and routing rules configured.
- Based on the routes, it sends to a downstream receiver (currently SNS).
- The downstream receiver, if it successfully receives the message, continues to forward the message to whatever component sits at the other end, usually a PagerDuty or Slack.
from amazon-managed-service-for-prometheus-roadmap.
I'd like to suggest that CloudWatch receive vended metrics related to metrics dropped due to the dedupe mechanisms around the cluster
and __replica__
labels. There is currently no way to validate in AMP that deduplication is happening appropriately, so seeing some sort of counter or indication that a number of metrics sources are being dropped due to the behavior listed here would be very helpful.
from amazon-managed-service-for-prometheus-roadmap.
We recently launched CloudWatch usage metrics, and you can learn more about them here. CloudWatch usage metrics were launched on 5/9.
Additional metrics such as:
- Workspaces per region
- Inhibition rules per workspace
- Routing Tree nodes per workspace
- RuleGroups per workspace
- RuleGroupNamespaces per workspace
will be available in a future update to vended metrics, and for now have been migrated to issue #12.
from amazon-managed-service-for-prometheus-roadmap.
Related Issues (20)
- Amazon Managed Prometheus cannot send to Chatbot to route alerts to Slack [Feature Request] HOT 3
- Import existing Cortex metrics
- Ability to delete individual and/or multiple metrics/time series HOT 10
- Document or raise size limit for HELP text HOT 2
- Backfilling metrics data for migration purposes
- implement present_over_time() range function HOT 1
- Workspace's service quotas accessible through UI and CLI HOT 1
- Additional metrics and logging for query performance visibility
- Add ability to expose AMP endpoint *only* to private VPC HOT 2
- [Feature] Support for month-over-month and quarter-over-quarter queries HOT 1
- Make Prometheus TSDB metrics available for faster debugging of active usage
- Provide ability to wipe off data for AMP workspace/metrics HOT 3
- support alert manager "active_time_intervals" HOT 1
- Remove 1h backfilling limitation
- Enable remote read api protocol HOT 1
- Support for native histograms HOT 2
- Revise the custom storage retention solution HOT 1
- Expose additional Cortex API on Managed Prometheus to alert manager integration using Cortex type in Grafana HOT 1
- Managed Prometheus Workspaces Cost allocation based on metric labels
- DescribeLoggingConfiguration is missing from ReadOnlyAccess policy HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from amazon-managed-service-for-prometheus-roadmap.