Code Monkey home page Code Monkey logo

cortex-tools's Introduction

Be aware of new mimirtool

If you're using this tool with Grafana Mimir, please use the new mimirtool instead:


Cortex Tools

This repo contains tools used for interacting with Cortex.

  • benchtool: A powerful YAML driven tool for benchmarking Cortex write and query API.
  • cortextool: Interacts with user-facing Cortex APIs and backend storage components.
  • logtool: Tool which parses Cortex query-frontend logs and formats them for easy analysis.
  • e2ealerting: Tool that helps measure how long an alert takes from scrape of sample to Alertmanager notification delivery.

Installation

The various binaries are available for macOS, Windows, and Linux.

macOS

cortextool is available on macOS via Homebrew:

$ brew install grafana/grafana/cortextool

Linux, Docker and Windows

Refer to the latest release for installation intructions on these.

cortextool

This tool is designed to interact with the various user-facing APIs provided by Cortex, as well as, interact with various backend storage components containing Cortex data.

Config Commands

Config commands interact with the Cortex api and read/create/update/delete user configs from Cortex. Specifically, a user's alertmanager and rule configs can be composed and updated using these commands.

Configuration

Env Variables Flag Description
CORTEX_ADDRESS address Address of the API of the desired Cortex cluster.
CORTEX_API_USER user In cases where the Cortex API is set behind a basic auth gateway, a user can be set as a basic auth user. If empty and CORTEX_API_KEY is set, CORTEX_TENANT_ID will be used instead.
CORTEX_API_KEY key In cases where the Cortex API is set behind a basic auth gateway, a key can be set as a basic auth password.
CORTEX_AUTH_TOKEN authToken In cases where the Cortex API is set behind gateway authenticating by bearer token, a token can be set as a bearer token header.
CORTEX_TENANT_ID id The tenant ID of the Cortex instance to interact with.

Alertmanager

The following commands are used by users to interact with their Cortex alertmanager configuration, as well as their alert template files.

Alertmanager Get
cortextool alertmanager get
Alertmanager Load
cortextool alertmanager load ./example_alertmanager_config.yaml

cortextool alertmanager load ./example_alertmanager_config.yaml template_file1.tmpl template_file2.tmpl

Rules

The following commands are used by users to interact with their Cortex ruler configuration. They can load prometheus rule files, as well as interact with individual rule groups.

Note: If you are interacting with a Loki ruler, be sure to add the flag --backend=loki to all commands.

Rules List

This command will retrieve all of the rule groups stored in the specified Cortex instance and print each one by rule group name and namespace to the terminal.

cortextool rules list
Rules Print

This command will retrieve all of the rule groups stored in the specified Cortex instance and print them to the terminal.

cortextool rules print
Rules Get

This command will retrieve the specified rule group from Cortex and print it to the terminal.

cortextool rules get example_namespace example_rule_group
Rules Delete

This command will delete the specified rule group from the specified namespace.

cortextool rules delete example_namespace example_rule_group
Rules Load

This command will load each rule group in the specified files and load them into Cortex. If a rule already exists in Cortex it will be overwritten, if a diff is found.

cortextool rules load ./example_rules_one.yaml ./example_rules_two.yaml  ...

Rules Lint

This command lints a rules file. The linter's aim is not to verify correctness but just YAML and PromQL expression formatting within the rule file. This command always edits in place, you can use the dry run flag (-n) if you'd like to perform a trial run that does not make any changes. This command does not interact with your Cortex cluster.

cortextool rules lint -n ./example_rules_one.yaml ./example_rules_two.yaml ...

Rules Prepare

This command prepares a rules file for upload to Cortex. It lints all your PromQL expressions and adds an specific label to your PromQL query aggregations in the file. This command does not interact with your Cortex cluster.

cortextool rules prepare -i ./example_rules_one.yaml ./example_rules_two.yaml ...

There are two flags of note for this command:

  • -i which allows you to edit in place, otherwise a a new file with a .output extension is created with the results of the run.
  • -l which allows you to specify the label you want to add for your aggregations, which is cluster by default.

At the end of the run, the command tells you whenever the operation was a success in the form of

INFO[0000] SUCESS: 194 rules found, 0 modified expressions

It is important to note that a modification can be a PromQL expression lint or a label add to your aggregation.

Rules Check

This commands checks rules against the recommended best practices for rules. This command does not interact with your Cortex cluster.

cortextool rules check ./example_rules_one.yaml

Remote Read

Cortex exposes a Remote Read API which allows access to the stored series. The remote-read subcommand of cortextool allows interacting with its API, to find out which series are stored.

Remote Read show statistics

The remote-read stats command summarizes statistics of the stored series matching the selector.

cortextool remote-read stats --selector '{job="node"}' --address http://demo.robustperception.io:9090 --remote-read-path /api/v1/read
INFO[0000] Create remote read client using endpoint 'http://demo.robustperception.io:9090/api/v1/read'
INFO[0000] Querying time from=2020-12-30T14:00:00Z to=2020-12-30T15:00:00Z with selector={job="node"}
INFO[0000] MIN TIME                           MAX TIME                           DURATION     NUM SAMPLES  NUM SERIES   NUM STALE NAN VALUES  NUM NAN VALUES
INFO[0000] 2020-12-30 14:00:00.629 +0000 UTC  2020-12-30 14:59:59.629 +0000 UTC  59m59s       159480       425          0                     0
Remote Read dump series

The remote-read dump command prints all series and samples matching the selector.

cortextool remote-read dump --selector 'up{job="node"}' --address http://demo.robustperception.io:9090 --remote-read-path /api/v1/read
{__name__="up", instance="demo.robustperception.io:9100", job="node"} 1 1609336914711
{__name__="up", instance="demo.robustperception.io:9100", job="node"} NaN 1609336924709 # StaleNaN
[...]
Remote Read export series into local TSDB

The remote-read export command exports all series and samples matching the selector into a local TSDB. This TSDB can then be further analysed with local tooling like prometheus and promtool.

# Use Remote Read API to download all metrics with label job=name into local tsdb
cortextool remote-read export --selector '{job="node"}' --address http://demo.robustperception.io:9090 --remote-read-path /api/v1/read --tsdb-path ./local-tsdb
INFO[0000] Create remote read client using endpoint 'http://demo.robustperception.io:9090/api/v1/read'
INFO[0000] Created TSDB in path './local-tsdb'
INFO[0000] Using existing TSDB in path './local-tsdb'
INFO[0000] Querying time from=2020-12-30T13:53:59Z to=2020-12-30T14:53:59Z with selector={job="node"}
INFO[0001] Store TSDB blocks in './local-tsdb'
INFO[0001] BLOCK ULID                  MIN TIME                       MAX TIME                       DURATION     NUM SAMPLES  NUM CHUNKS   NUM SERIES   SIZE
INFO[0001] 01ETT28D6B8948J87NZXY8VYD9  2020-12-30 13:53:59 +0000 UTC  2020-12-30 13:59:59 +0000 UTC  6m0.001s     15950        429          425          105KiB867B
INFO[0001] 01ETT28D91Z9SVRYF3DY0KNV41  2020-12-30 14:00:00 +0000 UTC  2020-12-30 14:53:58 +0000 UTC  53m58.001s   143530       1325         425          509KiB679B

# Examples for using local TSDB
## Analyzing contents using promtool
promtool tsdb analyze ./local-tsdb

## Dump all values of the TSDB
promtool tsdb dump ./local-tsdb

## Run a local prometheus
prometheus --storage.tsdb.path ./local-tsdb --config.file=<(echo "")

Overrides Exporter

The Overrides Exporter allows to continuously export per tenant configuration overrides as metrics. It can also, optionally, export a presets file (cf. example override config file and presets file).

cortextool overrides-exporter --overrides-file overrides.yaml --presets-file presets.yaml

Generate ACL Headers

This lets you generate the header which can then be used to enforce access control rules in GME / GrafanaCloud.

./cortextool acl generate-header --id=1234 --rule='{namespace="A"}'

Analyse

Run analysis against your Prometheus, Grafana and Cortex to see which metrics being used and exported. Can also extract metrics from dashboard JSON and rules YAML files.

analyse grafana

This command will run against your Grafana instance and will download its dashboards and then extract the Prometheus metrics used in its queries. The output is a JSON file.

Configuration
Env Variables Flag Description
GRAFANA_ADDRESS address Address of the Grafana instance.
GRAFANA_API_KEY key The API Key for the Grafana instance. Create a key using the following instructions: https://grafana.com/docs/grafana/latest/http_api/auth/
__ output The output file path. metrics-in-grafana.json by default.
Running the command
cortextool analyse grafana --address=<grafana-address> --key=<API-Key>
Sample output
{
  "metricsUsed": [
    "apiserver_request:availability30d",
    "workqueue_depth",
    "workqueue_queue_duration_seconds_bucket",
    ...
  ],
  "dashboards": [
    {
      "slug": "",
      "uid": "09ec8aa1e996d6ffcd6817bbaff4db1b",
      "title": "Kubernetes / API server",
      "metrics": [
        "apiserver_request:availability30d",
        "apiserver_request_total",
        "cluster_quantile:apiserver_request_duration_seconds:histogram_quantile",
        "workqueue_depth",
        "workqueue_queue_duration_seconds_bucket",
        ...
      ],
      "parse_errors": null
    }
  ]
}
analyse ruler

This command will run against your Grafana Cloud Prometheus instance and will fetch its rule groups. It will then extract the Prometheus metrics used in the rule queries. The output is a JSON file.

Configuration
Env Variables Flag Description
CORTEX_ADDRESS address Address of the Prometheus instance.
CORTEX_TENANT_ID id If you're using Grafana Cloud this is your instance ID.
CORTEX_API_KEY key If you're using Grafana Cloud this is your API Key.
__ output The output file path. metrics-in-ruler.json by default.
Running the command
cortextool analyse ruler --address=https://prometheus-blocks-prod-us-central1.grafana.net --id=<1234> --key=<API-Key>
Sample output
{
  "metricsUsed": [
    "apiserver_request_duration_seconds_bucket",
    "container_cpu_usage_seconds_total",
    "scheduler_scheduling_algorithm_duration_seconds_bucket"
    ...
  ],
  "ruleGroups": [
    {
      "namspace": "prometheus_rules",
      "name": "kube-apiserver.rules",
      "metrics": [
        "apiserver_request_duration_seconds_bucket",
        "apiserver_request_duration_seconds_count",
        "apiserver_request_total"
      ],
      "parse_errors": null
    },
    ...
}
analyse prometheus

This command will run against your Prometheus / Cloud Prometheus instance. It will then use the output from analyse grafana and analyse ruler to show you how many series in the Prometheus server are actually being used in dashboards and rules. Also, it'll show which metrics exist in Grafana Cloud that are not in dashboards or rules. The output is a JSON file.

Configuration
Env Variables Flag Description
CORTEX_ADDRESS address Address of the Prometheus instance.
CORTEX_TENANT_ID id If you're using Grafana Cloud this is your instance ID.
CORTEX_API_KEY key If you're using Grafana Cloud this is your API Key.
__ grafana-metrics-file The dashboard metrics input file path. metrics-in-grafana.json by default.
__ ruler-metrics-file The rules metrics input file path. metrics-in-ruler.json by default.
__ output The output file path. prometheus-metrics.json by default.
Running the command
cortextool analyse prometheus --address=https://prometheus-blocks-prod-us-central1.grafana.net --id=<1234> --key=<API-Key> --log.level=debug
Sample output
{
  "total_active_series": 38184,
  "in_use_active_series": 14047,
  "additional_active_series": 24137,
  "in_use_metric_counts": [
    {
      "metric": "apiserver_request_duration_seconds_bucket",
      "count": 11400,
      "job_counts": [
        {
          "job": "apiserver",
          "count": 11400
        }
      ]
    },
    {
      "metric": "apiserver_request_total",
      "count": 684,
      "job_counts": [
        {
          "job": "apiserver",
          "count": 684
        }
      ]
    },
    ...
  ],
  "additional_metric_counts": [
    {
      "metric": "etcd_request_duration_seconds_bucket",
      "count": 2688,
      "job_counts": [
        {
          "job": "apiserver",
          "count": 2688
        }
      ]
    },
    ...
analyse dashboard

This command accepts Grafana dashboard JSON files as input and extracts Prometheus metrics used in the queries. The output is a JSON file compatible with analyse prometheus.

Running the command
cortextool analyse dashboard ./dashboard_one.json ./dashboard_two.json ...
analyse rule-file

This command accepts Prometheus rule YAML files as input and extracts Prometheus metrics used in the queries. The output is a JSON file compatible with analyse prometheus.

Running the command
cortextool analyse rule-file ./rule_file_one.yaml ./rule_file_two.yaml ...

logtool

A CLI tool to parse Cortex query-frontend logs and formats them for easy analysis.

Options:
  -dur duration
        only show queries which took longer than this duration, e.g. -dur 10s
  -query
        show the query
  -utc
        show timestamp in UTC time

Feed logs into it using logcli from Loki, kubectl for Kubernetes, cat from a file, or any other way to get raw logs:

Loki logcli example:

$ logcli query '{cluster="us-central1", name="query-frontend", namespace="dev"}' --limit=5000 --since=3h --forward -o raw | ./logtool -dur 5s
https://logs-dev-ops-tools1.grafana.net/loki/api/v1/query_range?direction=FORWARD&end=1591119479093405000&limit=5000&query=%7Bcluster%3D%22us-central1%22%2C+name%3D%22query-frontend%22%2C+namespace%3D%22dev%22%7D&start=1591108679093405000
Common labels: {cluster="us-central1", container_name="query-frontend", job="dev/query-frontend", level="debug", name="query-frontend", namespace="dev", pod_template_hash="7cd4bf469d", stream="stderr"}

Timestamp                                TraceID           Length    Duration       Status  Path
2020-06-02 10:38:40.34205349 -0400 EDT   1f2533b40f7711d3  12h0m0s   21.92465802s   (200)   /api/prom/api/v1/query_range
2020-06-02 10:40:25.171649132 -0400 EDT  2ac59421db0000d8  168h0m0s  16.378698276s  (200)   /api/prom/api/v1/query_range
2020-06-02 10:40:29.698167258 -0400 EDT  3fd088d900160ba8  168h0m0s  20.912864541s  (200)   /api/prom/api/v1/query_range
$ cat query-frontend-logs.log | ./logtool -dur 5s
Timestamp                                TraceID           Length    Duration       Status  Path
2020-05-26 13:51:15.0577354 -0400 EDT    76b9939fd5c78b8f  6h0m0s    10.249149614s  (200)   /api/prom/api/v1/query_range
2020-05-26 13:52:15.771988849 -0400 EDT  2e7473ab10160630  10h33m0s  7.472855362s   (200)   /api/prom/api/v1/query_range
2020-05-26 13:53:46.712563497 -0400 EDT  761f3221dcdd85de  10h33m0s  11.874296689s  (200)   /api/prom/api/v1/query_range

benchtool

A tool for benchmarking a Prometheus remote-write backend and PromQL compatible API. It allows for metrics to be generated using a workload file.

License

Licensed Apache 2.0, see LICENSE.

cortex-tools's People

Contributors

colega avatar cstyan avatar eamonryan avatar gotjosh avatar gouthamve avatar hjet avatar javad-hajiani avatar jeschkies avatar jtlisi avatar justincmoy avatar justintm avatar jutley avatar luna-duclos avatar mattmendick avatar moertel avatar mpursley avatar owen-d avatar pracucci avatar pstibrany avatar replay avatar rgeyer avatar roidelapluie avatar samjewell avatar sandeepsukhani avatar sh0rez avatar shantanualsi avatar simonswine avatar stevesg avatar tomwilkie avatar vitovitolo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cortex-tools's Issues

SIGSEGV in cortextool version command

When running cortextool version on a host that can't reach github, the program crashes.

ยฑ .cortextool version    
version 0.3.2
checking latest version... panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1e491ce]

goroutine 1 [running]:
github.com/grafana/cortex-tools/pkg/version.getLatestFromGitHub(0xc0013c2d00, 0x1a)
	/build/source/pkg/version/version.go:40 +0x10e
github.com/grafana/cortex-tools/pkg/version.CheckLatest()
	/build/source/pkg/version/version.go:21 +0x49
main.main.func1(0xc00152c510, 0x40c5a3, 0x20d0500)
	/build/source/cmd/cortextool/main.go:33 +0x98
gopkg.in/alecthomas/kingpin%2ev2.(*actionMixin).applyActions(0xc0002cbd58, 0xc00152c510, 0x0, 0x0)
	/build/go/pkg/mod/gopkg.in/alecthomas/[email protected]/actions.go:28 +0x6d
gopkg.in/alecthomas/kingpin%2ev2.(*Application).applyActions(0xc00120c0f0, 0xc00152c510, 0x0, 0x0)
	/build/go/pkg/mod/gopkg.in/alecthomas/[email protected]/app.go:557 +0xdc
gopkg.in/alecthomas/kingpin%2ev2.(*Application).execute(0xc00120c0f0, 0xc00152c510, 0xc00103afe0, 0x1, 0x1, 0x0, 0x0, 0x0, 0xc001747f08)
	/build/go/pkg/mod/gopkg.in/alecthomas/[email protected]/app.go:390 +0x8f
gopkg.in/alecthomas/kingpin%2ev2.(*Application).Parse(0xc00120c0f0, 0xc00000e090, 0x1, 0x1, 0x1, 0xc000b8ac38, 0x0, 0x1)
	/build/go/pkg/mod/gopkg.in/alecthomas/[email protected]/app.go:222 +0x1fe
main.main()
	/build/source/cmd/cortextool/main.go:38 +0x1bf
[1]    142838 exit 2     cortextool version

Dereference at issue: https://github.com/grafana/cortex-tools/blob/v0.3.2/pkg/version/version.go#L40

Invalid YAML loading a rule with a multiline field

How to reproduce?

  • port-forward ruler Pod
$ kubectl -n cortex port-forward deploy/ruler 8080:80
  • create a rule group YAML file with multiline in expr field, with a breaking line at the beginning
$ cat >> test.yml << EOF
groups:
  - name: rule-group-name
    rules:
    - alert: alert-name
      expr: |

        up{jop="my-awesome-job"} == 0
EOF
  • load rule
$ cortextool rules load --address "http://localhost:8080" --id=0 --log.level="debug" test.yml
  • cortextool debug output:
INFO[0000] log level set to debug
DEBU[0000] path built to request rule group              url=/api/prom/rules/test/rule-group-name
DEBU[0000] sending request to cortex api                 method=GET url="http://localhost:8080/api/prom/rules/test/rule-group-name"
DEBU[0000] checking response                             status="404 Not Found"
DEBU[0000] resource not found                            fields.msg="request failed with response body group does not exist\n" status="404 Not Found"
DEBU[0000] sending request to cortex api                 method=POST url="http://localhost:8080/api/prom/rules/test"
DEBU[0000] checking response                             status="400 Bad Request"
ERRO[0000] requests failed                               fields.msg="request failed with response body unable to decoded rule group\n" status="400 Bad Request"
ERRO[0000] unable to load rule group                     error="failed request to the cortex api" group=rule-group-name namespace=test
cortextool: error: load operation unsuccessful, try --help

additional data

  • cortex ruler version 1.4.0

  • cortex-tools compiled with current master HEAD 432ad77

  • http request dump

POST /api/prom/rules/test HTTP/1.1
Host: localhost:8080
User-Agent: Go-http-client/1.1
Content-Length: 107
X-Scope-Orgid: 0
Accept-Encoding: gzip

name: rule-group-name
rules:
    - alert: alert-name
      expr: |4

        up{jop="my-awesome-job"} == 0

As you can see, the body content is not a valid YAML.

expected behaviour

cortex-tools should ensure is sending a valid YAML before reach cortex API.

I'm willing to help to fix it with some help.

Thanks!

Do not return fatal when no rules are loaded yet

When you have no rules loaded yet, and try to do a cortextool rules list the output you'll get is the following:

$ cortextool rules list
FATA[0000] unable to read rules from cortex, requested resource not found

This is a bit deceiving given, we were able to load rules there just aren't any yet.

Remove trailing slash (/) from Cortex Address

Using an address with a trailing slash can cause unexpected behaviour.

e.g.

$ CORTEX_ADDRESS=https://prometheus-us-central1.grafana.net/ cortextool rules load test.yml
ERRO[0000] unable to load rule group                     error="requested resource not found" group=up_job namespace=example_namespace
cortextool: error: load operation unsuccessful, try --help

In some cases, this is not taken into account e.g.

when using the diff command

$ CORTEX_ADDRESS=https://prometheus-us-central1.grafana.net/ cortextool rules diff --rule-files=test.yml
Changes are indicated with the following symbols:
  + updated

The following changes will be made if the provided rule set is synced:
~ Namespace: example_namespace
  ~ Group: up_job

Diff Summary: 0 Groups Created, 1 Groups Updated, 0 Groups Deleted

I think is because the diff command does not make use of any endpoints where the trailing slashes matter (e.g. subroutes on the API)

HTTP requests are not URL encoding parameters

I tried to delete a rule group that includes a % in the name. I got an error with the following reasoning:

invalid URL escape "% n"

Command line inputs need to be sanitized before being used in a URL.

Split CORTEX_ADDRESS between AM / Ruler

You can have your Ruler and Alertmanager in separate URLs. As a result, it becomes tedious having to change between commands, we should make this a bit more explicit that these are two separate components.

cortex-tools client results in huge binary

The cortex-tools project includes a client for the cortex-ruler. This, in itself, seems pretty simple. However, including the client into a simple golang app causes that app to go from 12->72Mb, importing Cortex, AWS CLI, and lots of other things.

Is it possible to simplify the client such that it doesn't increase binary size so much?

How to list all user rule groups

I am a cortex administrator, at the moment, it is unclear which tenant create or delete rule groups.
Is there some api can export all user rule grups?

I want to record all rule group to mysql db and sync to cortex periodly.

rules diff subcommand reports spurious changes

Loading the follow namespace/file into cortex:

groups:
- name: my_group
  rules:
  - record: value
    expr: vector(0)
    labels:
      val: '0'

And then immediately diffing with cortextool rules diff will produce a report that the group my_group will be updated.

Changes are indicated with the following symbols:
  + updated

The following changes will be made if the provided rule set is synced:
~ Namespace: my_namespace
  ~ Group: my_group

Diff Summary: 0 Groups Created, 1 Groups Updated, 0 Groups Deleted

This is because the rules file is unmarshalled to a Prometheus Rule struct with Annotations map[string]string = nil while Cortex assigns the same field to an empty map. This leads the deep equality check to report a difference.

Users can work around this bug by adding an empty annotations map to their rule files. But this is a little counterintuitive given that documentation doesn't show annotations as a valid field for recording rules. It might be better for cortextool to fill this field in itself if it is going to rely on on reflect.deepEquals.

Richer diff when running `cortextool diff|sync`

Right now, when you use the diff command it will only tell you which groups are going to be changed, but does not tell you which individual rules/alerts are changing.

I would be good to have an idea of what exactly is changing when you run this or the sync command.

Homebrew formula

It would be great if this was packaged for homebrew to make it easier to update/manage via brew.

Allow cortextool rules diff to accept an allowlist of namespaces rather than only a denylist

usage: cortextool rules diff --address=ADDRESS --id=ID [<flags>]

diff a set of rules to a designated cortex endpoint

Flags:
  --help                      Show context-sensitive help (also try --help-long and --help-man).
  --log.level="info"          set level of the logger
  --push-gateway.endpoint=PUSH-GATEWAY.ENDPOINT
                              url for the push-gateway to register metrics
  --push-gateway.job=PUSH-GATEWAY.JOB
                              job name to register metrics
  --push-gateway.interval=1m  interval to forward metrics to the push gateway
  --key=""                    Api key to use when contacting cortex, alternatively set $CORTEX_API_KEY.
  --backend=cortex            Backend type to interact with: <cortex|loki>
  --address=ADDRESS           Address of the cortex cluster, alternatively set CORTEX_ADDRESS.
  --id=ID                     Cortex tenant id, alternatively set CORTEX_TENANT_ID.
  --tls-ca-path=""            TLS CA certificate to verify cortex API as part of mTLS, alternatively set CORTEX_TLS_CA_PATH.
  --tls-cert-path=""          TLS client certificate to authenticate with cortex API as part of mTLS, alternatively set CORTEX_TLS_CERT_PATH.
  --tls-key-path=""           TLS client certificate private key to authenticate with cortex API as part of mTLS, alternatively set CORTEX_TLS_KEY_PATH.
  --ignored-namespaces=IGNORED-NAMESPACES
                              comma-separated list of namespaces to ignore during a diff.
  --rule-files=RULE-FILES     The rule files to check. Flag can be reused to load multiple files.
  --rule-dirs=RULE-DIRS       Comma separated list of paths to directories containing rules yaml files. Each file in a directory with a .yml or .yaml
                              suffix will be parsed.
  --disable-color             disable colored output

Currently, as you can see above, the diff command accepts a list of namespaces to ignore. Making it very hard to diff any particular namespace.

I'd like to suggest adding the option of adding an acceptlist of namespaces as well, make the two flags exclusive, to make this easier.

Specifying leading directories in path to template files causes parsing errors

I created an issue on Cortex (cortexproject/cortex#3357) about weird directory processing behavior on the alertmanager side, but @gotjosh suggested I create an issue here to address the root problem. I wouldn't assume the path that I'm asking cortextool to read it from on the local machine would have any effect on how cortex processes the file on the backend. I'm not sure the path should be sent to the backend. Even the filename is kind of annoying to have to be sent, but to match up the template name with the alert yaml I think that is necessary.

This seems like an issue where the tool should ignore the directory specified and not send that full path to the backend.

Unify use of YAML libraries

At the moment, we're using both go-yaml.v2 and go-yaml.v3, it'll be good to unify usage of both and avoid any potential pitfalls for having two versions.

upload tsdb block to cortex

It would be nice if cortex-tool would gain a block upload feature that enables posting a block to block storage, e.g. after a cortex downtime.

Confusing parse error messages when using loki backend

Given bug.yaml:

namespace: bug
groups:
  - name: bug
    rules:
      - alert: AlwaysFire
        expr: vector(1)

cortextool rules lint --backend=loki bug.yaml gives:

ERRO[0000] unable parse rules file                       error="could not parse expression: parse error at line 1, col 1: syntax error: unexpected IDENTIFIER" file=bug.yaml
cortextool: error: prepare operation unsuccessful, unable to parse rules files: file read error, try --help

There's nothing obviously wrong at line 1, col 1. I have an invalid logQL expression and cortextool should tell me that directly.

cortextool 0.3.2

Allow send TLS client certificate to Cortex API

๐Ÿ‘‹ hi!

I'm running Cortex on k8s for a while and I've protected the API with client TLS authentication with the help of ingress-nginx controller.

Right now I want to use cortex-tools to lint and load rules in an automated fashion from a CI pipeline. Thus I would like to authenticate the http client with a TLS client certificate.

I saw you're using go http client so it shouldn't be hard to add tls certs to CortexClient struct:

client http.Client

It would be something like https://gist.github.com/michaljemala/d6f4e01c4834bf47a9c4

The cli flags would look like:

cortextools rules load my-rule.yml --address=ADDRESS --id=ID --cacert ca.pem --key client.key --cert client.pem

and also adding environment variables:

CORTEX_TLS_CA_CERT
CORTEX_TLS_CLIENT_KEY
CORTEX_TLS_CLIENT_CERT

I'm wondering if you consider tls client auth useful for the project. In that case I'm willing to send a PR.

Thanks!

`rules list` command should support json or yaml output

I'd like to be able to list rules in Cortex, then programatically process them. The output of rules list is great for the human eye, but needlessly difficult to program around.

We should add an -o, --output flag to support YAML output. JSON output would also be appreciated, though plenty of client tools can make this conversion as necessary.

Cannot print rules since loading a recording rule

I loaded this:

namespace: test
groups:
- name: default
  rules:
    - alert: AlwaysFiring
      record: ""
      for: 0s
      expr: 1 == bool 1
    - record: "agent:custom_server_info:up"
      alert: ""
      expr: |2
          custom_server_info * 0
        unless on (agent_hostname)
          up{job="integrations/agent"}
        or on (agent_hostname)
          custom_server_info

Since then, I can't print the rules anymore.

FATA[0000] unable to read rules from cortex, yaml: line 10: did not find expected key

I can query the API and get the expected YAML.

curl -u $CORTEX_USER:$CORTEX_KEY "$CORTEX_URL/api/v1/rules"
bug:
    - name: default
      rules:
        - record: test:scalar:bug
          expr: vector(1)
test:
    - name: default
      rules:
        - alert: AlwaysFiring
          expr: 1 == bool 1
        - record: agent:custom_server_info:up
          expr: |4
              custom_server_info * 0
            unless on (agent_hostname)
              up{job="integrations/agent"}
            or on (agent_hostname)
              custom_server_info

To try to isolate the bug, I deleted all the rules and I tried loading this one:

namespace: bug
groups:
- name: test
  rules:
    - record: "test:scalar:bug"
      expr: |2
          vector(1)
        or
          vector(2)
cortextool rules load rules-bug.yml \
--address=$CORTEX_URL \
--id=$CORTEX_USER \
--key=$CORTEX_KEY \
--log.level=debug

INFO[0000] log level set to debug
DEBU[0000] path built to request rule group              url=/api/prom/rules/bug/test
DEBU[0000] sending request to cortex api                 method=GET url="https://prometheus-us-central1.grafana.net/api/prom/rules/bug/test"
DEBU[0000] checking response                             status="404 Not Found"
DEBU[0000] resource not found                            fields.msg="request failed with response body group does not exist\n" status="404 Not Found"
DEBU[0000] sending request to cortex api                 method=POST url="https://prometheus-us-central1.grafana.net/api/prom/rules/bug"
DEBU[0000] checking response                             status="400 Bad Request"
ERRO[0000] requests failed                               fields.msg="request failed with response body unable to decoded rule group\n" status="400 Bad Request"
ERRO[0000] unable to load rule group                     error="failed request to the cortex api" group=test namespace=bug

I was able to load this rule group using curl.

name: bug
rules:
  - record: "test:scalar:bug"
    expr: |2
        vector(1)
      or
        vector(2)
curl -u $CORTEX_USER:$CORTEX_KEY "$CORTEX_URL/api/prom/rules/bug" -H "Content-Type: application/yaml" --data-binary @rules-bug-api.yml -i
HTTP/2 202                                                                                                                                                                                                           
content-length: 58                                                                                                                                                                                                   
content-type: application/json                                                                            
date: Fri, 20 Nov 2020 16:41:23 GMT                                                                                                                                                                                  
via: 1.1 google                                                                                                                                                                                                      
alt-svc: clear                                                                                            
                                                                                                          
{"status":"success","data":null,"errorType":"","error":""}

I'm still unable to print the rules:

INFO[0000] log level set to debug
DEBU[0000] sending request to cortex api                 method=GET url="https://prometheus-us-central1.grafana.net/api/prom/rules"
DEBU[0000] checking response                             status="200 OK"
FATA[0000] unable to read rules from cortex, yaml: line 3: did not find expected key

But I can GET them from the API:

curl -u $CORTEX_USER:$CORTEX_KEY "$CORTEX_URL/api/prom/rules"
bug:
    - name: bug
      rules:
        - record: test:scalar:bug
          expr: |4
              vector(1)
            or
              vector(2)

Also, this rule does not run! I don't see the test:scalar:bug metric in my database.

If I create the same rule on a single line, then it works, so I think both Cortex and Cortextool has an issue with the YAML block quotes with an indentation indicator syntax as described in Prometheus docs.

New mixed-unit durations from prometheus/common v0.11.0 cannot be deserialized

Recently, github.com/prometheus/[email protected] introduced the ability to define durations using mixed units, e.g. 1h30m. This change is not backwards compatible, which creates issues for this project.

This change has been in the cortexproject/cortex master branch for a while, which some extremely notable users (Grafana Cloud!) are using. These changes are also in this project's master branch, but are unreleased. This means that there is no released version of cortextool that can properly interact with Prometheus rules stored in these Cortex instances.

All this requires is a new release of this project! Please cut a new release!

Implement a basic "linter" for rules files

Often when preparing rules ($ rules prepare) you would like to have a clear idea of what change in the diff - given there's no homogenous way of formatting rules YAML (e.g. Prometheus rules linter) a side of effect of the marshalling/unmarshalling of rules files is that your expressions and the file itself end up being linted by either the PromQL parser or the go YAML library.

This makes it difficult to have a consistent diff.

Given there's a promfmt for rules in the work, let's do something simple as an intermediary step. Take file(s), unmarshal then to our struct, marshal them, and format the promQL expressions in the rules file. With this, our users can "lint" their files before running them through the prepare command and have a more consistent diff on what changed.

go.mod and repo name discrepancy

The go.mod file specifies cortextool as the name, while the repo is actually named cortex-tools.

While this is not a big problem, it is confusing when naively trying to import code from this repo:

go: github.com/sh0rez/gctl/pkg/spec imports
	github.com/grafana/cortex-tools/pkg/client: github.com/grafana/[email protected]: parsing go.mod:
	module declares its path as: github.com/grafana/cortextool
	        but was required as: github.com/grafana/cortex-tools

To not actually break go imports, the "clean" solution would probably be to rename this repo to grafana/cortextool

YAML format of `rules get` should match `rules sync`

It would be helpful if the output format of cortextool rules get <namespace> <group> matched the exact format expected to cortextool rules sync. This would make it easier to get down all rules and commit them into Git, then sync them back again. Right now I have to munge the YAML slightly to make it compatible to sync.

Right now the format is this:

$ cortextool rules get somenamespace anygroup
name: anygroup
rules:
    - alert: FrontEnd Prometheus
      expr: .......

Ideally it would be this:

namespace: somenamespace
groups:
    - name: anygroup
      rules:
        - alert: FrontEnd Prometheus
          expr: ........

Improve changelogs

Right now we keep a central changelogs for all the binaries. Consider each binary having its own separate changelog.

Unify the docker images

I don't think we need three (and maybe more?) images, we could pack all the binaries in a single image making the release process simpler.

`ruleEquals` function does not work with yaml V3

When running Cortex locally I tried loading/diffing a local rules file against the rules endpoint multiple times. Since rulefmt started using yaml.v3 the RuleNode struct contains extra information about the formatting of the underlying yaml file. With Cortex the yaml returned from the API will not have the same formatting which can lead to diffs when none exist:

INFO[0000] updating group                                difference="rule #0 does not match {{8 0 !!str sum_up  <nil> []    3 15} {0 0    <nil> []    0 0} {8 0 !!str sum(up)  <nil> []    4 13} 0s map[] map[]} != {{8 0 !!str sum_up  <nil> []    5 13} {0 0    <nil> []    0 0} {8 0 !!str sum(up)  <nil> []    4 11} 0s map[] map[]}" group=test_rules namespace=rules

The differences in the above string are due to the yaml column and row and not the rules themselves.

cortextool not creating or updating existing rules

Seems to be that cortextool sync/cortextool load isn't working in my environment. We're running Cortex v1.5.0 and CortexTool v0.5.0.

I've created a new rulegroup in a new NameSpace using the below config:

namespace: collector-rules
groups:
    - name: collector-status
      rules:
        - record: ""
          alert: PrometheuServerIsDown
          expr: absent(up)
          for: 10m
          labels:
            severity: critical
          annotations:
            assignment_group: Site Reliability Engineering
            company: REDACTED
            description: Cortex has not received any metrics from the n4monitoring tenant for 10 minutes
            impact: "1"
            suggested_actions: Check if Prometheus is running in the REDACTED namespace
            summary: 'Cortex has not received metrics for REDACTED for 10minutes'
            urgency: "1"

The NS collector-rules does not exists. cortextool rules load throws an error:

ryan@WINDOWS-H8Q4C40:~/rw170/Documents/cortex-config$ cortextool rules load n4monitoring/rulegroups/collector.yml
ERRO[0000] unable to load rule group                     error="requested resource not found" group=collector-status namespace=collector-rules
cortextool: error: load operation unsuccessful, try --help

cortextool rules sync --rule-dirs=<dir> also throws an error:

ryan@WINDOWS-H8Q4C40:~/rw170/Documents/cortex-config$ cortextool rules sync --rule-dirs=n4monitoring/rulegroups
INFO[0000] creating group                                group=collector-status namespace=collector-rules
cortextool: error: sync operation unsuccessful, unable to complete executing changes.: requested resource not found, try --help

I've got a feeling it's potentially related to our ingress rules but I can't spot anything here is the ingress:

spec:
  rules:
  - host: REDACTED
    http:
      paths:
      - backend:
          serviceName: alertmanager
          servicePort: 80
        path: /multitenant_alertmanager/status
      - backend:
          serviceName: alertmanager
          servicePort: 80
        path: /alertmanager
      - backend:
          serviceName: alertmanager
          servicePort: 80
        path: /api/v1/alerts
      - backend:
          serviceName: ruler
          servicePort: 80
        path: /ruler/ring
      - backend:
          serviceName: ruler
          servicePort: 80
        path: /api/v1/rules
      - backend:
          serviceName: ruler
          servicePort: 80
        path: /api/prom/api/v1/alerts
      - backend:
          serviceName: ruler
          servicePort: 80
        path: /api/prom/rules
      - backend:
          serviceName: distributor
          servicePort: 80
        path: /distributor/all_user_stats

Rule groups are getting double escaped

#131 shows an example where GetRuleGroup would escape the space in escaped namespace and buildRequest would re-escape it, resulting in the failed test result of %2520 from %20.

rules list command returns generic 404 when no rules are configured

Running cortextool rules list when no rules are configured returns the following error message

$ cortextool rules list
time="2019-12-17T12:49:29-05:00" level=fatal msg="unable to read rules from cortex, requested resource not found"

This initially led me to believe there was an issue with my CORTEX_ADDRESS value. Should it return an empty list / fail silently instead?

What is the difference between the `load` and `sync` commands?

At the moment, it is unclear what is the exact difference between both commands. From a quick peek at the code it seems like load is more of a "only uploaded if the rule group does not exist under that namespace" while sync is more of a replace everything but tell me about it.

I'd be good to make clear how do we support each of the following use cases:

  • Create or update a namespace/group/rule only if it doesn't exist
  • Create or update a namespace/group/rule (regardless of its status)
  • Find the difference (at each of namespace/group/rules) between the input file and what already exists on the server. Then delete or create whatever is missing.

Add Docker Image building to the CI

With #47 we introduced a breaking change that wouldn't allow us to build docker images - it'll be good to have an image building process as part of the pipeline to catch these a bit earlier.

Add namespace flag to configure rules commands

Currently, cortextool always sets the namespace based on the name of the file. This behavior results in rule organization that feels quite unnatural. For example, if we want to define one alert per file, we end up with an absurd number of namespaces.

Additionally, not having control over the namespace makes the new sync functionality difficult to use. It allows us to ignore namespaces, but since there are so many namespaces that are dynamically created, this flag doesn't do anything particularly useful for us. We would much rather use sync with a specific namespace, then have all the changes applied within that namespace.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.