thanos-io / thanos Goto Github PK
View Code? Open in Web Editor NEWHighly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
Home Page: https://thanos.io
License: Apache License 2.0
Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
Home Page: https://thanos.io
License: Apache License 2.0
Hey @Bplotka @fabxc,
I'm using the following to stand up a thanos store gateway
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: thanos-store
spec:
serviceName: "thanos-store"
replicas: 1
selector:
matchLabels:
app: thanos
thanos-peer: "true"
template:
metadata:
labels:
app: thanos
thanos-peer: "true"
spec:
containers:
- name: thanos-store
image: improbable/thanos:latest
env:
- name: SECRET_ACCESS
valueFrom:
secretKeyRef:
name: secret-s3
key: secret_access_s3
- name: SECRET_KEY
valueFrom:
secretKeyRef:
name: secret-s3
key: secret_key_s3
args:
- "store"
- "--log.level=debug"
- "--tsdb.path=/var/thanos/store"
- "--cluster.peers=thanos-peers.default.svc.cluster.local:10900"
- "--s3.bucket=<bucket_name>"
- "--s3.endpoint=s3-<az>.amazonaws.com"
- "--s3.access-key=$(SECRET_ACCESS)"
- "--s3.secret-key=$(SECRET_KEY)"
ports:
- name: http
containerPort: 10902
- name: grpc
containerPort: 10901
- name: cluster
containerPort: 10900
volumeMounts:
- name: data
mountPath: /var/thanos/store
volumes:
- name: data
emptyDir: {}
However the image is failing to spin up:
Error parsing commandline arguments: required flag --gcs.bucket not provided
usage: thanos store --gcs.bucket=<bucket> [<flags>]
I suspect the improbable/thanos:latest image hasn't been updated and is an older image, as I can see in the master branch that the --gcs.bucket
flag has been changed so it is no longer Required()
// registerStore registers a store command.
func registerStore(m map[string]setupFunc, app *kingpin.Application, name string) {
cmd := app.Command(name, "store node giving access to blocks in a GCS bucket")
grpcAddr := cmd.Flag("grpc-address", "listen address for gRPC endpoints").
Default(defaultGRPCAddr).String()
httpAddr := cmd.Flag("http-address", "listen address for HTTP endpoints").
Default(defaultHTTPAddr).String()
dataDir := cmd.Flag("tsdb.path", "data directory of TSDB").
Default("./data").String()
gcsBucket := cmd.Flag("gcs.bucket", "Google Cloud Storage bucket name for stored blocks. If empty sidecar won't store any block inside Google Cloud Storage").
PlaceHolder("<bucket>").String()
For instance:
Looks like just UI issue: Nothing wrong with LabelValues. I can call /api/v1/label/__name__/values?_=1513164422596
as aggressively as I want.
During debugging for downsampling last week I discovered at least one block that contained duplicated and out of order data.
For about 10% of its series there was a sequence of 3 chunks that was repeated about 180 times. At the end of that repetition were two chunks for times before that sequence. I could not come up with an explanation how this could have happened. Generally, the original block with those 3 chunks might not have been GC'd and then accidentally been re-compacted in over several iterations. But this should not get us to 180.
No data loss occurred as far as I can tell and this should be fully recoverable.
Even though the duplication is very high, the total data blow up is fairly minimal AFAICT.
As we have no explanation of what caused this, the best way to address this seems to be:
thanos bucket check
command that walks existing blocks and detects the issue. It can also be extended to re-write affected blocks properly.querying store failed: receive series: rpc error: code = Aborted desc = fetch series for block 01C18S2S39ZQ4VK43SKVYBRKSR: invalid remaining size 4096, expected 13485
From time to time.
Since the most recent data is on Prometheus servers, we must be able to fan out to those for queries. For overall simplicity of the system it is feasible to have the sidecar implement the store API on top of Prometheus's remote read API or HTTP v1 API. Probably either works fine but the HTTP v1 API has proper stability guarantees.
Currently when we upload blocks to the object storage, if we encounter an error we clean up after ourselves and delete all files that were uploaded so far. This is however not 100% safe yet.
If the uploader has a hard crash, orphaned objects are left behind. Files are uploaded in order so a block will only be validly considered if the meta.json
exists, which is uploaded last. This guards against for example, considering a block during garbage collection even though it is only partial. This is however implicit and makes us rely on upload order, i.e. we cannot parallelise it.
We should come up with a way to ensure this. This could either be a dedicated done
file for each block or setting of a metadata/label onto the meta.json
file in the object storage. Either way, our code would do an explicit finalization after successful upload of all block data.
We should also add a background cleanup procedure that detects and purges partial blocks.
Store nodes are currently generally run as a single replica. It's not super critical to have HA in general since several hours or even days of recent data are HA via the Prometheus servers. But for some scenarios it might still be preferable.
Two could simply be deployed and the query node would take care of deduplication/merging just like for Prometheus HA pairs. But unlike Prometheus servers, the underlying data is truly the same in this case and fetching twice the amount is unnecessary overhead.
Some simple logic could be added to the query node to recognize real duplicates (Prometheus HA pairs are actually different through a replica
label) and to only query one of them.
While running in AWS, you can use the instance profile to get an auth token, but currently Thanos doesn't configure the bucket if no credentials are provided.
minio supports IAM :
iam := credentials.NewIAM("")
s3Client, err := minio.NewWithCredentials("s3.amazonaws.com", iam, true, "")
Would you consider a PR ?
Please consider creating a release, so that people using Thanos can reference version numbers.
Found potential issue of the block shipper test is failing for me locally:
https://github.com/improbable-eng/promlts/blob/master/pkg/shipper/shipper_test.go#L95
We use DB.Blocks()
to determine how many blocks should be there. However that number only updates when the DB gets reloaded (unexported method right now). Snapshots are also in another directory and thus not possibly visible for DB.
The number of blocks in the mem store varies in my failing tests.
The tests passing on CI before might be an indicator that the number in the mem store has thus been 0 before. The test would then pass as DB.Blocks
also reports zero.
Those are just guesses as I was about wrap it up for today.
The store node is leaking memory very heavily. For the high request frequency I currently have, each query bumps the resident memory by 200-500MB.
The memory seems to go somewhere in net/http
but is neither reused nor freed. From the heap profile I cannot tell whether it comes from our own gRPC or from requests to GCS.
We are not opening connections ourselves for either but only instantiate a constant number of clients. So possibly it's something we have to change in how we configure those.
I think scraping deduplicated data is actually the correct answer. I think we had idea to add deduplication batch job for old data that will reduce data in obj store.
The querying layer is a stateless and horizontally scalable component that implements PromQL on top of the store API. It parses PromQL queries and loads relevant series and their chunks through the store API.
It is TBD whether the store nodes should proxy to the sidecars or the querying layer knows about store nodes and all Prometheus servers. Once we implement pre-filtering of blocks/Prometheus servers based on well-known label sets it might be preferable to not pull this complexity in the querying layer itself.
It should query just a single random store node and only fail over if that one does not respond. However, capability to merge and deduplicate overlapping data should be a direct capbility of the query layer.
The query layer implements the Prometheus HTTP v1 query API to be compatible with existing tools such as Grafana.
We need a well-defined gRPC API to query series data by time range and a set of label selectors. It should return compressed chunks. The compression format is known and query layer or other clients can decompress chunks as needed after retrieval.
This API will not be compatible with the Prometheus remote read API
Basically we don't want to ask for the same data twice. Since we know that:
As we agreed, it would nice to get as much as possible from Source and rest from store.
As an example:
0───2───4───6───8───10
└────Store───┘
└───Source──┘
└────Query───┘
When we query for 2-8
we can make naive algorithm that gets max(minTime)
from Source (4
here) and we can query
2-4
+ 1 for safeguard so 2-5
from store
4-8
+ 1 for safeguard so 3-8
from source.
We need safeguards, since the propagation is eventually consistent.
Problems:
minTime
:0───2───4───6───8───10
└────Store───┘
└───Source1─┘
└─Source2─┘
└────Query───┘
And even worse in this case for store.. are we sure that 0-6
include actually 4-6
from Source1?
storageID
As we can see, any solution requires to introduce a lot of complexity (passing some IDs to match time ranges), but for what benefit? In reality it in worst case it will be ~12h of unnecessary data for each source in system. That sounds a lot if you want to have rule component and HA (so 3*12h)
Potential complexity reduction:
Since store nodes have peer info about source, they can limit returned data to the data still available from source node. But still we need some ID matching.
Somehow missing data points are ... represented differently?
(Removed image because it could contain sensitive data)
When we query the store API, we define a time window. When we retrieve chunks from block data, they do not align with that query window. For chunks crossing the query window at beginning and wend we could always decode the chunk, drop irrelevant samples, and encode new chunks.
This however can be quite a few CPU cycles and allocations. The bandwidth overhead of just sending the full chunks is minor AFAICS. It might be a good idea to just send chunks as they are and have the query layer ensure we skip over irrelevant samples.
For the current sidecar implementation, we guarantee to return exactly the relevant data since we query raw datapoints and encode them as a chunk. But this might not be the case indefinitely – we eventually might want to get a chunk-based API upstream that's similar to the store API.
Any concerns with this proposal?
The store layer is an HA pair of nodes that connect to the GCS bucket containing all block data. They load all index files to local disk and mmap them to quickly resolve queries of time range + label matchers.
They dynamically load chunks via range queries from GCS and potentially store them in an in-memory LRU cache.
Chunks for the same series are sequential within a block file. Thus multiple chunks can be loaded with a single range request. Similarly, chunks for multiple series are co-located and can be loaded in a single request as well.
Some research needs to be done on how to consolidate multiple chunk queries. From a latency perspective it can make sense to load significant amount of 'unnecessary' data (gaps between required chunks).
Sharding of store nodes is a non-goal initially. See the design doc for calculations leading to that decision.
It just occurred to me that once we have blocks from many sources, it can rather easily happen that we preload a lot of irrelevant postings.
Suppose you get up{job="A"}
. The up
metric will be present in virtually every single block for the given time range. However, job="A"
will probably be present in just a few. Currently if there's no postings list of job="A"
for a block, we will return an EmptyPostings
transparently and still register the postings list for __name__="up"
for preloading for all blocks.
If we were more explicit about the non-existant postings, we could skip that work. That however would go pretty deep down into index fetching internals and we can only infer this by looking at the operations between postings (intersection vs. merge).
Probably not something to optimize quite right now yet. Especially since we've ongoing upstream work in that area.
But we should keep it in mind in case we run into performance issues later.
Doing simple query over not really long period can OOM the pod (with given 6GB of RAM). Obviously we can increase the box size, but can we improve that?
Initial profiling shows that postings:
. 144.70GB 393: if err := indexr.preloadPostings(); err != nil {
. . 394: return nil, err
. . 395: }
Is increasing really fast with our ruler service scraping it every 15s.
In comparison with compressed series:
. 446.41MB 407: if err := indexr.preloadSeries(ps); err != nil {
. . 408: return nil, err
. . 409: }
Merging and deduplicating of data is non-trivial and has no hard truth. We need to find some heuristic we want to follow. This will be used at query time but also to consolidate data by rewriting it. So it has a significant and irreversible impact.
Got
level=warn name=sidecar ts=2017-11-16T15:30:06.554929827Z caller=shipper.go:71 msg="open file failed" err="stat /prometheus-data/01BZ2Q6ZJYT2R3N1ZM4P54XT7T.tmp: no such file or directory"
We should not care about all the irrelevant files Prometheus or someone can add
Related to #107
Right now we are asking every store API for series with given time range and labels.
Having external labels from sidecars, we can filter out sidecars which does not match with the external label specified in query (if any). E.g:
With given query (time range does not matter in this example):
{ _name_ : "a", external: "prom1", "label1": "1"}
And having these peers with peerType=store
{ external: "prom1" }
{ external: "prom2" }
{ }
// Store node will likely does not have external labels propagated.We can certainly say that there is no need to call 2 store peer for data. This can give some benefits when having lots of sidecards in system.
Hey @Bplotka,
Am I missing something here? 🤔
So I am seeing errors for incorrect arguments when doing go get:
$ go get github.com/improbable-eng/thanos/...
# github.com/improbable-eng/thanos/pkg/block
go1/src/github.com/improbable-eng/thanos/pkg/block/index.go:152:31: not enough arguments in call to index.NewFileReader
have (string)
want (string, int)
# github.com/improbable-eng/thanos/pkg/objstore
go1/src/github.com/improbable-eng/thanos/pkg/objstore/objstore.go:207:32: cannot use b.opsDuration.WithLabelValues(op) (type prometheus.Observer) as type prometheus.Histogram in argument to newTimingReadCloser:
prometheus.Observer does not implement prometheus.Histogram (missing Collect method)
go1/src/github.com/improbable-eng/thanos/pkg/objstore/objstore.go:222:32: cannot use b.opsDuration.WithLabelValues(op) (type prometheus.Observer) as type prometheus.Histogram in argument to newTimingReadCloser:
prometheus.Observer does not implement prometheus.Histogram (missing Collect method)
and when doing a make I get:
$ make
>> fetching goimports
>> fetching promu
>> fetching dep
>> dep ensure
>> formatting code
make: goimports: Command not found
make: *** [format] Error 127
Go versions: go version go1.9.4 linux/amd64
Query: topk(10, count by (__name__)({__name__=~".+"}))
Query node response:
Error executing query: fetch series: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (10277563 vs. 4194304)
Time:
Load time: 3412ms
Resolution: 14s
Total time series: 0
Doing the same on Prometheus gives proper data with time:
Load time: 651ms
Resolution: 14s
Total time series: 10
Execute
After couple of times this is happening:
level=debug name=thanos-store ts=2017-12-08T13:36:30.559067909Z caller=bucket.go:1054 msg="preloaded range" block=01C0MSNM5GS6J537VA4K822STW file=0 numOffsets=21 length=5807 duration=4.399890956s
level=debug name=thanos-store ts=2017-12-08T13:36:30.55943186Z caller=bucket.go:400 msg="preload postings" duration=4.889970819s
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0xb6a652]
goroutine 478309 [running]:
github.com/improbable-eng/thanos/pkg/store.(*lazyPostings).Next(0xc469b264c0, 0x1e3df52f)
<autogenerated>:1 +0x32
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*mergedPostings).Next(0xc4b6652630, 0x2158fed9fd6d1ba0)
/home/bartek/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:351 +0x208
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*mergedPostings).Next(0xc4b6652690, 0x2158fed9)
/home/bartek/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:351 +0x208
github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index.(*intersectPostings).Next(0xc4b66526c0, 0x246dfb94a3f)
/home/bartek/go/src/github.com/improbable-eng/thanos/vendor/github.com/prometheus/tsdb/index/postings.go:302 +0x33
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).blockSeries(0xc420229490, 0x12c37e0, 0xc529da4b00, 0xc4bbe7a7e0, 0xc4dc2c8cc0, 0x2, 0x2, 0x15fee39c7b3, 0x16036576393, 0x0, ...)
/home/bartek/go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:404 +0x619
github.com/improbable-eng/thanos/pkg/store.(*BucketStore).Series.func1(0x0, 0x0)
/home/bartek/go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:511 +0xa7
github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run.func1(0xc4de432360, 0xc4d8a71920, 0xc4274cb130)
/home/bartek/go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:38 +0x27
created by github.com/improbable-eng/thanos/vendor/github.com/oklog/run.(*Group).Run
/home/bartek/go/src/github.com/improbable-eng/thanos/vendor/github.com/oklog/run/group.go:37 +0xa8
Current test deployments within a single cluster/region work very well. Time to give global deployment options some thought.
Generally speaking, a full setup in every region is probably desirable for reliability and data locality for almost all setups. It also provides trivial means of sharding, which isolates failures and increases scalability.
Currently there seem to be two options:
A) Expand the gossip network globally. All store node, Prometheus servers, and queriers are interconnected.
Pros:
Cons:
B) Keep Thanos cluster regional or even more fine-grained. Additional global federation layer across smaller Thanos clusters.
Query nodes would be made aware of each other through regular service discovery mechanism and act as federation proxies for the Store API.
Pros:
Cons:
go_goroutines{job=~"thanos-*"}
works for store node, but it should not
(proper) go_goroutines{job=~"thanos-.*"}
works for both prometheus and store node.
The result size of the a final PromQL query seems to have very significant effect on total latency.
For example, apiserver_request_count{job="apiserver"}
takes at the order of 15-20 seconds. If wrapped with a sum
like sum(apiserver_request_count{job="apiserver"})
, latency sharply decreases to 2-3 seconds.
The data queried from GCS and flowing into query nodes is the same. The query engine is actually doing more work.
So the whole delta seems to come from sending the result after data fetching and processing completed successfully. The result is bigger, so more JSON encoding happens and more data is sent to the browser. The difference seems rather extreme nonetheless.
Query node is in the US, end client in EU. Still seems rather unusually high.
The query node is accessed via an SSH tunnel, which might play a role as well.
Querying go_goroutines
for 2w is fine
for 4w I am getting data only from sidecar. Request to store node ends up with:
query ts=2017-12-14T12:50:14.379715022Z caller=querier.go:185 msg="single select failed. Ignoring this result." err="rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4635734 vs. 4194304)"
Running the default Make target invokes the install-tools
target, which has the unexpected side-effect of performing a 'git pull' on repositories under the GOPATH:
https://github.com/improbable-eng/thanos/blob/b77877f42f9edf67a4413a9602054850a1a26afa/Makefile#L25-L31
We might want to consider using vg instead, which supports using dep metadata to specify which binaries are required. This would also allow for versions to be pinned, to make the builds repeatable.
The default Make target should not install binaries on a user's system.
With #92 an initial version of the rule component is largely complete.
What's still missing is sending out the alerts to Alertmanagers. Upstream uses the standard service discovery integrations and designated config section for that.
Some upstream dependency decoupling work I did last week however makes the notification sending properly pluggable.
Prometheus' service discovery is quite useful for extracting meta data and doing advanced filtering. Neither really seem to be a concern when sending alerts to a small set of Alertmanagers. DNS is basically in place everywhere and seems sufficient after all.
I'd like to consider going back to a simple --alertmanagers
flag, that just allows a few specifics regarding used DNS records and ports.
http://alertmanager.example.org:8080
– just sends alerts to that specific address using HTTPdns+https://alertmanager.example.org:8080
– sends alerts to each A record for the hostname name using port 8080 for each, uses HTTPSdnssrv+https://alertmanager.example.org
– like 2. but queries SRV records and uses the port specified theredns+https://alertmanager.example.org:8080/path/prefix/
– like 2. but uses a path prefixThe flag could be repeatable.
That solves virtually all sane use cases well enough with minimal learning curve.
The alternative is pulling in the whole dependency chain of the Prometheus discovery framework and requiring configuration files with relabeling for the rule component.
Another thing upstream supports are relabeling rules for alerts. The only reasonable use case I can think of is dropping "replica" labels from alerts. Since replicas however are a first-class concept in Thanos, we'd handle that case in a first-class manner.
Basically we need these to run on any k8s cluster.
The question is if store node, which is bit stateful is useful to be provisioned on k8s or just separate VM. I guess being able to run it on k8s as well is not a harm.
I'm currently debugging a case where a downsampled series is producing a lot of small chunks rather than few sufficiently large ones. A small fraction of series are also finalized with chunks with a timestamp range of [-1, -1]
for which there's no logical reason in the base data.
This does likely not affect correctness of query results but was caught by the new validations procedures added in #194. It's probably really worth keeping those around forever to catch regressions into behavior like this early.
We wanted to do that with some UI upgrade, but that is extremely useful now -> Setting gossip cluster and ensuring all labels are ok might be tricky.
We should keep an eye on the LabelValues
RPC to store nodes. It will always hit every single block, which can get particularly costly on GCS store nodes.
This data does rarely change and is only used for things like auto-completion. But for that it is called whenever the query UI opens up. Probably the query nodes themselves should already cache them.
Deduplicated query gives 2 series for single series (with and without replica label).
Hey @Bplotka @fabxc
Still pretty new to the Thanos code base. Going through it, one thing I've noticed with the backup behaviour is it seems to only dump the initial 2 hourly blocks. In my instance I've stood up thanos against an existing Prom server. My data dir looks as follows:
data]# ls -ltr
total 108
drwxr-xr-x 3 root root 4096 Jan 28 19:00 01C4Z28176WR17K7PH37K7FG9V
drwxr-xr-x 3 root root 4096 Jan 29 13:00 01C5101JDEX1TK8CMSC4NQK8KP
drwxr-xr-x 3 root root 4096 Jan 30 07:00 01C52XV3T2ETVSG93K73HVP6D1
drwxr-xr-x 3 root root 4096 Jan 31 01:00 01C54VMN5R94ZM7N7F08J20DDA
drwxr-xr-x 3 root root 4096 Jan 31 19:00 01C56SE59DGWBV587GG9M2W99W
drwxr-xr-x 3 root root 4096 Feb 1 13:00 01C58Q7Q5G4R0DFY5HDGD3XC9Y
drwxr-xr-x 3 root root 4096 Feb 2 07:00 01C5AN1885H7B9J1W11DXB137A
drwxr-xr-x 3 root root 4096 Feb 3 01:00 01C5CJTST135PPXDYDWTKEPYTD
drwxr-xr-x 3 root root 4096 Feb 3 19:00 01C5EGMASFVT63NC0QZTJSABFJ
drwxr-xr-x 3 root root 4096 Feb 4 13:00 01C5GEDW21PTYQ8WNKYRCAVQJX
drwxr-xr-x 3 root root 4096 Feb 5 07:00 01C5JC7DGMKPJVQD2DZH0JT7QJ
drwxr-xr-x 3 root root 4096 Feb 6 01:00 01C5MA0ZC22NSD0H7S8G207JFF
-rw------- 1 root root 6 Feb 6 12:29 lock
drwxr-xr-x 3 root root 4096 Feb 6 19:00 01C5P7TGEMP97FSN779S5E5AYH
drwxr-xr-x 3 root root 4096 Feb 7 13:00 01C5R5M1ANW6RYY81S91VC0F75
drwxr-xr-x 3 root root 4096 Feb 8 07:00 01C5T3DJX67W411JZ9FP745B2Q
drwxr-xr-x 3 root root 4096 Feb 9 01:00 01C5W173WNJA1F01TVJJPT5B93
drwxr-xr-x 3 root root 4096 Feb 9 19:00 01C5XZ0MQ5BSY2W5K9KKAGF5N2
drwxr-xr-x 3 root root 4096 Feb 10 13:00 01C5ZWT693CHSK8KEKBW04SSDX
drwxr-xr-x 3 root root 4096 Feb 11 07:00 01C61TKQF3YXY1XT01JN2X0A3W
drwxr-xr-x 3 root root 4096 Feb 12 01:00 01C63RD8T9Y2Z2C7G98YX9RBXV
drwxr-xr-x 3 root root 4096 Feb 12 07:00 01C64D0D1Y2B9P3QHHH9XCF8NV
drwxr-xr-x 3 root root 4096 Feb 12 09:00 01C64KW317054P6TNH8DCNGRP9
drwxr-xr-x 3 root root 4096 Feb 12 11:00 01C64TQT974JFMQV24CH9060XW
drwxrwxrwx 2 root root 4096 Feb 12 11:56 wal
When standing up Thanos I see:
./thanos sidecar --prometheus.url http://localhost:9090 --tsdb.path /opt/prometheus/promv2/data/ --s3.bucket=thanos --s3.endpoint=xxxxxxxx --s3.access-key=xxxxxxx --s3.secret-key=xxxxxx
level=info ts=2018-02-12T12:25:49.329654785Z caller=sidecar.go:293 msg="starting sidecar" peer=01C64ZMYG5728WQFXTVCD0F70V
level=info ts=2018-02-12T12:25:49.652116167Z caller=shipper.go:179 msg="upload new block" id=01C64KW317054P6TNH8DCNGRP9
level=info ts=2018-02-12T12:25:51.747570129Z caller=shipper.go:179 msg="upload new block" id=01C64TQT974JFMQV24CH9060XW
only the last two blocks are uploaded. From the flags for thanos sidecar
I don't see a mechanism for specifying a period for backdating. Perhaps I am doing something wrong? Is this intentional for some reason (compute/performance)? Or am I simply listing a feature request here?
Thanks.
When fanning out requests to multiple Prometheus instances or store nodes, partial failures will happen and should be handled gracefully. We need a consistent way to express partial failures for internal and external error handling.
This is implemented through a sidecar, which watches Prometheus's data directory for new blocks and uploads them with the existing directory structure into a single GCS bucket.
The resolvePeers()
function in the cluster
package tries to resolve the IP addresses for the cluster peers:
https://github.com/improbable-eng/thanos/blob/19ff509a4f658887f90ab543c79cc821762f64cd/pkg/cluster/cluster.go#L423-L476
It does this using the Go DNS resolver.
The memberlist library does the same thing, but uses TCP first in a bid to retrieve as many peers as possible:
https://github.com/hashicorp/memberlist/blob/ce8abaa0c60c2d6bee7219f5ddf500e0a1457b28/memberlist.go#L303-L306
Go's resolver may try TCP if the UDP query results in a truncated response, so maybe this is not an issue. But it seems like Thanos is doing more than it needs to - there's a lot of overlap between the code in these two places.
Should Thanos delegate DNS resolution for peers to the memberlist library, and is there a compelling reason for it to do its own resolution as it is currently?
The overall system provides us with a global view of the data. We want to make use of that for recording and alerting rules as well. This requires a service which evaluates those rules against the querying layer and writes this data to disk.
The component should appear like yet another Prometheus server to the outside, i.e. it writes down data from recording rules in the Prometheus format. The sidecar is deployed along with it, uploads completed blocks, and responds to queries for recent data.
Firing alerts are sent out in the same way as from a regular Prometheus server.
Since the expensive querying is outsourced to the querying layer, we probably don't need any sharding here either and can stick with an HA pair.
To retain some reliability properties of the Prometheus model, it may be a good idea to only evaluate rules in this component that actually require a global data view. Rules that just need data scoped to a single collector Prometheus could still be evaluated there. Finding some decision rules and potentially and automated process for this would be a valuable extension.
There are caveats to this for recording rules. Maybe it is only feasible for alerts.
We are basically move the source of truth for replica labels to the client.
Benefits:
Cons:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.