Comments (12)
Hey!
Out of curiosity: how does meta.json look for a broken block? Does it only contain the chunks that are actually there uploaded?
from thanos.
Wonder why it is always "cannot populate chunk 8 from block"... We have such errors in compactor periodically when object storage load/latency is high, like
Apr 02 05:41:54 thanos-compact1 thanos[1556]: ts=2024-04-02T05:41:54.267426403Z caller=http.go:91 level=info service=http/server component=compact msg="internal server is shutting down" err="critical error detected: compaction: group 0@5892148356452815085: compact blocks [/var/lib/thanos/compact/0@5892148356452815085/01HTDQCE6GV770YD8PSB6B9TMK /var/lib/thanos/compact/0@5892148356452815085/01HTDQCF5FPREE2EJA2ACE6F2Z]: cannot populate chunk 8 from block 01HTDQCE6GV770YD8PSB6B9TMK: segment index 0 out of range"
Last time even too many
journalctl --no-pager --since "24h ago" | grep "level=error" | grep "cannot populate chunk 8 from block" | awk -F"cannot populate chunk 8 from block" '{print $2}' | awk '{print $1}' | sed 's/://g' | sort | uniq | wc -l 16
16 blocks last 24hrs.
While all (some of?) this blocks or chunks looks perfect when run
thanos tools bucket verify --id=01HTDQCE6GV770YD8PSB6B9TMK --objstore.config-file=./s3.yml --objstore-backup.config-file=./s3-backup.yml --issues=index_known_issues --log.level=debug ts=2024-04-02T08:37:39.879537233Z caller=main.go:67 level=debug msg="maxprocs: Leaving GOMAXPROCS=[48]: CPU quota undefined" ts=2024-04-02T08:37:39.880164853Z caller=factory.go:53 level=info msg="loading bucket configuration" ts=2024-04-02T08:37:39.880769288Z caller=factory.go:53 level=info msg="loading bucket configuration" ts=2024-04-02T08:37:39.881215344Z caller=verify.go:138 level=info verifiers=index_known_issues msg="Starting verify task" ts=2024-04-02T08:37:39.881238107Z caller=index_issue.go:33 level=info verifiers=index_known_issues verifier=index_known_issues msg="started verifying issue" with-repair=false ts=2024-04-02T08:37:39.881273894Z caller=fetcher.go:407 level=debug component=block.BaseFetcher msg="fetching meta data" concurrency=32 ts=2024-04-02T08:49:46.190807748Z caller=fetcher.go:557 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=12m6.309548732s duration_ms=726309 cached=18698 returned=11170 partial=15 ts=2024-04-02T08:49:47.291958391Z caller=index_issue.go:141 level=debug verifiers=index_known_issues verifier=index_known_issues stats="{TotalSeries:69688 OutOfOrderSeries:0 OutOfOrderChunks:0 DuplicatedChunks:0 OutsideChunks:0 CompleteOutsideChunks:0 Issue347OutsideChunks:0 OutOfOrderLabels:0 SeriesMinLifeDuration:10s SeriesAvgLifeDuration:1h58m56.572s SeriesMaxLifeDuration:1h59m22.392s SeriesMinLifeDurationWithoutSingleSampleSeries:10s SeriesAvgLifeDurationWithoutSingleSampleSeries:1h58m56.572s SeriesMaxLifeDurationWithoutSingleSampleSeries:1h59m22.392s SeriesMinChunks:1 SeriesAvgChunks:5 SeriesMaxChunks:55 TotalChunks:408483 ChunkMinDuration:10s ChunkAvgDuration:20m17.513s ChunkMaxDuration:1h59m0.001s ChunkMinSize:54 ChunkAvgSize:141 ChunkMaxSize:1265 SeriesMinSize:48 SeriesAvgSize:83 SeriesMaxSize:416 SingleSampleSeries:0 SingleSampleChunks:0 LabelNamesCount:147 MetricLabelValuesCount:970}" id=01HTDQCE6GV770YD8PSB6B9TMK ts=2024-04-02T08:49:47.29201073Z caller=index_issue.go:57 level=debug verifiers=index_known_issues verifier=index_known_issues msg="no issue" id=01HTDQCE6GV770YD8PSB6B9TMK ts=2024-04-02T08:49:47.292177506Z caller=index_issue.go:75 level=info verifiers=index_known_issues verifier=index_known_issues msg="verified issue" with-repair=false ts=2024-04-02T08:49:47.293022428Z caller=verify.go:157 level=info verifiers=index_known_issues msg="verify task completed" ts=2024-04-02T08:49:47.293174706Z caller=main.go:164 level=info msg=exiting ts=2024-04-02T08:49:47.293203972Z caller=main.go:67 level=debug msg="maxprocs: No GOMAXPROCS change to reset%!(EXTRA []interface {}=[])"
Here is a meta.json
cat data/01HTDQCE6GV770YD8PSB6B9TMK/meta.json
{
"ulid": "01HTDQCE6GV770YD8PSB6B9TMK",
"minTime": 1711994400002,
"maxTime": 1712001600000,
"stats": {
"numSamples": 49015935,
"numSeries": 69688,
"numChunks": 408483
},
"compaction": {
"level": 1,
"sources": [
"01HTDQCE6GV770YD8PSB6B9TMK"
]
},
"version": 1,
"thanos": {
"labels": {
"monitor": "production",
"pod": "dirpod4",
"replica": "prometheus1"
},
"downsample": {
"resolution": 0
},
"source": "sidecar",
"segment_files": [
"000001"
],
"files": [
{
"rel_path": "chunks/000001",
"size_bytes": 57419523
},
{
"rel_path": "index",
"size_bytes": 8268359
},
{
"rel_path": "meta.json"
}
],
"index_stats": {}
}
}
and a copy of block on the local fs
me@thanos-compact1:~$ ls -la data/01HTDQCE6GV770YD8PSB6B9TMK/*
-rw-rw-r-- 1 me me 8268359 Apr 1 21:01 data/01HTDQCE6GV770YD8PSB6B9TMK/index
-rw-rw-r-- 1 me me 713 Apr 1 21:01 data/01HTDQCE6GV770YD8PSB6B9TMK/meta.json
-rw-rw-r-- 1 me me 135 Apr 2 05:45 data/01HTDQCE6GV770YD8PSB6B9TMK/no-compact-mark.json
data/01HTDQCE6GV770YD8PSB6B9TMK/chunks:
total 56076
drwxrwxr-x 2 me me 20 Apr 2 08:24 .
drwxrwxr-x 3 me me 78 Apr 2 08:24 ..
-rw-rw-r-- 1 me me 57419523 Apr 1 21:01 000001
Looks like some issue in s3 obj storage client or what?
from thanos.
Hey @bobykus31, are you able to share the blocks by chance?
from thanos.
I wish but don't think I am allowed to, sorry. You mean a content, right?
from thanos.
I even dunno if the block itself can be any help, but ...
Here is an output from
promtool tsdb analyze ./data 01HTDQCE6GV770YD8PSB6B9TMK
Block ID: 01HTDQCE6GV770YD8PSB6B9TMK
Duration: 1h59m59.998s
Series: 69688
Label names: 147
Postings (unique label pairs): 3188
Postings entries (total label pairs): 475088
Label pairs most involved in churning:
61 job=node
31 group=base
28 severity=medium
27 node=127.0.0.1
20 group=fluentbit
20 job=fluentbit
20 alertname=BlackboxNetworkProbeSlowOnDNSLookup
19 group=misc
14 alertstate=pending
14 __name__=ALERTS_FOR_STATE
14 __name__=ALERTS
12 node=somehost1.me
11 job=logproxy
11 node=somehost2.me
11 group=logproxy
10 pod=dirpod4
8 job=imap-starttls
8 job=script-exporter
8 group=imap-starttls
7 node=somehost3.me
Label names most involved in churning:
146 __name__
146 node
145 instance
145 job
133 group
35 name
29 alertname
29 severity
24 type
23 cpu
14 alertstate
13 device
10 plugin_id
10 worker_id
10 pod
10 hostname
5 mode
5 state
4 libs_deleted
4 execstart_binary
Most common label pairs:
43141 job=node
21086 group=base
14361 group=misc
8032 job=logproxy
8032 group=logproxy
7725 pod=dirpod4
4878 node=somehost6.me
4866 node=somehost7.me
4315 node=somehost1.me
4282 node=somehost2.me
3939 node=prometheus1.somehost.me
3862 hostname=logproxy1
3850 hostname=logproxy2
3727 job=script-exporter-self
3725 group=script-exporter-self
3714 job=script-exporter
3664 group=mailproxy-dirpod
3601 node=somehost3.me
3600 node=somehost4.me
3530 node=somehost5.me
Label names with highest cumulative label value length:
31542 __name__
9795 instance
6944 name
4755 rule_group
2499 address
1474 node
1306 zone
1190 execstart_binary
929 le
833 device
800 type
584 serial_number
460 nlri
451 serial
440 version
410 file
288 dialer_name
285 handler
269 id
265 scrape_job
Highest cardinality labels:
970 __name__
338 name
258 instance
147 address
125 le
108 id
101 rule_group
82 device
74 type
64 cpu
54 execstart_binary
46 node
43 zone
40 serial_number
40 collector
35 dialer_name
32 worker_id
32 core
32 scrape_job
31 nlri
Highest cardinality metric names:
3072 node_cpu_seconds_total
2140 node_systemd_unit_state
1600 node_cooling_device_cur_state
1600 node_softnet_processed_total
1600 node_cooling_device_max_state
1600 node_softnet_flow_limit_count_total
1600 node_softnet_cpu_collision_total
1600 node_softnet_backlog_len
1600 node_softnet_times_squeezed_total
1600 node_softnet_received_rps_total
1600 node_softnet_dropped_total
816 scripts_duration_seconds
792 node_scrape_collector_success
792 node_scrape_collector_duration_seconds
768 node_cpu_guest_seconds_total
708 node_systemd_execstart_binary_age
704 fluentd_output_status_retry_count
704 fluentd_output_status_num_errors
680 ipmi_sensor_state
680 ipmi_sensor_value
from thanos.
So I turned on a debug for s3 by trace.enable: true.
Here is what I can see
grep 01HTF7EGX696B0A0K3X9JY7987 /srv/logs/thanos-compact1/*-20240402 | grep GET | grep -v mark.json
2024-04-02T14:03:17.147505152Z thanos-compact1 6 thanos[13909] ts=2024-04-02T14:03:17.147480003Z caller=stdlib.go:105 level=debug s3TraceMsg="GET /ocsysinfra-prometheus-metrics/01HTF7EGX696B0A0K3X9JY7987/meta.json HTTP/1.1\r\nHost: s3host.me\r\nUser-Agent: MinIO (linux; amd64) minio-go/v7.0.61 thanos-compact/0.34.1 (go1.21.7)\r\nAuthorization: AWS4-HMAC-SHA256 Credential=**REDACTED**/20240402/DK/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=**REDACTED**\r\nX-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855\r\nX-Amz-Date: 20240402T140317Z\r\nAccept-Encoding: gzip\r\n\r"
2024-04-02T14:03:17.17700096Z thanos-compact1 6 thanos[13909] ts=2024-04-02T14:03:17.176970957Z caller=stdlib.go:105 level=debug s3TraceMsg="GET /ocsysinfra-prometheus-metrics/?delimiter=%2F&encoding-type=url&fetch-owner=true&list-type=2&prefix=01HTF7EGX696B0A0K3X9JY7987%2F HTTP/1.1\r\nHost: s3host.me\r\nUser-Agent: MinIO (linux; amd64) minio-go/v7.0.61 thanos-compact/0.34.1 (go1.21.7)\r\nAuthorization: AWS4-HMAC-SHA256 Credential=**REDACTED**/20240402/DK/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=**REDACTED**\r\nX-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855\r\nX-Amz-Date: 20240402T140317Z\r\nAccept-Encoding: gzip\r\n\r"
2024-04-02T14:03:17.297769984Z thanos-compact1 6 thanos[13909] ts=2024-04-02T14:03:17.297744857Z caller=stdlib.go:105 level=debug s3TraceMsg="GET /ocsysinfra-prometheus-metrics/01HTF7EGX696B0A0K3X9JY7987/index HTTP/1.1\r\nHost: s3host.me\r\nUser-Agent: MinIO (linux; amd64) minio-go/v7.0.61 thanos-compact/0.34.1 (go1.21.7)\r\nAuthorization: AWS4-HMAC-SHA256 Credential=**REDACTED**/20240402/DK/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=**REDACTED**\r\nX-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855\r\nX-Amz-Date: 20240402T140317Z\r\nAccept-Encoding: gzip\r\n\r"
....
ts=2024-04-02T14:07:56.233285405Z caller=compact.go:527 level=error msg="critical error detected; halting" err="compaction: group 0@17648158269862193886: compact blocks [/var/lib/thanos/compact/0@17648158269862193886/01HTF7EGWYN1YWZ63WC3XDJVYB /var/lib/thanos/compact/0@17648158269862193886/01HTF7EGX696B0A0K3X9JY7987]: cannot populate chunk 8 from block 01HTF7EGX696B0A0K3X9JY7987: segment index 0 out of range"
First
GET /ocsysinfra-prometheus-metrics/01HTF7EGX696B0A0K3X9JY7987/meta.json
Then
GET /ocsysinfra-prometheus-metrics/?delimiter=%2F&encoding-type=url&fetch-owner=true&list-type=2&prefix=01HTF7EGX696B0A0K3X9JY7987%2F
Then
GET /ocsysinfra-prometheus-metrics/01HTF7EGX696B0A0K3X9JY7987/index
Somehow I can not see something like
GET /ocsysinfra-prometheus-metrics/01HTF7EGX696B0A0K3X9JY7987/chunks/000001
exactly as I can see
ts=2024-04-02T14:03:17.898064859Z caller=stdlib.go:105 level=debug s3TraceMsg="GET /ocsysinfra-prometheus-metrics/01HTF7EGWYN1YWZ63WC3XDJVYB/chunks/000001
For some reason compactor just ignored /ocsysinfra-prometheus-metrics/01HTF7EGX696B0A0K3X9JY7987/chunks/000001 even later on I was able to download it with "s3cmd get --recursive" manually.
download: 's3://ocsysinfra-prometheus-metrics/01HTF7EGX696B0A0K3X9JY7987/chunks/000001' -> 'data/01HTF7EGX696B0A0K3X9JY7987/chunks/000001' [1 of 3]
92331737 of 92331737 100% in 3s 24.22 MB/s done
download: 's3://ocsysinfra-prometheus-metrics/01HTF7EGX696B0A0K3X9JY7987/index' -> 'data/01HTF7EGX696B0A0K3X9JY7987/index' [2 of 3]
13550935 of 13550935 100% in 0s 24.30 MB/s done
download: 's3://ocsysinfra-prometheus-metrics/01HTF7EGX696B0A0K3X9JY7987/meta.json' -> 'data/01HTF7EGX696B0A0K3X9JY7987/meta.json' [3 of 3]
716 of 716 100% in 0s 12.48 kB/s done
from thanos.
Hey @bobykus31, thats really interesting; can you retry with mc cp ... --recursive=true
?
from thanos.
Yes, re download seems to make the compaction going well.
groupKey=0@5322589354663189029 msg="compaction available and planned" plan="[01HTF7EGY6WQ5VSG6CER14WVJD (min time: 1712044800014, max time: 1712052000000) 01HTF7EHFG3BXV05EY2Z3E55SF (min time: 1712044800054, max time: 1712052000000)]"
from thanos.
Seems to be an s3 issues itself. I was able to reproduce it with s3cmd for some blocks. May be a --consistency-delay setting can help (currently it is 2h)
from thanos.
The issue still exists, fyi. Increase --consistency-delay does not help much. Looks similar to #1199
from thanos.
Not a fix for it by any means, but if we already have uploaded such blocks then #7282 should at least make compactor and store not crash but rather mark them as corrupted and increment the proper metrics
from thanos.
This could happen when you use the multi site obj storage. And data sync between sites is not consistent (for many reasons). Here is what obj storage provider recommends in such a case
Note: In general, you should use the “read-after-new-write” consistency control value. If requests aren't working correctly, change the application client behavior if possible. Or, configure the client to specify the consistency control for each API request. Set the consistency control at the bucket level only as a last resort.
Request example
PUT /bucket?x-ntap-sg-consistency=strong-global HTTP/1.1
Date: date
Authorization: authorization string
Host: host
How this achievable with Thanos?
from thanos.
Related Issues (20)
- Max and min pointed at Sidecars not working on 0.35 HOT 15
- `ThanosSidecarBucketOperationsFailed` alert is flaky
- PR Title Validation
- Thanos Receive Pod is crashing with Readiness and livness Probe Failed
- Thanos ruler vs. eventual consistency of metrics
- Can Huawei's OBS storage be supported? HOT 2
- Thanos React-app : Proxy server for thanos-query
- Query: update of endpoint failed...context deadline exceeded
- Thanos Chart 0.34.0 app version 12.23.1
- Thanos receive fails "no space left on device"
- sidecar: Greatly increased Thanos sidecar memory usage from 0.32.2 to 0.32.3, still exists in 0.35.0 HOT 3
- api/v1/label returns wrong values HOT 3
- Regression in thanos v0.35.1 HOT 3
- Thanos Receiver: Router/Ingestor setup no longer returns `thanos_receive_write_timeseries_*` and `thanos_receive_write_samples_*` metrics with thanos v0.35.1 HOT 3
- Extend Thanos bucket rewrite to support filtered archiving of existing blocks
- Support additional aggregates for downsampling
- Store Announced LabelSets Unexpected
- Warning in Grafana 11+: Thanos Receive dashboard depends on Angular
- compactor: Thanos Compactor compaction level number being set to 93+ for 1 half of HA Prometheus (S3 Storage) HOT 3
- Query: Network bandwidth usage upward of 500MB/s between Querier and configured stores HOT 17
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from thanos.