Code Monkey home page Code Monkey logo

Comments (12)

MichaHoffmann avatar MichaHoffmann commented on July 21, 2024

Hey!

Out of curiosity: how does meta.json look for a broken block? Does it only contain the chunks that are actually there uploaded?

from thanos.

bobykus31 avatar bobykus31 commented on July 21, 2024

Wonder why it is always "cannot populate chunk 8 from block"... We have such errors in compactor periodically when object storage load/latency is high, like

Apr 02 05:41:54 thanos-compact1 thanos[1556]: ts=2024-04-02T05:41:54.267426403Z caller=http.go:91 level=info service=http/server component=compact msg="internal server is shutting down" err="critical error detected: compaction: group 0@5892148356452815085: compact blocks [/var/lib/thanos/compact/0@5892148356452815085/01HTDQCE6GV770YD8PSB6B9TMK /var/lib/thanos/compact/0@5892148356452815085/01HTDQCF5FPREE2EJA2ACE6F2Z]: cannot populate chunk 8 from block 01HTDQCE6GV770YD8PSB6B9TMK: segment index 0 out of range"

Last time even too many

journalctl --no-pager --since "24h ago" | grep "level=error" | grep "cannot populate chunk 8 from block" | awk -F"cannot populate chunk 8 from block" '{print $2}' | awk '{print $1}' | sed 's/://g' | sort | uniq | wc -l 16

16 blocks last 24hrs.

While all (some of?) this blocks or chunks looks perfect when run

thanos tools bucket verify --id=01HTDQCE6GV770YD8PSB6B9TMK --objstore.config-file=./s3.yml --objstore-backup.config-file=./s3-backup.yml --issues=index_known_issues --log.level=debug ts=2024-04-02T08:37:39.879537233Z caller=main.go:67 level=debug msg="maxprocs: Leaving GOMAXPROCS=[48]: CPU quota undefined" ts=2024-04-02T08:37:39.880164853Z caller=factory.go:53 level=info msg="loading bucket configuration" ts=2024-04-02T08:37:39.880769288Z caller=factory.go:53 level=info msg="loading bucket configuration" ts=2024-04-02T08:37:39.881215344Z caller=verify.go:138 level=info verifiers=index_known_issues msg="Starting verify task" ts=2024-04-02T08:37:39.881238107Z caller=index_issue.go:33 level=info verifiers=index_known_issues verifier=index_known_issues msg="started verifying issue" with-repair=false ts=2024-04-02T08:37:39.881273894Z caller=fetcher.go:407 level=debug component=block.BaseFetcher msg="fetching meta data" concurrency=32 ts=2024-04-02T08:49:46.190807748Z caller=fetcher.go:557 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=12m6.309548732s duration_ms=726309 cached=18698 returned=11170 partial=15 ts=2024-04-02T08:49:47.291958391Z caller=index_issue.go:141 level=debug verifiers=index_known_issues verifier=index_known_issues stats="{TotalSeries:69688 OutOfOrderSeries:0 OutOfOrderChunks:0 DuplicatedChunks:0 OutsideChunks:0 CompleteOutsideChunks:0 Issue347OutsideChunks:0 OutOfOrderLabels:0 SeriesMinLifeDuration:10s SeriesAvgLifeDuration:1h58m56.572s SeriesMaxLifeDuration:1h59m22.392s SeriesMinLifeDurationWithoutSingleSampleSeries:10s SeriesAvgLifeDurationWithoutSingleSampleSeries:1h58m56.572s SeriesMaxLifeDurationWithoutSingleSampleSeries:1h59m22.392s SeriesMinChunks:1 SeriesAvgChunks:5 SeriesMaxChunks:55 TotalChunks:408483 ChunkMinDuration:10s ChunkAvgDuration:20m17.513s ChunkMaxDuration:1h59m0.001s ChunkMinSize:54 ChunkAvgSize:141 ChunkMaxSize:1265 SeriesMinSize:48 SeriesAvgSize:83 SeriesMaxSize:416 SingleSampleSeries:0 SingleSampleChunks:0 LabelNamesCount:147 MetricLabelValuesCount:970}" id=01HTDQCE6GV770YD8PSB6B9TMK ts=2024-04-02T08:49:47.29201073Z caller=index_issue.go:57 level=debug verifiers=index_known_issues verifier=index_known_issues msg="no issue" id=01HTDQCE6GV770YD8PSB6B9TMK ts=2024-04-02T08:49:47.292177506Z caller=index_issue.go:75 level=info verifiers=index_known_issues verifier=index_known_issues msg="verified issue" with-repair=false ts=2024-04-02T08:49:47.293022428Z caller=verify.go:157 level=info verifiers=index_known_issues msg="verify task completed" ts=2024-04-02T08:49:47.293174706Z caller=main.go:164 level=info msg=exiting ts=2024-04-02T08:49:47.293203972Z caller=main.go:67 level=debug msg="maxprocs: No GOMAXPROCS change to reset%!(EXTRA []interface {}=[])"

Here is a meta.json

cat data/01HTDQCE6GV770YD8PSB6B9TMK/meta.json 
{
	"ulid": "01HTDQCE6GV770YD8PSB6B9TMK",
	"minTime": 1711994400002,
	"maxTime": 1712001600000,
	"stats": {
		"numSamples": 49015935,
		"numSeries": 69688,
		"numChunks": 408483
	},
	"compaction": {
		"level": 1,
		"sources": [
			"01HTDQCE6GV770YD8PSB6B9TMK"
		]
	},
	"version": 1,
	"thanos": {
		"labels": {
			"monitor": "production",
			"pod": "dirpod4",
			"replica": "prometheus1"
		},
		"downsample": {
			"resolution": 0
		},
		"source": "sidecar",
		"segment_files": [
			"000001"
		],
		"files": [
			{
				"rel_path": "chunks/000001",
				"size_bytes": 57419523
			},
			{
				"rel_path": "index",
				"size_bytes": 8268359
			},
			{
				"rel_path": "meta.json"
			}
		],
		"index_stats": {}
	}
}


and a copy of block on the local fs

me@thanos-compact1:~$ ls -la data/01HTDQCE6GV770YD8PSB6B9TMK/*
-rw-rw-r-- 1 me me 8268359 Apr  1 21:01 data/01HTDQCE6GV770YD8PSB6B9TMK/index
-rw-rw-r-- 1 me me     713 Apr  1 21:01 data/01HTDQCE6GV770YD8PSB6B9TMK/meta.json
-rw-rw-r-- 1 me me     135 Apr  2 05:45 data/01HTDQCE6GV770YD8PSB6B9TMK/no-compact-mark.json

data/01HTDQCE6GV770YD8PSB6B9TMK/chunks:
total 56076
drwxrwxr-x 2 me me       20 Apr  2 08:24 .
drwxrwxr-x 3 me me       78 Apr  2 08:24 ..
-rw-rw-r-- 1 me me 57419523 Apr  1 21:01 000001

 

Looks like some issue in s3 obj storage client or what?

from thanos.

MichaHoffmann avatar MichaHoffmann commented on July 21, 2024

Hey @bobykus31, are you able to share the blocks by chance?

from thanos.

bobykus31 avatar bobykus31 commented on July 21, 2024

I wish but don't think I am allowed to, sorry. You mean a content, right?

from thanos.

bobykus31 avatar bobykus31 commented on July 21, 2024

I even dunno if the block itself can be any help, but ...
Here is an output from

promtool tsdb analyze ./data 01HTDQCE6GV770YD8PSB6B9TMK

Block ID: 01HTDQCE6GV770YD8PSB6B9TMK
Duration: 1h59m59.998s
Series: 69688
Label names: 147
Postings (unique label pairs): 3188
Postings entries (total label pairs): 475088

Label pairs most involved in churning:
61 job=node
31 group=base
28 severity=medium
27 node=127.0.0.1
20 group=fluentbit
20 job=fluentbit
20 alertname=BlackboxNetworkProbeSlowOnDNSLookup
19 group=misc
14 alertstate=pending
14 __name__=ALERTS_FOR_STATE
14 __name__=ALERTS
12 node=somehost1.me
11 job=logproxy
11 node=somehost2.me
11 group=logproxy
10 pod=dirpod4
8 job=imap-starttls
8 job=script-exporter
8 group=imap-starttls
7 node=somehost3.me

Label names most involved in churning:
146 __name__
146 node
145 instance
145 job
133 group
35 name
29 alertname
29 severity
24 type
23 cpu
14 alertstate
13 device
10 plugin_id
10 worker_id
10 pod
10 hostname
5 mode
5 state
4 libs_deleted
4 execstart_binary

Most common label pairs:
43141 job=node
21086 group=base
14361 group=misc
8032 job=logproxy
8032 group=logproxy
7725 pod=dirpod4
4878 node=somehost6.me
4866 node=somehost7.me
4315 node=somehost1.me
4282 node=somehost2.me
3939 node=prometheus1.somehost.me
3862 hostname=logproxy1
3850 hostname=logproxy2
3727 job=script-exporter-self
3725 group=script-exporter-self
3714 job=script-exporter
3664 group=mailproxy-dirpod
3601 node=somehost3.me
3600 node=somehost4.me
3530 node=somehost5.me

Label names with highest cumulative label value length:
31542 __name__
9795 instance
6944 name
4755 rule_group
2499 address
1474 node
1306 zone
1190 execstart_binary
929 le
833 device
800 type
584 serial_number
460 nlri
451 serial
440 version
410 file
288 dialer_name
285 handler
269 id
265 scrape_job

Highest cardinality labels:
970 __name__
338 name
258 instance
147 address
125 le
108 id
101 rule_group
82 device
74 type
64 cpu
54 execstart_binary
46 node
43 zone
40 serial_number
40 collector
35 dialer_name
32 worker_id
32 core
32 scrape_job
31 nlri

Highest cardinality metric names:
3072 node_cpu_seconds_total
2140 node_systemd_unit_state
1600 node_cooling_device_cur_state
1600 node_softnet_processed_total
1600 node_cooling_device_max_state
1600 node_softnet_flow_limit_count_total
1600 node_softnet_cpu_collision_total
1600 node_softnet_backlog_len
1600 node_softnet_times_squeezed_total
1600 node_softnet_received_rps_total
1600 node_softnet_dropped_total
816 scripts_duration_seconds
792 node_scrape_collector_success
792 node_scrape_collector_duration_seconds
768 node_cpu_guest_seconds_total
708 node_systemd_execstart_binary_age
704 fluentd_output_status_retry_count
704 fluentd_output_status_num_errors
680 ipmi_sensor_state
680 ipmi_sensor_value

from thanos.

bobykus31 avatar bobykus31 commented on July 21, 2024

So I turned on a debug for s3 by trace.enable: true.

Here is what I can see

grep 01HTF7EGX696B0A0K3X9JY7987 /srv/logs/thanos-compact1/*-20240402 | grep GET | grep -v mark.json

2024-04-02T14:03:17.147505152Z thanos-compact1 6 thanos[13909] ts=2024-04-02T14:03:17.147480003Z caller=stdlib.go:105 level=debug s3TraceMsg="GET /ocsysinfra-prometheus-metrics/01HTF7EGX696B0A0K3X9JY7987/meta.json HTTP/1.1\r\nHost: s3host.me\r\nUser-Agent: MinIO (linux; amd64) minio-go/v7.0.61 thanos-compact/0.34.1 (go1.21.7)\r\nAuthorization: AWS4-HMAC-SHA256 Credential=**REDACTED**/20240402/DK/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=**REDACTED**\r\nX-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855\r\nX-Amz-Date: 20240402T140317Z\r\nAccept-Encoding: gzip\r\n\r"

2024-04-02T14:03:17.17700096Z thanos-compact1 6 thanos[13909] ts=2024-04-02T14:03:17.176970957Z caller=stdlib.go:105 level=debug s3TraceMsg="GET /ocsysinfra-prometheus-metrics/?delimiter=%2F&encoding-type=url&fetch-owner=true&list-type=2&prefix=01HTF7EGX696B0A0K3X9JY7987%2F HTTP/1.1\r\nHost: s3host.me\r\nUser-Agent: MinIO (linux; amd64) minio-go/v7.0.61 thanos-compact/0.34.1 (go1.21.7)\r\nAuthorization: AWS4-HMAC-SHA256 Credential=**REDACTED**/20240402/DK/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=**REDACTED**\r\nX-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855\r\nX-Amz-Date: 20240402T140317Z\r\nAccept-Encoding: gzip\r\n\r"

2024-04-02T14:03:17.297769984Z thanos-compact1 6 thanos[13909] ts=2024-04-02T14:03:17.297744857Z caller=stdlib.go:105 level=debug s3TraceMsg="GET /ocsysinfra-prometheus-metrics/01HTF7EGX696B0A0K3X9JY7987/index HTTP/1.1\r\nHost: s3host.me\r\nUser-Agent: MinIO (linux; amd64) minio-go/v7.0.61 thanos-compact/0.34.1 (go1.21.7)\r\nAuthorization: AWS4-HMAC-SHA256 Credential=**REDACTED**/20240402/DK/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=**REDACTED**\r\nX-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855\r\nX-Amz-Date: 20240402T140317Z\r\nAccept-Encoding: gzip\r\n\r"


....

ts=2024-04-02T14:07:56.233285405Z caller=compact.go:527 level=error msg="critical error detected; halting" err="compaction: group 0@17648158269862193886: compact blocks [/var/lib/thanos/compact/0@17648158269862193886/01HTF7EGWYN1YWZ63WC3XDJVYB /var/lib/thanos/compact/0@17648158269862193886/01HTF7EGX696B0A0K3X9JY7987]: cannot populate chunk 8 from block 01HTF7EGX696B0A0K3X9JY7987: segment index 0 out of range"

First

GET /ocsysinfra-prometheus-metrics/01HTF7EGX696B0A0K3X9JY7987/meta.json

Then

GET /ocsysinfra-prometheus-metrics/?delimiter=%2F&encoding-type=url&fetch-owner=true&list-type=2&prefix=01HTF7EGX696B0A0K3X9JY7987%2F

Then

GET /ocsysinfra-prometheus-metrics/01HTF7EGX696B0A0K3X9JY7987/index

Somehow I can not see something like

GET /ocsysinfra-prometheus-metrics/01HTF7EGX696B0A0K3X9JY7987/chunks/000001

exactly as I can see

ts=2024-04-02T14:03:17.898064859Z caller=stdlib.go:105 level=debug s3TraceMsg="GET /ocsysinfra-prometheus-metrics/01HTF7EGWYN1YWZ63WC3XDJVYB/chunks/000001

For some reason compactor just ignored /ocsysinfra-prometheus-metrics/01HTF7EGX696B0A0K3X9JY7987/chunks/000001 even later on I was able to download it with "s3cmd get --recursive" manually.

download: 's3://ocsysinfra-prometheus-metrics/01HTF7EGX696B0A0K3X9JY7987/chunks/000001' -> 'data/01HTF7EGX696B0A0K3X9JY7987/chunks/000001'  [1 of 3]
 92331737 of 92331737   100% in    3s    24.22 MB/s  done
download: 's3://ocsysinfra-prometheus-metrics/01HTF7EGX696B0A0K3X9JY7987/index' -> 'data/01HTF7EGX696B0A0K3X9JY7987/index'  [2 of 3]
 13550935 of 13550935   100% in    0s    24.30 MB/s  done
download: 's3://ocsysinfra-prometheus-metrics/01HTF7EGX696B0A0K3X9JY7987/meta.json' -> 'data/01HTF7EGX696B0A0K3X9JY7987/meta.json'  [3 of 3]
 716 of 716   100% in    0s    12.48 kB/s  done

from thanos.

MichaHoffmann avatar MichaHoffmann commented on July 21, 2024

Hey @bobykus31, thats really interesting; can you retry with mc cp ... --recursive=true ?

from thanos.

bobykus31 avatar bobykus31 commented on July 21, 2024

Yes, re download seems to make the compaction going well.

groupKey=0@5322589354663189029 msg="compaction available and planned" plan="[01HTF7EGY6WQ5VSG6CER14WVJD (min time: 1712044800014, max time: 1712052000000) 01HTF7EHFG3BXV05EY2Z3E55SF (min time: 1712044800054, max time: 1712052000000)]"

from thanos.

bobykus31 avatar bobykus31 commented on July 21, 2024

Seems to be an s3 issues itself. I was able to reproduce it with s3cmd for some blocks. May be a --consistency-delay setting can help (currently it is 2h)

from thanos.

bobykus31 avatar bobykus31 commented on July 21, 2024

The issue still exists, fyi. Increase --consistency-delay does not help much. Looks similar to #1199

from thanos.

MichaHoffmann avatar MichaHoffmann commented on July 21, 2024

Not a fix for it by any means, but if we already have uploaded such blocks then #7282 should at least make compactor and store not crash but rather mark them as corrupted and increment the proper metrics

from thanos.

bobykus31 avatar bobykus31 commented on July 21, 2024

This could happen when you use the multi site obj storage. And data sync between sites is not consistent (for many reasons). Here is what obj storage provider recommends in such a case

Note: In general, you should use the “read-after-new-write” consistency control value. If requests aren't working correctly, change the application client behavior if possible. Or, configure the client to specify the consistency control for each API request. Set the consistency control at the bucket level only as a last resort.
Request example

PUT /bucket?x-ntap-sg-consistency=strong-global HTTP/1.1
Date: date
Authorization: authorization string
Host: host

How this achievable with Thanos?

from thanos.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.