Comments (7)
Which options did you use to start this Compactor?
from thanos.
Options are:
/usr/bin/thanos compact --log.level=warn --wait --data-dir=/data/main/thanos/compact/cache --objstore.config-file=/etc/thanos/data-thanos.yml --consistency-delay=1h --hash-func=SHA256 --http-address=:10912 --web.external-prefix=/thanos/compact --web.route-prefix=/
But I just found --no-debug.halt-on-error
which seems as if it may be causing the right behaviour.
I wonder why that isn't the default.
The description why not, is honestly not very convincing.
Yes, it would be bad to retry over and over if it just keeps failing, but there are much better ways to handle this.
a) Not retrying endlessly is anyway the service manager’s (i.e. systemd’s) responsibility, and there are plenty of proper ways in that to handle this and b) thanos could e.g. just set some flag on such error condition, that lets it fail immediately without retrying (which would still allow for the important part, namely failing - otherwise people in normal environments won't notice that anything's wrong).
from thanos.
I am seeing a similar behavior where it errors once (no space left on device), halts and never compacts again.
level=error ts=2024-03-19T16:41:39.30753731Z caller=compact.go:486 msg="critical error detected; halting" err="compaction: group 0@16225028991790494220: compact blocks [/data/compact/0@16225028991790494220/01F3EZ3ZAA58NSNX2EPRS03N5H /data/compact/0@16225028991790494220/01F3M3W8CDMFQP1ZTS72F07EGQ /data/compact/0@16225028991790494220/01F3S8RRSXCY18WNCS6NS6T1N4 /data/compact/0@16225028991790494220/01F3YDPP5RRD4FFERPTVTRY86T /data/compact/0@16225028991790494220/01F43J9M8VME00PDVN2FQ84SFH /data/compact/0@16225028991790494220/01F48Q59D5Z77DG93SGEYVZ8B1 /data/compact/0@16225028991790494220/01F4DW1XNESEQH4RBW5AZMN6P4]: 3 errors: populate block: write chunks: preallocate: no space left on device; sync /data/compact/0@16225028991790494220/01HSBRPYTTJ57GJBME0FR28SFV.tmp-for-creation/chunks/000088: file already closed; write /data/compact/0@16225028991790494220/01HSBRPYTTJ57GJBME0FR28SFV.tmp-for-creation/index: no space left on device"
level=info ts=2024-03-19T16:42:04.648266954Z caller=fetcher.go:470 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=2.059553102s duration_ms=2059 cached=40630 returned=40630 partial=0
level=info ts=2024-03-19T16:42:32.306773124Z caller=fetcher.go:470 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=43.087680327s duration_ms=43087 cached=40630 returned=40610 partial=0
level=info ts=2024-03-19T16:43:15.869657891Z caller=fetcher.go:470 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=43.552571092s duration_ms=43552 cached=40630 returned=40610 partial=0
level=info ts=2024-03-19T16:43:15.88204689Z caller=clean.go:34 msg="started cleaning of aborted partial uploads"
level=info ts=2024-03-19T16:43:15.88207772Z caller=clean.go:61 msg="cleaning of aborted partial uploads done"
level=info ts=2024-03-19T16:43:15.882085075Z caller=blocks_cleaner.go:44 msg="started cleaning of blocks marked for deletion"
level=info ts=2024-03-19T16:43:15.882099343Z caller=blocks_cleaner.go:58 msg="cleaning of blocks marked for deletion done"
I am running the compact component with a 100gb persistent volume and after the error, it goes back to ~50% usage:
from thanos.
To be clear: you won't ever recover Compactor running out of disk space without giving it more disk. I agree with the suggested expected behavior from @calestyo though, the Compactor should exit when it detects it doesn't have enough disk space available.
To fix the Compactor:
- scale Compactor to zero
- expand the PVs if you can or give it bigger ones
- scle it back up.
Finally when the Compactor catches up you can follow a similar procedure to shrink the PV.
from thanos.
you won't ever recover Compactor running out of disk space without giving it more disk.
The problem is however, even with more disk (respectively space freed) it never recovers and does anything useful again.
I think the default behaviour of daemons is to fail in such situation (so that the service manager, like systemd, can react)... a --keep-going mode might be nice for some people, but then only if it actually ever does something again (which seems not the case).
from thanos.
The problem is however, even with more disk (respectively space freed) it never recovers and does anything useful again.
You will need to give it more total disk space than it had before. The amount of extra space required is proportional to the time the Compactor had issues. I've seen setups where the Compactor had initially about 200 GB of disk and we had to give it 4x more.
As I said, I agree with you that the Compactor's default behavior is not ideal. We should improve it.
from thanos.
You will need to give it more total disk space than it had before.
I don't see why that should need to be the case. The filesystem could have had some other big occupier of space and with that being gone it could be just enough for compact, even with the total size being the same.
The other issue you're referring to, that it needs more and more space, is what I've described in #7198, which is probably a duplicate of #3405.
IMO the proper behaviour would be as follows:
- Per default, if it runs out of space and cannot finish it's operation it should exit with a non-zero status
- If some extra option is given, it may continue to run and continuously retry (perhaps with some delay in between or/and only if the free space is bigger), but without any extra conditions like the filesystem having gained more total space. But that mode only makes sense if some other system would be in place that automatically increases he space or frees up space - and if there's anyway such a system, it wouldn't cost much IMO, to then just restart
compact
... so in a way the whole mode of continuing to run makes IMO not that much sense.
from thanos.
Related Issues (20)
- Thanos-store: Store Pod is not able to load certain blocks from object storage HOT 1
- query: Internal Server Error unknown targetHealth: "unknown" when trying to open the targets page
- Receive: high in flight requests and high context deadline exceeded and ingestion latency in main branch HOT 9
- Rate query failing from Grafana HOT 2
- receive: memory spikes during tenant WAL truncation HOT 5
- Thanos receive context deadline exceeded: Permanent error: Post \"http://obs-thanos-receive.xxxsvc.cluster.local:19291/api/v1/receive\": context deadline exceeded HOT 2
- Thanos Store fills up the entire hard disk at startup HOT 3
- Support for encryption in transit for Memcached HOT 3
- Why does thanos query-frontend not support multiple downstream addresses?
- Thanos query fanout has high latency when a strict-store is unavailable HOT 3
- Ruler UI: `alert.query-template` is not honored inside the Rules UI HOT 4
- [Thanos Storegateway ]"failed to read index-header from disk; recreating" path=/data/01H1K45K0FRT36S1RCAWHW7R9A/index-heade HOT 1
- Adding User Agent to HTTP Logs
- Compact: Display TODO plan HOT 7
- compactor: does not compact 4 consecutive 2-hour blocks HOT 6
- compactor: series not 16-byte aligned error HOT 2
- Improved file access logging
- Sidecar: reporting as ready on startup when no Prometheus process is running
- tools bucket: Add ability to discover external labels from prometheus address for `upload-blocks` HOT 1
- Thanos Sidecar - Flush Endpoint HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from thanos.