Describe the bug Every night the IcingaDB HA crashs <h2 dir="a

IcingaDB HA crashs every night after upgrade to 1.2.0 about icingadb HOT 13 CLOSED

TheCry commented on August 28, 2024

IcingaDB HA crashs every night after upgrade to 1.2.0

from icingadb.

Comments (13)

TheCry commented on August 28, 2024 1

It seems like mysqldump is setting a lock by default. I didn't do it in my script. Now I created a dump with "--single-transaction=TRUE" and there was no outage.
Thank you for your support.

from icingadb.

oxzi commented on August 28, 2024

Thanks for coming forward with this issue. Based on your provided information and what was within the stack trace, it seems like your Redis instance is unavailable or at least unhealthy. The error "runtime-updates: Can't execute Redis pipeline" (source) indicates this.

One change between v1.1.1 and v.1.2.0 was 81085c0, effectively don't retrying a failed HA but giving up after five minutes. Thus, it might be the case that this error was already there before upgrading Icinga DB to v1.2.0, but just invisible.

Speaking of visibility, what is the other Icinga DB node doing? Could you please provide more extensive logs extending further into the past, from both Icinga DB nodes include their Redis? Please evaluate the logging level - all components included - to debug and restart the Icinga DBs before the next expected crash.

As there are additional information being logged as fields when using the systemd-journald output, you might wanna include those in your journalctl's output format, e.g., by using journalctl -o json.

What does the Bareos exactly do at this moment? Does it interact with the Redis instance or reconfigures the network? Is something else altering the system state at this moment?

from icingadb.

TheCry commented on August 28, 2024

Hello
the other node has exaxt the same error message this morning. Bareos only performs a database backup. The database is not set to "lock". For tonight I deactivated the security of the 3 servers to make sure that it has nothing to do with it.
No - Bareos does not interact with Redis. There is also no error message from icingadb-redis in the journal.
If a JSON log with this call is sufficient:
journalctl --boot --reverse --priority=emerg.. err --since=-72h --unit="icingadb.service" -o json

I'd edit the config.yml of IcingaDB to logging "debug":

logging:
  level: debug

  options:
    config-sync:
    database:
    dump-signals:
    heartbeat:
    high-availability:
    history-sync:
    overdue-sync:
    redis:
    retention:
    runtime-updates:
    telemetry:

So correct?

from icingadb.

oxzi commented on August 28, 2024

Hello the other node has exaxt the same error message this morning. Bareos only performs a database backup. The database is not set to "lock".

That's unfortunate.

As it happens on both nodes, I would think about some database-related issue.

Since v1.2.0 (779afd1) there are additional information attached to the "Handing over" message appearing in your posted log. There should also be fields present for the "runtime-updates: Can't execute Redis pipeline" line.
If your log wasn't already rotated, could you post your logs with --output json, especially regarding those lines?

There is also no error message from icingadb-redis in the journal.

Okay. How does your database setup look? Is it a MySQL/MariaDB or PostgreSQL? Is it a single node, a federation or a cluster? Are there suspicious logging entries around the same time?

If a JSON log with this call is sufficient: journalctl --boot --reverse --priority=emerg.. err --since=-72h --unit="icingadb.service" -o json

This looks good!

I'd edit the config.yml of IcingaDB to logging "debug":

Looks also good. However, you can remove the whole options block as each component is empty, thus inheriting the default config options from above.

from icingadb.

TheCry commented on August 28, 2024

[error_icingadb_Node1.log](https://github.com/Icinga/icingadb/files/15023144/error_icingadb_Node1.log)
I don't think so. If only that is to the database of IcingaDB. As written.. Everything was stable until before the upgrade
IcingaDB's database runs on IcingaWeb's server.
So a remote database for the 2 nodes in the HA cluster. It's a MariaDB (V 10.11.7).
There are no abnormalities there.
I've now adjusted the config.yml and restarted the IcingaDB daemon.

I'll attach the JSON output of both nodes of the last 72 hours to the ticket
error_icingadb_Node1.log

error_icingadb_Node2.log

from icingadb.

oxzi commented on August 28, 2024

Thanks for providing those logs. However, it seems like the logging level is too silent.

A small inspection with

jq -s 'sort_by(.__REALTIME_TIMESTAMP | tonumber) | .[] | [._HOSTNAME, (.__REALTIME_TIMESTAMP | tonumber | . / 1000000 | strftime("%x %X %Z")), .MESSAGE, .ICINGADB_ERROR]' < error_icingadb_Node?.log

only reveals "context canceled" errors, but it misses why the context was canceled. Thus, we have to wait for the next error and have a look at the debug logs then.

from icingadb.

oxzi commented on August 28, 2024

Please excuse the second post, but could you include all available log levels/priorities in your output. I misread the --priority flag in your prior post.

If a JSON log with this call is sufficient: journalctl --boot --reverse --priority=emerg.. err --since=-72h --unit="icingadb.service" -o json

Please set --priority 0..7 or --priority emerg..debug.

from icingadb.

TheCry commented on August 28, 2024

No problem...
The logs will be huge if I start the DEBUG level until tomorrow. I attach the new logs "--priority emerg..debug"
error_icingadb_Node1_and_2.zip

from icingadb.

oxzi commented on August 28, 2024

Thanks a lot. Regardless the size, I would like to inspect tomorrow's logs after the crash. Maybe you can reduce it to round about 30m before the crash.

from icingadb.

TheCry commented on August 28, 2024

Good morning. I have great news.. There was no outage tonight.
I also had the Bareos backup disabled for the 3 servers yesterday.
So I tried to reproduce the bug this morning. And it's my database backupscript that triggers the error when dumping the databases. I have attached the logs of both nodes (1 hour backwards) with debug information.
I have my backup script for the databases attached, because I just don't understand that it didn't cause any problems before the upgrade.
The icingadb database is 47GB in size. The retention is set as follows:
history-days: 90
SLA-Days: 180

Here are the logs and the script:
error_icingadb_node1_and_2-1h.zip
DatabaseBackupScript.txt

Edit:
I'd tried to backup the icingadb database with the script from https://www.netways.de/blog/2018/06/01/icinga-backup-encrypted/
The same problem.. :(

from icingadb.

oxzi commented on August 28, 2024

Thanks for your detailed report, your logs and your script. Based on your experiment, I would guess that your backup script is taking longer than the magic five minutes that Icinga DB now retires every database error since the latest v1.2.0 release. When it reaches this limit and there is still a LOCK from the mysqldump, Icinga DB exits.

Earlier you wrote that you have configured your database to "not lock". How did you do this? Could you try configuring your mysqldump command to not LOCK or use transactions as described in this StackOverflow thread? If everything else fails, you might wanna consider stopping and restarting Icinga DB during the time of your backup?

from icingadb.

oxzi commented on August 28, 2024

I am glad to hear that. Unless you have an idea what to change, please feel free to close this issue.

from icingadb.

TheCry commented on August 28, 2024

No - ICINGA is going great and I'm more than happy with it. Thanks again for your support, as it should have been in the forum. The only thing I didn't think about was the database dump at first

from icingadb.

IcingaDB HA crashs every night after upgrade to 1.2.0 about icingadb HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent