Comments (13)
It seems like mysqldump is setting a lock by default. I didn't do it in my script. Now I created a dump with "--single-transaction=TRUE" and there was no outage.
Thank you for your support.
from icingadb.
Thanks for coming forward with this issue. Based on your provided information and what was within the stack trace, it seems like your Redis instance is unavailable or at least unhealthy. The error "runtime-updates: Can't execute Redis pipeline" (source) indicates this.
One change between v1.1.1 and v.1.2.0 was 81085c0, effectively don't retrying a failed HA but giving up after five minutes. Thus, it might be the case that this error was already there before upgrading Icinga DB to v1.2.0, but just invisible.
Speaking of visibility, what is the other Icinga DB node doing? Could you please provide more extensive logs extending further into the past, from both Icinga DB nodes include their Redis? Please evaluate the logging level - all components included - to debug
and restart the Icinga DBs before the next expected crash.
As there are additional information being logged as fields when using the systemd-journald
output, you might wanna include those in your journalctl's output format, e.g., by using journalctl -o json
.
What does the Bareos exactly do at this moment? Does it interact with the Redis instance or reconfigures the network? Is something else altering the system state at this moment?
from icingadb.
Hello
the other node has exaxt the same error message this morning. Bareos only performs a database backup. The database is not set to "lock". For tonight I deactivated the security of the 3 servers to make sure that it has nothing to do with it.
No - Bareos does not interact with Redis. There is also no error message from icingadb-redis in the journal.
If a JSON log with this call is sufficient:
journalctl --boot --reverse --priority=emerg.. err --since=-72h --unit="icingadb.service" -o json
I'd edit the config.yml of IcingaDB to logging "debug":
logging:
level: debug
options:
config-sync:
database:
dump-signals:
heartbeat:
high-availability:
history-sync:
overdue-sync:
redis:
retention:
runtime-updates:
telemetry:
So correct?
from icingadb.
Hello the other node has exaxt the same error message this morning. Bareos only performs a database backup. The database is not set to "lock".
That's unfortunate.
As it happens on both nodes, I would think about some database-related issue.
Since v1.2.0 (779afd1) there are additional information attached to the "Handing over" message appearing in your posted log. There should also be fields present for the "runtime-updates: Can't execute Redis pipeline" line.
If your log wasn't already rotated, could you post your logs with --output json
, especially regarding those lines?
There is also no error message from icingadb-redis in the journal.
Okay. How does your database setup look? Is it a MySQL/MariaDB or PostgreSQL? Is it a single node, a federation or a cluster? Are there suspicious logging entries around the same time?
If a JSON log with this call is sufficient:
journalctl --boot --reverse --priority=emerg.. err --since=-72h --unit="icingadb.service" -o json
This looks good!
I'd edit the config.yml of IcingaDB to logging "debug":
Looks also good. However, you can remove the whole options
block as each component is empty, thus inheriting the default config options from above.
from icingadb.
[error_icingadb_Node1.log](https://github.com/Icinga/icingadb/files/15023144/error_icingadb_Node1.log)
I don't think so. If only that is to the database of IcingaDB. As written.. Everything was stable until before the upgrade
IcingaDB's database runs on IcingaWeb's server.
So a remote database for the 2 nodes in the HA cluster. It's a MariaDB (V 10.11.7).
There are no abnormalities there.
I've now adjusted the config.yml and restarted the IcingaDB daemon.
I'll attach the JSON output of both nodes of the last 72 hours to the ticket
error_icingadb_Node1.log
from icingadb.
Thanks for providing those logs. However, it seems like the logging level is too silent.
A small inspection with
jq -s 'sort_by(.__REALTIME_TIMESTAMP | tonumber) | .[] | [._HOSTNAME, (.__REALTIME_TIMESTAMP | tonumber | . / 1000000 | strftime("%x %X %Z")), .MESSAGE, .ICINGADB_ERROR]' < error_icingadb_Node?.log
only reveals "context canceled"
errors, but it misses why the context was canceled. Thus, we have to wait for the next error and have a look at the debug logs then.
from icingadb.
Please excuse the second post, but could you include all available log levels/priorities in your output. I misread the --priority
flag in your prior post.
If a JSON log with this call is sufficient:
journalctl --boot --reverse --priority=emerg.. err --since=-72h --unit="icingadb.service" -o json
Please set --priority 0..7
or --priority emerg..debug
.
from icingadb.
No problem...
The logs will be huge if I start the DEBUG level until tomorrow. I attach the new logs "--priority emerg..debug"
error_icingadb_Node1_and_2.zip
from icingadb.
Thanks a lot. Regardless the size, I would like to inspect tomorrow's logs after the crash. Maybe you can reduce it to round about 30m before the crash.
from icingadb.
Good morning. I have great news.. There was no outage tonight.
I also had the Bareos backup disabled for the 3 servers yesterday.
So I tried to reproduce the bug this morning. And it's my database backupscript that triggers the error when dumping the databases. I have attached the logs of both nodes (1 hour backwards) with debug information.
I have my backup script for the databases attached, because I just don't understand that it didn't cause any problems before the upgrade.
The icingadb database is 47GB in size. The retention is set as follows:
history-days: 90
SLA-Days: 180
Here are the logs and the script:
error_icingadb_node1_and_2-1h.zip
DatabaseBackupScript.txt
Edit:
I'd tried to backup the icingadb database with the script from https://www.netways.de/blog/2018/06/01/icinga-backup-encrypted/
The same problem.. :(
from icingadb.
Thanks for your detailed report, your logs and your script. Based on your experiment, I would guess that your backup script is taking longer than the magic five minutes that Icinga DB now retires every database error since the latest v1.2.0 release. When it reaches this limit and there is still a LOCK from the mysqldump
, Icinga DB exits.
Earlier you wrote that you have configured your database to "not lock". How did you do this? Could you try configuring your mysqldump command to not LOCK or use transactions as described in this StackOverflow thread? If everything else fails, you might wanna consider stopping and restarting Icinga DB during the time of your backup?
from icingadb.
I am glad to hear that. Unless you have an idea what to change, please feel free to close this issue.
from icingadb.
No - ICINGA is going great and I'm more than happy with it. Thanks again for your support, as it should have been in the forum. The only thing I didn't think about was the database dump at first
from icingadb.
Related Issues (20)
- Enhance HA take/hand over log messages
- Include app version in start-up logs
- Keep deleted Hosts and Services
- Getting error opening History "createTicketLinks() must be of the type string, null returned" HOT 13
- Release Version v1.2.0 HOT 1
- Update to `github.com/redis/go-redis/v9` HOT 1
- History sync: wrong number of arguments for 'xdel' command
- Redis licensing change HOT 2
- Document `retention.count` & `retention.interval` configs HOT 1
- Data too long for column `TimePeriod#range_value` HOT 1
- Default `database.options` gets overridden & fails config validation HOT 1
- `HA` does not stop retrying after 5 minutes
- Doc: Admin/Usage Guide
- Setting wsrep_sync_wait via prepared statement is incompatible with ProxySQL HOT 2
- Error 1406 (22001): Data too long for column 'icon_image_alt' HOT 8
- utils.FromUnixMilli has floating point rounding issues HOT 1
- "Error 1406: Data too long for column icon_image_alt" since update to icingadb 1.2.0 HOT 2
- Retention `Count` config `uint64` overflow HOT 3
- Make Plugin Output Character Limit configurable HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from icingadb.