Code Monkey home page Code Monkey logo

ansible-nagios's Introduction

ansible-nagios

Playbook for setting up the Nagios monitoring server and clients (CentOS/Rocky/RHEL/Fedora/FreeBSD)

Nagios

GA

What does it do?

  • Automated deployment of Nagios Server on CentOS7, Rocky 8/9 or RHEL 7/8/9
  • Automated deployment of Nagios client on CentOS6/7/8, RHEL6/7/8/9 or Rocky, Fedora and FreeBSD
    • Generates service checks and monitored hosts from Ansible inventory
    • Generates comprehensive checks for the Nagios server itself
    • Generates comprehensive checks for all hosts/services via NRPE
    • Generates most of the other configs based on jinja2 templates
    • Wraps Nagios in SSL via Apache
    • Sets up proper firewall rules (firewalld or iptables-services)
    • Support sending alerts via email and outgoing webhooks.
    • This is also available via Ansible Galaxy

How do I use it?

  • Add your nagios server under [nagios] in hosts inventory
  • Add respective services/hosts under their inventory group, hosts can only belong under one group.
  • Take a look at install/group_vars/all.yml to change anything like email address, nagios user, guest user etc.
  • Run the playbook. Read below for more details if needed.

Requirements

  • CentOS7 or RHEL7/8/9 or Rocky 8/9 for Nagios server only (for now).
  • RHEL6/7/8/9, CentOS6/7/8/9, Fedora or FreeBSD for the NRPE Nagios client
  • If you require SuperMicro server monitoring via IPMI (optional) then do the following
    • Installperl-IPC-Run and perl-IO-Tty RPMs for RHEL7 for optional IPMI sensor monitoring on SuperMicro.
      • I've placed them here if you can't find them, CentOS7 has them however.
    • Modify install/group_vars/all.yml to include supermicro_enable_checks: true
  • Please note I'll likely remove IPMI sensor monitoring support because it's a real pain and not that reliable, SNMP with MiB is better.

Notes

  • Sets the nagiosadmin password to changeme, you'll want to change this.
  • Creates a read-only user, set nagios_create_guest_user: false to disable this in install/group_vars/all.yml
  • You can turn off creation/management of firewall rules via install/group_vars/all.yml
  • Adding new hosts to inventory file will just regenerate the Nagios configs

Supported Service Checks

  • Implementation is very simple, with the following resource/service checks generated:
    • Generic out-of-band interfaces (ping, ssh, http)
    • Generic Linux servers (ping, ssh, load, users, procs, uptime, disk space, swap, zombie procs)
    • Generic Linux servers with MDADM RAID (same as above)
    • ELK servers (same as servers plus elasticsearch and Kibana)
    • Elasticsearch (same as servers plus TCP/9200 for elasticsearch)
    • Webservers (same as servers plus 80/TCP for webserver)
    • Webservers with SSL certificate checking (same as webservers plus checks SSL certificate validity/expiration)
    • DNS Servers (same as servers plus UDP/53 for DNS)
    • DNS Servers with MDADM RAID (same as above)
    • DNS Service Only (DNS and ICMP check)
    • Jenkins CI (same as servers plus TCP/8080 for Jenkins and optional nginx reverse proxy with auth)
    • FreeNAS Appliances (ping, ssh, volume status, alerts, disk health)
    • Network switches (ping, ssh)
    • IoT and ping-only devices (ping)
    • Dell iDRAC server checks via @dangmocrang check_idrac
      • You can select which checks you want in install/group_vars/all.yml
        • CPU, DISK, VDISK, PS, POWER, TEMP, MEM, FAN
    • SuperMicro server checks via the IPMI interface.
      • CPU, DISK, PS, TEMP, MEM: or anything supported via freeipmi sensors.
      • *Note: This is not the best way to monitor things, SNMP checks are WIP once we purchase licenses for them for our systems
  • contacts.cfg notification settings are in install/group_vars/all.yml and templated for easy modification.

Nagios Server Instructions

  • Clone repo and setup your Ansible inventory (hosts) file
git clone https://github.com/sadsfae/ansible-nagios
cd ansible-nagios
sed -i 's/host-01/yournagioshost/' hosts
  • Add any hosts for checks in the hosts inventory
  • The same host can only belong to one host inventory category
  • Note that you need to add ansible_host entries only for IP addresses for idrac, switches, out-of-band interfaces and anything that typically doesn't support Python and Ansible fact discovery.
  • Anything not an idrac, switch or oobserver should use the FQDN (or an /etc/hosts entry) for the inventory hostname or you may see this error:
    • AnsibleUndefinedVariable: 'dict object' has no attribute 'ansible_default_ipv4'}
[webservers]
webserver01

[switches]
switch01 ansible_host=192.168.0.100
switch02 ansible_host=192.168.0.102

[oobservers]
webserver01-ilo ansible_host=192.168.0.105

[servers]
server01

[servers_with_mdadm_raid]

[jenkins]
jenkins01

[dns]

[dns_with_mdadm_raid]

[idrac]
database01-idrac ansible_host=192.168.0.106

[supermicro-6048r]
web01-supermicro-ipmi ansible_host=192.168.0.108

[supermicro-6018r]

[supermicro-1028r]

  • Run the playbook
ansible-playbook -i hosts install/nagios.yml
  • Navigate to the server at https://yourhost/nagios
  • Default login is nagiosadmin / changeme unless you changed it in install/group_vars/all.yml

Known Issues

  • If you're using a non-root Ansible user you will want to edit install/group_vars/all.yml setting, e.g. AWS EC2:
ansible_system_user: ec2-user
  • SELinux doesn't always play well with Nagios, or the policies may be out of date as shipped with CentOS/RHEL.
avc: denied { create } for pid=8800 comm="nagios" name="nagios.qh
  • If you see this (or nagios doesn't start) you'll need to create an SELinux policy module.
# cat /var/log/audit/audit.log | audit2allow -M mynagios
# semodule -i mynagios.pp

Now restart Nagios and Apache and you should be good to go.

systemctl restart nagios
systemctl restart httpd

If all else fails set SELinux to permissive until it's running then run the above command again.

setenforce 1
  • If you have errors on RHEL7 you may need a few Perl packages if you opted to include SuperMicro monitoring via:
supermicro_enable_checks: true

Mass-generating Ansible Inventory

If you're using something like QUADS to manage your infrastructure automation scheduling you can do the following to generate all of your out-of-band or iDRAC interfaces.

quads-cli --ls-hosts | sed -e 's/^/mgmt-/g' > /tmp/all_ipmi_2019-10-23
for ipmi in $(cat all_ipmi_2019-10-23); do printf $ipmi ; echo " ansible_host=$(host $ipmi | awk '{print $NF}')"; done > /tmp/add_oobserver

Now you can paste /tmp/add_oobserver under the [oobservers] or [idrac] Ansible inventory group respectively.

Demonstration

  • You can view a video of the Ansible deployment here:

Ansible Nagios

iDRAC Server Health Details

  • The iDRAC health checks are all optional, you can pick which ones you want to monitor.

CHECK

  • The iDRAC health check will provide exhaustive health information and alert upon it.

iDRAC

Files

.
├── hosts
├── install
│   ├── group_vars
│   │   └── all.yml
│   ├── nagios.yml
│   └── roles
│       ├── firewall
│       │   └── tasks
│       │       └── main.yml
│       ├── firewall_client
│       │   └── tasks
│       │       └── main.yml
│       ├── instructions
│       │   └── tasks
│       │       └── main.yml
│       ├── nagios
│       │   ├── files
│       │   │   ├── check_ipmi_sensor
│       │   │   ├── idrac_2.2rc4
│       │   │   ├── idrac-smiv2.mib
│       │   │   ├── nagios.cfg
│       │   │   └── nagios.conf
│       │   ├── handlers
│       │   │   └── main.yml
│       │   ├── tasks
│       │   │   └── main.yml
│       │   └── templates
│       │       ├── cgi.cfg.j2
│       │       ├── check_freenas.py.j2
│       │       ├── commands.cfg.j2
│       │       ├── contacts.cfg.j2
│       │       ├── devices.cfg.j2
│       │       ├── dns.cfg.j2
│       │       ├── dns_with_mdadm_raid.cfg.j2
│       │       ├── elasticsearch.cfg.j2
│       │       ├── elkservers.cfg.j2
│       │       ├── freenas.cfg.j2
│       │       ├── idrac.cfg.j2
│       │       ├── ipmi.cfg.j2
│       │       ├── jenkins.cfg.j2
│       │       ├── localhost.cfg.j2
│       │       ├── oobservers.cfg.j2
│       │       ├── servers.cfg.j2
│       │       ├── servers_with_mdadm_raid.cfg.j2
│       │       ├── services.cfg.j2
│       │       ├── supermicro_1028r.cfg.j2
│       │       ├── supermicro_6018r.cfg.j2
│       │       ├── supermicro_6048r.cfg.j2
│       │       ├── switches.cfg.j2
│       │       └── webservers.cfg.j2
│       └── nagios_client
│           ├── files
│           │   ├── bsd_check_uptime.sh
│           │   └── check_raid
│           ├── handlers
│           │   └── main.yml
│           ├── tasks
│           │   └── main.yml
│           └── templates
│               └── nrpe.cfg.j2
├── meta
│   └── main.yml
└── tests
    └── test-requirements.txt

21 directories, 43 files

ansible-nagios's People

Contributors

qcu87z avatar sadsfae avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ansible-nagios's Issues

Nagios start error

fatal: [hostname]: FAILED! => {"changed": true, "cmd": ["systemctl", "restart", "nagios.service"], "delta": "0:00:00.037135", "end": "2017-01-28 17:05:23.980575", "failed": true, "rc": 1, "start": "2017-01-28 17:05:23.943440", "stderr": "Job for nagios.service failed because the control process exited with error code. See "systemctl status nagios.service" and "journalctl -xe" for details.", "stdout": "", "stdout_lines": [], "warnings": []}

[root@li849-175 ansible-nagios]# journalctl -xe
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: Nagios Core 4.0.8
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: Copyright (c) 2009-present Nagios Core Development Team and Community Contributors
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: Copyright (c) 1999-2009 Ethan Galstad
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: Last Modified: 08-12-2014
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: License: GPL
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: Website: http://www.nagios.org
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: Reading configuration data...
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: Warning: use_embedded_perl_implicitly is deprecated and will be removed.
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: Warning: enable_embedded_perl is deprecated and will be removed.
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: Warning: p1_file is deprecated and will be removed.
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: Warning: sleep_time is deprecated and will be removed.
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: Warning: external_command_buffer_slots is deprecated and will be removed. All commands are always processed upon arriva
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: Warning: command_check_interval is deprecated and will be removed. Commands are always handled on arrival
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: Read main config file okay...
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: Warning: Duplicate definition found for host 'hostname' (config file '/etc/nagios/conf.d/webservers.cfg', starting
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: Error: Could not add object property in file '/etc/nagios/conf.d/webservers.cfg' on line 9.
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: Error processing object config files!
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: ***> One or more problems was encountered while processing the config files...
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: Check your configuration file(s) to ensure that they contain valid
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: directives and data defintions. If you are upgrading from a previous
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: version of Nagios, you should be aware that some variables/definitions
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: may have been removed or modified in this version. Make sure to read
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: the HTML documentation regarding the config files, as well as the
Jan 28 17:05:23 li849-175.members.linode.com nagios[23581]: 'Whats New' section to find out what has changed.
Jan 28 17:05:23 li849-175.members.linode.com systemd[1]: nagios.service: control process exited, code=exited status=1
Jan 28 17:05:23 li849-175.members.linode.com systemd[1]: Failed to start Nagios Network Monitoring.
-- Subject: Unit nagios.service has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel

-- Unit nagios.service has failed.

-- The result is failed.
Jan 28 17:05:23 li849-175.members.linode.com systemd[1]: Unit nagios.service entered failed state.
Jan 28 17:05:23 li849-175.members.linode.com systemd[1]: nagios.service failed.

AnsibleUndefinedVariable: 'dict object' has no attribute 'ansible_default_ipv4'"

Hi,

I am facing one issue "AnsibleUndefinedVariable: 'dict object' has no attribute 'ansible_default_ipv4''

Here you go my host file.

[nagios]
103.225.77.2
[servers]

mysql-conversions-1 ansible_host=192.168.139.74


[webservers]

api-docs ansible_host=192.168.152.188
[elkservers]

[elasticsearch]
[switches]

[oobservers]

[idrac]

make things more friendly for non-root users

I noticed some issues using this with Amazon EC2 when a non-root user is initially controlling systems, but that user can sudo.

We should wrap everything with become: true it has no ill effect and helps these cases.

Nagios 4.2.4 require workaround for /var/log/nagios/spool/checkresults

nagios-4.2.4-2 does not properly create the /var/log/nagios/spool/checkresults directory and the following error occurs:

Feb 24 03:26:33 example.com nagios[401]: Nagios Core 4.2.4
Feb 24 03:26:33 example.com nagios[401]: Copyright (c) 2009-present Nagios Core Development Team and Community Contributors
Feb 24 03:26:33 example.com nagios[401]: Copyright (c) 1999-2009 Ethan Galstad
Feb 24 03:26:33 example.com nagios[401]: Last Modified: 12-07-2016
Feb 24 03:26:33 example.com nagios[401]: License: GPL
Feb 24 03:26:33 example.com nagios[401]: Website: https://www.nagios.org
Feb 24 03:26:33 example.com nagios[401]: Reading configuration data...
Feb 24 03:26:33 example.com nagios[401]: Error in configuration file '/etc/nagios/nagios.cfg' - Line 454 (Check result path '/var/log/nagios/spool/checkresults' is not a valid direc
Feb 24 03:26:33 shithole.hobopiss.com nagios[401]: Error processing main config file!

[RFE] Add alerting via webhook

This is an RFE for adding a contact definition to be a webhook URL e.g. posting alerting information to a chat platform that supports receiving webhooks like G-Chat or Slack.

[RFE] Add mdadm software raid checks

Is your feature request related to a problem? Please describe.

We need to have good mdadm raid checks.

Describe the Possible Solution
Provide ansible hostgroup entries for generic linux server and DNS server for starters to also include an option for those using mdadm raid.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional Info
Add any other info or context about the feature request here.

Create Jenkins Server Role

We need a jenkins server role here, probably best to use NRPE to monitor localhost:jenkinsport and make it configurable.

Can not update servers.cfg

Your System Details

  • Ansible version (rpm -qa | grep ansible):
  • ansible-2.9.15-1.el7.noarch
  • Operating System: (cat /etc/redhat-release)
  • CentOS Linux release 7.5.1804 (Core)

Describe the bug
failed: [localhost] (item=servers.cfg) => {"ansible_loop_var": "item", "changed": false, "item": "servers.cfg", "msg": "AnsibleUndefinedVariable: 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'ansible_os_family'"}

To Reproduce / What were you Doing?
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected Behavior
A clear and concise description of what you expected to happen.

Logs / Screenshots
If applicable, add logs or screenshots to help explain your problem.

Additional Details
Add any other context or details about the problem here.

Working with large number of hosts

Great work on this project. I'm trying to get this working with a largish number of servers and running into some issues. From my understanding, for this method to work it needs to gather facts for all servers, which can take quite some time and I've been finding issues if a server can't be connected to.

Are there any good ways to speed things up or work around the fact gathering?

Refresh NRPE for good measure

I have noticed that NRPE is not always started, let's employ a service refresh. We're going to remove the nrpe_needs_restart register, though using that is the correct way to do it I've noticed that it doesn't always work.

Support HTTP Auth for oobserver category

we use the [oobserver] inventory group for things like generic out-of-band and PDUs but we need a configurable option to support the check_http Nagios Plugin with authentication. This should be turned off by default and can be configured in install/group_vars/all.yml

Don't force SuperMicro Perl Deps unless needed

Currently we require perl-IO-Tty and perl-IPC-Run packages for EL7 for SuperMicro IPMI checks regardless or not if people are using them. This wouldn't normally be an issue except they don't appear in the base RHEL7 repos, but do appear for CentOS7.

Let's make a change to have supermicro_enable_checks: be a configurable option with the default set to false.

Wrap SuperMicro IPMI Checks to Control Status Codes

Using IPMI as the only means of monitoring SuperMicro servers via out-of-band management isn't that ideal but it works. We should wrap the raw IPMI return values better to control the following false positives that occur.

  • Powered off machines incorrectly alert as being down
  • Disk Checks or chassis intrusion may result false positive.

(idrac) split out SNMP checks to scale better for lots of hosts

After implementing idrac checks across ~200+ servers there seems to be some scalability problems. SNMP queries take on average of 60-100seconds to return at times, even with increasing the service_check_timeout and other parameters some checks still seem time out.

We probably need to split out checks into individual queries per component, possibly removing ones that aren't that useful versus how long they take to return.

reloading the iptables service

Restarting the iptables service can be dangerous if you're using conntrack. Reloading is much nicer I think..

--- a/install/roles/nagios-client/tasks/main.yml
+++ b/install/roles/nagios-client/tasks/main.yml
@@ -99,7 +99,7 @@
register: iptables_needs_restart

  • name: Restart iptables-services for TCP/{{nrpe_tcp_port}} (iptables-services)
  • shell: systemctl restart iptables.service
  • shell: systemctl reload iptables.service
    ignore_errors: true
    when: iptables_needs_restart != 0 and firewalld_in_use.rc != 0 and firewalld_is_active.rc != 0

Split out elasticsearch and elk server templates

There is a need for having separate elasticsearch and elkserver templates, if you're using the ELK ansible playbook you'll have an all-in-one ELK with different monitoring requirements.

Also you'll need to append the kibana username/password for check_http

If you're using a modular ELK deployment (multiple ES instances, perhaps separate kibana and several master nodes) you'll only care about monitoring elasticsearch.

fatal: [host-01]: UNREACHABLE!

Your System Details

  • Ansible version: ansible 2.9.18
  • Operating System: CentOS 8

Describe the bug

I tried to install this playground, using the reader me instruction but when I run the command ansible-playbook -i hosts install/nagios.yml I get the error; fatal: [host-01]: UNREACHABLE!

I have tried using

[nagios]
host-01

[nagios]
demo

[nagios]
159.x.x.247

image

To Reproduce / What were you doing?
Steps to reproduce the behavior:

git clone https://github.com/sadsfae/ansible-nagios
cd ansible-nagios
sed -i 's/host-01/159.x.x.247/' hosts
time ansible-playbook -i hosts install/nagios.yml

[RFE] FreeNAS Status API Script Status needs Cleanup

The check_freenas.py script returns a dictionary when it should return a more digestible format. The entire status is present but it could be cleaned up a little bit.

{u'meta': {u'previous': None, u'total_count': 1, u'offset': 0, u'limit': 20, u'next': None}, u'objects': [{u'timestamp': 1573610556, u'message': u'Pool pool0 state is DEGRADED: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state.', u'id': u'A:VolumeStatus:["Pool %(volume)s state is %(state)s: %(status)s", {"state": "DEGRADED", "status": "One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state.", "volume": "pool0"}]', u'dismissed': False, u'level': u'CRITICAL'}]} 

https://github.com/sadsfae/ansible-nagios/blob/master/install/roles/nagios/templates/check_freenas.py.j2#L115

Add Same server in Multiple group

Hi,

I downloaded this module, but I have one query. If i add same server in multiple groups it should be run.
For eg I added one server [10.1.2.86] in [switches][webservers] [servers] Group. I am thinking that it would use the checks that we have mentioned in [switches][webservers] [servers] i.e. in Nagios My server will be like this :- 10.1.2.86 : [switches] :- All Checks
[webservers] All Checks
[servers] All Checks

How can it be done ? Can you please give any suggestion ?

Add new hosts

Can you add a bunch of hosts to the already installed nagios?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.