stackhpc / ansible-collection-cephadm Goto Github PK

License: Apache License 2.0

Jinja 2.53% Python 97.47%

ansible-collection-cephadm's Introduction

StackHPC cephadm collection

This repo contains stackhpc.cephadm Ansible Collection. The collection includes modules and plugins supported by StackHPC for cephadm based deployments.

Tested with Ansible

Tested with the current Ansible 2.9 and 2.10 releases.

Included content

Roles:

cephadm for deployment/bootstrapping
commands for running arbitrary commands
crush_rules for defining CRUSH rules
ec_profiles for defining EC profiles
keys for defining auth keys
pools for defining pools

Using this collection

Before using the collection, you need to install the collection with the ansible-galaxy CLI:

ansible-galaxy collection install stackhpc.cephadm

You can also include it in a requirements.yml file and install it via ansible-galaxy collection install -r requirements.yml using the format:

collections:
- name: stackhpc.cephadm

See Ansible Using collections for more details.

Release notes handling

See antsibull-changelog docs for instructions how to deal with release notes.

More information

Licensing

Apache License Version 2.0

ansible-collection-cephadm's People

Contributors

Stargazers

Watchers

Forkers

creesb piersharding alexhill jheikki100 cityofships drmattheath stuartc-graphcore ironashram huongbn thg-openstack

ansible-collection-cephadm's Issues

Cephadm and Podman version

Adjust the role to install supported Podman version: https://docs.ceph.com/en/latest/cephadm/compatibility/#compatibility-with-podman-versions

`cephadm_custom_repos: true` prevents installation of Reef on Ubuntu 22.04

Turns out setting 'cephadm_custom_repos' as true installs quincy... the default version on Ubuntu 22.04! But this causes the playbook to ignore cephadm_ceph_release:

cephadm_ceph_release: "reef"
cephadm_custom_repos: true  # Should be enabled for Ubuntu 22.04

Results in:

root@storage-13-09002:/# ceph version
ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)

On this page we're told "If using Ubuntu 22.04 this should be set to true." for cephadm_custom_repos: https://github.com/stackhpc/ansible-collection-cephadm/blob/083dcea047f477d7a44da40c34d7c216b1026a44/roles/cephadm/README.md

But why is that? Is there any problem with using the custom repos for Ubuntu 22.04?

Allow enabling dashboard and monitoring stack post-bootstrap

cephadm does not respect setting of cephadm_registry_url

If the docker registry is overriden with the cephadm_registry_url, cephadm will continue to attempt to bootstrap even if required images are not resident in the registry. In this case, the playbook will revert to quay.io for redhat systems, which can cause silent failures if the target hosts are unable to access quay or potential inconsistencies if quay.io is accessible.

Update of pg_num/pgp_num is silently ignored when autoscale_mode is on

Updated pg_num/pgp_num values for a pool are silently ignored if autoscale_mode is on:

                if details['pg_autoscale_mode'] == 'on':
                    delta.pop('pg_num', None)
                    delta.pop('pgp_num', None)

https://github.com/stackhpc/ansible-collection-cephadm/blob/1.14.0/plugins/modules/cephadm_pool.py#L557-L559

This means that autoscale needs to be disabled first, but it has to be done in a separate run:

run with pg_autoscale_mode: off if it is currently on
run again to apply updated pg_num/pgp_num values

Adding OSDs directly after bootstrap fails because cluster is not yet up

I'm running into an issue with the stackhpc.cephadm.cephadm role where adding the OSDs directly after bootstrapping the cluster fails on the mon hosts not used for bootstrapping. Re-running my playbook does succeed, so this might be fixed by adding a task that waits until all mons are up.

Here's the stderr of the "Add OSDs individually" task on a failing host for reference:

Unable to find image 'quay.io/ceph/ceph:v17' locally
v17: Pulling from ceph/ceph\n6c5de04c936d: Pulling fs layer
f1ee40d9db4a: Pulling fs layer
17facd475902: Pulling fs layer
0d557d32f54e: Pulling fs layer
a12aac7905a4: Pulling fs layer
a12aac7905a4: Waiting
0d557d32f54e: Waiting
f1ee40d9db4a: Verifying Checksum
f1ee40d9db4a: Download complete
17facd475902: Verifying Checksum
17facd475902: Download complete
0d557d32f54e: Verifying Checksum
0d557d32f54e: Download complete
6c5de04c936d: Verifying Checksum
6c5de04c936d: Download complete
6c5de04c936d: Pull complete
f1ee40d9db4a: Pull complete
17facd475902: Pull complete
0d557d32f54e: Pull complete
a12aac7905a4: Verifying Checksum
a12aac7905a4: Download complete
a12aac7905a4: Pull complete
Digest: sha256:2b73ccc9816e0a1ee1dfbe21ba9a8cc085210f1220f597b5050ebfcac4bdd346
Status: Downloaded newer image for quay.io/ceph/ceph:v17
Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)',)

Add support for unmanaged: True setting

SSH key generation and bootstrap may happen on different hosts

Sometimes we see a bootstrap failure, where cephadm cannot read the previously generated SSH key. This seems to be because it generates the key on one host, then tries to bootstrap on another.

This seems to be some bug with run_once and delegate_to.

CRUSH rule changes not applied once pools exist

CRUSH rule assignments are not being applied after a pool has been created.

The bug is that updating Ceph pools doesn't evaluate changes in CRUSH rule here: https://github.com/stackhpc/ansible-collection-cephadm/blob/master/plugins/modules/cephadm_pool.py#L304-L305

Adding crush_rule to that list doesn't solve things because it compares pool ID with name, eg 1 with replicated_hdd, so now it always applies the change.

One way to add support for CRUSH rule configuration would be to replace pool IDs with pool names before the comparison is made.

Add support for deploying MDS

Destroy cluster doesn't fail gracefully.

Hello,

First off, awesome Ansible collection, thanks you for making it available!

Having a small issue when using this collection with -e "cephadm_recreate=true". As the previous run failed to add 2/3 hosts to my cluster the 'Destroy cluster' task fails like so:

TASK [stackhpc.cephadm.cephadm : Destroy cluster] ***********************************************************************************************************************************************
fatal: [index-16-09078]: FAILED! => {"changed": true, "cmd": ["cephadm", "rm-cluster", "--fsid", "53d7c6cc-2229-11ef-a94c-b1f216e39593", "--force"], "delta": "0:00:00.499584", "end": "2024-06-04 05:06:01.772498", "msg": "non-zero return code", "rc": 1, "start": "2024-06-04 05:06:01.272914", "stderr": "Traceback (most recent call last):\n  File \"/usr/lib/python3.10/runpy.py\", line 196, in _run_module_as_main\n    return _run_code(code, main_globals, None,\n  File \"/usr/lib/python3.10/runpy.py\", line 86, in _run_code\n    exec(code, run_globals)\n  File \"/tmp/tmpwf3vvwn_.cephadm.build/__main__.py\", line 10700, in <module>\n  File \"/tmp/tmpwf3vvwn_.cephadm.build/__main__.py\", line 10688, in main\n  File \"/tmp/tmpwf3vvwn_.cephadm.build/__main__.py\", line 7989, in command_rm_cluster\n  File \"/tmp/tmpwf3vvwn_.cephadm.build/__main__.py\", line 8047, in _rm_cluster\n  File \"/tmp/tmpwf3vvwn_.cephadm.build/__main__.py\", line 7979, in get_ceph_cluster_count\nFileNotFoundError: [Errno 2] No such file or directory: '/var/lib/ceph'", "stderr_lines": ["Traceback (most recent call last):", "  File \"/usr/lib/python3.10/runpy.py\", line 196, in _run_module_as_main", "    return _run_code(code, main_globals, None,", "  File \"/usr/lib/python3.10/runpy.py\", line 86, in _run_code", "    exec(code, run_globals)", "  File \"/tmp/tmpwf3vvwn_.cephadm.build/__main__.py\", line 10700, in <module>", "  File \"/tmp/tmpwf3vvwn_.cephadm.build/__main__.py\", line 10688, in main", "  File \"/tmp/tmpwf3vvwn_.cephadm.build/__main__.py\", line 7989, in command_rm_cluster", "  File \"/tmp/tmpwf3vvwn_.cephadm.build/__main__.py\", line 8047, in _rm_cluster", "  File \"/tmp/tmpwf3vvwn_.cephadm.build/__main__.py\", line 7979, in get_ceph_cluster_count", "FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/ceph'"], "stdout": "Deleting cluster with fsid: 53d7c6cc-2229-11ef-a94c-b1f216e39593", "stdout_lines": ["Deleting cluster with fsid: 53d7c6cc-2229-11ef-a94c-b1f216e39593"]}
fatal: [storage-16-09074]: FAILED! => {"changed": true, "cmd": ["cephadm", "rm-cluster", "--fsid", "53d7c6cc-2229-11ef-a94c-b1f216e39593", "--force"], "delta": "0:00:00.513107", "end": "2024-06-04 05:06:01.810504", "msg": "non-zero return code", "rc": 1, "start": "2024-06-04 05:06:01.297397", "stderr": "Traceback (most recent call last):\n  File \"/usr/lib/python3.10/runpy.py\", line 196, in _run_module_as_main\n    return _run_code(code, main_globals, None,\n  File \"/usr/lib/python3.10/runpy.py\", line 86, in _run_code\n    exec(code, run_globals)\n  File \"/tmp/tmpwf3vvwn_.cephadm.build/__main__.py\", line 10700, in <module>\n  File \"/tmp/tmpwf3vvwn_.cephadm.build/__main__.py\", line 10688, in main\n  File \"/tmp/tmpwf3vvwn_.cephadm.build/__main__.py\", line 7989, in command_rm_cluster\n  File \"/tmp/tmpwf3vvwn_.cephadm.build/__main__.py\", line 8047, in _rm_cluster\n  File \"/tmp/tmpwf3vvwn_.cephadm.build/__main__.py\", line 7979, in get_ceph_cluster_count\nFileNotFoundError: [Errno 2] No such file or directory: '/var/lib/ceph'", "stderr_lines": ["Traceback (most recent call last):", "  File \"/usr/lib/python3.10/runpy.py\", line 196, in _run_module_as_main", "    return _run_code(code, main_globals, None,", "  File \"/usr/lib/python3.10/runpy.py\", line 86, in _run_code", "    exec(code, run_globals)", "  File \"/tmp/tmpwf3vvwn_.cephadm.build/__main__.py\", line 10700, in <module>", "  File \"/tmp/tmpwf3vvwn_.cephadm.build/__main__.py\", line 10688, in main", "  File \"/tmp/tmpwf3vvwn_.cephadm.build/__main__.py\", line 7989, in command_rm_cluster", "  File \"/tmp/tmpwf3vvwn_.cephadm.build/__main__.py\", line 8047, in _rm_cluster", "  File \"/tmp/tmpwf3vvwn_.cephadm.build/__main__.py\", line 7979, in get_ceph_cluster_count", "FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/ceph'"], "stdout": "Deleting cluster with fsid: 53d7c6cc-2229-11ef-a94c-b1f216e39593", "stdout_lines": ["Deleting cluster with fsid: 53d7c6cc-2229-11ef-a94c-b1f216e39593"]}
changed: [storage-14-09034] => {"changed": true, "cmd": ["cephadm", "rm-cluster", "--fsid", "53d7c6cc-2229-11ef-a94c-b1f216e39593", "--force"], "delta": "0:00:07.164754", "end": "2024-06-04 05:06:08.185614", "rc": 0, "start": "2024-06-04 05:06:01.020860", "stderr": "", "stderr_lines": [], "stdout": "Deleting cluster with fsid: 53d7c6cc-2229-11ef-a94c-b1f216e39593", "stdout_lines": ["Deleting cluster with fsid: 53d7c6cc-2229-11ef-a94c-b1f216e39593"]}

TASK [stackhpc.cephadm.cephadm : Remove ssh keys] ***********************************************************************************************************************************************
changed: [storage-14-09034] => (item=/etc/ceph/cephadm.id) => {"ansible_loop_var": "item", "changed": true, "item": "/etc/ceph/cephadm.id", "path": "/etc/ceph/cephadm.id", "state": "absent"}
changed: [storage-14-09034] => (item=/etc/ceph/cephadm.pub) => {"ansible_loop_var": "item", "changed": true, "item": "/etc/ceph/cephadm.pub", "path": "/etc/ceph/cephadm.pub", "state": "absent"}

TASK [stackhpc.cephadm.cephadm : Run prechecks] *************************************************************************************************************************************************
included: /home/mcollins1/.ansible/collections/ansible_collections/stackhpc/cephadm/roles/cephadm/tasks/prechecks.yml for storage-14-09034

This causes the subsequent tasks to not be applied to those failed hosts.

Perhaps a ignore_errors: true here would be appropriate.

O

Bug: Cephadm role fails if `cephadm_host_labels` absent in group/host vars

Role's defaults are not sufficient:

TASK [stackhpc.cephadm.cephadm : Template out cluster.yml] ***********************************************************************************************************************************************************************************
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ansible.errors.AnsibleUndefinedVariable: 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'cephadm_host_labels'
fatal: [sv-hdd-12-0]: FAILED! => {"changed": false, "msg": "AnsibleUndefinedVariable: 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'cephadm_host_labels'"}

Use download.ceph.com repos on CentOS and remove hardcoded octopus release

Ensure /var/log/messages does not grow too big

Wrong cluster network is used

Since commit 16fdedc, we set the cluster network via ceph config set mon cluster_network. This appears to be ignored, resulting in Ceph using the public network for replication.

2024-03-18 22:44:15,946 7feee9be7740 INFO Mon IP `10.0.0.1` is in CIDR network `10.0.0.0/24`                                                                                                         
2024-03-18 22:44:15,947 7feee9be7740 INFO Mon IP `10.0.0.1` is in CIDR network `10.0.0.0/24`                                                                                                         
2024-03-18 22:44:15,947 7feee9be7740 DEBUG Inferred mon public CIDR from local network configuration ['10.0.0.0/24', '10.0.0.0/24']                                                                   
2024-03-18 22:44:15,947 7feee9be7740 INFO Internal network (--cluster-network) has not been provided, OSD replication will default to the public_network

I'd like to remove `ceph orch apply osd -i`

See ceph/ceph#42906 . Do you mind using ceph orch apply -i instead?

Cephadm role fails if inventory hosts have different `cephadm_admin_interface`

Occurs e.g. when ingress service is to be placed on a differently configured set of hosts:

fatal: [w04s01]: FAILED! => {
    "changed": false,
    "msg": "AnsibleUndefinedVariable: 'dict object' has no attribute 'bond0.1008'"
}

Add support for initial host placement

https://docs.ceph.com/en/latest/cephadm/host-management/#setting-the-initial-crush-location-of-host

Set password for dashboard admin

Add support for managing pools

Add Reef support

cephadm_ceph_release changed but package_update unset scenario

As noted in #17 - we could end up in a bad state.

Monitoring and logging support

Enable and configure Prometheus module
Enable dashboard
Set logging to files
Set log level

Stop using root

Cephadm supports using sudo - let's move to that.

Documentation

The cephadm role needs at least a decent README

cephadm role doesn't work with interfaces with dashes in it

cephadm_keys is ignored and doesn't create user+cred

Using the example below, after a successful deployment there is no global.client user in the ceph db.

---
#- name: Install epel on mons
#  hosts: mons
#  become: true
#  tasks:
#    - package:
#        name: epel-release
#        state: present
- name: Run cephadm collection
  any_errors_fatal: True
  hosts: ceph
  become: true
  vars:
    cephadm_ceph_release: "pacific"
    cephadm_fsid: "736XXXXX3a"
    cephadm_public_interface: "{{ public_net_interface }}"
    cephadm_public_network: "{{ public_net_cidr }}"
    cephadm_cluster_interface: "{{ storage_net_interface }}"
    cephadm_cluster_network: "{{ storage_net_cidr }}"
    cephadm_enable_dashboard: True
    cephadm_enable_monitoring: True
    cephadm_install_ceph_cli: True
    cephadm_enable_firewalld: False
    cephadm_bootstrap_host: "{{ groups['mgrs'][0] }}"
    cephadm_osd_spec:
      service_type: osd
      service_id: osd_spec_default
      placement:
        host_pattern: "*cephosd*"
      data_devices:
        paths:
          - /dev/vdb
          - /dev/vdc
    cephadm_pools:
      - name: data
        application: cephfs
        state: present
        size: 3
      - name: metadata
        application: cephfs
        state: present
        size: 3
      - name: rbd-internal
        application: rbd
        state: present
        size: 3
### these keys don't seem to get applied...
    cephadm_keys:
      - name: client.global
        caps:
          mds: "allow rwp"
          mon: "allow r, profile rbd"
          osd: "allow * pool=*"
          mgr: "allow rw, profile rbd pool=rbd-internal"
        state: present
    cephadm_commands:
      - "fs new gradientdev metadata data"
      - "orch apply mds gradientdev --placement 3"
      - "auth get-or-create client.gradient"
      - "auth caps client.gradient mds 'allow rwps' mon 'allow r, profile rbd' mgr 'allow rw, profile rbd pool=rbd-internal' osd 'allow rw tag cephfs *=*, profile rbd pool=rbd-internal'"
  pre_tasks:
  - name: Recursively remove /mnt/public directory
    ansible.builtin.file:
      path: /mnt/public
      state: absent
  - name: Recursively remove /mnt/poddata directory
    ansible.builtin.file:
      path: /mnt/poddata
      state: absent
  - name: Unmount /dev/vdb /mnt if mounted
    ansible.posix.mount:
      path: /mnt
      src: /dev/vdb
      state: absent
    register: mnt_unmounted
  - name: Debug unmount /dev/vdb
    ansible.builtin.debug:
      msg: "{{ mnt_unmounted }}"
    when: false
#    become: true
  - name: reboot the machine when /mnt has been removed
    ansible.builtin.reboot:
    when: mnt_unmounted.changed == true
  - name: Create /var/lib/ceph mountpoint
    ansible.builtin.file:
      path: /var/lib/ceph
      mode: 0755
      state: directory
    when: "'cephmgr' in inventory_hostname"
  - name: Mount /dev/vdb /var/lib/ceph for mons
    ansible.posix.mount:
      path: /var/lib/ceph
      src: /dev/vdb
      state: mounted
      fstype: ext4
    when: "'cephmgr' in inventory_hostname"
  - name: Generate /etc/hosts
    blockinfile:
      path: /etc/hosts
      marker_begin: BEGIN CEPH host
      block: |
        10.12.17.96 lr17-1-cephmgr1
        10.12.17.198 lr17-1-cephmgr2
        10.12.17.216 lr17-1-cephmgr3
        10.12.17.148 lr17-1-cephosd1
        10.12.17.185 lr17-1-cephosd2
        10.12.17.128 lr17-1-cephosd3
        10.12.17.54 lr17-1-cephosd4
        10.12.17.126 lr17-1-cephosd5
        10.12.17.67 lr17-1-cephosd6
        10.12.17.130 lr17-1-cephosd7
        10.12.17.31 lr17-1-cephosd8
        10.12.17.60 lr17-1-cephosd9
        10.12.17.222 lr17-1-cephosd10
        10.12.17.171 lr17-1-cephosd11
        10.12.17.39 lr17-1-cephosd12
        10.12.17.18 lr17-1-cephosd13
    become: true
  roles:
    - role: stackhpc.cephadm.cephadm
    - role: stackhpc.cephadm.pools
    - role: stackhpc.cephadm.keys
    - role: stackhpc.cephadm.commands

Add support for custom hosts spec

cannot import name 'pre_generate_ceph_cmd'

Attempting to use this collection to deploy a Ceph cluster, pool creation fails with:

ImportError: cannot import name 'pre_generate_ceph_cmd'

plugins/modules/cephadm_pool.py has
from ansible_collections.stackhpc.cephadm.plugins.module_utils.cephadm_common \ import generate_ceph_cmd, exec_command, exit_module, pre_generate_ceph_cmd

but pre_generate_ceph_cmd doesn't exist in plugins/module_utils/cephadm_common.py

removing the failing import from cephadm_pool.py allows the creation of pools (with as far as I can tell) no adverse effects.

Docker SDK dependency

Log into Docker registry task depends on Docker SDK for Python installed on remote hosts but it's not assured by the cephadm role.

Improve collection docs

Include links to roles and plugins docs