rancherfederal / rke2-ansible Goto Github PK

View Code? Open in Web Editor NEW

204.0 204.0 123.0 356 KB

RKE2 cluster provisioning via Ansible.

License: Apache License 2.0

Python 17.99% HCL 80.76% Jinja 1.24%

rke2-ansible's People

Contributors

Stargazers

Watchers

Forkers

brooksphilip aydev-fr yankcrime cyrilburd cstackpole tuckcodes fathiq isitar dgvigil rabie-zamane sebastienmusso belgaied2 snlkll fruitywelsh ron1 kemcnamara jeremyhallock auqkwa-tech-llc openmindtechnologies lmco butterhc houstondad 8times4 arineng reynencourt marthydavid jonmosco gdha jmarcellus21 metavac wrkode bahbahblaine raif-ahmed mmikitka nnewc devops-corner techtribeone sarathkrishna2018 raisoft pyaillet steven-cascante pasientskyhosting lillecarl danazag vinaymakam kasetty hubvu shcanard kuwv mlflr tux2cool robertb724-corsha thklein-io rykelley kevindiffily arsdef yassan fimmicon rjpw abdelhousni marcelstancudev icodeforyou-dot-net marcinkubica jdloft nwanati mrmontreal pjb5c glenn323 a-bennyfact0r ravisarswat aceeric cuonghv00 fultux ryan822399 emiliyaniliev anishsedhaii copypastefail dpabai gcgirish ansilh adamkoro bugaian 3sky jmack707 alisaazimova mesenger pionative pirouet getsomebread robnk23 jsnouffer benhosmer hepapi laszlojau ankitgyawali therealcreynold danielkuzmenkoo jcox10 420xnu tkoeppen

rke2-ansible's Issues

Support safe RKE2 cluster upgrading

Some process to drain/cordon (or not) nodes for upgrade and then re-join.

Tarball install w/SELinux enforcing fails

Environmental Info
RKE2 Version: v1.20.10+rke2r1
[root@ip-10-11-12-13 rke2]# rke2 -v
rke2 version v1.20.10+rke2r1 (a4f6020)
go version go1.16.6b7

Node(s) CPU architecture, OS, and Version
Just one server node
[root@ip-10-11-12-13 rke2]# uname -a
Linux ip-10-11-12-13.donut.org 3.10.0-1160.42.2.el7.x86_64 #1 SMP Tue Sep 7 14:49:57 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

[root@ip-10-11-12-13 rke2]# cat /etc/redhat-release
CentOS Linux release 7.9.2009 (Core)

Cluster Configuration
One server node

Describe the bug
rke2-server v1.20.10 tarball install on CentOS 7.9 w/SELinux enforcing fails to start w/flag --selinux

See issue rancher/rke2#1865 for more information including reproduction details.

Support SLES 15 SP2

SLES 15 SP2 is support by RKE2, thus this playbook should too.

https://docs.rke2.io/install/requirements/#operating-systems

Question: Advertice node-ip ?

Hi,

First, thanks for this playbook. Since k3s has no support for GlusterFS, this is a very good alternative to k3sup.

Now to my question: it's maybe a stupid question, i have a special setup where all hosts where in a public setup.
Therefore the given requirement was to use wireguard for all traffic between the hosts. . Also a level 4 proxy (haproxy) should be used for the api server. So ive created the following test setup

server 1

public ip: 10.10.10.140
wireguard_ip: 192.168.0.1
additional hostnames: api.rke2.lb.local loadbalancer.rke2.lb.local
os: ubuntu 20.04.2 LTS
software: haproxy

server 2

public ip: 10.10.10.141
wireguard_ip: 192.168.0.2
additional hostnames: server-01.rke2.lb.locall
os: ubuntu 20.04.2 LTS

server 3

public ip: 10.10.10.142
wireguard_ip: 192.168.0.3
additional hostnames: server-02.rke2.lb.local
os: ubuntu 20.04.2 LTS

server 4

public ip: 10.10.10.143
wireguard_ip: 192.168.0.4
additional hostnames: server-03.rke2.lb.local
os: ubuntu 20.04.2 LTS

Every server knows the additonal hostnames of the other servers.

The HAProxy config looks like this:

global
    log /dev/log	local0
    log /dev/log	local1 notice
    chroot /var/lib/haproxy
    stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
    stats timeout 30s
    user haproxy
    group haproxy
    daemon

    # Default SSL material locations
    ca-base /etc/ssl/certs
    crt-base /etc/ssl/private

    # See: https://ssl-config.mozilla.org/#server=haproxy&server-version=2.0.3&config=intermediate
    ssl-default-bind-ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384
    ssl-default-bind-ciphersuites TLS_AES_128_GCM_SHA256:TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256
    ssl-default-bind-options ssl-min-ver TLSv1.2 no-tls-tickets

defaults
    log	global
    mode	tcp
    option	dontlognull
    timeout connect 5000
    timeout client  50000
    timeout server  50000
    errorfile 400 /etc/haproxy/errors/400.http
    errorfile 403 /etc/haproxy/errors/403.http
    errorfile 408 /etc/haproxy/errors/408.http
    errorfile 500 /etc/haproxy/errors/500.http
    errorfile 502 /etc/haproxy/errors/502.http
    errorfile 503 /etc/haproxy/errors/503.http
    errorfile 504 /etc/haproxy/errors/504.http

frontend rke2_api
    bind :6443
    default_backend rke2_api_servers

backend rke2_api_servers
    server server-01 192.168.0.2:6443
    #server server-02 192.168.0.3:6443
    #server server-03 192.168.0.4:6443

frontend rke2_join
    bind :9345
    default_backend rke2_join_servers

backend rke2_join_servers
    server server-01 192.168.0.2:9345
    #server server-02 192.168.0.3:9345
    #server server-03 192.168.0.4:9345

server-02 and server-03 are disabled until the setup is done.

Now i've added the file inventory/my-cluster/group_vars/all.ymlwith the following content

kubernetes_api_server_host: api.rke2.lb.local

So the registration will use the loadbalancer.

My hosts-ini has the following content:

[rke2_servers]
[email protected]
[email protected]
[email protected]

[rke2_agents]

[rke2_cluster:children]
rke2_servers
rke2_agents

After running the playbook successfully, the output of kubectl get nodes -o wide shows the folling content

NAME                STATUS   ROLES                       AGE   VERSION          INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
rke2-lb-server-01   Ready    control-plane,etcd,master   50m   v1.21.2+rke2r1   10.10.10.141   <none>        Ubuntu 20.04.2 LTS   5.4.0-77-generic   containerd://1.4.4-k3s2
rke2-lb-server-02   Ready    control-plane,etcd,master   46m   v1.21.2+rke2r1   10.10.10.142   <none>        Ubuntu 20.04.2 LTS   5.4.0-77-generic   containerd://1.4.4-k3s2
rke2-lb-server-03   Ready    control-plane,etcd,master   47m   v1.21.2+rke2r1   10.10.10.143   <none>        Ubuntu 20.04.2 LTS   5.4.0-77-generic   containerd://1.4.4-k3s2

So during the registration the public ip was used instead of the wireguard ip. The log of the Loadbalancer also show no traffic on port 6443. Is it possible to modify this? Is this the option --node-ip during start? Is it possible to make this configurable in the playbook?

My current setup has no agents jet, but is there the same problem there?

Add uninstall script note in README

For now show how to run a simple ansible -i inventory/my-cluster/hosts.ini -m shell [run uninstall script]

current documentation:
https://docs.rke2.io/install/uninstall/

as per requested by @bgulla

Add to ansible-galaxy

Hi guys,

I've been playing around with this project a bit and am impressed with it. It works well in isolation, but because of the structure of this repository, to get the roles integrated into an existing ansible repo I had to manually copy the roles/ out and paste them into our repo.
I would like to open some discussion around restructuring the repo so that it can be easily imported into ansible-galaxy, for easier integration into customers existing infrastructure repositories.

Would something like that be possible? Hope what I'm asking for makes sense, would be happy to clarify.

Latest refactor PR breaks Ubuntu builds

Customer reported that the latest refactor breaks Ubuntu builds due to a couple of breaking changes. Customer reported the following recommendations to remediate:

Move tarball code code from RPM -> Tarball ansible file
In tarball ansible file remove all delegate_to 127.0.01

Add pofile cis-1.5 after initial rollout

Hi,

i've setup a cluster before the greater rework was done. The second cluster i've setup was after the rework. But i didn't saw, that profile: cis-1.5 wasn't default anymore. Is there any documention how to enable this after inital setup?

Thanks!

Failed: this task 'ansible.builtin.command' has extra param

Greetings,
Testing out some of the latest changes and ran into an issue that I thought I would report.

$ git log -1
commit 1fc0e3694c3ae4749d29d0a314ce1a65507b27e6 (HEAD -> main, origin/main, origin/HEAD)
Author: Mike D'Amato <[email protected]>
Date:   Fri Aug 6 13:31:40 2021 -0400

    Tarball needs agent service installation (#66)

Deploying from a Ubuntu 20.04 fully updated system to three VM's all Rocky 8.4 and fully updated (1 server; two agents)

$ ansible-playbook --version
ansible-playbook 2.9.6
  config file = ~/Code/Github_RancherFederal_rke2-ansible/ansible.cfg
  configured module search path = ['~/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python3/dist-packages/ansible
  executable location = /usr/bin/ansible-playbook
  python version = 3.8.10 (default, Jun  2 2021, 10:49:15) [GCC 9.4.0]

$ ansible-playbook site.yml -i inventory/my-cluster/hosts.ini -K
[snip]
TASK [rke2_server : Setup initial server] **************************************************************************************************
fatal: [192.168.1.54]: FAILED! => {"reason": "this task 'ansible.builtin.command' has extra params, which is only allowed in the following modules: add_host, meta, shell, include, script, win_command, import_tasks, set_fact, include_role, win_shell, include_vars, import_role, include_tasks, command, group_by, raw\n\nThe error appears to be in '~/Code/Github_RancherFederal_rke2-ansible/roles/rke2_server/tasks/first_server.yml': line 22, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: Wait for kubelet process to be present on host\n  ^ here\n"}

There is no wait that I can tell. It get's to this spot and immediately errors out.

Thoughts?
Thanks!

Non-existent shell (/bin/nologin) for etcd user

The CIS hardening tasks configure the etcd user to have the shell /bin/nologin, which does not exist on Ubuntu or CentOS systems (I'm unsure of strict Debian and RedHat, but likely they're the same). On Ubuntu, the path is /usr/sbin/nologin and on CentOS, /usr/sbin/nologin or /sbin/nologin. A similar issue was submitted and fixed in kubespray, which apparently was triggering a notification from a security scanner.

Deleted manifests should be removed from the first node

If you delete a manifest file that was previously copied to the first node, and then re-run the playbook, the manifest will not be deleted on the node and its resources will be re-created.

channel downgrade not respected on yum-based systems

I tried downgrading to channel=v1.19 and ran into a problem that I believe is as a result of this line:

rke2-ansible/roles/rke2_common/tasks/rpm_install.yml

Line 97 in 89be345

state: latest # noqa package-latest

Because I had already run the play with channel=stable, the more recent .repo files already existed in /etc/yum.repos.d, so despite ansible reporting the major/minor versions specified to be installed, the latest was in-fact installed.

After manually removing the newer .repo files, the play deployed the correct version.

"Failed: this task 'ansible.builtin.command' has extra params" on HEAD w/ansible-playbook 2.10.3

Running HEAD w/ansible-playbook 2.10.3 fails with message: "ERROR! this task 'ansible.builtin.command' has extra params". This seems to be caused by Ansible issue "Fix missing ansible.builtin FQCNs in hardcoded action names" (ansible/ansible#71824). The following tasks currently using the problematic 'ansible.builtin.command' FQCN twice each:

roles/rke2_agent/tasks/main.yml
roles/rke2_server/tasks/other_servers.yml

This regression was introduced in the recently introduced "Begin Idempotency" commit: https://github.com/rancherfederal/rke2-ansible/pull/85A. The use of 'ansible.builtin.command' in this commit is also inconsistent with the commit associated with recently closed issue: "
Failed: this task 'ansible.builtin.command' has extra param" (#69)

The simple fix for this issue is to replace the use of 'ansible.builtin.command' with 'command' in these two tasks.

remove first_server role

clean up the roles folder and remove any old roles that arent being used.

Optionally use INSTALL_RKE2_ARTIFACT_PATH for rke2.linux with "Rancher RKE2 Common" yum repo for rke2-selinux

Based on the directions taken by the Rancher system-upgrade-controller and the forthcoming Rancher 2.6 capi-based system-agent-installer-rke2, it seems the Rancher preference is to manage the rke2.linux self-extracting binary via a version-specific container image rather than a "Rancher RKE2 versioned" yum repo. However, on RHEL-based systems, it still seems to make sense to manage the rke2-selinux rpm via a "Rancher RKE2 Common" yum repo.

So, consider enabling use of the "Rancher RKE2 Common" yum repo separate from the "Rancher RKE2 versioned" yum repo. Also, handle configuration of selinux including yum installation of rke2-selinux package and enabling of selinux for containerd in config.yaml.

[Channels] Do you provide others RKE2 channel ?

Hello,

As mentioned in the title, do you provision other rke2 channel (v1.22/v1.21...) as specified in vars file of the role rke_common?
I would like to install rancher web UI to manage the cluster but (at this moment) it seems impossible to install the UI because Rancher helm chart only supports the Kubernetes < v1.22.0 (as mentioned here : rancher/rancher#34060 (comment)

)

So I test the v1.19, but it seems the env file is needed. I can't find any data about it.

Another issue:
One of the tarball link is dead (in the docs), maybe you should remove it.

Thanks for the reply.

Add more RKE2 config host var parameters

Add the ability to add individual hostvars for these specific rke2 config parameters.

See rke2-ansible/roles/rke2_common/tasks/config.yml for reference.

node_ip
node_name
bind_address
advertise_address
node_taints=[]
node_labels=[]
node_external_ip

Add ability to add audit logging policy configuration

RKE2 supports Audit logging:

https://kubernetes.io/docs/tasks/debug-application-cluster/audit/#audit-policy
https://docs.rke2.io/security/hardening_guide/#control-321

Support an empty rke2_agents inventory group for RKE2 standalone deployments

I have been able to deploy an RKE2 standalone server by adding a single host to [rke2_servers] and NO hosts to [rke2_agents]. Due to the nature of the Ansible inventory that I am working with, I would like to remove the assumption that the rke2_agents inventory group is defined.

In particular, wherever we have when clauses like:

- inventory_hostname in groups['rke2_agents']

I would like to replace it with

- inventory_hostname in groups['rke2_agents'] | default(false)

This would not be required if it was possible to create an empty group at runtime, similar to add_host, but that does not seem to be possible.

A PR with the proposed change is forthcoming.

Idempotency

An operation is idempotent if the result of performing it once is exactly the same as the result of performing it repeatedly without any intervening actions

One should be able to run this playbook repeatedly and if no variables or inventory changes then everything should be left as is. There should not be concern about "did I run the playbook with this host/variable yet?" because if nothing has changed then the ansible-playbook run should not change anything.

Add a good rke2_config: example to the README.md

A good example of what can be put in the rke2_config: would be a nice add to the README.md

Provide ability to test locally on laptop using something like Ansible Molecule

It would be nice to just run molecule test -s rhel-7.8 or molecule test -s ubuntu-20.04 to run end to end simulation (linting,multi-os,convergeof,idempotence) of ci-cd workflow as to avoid having to push up changes, wait for building of machines and find breaks in local environments after x period of time. Also, it'll save some $$ on AWS costs, and allow folks like me who like to code and test offline while on trains, planes and automobiles :)

Support Ubuntu 18/20

RKE2 offers support for Ubutnu 18/20 thus so should this playbook
https://docs.rke2.io/install/requirements/#operating-systems

multiple server node start will failed

according to this rancher/rke2#869 issue, multiple server node start at the same time will failed, need to run in sequence.

Urls should be variabalized for easier configuration change during deployments

Many organizations have internal repos for holding packages and images and may block outside usage, making these variables would allow for easier switching during deployments of updates as urls change, as well as removing environment specific data from the code.

Node specific config fails to be written to config.yml

Ansible version: 2.9.27
ansible.utils version: 2.4.3

Example inventory

[rke2_servers]
123.123.123.123 ansible_ssh_user="root" node_ip="10.0.0.1" bind_address="10.0.0.1" advertise_adress="10.0.0.1" node_external_ip="123.123.123.133"

[rke2_agents]

[rke2_cluster:children]
rke2_servers
rke2_agents

Running the config tasks of rke_common results in the following debug result:

[...]
TASK [rke2_common : Debug config] *************************************************************************************
ok: [123.123.123.123] => {
    "rke2_config": {
        "cni": "cilium",
        "debug": true,
        "node-label": [],
        "node-taint": [],
        "profile": "cis-1.6",
        "selinux": true
    }
}
Friday 21 January 2022  14:08:08 +0100 (0:00:00.030)       0:00:18.701 ******** 

TASK [rke2_common : Add node-ip to rke2_config] ***********************************************************************
ok: [123.123.123.123]
Friday 21 January 2022  14:08:08 +0100 (0:00:00.032)       0:00:18.733 ******** 

TASK [rke2_common : Debug changes] ************************************************************************************
ok: [123.123.123.123] => {
    "updated_rke2_config": {
        "changed": false,
        "failed": false,
        "rke2_config": {
            "cni": "cilium",
            "debug": true,
            "node-ip": "10.0.0.4",
            "node-label": [],
            "node-taint": [],
            "profile": "cis-1.6",
            "selinux": true
        }
    }
}
Friday 21 January 2022  14:08:08 +0100 (0:00:00.029)       0:00:18.763 ******** 
Friday 21 January 2022  14:08:08 +0100 (0:00:00.027)       0:00:18.790 ******** 

TASK [rke2_common : Debug config] *************************************************************************************
ok: [123.123.123.123] => {
    "rke2_config": {
        "cni": "cilium",
        "debug": true,
        "node-label": [],
        "node-taint": [],
        "profile": "cis-1.6",
        "selinux": true
    }
}
[...]

As you can see updated_rke2_config.changed is false even though the config did actually change. If I remove changed_when: false from the Add node-ip to rke2_config task the update works as expected.

Allow providing generic RKE2 configuration

Allow providing generic RKE2 configuration parameters without needing to write ansible logic.

Config file parameters should be used as source of truth and ansible should be able to parse for key parameters like profile: cis-1.x

Ironbank configuration

Add ability to switch on or add configuration required to use ironbank versions of RKE2 images

NetworkManager fix

Add a task to check if NetworkManager is running and apply the fix.
https://docs.rke2.io/known_issues/#networkmanager

Add external containerd support for NVIDIA

https://gist.github.com/bgulla/3b725f0eea54fdd49f4d7066e16b1d89

Using the tarball installation method on Centos8 causes "Unable to watch for tunnel endpoints..." error

Description
I am deploying a single-node RKE2 cluster using this ansible playbook. When using the air-gapped tarball installation method, I get the following error when running kubectl get pods -A :

The connection to the server was refused - did you specify the right host or port?

and when checking the system logs for rke2-server.service, I repeatedly see:

level=warning msg="Unable to watch for tunnel endpoints: Get \"https://127.0.0.1:6443/api/v1/namespaces/default/endpoints?fieldSelector=metadata.name%3Dkubernetes&resourceVersion=0&watch=true\": dial tcp 127.0.0.1:6443: connect: connection refused"

When deploying without the tarballs, I do not see this error.

Steps to Reproduce

Clone the rke2-ansible repo onto what we'll call the "deployment VM".
Install and deploy a base Centos8 VM from any of the available mirrors: http://isoredirect.centos.org/centos/8/isos/x86_64/. We'll call this the "target VM".
Make sure the target VM is reachable at a known IP from the deployment VM.
Configure passwordless SSH from your deployment VM onto the target VM.
Configure passwordless sudo on the Centos8 VM for a user you control. We'll call them "test".
Set the following values in inventory/sample/hosts.ini:

[rke2-servers]
{Insert IP of target VM} ansible_user=test

[rke2_cluster:children]
rke2_servers

Download v1.20.7+rke2r2 tarballs (rke2-images.linux-amd64.tar.zst and rke2.linux-amd64.tar.gz) from https://github.com/rancher/rke2/releases/tag/v1.20.7%2Brke2r2 and place them in tarball_install
Run the ansible playbook:

ansible-playbook site.yaml -i inventory/sample/hosts.ini

Copy the kubeconfig file from the traget VM to ~/.kube/config on your deployment VM:

ssh test@{insert target VM IP} "sudo cp /etc/rancher/rke2/rke2.yaml ~/.kube/config && sudo chown \$USER:\$USER ~/.kube/config
scp test@{insert target VM IP}:/~/.kube/config ~/.kube/config

Replace the loopback address in ~/.kube/config with the target VM IP.
Run kubectl get pods -A. You should get no response or the response shown in the description.
Access the target VM and run sudo systemctl status rke2-server.service to see the "Unable to watch..." message.

Additional Detail
RKE2 Version: 1.20.7+rke2r2
Possibly related to this issue from the RKE2 repo.

Enable kube-vip on-prem lb

https://gist.github.com/bgulla/7a6a72bdc5df6febb1e22dbc32f0ca4f

Automated testing for rke2-ansible

no long lived VMs, use github actions to spin up ec2 based cluster
test against Ubuntu, RHEL, Airgap, SLES

Support CIS 1.6

cis-1.5 is supported by the playbooks but since RKE2 supports CIS 1.6 it would be great if the ansible playbooks support this as well.

EL8 minimal install doesn't include tar

When testing on a minimal install of EL8 (RHEL and CentOS), tar isn't included which means I ended up with an error:

fatal: [192.168.1.155]: FAILED! => {"changed": false, "cmd": "tar -xf /tmp/ansible.gwik8e25rke2-install.XXXXXXXXXX/rke2.linux-amd64.tar.gz -C /usr/local", "msg": "[Errno 2] No such file or directory: b'tar': b'tar'", "rc": 2}

It would be nice if tar was on the list for ansible to check/install or have it on the pre-requirements list.

Thanks!

Enable Rancher ACE by configuring authentication-token-webhook-config-file

Support Rancher ACE enablement as described here by optionally configuring property

kube-apiserver-arg:
  - authentication-token-webhook-config-file=/var/lib/rancher/rke2/kube-api-authn-webhook.yaml

This feature should be implemented as a task much like https://github.com/rancherfederal/rke2-ansible/blob/main/roles/rke2_common/tasks/add-audit-policy-config.yml.

Stagger start RKE2 server nodes

To avoid any race conditions and a lot of error logs of servers that can't start until the first server is healthy and learner promotions are completed we should block starting/restarting servers until they are showing Ready state.

A possible solution is polling something like this on the host in question until True is observed.

/var/lib/rancher/rke2/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml \
--server https://127.0.0.1:6443 get no {{ inventory_hostname }} \
-o jsonpath='{.status.conditions[?(@.type=="Ready")].status}'

Support SELinux configuration/toggling

Cannot set other than rke2-images.linux-amd64.tar.(gz|zst) with tarball_install method

In tarball_install method, container image could only be rke2-images.linux-amd64.tar.zst or rke2-images.linux-amd64.tar.gz.

We are currently using cilium cni in some of our cluster and would like to import zipped images from rke2 project without having to use registry as done with default rke2 images

Proposed change:
#83

Tarball install does not update contents when install path is changed

Environmental Info:
RKE2 Version:

rke2 version v1.21.6+rke2r1 (b915fc986e84582458af7131fe7f4e686f2af493)
go version go1.16.6b7

Node(s) CPU architecture, OS, and Version:

OpenSuse 15.3 (SLES 15 SP3)

Describe the bug:

When tarball install changes to /opt/rke2 or a custom path is used instead of /usr/local, installation fails because a base directory change is not reflected to tarball contents.

Steps To Reproduce:

Use OpenSuse/SLES servers as targets.

Expected behavior:
I expected the same behavior as https://get.rke2.io/ script. There is a section that handles this:

# unpack_tarball extracts the tarball, correcting paths and moving systemd units as necessary
unpack_tarball() {
    info "unpacking tarball file to ${INSTALL_RKE2_TAR_PREFIX}"
    mkdir -p ${INSTALL_RKE2_TAR_PREFIX}
    tar xzf "${TMP_TARBALL}" -C "${INSTALL_RKE2_TAR_PREFIX}"
    if [ "${INSTALL_RKE2_TAR_PREFIX}" != "${DEFAULT_TAR_PREFIX}" ]; then
        info "updating tarball contents to reflect install path"
        sed -i "s|${DEFAULT_TAR_PREFIX}|${INSTALL_RKE2_TAR_PREFIX}|" ${INSTALL_RKE2_TAR_PREFIX}/lib/systemd/system/rke2-*.service ${INSTALL_RKE2_TAR_PREFIX}/bin/rke2-uninstall.sh
        info "moving systemd units to /etc/systemd/system"
        mv -f ${INSTALL_RKE2_TAR_PREFIX}/lib/systemd/system/rke2-*.service /etc/systemd/system/
        info "install complete; you may want to run:  export PATH=\$PATH:${INSTALL_RKE2_TAR_PREFIX}/bin"
    fi
}

Actual behavior:
Tarball install fails in two ways:

Playbook tries to move systemd units from /usr/local/... to /etc/systemd..., but source files are in tarball_dir (/opt/bin).
After fixing previous issue, we can also check that systemctl start rke2-server fails. Original rke2-server.service file points to wrong binary path. See ExecStart=/usr/local/bin/rke2 server. Therefore, rke2 agent service and uninstall script are also affected.

Additional context / logs:

Allow generic manifest configurations

RKE2 allows the user to provide manifest files for default helm chart configs as well as manifest files to be deployed once the cluster is healthy.

User should be able to provide a directory or key/common manifest configurations for ansible to apply.

Config file created incorrectly when using multiple values in rke2_kubelet_args

If I consume the playbook and set the following vars:

vars:
    rke2_kubelet_args:
      - "feature-gates=DynamicKubeletConfig=false"
      - "image-gc-high-threshold=100"
      - "image-gc-low-threshold=99"

Then the config file that is generated has multiple instances of the kubelet-arg: variable set, instead of 1 single instance set with an array of values.

Expected output:

kubelet-arg: 
 - feature-gates=DynamicKubeletConfig=false
 - image-gc-high-threshold=100
 - image-gc-low-threshold=99

Actual:

kubelet-arg: feature-gates=DynamicKubeletConfig=false
kubelet-arg: image-gc-high-threshold=100
kubelet-arg: image-gc-low-threshold=99

In my specific case, kubelet then fails to load as it only knows about image-gc-low-threshold and not image-gc-high-threshold so I think this could've easily been missed in other clusters.

Offending task:

rke2-ansible/roles/rke2_common/tasks/config.yml

Lines 46 to 51 in bb0e7a1

    
           - name: Add rke2_kubelet_args 
        
             lineinfile: 
        
               path: /etc/rancher/rke2/config.yaml 
        
               line: "kubelet-arg: {{ item }}" 
        
             with_items: 
        
               - "{{ rke2_kubelet_args | default([]) }}"

EDIT:

I also assume this is the case for the other array based vars:

kube-apiserver-arg
kube-scheduler-arg
kube-controller-manager-arg
kubelet-arg
node-label

Wrong extension for images

Dear rancherfederal team.

According to documentation the rke2-images.linux-amd64.tar.zst should be used, seems like that in the Ansible tasks a typo occurred

Images Install

If the rke2-images.linux-amd64.tar.zst file is found in the tarbarll_install/ directory then this playbook will use those images and not docker.io or a private registry.

rke2-ansible/roles/rke2_common/tasks/airgap.yml

Line 20 in 8751956

src: "{{ playbook_dir }}/tarball_install/rke2-images.linux-amd64.tar.gz"

rke2-ansible/roles/rke2_common/tasks/main.yml

Line 12 in 8751956

path: "{{ playbook_dir }}/tarball_install/rke2-images.linux-amd64.tar.gz"

Support registry configuration

User should be able to provide private registry configurations:

https://docs.rke2.io/install/containerd_registry_configuration/

systemctl start fails causing fail on ansible run

In my manual attempts to deploy rke2, I often would get failure return codes from systemctl start .... Because of the way the restart logic is written the service appears to fail and restart several times during initialization. While that logic is beyond the scope of this project, I wonder if perhaps failures could be ignored during the start tasks in the role(s)?

Remove ‘schedule:’ from GitHub actions

The CI tests are failing during scheduled runs and not when automatically run by PR and pushes.

Support cleaning up node data to remove them from cluster

Enable RPM Installs for Air-Gapped environments

Based on the feedback received regarding issue #86, I am attempting to move my air-gapped, selinux-enabled, rke2-ansible-based installation from using the tarball method to using the rpm method. In doing so, I encountered problems with task roles/rke2_common/tasks/rpm_install.yml due to its dependencies on the following internet URLs not typically available in air-gapped environments:

URL https://update.rke2.io/v1-release/channels - enables determination of the rke2 version to be installed
URL https://rpm.rancher.io/rke2 - defines yum repository rke2 baseurl prefix
URL https://rpm.rancher.io/public.key - specifies public key to access rke2 yum repositories

Proposed enhancement: Integrate additional, optional parameters rke_version and repo_baseurl_prefix into task rpm_install.yml.

WDYT of this proposed enhancement to the RPM installation method?

Adding Flux CD & SOPS

Heyho,

awesome repo, currently using it to setup a small cluster at work at probably switching to using it instead of my terraformed rke cluster.

I am a huge gitops fan and so i added a flux cd role to a fork of this repo (and i will add another one for injecting a secret for sops support in flux).

I was wondering if you are interested in upstreaming this as an optional feature. Since this repo is a full playbook and not a role this would also help us/me greatly maintain our fork :P

// Robert

Undefined object in "Add primary configuration items" task

I'm having trouble running the playbook on three Ubuntu 20.04 machines from an Ubuntu 20.04 host running Ansible 2.10.

The message I see, after running ansible-playbook -i inventory/my-cluster/hosts.ini site.yml -vvv, is:

TASK [rke2_common : Add primary configuration items] **************************************************************************************************************************************************************
task path: /home/jon/rke2-ansible/roles/rke2_common/tasks/config.yml:17
The full traceback is:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/ansible/executor/task_executor.py", line 585, in _execute
    self._task.post_validate(templar=templar)
  File "/usr/lib/python3/dist-packages/ansible/playbook/task.py", line 307, in post_validate
    super(Task, self).post_validate(templar)
  File "/usr/lib/python3/dist-packages/ansible/playbook/base.py", line 431, in post_validate
    value = templar.template(getattr(self, name))
  File "/usr/lib/python3/dist-packages/ansible/template/__init__.py", line 844, in template
    d[k] = self.template(
  File "/usr/lib/python3/dist-packages/ansible/template/__init__.py", line 798, in template
    result = self.do_template(
  File "/usr/lib/python3/dist-packages/ansible/template/__init__.py", line 1066, in do_template
    res = j2_concat(rf)
  File "<template>", line 12, in root
  File "/usr/lib/python3/dist-packages/ansible/template/__init__.py", line 264, in wrapper
    ret = func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/ansible/plugins/filter/core.py", line 69, in to_nice_yaml
    transformed = yaml.dump(a, Dumper=AnsibleDumper, indent=indent, allow_unicode=True, default_flow_style=False, **kw)
  File "/usr/lib/python3/dist-packages/yaml/__init__.py", line 290, in dump
    return dump_all([data], stream, Dumper=Dumper, **kwds)
  File "/usr/lib/python3/dist-packages/yaml/__init__.py", line 278, in dump_all
    dumper.represent(data)
  File "/usr/lib/python3/dist-packages/yaml/representer.py", line 27, in represent
    node = self.represent_data(data)
  File "/usr/lib/python3/dist-packages/yaml/representer.py", line 58, in represent_data
    node = self.yaml_representers[None](self, data)
  File "/usr/lib/python3/dist-packages/yaml/representer.py", line 231, in represent_undefined
    raise RepresenterError("cannot represent an object", data)
yaml.representer.RepresenterError: ('cannot represent an object', AnsibleUndefined)
fatal: [head]: FAILED! => {
    "changed": false
}

It seems to refer to line 17 of roles/rke2_common/tasks/config.yml, specifically it's looking for an rke2_config variable. A search on the repository doesn't show this defined anywhere. Am I missing something? Thanks!

	- name: Add rke2_kubelet_args
	lineinfile:
	path: /etc/rancher/rke2/config.yaml
	line: "kubelet-arg: {{ item }}"
	with_items:
	- "{{ rke2_kubelet_args \| default([]) }}"