techno-tim / k3s-ansible Goto Github PK

The easiest way to bootstrap a self-hosted High Availability Kubernetes cluster. A fully automated HA k3s etcd install with kube-vip, MetalLB, and more. Build. Destroy. Repeat.

Home Page: https://technotim.live/posts/k3s-etcd-ansible/

License: Apache License 2.0

Jinja 98.52% Shell 1.48%

k3s kubernetes metallb kube-vip etcd rancher k8s k3s-cluster high-availability

k3s-ansible's People

Contributors

Stargazers

Watchers

Forkers

chaseddevelopment harmakhis971 marinus-lab locus313 dbriankimmel eangus iglak rvoosterhout graygarbage beasknees theta-psi resinder idemery jonstump fourmobro sdylix kentsoderlind ibmchas quynhlab ckite ctmarsh xander-codepunk meta-dreamer christtian667 nomoosey adgai19 avaussant geokkjer skylanoris alexandrumarian-portal jtdevops scan-dev jorbecalona bergpb shanpira14-bit ricable cgoldi spread0x d90 victor-zinkv waaaaat xedriq alekseyzolot matthew-dickens delatt faherne cpm30 jaekook thomasgerritsen bruj0 hakuyouhiro vmangelschots roccqqck sud33p thedatabaseme crt-fork jpconstantineau prvinsm21 mikewebb70 higgsbosonic wombiegit sommerphilipp bastopolo thro42 solax8 mrvertical70 robdale110 scottywed joekravelli whoamihealay aubins-projects ozmodiar vermiumsifell taxx johnklyde resyncx teguhbudhi13 bpmckim oscmedgon liuliu-miao arechste ed-solis kyushuadamu thelowtech zelax25 biggamer11 mtbred rkoripalli kamaxou claudio4 elisiano fahadysf bohica70 rasmusgodske nightdreamer wyvernzora paihia dragonhunter274 bagg3rs luu123

k3s-ansible's Issues

Error when installing on centOS 7

the installation script stops at the step:

Init cluster inside the transient k3s-init service

with this error:

"Failed to create bus message: No such device or address"

Verify Reset Task Actually Removes the VIP

    I overlooked this change but I think this is needed otherwise the vip is not removed and will cause errors if you try to run the playbook twice without rebooting.  I took the easy route and just rebooted the nodes as part of the reset.  You can test by pining the vip white resetting.  https://github.com/techno-tim/k3s-ansible/pull/31

Originally posted by @timothystewart6 in #92 (comment)

Verify that all nodes actually joined - Debian

Expected Behavior

I run ansible-playbook site.yml -i inventory/k3s-cluster/hosts.ini and get a k3s-cluster deployed.

Current Behavior

Runs through the playbook until the verification process which it goes through without success.

Steps to Reproduce

Modify hosts.ini with ip's
Add variable to all.yml: ansible_ssh_private_key_file: ~/.ssh/ansible
Run site.yml using: ansible-playbook site.yml -i inventory/k3s-cluster/hosts.ini

Context (variables)

Operating system: Debian 11

Hardware: (CPU/RAM/Disk type)

Intel i5-8250U, 8GB, SSD
Intel i3-9100, 64GB, SSD
AMD Ryzen 5 PRO 1500, 24GB, SSD

Variables Used:

all.yml

k3s_version: v1.24.3+k3s1
ansible_user: ansible
ansible_ssh_private_key_file: ~/.ssh/ansible
systemd_dir: /etc/systemd/system
system_timezone: "Europe/Stockholm"
flannel_iface: "eth0"
apiserver_endpoint: "10.0.0.100"
k3s_token: "strawberrycake"
extra_server_args: "--disable servicelb --disable traefik"
extra_agent_args: ""
kube_vip_tag_version: "v0.5.0"
metal_lb_speaker_tag_version: "v0.13.4"
metal_lb_controller_tag_version: "v0.13.4"
metal_lb_ip_range: "10.0.0.80-10.0.0.90"

Hosts

host.ini

[master]
k3s-master-01
k3s-master-02
k3s-master-03

[node]
k3s-worker-01
k3s-worker-02
k3s-worker-03

[k3s_cluster:children]
master
node

Possible Solution

Switch to k3s Uninstall Script

We mange our own way of cleaning up nodes, however we should rely on k3s's uninstall script that is placed there when k3s is installed.

https://rancher.com/docs/k3s/latest/en/installation/uninstall/

server nodes
/usr/local/bin/k3s-uninstall.sh

agent nodes
/usr/local/bin/k3s-agent-uninstall.sh

We will also want to be sure we can run this multiple times without failure (part of the reason it was extracted from this).

I think we can check to see if it exists first before running and assume if it does not that it's a clean system. Not the best but open to ideas.

main.yml issue. error - he error appears to be in '/home/xxxxx/k3s-ansible/roles/download/tasks/main.yml': line 3, column 3

The main.yml file doesn't seem to work anymore.

FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: float object has no element 1\n\nThe error appears to be in '/home/xxxx/k3s-ansible/roles/download/tasks/main.yml': line 3, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: Download k3s binary x64\n ^ here\n"}

Appears to be looking for an undefined variable

Install was super quick

I used your cloudinit video to help create a better template on proxmox. But if you use that image as is, k3s runs out of space and stops without saying there is a problem. It wasn't until I ssh'd in and did a k3s config-check and saw there was a disk space issue. Increasing the size of disk wasn't a problem but it was a step I had to take that wasn't listed. At which point I found that your reset was really useful

Missing k3s_server_location: /var/lib/rancher/k3s/

Had to add k3s_server_location: /var/lib/rancher/k3s/ to inventory/my-cluster/group_vars/all.yml for playbook to succeed

Permissions on tmp folder

Need to verify this issue but someone in Discord said that they had an error when running the playbook. The permissions on the folder for the first master were too strict and it could not apply the manifest for metal lb. They had to chown the directory to their ansible user.

We should also test that this folder is removed by ansible with the fix (verify it's gone)

Expected Behavior

should complete without error

Current Behavior

Errors out and you need to chown the tmp folder

Steps to Reproduce

run the playbook
error when deleting tmp folder.

Context (variables)

Operating system:

ubuntu 22.04.1 cloud-init image

Possible Solution

I've checked the General Troubleshooting Guide

TASK [k3s/master : Copy vip manifest to first master fails on Ubuntu 22.04/20.04

Expected Behavior

The task should proceed. I attempted this both on Ubuntu 22.04 and Ubuntu 20.04 with identical results. The task "Copy vip manifest to first master" fails on both, causing the VIP to never come up.

Current Behavior

TASK [k3s/master : Copy vip rbac manifest to first master] **************************************************************************************************************************************************************************************************
Wednesday 14 September 2022  03:57:30 +0000 (0:00:00.308)       0:00:08.720 *** 
skipping: [10.0.30.22]
skipping: [10.0.30.23]
ok: [10.0.30.21]

TASK [k3s/master : Copy vip manifest to first master] *******************************************************************************************************************************************************************************************************
Wednesday 14 September 2022  03:57:31 +0000 (0:00:00.566)       0:00:09.287 *** 
skipping: [10.0.30.22]
skipping: [10.0.30.23]
[WARNING]: an unexpected error occurred during Jinja2 environment setup: unable to locate collection ansible.utils
fatal: [10.0.30.21]: FAILED! => {"changed": false, "msg": "AnsibleError: template error while templating string: unable to locate collection ansible.utils. String: apiVersion: apps/v1\nkind: DaemonSet\nmetadata:\n  name: kube-vip-ds\n  namespace: kube-system\nspec:\n  selector:\n    matchLabels:\n      name: kube-vip-ds\n  template:\n    metadata:\n      labels:\n        name: kube-vip-ds\n    spec:\n      affinity:\n        nodeAffinity:\n          requiredDuringSchedulingIgnoredDuringExecution:\n            nodeSelectorTerms:\n            - matchExpressions:\n              - key: node-role.kubernetes.io/master\n                operator: Exists\n            - matchExpressions:\n              - key: node-role.kubernetes.io/control-plane\n                operator: Exists\n      containers:\n      - args:\n        - manager\n        env:\n        - name: vip_arp\n          value: \"true\"\n        - name: port\n          value: \"6443\"\n        - name: vip_interface\n          value: {{ flannel_iface }}\n        - name: vip_cidr\n          value: \"{{ apiserver_endpoint | ansible.utils.ipsubnet | ansible.utils.ipaddr('prefix') }}\"\n        - name: cp_enable\n          value: \"true\"\n        - name: cp_namespace\n          value: kube-system\n        - name: vip_ddns\n          value: \"false\"\n        - name: svc_enable\n          value: \"false\"\n        - name: vip_leaderelection\n          value: \"true\"\n        - name: vip_leaseduration\n          value: \"15\"\n        - name: vip_renewdeadline\n          value: \"10\"\n        - name: vip_retryperiod\n          value: \"2\"\n        - name: address\n          value: {{ apiserver_endpoint }}\n        image: ghcr.io/kube-vip/kube-vip:{{ kube_vip_tag_version }}\n        imagePullPolicy: Always\n        name: kube-vip\n        resources: {}\n        securityContext:\n          capabilities:\n            add:\n            - NET_ADMIN\n            - NET_RAW\n            - SYS_TIME\n      hostNetwork: true\n      serviceAccountName: kube-vip\n      tolerations:\n      - effect: NoSchedule\n        operator: Exists\n      - effect: NoExecute\n        operator: Exists\n  updateStrategy: {}\nstatus:\n  currentNumberScheduled: 0\n  desiredNumberScheduled: 0\n  numberMisscheduled: 0\n  numberReady: 0\n"}

TASK [k3s/master : Copy metallb namespace to first master] **************************************************************************************************************************************************************************************************
Wednesday 14 September 2022  03:57:31 +0000 (0:00:00.106)       0:00:09.393 *** 
skipping: [10.0.30.22]
skipping: [10.0.30.23]

Steps to Reproduce

Deploy 5 brand new Ubuntu 22.04/Ubuntu 20.04 VM's using the Ubuntu cloud images & cloud-init.
Configure the appropriate ansible options.
Run the ansible playbook.

Context (variables)

Operating system: Ubuntu 22.04 & Ubuntu 20.04

Hardware: Proxmox VMs running on AMD EPYC CPUs, CPU in host passthrough.

Variables Used

all.yml

k3s_version: v1.25.0+k3s1
# this is the user that has ssh access to these machines
ansible_user: ubuntu
systemd_dir: /etc/systemd/system

system_timezone: "America/Chicago"

flannel_iface: "eth0"

apiserver_endpoint: "10.0.30.40"

k3s_node_ip: '{{ ansible_facts[flannel_iface]["ipv4"]["address"] }}'

extra_args: >-
  --flannel-iface={{ flannel_iface }}
  --node-ip={{ k3s_node_ip }}

extra_server_args: >-
  {{ extra_args }}
  --disable servicelb
  --disable traefik
extra_agent_args: >-
  {{ extra_args }}

kube_vip_tag_version: "v0.5.0"

metal_lb_speaker_tag_version: "v0.13.5"
metal_lb_controller_tag_version: "v0.13.5"

metal_lb_ip_range: "10.0.30.50-10.0.30.60"

Hosts

host.ini

[master]
10.0.30.21
10.0.30.22
10.0.30.23

[node]
10.0.30.31
10.0.30.32

[k3s_cluster:children]
master
node

I've checked the General Troubleshooting Guide

Rocky and Red Hat Support

As the title says would this ansible script work on rocky linux and rhel?

How to include rancher GUI and Longhorn

Hi and thanks for a great script!

I am trying to run this on a single workstation for testing purposes... and I have some questions:

In hosts.ini should I remove the node section?

[master]
localhost ansible_connection=local

[node]
localhost ansible_connection=local

[k3s_cluster:children]
master
node

I fail to install rancher and longhorn...

curl -#L https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
helm repo add rancher-latest https://releases.rancher.com/server-charts/latest
helm repo add jetstack https://charts.jetstack.io
kubectl apply -f https://github.com/jetstack/cert-manager/releases/download/v1.6.1/cert-manager.crds.yaml

# helm install jetstack
helm upgrade -i cert-manager jetstack/cert-manager --namespace cert-manager --create-namespace

helm upgrade -i rancher rancher-latest/rancher --create-namespace --namespace cattle-system --set hostname=rancher.localhost --set bootstrapPassword=mysecretpass --set replicas=1

What am I doing wrong?
Can rancher and longhorn be included in the script?

MetalLB AddressPool and L2Advertisement is not applied

When I run the ansible-playbook, the metallb address pool and l2advertisement are not applied. I had to create these manifest files and apply them to fix my issue:

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: first-pool
  namespace: metallb-system
spec:
  addresses:
    - 10.10.10.10-10.10.10.20

apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: default
  namespace: metallb-system

Expected Behavior

I would expect all manifest in the ansible roles to be applied.

Current Behavior

I think all but the address pool and l2advertisement get applied; not sure if this applies to other manifests as well. It might be related to the latest change where manifests are removed as part of the cleanup. But I am new to this ansible-playbook so I cannot verify that.

Steps to Reproduce

Add and configure my_cluster/hosts.ini
Add and configure my_cluster/group_vars/all.yml
Run bash deploy.sh to execute the ansible playbook.
Wait till finished
Check kubectl get IPAddressPool -A and kubectl get L2Advertisement -A to see that neither exists.

Context (variables)

Operating system:

3x Ubuntu 22.04 Server

Hardware:

3x Raspberry Pi 4B

Variables Used

all.yml

k3s_version: v1.24.3+k3s1
# this is the user that has ssh access to these machines
ansible_user: devantler
systemd_dir: /etc/systemd/system

# Set your timezone
system_timezone: "Europe/Copenhagen"

# interface which will be used for flannel
flannel_iface: "eth0"

# apiserver_endpoint is virtual ip-address which will be configured on each master
apiserver_endpoint: "10.10.10.0"

# k3s_token is required  masters can talk together securely
# this token should be alpha numeric only
k3s_token: "i-will-keep-this-secret"

# change these to your liking, the only required one is--disable servicelb
extra_server_args: "--disable servicelb --disable traefik"
extra_agent_args: ""

# image tag for kube-vip
kube_vip_tag_version: "v0.5.0"

# image tag for metal lb
metal_lb_speaker_tag_version: "v0.13.4"
metal_lb_controller_tag_version: "v0.13.4"

# metallb ip range for load balancer
metal_lb_ip_range: "10.10.10.10-10.10.10.20"

Hosts

host.ini

[master]
10.10.10.1
10.10.10.2
10.10.10.3

[k3s_cluster:children]
master

Might also be related to this. I want a HA setup with three nodes, so all my nodes should be masters. Is this supported? If not, can it be supported?

Purpose of ansibleuser not obvious

In the all.yml file that defines all the variables to use, it wasn't immediately obvious to me what ansibleuser was for. At first I was thinking this was the user I am running ansible as. Which didn't make sense, so i commented it out. I have my controlplane and nodes set up in my ~/.ssh/config so the playbook was able to install stuff. But the kubeconfig file was getting copied to nowhere. That was my clue. So maybe clarify that ansibleuser is the user you are ssh'ing into the master and nodes. So on an ubuntu box, it may be ubuntu.

`TASK [k3s/post : Wait for MetalLB resources]` fails when updating metallb

Expected Behavior

running ansible-playbook should update the cluster and run all tasks successfully

Current Behavior

When running ansible-playbook for updating the cluster, it fails on TASK [k3s/post : Wait for MetalLB resources]

Steps to Reproduce

try to update the cluster with ansible-playbook site.yml -i inventory/my-cluster/hosts.ini
wait for ansible to update everything
metallb leaves previous replicaset
task Wait for MetalLB resources fails due to > 1 replicaset in metallb-system namespace

Context (variables)

Operating system: Ubuntu 22.04.1 LTS

Hardware: 2x Raspberry Pi 4b 8GB

Variables Used

all.yml

---
k3s_version: v1.24.6+k3s1
# this is the user that has ssh access to these machines
ansible_user: pirate
systemd_dir: /etc/systemd/system

# Set your timezone
system_timezone: "Europe/Prague"

# interface which will be used for flannel
flannel_iface: "eth0"

# apiserver_endpoint is virtual ip-address which will be configured on each master
apiserver_endpoint: "192.168.10.10"

# k3s_token is required  masters can talk together securely
# this token should be alpha numeric only
k3s_token: ''

# The IP on which the node is reachable in the cluster.
# Here, a sensible default is provided, you can still override
# it for each of your hosts, though.
k3s_node_ip: '{{ ansible_facts[flannel_iface]["ipv4"]["address"] }}'

# Disable the taint manually by setting: k3s_master_taint = false
k3s_master_taint: "{{ true if groups['node'] | default([]) | length >= 1 else false }}"

# these arguments are recommended for servers as well as agents:
extra_args: >-
  --flannel-iface={{ flannel_iface }}
  --node-ip={{ k3s_node_ip }}

# change these to your liking, the only required are: --disable servicelb, --tls-san {{ apiserver_endpoint }}
extra_server_args: >-
  {{ extra_args }}
  {{ '--node-taint node-role.kubernetes.io/master=true:NoSchedule' if k3s_master_taint else '' }}
  --tls-san {{ apiserver_endpoint }}
  --disable servicelb
extra_agent_args: >-
  {{ extra_args }}

# image tag for kube-vip
kube_vip_tag_version: "v0.5.5"

# image tag for metal lb
metal_lb_speaker_tag_version: "v0.13.6"
metal_lb_controller_tag_version: "v0.13.6"

# metallb ip range for load balancer
metal_lb_ip_range: "192.168.10.11-192.168.30.49"

Hosts

host.ini

[master]
192.168.10.8
192.168.10.9

[node]

[k3s_cluster:children]
master
node

Possible Solution

delete old replicasets. Once there are no old replicasets the tasks finish correctly.

Maybe there could be a task that deletes the replicasets after the metallb is updated or changed. Now it leaves it there with everything set to 0.

I've checked the General Troubleshooting Guide

Why it is happening

I have identified the issue with replicasets that are left behind after metallb update, and the task expects to find only one when checking. it fails on this

Task fails with this error.

TASK [k3s/post : Wait for MetalLB resources] *********************************************************************************************************************************************************
Saturday 22 October 2022  14:45:19 +0200 (0:00:02.281)       0:02:06.666 ****** 
ok: [192.168.10.8] => (item=controller)
ok: [192.168.10.8] => (item=webhook service)
ok: [192.168.10.8] => (item=pods in replica sets)
failed: [192.168.10.8] (item=ready replicas of controller) => {"ansible_loop_var": "item", "changed": false, "cmd": ["k3s", "kubectl", "wait", "replicaset", "--namespace=metallb-system", "--selector=component=controller,app=metallb", "--for=jsonpath={.status.readyReplicas}=1", "--timeout=120s"], "delta": "0:00:00.688827", "end": "2022-10-22 14:45:26.583497", "item": {"condition": "--for=jsonpath='{.status.readyReplicas}'=1", "description": "ready replicas of controller", "resource": "replicaset", "selector": "component=controller,app=metallb"}, "msg": "non-zero return code", "rc": 1, "start": "2022-10-22 14:45:25.894670", "stderr": "readyReplicas is not found\nreadyReplicas is not found", "stderr_lines": ["readyReplicas is not found", "readyReplicas is not found"], "stdout": "replicaset.apps/controller-5888676bc9 condition met", "stdout_lines": ["replicaset.apps/controller-5888676bc9 condition met"]}
failed: [192.168.10.8] (item=fully labeled replicas of controller) => {"ansible_loop_var": "item", "changed": false, "cmd": ["k3s", "kubectl", "wait", "replicaset", "--namespace=metallb-system", "--selector=component=controller,app=metallb", "--for=jsonpath={.status.fullyLabeledReplicas}=1", "--timeout=120s"], "delta": "0:00:00.605612", "end": "2022-10-22 14:45:28.071205", "item": {"condition": "--for=jsonpath='{.status.fullyLabeledReplicas}'=1", "description": "fully labeled replicas of controller", "resource": "replicaset", "selector": "component=controller,app=metallb"}, "msg": "non-zero return code", "rc": 1, "start": "2022-10-22 14:45:27.465593", "stderr": "fullyLabeledReplicas is not found\nfullyLabeledReplicas is not found", "stderr_lines": ["fullyLabeledReplicas is not found", "fullyLabeledReplicas is not found"], "stdout": "replicaset.apps/controller-5888676bc9 condition met", "stdout_lines": ["replicaset.apps/controller-5888676bc9 condition met"]}
failed: [192.168.10.8] (item=available replicas of controller) => {"ansible_loop_var": "item", "changed": false, "cmd": ["k3s", "kubectl", "wait", "replicaset", "--namespace=metallb-system", "--selector=component=controller,app=metallb", "--for=jsonpath={.status.availableReplicas}=1", "--timeout=120s"], "delta": "0:00:01.894136", "end": "2022-10-22 14:45:30.815935", "item": {"condition": "--for=jsonpath='{.status.availableReplicas}'=1", "description": "available replicas of controller", "resource": "replicaset", "selector": "component=controller,app=metallb"}, "msg": "non-zero return code", "rc": 1, "start": "2022-10-22 14:45:28.921799", "stderr": "availableReplicas is not found\navailableReplicas is not found", "stderr_lines": ["availableReplicas is not found", "availableReplicas is not found"], "stdout": "replicaset.apps/controller-5888676bc9 condition met", "stdout_lines": ["replicaset.apps/controller-5888676bc9 condition met"]}

CI - Address Failures

Expected Behavior

CI should not fail as often

Current Behavior

CI seems to fail quiet a bit and I think this is just a side effect of how slow the VMs are in GitHub.

I see a few possibilities here:

More retries within tests
- While this might some some failures, it won't solve all
Retries for the whole pipeline
- This might require brining in a retry action, so we can retry the whole filed job.
Self-hosted runners
- This would allow us to have higher performance nodes than what gitlab offers. I have some free compute and could host some if it comes to this.

Suggestion: Remove metallb and use kube-vip for both control plane and for service loadbalancing

Kube-vip now can do both jobs so no need for metallb. This will simplify the install. also it will avoid arp spam from two different layer2 loadbalancers on the same network.

For reference: https://kube-vip.io/docs/about/features/

reset.yml: Task "Remove linux-modules-extra-raspi" fails on non-debian OS (missing apt-get)

Expected Behavior

Running the "reset.yml" playbook should not fail on non-debian/non-ubuntu operating systems.

Current Behavior

It fails on RHEL because it tries to run apt-get package manager which is not present in this OS.

TASK [reset : Remove linux-modules-extra-raspi] ********************************************************************************************************************************************************************************************
Wednesday 14 September 2022  13:59:04 +0200 (0:00:02.331)       0:01:33.070 *** 
[WARNING]: Updating cache and auto-installing missing dependency: python3-apt
fatal: [example.com]: FAILED! => {"changed": false, "cmd": "apt-get update", "msg": "[Errno 2] No such file or directory: b'apt-get': b'apt-get'", "rc": 2, "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}

PLAY RECAP *********************************************************************************************************************************************************************************************************************************
example.com : ok=22   changed=9    unreachable=0    failed=1    skipped=4    rescued=0    ignored=0

Steps to Reproduce

ansible-playbook reset.yml -i inventory/my-env/hosts.ini

Context

Operating system:
LSB Version: :core-4.1-amd64:core-4.1-noarch
Distributor ID: RedHatEnterprise
Description: Red Hat Enterprise Linux release 8.4 (Ootpa)
Release: 8.4
Codename: Ootpa

Changes vor using on Raspian OS-64 Bullseye

I try to use this repro. to create a cluster with Raspberry Pi installt with the newest Raspian OS 64BIT (Bullseye).

Expected Behavior

First Problemist that in roles/raspberrypi/tasks/main.yml Raspbian will not detected.
After Fix that in Raspbian.yml the tasks for iptables and ip6tables will not run. This depends while iptables is not install on Debian 11.

Current Behavior

Steps to Reproduce

Install Raspian OS 64BIT (Bullseye) on all Nodes
Run deploy.sh

Context (variables)

Operating system:
Raspian OS-64 Bullseye

Variables Used:

k3s_version: "v1.23.4+k3s1"
ansible_user: NA
systemd_dir: /etc/systemd/system

flannel_iface:"eth0"

apiserver_endpoint: "192.168.42.42"

k3s_token: "NA"

extra_server_args: "--no-deploy servicelb --write-kubeconfig-mode 644 --kube-apiserver-arg default-not-ready-toleration-seconds=30 --kube-apiserver-arg default-unreachable-toleration-seconds=30 --kube-controller-arg node-monitor-period=20s --kube-controller-arg node-monitor-grace-period=20s --kubelet-arg node-status-update-frequency=5s"
extra_agent_args: "--kubelet-arg node-status-update-frequency=5s"

kube_vip_tag_version:"v0.4.2"

# image tag for metal lb
metal_lb_speaker_tag_version: "v0.12.1"
metal_lb_controller_tag_version: "v0.12.1"

# metallb ip range for load balancer
metal_lb_ip_range: "192.168.30.23-192.168.30.33"

Possible Solution

Insert task in roles/raspberrypi/tasks/main.yml ~line 40

- name: Set detected_distribution to Raspbian (ARM64 on Debian Bullseye)
  set_fact:
    detected_distribution: Raspbian
  when:
    - ansible_facts.architecture is search("aarch64")
    - raspberry_pi|default(false)
    - ansible_facts.lsb.description|default("") is match("Debian.*bullseye")

create Copy roles/raspberrypi/tasks/prereq/Raspbian.yml to roles/raspberrypi/tasks/prereq/Raspbian-11.yml
insert task in roles/raspberrypi/tasks/prereq/Raspbian-11.yml on second position

- name: Install iptables on Bullseye
  apt: name=iptables state=present

metlb fails to deploy

Expected Behavior

metallb is auto deployed via k3s

Current Behavior

metallb.configmap.yml is never deployed which causes the metallb pods to fail and k3s tries to redpeloy

Steps to Reproduce

run ansible-playbook ./playbooks/site.yml -i ./inventory/tower-of-power/hosts.ini -K
k3s is deployed and accessible through kube-vip (192.168.0.90)
check deployment states of metallb and see a crashbackloop
error states there is a missing configmap
check configmaps for namespace and config is not present
k3s terminates the namespace and deploys to retry the deployment

Context (variables)

Operating system:
Raspberry pi OS (64-bit)

Hardware:
a mixture between pi3 and pi4s (master node is a pi4 4GB)

Variables Used:

all.yml

k3s_version: "v1.24.2+k3s2"
ansible_user: NA
systemd_dir: "/etc/systemd/system"

flannel_iface: "wlan0"

apiserver_endpoint: "192.168.0.190"

k3s_token: "NA"

extra_server_args: "--no-deploy servicelb --no-deploy traefik --write-kubeconfig-mode 644 --kube-apiserver-arg default-not-ready-toleration-seconds=30 --kube-apiserver-arg default-unreachable-toleration-seconds=30 --kube-controller-arg node-monitor-period=20s --kube-controller-arg node-monitor-grace-period=20s --kubelet-arg node-status-update-frequency=5s"
extra_agent_args: "--kubelet-arg node-status-update-frequency=5s"

kube_vip_tag_version: "v0.4.4"

metal_lb_speaker_tag_version: "v0.12.1"
metal_lb_controller_tag_version: "v0.12.1"

metal_lb_ip_range: "192.168.0.180-192.168.0.189"

Hosts

host.ini

[master]
192.168.0.200

[node]
192.168.0.201
192.168.0.202
192.168.0.203
192.168.0.204

[k3s_cluster:children]
master
node

Some extra logs
Here is the metllb-system controller pod after i get this error the whole namespace is terminated

{"branch":"HEAD","caller":"level.go:63","commit":"v0.12.1","goversion":"gc / go1.16.14 / arm64","level":"info","msg":"MetalLB controller starting version 0.12.1 (commit v0.12.1, branch HEAD)","ts":"2022-07-12T21:26:20.75386972Z","version":"0.12.1"}

{"caller":"level.go:63","level":"info","msg":"secret succesfully created","op":"CreateMlSecret","ts":"2022-07-12T21:26:20.95196698Z"}

{"caller":"level.go:63","event":"stateSynced","level":"info","msg":"controller synced, can allocate IPs now","ts":"2022-07-12T21:26:21.053274581Z"}

{"caller":"level.go:63","configmap":"metallb-system/config","event":"configLoaded","level":"info","msg":"config (re)loaded","ts":"2022-07-12T21:26:27.930190506Z"}

{"caller":"level.go:63","configmap":"metallb-system/config","error":"no MetalLB configuration in cluster","level":"error","msg":"configuration is missing, MetalLB will not function","op":"setConfig","ts":"2022-07-12T21:26:28.151674635Z"} 

Stream closed EOF for metallb-system/controller-7476b58756-kfp52 (controller)

it also seems to cause some issues with kube-vip and I randomly lose connection through k9s and kubectl when the namespace is terminated via k3s

I will try to get more information but I am still a bit new to kubernetes so if you can suggest any way to debug that would be appreciated

Unstable VIP PING & K3S too new for Rancher

this link gets a 404, but i did the k3s trouble shooting check list,
#20

I was able to get this working with a older release
with most current I get:

unstable Ping VIP IP#
K3S need to be < v1.24 for Rancher

Expected Behavior

VIP end point ping should be stable
Helm should be able to deploy Rancher with documented commands

Current Behavior

You can ping the VIP/API IP# intermittently , so get node and helm deployments are hit or miss
Rancher Deployment will fail because K3S is > 1.24

Steps to Reproduce

deploy with code base that had all.yml Commits on May 26, 2022, 3 etcd control, 5 workers Proxmox VM
(across there physical nodes)
deploy longhorn in default space, shared with workers (just learning the process)
deploy minecraft > everything was stable for weeks, though it kept over poding one node
backup, take down minecraft and longhorn
reset
git pull on 8/1/22, use latest
which has changes to, all.yml, main.yml, metallb configmap metallb ipaddresspool, metallb yamls, vip rbac, vip yaml
add 3 VM for dedicated longhorn > verify ansible can apt update and install proxmox guest agent and can password less ssh
add 3x IP# under nodes (3 control, 5 workers, 3 worker for longhorn only)
deploy with 8/1/22 version

Context (variables)

Operating system: Ubuntu 22

Hardware: 2x dual xeon nodes, 48 thread, 256GB Ram, 1x 1 liter node with a i5 10th 64GB Ram

Variables Used:

I didn't alter these, save adding my own token and IP#
they are what have been listed in the repo

all.yml

k3s_version: "1.24"
ansible_user: NA
systemd_dir: ""

flannel_iface: ""

apiserver_endpoint: ""

k3s_token: "NA"

extra_server_args: ""
extra_agent_args: ""

kube_vip_tag_version: ""

metal_lb_speaker_tag_version: ""
metal_lb_controller_tag_version: ""

metal_lb_ip_range: ""

Hosts

host.ini

[master]
IP.ADDRESS.ONE
IP.ADDRESS.TWO
IP.ADDRESS.THREE

[node]
IP.ADDRESS.FOUR
IP.ADDRESS.FIVE

[k3s_cluster:children]
master
node

Possible Solution

It seems to deploy ok, just something isn't quite the same when it comes to the VIP/API IP# access
From what I can the version changes are very particular

I tried taking various nodes offline one at a time, this didn't really help in any repeatable way. So I don't think it's any one node/vm or it's physical networking

tried just going to a older k3s , which might of helped rancher, but didn't help vip/metallb.
I'll try to see if I can find a vip or metallb log file (but I don't really know vip or metallb)

for now, going to try reversing the pull request and deployment with older stack

update errors

Expected Behavior

Run playbook, it either updates the cluster k3s/metallb/kubevip install or installs it new

Current Behavior

Master nodes seem fine but agents get the following error when starting:
level=fatal msg="parse "https://{{": invalid character "{" in host name"

Steps to Reproduce

setup servers
run script
wait
look at logs

Context (variables)

Ubuntu 22.04 Jammy

Operating system:

Hardware:

KVM virtual cluster

Variables Used

---
k3s_version: v1.24.4+k3s1
# this is the user that has ssh access to these machines
ansible_user: darkcloud
systemd_dir: /etc/systemd/system

# Set your timezone
system_timezone: "America/Chicago"

# interface which will be used for flannel
flannel_iface: "ens18"

# apiserver_endpoint is virtual ip-address which will be configured on each master
apiserver_endpoint: "192.168.86.189"

# k3s_token is required  masters can talk together securely
# this token should be alpha numeric only
k3s_token: "mysupersecretstuff"

# The IP on which the node is reachable in the cluster.
# Here, a sensible default is provided, you can still override
# it for each of your hosts, though.
k3s_node_ip: '{{ ansible_facts[flannel_iface]["ipv4"]["address"] }}'

# Disable the taint manually by setting: k3s_master_taint = false
k3s_master_taint: "{{ true if groups['node'] | default([]) | length >= 1 else false }}"

# change these to your liking, the only required one is--no-deploy servicelb
extra_args: >-
  --flannel-iface={{ flannel_iface }}
  --node-ip={{ k3s_node_ip }}

extra_server_args: >-
  {{ extra_args }}
  {{ '--node-taint node-role.kubernetes.io/master=true:NoSchedule' if k3s_master_taint else '' }}
  --tls-san {{ apiserver_endpoint }}
  --disable servicelb
  --disable traefik
  --kube-apiserver-arg default-not-ready-toleration-seconds=30
  --kube-apiserver-arg default-unreachable-toleration-seconds=30
  --kube-controller-arg node-monitor-period=20s

extra_agent_args: >-
  {{ extra_args }}

# image tag for kube-vip
kube_vip_tag_version: "v0.5.0"

# image tag for metal lb
metal_lb_speaker_tag_version: "v0.13.5"
metal_lb_controller_tag_version: "v0.13.5"

# metallb ip range for load balancer
metal_lb_ip_range: "192.168.86.150-192.168.86.180"

Hosts

host.ini

[master]
192.168.86.190
192.168.86.191
192.168.86.192
192.168.86.197

[node]
192.168.86.193
192.168.86.194
192.168.86.195
192.168.86.196

[k3s_cluster:children]
master
node

Possible Solution

I've checked the General Troubleshooting Guide

Support Adding Nodes Later?

Should we support adding nodes later using this playbook?

Considerations:

While it might seem convenient to add add nodes later using this playbook, it complicates it both in code and testing.
This playbook was originally created to help boot strap k3s clusters and everything after that was up to the operator to configure
If we did do this, would we make it (kind of) idempotent, where we would target the same cluster and add nodes?
This might work as is without any changes at all by just adding additional nodes to your host.ini (needs testing)

YAML Linter

Add YAML lint as a pre-commit hook to help commits fail fast, rather than in CI

Unable to verify masters when deploying to Ubuntu on x86

I have two physical Intel machines running Proxmox which have the following VMs:

NUC i7
- plane-1 - master
- plane-3 - master
- worker-1 - worker
Intel Denverton
- plane-2 - master
- worker-2 - worker

All VM's are running Ubuntu and use a template modelled on TechnoTim's recent video

Expected Behavior

I run ./deploy from the Proxmox (as root user) and it should setup the cluster

Current Behavior

I run the ansible script from my Proxmox server and it fails on the Verification step (even after 20 attempts):

- name: Verification
  block:
    - name: Verify that all nodes actually joined (check k3s-init.service if this fails)
      command:
        cmd: k3s kubectl get nodes -l "node-role.kubernetes.io/master=true" -o=jsonpath="{.items[*].metadata.name}"
      register: nodes
      until: nodes.rc == 0 and (nodes.stdout.split() | length) == (groups['master'] | length)
      retries: 20
      delay: 10
      changed_when: false

I can stop the retry loop which ensures that the transient process is not shut down by ansible and when I go into each of the three masters I can see that:

the transient process is running
running k3s kubectl get nodes (of any variant) will fail as nothing is running on port 8080 which is why the above step fails
when you look for listeners on port 6443 the master VM's you'll find that indeed there is a service listening there:
I have validated all k3s config files with k3s check-config and they are all fine
All files expected to copied to /var/lib/rancher/k3s/server have indeed been copied
Filesystem should have plenty of space on all VMs
I checked the /var/lib/rancher/k3s/server/red/api-server.kubeconfig file to see how it was setup and indeed it's clear on the fact that the listening port is 6443

Steps to Reproduce

use the deploy script
- this runs ansible-playbook site.yml -i inventory/venice/hosts.ini --private-key ~/.ssh/id_rsa -v --user ken
- I also tried runs where I added --ask-become-pass --ask-pass

Context (variables)

OS

Distributor ID: Ubuntu
Description: Ubuntu 20.04.4 LTS
Release: 20.04
Codename: focal

all packages are up-to-date

Virtualization

Proxmox 7.1.10

Variables Used:

k3s_version: v1.23.4+k3s1
# this is the user that has ssh access to these machines
ansible_user: ken
systemd_dir: /etc/systemd/system

# interface which will be used for flannel
flannel_iface: "eth0"

# apiserver_endpoint is virtual ip-address which will be configured on each master
apiserver_endpoint: "192.168.100.200"

# k3s_token is required  masters can talk together securely
# this token should be alpha numeric only
k3s_token: "xxxxx-xxxxx-xxxxx-xxxxx"

# change these to your liking, the only required one is--no-deploy servicelb
# extra_server_args: "--no-deploy servicelb --no-deploy traefik --write-kubeconfig-mode 644 --kube-api-server-arg default-not-ready-toleration-seconds=30 --kube-apiserver-arg default-unreachable-toleration-seconds=30 --kube-controller-arg node-monitor-period=20s --kube-controller-arg node-monitor-grace-period=20s --kubelet-arg node-status-update-frequency=5s"
# extra_agent_args: "--kubelet-arg node-status-update-frequency=5s"

# image tag for kube-vip
kube_vip_tag_version: "v0.4.3"

# image tag for metal lb
metal_lb_speaker_tag_version: "v0.12.1"
metal_lb_controller_tag_version: "v0.12.1"

# metallb ip range for load balancer
metal_lb_ip_range: "192.168.100.80-192.168.100.89"

All of these were run with the base configuration in the repo but then I also tried removing the extra_server_args and extra_agent_args from all.yml but that seemed to have zero impact.

Prune old manifests when bootstrapping cluster

Old manifest still exists after bootstrapping cluster and k3s will try to reapply these if you change them.

The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?"

Hello!
First of all thank you for this guide!
Learn so many new things, but now i am stucked and do not know how to proceed.

I try to get it up and running on rasperry pi's, 3 Masters, and 4 Workers.

Error message after lunching the playbook:

fatal: k3s-m3]: FAILED! => {"attempts": 20, "changed": false, "cmd": ["k3s", "kubectl", "get", "nodes", "-l", "node-role.kubernetes.io/master=true", "-o=jsonpath={.items[*].metadata.name}"], "delta": "0:00:01.384371", "end": "2022-03-31 22:13:11.580652", "msg": "non-zero return code", "rc": 1, "start": "2022-03-31 22:13:10.196281", "stderr": "The connection to the server localhost:8080 was refused - did you specify the right host or port?", "stderr_lines": ["The connection to the server localhost:8080 was refused - did you specify the right host or port?"], "stdout": "", "stdout_lines": []}

fatal: [k3s-m2]: FAILED! => {"attempts": 20, "changed": false, "cmd": ["k3s", "kubectl", "get", "nodes", "-l", "node-role.kubernetes.io/master=true", "-o=jsonpath={.items[*].metadata.name}"], "delta": "0:00:01.599374", "end": "2022-03-31 22:13:21.948957", "msg": "non-zero return code", "rc": 1, "start": "2022-03-31 22:13:20.349583", "stderr": "The connection to the server localhost:8080 was refused - did you specify the right host or port?", "stderr_lines": ["The connection to the server localhost:8080 was refused - did you specify the right host or port?"], "stdout": "", "stdout_lines": []}

fatal: [k3s-m1]: FAILED! => {"attempts": 20, "changed": false, "cmd": ["k3s", "kubectl", "get", "nodes", "-l", "node-role.kubernetes.io/master=true", "-o=jsonpath={.items[*].metadata.name}"], "delta": "0:00:01.694997", "end": "2022-03-31 22:14:06.756075", "msg": "non-zero return code", "rc": 1, "start": "2022-03-31 22:14:05.061078", "stderr": "The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?", "stderr_lines": ["The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?"], "stdout": "", "stdout_lines": []}

What can I do? Where do I have to search? what could the error be?

Thanks for any help that you can give me.

Add HA Rancher installation role

The title says everything

VIP address is not reachable

First, I had to add quotes to jinja variables:

- name: Configure kubectl cluster to https://"{{ apiserver_endpoint | ansible.utils.ipwrap }}":6443
  command: >-
    k3s kubectl config set-cluster default
      --server=https://"{{ apiserver_endpoint | ansible.utils.ipwrap }}":6443
      --kubeconfig ~"{{ ansible_user }}"/.kube/config

because the step failed with:

TASK [k3s/master : Configure kubectl cluster to https://{{ apiserver_endpoint | ansible.utils.ipwrap }}:6443] ************************************************************************************************************************************************
Wednesday 14 September 2022  19:24:28 +0300 (0:00:00.659)       0:02:17.051 *** 
fatal: [192.168.192.8]: FAILED! => {"changed": true, "cmd": ["k3s", "kubectl", "config", "set-cluster", "default", "--server=https://{{", "apiserver_endpoint", "|", "ansible.utils.ipwrap", "}}:6443", "--kubeconfig", "/home/{{", "ansible_user", "}}/.kube/config"], "delta": "0:00:00.196304", "end": "2022-09-14 19:24:29.395461", "msg": "non-zero return code", "rc": 1, "start": "2022-09-14 19:24:29.199157", "stderr": "error: Unexpected args: [default apiserver_endpoint | ansible.utils.ipwrap }}:6443 ansible_user }}/.kube/config]", "stderr_lines": ["error: Unexpected args: [default apiserver_endpoint | ansible.utils.ipwrap }}:6443 ansible_user }}/.kube/config]"], "stdout": "Set a cluster entry in kubeconfig.\n\n Specifying a name that already exists will merge new fields on top of existing values for those fields.\n\nExamples:\n  # Set only the server field on the e2e cluster entry without touching other values\n  kubectl config set-cluster e2e --server=https://1.2.3.4\n  \n  # Embed certificate authority data for the e2e cluster entry\n  kubectl config set-cluster e2e --embed-certs --certificate-authority=~/.kube/e2e/kubernetes.ca.crt\n  \n  # Disable cert checking for the e2e cluster entry\n  kubectl config set-cluster e2e --insecure-skip-tls-verify=true\n  \n  # Set custom TLS server name to use for validation for the e2e cluster entry\n  kubectl config set-cluster e2e --tls-server-name=my-cluster-name\n  \n  # Set proxy url for the e2e cluster entry\n  kubectl config set-cluster e2e --proxy-url=https://1.2.3.4\n\nOptions:\n    --embed-certs=false:\n\tembed-certs for the cluster entry in kubeconfig\n\n    --proxy-url='':\n\tproxy-url for the cluster entry in kubeconfig\n\nUsage:\n  kubectl config set-cluster NAME [--server=server] [--certificate-authority=path/to/certificate/authority] [--insecure-skip-tls-verify=true] [--tls-server-name=example.com] [options]\n\nUse \"kubectl options\" for a list of global command-line options (applies to all commands).", "stdout_lines": ["Set a cluster entry in kubeconfig.", "", " Specifying a name that already exists will merge new fields on top of existing values for those fields.", "", "Examples:", "  # Set only the server field on the e2e cluster entry without touching other values", "  kubectl config set-cluster e2e --server=https://1.2.3.4", "  ", "  # Embed certificate authority data for the e2e cluster entry", "  kubectl config set-cluster e2e --embed-certs --certificate-authority=~/.kube/e2e/kubernetes.ca.crt", "  ", "  # Disable cert checking for the e2e cluster entry", "  kubectl config set-cluster e2e --insecure-skip-tls-verify=true", "  ", "  # Set custom TLS server name to use for validation for the e2e cluster entry", "  kubectl config set-cluster e2e --tls-server-name=my-cluster-name", "  ", "  # Set proxy url for the e2e cluster entry", "  kubectl config set-cluster e2e --proxy-url=https://1.2.3.4", "", "Options:", "    --embed-certs=false:", "\tembed-certs for the cluster entry in kubeconfig", "", "    --proxy-url='':", "\tproxy-url for the cluster entry in kubeconfig", "", "Usage:", "  kubectl config set-cluster NAME [--server=server] [--certificate-authority=path/to/certificate/authority] [--insecure-skip-tls-verify=true] [--tls-server-name=example.com] [options]", "", "Use \"kubectl options\" for a list of global command-line options (applies to all commands)."]}
fatal: [192.168.192.7]: FAILED! => {"changed": true, "cmd": ["k3s", "kubectl", "config", "set-cluster", "default", "--server=https://{{", "apiserver_endpoint", "|", "ansible.utils.ipwrap", "}}:6443", "--kubeconfig", "/home/{{", "ansible_user", "}}/.kube/config"], "delta": "0:00:00.164204", "end": "2022-09-14 19:24:29.422157", "msg": "non-zero return code", "rc": 1, "start": "2022-09-14 19:24:29.257953", "stderr": "error: Unexpected args: [default apiserver_endpoint | ansible.utils.ipwrap }}:6443 ansible_user }}/.kube/config]", "stderr_lines": ["error: Unexpected args: [default apiserver_endpoint | ansible.utils.ipwrap }}:6443 ansible_user }}/.kube/config]"], "stdout": "Set a cluster entry in kubeconfig.\n\n Specifying a name that already exists will merge new fields on top of existing values for those fields.\n\nExamples:\n  # Set only the server field on the e2e cluster entry without touching other values\n  kubectl config set-cluster e2e --server=https://1.2.3.4\n  \n  # Embed certificate authority data for the e2e cluster entry\n  kubectl config set-cluster e2e --embed-certs --certificate-authority=~/.kube/e2e/kubernetes.ca.crt\n  \n  # Disable cert checking for the e2e cluster entry\n  kubectl config set-cluster e2e --insecure-skip-tls-verify=true\n  \n  # Set custom TLS server name to use for validation for the e2e cluster entry\n  kubectl config set-cluster e2e --tls-server-name=my-cluster-name\n  \n  # Set proxy url for the e2e cluster entry\n  kubectl config set-cluster e2e --proxy-url=https://1.2.3.4\n\nOptions:\n    --embed-certs=false:\n\tembed-certs for the cluster entry in kubeconfig\n\n    --proxy-url='':\n\tproxy-url for the cluster entry in kubeconfig\n\nUsage:\n  kubectl config set-cluster NAME [--server=server] [--certificate-authority=path/to/certificate/authority] [--insecure-skip-tls-verify=true] [--tls-server-name=example.com] [options]\n\nUse \"kubectl options\" for a list of global command-line options (applies to all commands).", "stdout_lines": ["Set a cluster entry in kubeconfig.", "", " Specifying a name that already exists will merge new fields on top of existing values for those fields.", "", "Examples:", "  # Set only the server field on the e2e cluster entry without touching other values", "  kubectl config set-cluster e2e --server=https://1.2.3.4", "  ", "  # Embed certificate authority data for the e2e cluster entry", "  kubectl config set-cluster e2e --embed-certs --certificate-authority=~/.kube/e2e/kubernetes.ca.crt", "  ", "  # Disable cert checking for the e2e cluster entry", "  kubectl config set-cluster e2e --insecure-skip-tls-verify=true", "  ", "  # Set custom TLS server name to use for validation for the e2e cluster entry", "  kubectl config set-cluster e2e --tls-server-name=my-cluster-name", "  ", "  # Set proxy url for the e2e cluster entry", "  kubectl config set-cluster e2e --proxy-url=https://1.2.3.4", "", "Options:", "    --embed-certs=false:", "\tembed-certs for the cluster entry in kubeconfig", "", "    --proxy-url='':", "\tproxy-url for the cluster entry in kubeconfig", "", "Usage:", "  kubectl config set-cluster NAME [--server=server] [--certificate-authority=path/to/certificate/authority] [--insecure-skip-tls-verify=true] [--tls-server-name=example.com] [options]", "", "Use \"kubectl options\" for a list of global command-line options (applies to all commands)."]}
fatal: [192.168.192.9]: FAILED! => {"changed": true, "cmd": ["k3s", "kubectl", "config", "set-cluster", "default", "--server=https://{{", "apiserver_endpoint", "|", "ansible.utils.ipwrap", "}}:6443", "--kubeconfig", "/home/{{", "ansible_user", "}}/.kube/config"], "delta": "0:00:00.240263", "end": "2022-09-14 19:24:29.459260", "msg": "non-zero return code", "rc": 1, "start": "2022-09-14 19:24:29.218997", "stderr": "error: Unexpected args: [default apiserver_endpoint | ansible.utils.ipwrap }}:6443 ansible_user }}/.kube/config]", "stderr_lines": ["error: Unexpected args: [default apiserver_endpoint | ansible.utils.ipwrap }}:6443 ansible_user }}/.kube/config]"], "stdout": "Set a cluster entry in kubeconfig.\n\n Specifying a name that already exists will merge new fields on top of existing values for those fields.\n\nExamples:\n  # Set only the server field on the e2e cluster entry without touching other values\n  kubectl config set-cluster e2e --server=https://1.2.3.4\n  \n  # Embed certificate authority data for the e2e cluster entry\n  kubectl config set-cluster e2e --embed-certs --certificate-authority=~/.kube/e2e/kubernetes.ca.crt\n  \n  # Disable cert checking for the e2e cluster entry\n  kubectl config set-cluster e2e --insecure-skip-tls-verify=true\n  \n  # Set custom TLS server name to use for validation for the e2e cluster entry\n  kubectl config set-cluster e2e --tls-server-name=my-cluster-name\n  \n  # Set proxy url for the e2e cluster entry\n  kubectl config set-cluster e2e --proxy-url=https://1.2.3.4\n\nOptions:\n    --embed-certs=false:\n\tembed-certs for the cluster entry in kubeconfig\n\n    --proxy-url='':\n\tproxy-url for the cluster entry in kubeconfig\n\nUsage:\n  kubectl config set-cluster NAME [--server=server] [--certificate-authority=path/to/certificate/authority] [--insecure-skip-tls-verify=true] [--tls-server-name=example.com] [options]\n\nUse \"kubectl options\" for a list of global command-line options (applies to all commands).", "stdout_lines": ["Set a cluster entry in kubeconfig.", "", " Specifying a name that already exists will merge new fields on top of existing values for those fields.", "", "Examples:", "  # Set only the server field on the e2e cluster entry without touching other values", "  kubectl config set-cluster e2e --server=https://1.2.3.4", "  ", "  # Embed certificate authority data for the e2e cluster entry", "  kubectl config set-cluster e2e --embed-certs --certificate-authority=~/.kube/e2e/kubernetes.ca.crt", "  ", "  # Disable cert checking for the e2e cluster entry", "  kubectl config set-cluster e2e --insecure-skip-tls-verify=true", "  ", "  # Set custom TLS server name to use for validation for the e2e cluster entry", "  kubectl config set-cluster e2e --tls-server-name=my-cluster-name", "  ", "  # Set proxy url for the e2e cluster entry", "  kubectl config set-cluster e2e --proxy-url=https://1.2.3.4", "", "Options:", "    --embed-certs=false:", "\tembed-certs for the cluster entry in kubeconfig", "", "    --proxy-url='':", "\tproxy-url for the cluster entry in kubeconfig", "", "Usage:", "  kubectl config set-cluster NAME [--server=server] [--certificate-authority=path/to/certificate/authority] [--insecure-skip-tls-verify=true] [--tls-server-name=example.com] [options]", "", "Use \"kubectl options\" for a list of global command-line options (applies to all commands)."]}

After this the role has ended successfully but still I'm not able to ping VIP ip from any of the nodes and i don't see it in ip a output.

Here is my inventory hosts.ini, yes, I don't need separate worker nodes:

[master]
192.168.192.7
192.168.192.8
192.168.192.9

[node]

[k3s_cluster:children]
master
node

and all.yml:

---
k3s_version: v1.24.4+k3s1
# this is the user that has ssh access to these machines
ansible_user: les
systemd_dir: /etc/systemd/system

# Set your timezone
system_timezone: "Europe/Moscow"

# interface which will be used for flannel
#flannel_iface: "eth0"
flannel_iface: "ens33"

# apiserver_endpoint is virtual ip-address which will be configured on each master
apiserver_endpoint: "192.168.192.222"

# k3s_token is required  masters can talk together securely
# this token should be alpha numeric only
k3s_token: "super-puper-k3s-token-that-i-should-not-show-to-anybody"

# The IP on which the node is reachable in the cluster.
# Here, a sensible default is provided, you can still override
# it for each of your hosts, though.
k3s_node_ip: '{{ ansible_facts[flannel_iface]["ipv4"]["address"] }}'

# these arguments are recommended for servers as well as agents:
extra_args: >-
  --flannel-iface={{ flannel_iface }}
  --node-ip={{ k3s_node_ip }}

# change these to your liking, the only required one is --disable servicelb
extra_server_args: >-
  {{ extra_args }}
  --disable servicelb
  --disable traefik
extra_agent_args: >-
  {{ extra_args }}

# image tag for kube-vip
kube_vip_tag_version: "v0.5.0"

# image tag for metal lb
metal_lb_speaker_tag_version: "v0.13.5"
metal_lb_controller_tag_version: "v0.13.5"

# metallb ip range for load balancer
metal_lb_ip_range: "192.168.192.80-192.168.192.90"

Document Playbook Upgrade Steps

There are times when we make modifications to the template which can cause previously generated variables to fail.

Document steps in README.md for upgrades (e.g. git pull, create new template, copy over, etc...)

Document Cluster Upgrade Steps

This playbook also works for upgrading a cluster and all of its components.

Document steps in README.md on how to upgrade your cluster

K3s Service stuck at activating on nodes

k3s-node.service does not start and is stuck on activating.

Expected Behavior

k3s-node.server should be running on nodes.

Current Behavior

`TASK [k3s/master : Create crictl symlink] ******************************************************************************changed: [us-ga-cluster-02]
changed: [us-ga-cluster-01]
changed: [us-ga-cluster-03]

PLAY [node] ************************************************************************************************************
TASK [Gathering Facts] *************************************************************************************************ok: [us-ga-worker-01]
ok: [us-ga-worker-02]

TASK [k3s/node : Copy K3s service file] ********************************************************************************changed: [us-ga-worker-01]
changed: [us-ga-worker-02]

TASK [k3s/node : Enable and check K3s service] *************************************************************************`

`ubuntu@us-ga-worker-01:~$ sudo systemctl status k3s-node.service
● k3s-node.service - Lightweight Kubernetes
Loaded: loaded (/etc/systemd/system/k3s-node.service; enabled; vendor preset: enabled)
Active: activating (start) since Fri 2022-05-20 15:49:17 EDT; 53s ago
Docs: https://k3s.io
Process: 2513 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
Process: 2522 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
Main PID: 2527 (k3s-agent)
Tasks: 8
Memory: 31.2M
CGroup: /system.slice/k3s-node.service
└─2527 /usr/local/bin/k3s agent

May 20 15:49:17 us-ga-worker-01 k3s[2527]: time="2022-05-20T15:49:17-04:00" level=info msg="Starting k3s agent v1.23.6+k3s1 (418c3fa8)"
May 20 15:49:17 us-ga-worker-01 k3s[2527]: time="2022-05-20T15:49:17-04:00" level=info msg="Running load balancer k3s-agent-load-balancer 127.0.0.1:6444 -> [10.20.0.46:6443]"
May 20 15:49:23 us-ga-worker-01 k3s[2527]: time="2022-05-20T15:49:23-04:00" level=error msg="failed to get CA certs: Get "https://127.0.0.1:6444/cacerts\": read tcp 127.0.0.1:51760->127.0.0.1:6444: read: connection reset by peer"
May 20 15:49:29 us-ga-worker-01 k3s[2527]: time="2022-05-20T15:49:29-04:00" level=error msg="failed to get CA certs: Get "https://127.0.0.1:6444/cacerts\": read tcp 127.0.0.1:51768->127.0.0.1:6444: read: connection reset by peer"
May 20 15:49:35 us-ga-worker-01 k3s[2527]: time="2022-05-20T15:49:35-04:00" level=error msg="failed to get CA certs: Get "https://127.0.0.1:6444/cacerts\": read tcp 127.0.0.1:51776->127.0.0.1:6444: read: connection reset by peer"
May 20 15:49:42 us-ga-worker-01 k3s[2527]: time="2022-05-20T15:49:42-04:00" level=error msg="failed to get CA certs: Get "https://127.0.0.1:6444/cacerts\": read tcp 127.0.0.1:51784->127.0.0.1:6444: read: connection reset by peer"
May 20 15:49:48 us-ga-worker-01 k3s[2527]: time="2022-05-20T15:49:48-04:00" level=error msg="failed to get CA certs: Get "https://127.0.0.1:6444/cacerts\": read tcp 127.0.0.1:51792->127.0.0.1:6444: read: connection reset by peer"
May 20 15:49:54 us-ga-worker-01 k3s[2527]: time="2022-05-20T15:49:54-04:00" level=error msg="failed to get CA certs: Get "https://127.0.0.1:6444/cacerts\": read tcp 127.0.0.1:51800->127.0.0.1:6444: read: connection reset by peer"
May 20 15:50:00 us-ga-worker-01 k3s[2527]: time="2022-05-20T15:50:00-04:00" level=error msg="failed to get CA certs: Get "https://127.0.0.1:6444/cacerts\": read tcp 127.0.0.1:51808->127.0.0.1:6444: read: connection reset by peer"
May 20 15:50:06 us-ga-worker-01 k3s[2527]: time="2022-05-20T15:50:06-04:00" level=error msg="failed to get CA certs: Get "https://127.0.0.1:6444/cacerts\": read tcp 127.0.0.1:51816->127.0.0.1:6444: read: connection reset by peer"`

Steps to Reproduce

ansible-playbook playbooks/site.yml -K

Context (variables)

Operating system:
ubuntu 20.04
Hardware:
Proxmox VM

Variables Used:

all.yml

k3s_version: v1.23.6+k3s1
# this is the user that has ssh access to these machines
ansible_user: ubuntu
systemd_dir: /etc/systemd/system

# Set your timezone
system_timezone: "America/New_York"

# interface which will be used for flannel
flannel_iface: "eth0"

# apiserver_endpoint is virtual ip-address which will be configured on each master
apiserver_endpoint: "10.20.0.46"

# k3s_token is required  masters can talk together securely
# this token should be alpha numeric only
k3s_token: "xxxxxxxxxxxxxxxxx"

# change these to your liking, the only required one is--no-deploy servicelb
extra_server_args: "--no-deploy servicelb --no-deploy traefik"
extra_agent_args: ""

# image tag for kube-vip
kube_vip_tag_version: "v0.4.3"

# image tag for metal lb
metal_lb_speaker_tag_version: "v0.12.1"
metal_lb_controller_tag_version: "v0.12.1"

# metallb ip range for load balancers
metal_lb_ip_range: "10.20.0.100-10.20.0.120"

Hosts

host.ini

[master]
us-ga-cluster-01
us-ga-cluster-02
us-ga-cluster-03

[node]
us-ga-worker-01
us-ga-worker-02

[k3s_cluster:children]
master
node

Possible Solution

Master apiserver_enpoint IP not assigned

Expected Behavior

The IP adress provided as apiserver_endpoint should get assigned to the master node as vIP. Nodes can register to this IP as Kubernetes API.

Current Behavior

The installation fails at the step TASK [k3s/master : Verify that all nodes actually joined (check k3s-init.service if this fails)] ***
I do not see the IP provided as apiserver_endpoint anywhere getting assigned to.
Do I need multi-master setup to get a vIP assigned?

Steps to Reproduce

Setup 3 Ubuntu 20.04 VMs
Assign 1 VM as Master and 2 VMs as Nodes
Assign a IP in the same subnet as the 3 VMs as apiserver_endpoint
Run the playbook as documented

Context (variables)

Operating system: Ubuntu 20.04

Hardware: VMs on AMD Ryzen 7 4900G on a Proxmox hypervisor

Variables Used:

all.yml

k3s_version: "v1.23.4+k3s1"
ansible_user: NA
systemd_dir: "/etc/systemd/system"

flannel_iface: "eth0"

apiserver_endpoint: "10.0.0.5"

k3s_token: "NA"

extra_server_args: "--no-deploy servicelb --no-deploy traefik"
extra_agent_args: ""

kube_vip_tag_version: ""v0.4.4""

metal_lb_speaker_tag_version: "v0.12.1"
metal_lb_controller_tag_version: "v0.12.1"

metal_lb_ip_range: "10.0.0.150-10.0.0.200"

Hosts

host.ini

[master]
master01.dc.lab.domain.tld

[node]
node01.dc.lab.domain.tld
node02.dc.lab.domain.tld

[k3s_cluster:children]
master
node

Possible Solution

Since the vIP for the master node does not getting assigned, the nodes cannot register to a master. So the Kubernetes cluster stays a single master cluster.

tls-san problem

new install ends with an error:

fatal: [k3s-master-3]: FAILED! => {"changed": true, "cmd": ["k3s", "kubectl", "config", "set-cluster", "default", "--server=https://{{", "apiserver_endpoint", "|", "ansible.utils.ipwrap", "}}:6443", "--kubeconfig", "~root/.kube/config"], "delta": "0:00:00.286011", "end":
 "2022-10-28 07:47:02.626092", "msg": "non-zero return code", "rc": 1, "start": "2022-10-28 07:47:02.340081", "stderr": "error: Unexpected args: [default apiserver_endpoint | ansible.utils.ipwrap }}:6443]", "stderr_lines": ["error: Unexpected args: [default apiserver_endpo
int | ansible.utils.ipwrap }}:6443]"], "stdout": "Set a cluster entry in kubeconfig.\n\n Specifying a name that already exists will merge new fields on top of existing values for those fields.\n\nExamples:\n  # Set only the server field on the e2e cluster entry without tou
ching other values\n  kubectl config set-cluster e2e --server=https://1.2.3.4\n  \n  # Embed certificate authority data for the e2e cluster entry\n  kubectl config set-cluster e2e --embed-certs --certificate-authority=~/.kube/e2e/kubernetes.ca.crt\n  \n  # Disable cert che
cking for the e2e cluster entry\n  kubectl config set-cluster e2e --insecure-skip-tls-verify=true\n  \n  # Set custom TLS server name to use for validation for the e2e cluster entry\n  kubectl config set-cluster e2e --tls-server-name=my-cluster-name\n  \n  # Set proxy url 
for the e2e cluster entry\n  kubectl config set-cluster e2e --proxy-url=https://1.2.3.4\n\nOptions:\n    --embed-certs=false:\n\tembed-certs for the cluster entry in kubeconfig\n\n    --proxy-url='':\n\tproxy-url for the cluster entry in kubeconfig\n\nUsage:\n  kubectl con
fig set-cluster NAME [--server=server] [--certificate-authority=path/to/certificate/authority] [--insecure-skip-tls-verify=true] [--tls-server-name=example.com] [options]\n\nUse \"kubectl options\" for a list of global command-line options (applies to all commands).", "std
out_lines": ["Set a cluster entry in kubeconfig.", "", " Specifying a name that already exists will merge new fields on top of existing values for those fields.", "", "Examples:", "  # Set only the server field on the e2e cluster entry without touching other values", "  ku
bectl config set-cluster e2e --server=https://1.2.3.4", "  ", "  # Embed certificate authority data for the e2e cluster entry", "  kubectl config set-cluster e2e --embed-certs --certificate-authority=~/.kube/e2e/kubernetes.ca.crt", "  ", "  # Disable cert checking for the 
e2e cluster entry", "  kubectl config set-cluster e2e --insecure-skip-tls-verify=true", "  ", "  # Set custom TLS server name to use for validation for the e2e cluster entry", "  kubectl config set-cluster e2e --tls-server-name=my-cluster-name", "  ", "  # Set proxy url fo
r the e2e cluster entry", "  kubectl config set-cluster e2e --proxy-url=https://1.2.3.4", "", "Options:", "    --embed-certs=false:", "\tembed-certs for the cluster entry in kubeconfig", "", "    --proxy-url='':", "\tproxy-url for the cluster entry in kubeconfig", "", "Usa
ge:", "  kubectl config set-cluster NAME [--server=server] [--certificate-authority=path/to/certificate/authority] [--insecure-skip-tls-verify=true] [--tls-server-name=example.com] [options]", "", "Use \"kubectl options\" for a list of global command-line options (applies 
to all commands)."]}

Context (variables)

Operating system: ubuntu 20.04

Hardware: proxmox QEMU VM

Variables Used

all.yml

---
k3s_version: v1.24.6+k3s1
# this is the user that has ssh access to these machines
ansible_user: root
systemd_dir: /etc/systemd/system

# Set your timezone
system_timezone: "Europe/Berlin"

# interface which will be used for flannel
flannel_iface: "ens18"

# apiserver_endpoint is virtual ip-address which will be configured on each master
apiserver_endpoint: "10.1.0.200"

# k3s_token is required  masters can talk together securely
# this token should be alpha numeric only
k3s_token: "secret"

# The IP on which the node is reachable in the cluster.
# Here, a sensible default is provided, you can still override
# it for each of your hosts, though.
k3s_node_ip: '{{ ansible_facts[flannel_iface]["ipv4"]["address"] }}'

# Disable the taint manually by setting: k3s_master_taint = false
k3s_master_taint: "{{ true if groups['node'] | default([]) | length >= 1 else false }}"

# these arguments are recommended for servers as well as agents:
extra_args: >-
  --flannel-iface={{ flannel_iface }}
  --node-ip={{ k3s_node_ip }}

# change these to your liking, the only required are: --disable servicelb, --tls-san {{ apiserver_endpoint }}
extra_server_args: >-
  {{ extra_args }}
  {{ '--node-taint node-role.kubernetes.io/master=true:NoSchedule' if k3s_master_taint else '' }}
  --tls-san {{ apiserver_endpoint }}
  --disable servicelb
#  --disable traefik
extra_agent_args: >-
  {{ extra_args }}

# image tag for kube-vip
kube_vip_tag_version: "v0.5.5"

# image tag for metal lb
metal_lb_speaker_tag_version: "v0.13.6"
metal_lb_controller_tag_version: "v0.13.6"

# metallb ip range for load balancer
metal_lb_ip_range: "10.1.0.230-10.1.0.250"

what is this new tls-sna variable?

what does it do and why has it recently become necessary

How to add more control plane modes without tearing down and reinstalling the cluster?

Thanks for the video!

Adding support for Calico CNI

I think it could be interesting to propose an alternative CNI to flannel, like Calico.
What do you think guys ?

I did a PoC already and the implementation is not really hard.

Feature request: skip MetalLB installation if Traefik is enabled?

I modified the scripts so that the MetalLB (which I don't want) won't be installed if the Traefik is enabled:

diff --git a/roles/prereq/tasks/main.yml b/roles/prereq/tasks/main.yml
index dcab613..354497b 100644
--- a/roles/prereq/tasks/main.yml
+++ b/roles/prereq/tasks/main.yml
@@ -1,4 +1,9 @@
 ---
+- name: Set install_metal_lb fact to true if Traefik is disabled
+  set_fact:
+    install_metal_lb: true
+  when: extra_server_args is search('--disable traefik')
+
 - name: Set same timezone on every Server
   timezone:
     name: "{{ system_timezone }}"

Then I added the following clause to all the tasks related to metallb:

  when: install_metal_lb|default(false)

It's a dirty solution, but works for me. I could create a pull request, but I was wondering if this is the right approach? My assumption is that metallb and traefik are mutually exclusive, but maybe I'm wrong and such an approach doesn't make sense...

Update Blog Post

This post needs to be updated with the changes to match what is currently used

https://docs.technotim.live/posts/k3s-etcd-ansible/

It might also be a good place to document #72 #71 #87

Support Single Node Installs

We should support single node installs, installing all roles and all items on one node. This gives someone the capability of running k3s one one node, with the flexibility of growing later.

Modular and Iterative Installation

The K3S-Ansible setup has a very monolithical approach in its current design, making it an all-or-nothing proposition. I think it would be more usable and less error-prone if it was designed to gradually and incrementally increase deployment complexity:

Install K3s cluster with a single master
Install Etcd and support multiple masters
Add kube-vip and Metal-LB support

The benefits of such gradual setup and modularity are:

Users can choose to stop at 1, if they don't really need 2 and 3, or stop at 2, if they don't need 3
The setup run would be easier to debug, because you have smaller steps, and after each step you get something working. For instance, with current design if Metal-LB setup or config fails, I don't even get a working K3s cluster, because Metal-LB installation is way too early. That is adding unnecessary fragility, in my opinion.

Thank you

⚖️ Maintain Lineage vs. Keeping this repo clean

I am having a hard time balancing maintaining lineage from a fork from a branch from a fork and making contributing easy. While it is possible, it's really going to be challenging for PRs and other contributions. I am considering dropping the lineage to the other repos, while still giving them credit in the README.md

Changing metalLB ip address range requires workload restart

I am trying to change the iprange assigned by metallb by running just the "post" role (i am using ansible tags to achieve this). However it seems that the workload of metalLB needs to be restart to take effect

Installation Bugs

Following this guide:

https://docs.technotim.live/posts/k3s-etcd-ansible/

The following issues were noted:

First create the VMS with this doc:

https://docs.technotim.live/posts/cloud-init-cloud-image/

Issue 1:
qm create 8000 --memory 2048 --core 2 --name ubuntu-cloud --net0 virtio,bridge=vmbr0

This creates a VM that doesn't really have enough memory todo much once K3s is running. Recommend changing to 4GB

Issue 2:

qm importdisk 8000 focal-server-cloudimg-amd64.img local-lvm

The disk that is created is too small to install K3s. Disk size needs to be increased I used 32GB

Issue 3:

DO NOT START YOUR VM

Add here that you need to configure cloud init settings in the GUI. I missed this step first time.

Issue 4:

You may want to start your VM as I did to add the Proxmox VM agent. When starting a VM it gets issued a machine ID.
Prior to shuting down the VM the machine ID needs to be reset

BUG:

You suggest:


sudo rm -f /etc/machine-id
sudo rm -f /var/lib/dbus/machine-id

This wil break your Ubnuntu installation! Do not delete /etc/machine-id truncate it with:

truncate -s 0 /etc/machine-id /var/lib/dbus/machine-id

The machine-id file does not get created if it is deleted, but it will get populated with a new value if it is empty. If the machine-id is not preset networking will fail to start.

Reference: https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1508766

Return to creating the K3s witj Ansible:

Issue1:

cp -R k3s-ansible/inventory/sample inventory/my-cluster

There needs to be a setup where you create a directory for you files. The folder inventory is not present until you create it.

Issue2:

Edit inventory/my-cluster/group_vars/all.yml

When editing this file if you change to the latest verison if K3s by changing this line:

k3s_version: v1.23.4+k3s1.

It will break. V 1.24 appears not to work? Any thoughts.

Suggestion:

Prior to running:

ansible-playbook ./playbooks/reset.yml -i ./inventory/my-cluster/hosts.ini

Run:

ansible all -m ping -i ./inventory/my-cluster/hosts.ini

This is due to a fresh install of Ansible appearing to fail to connect until you have agrres to adding the host to knowhosts.

The download-boxes.sh script fails in CI on new forks

Expected Behavior

I expect that the test.yaml pipeline can execute on forked repositories without error.

Current Behavior

The test.yaml pipeline fails when calling the download-boxes.sh script.

Steps to Reproduce

Fork the repository.
Run the pipelines

Context (variables)

Operating system: MacOS

Hardware: M1

Variables Used

None as it happens in CI.

Hosts

None as it happens during CI.

Repro

Check my fork for a failed action. There is not much info to gain from it, but it does prove that it fails on fresh forks: https://github.com/devantler/k3s-ansible/actions

Possible Solution

I think it might be related to caching, in case you have not cached any vagrant boxes, and it tries to download them with the script.

I've checked the General Troubleshooting Guide

Node taints enabled by default on a cluster that consists of multiple masters and zero nodes

Context

I have a cluster that consist of master nodes only:

hosts.ini

[master]
server1.mydomain.com
server2.mydomain.com
server3.mydomain.com

[node]

[k3s_cluster:children]
master
node

Expected Behavior

Node taints shouldn't be enabled if there are master nodes only. All the master nodes should be able to accept workload.

Current Behavior

Taints are added to all master nodes, and since there are no regular nodes, it's not possible to run any workload on the cluster.

$ kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints --no-headers
server1.mydomain.com   [map[effect:NoExecute key:CriticalAddonsOnly value:true]]
server2.mydomain.com   [map[effect:NoExecute key:CriticalAddonsOnly value:true]]
server3.mydomain.com   [map[effect:NoExecute key:CriticalAddonsOnly value:true]]

The reason lies in one of the recent commits (4acbe91b6c3a25a8f693df7c11789e1947842425) that added this variable:

k3s_single_node: "{{ 'true' if groups['k3s_cluster'] | length == 1 else 'false' }}"

incorrectly validates setup on x86 hosts

I have two physical Intel machines running Proxmox which have the following VMs:

NUC i7
- plane-1 - master
- plane-3 - master
- worker-1 - worker
Intel Denverton
- plane-2 - master
- worker-2 - worker

All VM's are running Ubuntu and use a template modelled on TechnoTim's recent video

I run the ansible script from my Proxmox server and it fails on the Verification step (even after 20 attempts):

- name: Verification
  block:
    - name: Verify that all nodes actually joined (check k3s-init.service if this fails)
      command:
        cmd: k3s kubectl get nodes -l "node-role.kubernetes.io/master=true" -o=jsonpath="{.items[*].metadata.name}"
      register: nodes
      until: nodes.rc == 0 and (nodes.stdout.split() | length) == (groups['master'] | length)
      retries: 20
      delay: 10
      changed_when: false

I can stop the retry loop which ensures that the transient process is not shut down by ansible and when I go into each of the three masters I can see that:

the transient process is running
running k3s kubectl get nodes (of any variant) will fail as nothing is running on port 8080 which is why the above step fails
when you look for listeners on port 6443 the master VM's you'll find that indeed there is a service listening there:
I have validated all k3s config files with k3s check-config and they are all fine
All files expected to copied to /var/lib/rancher/k3s/server have indeed been copied

All of these were run with the base configuration in the repo but then I also tried removing the extra_server_args and extra_agent_args from all.yml but that seemed to have zero impact.

Better Distro Support

We claim to support CentOS in the README.md but we're lacking the bandwidth and or tests to full vet it. Debian based distros have more use and support.

If we're going to continue to support it I think the following things might help:

Remove any distro based specific commands (apt vs using ansible.builtin.package module)
Tests per distribution or mix distro tests - the easy step is mixing distros but that might come with its own challenges

(open to suggestions)

Reboot on Reset

Since nodes are in an odd state after a reset and the VIP stays up, we should just reboot all nodes after a reset. This will solve #17

Error during playbook execution

Ansible playbook for deploying k3s fails to complete

Expected Behavior

K3s deploys with master nodes and agent nodes based on all.yaml and hosts.ini with kube-vip without error

Current Behavior

During execution of the ansible-playbook the "Configure kubectl cluster" step fails. Error can be viewed here:
https://pastebin.com/ag3kUgWW
Kube-vip also doesn't respond to ping or seem to deploy

Steps to Reproduce

1.Create ubuntu 22.04.1 servers using cloud-init on proxmox
2.Configure all.yml and hosts.ini
3. Run requirements playbook
4.Run k3s playbook

Context (variables)

Operating system: Ubuntu server 22.04.1

Hardware: Proxmox running on:
MB: Supermicro x10DRL-i
CPU: 2x Intel Xeon E5-2660 v4
RAM: 126GiB ECC RAM

VMs:
SeasBIOS
i440fx machine
20GiB root disk
28 CPUs
8GiB RAM

Variables Used

all.yml

k3s_version: "v1.24.6+k3s1"
ansible_user: NA
systemd_dir: "/etc/systemd/system"

flannel_iface: "eth0"

apiserver_endpoint: "192.168.10.50"

k3s_token: "NA"

extra_server_args: "  {{ extra_args }}
  {{ '--node-taint node-role.kubernetes.io/master=true:NoSchedule' if k3s_master_taint else '' }}
  --tls-san {{ apiserver_endpoint }}
  --disable servicelb
  --disable traefik"
extra_agent_args: "  {{ extra_args }}"

kube_vip_tag_version:  "v0.5.5"

metal_lb_speaker_tag_version: "v0.13.6"
metal_lb_controller_tag_version: "v0.13.6"

metal_lb_ip_range: "192.168.10.40-192.168.10.49"

Hosts

host.ini

[master]
192.168.10.51
192.168.10.52
192.168.10.53

[node]
192.168.10.57
192.168.10.58
192.168.10.59

[k3s_cluster:children]
master
node

Possible Solution

It seems kube-vip is not deploying properly

[ Y] I've checked the General Troubleshooting Guide
^VIP is not pingable

Don't run tests when there are changes to README

We should excludes certain files/paths from CI. README.md shouldn't trigger CI (Unless we're going to lint markdown)

Document ansible-galaxy collection install

This playbook also works for upgrading a cluster and all of its components.

Document steps for installing ansible-galaxy collection install ansible.utils in README.md