Describe the bug
At seemingly random intervals Neutron OVN metadata agent will crash, and when it crashes it gets into a crash loop and fails to get back to a running state.
To Reproduce
I can't reproduce the error at this time.
Expected behavior
The Neutron OVN metadata agent shouldn't crash.
Kubernetes Version
Currently running v1.28.6
. This issue has also been seen in v1.26.10
.
POD Description
# kubectl --namespace openstack describe pod neutron-ovn-metadata-agent-default-nrmnk
Name: neutron-ovn-metadata-agent-default-nrmnk
Namespace: openstack
Priority: 0
Service Account: neutron-ovn-metadata-agent
Node: 935822-compute03-ospcv2-dfw.openstack.local/172.28.232.123
Start Time: Fri, 26 Jan 2024 02:43:15 +0000
Labels: application=neutron
component=ovn-metadata-agent
controller-revision-hash=6b4fc69764
pod-template-generation=14
release_group=neutron
Annotations: configmap-bin-hash: 3b17fadc4799090e9f5d65201d90080ae322cff710e2b448a8f9d2c92555a57d
configmap-etc-hash: 05ac75f51498078f0e5f26737411dca523197d952546c1c5d9926732d1d7d7a8
openstackhelm.openstack.org/release_uuid:
Status: Running
IP: 172.28.232.123
IPs:
IP: 172.28.232.123
Controlled By: DaemonSet/neutron-ovn-metadata-agent-default
Init Containers:
init:
Container ID: containerd://9e3aeb65d0f234b6e93298ec960c940a5cbe1a3b88c61ed2b5a05cf7d61b1ee7
Image: quay.io/airshipit/kubernetes-entrypoint:v1.0.0
Image ID: sha256:c092d0dada614fdae3920939c5a9683b2758288f23c2e3b425128653857d7520
Port: <none>
Host Port: <none>
Command:
kubernetes-entrypoint
State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 26 Jan 2024 02:43:15 +0000
Finished: Fri, 26 Jan 2024 02:43:17 +0000
Ready: True
Restart Count: 0
Environment:
POD_NAME: neutron-ovn-metadata-agent-default-nrmnk (v1:metadata.name)
NAMESPACE: openstack (v1:metadata.namespace)
INTERFACE_NAME: eth0
PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/
DEPENDENCY_SERVICE: openstack:nova-metadata,openstack:neutron-server
DEPENDENCY_DAEMONSET:
DEPENDENCY_CONTAINER:
DEPENDENCY_POD_JSON:
DEPENDENCY_CUSTOM_RESOURCE:
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vqjjt (ro)
neutron-metadata-agent-init:
Container ID: containerd://646c5854dc030c26473f090cc62a3b85e33badd985990b90addb67ec6098c383
Image: docker.io/openstackhelm/neutron:2023.1-ubuntu_jammy
Image ID: docker.io/openstackhelm/neutron@sha256:b6f3dcfe8ffe051ed2280857365ebfd51220e8d0ef8c4ef9f9f8f59ddf1a0823
Port: <none>
Host Port: <none>
Command:
/tmp/neutron-metadata-agent-init.sh
State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 26 Jan 2024 02:43:18 +0000
Finished: Fri, 26 Jan 2024 02:43:18 +0000
Ready: True
Restart Count: 0
Limits:
cpu: 2
memory: 4Gi
Requests:
cpu: 100m
memory: 64Mi
Environment:
NEUTRON_USER_UID: 42424
Mounts:
/etc/neutron/neutron.conf from neutron-etc (ro,path="neutron.conf")
/tmp from pod-tmp (rw)
/tmp/neutron-metadata-agent-init.sh from neutron-bin (ro,path="neutron-metadata-agent-init.sh")
/var/lib/neutron/openstack-helm from socket (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vqjjt (ro)
ovn-neutron-init:
Container ID: containerd://177e528248223bd182edb2c6be832085a1164eb2f5a9ca018319a1227fa8b249
Image: docker.io/openstackhelm/neutron:2023.1-ubuntu_jammy
Image ID: docker.io/openstackhelm/neutron@sha256:b6f3dcfe8ffe051ed2280857365ebfd51220e8d0ef8c4ef9f9f8f59ddf1a0823
Port: <none>
Host Port: <none>
Command:
/tmp/neutron-ovn-init.sh
State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 26 Jan 2024 02:43:19 +0000
Finished: Fri, 26 Jan 2024 02:43:19 +0000
Ready: True
Restart Count: 0
Limits:
cpu: 2
memory: 4Gi
Requests:
cpu: 100m
memory: 64Mi
Environment: <none>
Mounts:
/tmp from pod-tmp (rw)
/tmp/neutron-ovn-init.sh from neutron-bin (ro,path="neutron-ovn-init.sh")
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vqjjt (ro)
Containers:
neutron-ovn-metadata-agent:
Container ID: containerd://dd4814d89b8446c913e22a8121062bae3d49206fd71fb89d6c9806c7e611140b
Image: docker.io/openstackhelm/neutron:2023.1-ubuntu_jammy
Image ID: docker.io/openstackhelm/neutron@sha256:b6f3dcfe8ffe051ed2280857365ebfd51220e8d0ef8c4ef9f9f8f59ddf1a0823
Port: <none>
Host Port: <none>
Command:
/tmp/neutron-ovn-metadata-agent.sh
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Fri, 26 Jan 2024 03:17:01 +0000
Finished: Fri, 26 Jan 2024 03:17:14 +0000
Ready: False
Restart Count: 11
Limits:
cpu: 2
memory: 4Gi
Requests:
cpu: 100m
memory: 64Mi
Liveness: exec [python /tmp/health-probe.py --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/ovn_metadata_agent.ini --liveness-probe] delay=120s timeout=580s period=600s #success=1 #failure=3
Readiness: exec [python /tmp/health-probe.py --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/ovn_metadata_agent.ini] delay=30s timeout=185s period=190s #success=1 #failure=3
Environment:
RPC_PROBE_TIMEOUT: 60
RPC_PROBE_RETRIES: 2
Mounts:
/etc/neutron/logging.conf from neutron-etc (ro,path="logging.conf")
/etc/neutron/neutron.conf from neutron-etc (ro,path="neutron.conf")
/etc/neutron/ovn_metadata_agent.ini from neutron-etc (ro,path="ovn_metadata_agent.ini")
/etc/neutron/plugins/ml2/ml2_conf.ini from neutron-etc (ro,path="ml2_conf.ini")
/etc/neutron/rootwrap.conf from neutron-etc (ro,path="rootwrap.conf")
/etc/neutron/rootwrap.d/debug.filters from neutron-etc (ro,path="debug.filters")
/etc/neutron/rootwrap.d/dhcp.filters from neutron-etc (ro,path="dhcp.filters")
/etc/neutron/rootwrap.d/dibbler.filters from neutron-etc (ro,path="dibbler.filters")
/etc/neutron/rootwrap.d/ebtables.filters from neutron-etc (ro,path="ebtables.filters")
/etc/neutron/rootwrap.d/ipset-firewall.filters from neutron-etc (ro,path="ipset-firewall.filters")
/etc/neutron/rootwrap.d/iptables-firewall.filters from neutron-etc (ro,path="iptables-firewall.filters")
/etc/neutron/rootwrap.d/l3.filters from neutron-etc (ro,path="l3.filters")
/etc/neutron/rootwrap.d/linuxbridge-plugin.filters from neutron-etc (ro,path="linuxbridge-plugin.filters")
/etc/neutron/rootwrap.d/netns-cleanup.filters from neutron-etc (ro,path="netns-cleanup.filters")
/etc/neutron/rootwrap.d/openvswitch-plugin.filters from neutron-etc (ro,path="openvswitch-plugin.filters")
/etc/neutron/rootwrap.d/privsep.filters from neutron-etc (ro,path="privsep.filters")
/etc/sudoers.d/kolla_neutron_sudoers from neutron-etc (ro,path="neutron_sudoers")
/run from run (rw)
/run/netns from host-run-netns (rw)
/tmp from pod-tmp (rw)
/tmp/health-probe.py from neutron-bin (ro,path="health-probe.py")
/tmp/neutron-ovn-metadata-agent.sh from neutron-bin (ro,path="neutron-ovn-metadata-agent.sh")
/var/lib/neutron from pod-var-neutron (rw)
/var/lib/neutron/openstack-helm from socket (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vqjjt (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
pod-tmp:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
pod-var-neutron:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
run:
Type: HostPath (bare host directory volume)
Path: /run
HostPathType:
neutron-bin:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: neutron-bin
Optional: false
neutron-etc:
Type: Secret (a volume populated by a Secret)
SecretName: neutron-ovn-metadata-agent-default
Optional: false
socket:
Type: HostPath (bare host directory volume)
Path: /var/lib/neutron/openstack-helm
HostPathType:
host-run-netns:
Type: HostPath (bare host directory volume)
Path: /run/netns
HostPathType:
kube-api-access-vqjjt:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: openstack-network-node=enabled
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/network-unavailable:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 35m default-scheduler Successfully assigned openstack/neutron-ovn-metadata-agent-default-nrmnk to 935822-compute03-ospcv2-dfw.openstack.local
Normal Pulled 35m kubelet Container image "quay.io/airshipit/kubernetes-entrypoint:v1.0.0" already present on machine
Normal Created 35m kubelet Created container init
Normal Started 35m kubelet Started container init
Normal Pulled 35m kubelet Container image "docker.io/openstackhelm/neutron:2023.1-ubuntu_jammy" already present on machine
Normal Created 35m kubelet Created container neutron-metadata-agent-init
Normal Started 35m kubelet Started container neutron-metadata-agent-init
Normal Pulled 35m kubelet Container image "docker.io/openstackhelm/neutron:2023.1-ubuntu_jammy" already present on machine
Normal Created 35m kubelet Created container ovn-neutron-init
Normal Started 35m kubelet Started container ovn-neutron-init
Normal Pulled 34m (x4 over 35m) kubelet Container image "docker.io/openstackhelm/neutron:2023.1-ubuntu_jammy" already present on machine
Normal Created 34m (x4 over 35m) kubelet Created container neutron-ovn-metadata-agent
Normal Started 34m (x4 over 35m) kubelet Started container neutron-ovn-metadata-agent
Warning BackOff 35s (x151 over 35m) kubelet Back-off restarting failed container neutron-ovn-metadata-agent in pod neutron-ovn-metadata-agent-default-nrmnk_openstack(b12830e7-4c00-4562-a035-a31aab324ea3)
POD Logs
2024-01-26 03:17:06.670 321 INFO neutron.common.config [-] Logging enabled!
2024-01-26 03:17:06.670 321 INFO neutron.common.config [-] /var/lib/openstack/bin/neutron-ovn-metadata-agent version 22.1.1.dev14
2024-01-26 03:17:06.842 321 INFO neutron.agent.ovn.metadata.ovsdb [-] Getting OvsdbSbOvnIdl for MetadataAgent with retry
2024-01-26 03:17:07.288 330 INFO neutron.agent.ovn.metadata.ovsdb [-] Getting OvsdbSbOvnIdl for MetadataAgent with retry
2024-01-26 03:17:07.292 329 INFO neutron.agent.ovn.metadata.ovsdb [-] Getting OvsdbSbOvnIdl for MetadataAgent with retry
2024-01-26 03:17:13.761 321 INFO neutron.agent.ovn.metadata.agent [-] Cleaning up ovnmeta-f992a46b-8dae-4eef-b80f-2d980466a361 namespace which is not needed anymore
2024-01-26 03:17:14.186 321 CRITICAL neutron [-] Unhandled error: OSError: [Errno 22] failed to open netns
2024-01-26 03:17:14.186 321 ERROR neutron Traceback (most recent call last):
2024-01-26 03:17:14.186 321 ERROR neutron File "/var/lib/openstack/bin/neutron-ovn-metadata-agent", line 8, in <module>
2024-01-26 03:17:14.186 321 ERROR neutron sys.exit(main())
2024-01-26 03:17:14.186 321 ERROR neutron File "/var/lib/openstack/lib/python3.10/site-packages/neutron/cmd/eventlet/agents/ovn_metadata.py", line 24, in main
2024-01-26 03:17:14.186 321 ERROR neutron metadata_agent.main()
2024-01-26 03:17:14.186 321 ERROR neutron File "/var/lib/openstack/lib/python3.10/site-packages/neutron/agent/ovn/metadata_agent.py", line 42, in main
2024-01-26 03:17:14.186 321 ERROR neutron agt.start()
2024-01-26 03:17:14.186 321 ERROR neutron File "/var/lib/openstack/lib/python3.10/site-packages/neutron/agent/ovn/metadata/agent.py", line 334, in start
2024-01-26 03:17:14.186 321 ERROR neutron self.sync()
2024-01-26 03:17:14.186 321 ERROR neutron File "/var/lib/openstack/lib/python3.10/site-packages/neutron/agent/ovn/metadata/agent.py", line 65, in wrapped
2024-01-26 03:17:14.186 321 ERROR neutron return f(*args, **kwargs)
2024-01-26 03:17:14.186 321 ERROR neutron File "/var/lib/openstack/lib/python3.10/site-packages/neutron/agent/ovn/metadata/agent.py", line 407, in sync
2024-01-26 03:17:14.186 321 ERROR neutron self.teardown_datapath(self._get_datapath_name(ns))
2024-01-26 03:17:14.186 321 ERROR neutron File "/var/lib/openstack/lib/python3.10/site-packages/neutron/agent/ovn/metadata/agent.py", line 454, in teardown_datapath
2024-01-26 03:17:14.186 321 ERROR neutron ip.garbage_collect_namespace()
2024-01-26 03:17:14.186 321 ERROR neutron File "/var/lib/openstack/lib/python3.10/site-packages/neutron/agent/linux/ip_lib.py", line 267, in garbage_collect_namespace
2024-01-26 03:17:14.186 321 ERROR neutron if self.namespace_is_empty():
2024-01-26 03:17:14.186 321 ERROR neutron File "/var/lib/openstack/lib/python3.10/site-packages/neutron/agent/linux/ip_lib.py", line 262, in namespace_is_empty
2024-01-26 03:17:14.186 321 ERROR neutron return not self.get_devices()
2024-01-26 03:17:14.186 321 ERROR neutron File "/var/lib/openstack/lib/python3.10/site-packages/neutron/agent/linux/ip_lib.py", line 179, in get_devices
2024-01-26 03:17:14.186 321 ERROR neutron devices = privileged.get_device_names(self.namespace)
2024-01-26 03:17:14.186 321 ERROR neutron File "/var/lib/openstack/lib/python3.10/site-packages/neutron/privileged/agent/linux/ip_lib.py", line 642, in get_device_names
2024-01-26 03:17:14.186 321 ERROR neutron in get_link_devices(namespace, **kwargs)]
2024-01-26 03:17:14.186 321 ERROR neutron File "/var/lib/openstack/lib/python3.10/site-packages/tenacity/__init__.py", line 333, in wrapped_f
2024-01-26 03:17:14.186 321 ERROR neutron return self(f, *args, **kw)
2024-01-26 03:17:14.186 321 ERROR neutron File "/var/lib/openstack/lib/python3.10/site-packages/tenacity/__init__.py", line 423, in __call__
2024-01-26 03:17:14.186 321 ERROR neutron do = self.iter(retry_state=retry_state)
2024-01-26 03:17:14.186 321 ERROR neutron File "/var/lib/openstack/lib/python3.10/site-packages/tenacity/__init__.py", line 360, in iter
2024-01-26 03:17:14.186 321 ERROR neutron return fut.result()
2024-01-26 03:17:14.186 321 ERROR neutron File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
2024-01-26 03:17:14.186 321 ERROR neutron return self.__get_result()
2024-01-26 03:17:14.186 321 ERROR neutron File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
2024-01-26 03:17:14.186 321 ERROR neutron raise self._exception
2024-01-26 03:17:14.186 321 ERROR neutron File "/var/lib/openstack/lib/python3.10/site-packages/tenacity/__init__.py", line 426, in __call__
2024-01-26 03:17:14.186 321 ERROR neutron result = fn(*args, **kwargs)
2024-01-26 03:17:14.186 321 ERROR neutron File "/var/lib/openstack/lib/python3.10/site-packages/oslo_privsep/priv_context.py", line 271, in _wrap
2024-01-26 03:17:14.186 321 ERROR neutron return self.channel.remote_call(name, args, kwargs,
2024-01-26 03:17:14.186 321 ERROR neutron File "/var/lib/openstack/lib/python3.10/site-packages/oslo_privsep/daemon.py", line 215, in remote_call
2024-01-26 03:17:14.186 321 ERROR neutron raise exc_type(*result[2])
2024-01-26 03:17:14.186 321 ERROR neutron OSError: [Errno 22] failed to open netns
2024-01-26 03:17:14.186 321 ERROR neutron
Additional context
The relevant log entry is here.
2024-01-26 03:17:13.761 321 INFO neutron.agent.ovn.metadata.agent [-] Cleaning up ovnmeta-f992a46b-8dae-4eef-b80f-2d980466a361 namespace which is not needed anymore
2024-01-26 03:17:14.186 321 CRITICAL neutron [-] Unhandled error: OSError: [Errno 22] failed to open netns
It seems that the neutron metadata agent is attempting to cleanup ovnmeta-f992a46b-8dae-4eef-b80f-2d980466a361
and failing to do so. While this problem is easy to fix, simply login to the offending node 935822-compute03-ospcv2-dfw.openstack.local/172.28.232.123 and eliminate the erroneous namespace, the metadata agent should be able to do this automatically.
Resolving the issue with Ansible.
ansible -m shell -a 'ip netns delete ovnmeta-f992a46b-8dae-4eef-b80f-2d980466a361' 935822-compute03-ospcv2-dfw.openstack.local --inventory /etc/genestack/inventory --become
Host netns
When investigating the host, we find the following output.
root@935822-compute03-ospcv2-dfw:~# ip netns
Error: Peer netns reference is invalid.
Error: Peer netns reference is invalid.
ovnmeta-a60825e1-66ff-479c-92c4-2120b092f75c (id: 15)
ovnmeta-2887f499-3ca1-4507-8f93-7585e6a13f63 (id: 14)
cni-0adaaeb7-0666-ee24-ce5e-fb5bc70c4f21 (id: 6)
cni-46d2ca93-5b2a-1574-ecfb-1c1e03d85ed4 (id: 4)
cni-c48dde5f-15ed-c672-35f4-a7b7bd361222 (id: 3)
cni-b595cd77-4a34-89fc-d2ff-8b55e8a76b23 (id: 2)
ovnmeta-970d930a-f047-4a4b-977e-d9a2fe2fe00d (id: 5)
Error: Peer netns reference is invalid.
ovnmeta-f992a46b-8dae-4eef-b80f-2d980466a361
Error: Peer netns reference is invalid.
ovnmeta-691b7a69-1142-4d5b-8395-026a4c676aab
ovnmeta-7ceb1594-e791-4ce2-a218-761e59bb1169 (id: 11)
ovnmeta-1b686933-cc0d-4744-9642-cbbf5e50bed8 (id: 10)
ovnmeta-639ba975-890a-4bc6-a230-3b6548ce6089 (id: 9)
ovnmeta-9b291755-a9b6-4abe-a5a7-6f75a7f6db16 (id: 8)
ovnmeta-0beb4d29-1244-431f-85e8-dd0a5ef1ac88 (id: 7)
cni-08e4d4a9-723a-4959-bcf4-b73d5859acf5 (id: 0
The namespace ovnmeta-f992a46b-8dae-4eef-b80f-2d980466a361
has no id and we see the Error: Peer netns reference is invalid..