Comments (8)
logs/longhorn-system/csi-attacher-57c5fd5bdf-x8wmf/csi-attacher.log
2024-04-23T19:04:17.278448755+10:00 I0423 09:04:17.278284 1 csi_handler.go:234] Error processing "csi-8ce5e5e7551c2951b0826ae7383457b98a30e0a8251083368f71332205f7d9d6": failed to attach: rpc error: code = Internal desc = volume bazarr failed to attach to node k3s-worker-1 with attachmentID csi-8ce5e5e7551c2951b0826ae7383457b98a30e0a8251083368f71332205f7d9d6: Waiting for volume share to be available
And also in yamls/namespaced/longhorn-system/longhorn.io/v1beta2/volumeattachments.yaml,
spec:
attachmentTickets:
csi-8ce5e5e7551c2951b0826ae7383457b98a30e0a8251083368f71332205f7d9d6:
generation: 0
id: csi-8ce5e5e7551c2951b0826ae7383457b98a30e0a8251083368f71332205f7d9d6
nodeID: k3s-worker-1
parameters:
disableFrontend: "false"
lastAttachedBy: "null"
type: csi-attacher
volume: bazarr
status:
attachmentTicketStatuses:
csi-8ce5e5e7551c2951b0826ae7383457b98a30e0a8251083368f71332205f7d9d6:
conditions:
- lastProbeTime: "null"
lastTransitionTime: "2024-04-22T22:09:38Z"
message: Waiting for volume share to be available
reason: "null"
status: "False"
type: Satisfied
generation: 0
id: csi-8ce5e5e7551c2951b0826ae7383457b98a30e0a8251083368f71332205f7d9d6
satisfied: false
That comes from
https://github.com/longhorn/longhorn-manager/blob/2c27d58245028e2475b301ab14021d51a9ef73e1/controller/volume_attachment_controller.go#L741
and therefore from
https://github.com/longhorn/longhorn-manager/blob/2c27d58245028e2475b301ab14021d51a9ef73e1/controller/volume_attachment_controller.go#L914-L918
And indeed its share-manage is not running
, it is stopping
, as are all but two of them.
yamls/namespaced/longhorn-system/longhorn.io/v1beta2/sharemanagers.yaml
name: bazarr
namespace: longhorn-system
ownerReferences:
- apiVersion: longhorn.io/v1beta2
kind: Volume
name: bazarr
uid: e214dc42-a6e8-41e8-8d27-1971c32fddaf
resourceVersion: "247397410"
uid: 13836141-afbe-4edd-9881-6665dfb2fb9e
spec:
image: longhornio/longhorn-share-manager:v1.6.1
status:
endpoint: "null"
ownerID: k3s-worker-1
state: stopping
Looking into why that would be.
from longhorn.
For the replica placement question, try setting replica-disk-soft-anti-affinity
to false. That would mean "hard" anti-affinity, so that a replica would simply remain unscheduled rather than reluctantly sharing a disk if necessary to schedule.
It's interesting, because in the support bundle, replica-soft-anti-affinity
(which is really "replica node soft anti-affinity") is already false, so they should not have shared a node, much less a disk. But I'm not entirely sure of the order of priority when calculating the scheduling.
For VM partition, I don't know. That's probably worth a separate ticket (or question, actually).
from longhorn.
Support bundle is empty (0 bytes). Perhaps mail it to [email protected]
from longhorn.
I think the file is too large, will send a link to that email now
from longhorn.
Some other things. There is a lot of logging about things like
2024-04-23T08:08:36.437737012+10:00 time="2024-04-22T22:08:36Z" level=error msg="Failed to sync Longhorn replica" func=controller.handleReconcileErrorLogging file="utils.go:67" Replica=longhorn-system/prowlarr-r-e2cd97b3 controller=longhorn-replica error="failed to sync replica for longhorn-system/prowlarr-r-e2cd97b3: instance prowlarr-r-e2cd97b3 NodeID k3s-worker-2 is not the same as the instance manager instance-manager-5b18aaee3512cb8713b524f2f78399b4 NodeID k3s-worker-3"
And when I look at the replicas.yaml, there are only a handful that even have a nodeIP currently:
grep storageIP replicas.yaml | sort | uniq -c
4 storageIP: 10.42.3.71
2 storageIP: 10.42.5.177
26 storageIP: "null"
Although the engines.yaml shows other addresses for replicas in the AddressMaps, for instance:
replicaAddressMap:
bazarr-r-3aa7d762: 10.42.4.16:10030
bazarr-r-501487c6: 10.42.5.167:10002
bazarr-r-cf42f74a: 10.42.4.16:10000
This is a good time to mention that Longhorn identifies replicas by the IP:port they are assigned to listen for I/O commands. (This one is also interesting because two of the three replicas are on the same node, which is not really optimal.)
So, a couple of questions:
-
Is it possible that in the reboot and DNS shift that some nodes were assigned new IPs?
-
Can you give more detail about how the replicas were moved from one node to another?
from longhorn.
I'm not entirely certain that they were given new IPs, I doubt it since I didn't change my dhcp server, only the dns.
However the strangest thing happened just before, I evicted all pods from k3s-worker-1 vm to increase the root partition to a larger size, and for some reason all the deployments besides radarr and vaultwarden are working now. Not really sure why.
Also how do I force longhorn to only have 1 replica per node, it looks like it always jumps around and sometimes is always on the same node?
from longhorn.
Also another quick question, not sure if I should make another issue for this or not but would you know how to increase the partition of a vm that has longhorn running on it? I'm getting this error and everytime I try to expand the main partition the vm doesn't boot up again:
GPT PMBR size mismatch (209715199 != 419430399) will be corrected by write.
The backup GPT table is not on the end of the device. This problem will be corrected by write.
This disk is currently in use - repartitioning is probably a bad idea.
It's recommended to umount all file systems, and swapoff all swap
partitions on this disk.
from longhorn.
Yeah I also have Default Data Locality set to best-effort which I assume would help but nothing. I can see the issue with radarr now is that the replicas are on nodes 1 and 3 but it's trying to create the pod on 2. Got to figure it out somehow. Thanks for your help
I'll make another post for the vm partition, cheers
from longhorn.
Related Issues (20)
- [TEST][FEATURE] strict-local volume supports replica migration
- [IMPROVEMENT] v2 volume supports delta replica rebuilding based on snapshot checksum
- [TASK] Manage Shallow copy async in go-spdk-helper HOT 3
- [CI] Improve the container image build time
- [BUG] the engine image controller has to deploy the engine images with the linux node selector HOT 1
- [TEST] add strict-local test case into test_drain_with_block_for_eviction_if_contains_last_replica_success
- PVC size in longhorn UI is different than actual size used by Pod. HOT 1
- [QUESTION] longhorn-manager /usr/local/sbin/ volume and noexec configuration HOT 4
- How to debug what's causing the pods to fail on every backup HOT 12
- [CI] Pre Ready-For-Testing Checklist isn't created after changing to GitHub issue to review state
- [IMPROVEMENT] Abstract BackingImage File Copy to independent CR
- [BUG] Uninstallation will fail if invalid backuptarget is set. HOT 17
- [DOC] Add a KB for the insufficient space issue
- [BACKPORT][v1.6.3][BUG] Replica Auto Balance options under General Setting and under Volume section should have similar case
- [BACKPORT][v1.5.6][BUG] Replica Auto Balance options under General Setting and under Volume section should have similar case
- [UI][FEATURE] Support volume encryption for (encrypted) backing image volumes
- [UI][FEATURE] Add backing image encryption and clone support
- [IMPROVEMENT] move some arguments in backup grpc to parameters map to reduce maintain effort
- [BACKPORT][v1.5.6][BUG] Uninstallation will fail if invalid backuptarget is set. HOT 1
- [BACKPORT][v1.6.3][BUG] Uninstallation will fail if invalid backuptarget is set. HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from longhorn.