Code Monkey home page Code Monkey logo

Comments (8)

pat-s avatar pat-s commented on August 23, 2024 1

I also use a RWX (EFS on AWS) and I don't see such long clone times. In the end, this is likely an issue of the respective RWX driver or storage backend you're using.

Here is the description of the storage class

Name:                  efs-sc-delete
IsDefaultClass:        No
Annotations:           meta.helm.sh/release-name=aws-efs-csi-driver,meta.helm.sh/release-namespace=aws-efs-csi-driver
Provisioner:           efs.csi.aws.com
Parameters:            basePath=/dynamic_provisioning,directoryPerms=700,fileSystemId=fs-05b518491300978f2,provisioningMode=efs-ap
AllowVolumeExpansion:  <unset>
MountOptions:          <none>
ReclaimPolicy:         Delete
VolumeBindingMode:     Immediate
Events:                <none>

from helm.

pi3ch avatar pi3ch commented on August 23, 2024

I used the OpenEBS NFS provisioner. From the logs I posted earlier, I can see volume is immediately available to the build pod.

Provisioner:           openebs.io/nfsrwx
Parameters:            <none>
AllowVolumeExpansion:  <unset>
MountOptions:          <none>
ReclaimPolicy:         Delete
VolumeBindingMode:     Immediate
Events:                <none>

I did further benchmark, using RWO volume. I use the following pipeline definition and tested using both WoodPecker:2.0.0 and Drone:2.21.0 on the same cluster. Drone completes the tasks in 23s but it takes over a minutes for Woodpecker.

steps:
- name: Init
  image: alpine
  commands:
    - echo "hello init"
    - sleep 5

- name: Fail test
  image: alpine
  commands:
  - id
  - sleep 5
  - env
  - sleep 5
  - exit 1

Woodpecker timing:
image

Drone timing:
image

So two things I noticed:

  1. Drone doesn't mount a volume to the build pod. Is there a way to disable default volume mounting in the woodpecker's build pod?
  2. Woodpecker waits longer to find the pod is in ready state before running the command. Is there a way to change the frequency of this check?

from helm.

pat-s avatar pat-s commented on August 23, 2024

Thanks for the additional tests.

To make the timings meaningful, the image pull times are needed. They need to be subtracted from the overall time.

Drone doesn't mount a volume to the build pod. Is there a way to disable default volume mounting in the woodpecker's build pod?

It doesn't seem possible at the moment. The volume mount is hard-coded here and not managed within the helm chart: https://github.com/woodpecker-ci/woodpecker/blob/f5d9c92824713f548cf5555177bf4d6b816d81bf/pipeline/backend/kubernetes/pod.go#L162-L173
I think it's required for sharing output between steps, i.e. the cloned repo from the "clone" step to all further steps. I am not sure one can turn it off. If drone does without a PV, it would be interesting to know how they do it (i.e. porting over the cloned assets to the other steps).

Woodpecker waits longer to find the pod is in ready state before running the command. Is there a way to change the frequency of this check?

The chart defines the readiness as follows

          readinessProbe:
            httpGet:
              path: /healthz
              port: 8000

By default the checks are executed every 10s but these options are not exposed within chart right now. However, I would assume that this is not the issue but some process during pod startup (maybe the mounting process) might cause some delay.

from helm.

pat-s avatar pat-s commented on August 23, 2024

FWIW, a "normal" clone step in most of my repos takes ~9s. With a RWX. I assume that the issue might be the RWX driver you're using and not in the general use of RWX (otherwise my timings would be higher as well).

Yet of course the differences to Drone are interesting. But before going further we need verification regarding the subtraction of image pull times.

from helm.

pi3ch avatar pi3ch commented on August 23, 2024

Here are RWO storage class timing. From my read, it usually takes 10-15s for volume to attach. So the additional delay in getting the commands running as compare with Drone, seems to be related to volume bounding delay.

LAST SEEN   TYPE      REASON                   OBJECT                                                          MESSAGE
15s         Warning   FailedScheduling         pod/wp-01hk8s1qapv3vj2bn13y5dw2fg-0-clone                       0/4 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling..
12s         Warning   FailedScheduling         pod/wp-01hk8s1qapv3vj2bn13y5dw2fg-0-clone                       0/4 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling..
10s         Normal    Scheduled                pod/wp-01hk8s1qapv3vj2bn13y5dw2fg-0-clone                       Successfully assigned woodpecker/wp-01hk8s1qapv3vj2bn13y5dw2fg-0-clone to bn-amd-8gbram-4vcpu-x4nk2
5s          Normal    SuccessfulAttachVolume   pod/wp-01hk8s1qapv3vj2bn13y5dw2fg-0-clone                       AttachVolume.Attach succeeded for volume "pvc-4f8bd953-1b04-4fcc-bea3-5434490cb90d"
4s          Normal    Pulled                   pod/wp-01hk8s1qapv3vj2bn13y5dw2fg-0-clone                       Container image "docker.io/woodpeckerci/plugin-git:2.4.0" already present on machine
4s          Normal    Created                  pod/wp-01hk8s1qapv3vj2bn13y5dw2fg-0-clone                       Created container wp-01hk8s1qapv3vj2bn13y5dw2fg-0-clone
4s          Normal    Started                  pod/wp-01hk8s1qapv3vj2bn13y5dw2fg-0-clone                       Started container wp-01hk8s1qapv3vj2bn13y5dw2fg-0-clone
15s         Normal    ExternalProvisioning     persistentvolumeclaim/wp-01hk8s1qapv3vj2bn13y5dw2fg-0-default   waiting for a volume to be created, either by external provisioner or manually created by system administrator
15s         Normal    Provisioning             persistentvolumeclaim/wp-01hk8s1qapv3vj2bn13y5dw2fg-0-default   External provisioner is provisioning volume for claim "woodpecker/wp-01hk8s1qapv3vj2bn13y5dw2fg-0-default"
13s         Normal    ProvisioningSucceeded    persistentvolumeclaim/wp-01hk8s1qapv3vj2bn13y5dw2fg-0-default   Successfully provisioned volume pvc-4f8bd953-1b04-4fcc-bea3-5434490cb90d

image

RWX storage class timing. taking approximately 2m 2s to get the volume attached.

LAST SEEN   TYPE      REASON                   OBJECT                                                          MESSAGE
2m1s        Warning   FailedScheduling         pod/wp-01hk8sx15w474tq9feet3dbv9b-0-clone                       0/4 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling..
2m          Warning   FailedScheduling         pod/wp-01hk8sx15w474tq9feet3dbv9b-0-clone                       0/4 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling..
118s        Normal    Scheduled                pod/wp-01hk8sx15w474tq9feet3dbv9b-0-clone                       Successfully assigned woodpecker/wp-01hk8sx15w474tq9feet3dbv9b-0-clone to bn-amd-8gbram-4vcpu-x4nkp
106s        Normal    Pulled                   pod/wp-01hk8sx15w474tq9feet3dbv9b-0-clone                       Container image "docker.io/woodpeckerci/plugin-git:2.4.0" already present on machine
106s        Normal    Created                  pod/wp-01hk8sx15w474tq9feet3dbv9b-0-clone                       Created container wp-01hk8sx15w474tq9feet3dbv9b-0-clone
106s        Normal    Started                  pod/wp-01hk8sx15w474tq9feet3dbv9b-0-clone                       Started container wp-01hk8sx15w474tq9feet3dbv9b-0-clone
2m2s        Normal    Provisioning             persistentvolumeclaim/wp-01hk8sx15w474tq9feet3dbv9b-0-default   External provisioner is provisioning volume for claim "woodpecker/wp-01hk8sx15w474tq9feet3dbv9b-0-default"
2m2s        Normal    ExternalProvisioning     persistentvolumeclaim/wp-01hk8sx15w474tq9feet3dbv9b-0-default   waiting for a volume to be created, either by external provisioner "openebs.io/nfsrwx" or manually created by system administrator
2m          Normal    ProvisioningSucceeded    persistentvolumeclaim/wp-01hk8sx15w474tq9feet3dbv9b-0-default   Successfully provisioned volume pvc-ad9f928a-3571-4588-a046-138f976c456d

image

Once volume is attached to the pod, other steps, running as containers in the same pod, use the already attached volume. Hence the timing is less than the clone step.

I don't know why it is necessary to have PVC for sharing data between containers of the same pod, couldn't we just use a temporary/ephemeral memory volume that is shared. Probably that makes it faster and less costly.

from helm.

pi3ch avatar pi3ch commented on August 23, 2024

FWIW, Drone creates EmptyDir/memory work volume: https://github.com/drone-runners/drone-runner-kube/blob/b57d0a9c80b0df00cfb92d59e3f965e8a29ac987/engine/compiler/compiler.go#L184

	workVolume := &engine.Volume{
		EmptyDir: &engine.VolumeEmptyDir{
			ID:   random(),
			Name: workMount.Name,
		},
	}

from helm.

pat-s avatar pat-s commented on August 23, 2024

Thanks for the stats.

As mentioned in woodpecker-ci/woodpecker#1594 (comment), every step uses its own pod. Which is why a volume is required.

Both your RWO and RWX attach times are really high 😮 My RWX attach time is lower than your RWO 🤯
I haven't done research yet what might cause this but it seems to me that something is not so good WRT to your volume drivers or other related components. Indeed, when it takes that long, WP or any other app is annoying to use.
Leaving the emptyDir/one-pod/many-pod discussion aside, I still believe you are facing a local issues in your cluster which is responsible for the high attach times. Otherwise we would have seen other requests already and I also haven't faced such high timings in different clusters I have WP running on (all AWS/Azure).

I can help you debugging it but I think we can't do anything in the chart here. If we go for the one-pod/multiple-containers approach, this needs changes in the WP k8s backend code as mentioned in woodpecker-ci/woodpecker#1594 (comment).

from helm.

pi3ch avatar pi3ch commented on August 23, 2024

Yeah make total sense. I am asking DO support to understand if this is normal or something wrong with our cluster. I did further testing using basic deployments and it took 2m for RWX to get attached.

from helm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.