Code Monkey home page Code Monkey logo

Comments (14)

mclasmeier avatar mclasmeier commented on June 26, 2024 1

@hairmare, I might have good news for you.

A change was merged to master, which allows the user of the Helm charts to overwrite these annoations. For example:

customize:
  other:
    persistentvolumeclaim/central-db:
      annotations:
        helm.sh/hook: null
        helm.sh/hook-delete-policy: null
        helm.sh/resource-policy: null

This would cause the Helm annotations for the central-db PVC to be deleted. Caveat: When using this kind of flexibility for modifying the manifests during rendering you better know exactly what you are doing -- particularly in terms of lifecycle management. This is a powerful mechanism, but I'm sure it also allows shooting yourself in the foot.

I have tried this with argocd on an AKS cluster (default StorageClass is using WaitForFirstConsumer) and it seems to work.

In case you need some help for trying it out, feel free to ping me.

The same mechanism can also be used for changing the secret annotations.

That being said, I think we can close this issue, do you agree?

Thank you!

from stackrox.

mtesseract avatar mtesseract commented on June 26, 2024

Hello @hairmare!
Thank you for your proposal.
We will look into this and come back to you when we can provide more information on this topic.
Greetings,
Moritz

from stackrox.

hairmare avatar hairmare commented on June 26, 2024

I have since noticed that a bunch of other resources (mainly secrets) also have a helm-hook annotation and would like to also propose removing the annotation from those resources.

The main reason (in addition to the WaitForFirstConsumer) is that i'm deploying the chart using Argo CD (aka OpenShift GitOps) and that Argo CD is trying to remove hook resources from the previous sync before it starts the next sync. This makes it non-trivial to use the chart with Argo CD to rollout changes to an already deployed instance of central-services.

from stackrox.

hairmare avatar hairmare commented on June 26, 2024

Hey @mtesseract
Were you able to get any feedback from the team? Argo CD still wants to remove important parts of the deployment with every update and i'd be happy to provide a fix.
Thanks,
Lucas

from stackrox.

mtesseract avatar mtesseract commented on June 26, 2024

Hi Lucas,

sorry that I wasn't able to get back to you earlier. So, I've had a first look at this and would like to describe my thoughts around this. Please correct me if I got something wrong (please note that I haven't used Argo CD or Azure AKS before).

I have the feeling that we are dealing with two issues here, is this right?
First would be about the WaitForFirstConsumer setting and how the Chart could be installed when WaitForFirstConsumer is activated for the involved PVC.

Second would be the problem with deleting Helm hooks during an Argo CD sync.

Right now I have only thought about the second.

So, I checked the Argo CD documentation in order to learn something about how Argo CD deals with hooks.
I found https://argo-cd.readthedocs.io/en/stable/user-guide/resource_hooks/#hook-deletion-policies and according to this page BeforeHookCreation would be used as deletion policy by default.

When looking at https://argo-cd.readthedocs.io/en/stable/user-guide/helm/#helm-hooks it seems to me that the hook-delete-policy simply carries over from the Helm hook to an Argo CD hook -- and if it is not present, then this would default to BeforeHookCreation, which would explain the behaviour you are observing. Correct so far?

I checked if we use hook-deletion-policies currently and the only such annotations I could find are of the form https://github.com/stackrox/stackrox/blob/master/image/templates/helm/stackrox-central/templates/01-central-11-pvc.yaml#L27, which is not converted correctly into Argo CD world.

In general: We must be very careful about changing the semantics of the hook resources as current users might depend on the current behaviour. For example, a PV is expected to not be destroyed when a Helm chart is deinstalled. Secrets are another example of resources which we do not want to be deleted implicitly.

Now, I wonder what would happen if we keep the Hook nature of the current Hook resources and add a deletion policy of hook-failed -- would this prevent these resources from being deleted by Argo CD on a sync?

Thanks
Moritz

from stackrox.

porridge avatar porridge commented on June 26, 2024

@hairmare can you please comment on @mtesseract 's response?

from stackrox.

mclasmeier avatar mclasmeier commented on June 26, 2024

I'd like to follow-up on this one.

Regarding the WaitForFirstConsumer. I failed to reproduce this in my experiments -- I wonder if this problem has been solved outside of StackRox somehow?

I am connected to an AKS cluster and I have this:

❯ kc get sc | rg '(^NAME|default)'
NAME                    PROVISIONER          RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
default (default)       disk.csi.azure.com   Delete          WaitForFirstConsumer   true  

My helm CLI is version v3.13.3.

I tried to install the stackrox-central-services Helm chart with helm install --wait. It took a while (until all pods are up and healthy) but it worked. A PVC got created:

❯ kc get pvc -A
NAMESPACE   NAME         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
stackrox    central-db   Bound    pvc-922dfbca-3273-4342-894b-5cf17d648eea   100Gi      RWO            default        6m

Just to verify, the PVC does have the Helm hooks associated:

❯ kc get -n stackrox pvc central-db -o yaml | yq .metadata.annotations | rg hook
helm.sh/hook: pre-install,pre-upgrade
helm.sh/hook-delete-policy: never

Can you confirm this or am I missing something here?

Thanks
Moritz

from stackrox.

hairmare avatar hairmare commented on June 26, 2024

Hi Moritz

Thanks for circling back and trying this again!

I tried again on a vanilla AKS and didn't get any further...

$ kubectl get sc | grep -P '(^NAME|default)'
NAME                    PROVISIONER          RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
default (default)       disk.csi.azure.com   Delete          WaitForFirstConsumer   true                   3d13h

$ kubectl get pvc -A | grep -P '(^NAME|stackrox)'
NAMESPACE             NAME                                                                             STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
stackrox-issue-2482   stackrox-db                                                                      Pending                                                                        default        8m30s

$ kubectl get pvc stackrox-db -oyaml | yq .metadata.annotations | grep -P hook 
helm.sh/hook: pre-install,pre-upgrade
helm.sh/hook-delete-policy: never

# in fact:
$ kubectl get-all -n stackrox-issue-2482
NAME                               NAMESPACE            AGE
configmap/kube-root-ca.crt         stackrox-issue-2482  21m  
persistentvolumeclaim/stackrox-db  stackrox-issue-2482  12m  
serviceaccount/default             stackrox-issue-2482  21m  
secret/central-htpasswd            stackrox-issue-2482  12m  
secret/central-tls                 stackrox-issue-2482  12m  
secret/scanner-db-password         stackrox-issue-2482  12m  
secret/scanner-db-tls              stackrox-issue-2482  12m  
secret/scanner-tls                 stackrox-issue-2482  12m  
secret/stackrox-generated-letxqh   stackrox-issue-2482  12m  

$ kubectl events pvc stackrox-db
LAST SEEN              TYPE     REASON                 OBJECT                              MESSAGE
2m50s (x82 over 23m)   Normal   WaitForFirstConsumer   PersistentVolumeClaim/stackrox-db   waiting for first consumer to be created before binding

I don't think I'll be able to dig much deeper today but will investigate further. Some notes for now:

  • I'm using a non-default Namespace, but i didn't find anything that would warrant it breaking this early
  • Your PVC is called central-db, mine stackrox-db 👀
  • We have the same hooks

I'm still using Argo CD for the Deployment (and not plain Helm). Argo CD doesn't just call the helm binary so it might be contributing to the issue. The difference in our PVC naming leads me to think we might want to compare value and versions... below is a dump of the test i did to get the above results.

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: app-stackrox
  namespace: infra-argocd
spec:
  destination:
    namespace: stackrox-issue-2482
    server: https://kubernetes.default.svc
  project: test
  source:
    chart: stackrox-central-services
    helm:
      releaseName: stackrox-central-services
      values: |-
        allowNonstandardNamespace: true
        system:
          enablePodSecurityPolicies: false
    repoURL: https://raw.githubusercontent.com/stackrox/helm-charts/main/opensource/
    targetRevision: 74.9.0

As you can see, it's quote minimal. It also fails pretty early, so i don't even get to pods to get scheduled and images to get pulled.

When i initially reported this issue there where some bugs in Argo CD that compounded it's effects. Those have been fixed leading to a somewhat clearer(ish) picture.

As far as i can tell, using Helm on the CLI applies the resources in the order mandated by the hooks, and Helm waits for the api server to accept them before progessing to the next stage. OTOH Argo CD, being a fully fledged meta-operator, waits for the resources to become available before progressing to the next stage.
In the case of WaitForFirstConsumer this never happens so Argo CD won't ever progress past the hook stage.

I'm not sure this can be fundamentally addressed in Argo CD without it breaking any promises it makes to users regards how hooks work when it comes to some subtleties like this (which isn't as well defined in Helm as would be preferable).

I'm also failing to understand what purpose the hooks actually serve. One of the core tasks of Kubernetes's scheduler is to schedule workloads to a node while utilizing the CSI api to figure out where, and when, a workload can be run. The hooks currently preempt the scheduler from reconciling this as it happens and they making it wait for an actual resolution strategy to become available.

That said, although i can't think of a scenario where the hook annotation isn't a subtle bug and removing it a lowest risk change, i can understand that your risk posture probably differs from mine. Since external DB support was released being able to deploy Stackrox in this way using Argo CD has basically been relegated to a lab requirement.

Anyways, our application is progressing well and the fact that this issue is very far from my critical path, leads me to tend towards considering it fixed in a sense of there being simple and established workarounds :)

Cheers,
Lucas

from stackrox.

mclasmeier avatar mclasmeier commented on June 26, 2024

Btw, found an issue about this for ArgoCD: argoproj/argo-cd#12840

from stackrox.

porridge avatar porridge commented on June 26, 2024

Hi @hairmare , I have a few comments and questions, some of which might help, but might be just related, depending on your use case. Feel free to ignore them as you see fit.

  1. Argo CD is trying to remove hook resources from the previous sync before it starts the next sync

    It seems to me that ArgoCD is not obeying the hook-delete-policy: never annotation? Perhaps this is a bug in ArgoCD?

  2. Your mention of targetRevision: 74.9.0 suggests that you're deploying an (almost?) end-of-life version of stackrox, i.e. 3.74.9. Perhaps it would make sense to look into upgrading to the 4.x version? Although I'm pretty sure this particular issue is also present there, so this is more of a general remark.

  3. Also on a more general note, but this might actually help with this issue: since you mentioned using OpenShift GitOps, have you considered using the ACS operator that we ship for OpenShift? I'd be very interested in any feedback you might have regarding the feasibility of moving from helm-based stackrox installation to an operator-based one. In case the problem is that there is no operator support for non-OpenShift platforms, would it affect your decision if we started supporting it for other platforms too?

  4. Funny (or sad) how the bug which @mclasmeier found also mentions stackrox. Note that this comment has a workaround for your late PVC binding issue @hairmare . Do you think it could solve it for you?

from stackrox.

mclasmeier avatar mclasmeier commented on June 26, 2024

As I understand the issue for @hairmare is somewhat solved, since he wrote:

Anyways, our application is progressing well and the fact that this issue is very far from my critical path, leads me to tend towards considering it fixed in a sense of there being simple and established workarounds :)

But I was able to verify that the combination still breaks: AKS + ArgoCD + StackRox Helm Chart => deadlocked.

One can probably use the mentioned workaround or even plug in kustomize into ArgoCD, I've seen that somewhere, here it is: https://github.com/argoproj/argocd-example-apps/blob/master/plugins/kustomized-helm/README.md

Furthermore, we do have some evidence for this issue not being present when using the stackrox operator in ArgoCD -- this is what we are doing internally.

@porridge, not sure what you are referring to here:

Funny (or sad) how the bug which @mclasmeier found also mentions stackrox.

from stackrox.

porridge avatar porridge commented on June 26, 2024

@porridge, not sure what you are referring to here:

Funny (or sad) how the bug which @mclasmeier found also mentions stackrox.

I mean that the person who file the bug you found, also encountered it when trying to deploy stackrox.

from stackrox.

mclasmeier avatar mclasmeier commented on June 26, 2024

Ah, I didn't notice stackrox being mentioned in the attached screenshot.

By the way, there is even another issue for this here: #5191

from stackrox.

hairmare avatar hairmare commented on June 26, 2024

@mclasmeier thanks for all the pointers, i have some reading to do

The issue isn't solved as is, but it has a low prio because i can circumvent it by using an external DB or using the workaround from this comment. I would appreciate it to work without any of these, but my case foresees using a DBaaS anyway.

from stackrox.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.