Comments (14)
@hairmare, I might have good news for you.
A change was merged to master, which allows the user of the Helm charts to overwrite these annoations. For example:
customize:
other:
persistentvolumeclaim/central-db:
annotations:
helm.sh/hook: null
helm.sh/hook-delete-policy: null
helm.sh/resource-policy: null
This would cause the Helm annotations for the central-db PVC to be deleted. Caveat: When using this kind of flexibility for modifying the manifests during rendering you better know exactly what you are doing -- particularly in terms of lifecycle management. This is a powerful mechanism, but I'm sure it also allows shooting yourself in the foot.
I have tried this with argocd on an AKS cluster (default StorageClass is using WaitForFirstConsumer
) and it seems to work.
In case you need some help for trying it out, feel free to ping me.
The same mechanism can also be used for changing the secret annotations.
That being said, I think we can close this issue, do you agree?
Thank you!
from stackrox.
Hello @hairmare!
Thank you for your proposal.
We will look into this and come back to you when we can provide more information on this topic.
Greetings,
Moritz
from stackrox.
I have since noticed that a bunch of other resources (mainly secrets) also have a helm-hook annotation and would like to also propose removing the annotation from those resources.
The main reason (in addition to the WaitForFirstConsumer
) is that i'm deploying the chart using Argo CD (aka OpenShift GitOps) and that Argo CD is trying to remove hook resources from the previous sync before it starts the next sync. This makes it non-trivial to use the chart with Argo CD to rollout changes to an already deployed instance of central-services.
from stackrox.
Hey @mtesseract
Were you able to get any feedback from the team? Argo CD still wants to remove important parts of the deployment with every update and i'd be happy to provide a fix.
Thanks,
Lucas
from stackrox.
Hi Lucas,
sorry that I wasn't able to get back to you earlier. So, I've had a first look at this and would like to describe my thoughts around this. Please correct me if I got something wrong (please note that I haven't used Argo CD or Azure AKS before).
I have the feeling that we are dealing with two issues here, is this right?
First would be about the WaitForFirstConsumer
setting and how the Chart could be installed when WaitForFirstConsumer
is activated for the involved PVC.
Second would be the problem with deleting Helm hooks during an Argo CD sync.
Right now I have only thought about the second.
So, I checked the Argo CD documentation in order to learn something about how Argo CD deals with hooks.
I found https://argo-cd.readthedocs.io/en/stable/user-guide/resource_hooks/#hook-deletion-policies and according to this page BeforeHookCreation
would be used as deletion policy by default.
When looking at https://argo-cd.readthedocs.io/en/stable/user-guide/helm/#helm-hooks it seems to me that the hook-delete-policy
simply carries over from the Helm hook to an Argo CD hook -- and if it is not present, then this would default to BeforeHookCreation
, which would explain the behaviour you are observing. Correct so far?
I checked if we use hook-deletion-policies currently and the only such annotations I could find are of the form https://github.com/stackrox/stackrox/blob/master/image/templates/helm/stackrox-central/templates/01-central-11-pvc.yaml#L27, which is not converted correctly into Argo CD world.
In general: We must be very careful about changing the semantics of the hook resources as current users might depend on the current behaviour. For example, a PV is expected to not be destroyed when a Helm chart is deinstalled. Secrets are another example of resources which we do not want to be deleted implicitly.
Now, I wonder what would happen if we keep the Hook nature of the current Hook resources and add a deletion policy of hook-failed
-- would this prevent these resources from being deleted by Argo CD on a sync?
Thanks
Moritz
from stackrox.
@hairmare can you please comment on @mtesseract 's response?
from stackrox.
I'd like to follow-up on this one.
Regarding the WaitForFirstConsumer
. I failed to reproduce this in my experiments -- I wonder if this problem has been solved outside of StackRox somehow?
I am connected to an AKS cluster and I have this:
❯ kc get sc | rg '(^NAME|default)'
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
default (default) disk.csi.azure.com Delete WaitForFirstConsumer true
My helm
CLI is version v3.13.3
.
I tried to install the stackrox-central-services
Helm chart with helm install --wait
. It took a while (until all pods are up and healthy) but it worked. A PVC got created:
❯ kc get pvc -A
NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
stackrox central-db Bound pvc-922dfbca-3273-4342-894b-5cf17d648eea 100Gi RWO default 6m
Just to verify, the PVC does have the Helm hooks associated:
❯ kc get -n stackrox pvc central-db -o yaml | yq .metadata.annotations | rg hook
helm.sh/hook: pre-install,pre-upgrade
helm.sh/hook-delete-policy: never
Can you confirm this or am I missing something here?
Thanks
Moritz
from stackrox.
Hi Moritz
Thanks for circling back and trying this again!
I tried again on a vanilla AKS and didn't get any further...
$ kubectl get sc | grep -P '(^NAME|default)'
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
default (default) disk.csi.azure.com Delete WaitForFirstConsumer true 3d13h
$ kubectl get pvc -A | grep -P '(^NAME|stackrox)'
NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
stackrox-issue-2482 stackrox-db Pending default 8m30s
$ kubectl get pvc stackrox-db -oyaml | yq .metadata.annotations | grep -P hook
helm.sh/hook: pre-install,pre-upgrade
helm.sh/hook-delete-policy: never
# in fact:
$ kubectl get-all -n stackrox-issue-2482
NAME NAMESPACE AGE
configmap/kube-root-ca.crt stackrox-issue-2482 21m
persistentvolumeclaim/stackrox-db stackrox-issue-2482 12m
serviceaccount/default stackrox-issue-2482 21m
secret/central-htpasswd stackrox-issue-2482 12m
secret/central-tls stackrox-issue-2482 12m
secret/scanner-db-password stackrox-issue-2482 12m
secret/scanner-db-tls stackrox-issue-2482 12m
secret/scanner-tls stackrox-issue-2482 12m
secret/stackrox-generated-letxqh stackrox-issue-2482 12m
$ kubectl events pvc stackrox-db
LAST SEEN TYPE REASON OBJECT MESSAGE
2m50s (x82 over 23m) Normal WaitForFirstConsumer PersistentVolumeClaim/stackrox-db waiting for first consumer to be created before binding
I don't think I'll be able to dig much deeper today but will investigate further. Some notes for now:
- I'm using a non-default Namespace, but i didn't find anything that would warrant it breaking this early
- Your PVC is called central-db, mine stackrox-db 👀
- We have the same hooks
I'm still using Argo CD for the Deployment (and not plain Helm). Argo CD doesn't just call the helm
binary so it might be contributing to the issue. The difference in our PVC naming leads me to think we might want to compare value and versions... below is a dump of the test i did to get the above results.
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: app-stackrox
namespace: infra-argocd
spec:
destination:
namespace: stackrox-issue-2482
server: https://kubernetes.default.svc
project: test
source:
chart: stackrox-central-services
helm:
releaseName: stackrox-central-services
values: |-
allowNonstandardNamespace: true
system:
enablePodSecurityPolicies: false
repoURL: https://raw.githubusercontent.com/stackrox/helm-charts/main/opensource/
targetRevision: 74.9.0
As you can see, it's quote minimal. It also fails pretty early, so i don't even get to pods to get scheduled and images to get pulled.
When i initially reported this issue there where some bugs in Argo CD that compounded it's effects. Those have been fixed leading to a somewhat clearer(ish) picture.
As far as i can tell, using Helm on the CLI applies the resources in the order mandated by the hooks, and Helm waits for the api server to accept them before progessing to the next stage. OTOH Argo CD, being a fully fledged meta-operator, waits for the resources to become available before progressing to the next stage.
In the case of WaitForFirstConsumer this never happens so Argo CD won't ever progress past the hook stage.
I'm not sure this can be fundamentally addressed in Argo CD without it breaking any promises it makes to users regards how hooks work when it comes to some subtleties like this (which isn't as well defined in Helm as would be preferable).
I'm also failing to understand what purpose the hooks actually serve. One of the core tasks of Kubernetes's scheduler is to schedule workloads to a node while utilizing the CSI api to figure out where, and when, a workload can be run. The hooks currently preempt the scheduler from reconciling this as it happens and they making it wait for an actual resolution strategy to become available.
That said, although i can't think of a scenario where the hook annotation isn't a subtle bug and removing it a lowest risk change, i can understand that your risk posture probably differs from mine. Since external DB support was released being able to deploy Stackrox in this way using Argo CD has basically been relegated to a lab requirement.
Anyways, our application is progressing well and the fact that this issue is very far from my critical path, leads me to tend towards considering it fixed in a sense of there being simple and established workarounds :)
Cheers,
Lucas
from stackrox.
Btw, found an issue about this for ArgoCD: argoproj/argo-cd#12840
from stackrox.
Hi @hairmare , I have a few comments and questions, some of which might help, but might be just related, depending on your use case. Feel free to ignore them as you see fit.
-
Argo CD is trying to remove hook resources from the previous sync before it starts the next sync
It seems to me that ArgoCD is not obeying the
hook-delete-policy: never
annotation? Perhaps this is a bug in ArgoCD? -
Your mention of
targetRevision: 74.9.0
suggests that you're deploying an (almost?) end-of-life version of stackrox, i.e. 3.74.9. Perhaps it would make sense to look into upgrading to the 4.x version? Although I'm pretty sure this particular issue is also present there, so this is more of a general remark. -
Also on a more general note, but this might actually help with this issue: since you mentioned using OpenShift GitOps, have you considered using the ACS operator that we ship for OpenShift? I'd be very interested in any feedback you might have regarding the feasibility of moving from helm-based stackrox installation to an operator-based one. In case the problem is that there is no operator support for non-OpenShift platforms, would it affect your decision if we started supporting it for other platforms too?
-
Funny (or sad) how the bug which @mclasmeier found also mentions stackrox. Note that this comment has a workaround for your late PVC binding issue @hairmare . Do you think it could solve it for you?
from stackrox.
As I understand the issue for @hairmare is somewhat solved, since he wrote:
Anyways, our application is progressing well and the fact that this issue is very far from my critical path, leads me to tend towards considering it fixed in a sense of there being simple and established workarounds :)
But I was able to verify that the combination still breaks: AKS + ArgoCD + StackRox Helm Chart => deadlocked.
One can probably use the mentioned workaround or even plug in kustomize into ArgoCD, I've seen that somewhere, here it is: https://github.com/argoproj/argocd-example-apps/blob/master/plugins/kustomized-helm/README.md
Furthermore, we do have some evidence for this issue not being present when using the stackrox operator in ArgoCD -- this is what we are doing internally.
@porridge, not sure what you are referring to here:
Funny (or sad) how the bug which @mclasmeier found also mentions stackrox.
from stackrox.
@porridge, not sure what you are referring to here:
Funny (or sad) how the bug which @mclasmeier found also mentions stackrox.
I mean that the person who file the bug you found, also encountered it when trying to deploy stackrox.
from stackrox.
Ah, I didn't notice stackrox being mentioned in the attached screenshot.
By the way, there is even another issue for this here: #5191
from stackrox.
@mclasmeier thanks for all the pointers, i have some reading to do
The issue isn't solved as is, but it has a low prio because i can circumvent it by using an external DB or using the workaround from this comment. I would appreciate it to work without any of these, but my case foresees using a DBaaS anyway.
from stackrox.
Related Issues (20)
- Delete old deferral requests HOT 5
- Scanner: no support for external DB HOT 2
- Revamp Database Migration and Management Process HOT 4
- This is a test issue to see if the token still works
- StackRox 4.4.0 collectors are crashing on COS GKE - failed to initialize collector kernel components HOT 9
- StackRox central 4.4.0 is trying to write to deprecated/'no longer needed' stackrox-db volume HOT 6
- OIDC Login redirect_uri is incorrect using Generic Kubernetes 1.28.6 cluster HOT 4
- can`t scan images HOT 7
- add integration with harbor HOT 3
- Replace sort functions with slices equivalents HOT 2
- Create CredentialExpiryBanner warnings for certificate expiration of central-db-tls certificates HOT 1
- DOCS Missing - What's the API call behind : roxctl image check HOT 1
- CVEs found in roxctl 4.4.2 HOT 2
- scanner-v4-matcher exceeds memory limit and is killed HOT 4
- Scanning images in tightly locked clusters. HOT 10
- Jira Cloud Integration fails HOT 8
- ACS cannot fetch metadata when using mirrors in a disconnected environment HOT 2
- Updating Scanner Defitions in Offline Mode HOT 4
- secured cluster services not able to authenticate with central service after restarting central db HOT 17
- UI reports Vulnerability definitions as out of date, definitions manifest in Central is correct HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from stackrox.