We use GKE, use Binary Auth and for system components we like to use latest tags from our own Artifact Registry so we use digester.
We have a daily Terraform deployment run that deploys a variety of configs to our environment including some system components like digester but also some other deployments which utilise digester.
One of these components is a kubernetes API proxy because we use private services connect so we have this as a system component that is delivered daily.
Now, we shold have but didn't up to now, ignore_changes for the image field in the Terraform run. It didn't really matter to us but we did know that when the deployment happened every day digester without the @sha256 value would mutate the deployment but since this was in keeping with the current replicaset then this didn't matter.
Yesterday I discovered about 18000 ReplicaSets for this deployment and new ones were getting created every 2-3 minutes rotating the proxy out of service and breaking people's connection.
Inspection revealed that the k8s API proxy deployment had gotten past digester without being mutated and just had the tag 'latest' meaning this was now different to the ReplicaSet and then the fight began - the deployment controller seeing the difference thought to create a new ReplicaSet using just the tag it had, then digester would update the ReplicaSet, start rotating pods from the old to the new and the process would begin again 2 mins later.
Takeaways are..
- For us we should use TF lifecycle policy to ignore changes on the image to stop these servces even touching the deployment.
- A failurePolicy of Fail instead of Ignore would have stopped the deployment being updated if there was a service timeout to the webhook deployment.
We're implementing these changes now ourselves but submit this edge case for your consideration.