Comments (10)
from azure-service-operator.
Thanks for reporting this. As discussed on Slack we believe this is a bug.
We'll look over the logs (thank you for sharing them) and see if we can't get this fixed!
from azure-service-operator.
The key logs in question are:
I0328 11:17:21.995373 1 generic_reconciler.go:335] controllers/DatabaseAccountController "msg"="Skipping creation/update of resource due to policy" "name"="cosmos-cdoacr4-gl-accounts" "namespace"="platform-accounts" "serviceoperator.azure.com/reconcile-policy"="skip"
I0328 11:17:21.995449 1 azure_generic_arm_reconciler_instance.go:414] controllers/DatabaseAccountController "msg"="Resource successfully created/updated" "azureName"="cosmos-cdoacr4-gl-accounts" "name"="cosmos-cdoacr4-gl-accounts" "namespace"="platform-accounts" "resourceID"="/subscriptions/ea80ce6b-d4af-44da-b2c1-af925fd21360/resourceGroups/rg-cdoacr4-gl-accounts/providers/Microsoft.DocumentDB/databaseAccounts/cosmos-cdoacr4-gl-accounts"
I0328 11:17:22.124739 1 kubernetes_exporter.go:41] controllers/DatabaseAccountController "msg"="Getting Kubernetes resources for export" "azureName"="cosmos-cdoacr4-gl-accounts" "name"="cosmos-cdoacr4-gl-accounts" "namespace"="platform-accounts"
I0328 11:17:22.124775 1 kubernetes_exporter.go:47] controllers/DatabaseAccountController "msg"="Successfully retrieved Kubernetes resources for export" "ResourcesToWrite"=0 "azureName"="cosmos-cdoacr4-gl-accounts" "name"="cosmos-cdoacr4-gl-accounts" "namespace"="platform-accounts"
So I think what is happening is:
- Cluster A (actively managing the account) creates the account.
- Cluster B (has skip set) sees the account is created and immediately calls the ListKeys API, which seems to succeed and return 0 keys.
I see two approaches to solving this problem:
- Wait for the resource to be in ProvisioningState: Succeeded here
- Treat "user asked us to export a key we couldn't find" as a retryable error here. If we do this fix we should apply it to every resource that supports key export (there are quite a few).
We should probably do both of these, but I believe for the purposes of this bug, either fix will work.
from azure-service-operator.
Even though I see what must have happened via the above logs, when I tried to actually reproduce this behavior I wasn't able to reproduce it.
Here's what I did to reproduce:
- Install ASO v2.3.0 into a local
kind
cluster. - Applied 1 rg + 2 databaseaccounts (one with
skip
for the reconcile-mode). Each databaseaccount writes its secrets to a different secret (keys1
for the actively managed one,keys2
for theskip
one). - Waited until the DatabaseAccount was created.
- Checked the secrets.
Here's what I saw:
k get databaseaccounts.documentdb.azure.com
NAME READY SEVERITY REASON MESSAGE
asotestcosmosdb False Info Reconciling The resource is in the process of being reconciled by the operator
asotestcosmosdb-skip False Warning Failed extension failed to produce resources for export: failed listing keys: POST https://management.azure.com/subscriptions/00000000-0000-0000-000000000000/resourceGroups/aso-sample-rg/providers/Microsoft.DocumentDB/databaseAccounts/asotestcosmosdb/listKeys...
After some waiting...
k get databaseaccounts.documentdb.azure.com
NAME READY SEVERITY REASON MESSAGE
asotestcosmosdb True Succeeded
asotestcosmosdb-skip True Succeeded
k get secrets
NAME TYPE DATA AGE
keys1 Opaque 3 19s
keys2 Opaque 3 13s
I also confirmed that both keys1
and keys2
had the same secret data.
The actual resources I applied were:
apiVersion: resources.azure.com/v1api20200601
kind: ResourceGroup
metadata:
name: aso-sample-rg
namespace: default
spec:
location: westcentralus
---
apiVersion: documentdb.azure.com/v1api20210515
kind: DatabaseAccount
metadata:
name: asotestcosmosdb
namespace: default
spec:
databaseAccountOfferType: Standard
kind: GlobalDocumentDB
location: westus2
locations:
- locationName: westus2
operatorSpec:
secrets:
documentEndpoint:
key: endpoint
name: keys1
primaryMasterKey:
key: primarymasterkey
name: keys1
primaryReadonlyMasterKey:
key: primaryreadonlymasterkey
name: keys1
owner:
name: aso-sample-rg
---
apiVersion: documentdb.azure.com/v1api20210515
kind: DatabaseAccount
metadata:
name: asotestcosmosdb-skip
namespace: default
annotations:
serviceoperator.azure.com/reconcile-policy: skip
spec:
azureName: asotestcosmosdb
databaseAccountOfferType: Standard
kind: GlobalDocumentDB
location: westus2
locations:
- locationName: westus2
operatorSpec:
secrets:
documentEndpoint:
key: endpoint
name: keys2
primaryMasterKey:
key: primarymasterkey
name: keys2
primaryReadonlyMasterKey:
key: primaryreadonlymasterkey
name: keys2
owner:
name: aso-sample-rg
It's possible that there is a race where sometimes this works and sometimes it doesn't and I just got lucky and things worked for me but if I tried it enough times, it would fail.
The main thing I see that makes me think that this should mostly work even now is that in the logs when the DatabaseAccount is still being created I see this:
I0408 21:01:12.354554 1 common.go:58] controllers/DatabaseAccountController "msg"="Reconcile invoked" "annotations"={"serviceoperator.azure.com/operator-namespace":"azureserviceoperator-system","serviceoperator.azure.com/reconcile-policy":"skip","serviceoperator.azure.com/resource-id":"/subscriptions/00000000-0000-0000-00000000000/resourceGroups/aso-sample-rg/providers/Microsoft.DocumentDB/databaseAccounts/asotestcosmosdb"} "conditions"="[Condition [Ready], Status = "False", ObservedGeneration = 1, Severity = "Warning", Reason = "Failed", Message = "extension failed to produce resources for export: failed listing keys: POST https://management.azure.com/subscriptions/00000000-0000-0000-00000000000/resourceGroups/aso-sample-rg/providers/Microsoft.DocumentDB/databaseAccounts/asotestcosmosdb/listKeys\n--------------------------------------------------------------------------------\nRESPONSE 404: 404 NotFound\nERROR CODE: NotFound\n--------------------------------------------------------------------------------\n{\n \"code\": \"NotFound\",\n \"message\": \"Entity with the specified id does not exist in the system. More info: https://aka.ms/cosmosdb-tsg-not-found\\r\\nActivityId: c81fab9e-9f50-4e05-9446-2f0b43618dff, Microsoft.Azure.Documents.Common/2.14.0\"\n}\n--------------------------------------------------------------------------------\n", LastTransitionTime = "2024-04-08 21:00:40 +0000 UTC"]" "creationTimestamp"="2024-04-08T21:00:01Z" "deletionTimestamp"=null "finalizers"=["serviceoperator.azure.com/finalizer"] "generation"=1 "kind"={"kind":"DatabaseAccount","apiVersion":"documentdb.azure.com/v1api20210515storage"} "name"="asotestcosmosdb-skip" "namespace"="default" "owner"={"group":"resources.azure.com","kind":"ResourceGroup","name":"aso-sample-rg"} "ownerReferences"=[{"apiVersion":"resources.azure.com/v1api20200601storage","kind":"ResourceGroup","name":"aso-sample-rg","uid":"0e47e177-39ee-4eb8-ada1-ac5a7fa73460"}] "resourceVersion"="1348" "uid"="f139d444-dbbc-4f9d-a7a0-948472abaf84"
Which is what I expected to see when the DB didn't exist yet. So whatever is going on is less than a 100% repro (or else there's something else about you're environment we're missing).
In any case, I've created #3925 which should resolve the issue if it does happen.
from azure-service-operator.
Awesome to hear about the fix! Just to check - you created two databases with one ASO/cluster? The issue here is with one database and many ASOs/clusters. We have an environment that spans multiple regions with an AKS cluster in each, but some databases are single global databases with multi-region replication.
You could probably reproduce in one cluster with the same database referenced in two namespaces... So metadata for the databases would have the same name - one with skip and one without. Let me know if this does not make sense
from azure-service-operator.
You could probably reproduce in one cluster with the same database referenced in two namespaces... So metadata for the databases would have the same name - one with skip and one without. Let me know if this does not make sense
This is what I did (see yaml above) and it worked. I did 2 databases with same AzureName
in same namespace (I don't think namespace should matter). 1 with skip and one without.
Maybe the listkeys returning 0 keys only happens if I actually replicate your calling pattern though (N regions), due to some routing/caching on cosmosdb side? I might give that a try.
from azure-service-operator.
Related Issues (20)
- Include system affinity for when ASO runs on AKS HOT 2
- Update dbformysql to latest version
- Update machinelearningservices to the latest version HOT 1
- Bug: RoleAssignmentProperties.DelegatedManagedIdentityResourceId has wrong type HOT 1
- Re-record flaky resource tests
- Helm chart Add support for image pull secrets HOT 3
- Consider switching to 'rapid' from 'gopter'
- Tracking issue for 2.7.0 release HOT 1
- Support Auth Key Rotation HOT 5
- Bug: asoctl fails to import resources
- Consider changing default sync period back to 15m from 1h HOT 1
- Add report for property changes between versions HOT 1
- Support `-n` namespace specifier for asoctl
- Feature: Import latest version of MySQL FlexibleServer
- Have TokenFilePath configurable via an environment variable, when using WorkloadIdentity HOT 6
- Bug: `ipConfigurations` field is not populated in `PrivateEndpoint` status HOT 5
- Add documentation of resource extension points
- Investigate alternatives to using a PAT for CI
- Feature: Support export of storage account connection strings
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from azure-service-operator.