Code Monkey home page Code Monkey logo

Comments (10)

karljaxon avatar karljaxon commented on June 27, 2024

karl.log
karl2.log

from azure-service-operator.

matthchr avatar matthchr commented on June 27, 2024

Thanks for reporting this. As discussed on Slack we believe this is a bug.

We'll look over the logs (thank you for sharing them) and see if we can't get this fixed!

from azure-service-operator.

matthchr avatar matthchr commented on June 27, 2024

The key logs in question are:

I0328 11:17:21.995373       1 generic_reconciler.go:335] controllers/DatabaseAccountController "msg"="Skipping creation/update of resource due to policy" "name"="cosmos-cdoacr4-gl-accounts" "namespace"="platform-accounts" "serviceoperator.azure.com/reconcile-policy"="skip"
I0328 11:17:21.995449       1 azure_generic_arm_reconciler_instance.go:414] controllers/DatabaseAccountController "msg"="Resource successfully created/updated" "azureName"="cosmos-cdoacr4-gl-accounts" "name"="cosmos-cdoacr4-gl-accounts" "namespace"="platform-accounts" "resourceID"="/subscriptions/ea80ce6b-d4af-44da-b2c1-af925fd21360/resourceGroups/rg-cdoacr4-gl-accounts/providers/Microsoft.DocumentDB/databaseAccounts/cosmos-cdoacr4-gl-accounts"
I0328 11:17:22.124739       1 kubernetes_exporter.go:41] controllers/DatabaseAccountController "msg"="Getting Kubernetes resources for export" "azureName"="cosmos-cdoacr4-gl-accounts" "name"="cosmos-cdoacr4-gl-accounts" "namespace"="platform-accounts"
I0328 11:17:22.124775       1 kubernetes_exporter.go:47] controllers/DatabaseAccountController "msg"="Successfully retrieved Kubernetes resources for export" "ResourcesToWrite"=0 "azureName"="cosmos-cdoacr4-gl-accounts" "name"="cosmos-cdoacr4-gl-accounts" "namespace"="platform-accounts"

So I think what is happening is:

  1. Cluster A (actively managing the account) creates the account.
  2. Cluster B (has skip set) sees the account is created and immediately calls the ListKeys API, which seems to succeed and return 0 keys.

I see two approaches to solving this problem:

  1. Wait for the resource to be in ProvisioningState: Succeeded here
  2. Treat "user asked us to export a key we couldn't find" as a retryable error here. If we do this fix we should apply it to every resource that supports key export (there are quite a few).

We should probably do both of these, but I believe for the purposes of this bug, either fix will work.

from azure-service-operator.

matthchr avatar matthchr commented on June 27, 2024

Even though I see what must have happened via the above logs, when I tried to actually reproduce this behavior I wasn't able to reproduce it.

Here's what I did to reproduce:

  1. Install ASO v2.3.0 into a local kind cluster.
  2. Applied 1 rg + 2 databaseaccounts (one with skip for the reconcile-mode). Each databaseaccount writes its secrets to a different secret (keys1 for the actively managed one, keys2 for the skip one).
  3. Waited until the DatabaseAccount was created.
  4. Checked the secrets.

Here's what I saw:

k get databaseaccounts.documentdb.azure.com 
NAME                        READY   SEVERITY   REASON        MESSAGE
asotestcosmosdb        False   Info       Reconciling   The resource is in the process of being reconciled by the operator
asotestcosmosdb-skip   False   Warning    Failed        extension failed to produce resources for export: failed listing keys: POST https://management.azure.com/subscriptions/00000000-0000-0000-000000000000/resourceGroups/aso-sample-rg/providers/Microsoft.DocumentDB/databaseAccounts/asotestcosmosdb/listKeys...

After some waiting...

k get databaseaccounts.documentdb.azure.com 
NAME                        READY   SEVERITY   REASON      MESSAGE
asotestcosmosdb        True               Succeeded   
asotestcosmosdb-skip   True               Succeeded   

k get secrets 
NAME    TYPE     DATA   AGE
keys1   Opaque   3      19s
keys2   Opaque   3      13s

I also confirmed that both keys1 and keys2 had the same secret data.

The actual resources I applied were:

apiVersion: resources.azure.com/v1api20200601
kind: ResourceGroup
metadata:
  name: aso-sample-rg
  namespace: default
spec:
  location: westcentralus
---
apiVersion: documentdb.azure.com/v1api20210515
kind: DatabaseAccount
metadata:
  name: asotestcosmosdb
  namespace: default
spec:
  databaseAccountOfferType: Standard
  kind: GlobalDocumentDB
  location: westus2
  locations:
  - locationName: westus2
  operatorSpec:
    secrets:
      documentEndpoint:
        key: endpoint
        name: keys1
      primaryMasterKey:
        key: primarymasterkey
        name: keys1
      primaryReadonlyMasterKey:
        key: primaryreadonlymasterkey
        name: keys1
  owner:
    name: aso-sample-rg
---
apiVersion: documentdb.azure.com/v1api20210515
kind: DatabaseAccount
metadata:
  name: asotestcosmosdb-skip
  namespace: default
  annotations:
    serviceoperator.azure.com/reconcile-policy: skip
spec:
  azureName: asotestcosmosdb
  databaseAccountOfferType: Standard
  kind: GlobalDocumentDB
  location: westus2
  locations:
  - locationName: westus2
  operatorSpec:
    secrets:
      documentEndpoint:
        key: endpoint
        name: keys2
      primaryMasterKey:
        key: primarymasterkey
        name: keys2
      primaryReadonlyMasterKey:
        key: primaryreadonlymasterkey
        name: keys2
  owner:
    name: aso-sample-rg

It's possible that there is a race where sometimes this works and sometimes it doesn't and I just got lucky and things worked for me but if I tried it enough times, it would fail.

The main thing I see that makes me think that this should mostly work even now is that in the logs when the DatabaseAccount is still being created I see this:

I0408 21:01:12.354554 1 common.go:58] controllers/DatabaseAccountController "msg"="Reconcile invoked" "annotations"={"serviceoperator.azure.com/operator-namespace":"azureserviceoperator-system","serviceoperator.azure.com/reconcile-policy":"skip","serviceoperator.azure.com/resource-id":"/subscriptions/00000000-0000-0000-00000000000/resourceGroups/aso-sample-rg/providers/Microsoft.DocumentDB/databaseAccounts/asotestcosmosdb"} "conditions"="[Condition [Ready], Status = "False", ObservedGeneration = 1, Severity = "Warning", Reason = "Failed", Message = "extension failed to produce resources for export: failed listing keys: POST https://management.azure.com/subscriptions/00000000-0000-0000-00000000000/resourceGroups/aso-sample-rg/providers/Microsoft.DocumentDB/databaseAccounts/asotestcosmosdb/listKeys\n--------------------------------------------------------------------------------\nRESPONSE 404: 404 NotFound\nERROR CODE: NotFound\n--------------------------------------------------------------------------------\n{\n \"code\": \"NotFound\",\n \"message\": \"Entity with the specified id does not exist in the system. More info: https://aka.ms/cosmosdb-tsg-not-found\\r\\nActivityId: c81fab9e-9f50-4e05-9446-2f0b43618dff, Microsoft.Azure.Documents.Common/2.14.0\"\n}\n--------------------------------------------------------------------------------\n", LastTransitionTime = "2024-04-08 21:00:40 +0000 UTC"]" "creationTimestamp"="2024-04-08T21:00:01Z" "deletionTimestamp"=null "finalizers"=["serviceoperator.azure.com/finalizer"] "generation"=1 "kind"={"kind":"DatabaseAccount","apiVersion":"documentdb.azure.com/v1api20210515storage"} "name"="asotestcosmosdb-skip" "namespace"="default" "owner"={"group":"resources.azure.com","kind":"ResourceGroup","name":"aso-sample-rg"} "ownerReferences"=[{"apiVersion":"resources.azure.com/v1api20200601storage","kind":"ResourceGroup","name":"aso-sample-rg","uid":"0e47e177-39ee-4eb8-ada1-ac5a7fa73460"}] "resourceVersion"="1348" "uid"="f139d444-dbbc-4f9d-a7a0-948472abaf84"

Which is what I expected to see when the DB didn't exist yet. So whatever is going on is less than a 100% repro (or else there's something else about you're environment we're missing).

In any case, I've created #3925 which should resolve the issue if it does happen.

from azure-service-operator.

karljaxon avatar karljaxon commented on June 27, 2024

Awesome to hear about the fix! Just to check - you created two databases with one ASO/cluster? The issue here is with one database and many ASOs/clusters. We have an environment that spans multiple regions with an AKS cluster in each, but some databases are single global databases with multi-region replication.

You could probably reproduce in one cluster with the same database referenced in two namespaces... So metadata for the databases would have the same name - one with skip and one without. Let me know if this does not make sense

from azure-service-operator.

matthchr avatar matthchr commented on June 27, 2024

You could probably reproduce in one cluster with the same database referenced in two namespaces... So metadata for the databases would have the same name - one with skip and one without. Let me know if this does not make sense

This is what I did (see yaml above) and it worked. I did 2 databases with same AzureName in same namespace (I don't think namespace should matter). 1 with skip and one without.

Maybe the listkeys returning 0 keys only happens if I actually replicate your calling pattern though (N regions), due to some routing/caching on cosmosdb side? I might give that a try.

from azure-service-operator.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.