Code Monkey home page Code Monkey logo

aro-rp's Introduction

Azure Red Hat OpenShift Resource Provider

Welcome!

For information relating to the generally available Azure Red Hat OpenShift v4 service, please see the following links:

Quickstarts

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

Before you start development, please set up your local git hooks to conform to our development standards:

make init-contrib

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repositories using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Repository map

  • .pipelines: CI workflows using Azure pipelines.

  • cmd/aro: RP entrypoint.

  • deploy: ARM templates to deploy RP in development and production.

  • docs: Documentation.

  • hack: Build scripts and utilities.

  • pkg: RP source code:

    • pkg/api: RP internal and external API definitions.

    • pkg/backend: RP backend workers.

    • pkg/bootstraplogging: Bootstrap logging configuration

    • pkg/client: Autogenerated ARO service Go client.

    • pkg/cluster: Cluster create/update/delete operations wrapper for OCP installer.

    • pkg/database: RP CosmosDB wrapper layer.

    • pkg/deploy: /deploy ARM template generation code.

    • pkg/env: RP environment-specific shims for running in production, development or test

    • pkg/frontend: RP frontend webserver.

    • pkg/metrics: Handles RP metrics via statsd.

    • pkg/mirror: OpenShift release mirror tooling.

    • pkg/monitor: Monitors running clusters.

    • pkg/operator/controllers: A list of controllers instantiated by the operator component.

      • alertwebhook: Ensures that the receiver endpoint defined in the alertmanager-main secret matches the webserver endpoint at aro-operator-master.openshift-azure-operator:8080, to avoid the AlertmanagerReceiversNotConfigured warning.

      • checker: Watches the Cluster resource for changes and updates conditions of the resource based on checks mentioned below

        • internetchecker: validate outbound internet connectivity to the nodes

        • serviceprincipalchecker: validate cluster service principal has the correct role/permissions

      • clusteroperatoraro: Ensures that the ARO cluster object is consistent and immutable

      • dnsmasq: Ensures that a dnsmasq systemd service is defined as a machineconfig for all nodes. The dnsmasq config contains records for azure load balancers such as api, api-int and *.apps domains so they will resolve even if custom DNS on the VNET is set.

      • genevalogging: Ensures all the Geneva logging resources in the openshift-azure-logging namespace matches the pre-defined specification found in pkg/operator/controllers/genevalogging/genevalogging.go.

      • imageconfig: Ensures that required registries are not blocked in image.config

      • machine: validate machine objects have the correct provider spec, vm type, vm image, disk size, three master nodes exist, and the number of worker nodes match the desired worker replicas

      • machineset: Ensures that a minimum of two worker replicas are met.

      • machinehealthcheck: Ensures the MachineHealthCheck resource is running as configured. See machinehealthcheck/doc.go

      • monitoring: Ensures that the OpenShift monitoring configuration in the openshift-monitoring namespace is consistent and immutable.

      • node: Force deletes pods when a node fails to drain for 1 hour. It should clear up any pods that refuse to be evicted on a drain due to violating a pod disruption budget.

      • pullsecret: Ensures that the ACR credentials in the openshift-config/pull-secret secret match those in the openshift/azure-operator/cluster secret.

      • rbac: Ensures that the aro-sre clusterrole and clusterrolebinding exist and are consistent.

      • routefix: Ensures all the routefix resources in the namespace openshift-azure-routefix remain on the cluster.

      • subnets: Ensures that the Network Security Groups (NSGs) are correct, and updates the Azure Machine Provider spec with subnet, vnet, and Network Resource Group.

      • workaround: Applies a set of temporary workarounds to the ARO cluster.

      • previewfeature: Allows toggling certain features that are not yet enabled by default.

    • pkg/portal: Portal for running promql queries against a cluster or requesting a kubeconfig for a cluster.

    • pkg/proxy: Proxy service for portal kubeconfig cluster access.

    • pkg/swagger: Swagger specification generation code.

    • pkg/util: Utility libraries.

  • python: Autogenerated ARO service Python client and az aro client extension.

  • swagger: Autogenerated ARO service Swagger specification.

  • test: End-to-end tests.

  • vendor: Vendored Go libraries.

Basic architecture

  • pkg/frontend is intended to become a spec-compliant RP web server. It is backed by CosmosDB. Incoming PUT/DELETE requests are written to the database with an non-terminal (Updating/Deleting) provisioningState.

  • pkg/backend reads documents with non-terminal provisioningStates, asynchronously updates them and finally updates document with a terminal provisioningState (Succeeded/Failed). The backend updates the document with a heartbeat - if this fails, the document will be picked up by a different worker.

  • As CosmosDB does not support document patch, care is taken to correctly pass through any fields in the internal model which the reader is unaware of (see github.com/ugorji/go/codec.MissingFielder). This is intended to help in upgrade cases and (in the future) with multiple microservices reading from the database in parallel.

  • Care is taken to correctly use optimistic concurrency to avoid document corruption through concurrent writes (see RetryOnPreconditionFailed).

  • The pkg/api architecture differs somewhat from github.com/openshift/openshift-azure: the intention is to fix the broken merge semantics and try pushing validation into the versioned APIs to improve error reporting.

  • Everything is intended to be crash/restart/upgrade-safe, horizontally scaleable, upgradeable...

Useful links

aro-rp's People

Contributors

aldofusterturpin avatar anshulvermapatel avatar arborite-rh avatar arrislee avatar asalkeld avatar bennerv avatar cadenmarchese avatar cblecker avatar darthhexx avatar dependabot[bot] avatar drewandersonnz avatar ellis-johnson avatar hawkowl avatar jewzaam avatar jim-minter avatar m1kola avatar makdaam avatar mbarnes avatar mikeandescavage avatar mjudeikis avatar nilsanderselde avatar nwnt avatar petrkotas avatar s-amann avatar s-fairchild avatar srinivasatmakuri avatar tony-schndr avatar troy0820 avatar tsatam avatar yjst2012 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

aro-rp's Issues

cosmosdb load test

cosmosdb package is doing exponential requests backoff when RU's are exceeded.

We need to have reproducible for load testing so we could be sure that if RU's are exceeded cosmos client will retry until load is evenly distributed

condense database patches in install

Currently we patch the database adding information all over the place during install, so a sharp-eyed user doing lots of GETs during install will see different values added gradually over the install process.

Would be better to minimise the number of patches and try to get as much of the data added atomically in one place if possible.

Client refresh / credential caching issues in dev and prod

Saw an az aro create fail with the following storage deployment error messages:

{
    "error": {
        "code": "AuthorizationFailed",
        "message": "The client 'b9bc1545-a684-4e50-9a4d-2e2ee90bed6a' with object id 'b9bc1545-a684-4e50-9a4d-2e2ee90bed6a' does not have authorization to perform action 'Microsoft.Storage/storageAccounts/write' over scope '/subscriptions/225e02bc-43d0-43d1-a01a-17e584a4ef69/resourcegroups/aro-6hwuovvq/providers/Microsoft.Storage/storageAccounts/clusterixgbq' or the scope is invalid. If access was recently granted, please refresh your credentials."
    }
}

Need to understand how to fix this.

Error reporting in asyncoperationstatus

There is the provision to report operation errors back to the user via the operationsstatus API but we are currently hard coding any backend error to "internal server error".

There may be a few cases where we want to be clearer about what the backend error was - dns name collision is one example.

Inadequate quota is another.

[Proposal][wip]RP metrics

RP should emit statsd metrics (compatibility with geneva).

We have 2 ways to achieve this goal:

  1. export all metrics using statsd
  2. export all metrics using prometheus and convert to statsd using same way as we did inV3
Prometheus pros:
1. Wider adoption
2. Easier orchestration
3. RH stack compatibility
4. Easier future integration

Prom cons:
1. Percentiles computation may not give an overall picture as percentiles are calculated at each instance level, not at global level.
2. More memory is used as the metrics are locally stored in application memory.

Statsd pros:
1. Aggregation at the server level
2. Easy to implement (straight forward-protocol)
3. Percentiles and histograms are calculated on server-side relatively less overhead in the client application.

Statsd cons:
1. Compatibility
2. Local testing will be harder (with Prometheus we can run an instance of Prometheus and scrape and test/validate or just curl /metrics). With statsd we will need to send metrics to "sink" for testing `nc -u -w0 127.0.0.1 8125`

StatsD client example: https://github.com/statsd/statsd/blob/master/examples/go/statsd.go
https://godoc.org/github.com/etsy/statsd/examples/go

The proposal would be to create subpackage pkg/metrics with methods to record different metrics with configured dimensions/tags.

By initiating RP some dimensions would need to be provided. Like region, name, location.

Metrics recording would be recorded as statsD as example below:

Examle metrics:

Statsd:
aro_frontend_call.v20191231-preview.openShiftcluster.timers.t:84.2|ms
aro_frontend_call.{api-version}.{openshiftcluster,asyncoperation,openShiftclustercredentialssubscription}.timers.t:84.2|ms
Prometheus:
aro_frontend_call{api_name="openshiftcluster", version="v20191231-preview", method="post", quuantile="0.5"}0.06
# if statsd exported is used, it will automatically convert into Prometheus summary
aro_frontend_call_sum{api_name="openshiftcluster", version="v20191231-preview", method="post"}0.79
aro_frontend_call_count{api_name="openshiftcluster", version="v20191231-preview", method="post"}20

Proposed metrics:
Where we use dymensions to add metadata like: rp_ame, location, region, environement

aro_frontend_call.{version}.{api}.timers:v|ms - frontend api call execution times
aro_frontend_call.{version}.{api}.errors.{code}.counters:v|c- error counts (all 4xx, 5xx codes)
aro_frontend_call.{version}.{api}.success.{code}.counters:v|c- success counts (all 2xx)
aro_frontend_call.sessions.gauges:v|g- current open sessions

aro_backend.workers.gauges:10|g - current worker count
aro_backend.errors.counters: 10|c - total errors in backend

aro_cosmodb_call.{method}.{database_name}.timers:v|ms
aro_cosmodb_call.{method}.{database_name}.counters:v|c

@jim-minter @m1kola @asalkeld WDYT?

Make backend goroutines cancellable

jim-minter/rp#57

Backend goroutines should use a cancellable context which is owned by the corresponding heartbeat goroutine. In this way if the heartbeat realises it has lost the lease, it can cancel the work in progress.

Reduce number of install phases, but split them into smaller functions

Currently we have 3 install phases and functions that implement these phases are big:

case api.InstallPhaseDeployStorage:
err := i.installStorage(ctx, installConfig, platformCreds, image)
if err != nil {
return err
}
case api.InstallPhaseDeployResources:
err := i.installResources(ctx)
if err != nil {
return err
}
case api.InstallPhaseRemoveBootstrap:
err := i.removeBootstrap(ctx)
if err != nil {
return err
}
i.doc, err = i.db.PatchWithLease(ctx, i.doc.Key, func(doc *api.OpenShiftClusterDocument) error {
doc.OpenShiftCluster.Properties.Install = nil
return nil
})
return err

Would be great to reduce the number of install phases to two:

  • everything before we delete the bootstrap
  • bootstrap deletion and everything after

But at the same time move every step of the process into a separate, testable function.

So we will be able to extend and more easily unit test the installation process, but at the same time keep two major phases separate.

Background: #85 (comment)

az aro create failed

az aro create failed with fresh vnet:

This command group is in preview. It may be changed/removed in a future release.
Deployment failed. Correlation ID: bf750d65-99ae-4945-8327-298dc79fed44. Internal server error.

Second error:

Subnet GatewaySubnet is in use by /subscriptions/225e02bc-43d0-43d1-a01a-17e584a4ef69/resourceGroups/v4-eastus/providers/Microsoft.Network/virtualNetworkGateways/dev-vpn/ipConfigurations/default and cannot be deleted. In order to delete t
he subnet, delete all the resources within the subnet. See aka.ms/deletesubnet.  

My assumption is that when using az aro create we should add required role binding to the resources. But as we already know RBAC sometimes lags to propagate due to caching via different RP despite the fact it returns 200. This makes the first create obscure and random failure.

scaling verification for workers

While scaling v3 to 100 nodes and back we got this:

FATA[2019-12-18T04:26:04+11:00] containerservice.OpenShiftManagedClustersClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: error response cannot be parsed: "400 Bad Request: Failed to apply request: UpdateWorkerAgentPoolDeleteVM: Code=\"SpecifiedAllocatedOutboundPortsForOutboundRuleExceedsTotalNumberOfAvailablePorts\" Message=\"Specified Allocated Outbound Ports 1024 for Outbound Rule /subscriptions/.../resourceGroups/pnasrat-testscale/providers/Microsoft.Network/loadBalancers/kubernetes/outboundRules/outbound exceeds total number of available ports 624. Reduce allocated ports or increase number of IP addresses for outbound rule.\"\n" error: json: cannot unmarshal number into Go value of type azure.RequestError 

Need to investigate this for 4 and if needed make required upstream changes.

E2E creation test should measure creation time

I would like to have an e2e test which apart from validating the create operation itself also makes sure that we do not accidentally make cluster creation longer by making changes in the install process or by updating dependecies (installer, for example).

This might require a discussion to define time limit which we should not exceed.

add go-cosmosdb package metrics

package github.com/jim-minter/go-cosmosdb is doing retries on http.StatusTooManyRequests.

We need to add some metrics orchestration into cosmosdb pacakge so we could export those metrics.

Prepare a shared RP development environment - Unique Values needs to be used for DB / KV

Going through Prepare a shared RP development environment.md

Having :

Deployment failed. Correlation ID: 47a29499-2716-4402-8103-7d30d48ee154. {
  "error": {
    "code": "VaultAlreadyExists",
    "message": "The name 'v4-eastus' is already in use."
  }
}
Deployment failed. Correlation ID: bdc97687-c7a9-4f8e-a009-c75f948c0cc8. {
  "code": "BadRequest",
  "message": "DatabaseAccount name 'v4-eastus' already exists.\r\nActivityId: 418845f2-6fd5-4f32-9ba3-d66c35a4949c, Microsoft.Azure.Documents.Common/2.9.2"
}

Will open a PR after a successful deployment.

Thanks

move ImagePullSecret to internal model

Preserve ImagePullSecret into internal data model

Open question:
Do we want customer to be able to provide their own pull secret? If it is not provided we can use RP one. We would need to check if customers secret is valid in the validation code.

Add installation error metrics

Apart form tracking general cluster creation time (#91) willl be nice to see where we fail most frequently during the cluster installation process to see areas where we need to improve.

I think, we should report be more granularly than install phases. Might be better\easier to do it after #106

This might also be used to observe error spikes across subscriptions.

Background: #85 (comment)

ugorji/go/codec.AddExt move to SetBytesExt

AddExt registes an encode and decode function for a reflect.Type. To deregister an Ext, call AddExt with nil encfn and/or nil decfn.

Deprecated: Use SetBytesExt or SetInterfaceExt on the Handle instead.

AddExt registes an encode and decode function for a reflect.Type. To deregister an Ext, call AddExt with nil encfn and/or nil decfn.

Deprecated: Use SetBytesExt or SetInterfaceExt on the Handle instead.

AddExt registes an encode and decode function for a reflect.Type. To deregister an Ext, call AddExt with nil encfn and/or nil decfn.

Deprecated: Use SetBytesExt or SetInterfaceExt on the Handle instead.

PE/PLS deletion failure when cluster RG is deleted by customer

When a customer will delete the cluster resource group (aro-xxxxxx) manually via CLI/Portal, PLS deletion will fail because PE will still exist.

I guess there is no clean way to handle this, customer should trigger a DELETE operation using the RP.

investigate usages of ctx over stop ch

Currently, in some places we use stop channels to terminate execution.

WE should investigate swapping them around the place with ctx so we would have a unified way to control execution termination,

Unit test MissingFields

Every struct of every internal API type should include MissingFields, as this is what enables clients to safely update the struct without knowing the entire API schema. This enables both concurrent operation of RP binaries at version N and N+1 during upgrade, as well as potentially other microservices to use the same database collection.

Add a unit test that ensures that every struct of every internal API type includes MissingFields.

Check route table configuration

The 4.3 installer creates a route table resource but doesn't actually plumb it into any subnets.

Check whether the route table is actually necessary. Depending on the answer, plumb it correctly or remove it. Also fix the upstream installer.

Console branding fails

NFO[2020-01-30T11:18:12Z] pkg/install/2-removebootstrap.go:57 install.(*Installer).removeBootstrap() removing bootstrap nic                        component=backend resource=/subscriptions/225e02bc-43d0-43d1-a01a-17e584a4ef69/resourceGroups/v4-eastus/providers/Microsoft.RedHatOpenShift/openShiftClusters/mjudeikis
INFO[2020-01-30T11:18:13Z] pkg/install/2-removebootstrap.go:159 install.(*Installer).updateConsoleBranding() updating console branding                     component=backend resource=/subscriptions/225e02bc-43d0-43d1-a01a-17e584a4ef69/resourceGroups/v4-eastus/providers/Microsoft.RedHatOpenShift/openShiftClusters/mjude
ikis
ERRO[2020-01-30T11:18:14Z] pkg/backend/openshiftcluster.go:94 backend.(*openShiftClusterBackend).handle() consoles.operator.openshift.io "cluster" not found  component=backend resource=/subscriptions/225e02bc-43d0-43d1-a01a-17e584a4ef69/resourceGroups/v4-eastus/providers/Microsoft.RedHatOpenShift/openShiftClusters/mj
udeikis
INFO[2020-01-30T11:18:14Z] pkg/backend/openshiftcluster.go:61 backend.(*openShiftClusterBackend).try.func1.1() done                                          component=backend duration=1389.907730343 resource=/subscriptions/225e02bc-43d0-43d1-a01a-17e584a4ef69/resourceGroups/v4-eastus/providers/Microsoft.RedHatOpenShi
ft/openShiftClusters/mjudeikis

cosmosdb metrics panic

ERRO[2020-01-27T19:10:40Z] pkg/util/recover/recover.go:18 recover.Panic() runtime error: invalid memory address or nil pointer dereference  component=backend                                                                                                                                      
INFO[2020-01-27T19:10:40Z] pkg/util/recover/recover.go:19 recover.Panic() goroutine 136 [running]:                                                                                                                                                                                                 
runtime/debug.Stack(0xc00032c380, 0x2, 0xc001a62fa8)                                                                                                                                                                                                                                               
        /home/mjudeiki/.gvm/gos/go1.13.5/src/runtime/debug/stack.go:24 +0x9d                                                                                                                                                                                                                       
github.com/Azure/ARO-RP/pkg/util/recover.Panic(0xc00032c380)                                                                                                                                                                                                                                       
        /home/mjudeiki/go/src/github.com/Azure/ARO-RP/pkg/util/recover/recover.go:19 +0x9a                                                                                                                                                                                                         
panic(0x26ccae0, 0x4863c60)                                                                                                                                                                                                                                                                        
        /home/mjudeiki/.gvm/gos/go1.13.5/src/runtime/panic.go:679 +0x1b2                                                                                                                                                                                                                           
github.com/Azure/ARO-RP/pkg/metrics/statsd/cosmosdb.(*tracerRoundTripper).RoundTrip.func1(0xc001a63260, 0xc001a63268, 0xc00065c6f0, 0xc0009d8200, 0xbf83eb448dac88cf, 0x220df00561f, 0x48fcce0)                                                                                                    
        /home/mjudeiki/go/src/github.com/Azure/ARO-RP/pkg/metrics/statsd/cosmosdb/metrics.go:40 +0x3a                                                                                                                                                                                              
github.com/Azure/ARO-RP/pkg/metrics/statsd/cosmosdb.(*tracerRoundTripper).RoundTrip(0xc00065c6f0, 0xc0009d8200, 0x0, 0x301a4c0, 0xc0002dd2b0)                                                                                                                                                      
        /home/mjudeiki/go/src/github.com/Azure/ARO-RP/pkg/metrics/statsd/cosmosdb/metrics.go:78 +0x11d                                                                                                                                                                                             
net/http.send(0xc0009d8000, 0x301a6e0, 0xc00065c6f0, 0xbf83eb4c0dac5305, 0x227db23cc57, 0x48fcce0, 0xc000120670, 0xbf83eb4c0dac5305, 0x1, 0x0)                                                                                                                                                     
        /home/mjudeiki/.gvm/gos/go1.13.5/src/net/http/client.go:250 +0x443                                                                                                                                                                                                                         
net/http.(*Client).send(0xc00065c7b0, 0xc0009d8000, 0xbf83eb4c0dac5305, 0x227db23cc57, 0x48fcce0, 0xc000120670, 0x0, 0x1, 0x48fe200)                                                                                                                                                               
        /home/mjudeiki/.gvm/gos/go1.13.5/src/net/http/client.go:174 +0xfa                                                                                                                                                                                                                          
net/http.(*Client).do(0xc00065c7b0, 0xc0009d8000, 0x0, 0x0, 0x0)                                                                                                                                                                                                                                   
        /home/mjudeiki/.gvm/gos/go1.13.5/src/net/http/client.go:641 +0x3ce                                                                                                                                                                                                                         
net/http.(*Client).Do(...)                                                                                                                                                                                                                                                                         
        /home/mjudeiki/.gvm/gos/go1.13.5/src/net/http/client.go:509                                                                                                                                                                                                                                
github.com/Azure/ARO-RP/pkg/database/cosmosdb.(*databaseClient)._do(0xc000a28d70, 0x3085080, 0xc0000f4000, 0x2b43039, 0x4, 0xc002fde7e0, 0x29, 0x2b43301, 0x4, 0xc001ac4090, ...)                                                                                                                  
        /home/mjudeiki/go/src/github.com/Azure/ARO-RP/pkg/database/cosmosdb/zz_generated_cosmosdb.go:134 +0x3d0                                                                                                                                                                                    
github.com/Azure/ARO-RP/pkg/database/cosmosdb.(*databaseClient).do(0xc000a28d70, 0x3085080, 0xc0000f4000, 0x2b43039, 0x4, 0xc002fde7e0, 0x29, 0x2b43301, 0x4, 0xc001ac4090, ...)                                                                                                                   
        /home/mjudeiki/go/src/github.com/Azure/ARO-RP/pkg/database/cosmosdb/zz_generated_cosmosdb.go:83 +0x165                                                                                                                                                                                     
github.com/Azure/ARO-RP/pkg/database/cosmosdb.(*openShiftClusterDocumentQueryIterator).NextRaw(0xc000b979c0, 0x3085080, 0xc0000f4000, 0x2430240, 0xc000120638, 0x12a7311, 0x2aa0ac0)                                                                                                               
        /home/mjudeiki/go/src/github.com/Azure/ARO-RP/pkg/database/cosmosdb/zz_generated_openshiftclusterdocument.go:280 +0x5dd                                                                                                                                                                    
github.com/Azure/ARO-RP/pkg/database/cosmosdb.(*openShiftClusterDocumentQueryIterator).Next(0xc000b979c0, 0x3085080, 0xc0000f4000, 0xc002cbef60, 0x0, 0x304ba80)                                                                                                                                   
        /home/mjudeiki/go/src/github.com/Azure/ARO-RP/pkg/database/cosmosdb/zz_generated_openshiftclusterdocument.go:253 +0x6e                                                                                                                                                                     
github.com/Azure/ARO-RP/pkg/database.(*openShiftClusters).Dequeue(0xc0003b4570, 0x3085080, 0xc0000f4000, 0xc0000f4000, 0x0, 0x0)                                                                                                                                                                   
        /home/mjudeiki/go/src/github.com/Azure/ARO-RP/pkg/database/openshiftclusters.go:248 +0x266                                                                                                                                                                                                 
github.com/Azure/ARO-RP/pkg/backend.(*openShiftClusterBackend).try(0xc0001f42d8, 0x3085080, 0xc0000f4000, 0xc000aa0300, 0x0, 0x0)                                                                                                                                                                  
        /home/mjudeiki/go/src/github.com/Azure/ARO-RP/pkg/backend/openshiftcluster.go:28 +0x6f                                                                                                                                                                                                     
github.com/Azure/ARO-RP/pkg/backend.(*backend).Run(0xc00032c3f0, 0x3085080, 0xc0000f4000, 0xc000aa0300)                                                                                                                                                                                            
        /home/mjudeiki/go/src/github.com/Azure/ARO-RP/pkg/backend/backend.go:89 +0x206                                                           
created by main.rp                                                      
        /home/mjudeiki/go/src/github.com/Azure/ARO-RP/cmd/aro/rp.go:61 +0x891  component=backend                       

Double-check SNAT/ILB architecture

Early in OpenShift 4.2, masters called back to themselves recursively via the ILB. Check that doesn't happen in the ARO architecture.

Investigate "Unable to find status link for polling."

14/1

$ az aro update -g $RESOURCEGROUP -n $CLUSTER --worker-count 4
This command group is in preview. It may be changed/removed in a future release.
Deployment failed. Correlation ID: 3e26b7c0-a05a-4782-8794-adc56fd892b6. Unable to find status link for polling.

Move code generation comments into separate files

Would be nice to have code generation-related comments in separate files (eg generate.go in a package): it makes clear that the package has something to do with go generate and where to make modifications to code generation. Some packages already have this file. Some packages need to be updated.

As an extra excercise, we can write a small tool to enforce this in CI (see go/ast/go/parser packages for references).

See #69 (comment) for the background.

ARM Failed - SNAT issue

Deployment failed with with SNAP error:

ERRO[2020-01-02T11:16:18Z] pkg/backend/openshiftcluster.go:82 backend.(*openShiftClusterBackend).handle() Code="DeploymentFailed" Message="At least one resource deployment operation failed. Please list deployment operations for details. P
lease see https://aka.ms/DeployOperations for usage details." Details=[{"code":"BadRequest","message":"{\r\n  \"error\": {\r\n    \"code\": \"LoadBalancingRuleMustDisableSNATSinceSameFrontendIPConfigurationIsReferencedByOutboundRule\",\r\
n    \"message\": \"Load Balancing Rules /subscriptions/225e02bc-43d0-43d1-a01a-17e584a4ef69/resourceGroups/mjudeikis-cluster/providers/Microsoft.Network/loadBalancers/aro-public-lb/loadBalancingRules/api-internal must disable snat since 
same FrontendIPConfiguration /subscriptions/225e02bc-43d0-43d1-a01a-17e584a4ef69/resourceGroups/mjudeikis-cluster/providers/Microsoft.Network/loadBalancers/aro-public-lb/frontendIPConfigurations/public-lb-ip is referenced by Outbound Rule
s /subscriptions/225e02bc-43d0-43d1-a01a-17e584a4ef69/resourceGroups/mjudeikis-cluster/providers/Microsoft.Network/loadBalancers/aro-public-lb/outboundRules/api-internal-outboundrule\",\r\n    \"details\": []\r\n  }\r\n}"}]  component=bac
kend resource=/subscriptions/225e02bc-43d0-43d1-a01a-17e584a4ef69/resourceGroups/v4-eastus/providers/Microsoft.RedHatOpenShift/openShiftClusters/mjudeikis-cluster

api-internal must disable snat since same FrontendIPConfiguration /subscriptions/225e02bc-43d0-43d1-a01a-17e584a4ef69/resourceGroups/mjudeikis-cluster/providers/Microsoft.Network/loadBalancers/aro-public-lb/frontendIPConfigurations/public-lb-ip

ARM CreateOrUpdate failed - credential issue

ERRO[2020-01-02T11:03:49Z] pkg/backend/openshiftcluster.go:82 backend.(*openShiftClusterBackend).handle() resources.DeploymentsClient#CreateOrUpdate: Failure sending request: StatusCode=403 -- Original Error: Code="AuthorizationFailed" Me
ssage="The client 'bc46cb3a-97e9-4d05-8e76-d2659a531a33' with object id 'bc46cb3a-97e9-4d05-8e76-d2659a531a33' does not have authorization to perform action 'Microsoft.Resources/deployments/write' over scope '/subscriptions/225e02bc-43d0-
43d1-a01a-17e584a4ef69/resourcegroups/mjudeikis-cluster/providers/Microsoft.Resources/deployments/azuredeploy' or the scope is invalid. If access was recently granted, please refresh your credentials."  component=backend resource=/subscri
ptions/225e02bc-43d0-43d1-a01a-17e584a4ef69/resourceGroups/v4-eastus/providers/Microsoft.RedHatOpenShift/openShiftClusters/mjudeikis-cluster

refresh sometimes is not enough time:

res, err := d.applications.GetServicePrincipalsIDByAppID(ctx, os.Getenv("AZURE_FP_CLIENT_ID"))
	if err != nil {
		return err
	}

	_, err = d.roleassignments.Create(ctx, "/subscriptions/"+d.SubscriptionID()+"/resourceGroups/"+oc.Properties.ResourceGroup, uuid.NewV4().String(), mgmtauthorization.RoleAssignmentCreateParameters{
		Properties: &mgmtauthorization.RoleAssignmentProperties{
			RoleDefinitionID: to.StringPtr("/subscriptions/" + d.SubscriptionID() + "/providers/Microsoft.Authorization/roleDefinitions/c95361b8-cf7c-40a1-ad0a-df9f39a30225"),
			PrincipalID:      res.Value,
		},
	})
	if err != nil {
		var ignore bool
		if err, ok := err.(autorest.DetailedError); ok {
			if err, ok := err.Original.(*azure.RequestError); ok && err.ServiceError != nil && err.ServiceError.Code == "RoleAssignmentExists" {
				ignore = true
			}
		}
		if !ignore {
			return err
		}
	}

	d.log.Print("development mode: refreshing authorizer")
	return fpAuthorizer.(*refreshableAuthorizer).Refresh()

Need to validate if we have write access already. before handing it over.

/kind bug

Client error count metric.

For each of the azure, cosmos and k8s client libraries, emit a counter which increments by 1 when the API returns a non-nil err

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.