gardener / gardener-extension-provider-azure Goto Github PK

View Code? Open in Web Editor NEW

9.0 15.0 77.0 39.03 MB

Gardener extension controller for the Azure cloud provider (https://azure.microsoft.com).

Home Page: https://gardener.cloud

License: Apache License 2.0

Shell 1.80% Dockerfile 0.09% Makefile 0.67% Go 93.94% Smarty 2.34% HCL 0.84% Python 0.33%

gardener-extension-provider-azure's People

Stargazers

Watchers

Forkers

guydc timebertt hardikdr ialidzhikov mvladev georgekuruvillak timuthy vlvasilev rfranzke vpnachev musse dguendisch sdojjy stoyanr swapnilgm schrodit mandelsoft kristian-zh dkistner kon-angelo danielfoehrkn prashanth26 krgostev beckermax amshuman-kr isgasho axiomsamarth martinweindel jbasement neo-liang-sap tareksha kwong-dev happytobi yuvalman voelzmo n-boshnakov dimityrmirchev plkokanov ary1992 acumino himanshu-kun shafeeqes kostov6 christianhuening rakesh-garimella friedrichwilken hendrikkahl tedteng scheererj denitiu mrbatschner istvanballok isabella232 dimitar-kostadinov gowrisankar22 rishabh-11 shreyas-s-rao sallyan elchead aleksandarsavchev docktofuture dergeberl oliver-goetz simonkienzler elankath jguipi 23technologies axel7born raphaelvogel seshachalam-yv andreasburger vogelhome hebelsan aaronfern renormalize rickardsjp lucabernstein

gardener-extension-provider-azure's Issues

Implement `Infrastructure` controller for Azure provider

Similar to how we have implemented the Infrastructure extension resource controller for the AWS provider let's please now do it for Azure.

Based on the current implementation the InfrastructureConfig should look like this:

apiVersion: azure.provider.extensions.gardener.cloud/v1alpha1
kind: InfrastructureConfig
resourceGroup:
  name: mygroup
networks:
  vnet: # specify either 'name' or 'cidr'
  # name: my-vnet
    cidr: 10.250.0.0/16
  workers: 10.250.0.0/19

Based on the current implementation the InfrastructureStatus should look like this:

---
apiVersion: azure.provider.extensions.gardener.cloud/v1alpha1
kind: InfrastructureStatus
resourceGroup:
  name: mygroup
networks:
  vnet:
    name: my-vnet
  subnets:
  - purpose: nodes
    name: my-subnet
availabilitySets:
- purpose: nodes
  id: av-set-id
  name: av-set-name
routeTables:
- purpose: nodes
  name: route-table-name
securityGroups:
- purpose: nodes
  name: sec-group-name

The current infrastructure creation/deletion implementation can be found here. Please try to change as little as possible (with every change the risk that we break something increases!) and just move the code over into the extensions infrastructure actuator.

SeedNetworkPoliciesTest fails always

From gardener-attic/gardener-extensions#293

What happened:
The test defined in SeedNetworkPoliciesTest.yaml fails always.
Most of the time the following 3 specs fail:

2019-07-29 11:32:33	Test Suite Failed
2019-07-29 11:32:33	Ginkgo ran 1 suite in 3m20.280138435s
2x		2019-07-29 11:32:33	
2019-07-29 11:32:32	FAIL! -- 375 Passed | 3 Failed | 0 Pending | 126 Skipped
2019-07-29 11:32:32	Ran 378 of 504 Specs in 85.218 seconds
2019-07-29 11:32:32	
2019-07-29 11:32:32	> /go/src/github.com/gardener/gardener/test/integration/seeds/networkpolicies/aws/networkpolicy_aws_test.go:1194
2019-07-29 11:32:32	[Fail] Network Policy Testing egress for mirrored pods elasticsearch-logging [AfterEach] should block connection to "Garden Prometheus" prometheus-web.garden:80
2019-07-29 11:32:32	
2019-07-29 11:32:32	/go/src/github.com/gardener/gardener/test/integration/seeds/networkpolicies/aws/networkpolicy_aws_test.go:1062
2019-07-29 11:32:32	[Fail] Network Policy Testing components are selected by correct policies [AfterEach] gardener-resource-manager
2019-07-29 11:32:32	
2019-07-29 11:32:32	/go/src/github.com/gardener/gardener/test/integration/seeds/networkpolicies/aws/networkpolicy_aws_test.go:1194
2019-07-29 11:32:32	[Fail] Network Policy Testing egress for mirrored pods gardener-resource-manager [AfterEach] should block connection to "External host" 8.8.8.8:53

@mvladev can you please check?

Environment:
TestMachinery on all landscapes (dev, ..., live)

NAT Gateway Support

What would you like to be added:
Azure will offer soon a NAT service (currently in a private preview) and there are some scenarios when users could need a dedicated nat service e.g. whitelisting scenarios which require a stable ip(s) for egress connections initiated within the cluster.

Currently all egress traffic from a Gardener managed Azure cluster is routed via the cluster load balancer.

As the nat gateway will come with additional costs I would recommend to integrate it optionally and make it configurable for the users.

As the nat gateway require always to have at least one public ip assigned. I would propose to make it possible for users to pass their public ip address(es) or public ip address range(s) to the extension via .spec.providerConfig.networks.natGateway.ipAddresses[] | .ipAddressRanges[]. Only if both lists are empty the Gardener extension would create one public ip and assign it to the service.

spec:
  type: azure
  ...
  providerConfig:
    apiVersion: azure.provider.extensions.gardener.cloud/v1alpha1
    kind: InfrastructureConfig
    networks:
      natGateway:
        enabled: <true|false>
        ipAddresses:
        - name: my-ip-resource-name
          resourceGroup: my-ip-resource-group
        ipAddressRanges:
        - name: my-iprange-resource-name
          resourceGroup: my-iprange-resource-group
    ...

Later on when we go on with multiple Availability Set support we will probably need to make the nat gateway required. In this case the .spec.providerConfig.networks.natGateway.enabled need to be always true.

Why is this needed:
Support scenarios which require a dedicated nat service e.g. whitelisting scenarios.

Status

Step 1 – NatGateway integration with one public ip attached by Gardener (no bring your own ip for the NatGateway in the first step due to hashicorp/terraform-provider-azurerm#6052) and only for zoned cluster → #50
Step 2 – Enable bring your own ip for zoned clusters (when hashicorp/terraform-provider-azurerm#6052 is fixed an a new version of the azurerm Terraform provider is released) → #54
Step 3 – Enable NatGateway for AvailabilitySet based/non zoned clusters, when Standard Load Balancer is integrated for non zoned clusters (larger effort, due to LoadBalancer migration etc.). This will probably will require to deploy the NatGateway mandatory/non optional in addition to the Standard LoadBalancer.

cc @vlerenc, @AndreasBurger, @MSSedusch

Minimal Permissions for user credentials

From gardener-attic/gardener-extensions#133

We have narrowed down the access permissions for AWS shoot clusters (potential remainder tracked in #178), but not yet for Azure, GCP and OpenStack, which this ticket is now about. We expect less success on these infrastructures as AWSes permision/policy options are very detailed. This may break the "shared account" idea on these infrastructures (Azure and GCP - OpenStack can be mitigated by programmatically creating tenants on the fly).

Adopt CSI for Azure extension beginning with Kubernetes 1.21

As of Kubernetes 1.18 the in-tree Azure Disk volume driver is deprecated in favour of CSI. We should adopt CSI for 1.18+ clusters (where the Azure Disk migration will be promoted to beta as well)

Specify volumeBindingMode:WaitForFirstConsumer in default storage class

/area storage
/kind enhancement
/priority normal
/platform azure

What would you like to be added:
PVs shall be created in the zone where the pod will be scheduled to.

Why is this needed:
From https://kubernetes.io/docs/concepts/storage/storage-classes/#volume-binding-mode:

By default, the Immediate mode indicates that volume binding and dynamic provisioning occurs once the PersistentVolumeClaim is created. For storage backends that are topology-constrained and not globally accessible from all Nodes in the cluster, PersistentVolumes will be bound or provisioned without knowledge of the Pod's scheduling requirements. This may result in unschedulable Pods.

We use Immediate, but should rather use WaitForFirstConsumer, wouldn't you agree?

Implement `Worker` controller for Azure provider

Similar to how we have implemented the Worker extension resource controller for the AWS provider let's please now do it for Azure.

There is no special provider config required to be implemented, however, we should have component configuration for the controller that should look as follows:

---
apiVersion: azure.provider.extensions.config.gardener.cloud/v1alpha1
kind: ControllerConfiguration
machineImages:
- name: coreos
  version: 1967.5.0
  publisher: CoreOS
  offer: CoreOS
  sku: Stable

Clean up the handling for .spec.provider.workers[].machine.volume.type=standard

gardener-attic/gardener-extensions#401 introduces the following code fragment for backwards-compatibility reasons:

gardener-extension-provider-azure/pkg/controller/worker/machines.go

Lines 306 to 315 in b4c3648

    
           // In the past the volume type information was not passed to the machineclass. 
        
           // In consequence the Machine controller manager has created machines always 
        
           // with the default volume type of the requested machine type. Existing clusters 
        
           // respectively their worker pools could have an invalid volume configuration 
        
           // which was not applied. To do not damage existing cluster we will set for 
        
           // now the volume type only if it's a valid Azure volume type. 
        
           // Otherwise we will still use the default volume of the machine type. 
        
           if pool.Volume.Type != nil && (*pool.Volume.Type == "Standard_LRS" || *pool.Volume.Type == "StandardSSD_LRS" || *pool.Volume.Type == "Premium_LRS") { 
        
           	osDisk["type"] = *pool.Volume.Type 
        
           }

Prior to gardener-attic/gardener-extensions#401, the azure machines were always with the default os disk belonging to the requested machine type.

We already have in place the validation that prevents creation of a new cluster with volume type != ["Standard_LRS", "StandardSSD_LRS", "Premium_LRS"].

$ k apply -f ~/bar.yaml --dry-run=server
Error from server (Forbidden): error when creating "/Users/foo/bar.yaml": shoots.core.gardener.cloud "test" is forbidden: [spec.provider.workers[0].volume.type: Unsupported value: core.Volume{Name:(*string)(nil), Type:(*string)(0xc013ff5820), VolumeSize:"50Gi", Encrypted:(*bool)(nil)}: supported values: "Standard_LRS", "StandardSSD_LRS", "Premium_LRS"]

We still have a small number of legacy Shoots which still use .spec.provider.workers[].machine.volume.type=standard. We should actively ping them to migrate (a rolling update of the Nodes will be caused by change of volume.type).

After that we can clean up the above explicit check for ["Standard_LRS", "StandardSSD_LRS", "Premium_LRS"].

/kind cleanup
/platform azure
/priority normal
/cc @dkistner

New Azure clusters often fail during initial creation

New Azure clusters often fail during initial creation because the Terraform infrastructure job returns errors that are confusing for end-users. After the next reconciliation/Terraform run it succeeds, however, following errors are shown:

-> Pod 'example.infra.tf-job-v8zct' reported:
* azurerm_availability_set.workers: 1 error occurred:
	
* azurerm_availability_set.workers: compute.AvailabilitySetsClient#CreateOrUpdate: Failure responding to request: StatusCode=404 -- Original Error: autorest/azure: Service returned an error. Status=404 Code="ResourceGroupNotFound" Message="Resource group 'shoot--project--example' could not be found."
	
* azurerm_route_table.workers: 1 error occurred:
	
* azurerm_route_table.workers: Error Creating/Updating Route Table "worker_route_table" (Resource Group "shoot--project--example"): network.RouteTablesClient#CreateOrUpdate: Failure sending request: StatusCode=404 -- Original Error: Code="ResourceGroupNotFound" Message="Resource group 'shoot--project--example' could not be found."
* azurerm_virtual_network.vnet: 1 error occurred:
	
* azurerm_virtual_network.vnet: Error Creating/Updating Virtual Network "shoot--project--example" (Resource Group "shoot--project--example"): network.VirtualNetworksClient#CreateOrUpdate: Failure sending request: StatusCode=404 -- Original Error: Code="ResourceGroupNotFound" Message="Resource group 'shoot--project--example' could not be found."]

Azure clusters in existing resource groups

Some time back we have disabled deployments into existing Azure resource groups and existing Azure VNets because the Azure cloud provider implementation did not clean up self-created resources properly (tested with version 1.7.6).

Has the Azure cloud provider been improved in that regards so that we can re-enable it again?

Status:

Enable deploying shoots into existing vNETs.
ref: gardener/gardener#1558 gardener-attic/gardener-extensions#371
Enable deploying shoots into existing resourceGroups

Update credentials during Worker deletion

From gardener-attic/gardener-extensions#523

Steps to reproduce:

Create a Shoot with valid cloud provider credentials my-secret.

Ensure that the Shoot is successfully created.

Invalidate the my-secret credentials.

Delete the Shoot.

Update my-secret credentials with valid ones.

Ensure that the Shoot deletion fails waiting the Worker to be deleted.

Currently we do no sync the cloudprovider credentials in the <Provider>MachineClass during Worker deletion. Hence machine-controller-manager fails to delete the machines because the credentials are the invalid ones.

Deletion of shoot cluster fails if service allow-udp-egress is pending

How to categorize this issue?

/area control-plane
/kind bug
/priority normal
/platform azure

What happened:
This was observed on multiple cluster on the canary landscape.
For unknown reasons the load balancer for the shoot service allow-udp-egress was not assigned and the service stays in the state pending:

> ka get svc
NAMESPACE     NAME               TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
default       kubernetes         ClusterIP      100.64.0.1      <none>        443/TCP          5h6m
kube-system   allow-udp-egress   LoadBalancer   100.71.199.73   <pending>     1234:30593/UDP   3h34m

As a consequence the managed resource extension-controlplane-shoot is not deleted

What you expected to happen:
The service should be deleted if it is pending

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

Gardener version (if relevant):
Extension version:
Kubernetes version (use kubectl version):
Cloud provider or hardware configuration:
Others:

Forbid replacing secret with new account for existing Shoots

What would you like to be added:
Currently we don't have a validation that would prevent user to replace its cloudprovider secret with credentials for another account. Basically we do have only a warning in the dashboard - ref gardener/dashboard#422.

Steps to reproduce:

Get an existing Shoot.
Update its secret with credentials for another account.
Ensure that on new reconciliation, new infra resources will be created in the new account. The old infra resources and machines in the old account will leak.
For me the reconciliation failed at

    lastOperation:
      description: Waiting until the Kubernetes API server can connect to the Shoot
        workers
      lastUpdateTime: "2020-02-20T14:56:43Z"
      progress: 89
      state: Processing
      type: Reconcile

wtih reason

$ k describe svc -n kube-system vpn-shoot
Events:
  Type     Reason                   Age                  From                Message
  ----     ------                   ----                 ----                -------
  Normal   EnsuringLoadBalancer     7m38s (x6 over 10m)  service-controller  Ensuring load balancer
  Warning  SyncLoadBalancerFailed   7m37s (x6 over 10m)  service-controller  Error syncing load balancer: failed to ensure load balancer: could not find any suitable subnets for creating the ELB

Why is this needed:
Prevent users to harm themselves.

When "infrastructureConfig" has invalid format user can't see the actual error.

How to categorize this issue?

/area logging
/kind enhancement
/priority normal
/platform azure

What would you like to be added:
When validating the infrastructureConfig the validating controller shows only not allowed to configure an unsupported infrastructureConfig.
It will be a good to add more verbose error message to show the user what and where exactly is the error.

Why is this needed:
If a client has misconfigured "infrastructureConfig" section in the shoot specification, the only message he gets is

spec.provider.infrastructureConfig: Forbidden: not allowed to configure an unsupported infrastructureConfig"

and he is unable see his misconfiguration.
The issue comes from a client which is using go library to generate and apply the shoot spec.
According to his example, the only missing thing was the apiVersion and Kind which he told me that are inserted automatically before Create request. Everything else seem to be properly configured.
There was no example of the misconfigured infrastructureConfig in the error message in the validating controller, nor a concrete error denoting the misconfiguration.
According to the code snipped:

infraConfig, err := decodeInfrastructureConfig(decoder, config)
if err != nil {
return nil, field.Forbidden(infraConfigPath, "not allowed to configure an unsupported infrastructureConfig")
}

the actual error is skiped.

Configure mcm-settings from worker to machine-deployment.

How to categorize this issue?

/area usability
/kind enhancement
/priority normal
/platform azure

What would you like to be added: Machine-controller-manager now allows configuring certain controller-settings, per machine-deployment. Currently, the following fields can be set:

drainTimeout, healthTimeout, creationTimeout, maxEvictRetries, nodeConditions.
Refer: gardener/machine-controller-manager#478

Also, with the PR gardener/gardener#2563 , these settings can be configured via shoot-resource as well.

We need to enhance the worker-extensions to read these settings from worker-object and set respectively on MachineDeployment.

Similar PR on AWS worker-extension: gardener/gardener-extension-provider-aws#148
Dependencies:

Vendor the MCM 0.33.0
gardener/gardener#2563 should be merged.
g/g with the #2563 change should be vendored.

Why is this needed:
To allow a fine configuration of MCM via worker-object.

Virtual network not found after wake up

What happened:
A hibernated cluster was woken up, but the machine deployment failed with an error:

Failed while waiting for all machine deployments to be ready: 'Machine shoot--hec-tna--vlab005-worker-h39fa-z3-54fbcb4fb4-dhlkh failed: Cloud provider message - network.SubnetsClient#Get: Failure responding to request: StatusCode=404 -- Original Error: autorest/azure: Service returned an error. Status=404 Code="ResourceNotFound" Message="The Resource 'Microsoft.Network/virtualNetworks/vnet-HEC42-XSE' under resource group 'shoot--hec-tna--vlab005' was not found."

The network vnet-HEC42-XSE was unchanged and was still existing.
The worker spec had the correct infrastructure status with the vnet rg, but the machineclass was generated without the vnet rg.
After adding the vnet resource group to the machineclass manually, the wake up succeeded finally.

What you expected to happen:
The machineclass should be generated with the vnet rg.

How to reproduce it (as minimally and precisely as possible):
Dominic Kistner tried to reproduce it, but without success.

Anything else we need to know?:

Environment:

Gardener version (if relevant):
Extension version:
Kubernetes version (use kubectl version):
Cloud provider or hardware configuration:
Others:

Should we switch to standard SKU load balancers in Azure?

We are using basic SKU load balancers for our Shoots in Azure, however, Azure supports also standard SKU load balancers which provide more features: https://docs.microsoft.com/en-us/azure/load-balancer/load-balancer-standard-overview

Should we switch to them?

Hint: To use this feature in Kubernetes we require version 1.11 for the Seed clusters, and must set {"loadBalancerSku": "standard"} in the Azure cloud provider config. See kubernetes/kubernetes#61884 and kubernetes/kubernetes#62707.

Enable UDP for egress traffic on Azure

Currently we have a problem that if standard loadbalancers are used, no frontend ip configuration will be created for UDP by default which in turns will result in blockage for all UDP egress requests (e.g., NTP, DNS, etc.).

To fix this we need to support UDP egress traffic. This can be done either via a real NAT GW (not using the LB as a GW), or by modifying any of the shoot services created on the kube-system namespace to also have a UDP port.

cc @dkistner

Adaptation to terraform v0.12 language changes

How to categorize this issue?

/area open-source
/kind cleanup
/priority normal
/platform azure

What would you like to be added:
provider-azure needs an adaptation of the terraform configuration to v0.12. For provider-aws this is done with this PR - gardener/gardener-extension-provider-aws#111.

Why is this needed:
Currently the terraformer run is only omitting warnings but in a future version of terraform, the warnings will be switched to errors.

NatGateway configurable idle connection timeout

How to categorize this issue?
/kind enhancement
/priority normal
/platform azure

What would you like to be added:
Make the timeout for idle connections configurable for the NatGateway, see here: https://www.terraform.io/docs/providers/azurerm/r/nat_gateway.html#idle_timeout_in_minutes

The corresponding field in the infrastructure config could look like this:

apiVersion: azure.provider.extensions.gardener.cloud/v1alpha1
kind: InfrastructureConfig
networks:
  ...
  natGateway:
    enabled: true
    idleConnectionTimoutMinutes: 99 # keep idle connections open for 99 minutes

Why is this needed:
The default idle connection timeout of four minutes is not always sufficient.

Infra failure when fault or update domain count is changed

How to categorize this issue?

/kind bug
/priority normal
/platform azure

What happened:
The values countFaultDomains or countUpdateDomains were changed in a cloud profile. A corresponding shoot had been created successfully with the previous values but when Gardener tried to reconcile the Infrastructure of the shoot with the new values, the infra controller failed for the following reason:

* compute.AvailabilitySetsClient#Delete: Failure sending request: StatusCode=409 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="OperationNotAllowed" Message="Availability Set '<technical-id>' cannot be deleted. Before deleting an Availability Set please ensure that it does not contain any VM."

What you expected to happen:
The infra should reconcile successfully, probably with the values which had been used to create the Availability Set.

Define what privileges are needed for the Azure ServicePrincipal

What would you like to be added:

In the usage-as-end-user document, it states that an Azure ServicePrincipal is required for creating shoot clusters. Would it be possible to define the minimum roles needed for this user to create shoots in Azure?

Why is this needed:

We are currently using the Contributer role but this is likely over privileged for what is needed

Enable Virtual network service endpoints for Azure

What would you like to be added:
Virtual network service endpoints for Azure support.

Virtual network service endpoints for Azure requires a specific flag to be set on the subnet (not on VNET, on Subnet) of the nodes. Currently we can provide an existing VNET but Subnet is always created and managed by Gardener. In our case we want the cluster nodes to be able to use this feature and so related Subnet needs this flag set on. But the subnet is configured by Gardener, and our changes are being reverted during reconciles.

We've faced this issue on Gardener 0.18, but verified that the same issue exists in new extension implementation: https://github.com/gardener/gardener-extensions/blob/c4c502de484d9ef6333d4f15c807d96740086550/controllers/provider-azure/charts/internal/azure-infra/templates/main.tf#L28

These are the options I can see with my brief investigation:

enable passing an existing subnet along with vnet (also mind #5 since it seems related), so we can take over the the vnet and subnet management from gardener and provide them when creating the shoot,
enable providing service_endpoints which will be used in terraformer if provided (https://www.terraform.io/docs/providers/azurerm/r/subnet.html#service_endpoints), since this will also allow doing on-the-fly changes for existing clusters. We have many clusters older than 1y, we dont want to reprovision to enable this feature.

I personally go for option 2, but open to discussion.

Why is this needed:

We have users blocked on moving in because they've been using Azure PostgreSQL. When not using VNET endpoint for PostgreSQL:

there is a performance impact, on DB operations
one has to whitelist all egress IPs of the cluster on the postgres service and its not easy to set kubernetes cluster's egress IP on Azure when using multiple non-static Service[type=LoadBalancers].

https://docs.microsoft.com/en-us/azure/virtual-network/virtual-network-service-endpoints-overview#key-benefits

Add Azure remedy controller to the controlplane extension

How to categorize this issue?
/area control-plane
/kind enhancement
/priority normal
/platform azure

What would you like to be added:
Extend the controlplane controller to install / remove the Azure remedy controller.

Why is this needed:
The Azure remedy controller takes care about certain sporadic Azure issues by trying to apply "remedies" for them.

Azure CNI Support

The Microsoft networking team has recommended that we use the Azure CNI plugin (see https://github.com/Azure/azure-container-networking/blob/master/docs/cni.md) instead of Calico (however they stated that they believe nothing is wrong with Calico).

We shall install a cluster on Azure using the Azure CNI and compare network throughput and scalability.

Multiple (Availability)Sets

Gardener currently deploy for non zoned clusters only one Availability Set for all machines in the cluster.
This approach has several drawbacks, e.g. not possible to combine different machine type skus, no option to move to newer machine type skus, stability and performance of update operations etc.

Replace deprecated terraform resources

How to categorize this issue?

/area quality
/kind enhancement
/priority normal
/platform azure

What would you like to be added:
The deprecated terraform resources network_security_group_id and route_table_id should be replaced by azurerm_subnet_network_security_group_association and azurerm_subnet_route_table_association respectively.

Why is this needed:
When creating an azure Infrastructure terraform gives out the following warnings:

Warning: "network_security_group_id": [DEPRECATED] Use the `azurerm_subnet_network_security_group_association` resource instead.

  on tf/main.tf line 24, in resource "azurerm_subnet" "workers":
  24: resource "azurerm_subnet" "workers" {



Warning: "route_table_id": [DEPRECATED] Use the `azurerm_subnet_route_table_association` resource instead.

  on tf/main.tf line 24, in resource "azurerm_subnet" "workers":
  24: resource "azurerm_subnet" "workers" {

Azure cloud-controller-manager RBAC for serviceaccount "system:serviceaccount:kube-system:azure-cloud-provider"

Currently the azure cloud-controller-manager has multiple error logs :

'events "kube-apiserver.15e7ebd2126cb9f8" is forbidden: User "system:serviceaccount:kube-system:azure-cloud-provider" cannot patch resource "events" in API group "" in the namespace "shoot--foo--bar"' (will not retry!)

According to kubernetes/kubernetes#68212 the ClusterRole system:azure-cloud-provider and ClusterRoleBinding system:azure-cloud-provider needs to be created.

Enforce minimum TLS version 1.2 while creating storage account

How to categorize this issue?

/area backup
/area certification
/area security
/area storage
/kind enhancement
/priority critical
/platform azure

What would you like to be added:
Storage account creation (during backup bucket creation) fails if the account policy enforces TLS minimum version 1.2 with the following error.

flow "Backup Bucket Reconciliation" encountered task errors: [task
      "Waiting until backup bucket is reconciled" failed: Error while waiting for
      BackupBucket /bf754112-7079-4f76-9164-1e936ac7e675 to become ready: extension
      encountered error during reconciliation: Error reconciling backupbucket: storage.AccountsClient#Create:
      Failure sending request: StatusCode=403 -- Original Error: Code="RequestDisallowedByPolicy"
      Message="Resource ''bkp4cf2e2d424519f5'' was disallowed by policy. Policy identifiers:
      ''[{\"policyAssignment\":{\"name\":\"Storage accounts must use TLS Version 1.2
      and above\",\"id\":\"/subscriptions/6fcac074-8bf5-4307-b30f-7b84764cd5ba/providers/Microsoft.Authorization/policyAssignments/dd3024c89ba1466fbfbbcebc\"},\"policyDefinition\":{\"name\":\"Storage
      accounts must use TLS Version 1.2 and above\",\"id\":\"/providers/Microsoft.Management/managementgroups/SAP-MG-21052019-00000001/providers/Microsoft.Authorization/policyDefinitions/c6a25da3-ee91-37b2-ab5c-49939b141093\"}}]''."
      Target="bkp4cf2e2d424519f5" AdditionalInfo=[{"info":{"evaluationDetails":{"evaluatedExpressions":[{"expression":"type","expressionKind":"Field","expressionValue":"Microsoft.Storage/storageAccounts","operator":"Equals","path":"type",
"result":"True","targetValue":"Microsoft.Storage/storageAccounts"},{"expression":"Microsoft.Storage/storageAccounts/minimumTlsVersion","expressionKind":"Field","operator":"Equals","path":"properties.minimumTlsVersion","result":"False","ta
rgetValue":"TLS1_2"}]},"policyAssignmentDisplayName":"Storage
      accounts must use TLS Version 1.2 and above","policyAssignmentId":"/subscriptions/6fcac074-8bf5-4307-b30f-7b84764cd5ba/providers/Microsoft.Authorization/policyAssignments/dd3024c89ba1466fbfbbcebc","policyAssignmentName":"dd3024c89ba
1466fbfbbcebc","policyAssignmentParameters":{},"policyAssignmentScope":"/subscriptions/6fcac074-8bf5-4307-b30f-7b84764cd5ba","policyDefinitionDisplayName":"Storage
      accounts must use TLS Version 1.2 and above","policyDefinitionEffect":"Deny","policyDefinitionId":"/providers/Microsoft.Management/managementgroups/SAP-MG-21052019-00000001/providers/Microsoft.Authorization/policyDefinitions/c6a25da
3-ee91-37b2-ab5c-49939b141093","policyDefinitionName":"c6a25da3-ee91-37b2-ab5c-49939b141093"},"type":"PolicyViolation"}]]

Storage account and the backup bucket creation should succeed in such cases case.

Why is this needed:
Compliance to SCD policies.

Possible implementation:
Add MinimumTLSVersion: storage.TLS12, parameter in the storage account creation code here. Credit: @vpnachev

Run all `make` commands in container.

From https://github.com/gardener/gardener-extensions/issues/122

All make commands require various dev resources to be available:

golang

helm

various gnu utils such as tar, gzip, tr, base64, cut, find, grep, xargs.

gingo

linters.

I think that the most reasonable way to get reproducible builds and things such as make generate is to run those commands in a pre-build container image.

Get rid of github.com/Azure/go-autorest replace directive

$ go get github.com/gardener/gardener@7d1c63a764aac467b9ee57a7501881a7017f1bf2

$ make revendor

github.com/gardener/gardener-extensions/controllers/provider-azure/pkg/azure/client imports
	github.com/Azure/azure-sdk-for-go/services/resources/mgmt/2019-05-01/resources imports
	github.com/Azure/go-autorest/tracing: ambiguous import: found github.com/Azure/go-autorest/tracing in multiple modules:
	github.com/Azure/go-autorest v11.5.0+incompatible (/Users/i331370/go/pkg/mod/github.com/!azure/[email protected]+incompatible/tracing)
	github.com/Azure/go-autorest/tracing v0.5.0 (/Users/i331370/go/pkg/mod/github.com/!azure/go-autorest/[email protected])
github.com/gardener/gardener-extensions/controllers/provider-azure/pkg/azure/client imports
	github.com/Azure/azure-storage-blob-go/azblob tested by
	github.com/Azure/azure-storage-blob-go/azblob.test imports
	github.com/Azure/go-autorest/autorest/adal: ambiguous import: found github.com/Azure/go-autorest/autorest/adal in multiple modules:
	github.com/Azure/go-autorest v11.5.0+incompatible (/Users/i331370/go/pkg/mod/github.com/!azure/[email protected]+incompatible/autorest/adal)
	github.com/Azure/go-autorest/autorest/adal v0.6.0 (/Users/i331370/go/pkg/mod/github.com/!azure/go-autorest/autorest/[email protected])

Validate cloudprovider credentials

(recreating issue from the g/g repo: gardener/gardener#2293)

What would you like to be added:
Add validation for cloudprovider secret

Why is this needed:
Currently, when uploading secrets via the UI, all secret fields are required and validated. However, when creating those credentials via the cloudprovider secret, there is no validation. This results in errors such as:

Flow "Shoot cluster reconciliation" encountered task errors: [task "Waiting until shoot infrastructure has been reconciled" failed: failed to create infrastructure: retry failed with context deadline exceeded, last error: extension encountered error during reconciliation: Error reconciling infrastructure: secret shoot--xxxx--xxxx/cloudprovider doesn't have a subscription ID] Operation will be retried.

Unable to disable the remedy-controller for already existing Shoot

How to categorize this issue?

/kind bug
/priority normal
/platform azure

What you expected to happen:
remedy-controller to be removed when the Shoot is annotated with azure.provider.extensions.gardener.cloud/disable-remedy-controller=true and it is reconciled afterwards

How to reproduce it (as minimally and precisely as possible):

Create a Shoot
Ensure that the remedy-controller is present as expected

$ k -n shoot--dev--bar po -l app=remedy-controller-azure
NAME                                       READY   STATUS    RESTARTS   AGE
remedy-controller-azure-756f555b6b-g4h6s   1/1     Running   0          7h6m

Annotate the Shoot with azure.provider.extensions.gardener.cloud/disable-remedy-controller=true and then trigger a reconcile

$ k -n garden-dev annotate shoot bar azure.provider.extensions.gardener.cloud/disable-remedy-controller=true
$ k -n garden-dev annotate shoot bar gardener.cloud/operation=reconcile

Ensure that the remedy-controller is not removed

$ k -n shoot--dev--bar get po -l app=remedy-controller-azure
NAME                                       READY   STATUS    RESTARTS   AGE
remedy-controller-azure-756f555b6b-g4h6s   1/1     Running   0          7h11m

Gardener version (if relevant):
Extension version: v1.15.2
Kubernetes version (use kubectl version):
Cloud provider or hardware configuration:
Others:

Add integration test for Infrastructure reconciliation and deletion

How to categorize this issue?

/area testing
/kind enhancement
/priority normal
/platform azure

What would you like to be added:
With gardener/gardener-extension-provider-aws#54 provider-aws features integration test for Infrastructure reconciliation and deletion.

Why is this needed:
I would be good if we have similar test for provider-azure as it will increase the robustness.

Install Azure kubernetes-volume-drivers

Hi, it would be nice to have these drivers available on shoots running on Azure.

Circumvent k8s 1.15.3 bug for Azure cloud provider

K8s 1.15.3 has been released recently but this version is shipped with a bug in the Azure cloud provider (Cloud-Controller-Manager) which lets it crash loop constantly.

See: kubernetes/kubernetes#81463

Nevertheless, we could circumvent this issue by using cloudProviderBackoffMode: default instead of v2.

https://github.com/kubernetes/kubernetes/blob/c7858aa97643c532ffb26a8ae9fa86611dd329a5/staging/src/k8s.io/legacy-cloud-providers/azure/azure.go#L137

@dkistner @rfranzke do you have any opinion / experience using the default mode or do you see any risk switching to it at least at the moment for 1.15.3 clusters only?

/cc @MSSedusch @ThormaehlenFred

Support additional HA concepts

What would you like to be added:
Add support for additional HA concepts. For example, on Azure, Gardener only supports deploying VMs into Availability Sets.

In addition to AVSets, it would be great if Gardener and MCM could support these concepts as well:

None
AVsets have some limitations e.g. on VMs that can be part of the same AVSet. In some cases it would be good to have a cluster without an HA concept
Availability Zones
Availability Zones offer better VM SLAs (99.99 instead of 99.95 for AVSets)

Why is this needed:
More flexibility for customers and higher SLA

Consider Azure AZs

We do not support AZs for Azure, but there seem to be some now: https://azure.microsoft.com/de-de/blog/introducing-azure-availability-zones-for-resiliency-and-high-availability/ -> https://azure.microsoft.com/en-us/global-infrastructure/availability-zones
What are the advantages of those AZs and should we support them (instead of the update and fault domains)?

Provider-specific webhooks in Garden cluster

From gardener-attic/gardener-extensions#407

With the new core.gardener.cloud/v1alpha1.Shoot API Gardener does no longer understand the provider-specifics, e.g., the infrastructure config, control plane config, worker config, etc.
This allows end-users to harm themselves and create invalid Shoot resources the Garden cluster. Errors will only become present during reconciliation part creation of the resource.

Also, it's not possible to default any of the provider specific sections. Hence, we could also think about mutating webhooks in the future.

As we are using the controller-runtime maintained by the Kubernetes SIGs it should be relatively easy to implement these webhooks as the library abstracts already most of the things.

We should have a separate, dedicated binary incorporating the webhooks for each provider, and a separate Helm chart for the deployment in the Garden cluster.

Similarly, the networking and OS extensions could have such webhooks as well to check on the providerConfig for the networking and operating system config.

Part of gardener/gardener#308

Azure loadbalancers cannot be created

Azure shoots cannot be provisioned sometimes because the cloud controller manager is unable to create the vpn and nginx loadbalancer.

The cloud controller managers states that the corresponding public IP address cannot be found

Error syncing load balancer: failed to ensure load balancer: Code="ReferencedResourceNotProvisioned" 
Message="Cannot proceed with operation because resource /subscriptions/xxx/resourceGroups/shoot-az-xxx/providers/Microsoft.Network/publicIPAddresses/shoot-az-xxx--abc used by resource /subscriptions/xxx/resourceGroups/shoot-az-xxx/providers/Microsoft.Network/loadBalancers/shoot-az-xxx is not in Succeeded state. 
Resource is in Failed state and the last operation that updated/is updating the resource is PutPublicIpAddressOperation

The issue is looks like a inconsistency in azure.

Steps to reproduce

create a azure shoot with nginx addon enabled.

The issue can be resolved by deleting both services which then triggers the deletion the loadbalancer.
After the reconcilition the loadbalancer is then created successfully.

Properly Enable Accelerated Networking for AvSet Clusters

How to categorize this issue?

/area performance
/area quality
/kind enhancement
/priority normal
/platform azure

What would you like to be added:
Recently Accelerated Networking was enabled with #65. The decision to enable it or not was based only on the VM type and OS, but this caused troubles with AvSet clusters - basically the migration to VMs with enabled Accelerated Networking is requiring all non-enabled VMs to be deleted before new ones are created. If this is not done, VM creation is failing with ExistingAvailabilitySetWasNotDeployedOnAcceleratedNetworkingEnabledCluster, thus we disabled Accelerated Networking with #100 for all AvSet based shoot clusters.

Once #61 is implemented, we can use AvSet per machine class which will allow us to properly enable Accelerated Networking for non-zoned clusters, again.

Why is this needed:
See #55

Implement `ControlPlane` controller for Azure provider

Similar to how we have implemented the ControlPlane extension resource controller for the AWS provider let's please now do it for Azure.

Based on the current implementation the ControlPlaneConfig should look like this:

apiVersion: azure.provider.extensions.gardener.cloud/v1alpha1
kind: ControlPlaneConfig
cloudControllerManager:
  featureGates:
    CustomResourceValidation: true

No ControlPlaneStatus needs to be implemented right now (not needed yet).

Accelerated Network Enablement

What would you like to be added:
Azure supports a feature called Accelerated Networking to increase network performance of virtual machines e.g. reduce latency/more packet per second, lower cpu utilization.
The usage of the feature depends on the machine type and the operating systems.
More information are available here.

Why is this needed:
This feature comes for free and should therefore be enabled by default for all machines and os images combinations which support it.

To enable the feature we just need to turn on a flag on the network interfaces attached to the vm (see here).

We would need to put the information which machine type and os image support accelerated networking into the Azure specific part of the CloudProfile. The supported machine types can be fetched via an Azure api (e.g. az vm list-skus) and therefore it can be automated.
The operating systems need to be maintained manually. A list is of supported ones is available here.
Custom images are also supported but there are some requirements, see here.

Action Items

MCM need to support creating machine with network interfaces that have accelerated networking enabled.
Gardener Azure Provider need to render and deploy proper machine classes
Gardener Azure CloudProfile need to be extended to allow to configure accelerated networking supporting machine types and operating systems

Automate CloudProfile creation wherever possible by using the Azure apis.

Add tests for alerting rules

From gardener-attic/gardener-extensions#346

The monitoring configs that have been moved to the gardener extensions no longer have alerting rules tests. These tests should be added to the extensions and executed in the ci pipeline.

See also: gardener/gardener#1466

Azure shoot nodes cannot be scaled up with standard lb sku

/kind bug

Since gardener-extensions 0.13.0 the LB are created with SKU standard and it turns out that worker nodes for such shoots cannot be scaled up.
Basically the egress traffic of the new nodes is broken and they cannot self-bootstrap, thus fail to join. This behavior is not observed with lb SKU basic.

Steps to reproduce:

Update gardener-extensions to 0.13.x
Create new shoot cluster(min=max=2) and wait the creation to succeed.
Scale up the worker pool (min=max=3) and ensure that the third newly created node cannot join the cluster

Additionally, ssh on the failing node and try to curl the shoot's kube-apiserver

$ curl https://api.foo.bar.internal -vvvvvv
* Expire in 0 ms for 6 (transfer 0x55dd5ef13650)
* Expire in 1 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 1 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 1 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 1 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 1 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 1 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 1 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 1 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 1 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 1 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 1 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 1 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 1 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 1 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 1 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 1 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 1 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 1 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 1 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 1 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 1 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 1 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 1 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 1 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 1 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 1 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 1 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 1 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 1 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 1 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 1 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 2 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 2 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 2 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 2 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 2 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 2 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 2 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 2 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 2 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 2 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 2 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 2 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 2 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 2 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 2 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 2 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 2 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 2 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 2 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 2 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 2 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 0 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 1 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 2 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 1 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 1 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 2 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 1 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 1 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 2 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 1 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 1 ms for 1 (transfer 0x55dd5ef13650)
* Expire in 2 ms for 1 (transfer 0x55dd5ef13650)
*   Trying 1.2.3.4...
* TCP_NODELAY set
* Expire in 149997 ms for 3 (transfer 0x55dd5ef13650)
* Expire in 200 ms for 4 (transfer 0x55dd5ef13650)
* connect to 1.2.3.4 port 443 failed: Connection timed out
*   Trying 5.6.7.8...
* TCP_NODELAY set
* Expire in 83971 ms for 3 (transfer 0x55dd5ef13650)
* After 83971ms connect time, move on!
* connect to 5.6.7.8 port 443 failed: Connection timed out
* Failed to connect to api.foo.bar.internal port 443: Connection timed out
* Closing connection 0
curl: (7) Failed to connect to api.foo.bar.internal port 443: Connection timed out

Terraform deprecated arguments

Currently we use the following deprecated arguments:

provider-azure:

azurerm_subnet.workers: "network_security_group_id"

Warning: azurerm_subnet.workers: "network_security_group_id": [DEPRECATED] Use the `azurerm_subnet_network_security_group_association` resource instead.

azurerm_subnet.workers: "route_table_id"

Warning: azurerm_subnet.workers: "route_table_id": [DEPRECATED] Use the `azurerm_subnet_route_table_association` resource instead.

provider-alicloud

alicloud_nat_gateway.nat_gateway: "spec"

Warning: alicloud_nat_gateway.nat_gateway: "spec": [DEPRECATED] Field 'spec' has been deprecated from provider version 1.7.1, and new field 'specification' can replace it.

Support Service.spec.loadBalancerIP with Kubernetes services of type LoadBalancer

Feature request: Applications with exposed static IP addresses, for the purpose to be white-listed by consumers, need to be able to re-use the IP addresses in case shoots are created, for example in a disaster recovery scenarios. This feature should be available with all cloud providers, i.e. be part of the homogeneity of clusters.

In the case of shoots in Azure:

Given I have an Azure resource group with name: my-resource-group
And I have a Virtual Network with name: my-virtual-network
And I have a Public IP address with:
  Name: my-public-address
  IPAddress: 51.105.156.156

When I create a Gardener shoot as:
  kind: Shoot
  apiVersion: garden.sapcloud.io/v1beta1
  metadata:
    name: my-shoot
  spec:
    cloud:
      profile: az
      azure:
        networks:
          vnet:
            resourceGroup: my-resource-group
            name: my-virtual-network
And I create a service in my-shoot shoot as:
  kind: Service
  apiVersion: v1
  metadata:
    name: my-load-balancer
  spec:
    type: LoadBalancer
    loadBalancerIP: 51.105.156.156

Then my-load-balancer service should be created successfully with:
  .status.loadBalancer.ingress[0].ip: 51.105.156.156

Currently this scenarios fails with error message:

user supplied IP Address 51.105.156.156 was not found in resource group shoot--garden--my-shoot

Blocks of type "timeouts" are not expected here

What happened:
Infrastructure fails with Blocks of type "timeouts" are not expected here.

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Create Shoot.
Ensure that the Infrastructure reconciliation fails with:

$ k -n shoot--foo--bar logs bar.infra.tf-apply-nbk4c -f
Initializing the backend...

Initializing provider plugins...

The following providers do not have any version constraints in configuration,
so the latest version was installed.

To prevent automatic upgrades to new major versions that may contain breaking
changes, it is recommended to add version = "..." constraints to the
corresponding provider blocks in configuration, with the constraint strings
suggested below.

* provider.azurerm: version = "~> 1.44"

Terraform has been successfully initialized!

You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.

Error: Unsupported block type

  on tf/main.tf line 12, in resource "azurerm_resource_group" "rg":
  12:   timeouts {

Blocks of type "timeouts" are not expected here.


Nothing to do.
Tue May 19 09:47:47 UTC 2020 terraform.sh exited with 0.

Anything else we need to know?:

Environment:

Gardener version (if relevant):
Extension version:
Kubernetes version (use kubectl version):
Cloud provider or hardware configuration:
Others:

Missing health check for remedy controller

How to categorize this issue?

/area ops-productivity
/kind bug
/priority normal
/platform azure

What happened:
When the remedy controller deployed into the shoot namespace next to the control plane of an Azure shoot fails then the conditions of the ControlPlane resource do not change.

What you expected to happen:
The condition ControlPlaneHealthy should transition to False.

How to reproduce it (as minimally and precisely as possible):
Deploy an Azure shoot and delete the CRDs used by the remedy controller. See it failing but the condition staying the same.

Anything else we need to know?:
The check should be added to https://github.com/gardener/gardener-extension-provider-azure/blob/master/pkg/controller/healthcheck/add.go

/assign @stoyanr @AndreasBurger

Environment:

Gardener version (if relevant): v1.9.0
Extension version: v1.12.0
Kubernetes version (use kubectl version): v1.18.0
Cloud provider or hardware configuration: Azure
Others: none

Investigate fine-grained cloud provider rate limits

What would you like to be added:
With 1.18 the Azure cloud provider has several more configuration options for fine-tuning individual rate limits, see kubernetes/kubernetes#86515. It makes sense to investigate which of these flags should be set.

Why is this needed:
Reducing risk to run into ARM rate limits.

/cc @dkistner @MSSedusch

	// In the past the volume type information was not passed to the machineclass.
	// In consequence the Machine controller manager has created machines always
	// with the default volume type of the requested machine type. Existing clusters
	// respectively their worker pools could have an invalid volume configuration
	// which was not applied. To do not damage existing cluster we will set for
	// now the volume type only if it's a valid Azure volume type.
	// Otherwise we will still use the default volume of the machine type.
	if pool.Volume.Type != nil && (pool.Volume.Type == "Standard_LRS" \|\| pool.Volume.Type == "StandardSSD_LRS" \|\| *pool.Volume.Type == "Premium_LRS") {
	osDisk["type"] = *pool.Volume.Type
	}

gardener / gardener-extension-provider-azure Goto Github PK

gardener-extension-provider-azure's People

Stargazers

Watchers

Forkers

gardener-extension-provider-azure's Issues

Steps to reproduce

Recommend Projects

Recommend Topics

Recommend Org