Code Monkey home page Code Monkey logo

Comments (12)

dkistner avatar dkistner commented on August 11, 2024 2

So with the current network setup for Azure Shoots we have only one subnet and machines distributed across several zones can be attached to it. So far this wasn't an issue as the Standard LoadBalancer which was used for ingress and egress (which is still the default, as NatGateway should be optional) is deployed automatically zone redundant by Azure. Azure subnets can currently only associate one NatGateway, which means that multiple NatGateways in different zones attached to the same Subnet is not possible.

So at the moment I see only two possibilities to enable zone redundant NatGateways:

  1. Dedicated subnet for each zone similar as we have it for AWS. This is a larger effort as we need a migration logic to move machines from one subnet to another.
  2. Microsoft will provide an automated zone redundant approach to deploy the NatGateway similar as we have it for the Standard LoadBalancer. This is not yet there (even NatGateway at all is currently not GA) and we don't know if and when something like that will be available.

So for now I see only option 1. as short term solution, but with more effort to enable redundant NatGateways. That's probably also the only option which Gardener fully control. Option 2. would mean wait and go on for now without NatGateway HA. This will be anyways an optional feature and therefore we can reason, if zone redundant reliable Nating is required please go with Standard LoadBalancer.

But as I mentioned. AvailabilitySet are also a valid HA mechanism for machines and in this case we will always have just one subnet and therefore only one NatGateway (expect when multiple NatGateways are allowed to be attached to one subnet).

So how to proceed? @vlerenc, @rfranzke WDYT?

from gardener-extension-provider-azure.

dkistner avatar dkistner commented on August 11, 2024 1

Just to summarise it.

The current network setup for Azure Shoot clusters consists of one subnet within a virtual network (vnet) and machines assigned to an AvailabilitySet or machines distributed across zones can be attached to the subnet.

This approach has some implications for the natgateway integration:

  • The natgateway need to be deployed into one zone. This means it is not zone redundant deployed like the Standard LoadBalancer and this makes it to a potential a single point of failure. This is anyways the case for AvailabilitySet based clusters as we have here always just one subnet.
  • Egress connections initiated from machines in another zone as the natgateway need to route the traffic first to the zone which host the natgateway before it can go to the Internet via the natgateway. This could potentially have a latency impact.
  • We don't know the SLA and costs + traffic costs for the natgateway compared to the Standard LoadBalancer.

Btw, if we decide to go with multiple natgateways later on, means one per zone, then we could easily extend the suggested structure by adding a zone information to each ip address or ip address range. Gardener Azure extension controller would then only create a public ip(s) for zones were the user does not specify an ip or ip range.

    networks:
      natGateway:
        enabled: <true|false>
        ipAddresses:
        - name: my-ip-resource-name
          resourceGroup: my-ip-resource-group
          zone: 1
        ipAddressRanges:
        - name: my-iprange-resource-name
          resourceGroup: my-iprange-resource-group
          zone: 2

from gardener-extension-provider-azure.

dkistner avatar dkistner commented on August 11, 2024 1

I propose to go one with the Azure NatGateway integration in several steps.

  1. NatGateway integration with one public ip attached by Gardener (no bring your own ip for the NatGateway in the first step due to hashicorp/terraform-provider-azurerm#6052) and only for zoned cluster
  2. Enable bring your own ip for zoned clusters (when hashicorp/terraform-provider-azurerm#6052 is fixed an a new version of the azurerm Terraform provider is released)
  3. Enable NatGateway for AvailabilitySet based/non zoned clusters, when Standard Load Balancer is integrated for non zoned clusters (larger effort, due to LoadBalancer migration etc.). This will probably will require to deploy the NatGateway mandatory/non optional in addition to the Standard LoadBalancer.

from gardener-extension-provider-azure.

vlerenc avatar vlerenc commented on August 11, 2024 1

Thanks @dkistner.

Since users that deploy to multiple AZs want to do that for availability reasons, we should definitely have one NAT GW per zone, otherwise we introduce singletons and break the main motivation to go for AZs in the first place. We will never be able to explain/motivate this decision when push comes to shove.

Enforcing NAT GWs for AvSet-based clusters is acceptable, because we don't want to go on with AvSet-based clusters as of today for two main reasons:

  1. When an AvSet gets a nervous break-down, it's similar to loosing a zone (not really, but if the cluster is set up in this way, at least kind of), and the rest of the cluster is not going down as well.
  2. When the user want's to modernise a cluster, we need another AvSet for the new hardware.
    So, if the logical chain is as follows: multiple AvSets require STD LBs that require NAT GWs, then so be it. The above goals justify this or else AvSet clusters remain very brittle/troubled. We want to change that more than anything else here (frankly, even more than stable outbound IPs).

The plan above also makes a lot of sense in regards to the TF bug. Maybe MS could help here fixing it, or we, or we could use the native SDKs, but "sitting it out" is absolutely fair, especially given all the work we have to do.

from gardener-extension-provider-azure.

MSSedusch avatar MSSedusch commented on August 11, 2024

should we allow customers to define how many nat gateways to create and in which zone (and in which subnet)?

networks:
      natGateways:
      - name: nat1
        zone: 1
        ipAddresses:
        - name: my-ip-resource-name
          resourceGroup: my-ip-resource-group
        ipAddressRanges:
        - name: my-iprange-resource-name
          resourceGroup: my-iprange-resource-group
        subnet: subnet1
      - name: nat2
        zone: 2
        ipAddresses:
        - name: my-ip-resource-name
          resourceGroup: my-ip-resource-group
        ipAddressRanges:
        - name: my-iprange-resource-name
          resourceGroup: my-iprange-resource-group
        subnet: subnet2

from gardener-extension-provider-azure.

dkistner avatar dkistner commented on August 11, 2024

Hmm we discussed that, the main benefit is redundancy, right? Means if the nat gateway fails in one zone, then only machines in this zone have no egress and the others in the other zones are still fine.

On the other hand side this would not be possible for Availability Set based clusters (distribution accross fault domains is not possible?) and come with more costs, because we need in this case for each zone a nat and not only one per cluster.

from gardener-extension-provider-azure.

dkistner avatar dkistner commented on August 11, 2024

Another finding. The natgateway attached a subnet which has a Basic LoadBalancer assigned is not possible.

That would mean if we want to enable natgateways for AvailabilitySet based cluster then we would need to switch to Standard LoadBalancers also for this type of cluster (which we want to do anyways) and the natgateway would be mandatory for this type of cluster (which is different from zoned based cluster, they can work without natgateway). That means a zoned based cluster would only need a Standard LoadBalancer and an AvailabilitySet based cluster need a Standard LoadBalancer + natgateway.

from gardener-extension-provider-azure.

dkistner avatar dkistner commented on August 11, 2024

It seems there is a bug in the azurerm Terraform provider which prevent the Terraformer to detach and delete Gardener managed public ip assigned to the natgateway. With this issue the Terraformer won't be able to detach and delete the public ip (created by Gardener, incase user does not provide one) in case the user wants to assign later on his own public ip/ipranges. See here: hashicorp/terraform-provider-azurerm#6052

from gardener-extension-provider-azure.

vlerenc avatar vlerenc commented on August 11, 2024

Well, "(1) larger effort" and "I see only (1) as short term solution" doesn't fir together for me. ;-)

Considering what you wrote, I would then do (2), i.e. sit it out and hope nobody escalates before MS/Azure changes that. If MS/Azure offers cross-AZ subnets and has "zone-aware LBs" then I would expect that for all resources as well. AWS does it differently and scopes by zone.

from gardener-extension-provider-azure.

dkistner avatar dkistner commented on August 11, 2024

Well, "(1) larger effort" and "I see only (1) as short term solution" doesn't fir together for me. ;-)

:D Sorry for the misleading statements. I meant I see more implementation efforts for (1) because of the change in the network layout and the migration logic to move machines from one subnet to another.Of course we could do that. With short term I mean maybe within weeks, for (2) I do not know. I can only estimate and I would guess months...

I'm also for (2) in general because we still have the Standard LoadBalancer with zone redundancy which should be sufficient for most cases.

from gardener-extension-provider-azure.

larsdannecker avatar larsdannecker commented on August 11, 2024

Thanks @dkistner.

Since users that deploy to multiple AZs want to do that for availability reasons, we should definitely have one NAT GW per zone, otherwise we introduce singletons and break the main motivation to go for AZs in the first place. We will never be able to explain/motivate this decision when push comes to shove.

Enforcing NAT GWs for AvSet-based clusters is acceptable, because we don't want to go on with AvSet-based clusters as of today for two main reasons:

  1. When an AvSet gets a nervous break-down, it's similar to loosing a zone (not really, but if the cluster is set up in this way, at least kind of), and the rest of the cluster is not going down as well.
  2. When the user want's to modernise a cluster, we need another AvSet for the new hardware.
    So, if the logical chain is as follows: multiple AvSets require STD LBs that require NAT GWs, then so be it. The above goals justify this or else AvSet clusters remain very brittle/troubled. We want to change that more than anything else here (frankly, even more than stable outbound IPs).

The plan above also makes a lot of sense in regards to the TF bug. Maybe MS could help here fixing it, or we, or we could use the native SDKs, but "sitting it out" is absolutely fair, especially given all the work we have to do.

Hi @vlerenc
what about data centers that are requiring AVsets?

from gardener-extension-provider-azure.

dkistner avatar dkistner commented on August 11, 2024

The step 3 to make the NatGateway usable in combination with AvailabilitySets will be probably not be implemented as we are planning to deprecate AvailabilitySet based clusters and replace them with clusters based on VirtualMachineScaleSet Orchestration mode VM (VMO) in the mid term. Those cluster will be out of the box compatible with NatGateway.

from gardener-extension-provider-azure.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.