Code Monkey home page Code Monkey logo

infrastructure-manager's People

Contributors

akgalwas avatar dellagustin-sap avatar dependabot[bot] avatar disper avatar grego952 avatar kyma-bot avatar m00g3n avatar pbochynski avatar

Watchers

 avatar  avatar  avatar

infrastructure-manager's Issues

[Threat Modelling] Limit access to Gardener project kubeconfig

Reason

We're using a kubeconfig defined in gardener-kubeconfig-path. We should limit the access to it to not allow unathorize access to the gardener project.

Acceptance criteria

  • review access rights to the gardener project kubeconfig and adjust them if needed

Multiple worker groups

Description
Enable possibility to create multiple worker groups with different machine types, volume types, node labels, annotations, taints.

Reasons
One size doesn't fit all. Many applications require specific nodes for particular services.

Migrate Prow jobs to Github Actions

Description

As Prow will be discontinued in 2024, we have to move the Prow jobs used for the provisioner to an alternative CI/CD system. In our case Github Actions is the preferred choice.

Overview of all existing Prow-jobs is listed here: https://github.com/search?q=repo%3Akyma-project%2Ftest-infra+framefrog&type=code&p=1

AC:

  • Identify which of the jobs listed in the URL above are required during the Infrastructure Manager development-lifecycle and relevant in the longterm (have to be migrated)
  • All Infrastructure Manager related Prow jobs are migrated to Github Actions

Reasons

Migrate CI/CD jobs from Prow to Github Actions as Prow will be discontinued in 2024.

Attachments

Infrastructure Manager - Perform load and stress test to verify operator's behaviour under load

Description

We should verify how the operator behaves under load. To increase the stabilisation and reliability of the infrastructure manager, a performance test has to be implemented which verifies common use cases. Goals is to measure regularly our internally defined performance KPIs (benchmarking/load test), verify the limits of the application (stress test) and detect performance critical behaviours before the Infrastructure Manager gets deployed on a productive landscape (no memory leaks etc.).

Acceptance criteria:

  • Identify the most relevant use-cases of the Infrastructure Manager
    • define input parameters (e.g. execute the test for 100, 1000, and 5000 CRDs)
    • specify the execution context/boundaries (how often the use case will be applied in parallel, limits for CPU/RAM consumption, max. execution time per test case etc.)
    • share the collected use-cases and the defined boundaries with in the team and collect their feedback
  • Learn what is the recommended way of load testing Kubebuilder projects
  • Implement the use-cases in a load test using one of the mainstream load testing tools (e.g. Grafana K6). This test has to cover
    • the creation of a load test landscape (e.g. by using a local K3d cluster or provisioning a Gardener Cluster) and deployment of a particular Infrastructure Manager version
    • ensure metrics of the Infrastructure Manager are recorded during the test execution
    • visualisation of the measured metrics in a Dashboard (e.g. Plutono)
    • mocks for 3rd party systems to avoid an overload of external systems (e.g. Gardener service)
  • Run the load test and increase the amount of (parallel) workers until the application starts to behave unstable/crashing to detect our maximum performance capacities.
    • Document test results
  • Integrate the load test into the release process to detect critical performance changes between releases

Reasons
Before deploying the operator on production we must know its performance characteristic.

Infrastructure Manager - Add metrics, and alerts to improve observability

Description

The infrastructure Manager should provide metrics to allow early issues detection.

Reasons

Infrastructure Manager is a component that in the long run will be responsible for cluster creation. In case of a downtime the impact on Kyma Control Plane will be significant. We must prevent that by increasing the observability.

Acceptance criteria

Ensure that relevant secret is removed when CR is deleted

Reason
When POD is disabled (even for a shorter duration like 10 seconds), and the GardenerCluster CR will be removed by KEB, IM controller will not receive an event and the corresponding secret will not be cleaned up.

What
Some mechanisms (e.g., owner reference/finalizers) should be introduced to ensure that when GardnerCluster CR is removed, the corresponding secret will also be removed.

Errors are being thrown in logs when using force rotation.

Description
Errors are being thrown in logs when using force rotation.

Expected result

No errors should be thrown in logs when using force rotation.

Actual result

Errors are being thrown in logs when using force rotation.

2023-12-20T12:29:44Z    INFO    Rotation of secret kubeconfig-01568d6b-e96f-4106-b8f5-f5a745f0390d in namespace kcp-system forced. {"GardenerCluster": "01568d6b-e96f-4106-b8f5-f5a745f0390d", "Namespace": "kcp-system"}
2023-12-20T12:29:44Z    ERROR   status update failed    {"error": "Operation cannot be fulfilled on gardenerclusters.infrastructuremanager.kyma-project.io \"01568d6b-e96f-4106-b8f5-f5a745f0390d\": the object has been modified; please apply your changes to the latest version and try again"}
2023-12-20T12:29:44Z    ERROR   Reconciler error        {"controller": "gardenercluster", "controllerGroup": "infrastructuremanager.kyma-project.io", "controllerKind": "GardenerCluster", "GardenerCluster": {"name":"01568d6b-e96f-4106-b8f5-f5a745f0390d","namespace":"kcp-system"}, "namespace": "kcp-system", "name": "01568d6b-e96f-4106-b8f5-f5a745f0390d", "reconcileID": "f1f60c6e-15c4-45cb-bcde-a3c60b8ce864", "error": "Operation cannot be fulfilled on gardenerclusters.infrastructuremanager.kyma-project.io \"01568d6b-e96f-4106-b8f5-f5a745f0390d\": the object has been modified; please apply your changes to the latest version and try again"}
2023-12-20T12:29:44Z    INFO    Starting reconciliation.        {"GardenerCluster": "01568d6b-e96f-4106-b8f5-f5a745f0390d", "Namespace": "kcp-system"}
2023-12-20T12:29:44Z    INFO    rotation params {"GardenerCluster": "01568d6b-e96f-4106-b8f5-f5a745f0390d", "Namespace": "kcp-system", "lastSync": "0001-01-01 00:00:00", "requeueAfter": "6h50m24s"}

Steps to reproduce

  1. (Probably not important) The cluster was first updated to k8s 1.27.6 and then hibernated before the rotation was forced.
  2. Force certificate rotation
  3. Check IM logs

/kind bug

Infrastructure Manager - Prepare migration script/Go program that will create GardenerCluster for each existing cluster

Description

Prepare a Go program/script that will iterate over Kyma resources. For each Kyma resource it will:

  1. Read labels from the Kyma resource
  2. Create GardenerCluster CR

The GardenerCluster CR must contain the fields defined here. Kyma resource is created by the KEB, and the labels it adds can be found here. Mind that the secret name is also defined by KEB.

Reasons
In order to migrate to the architecture with the Infrastructure Manager responsible for dynamic kubeconfig creation the environment some additional steps must be performed. When Infrastructure Manager will be deployed on the target environment there will be a need to handle existent Kyma clusters. The migration script is needed to make sure Infrastructure Manager will control all the runtimes.

Infrastructure Manager - create initial project structure

Description

Create a minimal structure for Cluster Inventory Infrastructure Manager.

Acceptance criteria:

Stretch:

Reasons

In order to kick off the implementation we need to define the code structure, create pipelines. We also need to define the interface for Kyma Environment Broker that is supposed to create Cluster CRs.

Migrate logic from Provisioner into Kyma Infrastructure Manager [EPIC]

Description

The Provisioner has to be replaced by the Kyma Infrastrcuture Manager. The logic of the Provisioner has to be migrated into the Infrastructure Manager, but also considering already planned new features. This could required a rethinking of the current software architecture to ensure a flexible and extensible but also maintainable software structure of the Infrastructure Manager.

AC:

  • Review the current logical architecture of the Provisioner and their buildling blocks (what are the current features the provisioner supports). Rethink how this logic could be arranged to ensure extensibility of this code to be able to add new features in a convenient way (e..g think about a plugin mechanism / framework approach which allows an easy integration of new features). Just as example: possible options could be the introduction of a chain-of-responsibility pattern for Shoot-Spec generation etc.
    • Create a ADR for the new architecture
      • Think how local testing and debugging will be supported
      • If required: implement tiny POC to show how the framework / plugin approach will technically work and demonstrate it to the team
    • Review and align the ADR with the team and get a common approval
  • Migrate the provisioner logic into the Infrastructure Manager by following the architectural decision defined in the previously created ADR
  • Setup a cutover plan for replacing the Provisioner by the Infrastructure Manager
    • Create a dedicated issue for the cutover process and list the required steps in their timely order and who the owner of this step is
    • Align the rollout plan with SREs and KEB team

Reasons

Replacing the old Kyma Provisioner with the Kyma Infrastructure Manager to follow new KCP architectural paradigm (K8s native application).

Attachments

Improve unit testing in the main reconciliation loop

Description

While working on #95, #97 and #99 we've noticed that the bigger changes in the corresponding code we've noticed that tests require an improvement.

Reasons

That's the crucial part of Infrastructure Manager that has to be correctly tested so the future enhancements or bug fixes will not cause regressions.

Attachments

  • related PR with some initial unit tests improvements #107

Setup end-2-end monitoring of KIM to detect service degradations and fire alerts

Description

As critical backend service of Kyma, the monitoring of the availability of the Infrastructure Manger is critical to react in-time on service degradations.

Goals is to setup a end-2-end test case for the Infrastructure Manager which verifies the correct functionality of this service on KCP. The test should be executed in intervals (e.g. hourly) and create a full-fledged Gardener cluster and also destroy it afterwards.

In case that the cluster creation wasn't possible, an alert should be fired (e.g. via the SRE monitoring system) and inform the Framefrog team about the service degradation.

AC:

  • Get in touch with SREs and verify how a full-fledged test case could be integrated into the existing monitoring solution in Kyma
  • Implement an test case which requests the KIM to create a Gardener cluster and finally also deletes it:
    • The test has to verify that the cluster got successfully created in Gardener
    • Check whether the cluster is accessible using the received kubeconfig from Gardener
    • Finally destroy the created Kyma cluster
  • Ensure a cleanup mechanism is in place which would remove orphan clusters in cases that the test mechanism wasn't able to handle the cleanup as part of the test run.
  • Integrate the test case into the monitoring system (based on the guidance from SREs, see step 1) and ensure alerts are fire in case of KIM service degradation

Reasons

Ensure high quality and proactive service monitoring.

Attachments

[Threat Modelling] Configure audit logs to track changes applied on CRs and secrets

Reason
Those important IM resources should be audit logged.

Acceptance Criteria

Ensure following cases are recorded in the auditlog:

  • If an agent (app or a user) edits the GardenCluster CR - we should see an audit log of that action
  • If an agent (app or a user) edits the secrets - we should see an audit log of that action
  • If an agent (app or a user) accesses gardener secret - we should see an audit log of that action
  • If the above does not happen, consult the situation with security experts and prepare the mitigation plan

Documentation improvements

Description

Acceptance Criteria

  • Improve the part on what has to be configured for IM to work
  • Describe the time rotation feature
  • Describe the force rotation feature

Infrastructure Manager - implement kubeconfig secret management

Description

The Infrastructure Manager must manage dynamic kubeconfigs.

Acceptance criteria:

  • Infrastructure Manager can be installed on Gardener cluster.
  • #37
  • #48
  • #39
  • Infrastructure Manager is periodically triggered to ensure secrets are rotated when needed.
  • It is possible to force a secret rotation with annotation added to the secret.

Reasons

In the long term the Infrastructure Manager will replace Provisioner. In the first step it will be responsible for kubeconfig management in the Kyma Control Plane.

Identify and implement business critical metrics / KPIs, define an action plan and configure alerting rules

Description

With #11 we are able to make the Infrastructure Manager transparent and also simplify our operational life by establishing smart metrics and alerting rules.

Goals of this task is to identify which metrics / KPIs are business relevant and what the critical threshold for it are. We also have to define an action plan when such a threshold is reached which trigger a required action to bring our business back on track. Finally, alerting rules have to be configured which inform us as soon as one of the thresholds is reached.

AC:

  • Think about technical and business critical metrics / KPIs which give a clear indication of the quality and health of the Infrastructure Manager
    • Define the reason why this metric is relevant and what it represents.
    • Define the threshold (min <> max etc.) which indicate an service degradation or health issue of the Infrastructure Manager. If a metric has no threshold, verify if it's for us still helpful to measure this value.
    • Specify the required action that has to be applied if a threshold is reached to recover the Infrastructure Manager into a productive and healthy state
    • Present the results in the team to collect the feedback of the colleagues.
  • Implement the identify business metrics in the Infrastructure Manager
  • Configure alerting rules which inform the team as soon as one of the thresholds is reached

Reasons

Improve operational quality and simplify on-call shifts by establish proper metrics/KPI measuring and alerting.

Extends #11

Attachments

RBAC kubeconfigs for Clusters

Description

There should be a possibility to issue a kubeconfig for the cluster with limited access/privileges.

Kubernetes allows for creating kubeconfigs for specific ServiceAccounts. Having such SA-based kubeconfig makes it possible to limit its use with proper Roles/ClusterRoles.

Suggestions

this is just a proposal, feel free to refine/change/adapt it as you like

One of the options would be to have a new CRD used for issuing kubeconfigs - it could include ServiceAccount information along with the Role/ClusterRole assigned to that ServiceAccount. Based on this Infrastructure Manager could create the SA, (Cluster)Role, issue kubeconfig and save it as a secret in the KCP.

Such a solution would require introducing a controller for handling those, but it will be a universal solution that would support multiple Kubeconfigs to be issued for a single cluster (i.e. for KEB, KLM and other KCP Controllers that would require cluster access).

Regarding the deletion logic - it can be solved with a finalizer that is set on all the CRs, when the deletion timestamp is picked up by the controller then cluster resources (SAs, Roles, etc.) are dropped and the finalizer is removed.

Reasons

It is generally recommended to keep the required privileges minimal for the specific roles. Right now the issued kubeconfigs are for the cluster-admin role which allows for unconstrained actions to be taken using this kubeconfig. From the security perspective, it would be also beneficial to differentiate between entities connecting to the SKR. Separate kubeconfigs for KEB or KLM would make it transparent from the audit-log perspective on which component took which action in the cluster.

Acceptance Criteria

this is just a proposal, feel free to refine those as you like

  • It is possible to request RBAC Kubeconfig
    • ServiceAccount spec is passed as part of the request
    • Role/ClusterRole is passed as part of the request
  • Requested resources are created in the SKR cluster
    • ServiceAccount
    • Role/ClusterRole
    • RoleBinding/ClusterRoleBinding
  • Kubeconfig is issued for the created ServiceAccount
  • Kubeconfig is saved as a K8S Secret
  • K8S Secret with the secret is referenced as part of the status for the request
  • Infrastructure Manager supports "graceful" deletion of deployed resources

Define testing concept for KIM

Description

For our release management and to fulfil SAP product standards, we have to document how our testing strategy for the KIM looks like.

Some example links to such documentations are available here: https://wiki.one.int.sap/wiki/display/kyma/Testing+Strategy+-+Link+summary

For the AC, the testing strategy is already documented.

AC:

Area
Kyma Infrastructure Manager

Reasons

Mandatory part of the delivery process and required for a fast creation of Microdeliveries.

Assignees

@kyma-project/technical-writers

Attachments

Set force-deletion flag when creating shoot-cluster

Description

Gardner supports now the option to force the deletion of a cluster (which avoids longer waiting-periods during the de-provisioning e.g. the K8s cluster couldn't be gracefully stopped caused by hanging finalizers).

We agreed to use this feature flag and the infrastructure manager / provisioner should set this flag properly.

AC:

  • The flag confirmation.gardener.cloud/force-deletion is set in the shoot-specs of Gardener clusters.

Reasons

Enable/accept non-graceful shutdowns of Gardener clusters to avoid longer waiting periods during the de-provisioning.

Attachments

[Moved from Provisioner to KIM]

Infrastructure Manager - Dynamic kubeconfigs e2e test

Description

How it's going to be implemented is yet to be defined.

Reasons

Assure that the dynamic kubeconfigs feature is working e2e.

Acceptance criteria

  • Prepare Go code/bash script performing the test
  • Prepare changes in configuration/makefile to allow running in CI/CD pipeline

Attachments

/area control-plane
/kind feature

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.