The infrastructure-manager from kyma-project

[rl-license_file-1] Violation against OSS Rules of Play

A violation against the OSS Rules of Play has been detected.

Rule ID: rl-license_file-1
Explanation: Does it have a license file? No

Find more information at: https://sap.github.io/fosstars-rating-core/oss_rules_of_play_rating.html

[rl-reuse_tool-4] Violation against OSS Rules of Play

A violation against the OSS Rules of Play has been detected.

Rule ID: rl-reuse_tool-4
Explanation: Is it compliant with REUSE rules? No

Find more information at: https://sap.github.io/fosstars-rating-core/oss_rules_of_play_rating.html

[rl-reuse_tool-1] Violation against OSS Rules of Play

A violation against the OSS Rules of Play has been detected.

Rule ID: rl-reuse_tool-1
Explanation: Does README mention REUSE? No

Find more information at: https://sap.github.io/fosstars-rating-core/oss_rules_of_play_rating.html

[Threat Modelling] Limit access to Gardener project kubeconfig

Reason

We're using a kubeconfig defined in gardener-kubeconfig-path. We should limit the access to it to not allow unathorize access to the gardener project.

Acceptance criteria

review access rights to the gardener project kubeconfig and adjust them if needed

Multiple worker groups

Description
Enable possibility to create multiple worker groups with different machine types, volume types, node labels, annotations, taints.

Reasons
One size doesn't fit all. Many applications require specific nodes for particular services.

Migrate Prow jobs to Github Actions

Description

As Prow will be discontinued in 2024, we have to move the Prow jobs used for the provisioner to an alternative CI/CD system. In our case Github Actions is the preferred choice.

Overview of all existing Prow-jobs is listed here: https://github.com/search?q=repo%3Akyma-project%2Ftest-infra+framefrog&type=code&p=1

AC:

Identify which of the jobs listed in the URL above are required during the Infrastructure Manager development-lifecycle and relevant in the longterm (have to be migrated)
All Infrastructure Manager related Prow jobs are migrated to Github Actions

Reasons

Migrate CI/CD jobs from Prow to Github Actions as Prow will be discontinued in 2024.

Attachments

Infrastructure Manager - Perform load and stress test to verify operator's behaviour under load

Description

We should verify how the operator behaves under load. To increase the stabilisation and reliability of the infrastructure manager, a performance test has to be implemented which verifies common use cases. Goals is to measure regularly our internally defined performance KPIs (benchmarking/load test), verify the limits of the application (stress test) and detect performance critical behaviours before the Infrastructure Manager gets deployed on a productive landscape (no memory leaks etc.).

Acceptance criteria:

Reasons
Before deploying the operator on production we must know its performance characteristic.

[Threat Modelling] Ensure proper SecurityContext (least minimum privileges)

Reason
Prevent the possibility that an agent will get access to the Infrastructure Manager Operator

Acceptance criteria

Ensure proper SecurityContext (least minimum privileges)

Configure logging to be in a json format

Description
Configure logging in Infrastructure Manager to be in json format

Reasons

Have an easy to consume logs.

Attachments

Infrastructure Manager - Add metrics, and alerts to improve observability

Description

The infrastructure Manager should provide metrics to allow early issues detection.

Reasons

Infrastructure Manager is a component that in the long run will be responsible for cluster creation. In case of a downtime the impact on Kyma Control Plane will be significant. We must prevent that by increasing the observability.

Acceptance criteria

Add metrics proposed by Benajmin
Delete kube-rbac-proxy sidecar similar to the change done in https://github.com/kyma-project/compass-manager/pull/55/files
metrics should be still valid after pod restarts

Ensure that relevant secret is removed when CR is deleted

Reason
When POD is disabled (even for a shorter duration like 10 seconds), and the GardenerCluster CR will be removed by KEB, IM controller will not receive an event and the corresponding secret will not be cleaned up.

What
Some mechanisms (e.g., owner reference/finalizers) should be introduced to ensure that when GardnerCluster CR is removed, the corresponding secret will also be removed.

Errors are being thrown in logs when using force rotation.

Description
Errors are being thrown in logs when using force rotation.

Expected result

No errors should be thrown in logs when using force rotation.

Actual result

Errors are being thrown in logs when using force rotation.

2023-12-20T12:29:44Z    INFO    Rotation of secret kubeconfig-01568d6b-e96f-4106-b8f5-f5a745f0390d in namespace kcp-system forced. {"GardenerCluster": "01568d6b-e96f-4106-b8f5-f5a745f0390d", "Namespace": "kcp-system"}
2023-12-20T12:29:44Z    ERROR   status update failed    {"error": "Operation cannot be fulfilled on gardenerclusters.infrastructuremanager.kyma-project.io \"01568d6b-e96f-4106-b8f5-f5a745f0390d\": the object has been modified; please apply your changes to the latest version and try again"}
2023-12-20T12:29:44Z    ERROR   Reconciler error        {"controller": "gardenercluster", "controllerGroup": "infrastructuremanager.kyma-project.io", "controllerKind": "GardenerCluster", "GardenerCluster": {"name":"01568d6b-e96f-4106-b8f5-f5a745f0390d","namespace":"kcp-system"}, "namespace": "kcp-system", "name": "01568d6b-e96f-4106-b8f5-f5a745f0390d", "reconcileID": "f1f60c6e-15c4-45cb-bcde-a3c60b8ce864", "error": "Operation cannot be fulfilled on gardenerclusters.infrastructuremanager.kyma-project.io \"01568d6b-e96f-4106-b8f5-f5a745f0390d\": the object has been modified; please apply your changes to the latest version and try again"}
2023-12-20T12:29:44Z    INFO    Starting reconciliation.        {"GardenerCluster": "01568d6b-e96f-4106-b8f5-f5a745f0390d", "Namespace": "kcp-system"}
2023-12-20T12:29:44Z    INFO    rotation params {"GardenerCluster": "01568d6b-e96f-4106-b8f5-f5a745f0390d", "Namespace": "kcp-system", "lastSync": "0001-01-01 00:00:00", "requeueAfter": "6h50m24s"}

Steps to reproduce

(Probably not important) The cluster was first updated to k8s 1.27.6 and then hibernated before the rotation was forced.
Force certificate rotation
Check IM logs

/kind bug

Infrastructure Manager - Prepare migration script/Go program that will create GardenerCluster for each existing cluster

Description

Prepare a Go program/script that will iterate over Kyma resources. For each Kyma resource it will:

Read labels from the Kyma resource
Create GardenerCluster CR

The GardenerCluster CR must contain the fields defined here. Kyma resource is created by the KEB, and the labels it adds can be found here. Mind that the secret name is also defined by KEB.

Reasons
In order to migrate to the architecture with the Infrastructure Manager responsible for dynamic kubeconfig creation the environment some additional steps must be performed. When Infrastructure Manager will be deployed on the target environment there will be a need to handle existent Kyma clusters. The migration script is needed to make sure Infrastructure Manager will control all the runtimes.

Infrastructure Manager - create initial project structure

Description

Create a minimal structure for ~~Cluster Inventory~~ Infrastructure Manager.

Acceptance criteria:

Stretch:

consider a /hack directory with convenient Makefiles for local/ci
#7
Security scans
- security tools https://github.com/search?q=repo%3Akyma-project%2Ftest-infra%20application-connector-manager&type=code
- Also consider new go lang vuln check
#25
reuse compliance
Consider e2e smoke test
markdown link checker

Reasons

In order to kick off the implementation we need to define the code structure, create pipelines. We also need to define the interface for Kyma Environment Broker that is supposed to create Cluster CRs.

[rl-reuse_tool-3] Violation against OSS Rules of Play

A violation against the OSS Rules of Play has been detected.

Rule ID: rl-reuse_tool-3
Explanation: Is it registered in REUSE? No

Find more information at: https://sap.github.io/fosstars-rating-core/oss_rules_of_play_rating.html

[rl-reuse_tool-4] Violation against OSS Rules of Play

A violation against the OSS Rules of Play has been detected.

Rule ID: rl-reuse_tool-4
Explanation: Is it compliant with REUSE rules? No

Find more information at: https://sap.github.io/fosstars-rating-core/oss_rules_of_play_rating.html

[rl-vulnerability_alerts-1] Violation against OSS Rules of Play

A violation against the OSS Rules of Play has been detected.

Rule ID: rl-vulnerability_alerts-1
Explanation: Are vulnerability alerts enabled? No

Find more information at: https://sap.github.io/fosstars-rating-core/oss_rules_of_play_rating.html

Migrate logic from Provisioner into Kyma Infrastructure Manager [EPIC]

Description

The Provisioner has to be replaced by the Kyma Infrastrcuture Manager. The logic of the Provisioner has to be migrated into the Infrastructure Manager, but also considering already planned new features. This could required a rethinking of the current software architecture to ensure a flexible and extensible but also maintainable software structure of the Infrastructure Manager.

AC:

Review the current logical architecture of the Provisioner and their buildling blocks (what are the current features the provisioner supports). Rethink how this logic could be arranged to ensure extensibility of this code to be able to add new features in a convenient way (e..g think about a plugin mechanism / framework approach which allows an easy integration of new features). Just as example: possible options could be the introduction of a chain-of-responsibility pattern for Shoot-Spec generation etc.
- Create a ADR for the new architecture
  - Think how local testing and debugging will be supported
  - If required: implement tiny POC to show how the framework / plugin approach will technically work and demonstrate it to the team
- Review and align the ADR with the team and get a common approval
Migrate the provisioner logic into the Infrastructure Manager by following the architectural decision defined in the previously created ADR
Setup a cutover plan for replacing the Provisioner by the Infrastructure Manager
- Create a dedicated issue for the cutover process and list the required steps in their timely order and who the owner of this step is
- Align the rollout plan with SREs and KEB team

Reasons

Replacing the old Kyma Provisioner with the Kyma Infrastructure Manager to follow new KCP architectural paradigm (K8s native application).

Attachments

[rl-reuse_tool-3] Violation against OSS Rules of Play

A violation against the OSS Rules of Play has been detected.

Rule ID: rl-reuse_tool-3
Explanation: Is it registered in REUSE? No

Find more information at: https://sap.github.io/fosstars-rating-core/oss_rules_of_play_rating.html

Improve unit testing in the main reconciliation loop

Description

While working on #95, #97 and #99 we've noticed that the bigger changes in the corresponding code we've noticed that tests require an improvement.

Reasons

That's the crucial part of Infrastructure Manager that has to be correctly tested so the future enhancements or bug fixes will not cause regressions.

Attachments

related PR with some initial unit tests improvements #107

Setup end-2-end monitoring of KIM to detect service degradations and fire alerts

Description

As critical backend service of Kyma, the monitoring of the availability of the Infrastructure Manger is critical to react in-time on service degradations.

Goals is to setup a end-2-end test case for the Infrastructure Manager which verifies the correct functionality of this service on KCP. The test should be executed in intervals (e.g. hourly) and create a full-fledged Gardener cluster and also destroy it afterwards.

In case that the cluster creation wasn't possible, an alert should be fired (e.g. via the SRE monitoring system) and inform the Framefrog team about the service degradation.

AC:

Get in touch with SREs and verify how a full-fledged test case could be integrated into the existing monitoring solution in Kyma
Implement an test case which requests the KIM to create a Gardener cluster and finally also deletes it:
- The test has to verify that the cluster got successfully created in Gardener
- Check whether the cluster is accessible using the received kubeconfig from Gardener
- Finally destroy the created Kyma cluster
Ensure a cleanup mechanism is in place which would remove orphan clusters in cases that the test mechanism wasn't able to handle the cleanup as part of the test run.
Integrate the test case into the monitoring system (based on the guidance from SREs, see step 1) and ensure alerts are fire in case of KIM service degradation

Reasons

Ensure high quality and proactive service monitoring.

Attachments

[rl-reuse_tool-2] Violation against OSS Rules of Play

A violation against the OSS Rules of Play has been detected.

Rule ID: rl-reuse_tool-2
Explanation: Does it have LICENSES directory with licenses? No

Find more information at: https://sap.github.io/fosstars-rating-core/oss_rules_of_play_rating.html

[Threat Modelling] Configure audit logs to track changes applied on CRs and secrets

Reason
Those important IM resources should be audit logged.

Acceptance Criteria

Ensure following cases are recorded in the auditlog:

If an agent (app or a user) edits the GardenCluster CR - we should see an audit log of that action
If an agent (app or a user) edits the secrets - we should see an audit log of that action
If an agent (app or a user) accesses gardener secret - we should see an audit log of that action
If the above does not happen, consult the situation with security experts and prepare the mitigation plan

Documentation improvements

Description

Acceptance Criteria

Improve the part on what has to be configured for IM to work
Describe the time rotation feature
Describe the force rotation feature

[Threat Modelling] Check if we process personal data in cluster kubeconfig

Reason
We can't store personal data without a reason to be DPP compliant.

Acceptance criteria

Check if we process personal data in kyma cluster kubeconfig

[Threat Modelling] Assure that Connection is properly secured/encrypted

Acceptance Criteria

Check TLS configuration of Gardener API
Enforce strict certificate validation in the Infrastructure Manager

Infrastructure Manager - Configure security scanners

Description

Configure:

Protecode
Whitesource
Checkmarx

Reasons

Be secure.

Attachments

/area control-plane
/area security
/kind feature

Infrastructure Manager - implement kubeconfig secret management

Description

The Infrastructure Manager must manage dynamic kubeconfigs.

Acceptance criteria:

Infrastructure Manager can be installed on Gardener cluster.
#37
#48
#39
Infrastructure Manager is periodically triggered to ensure secrets are rotated when needed.
It is possible to force a secret rotation with annotation added to the secret.

Reasons

In the long term the Infrastructure Manager will replace Provisioner. In the first step it will be responsible for kubeconfig management in the Kyma Control Plane.

[rl-reuse_tool-1] Violation against OSS Rules of Play

A violation against the OSS Rules of Play has been detected.

Rule ID: rl-reuse_tool-1
Explanation: Does README mention REUSE? No

Find more information at: https://sap.github.io/fosstars-rating-core/oss_rules_of_play_rating.html

Identify and implement business critical metrics / KPIs, define an action plan and configure alerting rules

Description

With #11 we are able to make the Infrastructure Manager transparent and also simplify our operational life by establishing smart metrics and alerting rules.

Goals of this task is to identify which metrics / KPIs are business relevant and what the critical threshold for it are. We also have to define an action plan when such a threshold is reached which trigger a required action to bring our business back on track. Finally, alerting rules have to be configured which inform us as soon as one of the thresholds is reached.

AC:

Think about technical and business critical metrics / KPIs which give a clear indication of the quality and health of the Infrastructure Manager
- Define the reason why this metric is relevant and what it represents.
- Define the threshold (min <> max etc.) which indicate an service degradation or health issue of the Infrastructure Manager. If a metric has no threshold, verify if it's for us still helpful to measure this value.
- Specify the required action that has to be applied if a threshold is reached to recover the Infrastructure Manager into a productive and healthy state
- Present the results in the team to collect the feedback of the colleagues.
Implement the identify business metrics in the Infrastructure Manager
Configure alerting rules which inform the team as soon as one of the thresholds is reached

Reasons

Improve operational quality and simplify on-call shifts by establish proper metrics/KPI measuring and alerting.

Extends #11

Attachments

[rl-license_file-1] Violation against OSS Rules of Play

A violation against the OSS Rules of Play has been detected.

Rule ID: rl-license_file-1
Explanation: Does it have a license file? No

Find more information at: https://sap.github.io/fosstars-rating-core/oss_rules_of_play_rating.html

[rl-vulnerability_alerts-1] Violation against OSS Rules of Play

A violation against the OSS Rules of Play has been detected.

Rule ID: rl-vulnerability_alerts-1
Explanation: Are vulnerability alerts enabled? No

Find more information at: https://sap.github.io/fosstars-rating-core/oss_rules_of_play_rating.html

[rl-license_file-1] Violation against OSS Rules of Play

A violation against the OSS Rules of Play has been detected.

Rule ID: rl-license_file-1
Explanation: Does it have a license file? No

Find more information at: https://sap.github.io/fosstars-rating-core/oss_rules_of_play_rating.html

[rl-vulnerability_alerts-1] Violation against OSS Rules of Play

A violation against the OSS Rules of Play has been detected.

Rule ID: rl-vulnerability_alerts-1
Explanation: Are vulnerability alerts enabled? No

Find more information at: https://sap.github.io/fosstars-rating-core/oss_rules_of_play_rating.html

RBAC kubeconfigs for Clusters

Description

There should be a possibility to issue a kubeconfig for the cluster with limited access/privileges.

Kubernetes allows for creating kubeconfigs for specific ServiceAccounts. Having such SA-based kubeconfig makes it possible to limit its use with proper Roles/ClusterRoles.

Suggestions

this is just a proposal, feel free to refine/change/adapt it as you like

One of the options would be to have a new CRD used for issuing kubeconfigs - it could include ServiceAccount information along with the Role/ClusterRole assigned to that ServiceAccount. Based on this Infrastructure Manager could create the SA, (Cluster)Role, issue kubeconfig and save it as a secret in the KCP.

Such a solution would require introducing a controller for handling those, but it will be a universal solution that would support multiple Kubeconfigs to be issued for a single cluster (i.e. for KEB, KLM and other KCP Controllers that would require cluster access).

Regarding the deletion logic - it can be solved with a finalizer that is set on all the CRs, when the deletion timestamp is picked up by the controller then cluster resources (SAs, Roles, etc.) are dropped and the finalizer is removed.

Reasons

It is generally recommended to keep the required privileges minimal for the specific roles. Right now the issued kubeconfigs are for the cluster-admin role which allows for unconstrained actions to be taken using this kubeconfig. From the security perspective, it would be also beneficial to differentiate between entities connecting to the SKR. Separate kubeconfigs for KEB or KLM would make it transparent from the audit-log perspective on which component took which action in the cluster.

Acceptance Criteria

this is just a proposal, feel free to refine those as you like

It is possible to request RBAC Kubeconfig
- ServiceAccount spec is passed as part of the request
- Role/ClusterRole is passed as part of the request
Requested resources are created in the SKR cluster
- ServiceAccount
- Role/ClusterRole
- RoleBinding/ClusterRoleBinding
Kubeconfig is issued for the created ServiceAccount
Kubeconfig is saved as a K8S Secret
K8S Secret with the secret is referenced as part of the status for the request
Infrastructure Manager supports "graceful" deletion of deployed resources

Define testing concept for KIM

Description

For our release management and to fulfil SAP product standards, we have to document how our testing strategy for the KIM looks like.

Some example links to such documentations are available here: https://wiki.one.int.sap/wiki/display/kyma/Testing+Strategy+-+Link+summary

For the AC, the testing strategy is already documented.

AC:

Create a testing strategy document for the Infrastructure Manager (proposal for the location: https://github.com/kyma-project/infrastructure-manager/tree/main/docs/contributor/...-ci-cd.md )
When the documentation is ready, update the link in this Wiki page: https://wiki.one.int.sap/wiki/display/kyma/Testing+Strategy+-+Link+summary

Area
Kyma Infrastructure Manager

Reasons

Mandatory part of the delivery process and required for a fast creation of Microdeliveries.

Assignees

@kyma-project/technical-writers

Attachments

[Threat Modelling] Restrict if possible the used http methods with the Gardener client

Description
Allow only needed actions/http methods.

Reason
We don't want to provide a way to edit/delete cluster related data.

Acceptance criteria

Restrict if possible the used http methods with the Gardener client

Attachments

https://github.com/gardener/gardener/blob/34df9b756dfb403aba117b4735580a97f44f55dc/docs/api-reference/authentication.md

Set force-deletion flag when creating shoot-cluster

Description

Gardner supports now the option to force the deletion of a cluster (which avoids longer waiting-periods during the de-provisioning e.g. the K8s cluster couldn't be gracefully stopped caused by hanging finalizers).

We agreed to use this feature flag and the infrastructure manager / provisioner should set this flag properly.

AC:

The flag confirmation.gardener.cloud/force-deletion is set in the shoot-specs of Gardener clusters.

Reasons

Enable/accept non-graceful shutdowns of Gardener clusters to avoid longer waiting periods during the de-provisioning.

Attachments

[Moved from Provisioner to KIM]

Infrastructure Manager - Dynamic kubeconfigs e2e test

Description

How it's going to be implemented is yet to be defined.

Reasons

Assure that the dynamic kubeconfigs feature is working e2e.

Acceptance criteria

Prepare Go code/bash script performing the test
Prepare changes in configuration/makefile to allow running in CI/CD pipeline

Attachments

/area control-plane
/kind feature

Enable markdown link checker

Description

Configure a markdown link checker that will ensure that links we use in our *.MD files are valid.

Reasons

Documentation consistency
Good developer experience

Attachments

/area documentation
/area control-plane
/kind feature

kyma-project / infrastructure-manager Goto Github PK

infrastructure-manager's People

Contributors

Watchers

Forkers

infrastructure-manager's Issues

Recommend Projects

Recommend Topics

Recommend Org