Code Monkey home page Code Monkey logo

ibm-roks-toolkit's Introduction

IBM-ROKS Toolkit

Overview

The IBM-ROKS toolkit is a set of tools and files that enables running OpenShift x on IBM Public Cloud in a hyperscale manner with many control planes hosted on a central management cluster. This tool was jointly developed by RedHat and IBM.

Getting Started

Install on standalone environment

  • Run make build to build the binary
  • Construct a "cluster.yaml" to define custom parameters for the cluster. Example found here: cluster.yaml.example
  • Construct a "pull-secret.txt" to provide authentication to pull from desired docker registries. Example found here: pull-secret.txt.example
  • Construct and run the render command, with optional fields below: ./bin/ibm-roks render
    • output-dir: Specify the directory where manifest files should be output (default ./manifests)
    • config: Specify the config file for this cluster (default ./cluster.yaml)
    • pull-secret: Specify the pull secret used to pull from desired docker registries (default ./pull-secret.txt)
    • pki-dir: Specify the directory where the input PKI files have been placed (default ./pki)
    • include-secrets: If true, PKI secrets will be included in rendered manifests (default false)
    • include-etcd: If true, Etcd manifests will be included in rendered manifests (default false)
    • include-autoapprover: If true, includes a simple autoapprover pod in manifests (default false)
    • include-vpn: If true, includes a VPN server, sidecar and client (default false)
    • include-registry: If true, includes a default registry config to deploy into the user cluster (default false)
  • Apply all the generated resources to the cluster kubectl apply -f output-dir/

Release Process

Creating a new release

New releases for the toolkit are created via pull requests.

  1. Run hack/bump-release.sh. This will increment the date in the release-date file.
  2. Submit a pull request with this change.
  3. Once the PR is merged, a post submit job will automatically be kicked off to publish the release.
    • You can track the status of the post release jobs for all branches here.
  4. (optional) Cherry-pick the change to other branches.

Changing the base image

The base images specified in the Dockerfiles are for testing only. When releases are created and images published, the base images are substituted. Master branch substitution definitions can be found here, with the other release branches located in the same directory. There is a golang base, which is used for the first stage of the Dockerfiles, and a roks-toolkit-base, which is used for the second stage. The roks-toolkit-base image is further defined here. To bump the base image for all toolkit images, submit a PR to update the image in that file. Once merged, follow the steps for Creating a new release.

ibm-roks-toolkit's People

Contributors

attiss avatar bryan-cox avatar csrwng avatar davidhay1969 avatar enxebre avatar evan-reilly avatar jeffnowicki avatar jmcmeek avatar jonesbr17 avatar joseph-goergen avatar lucaspalm avatar mihivagyok avatar openshift-ci[bot] avatar openshift-merge-bot[bot] avatar openshift-merge-robot avatar relyt0925 avatar riendeau avatar rtheis avatar ryan-cradick avatar shalver avatar thrasher-redhat avatar wmlynch avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ibm-roks-toolkit's Issues

Cluster deploy timing issue may leave role bindings missing for hours

There is a timing issue that may leave role bindings (see example OpenShift API server logs below) missing for hours after a cluster deployment. The missing shared-resource-viewers role binding causes oc new-app --name myapp https://github.com/openshift/nodejs-ex.git to fail
to build due to error error: build error: After retrying 2 times, Pull image still failed due to error: unauthorized: authentication required. There are likely other impacts beyond this example. Eventually the missing role bindings are created hours later thus allowing oc new-app to work.

E0413 16:44:39.080537       1 storage_rbac.go:316] unable to reconcile rolebinding.rbac.authorization.k8s.io/shared-resource-viewers in openshift: rolebindings.rbac.authorization.k8s.io "shared-resource-viewers" is forbidden: could not list rolebinding restrictions: the server could not find the requested resource (get rolebindingrestrictions.authorization.openshift.io)
E0413 16:42:58.333934       1 storage_rbac.go:316] unable to reconcile rolebinding.rbac.authorization.k8s.io/system:node-config-reader in openshift-node: rolebindings.rbac.authorization.k8s.io "system:node-config-reader" is forbidden: could not list rolebinding restrictions: the server could not find the requested resource (get rolebindingrestrictions.authorization.openshift.io)

ROKS metrics component is missing CPU and memory requests

The new ROKS metrics component is missing CPU and memory requests causing OCP conformance test [sig-arch] Managed cluster should ensure control plane pods do not run in best-effort QoS [Suite:openshift/conformance/parallel] to fail.

Use by-digest instead of by-tag pullspecs for installation

Telemetry out of ROKS includes entries like cluster_version{type="initial",image="registry.ng.bluemix.net/armada-master/ocp-release:4.6.22-x86_64",...}. Using by-tag pullspecs from trusted registries is not dire, but pivoting to by-digest pullspecs protects you from compromised registries, mutating tags, and other excitement that can happen as an image flows out of Red Hat's build pipeline (with a signature) and over to the new cluster, until the cluster eventually updates to a by-digest pullspec. Some details on mutable-tag concerns in openshift/oc#390. Can we adjust to:

  1. Use by-digest pullspecs.
  2. Check that we have a valid Red Hat signature on releases in one of the official signature stores before using the release image to launch the cluster.

control-plane-operator causing excessive load on ROKS tugboat apiserver

Copying from https://github.ibm.com/alchemy-containers/armada-update/issues/2617

From tugboat apiserver logs, counts of reads and writes by control-plane-operator for one cluster over about 2 minutes:

    161 verb="GET" URI="/apis/apps/v1/namespaces/master-c465jhg20s3mckhh6s80/deployments/openshift-apiserver" userAgent="control-plane-operator/v0.0.0 (linux/amd64) kubernetes/$Format" 
    161 verb="PUT" URI="/apis/apps/v1/namespaces/master-c465jhg20s3mckhh6s80/deployments/openshift-apiserver" userAgent="control-plane-operator/v0.0.0 (linux/amd64) kubernetes/$Format" 
    161 verb="PUT" URI="/api/v1/namespaces/master-c465jhg20s3mckhh6s80/configmaps/openshift-apiserver-config" userAgent="control-plane-operator/v0.0.0 (linux/amd64) kubernetes/$Format" 
    162 verb="GET" URI="/apis/apps/v1/namespaces/master-c465jhg20s3mckhh6s80/deployments/openshift-controller-manager" userAgent="control-plane-operator/v0.0.0 (linux/amd64) kubernetes/$Format" 
    162 verb="PUT" URI="/apis/apps/v1/namespaces/master-c465jhg20s3mckhh6s80/deployments/openshift-controller-manager" userAgent="control-plane-operator/v0.0.0 (linux/amd64) kubernetes/$Format" 
    162 verb="PUT" URI="/api/v1/namespaces/master-c465jhg20s3mckhh6s80/configmaps/openshift-controller-manager-config" userAgent="control-plane-operator/v0.0.0 (linux/amd64) kubernetes/$Format" 
    322 verb="GET" URI="/api/v1/namespaces/master-c465jhg20s3mckhh6s80/configmaps/openshift-apiserver-config" userAgent="control-plane-operator/v0.0.0 (linux/amd64) kubernetes/$Format" 
    324 verb="GET" URI="/api/v1/namespaces/master-c465jhg20s3mckhh6s80/configmaps/openshift-controller-manager-config" userAgent="control-plane-operator/v0.0.0 (linux/amd64) kubernetes/$Format" 

From performance team, the control-plane-operator logs have lots of this:

kubectl logs control-plane-operator-68d6d7d445-g4czf -n master-c6adpbk20mj47823vl7g --tail=50
2021-11-23T13:23:49.077Z        INFO        control-plane-operator.OpenShiftAPIServerClient        Updating OpenShift APIServer configmap
2021-11-23T13:23:49.116Z        INFO        control-plane-operator.OpenShiftControllerManagerClient        Updating OpenShift Controller Manager deployment
2021-11-23T13:23:49.258Z        INFO        control-plane-operator.OpenShiftAPIServerClient        Updating OpenShift APIServer deployment
I1123 13:23:49.489453       1 recorder_logging.go:37] &Event{ObjectMeta:{dummy.16ba2fabd0aa669c  dummy    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] []  []},InvolvedObject:ObjectReference{Kind:Pod,Namespace:dummy,Name:dummy,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,},Reason:ObservedConfigChanged,Message:Writing updated observed config:   map[string]interface{}{
          "build":    map[string]interface{}{"buildDefaults": map[string]interface{}{"resources": map[string]interface{}{}}, "imageTemplateFormat": map[string]interface{}{"format": string("quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d83f2ed0b41ad5f5b08d775a61f78c2459b4938e7cc53d3fb75ec68f672e8e48")}},
          "deployer": map[string]interface{}{"imageTemplateFormat": map[string]interface{}{"format": string("quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:56a377e0f3e48105f5dc0d3d70e1d821fcb5f282023f0a5410b7df9d5b617a65")}},
-         "dockerPullSecret": map[string]interface{}{
-                 "internalRegistryHostname": string("image-registry.openshift-image-registry.svc:5000"),
-         },
  }
,Source:EventSource{Component:,Host:,},FirstTimestamp:2021-11-23 13:23:49.489338012 +0000 UTC m=+527581.065721643,LastTimestamp:2021-11-23 13:23:49.489338012 +0000 UTC m=+527581.065721643,Count:1,Type:Normal,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}
2021-11-23T13:23:49.545Z        INFO        control-plane-operator.OpenShiftControllerManagerClient        Updating OpenShift Controller Manager configmap
2021-11-23T13:23:49.682Z        INFO        control-plane-operator.OpenShiftControllerManagerClient        Updating OpenShift Controller Manager deployment
I1123 13:23:49.911168       1 recorder_logging.go:37] &Event{ObjectMeta:{dummy.16ba2fabe9cd1eea  dummy    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] []  []},InvolvedObject:ObjectReference{Kind:Pod,Namespace:dummy,Name:dummy,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,},Reason:ObservedConfigChanged,Message:Writing updated observed config:   map[string]interface{}{
-         "imagePolicyConfig": map[string]interface{}{
-                 "internalRegistryHostname": string("image-registry.openshift-image-registry.svc:5000"),
-         },
          "projectConfig": map[string]interface{}{"projectRequestMessage": string("")},
  }
,Source:EventSource{Component:,Host:,},FirstTimestamp:2021-11-23 13:23:49.911043818 +0000 UTC m=+527581.487427205,LastTimestamp:2021-11-23 13:23:49.911043818 +0000 UTC m=+527581.487427205,Count:1,Type:Normal,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}
2021-11-23T13:23:49.959Z        INFO        control-plane-operator.OpenShiftAPIServerClient        Updating OpenShift APIServer configmap
2021-11-23T13:23:50.028Z        INFO        control-plane-operator.OpenShiftAPIServerClient        Updating OpenShift APIServer deployment
I1123 13:23:50.490253       1 recorder_logging.go:37] &Event{ObjectMeta:{dummy.16ba2fac0c513aee  dummy    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] []  []},InvolvedObject:ObjectReference{Kind:Pod,Namespace:dummy,Name:dummy,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,},Reason:ObservedConfigChanged,Message:Writing updated observed config:   map[string]interface{}{
          "build":    map[string]interface{}{"buildDefaults": map[string]interface{}{"resources": map[string]interface{}{}}, "imageTemplateFormat": map[string]interface{}{"format": string("quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d83f2ed0b41ad5f5b08d775a61f78c2459b4938e7cc53d3fb75ec68f672e8e48")}},
          "deployer": map[string]interface{}{"imageTemplateFormat": map[string]interface{}{"format": string("quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:56a377e0f3e48105f5dc0d3d70e1d821fcb5f282023f0a5410b7df9d5b617a65")}},
-         "dockerPullSecret": map[string]interface{}{
-                 "internalRegistryHostname": string("image-registry.openshift-image-registry.svc:5000"),
-         },
  }
,Source:EventSource{Component:,Host:,},FirstTimestamp:2021-11-23 13:23:50.490127086 +0000 UTC m=+527582.066510970,LastTimestamp:2021-11-23 13:23:50.490127086 +0000 UTC m=+527582.066510970,Count:1,Type:Normal,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}
2021-11-23T13:23:50.525Z        INFO        control-plane-operator.OpenShiftControllerManagerClient        Updating OpenShift Controller Manager configmap
2021-11-23T13:23:50.645Z        INFO        control-plane-operator.OpenShiftControllerManagerClient        Updating OpenShift Controller Manager deployment
I1123 13:23:50.911327       1 recorder_logging.go:37] &Event{ObjectMeta:{dummy.16ba2fac256a3e8a  dummy    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] []  []},InvolvedObject:ObjectReference{Kind:Pod,Namespace:dummy,Name:dummy,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,},Reason:ObservedConfigChanged,Message:Writing updated observed config:   map[string]interface{}{
-         "imagePolicyConfig": map[string]interface{}{
-                 "internalRegistryHostname": string("image-registry.openshift-image-registry.svc:5000"),
-         },
          "projectConfig": map[string]interface{}{"projectRequestMessage": string("")},
  }
,Source:EventSource{Component:,Host:,},FirstTimestamp:2021-11-23 13:23:50.91119681 +0000 UTC m=+527582.487580430,LastTimestamp:2021-11-23 13:23:50.91119681 +0000 UTC m=+527582.487580430,Count:1,Type:Normal,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}
I1123 13:23:51.490272       1 recorder_logging.go:37] &Event{ObjectMeta:{dummy.16ba2fac47ec105e  dummy    0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] []  []},InvolvedObject:ObjectReference{Kind:Pod,Namespace:dummy,Name:dummy,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,},Reason:ObservedConfigChanged,Message:Writing updated observed config:   map[string]interface{}{
          "build":    map[string]interface{}{"buildDefaults": map[string]interface{}{"resources": map[string]interface{}{}}, "imageTemplateFormat": map[string]interface{}{"format": string("quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d83f2ed0b41ad5f5b08d775a61f78c2459b4938e7cc53d3fb75ec68f672e8e48")}},
          "deployer": map[string]interface{}{"imageTemplateFormat": map[string]interface{}{"format": string("quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:56a377e0f3e48105f5dc0d3d70e1d821fcb5f282023f0a5410b7df9d5b617a65")}},
-         "dockerPullSecret": map[string]interface{}{
-                 "internalRegistryHostname": string("image-registry.openshift-image-registry.svc:5000"),
-         },
  }
,Source:EventSource{Component:,Host:,},FirstTimestamp:2021-11-23 13:23:51.490130014 +0000 UTC m=+527583.066514718,LastTimestamp:2021-11-23 13:23:51.490130014 +0000 UTC m=+527583.066514718,Count:1,Type:Normal,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,}
2021-11-23T13:23:51.795Z        INFO        control-plane-operator.OpenShiftAPIServerClient        Updating OpenShift APIServer configmap
2021-11-23T13:23:51.796Z        INFO        control-plane-operator.OpenShiftControllerManagerClient        Updating OpenShift Controller Manager configmap
2021-11-23T13:23:52.087Z        INFO        control-plane-operator.OpenShiftAPIServerClient        Updating OpenShift APIServer deployment

I believe the "Writing updated observed config" events (recorder_logging.go lines) are from updateObeservedConfig - https://github.com/openshift/library-go/blob/release-4.9/pkg/operator/configobserver/config_observer_controller.go#L184

And that is called by a sync function - https://github.com/openshift/library-go/blob/release-4.9/pkg/operator/configobserver/config_observer_controller.go#L162

I don't see the configmaps or deployments actually changing over time. The event output suggests that the objects seen by the sync code are always missing those fields - yet I see them in the configmaps.

It seems like that logic is either not seeing the current configmaps or not comparing actual / expected properly.

Final note: In my test deployment this behavior stops after 20 minutes or so. The performance team sees that continuously for all clusters, possibly because they deploy clusters without workers?

Cannot use multi-arch images for rendering the manifest

The failure is when trying to get the release image info.

Failed to use the image from a secured registry even by providing the pull secret. No issue while using it via oc client 4.12 oc adm release info.

# grep releaseImage cluster.yaml
releaseImage: registry.ng.bluemix.net/armada-multi-master/ocp-release:4.11.4-multi
# ./ibm-roks render --pull-secret ~/.docker/config.json
FATA[0000] Error occurred rendering manifests            error="unable to read image registry.ng.bluemix.net/armada-multi-master/ocp-release:4.11.4-multi: Head \"https://registry.ng.bluemix.net/v2/armada-multi-master/ocp-release/manifests/4.11.4-multi\": unauthorized: The login credentials are not valid, or your IBM Cloud account is not active."

Failed to use the image from quay.io which is manifest list related.

# grep releaseImage cluster.yaml
releaseImage: quay.io/openshift-release-dev/ocp-release:4.11.4-multi
# ./ibm-roks render --pull-secret ~/.docker/config.json
FATA[0001] Error occurred rendering manifests            error="unable to parse image quay.io/openshift-release-dev/ocp-release:4.11.4-multi: unknown image manifest of type *manifestlist.DeserializedManifestList from manifest sha256:53679d92dc0aea8ff6ea4b6f0351fa09ecc14ee9eda1b560deeb0923ca2290a1"

Service Monitor Deployment Race condition

There's a race condition at cluster initialization with the manifest bootstrapper pod that will have it continuously fail and crash loop until the first worker is successfully provisioned in the cluster and the monitoring operator can roll out and initialize the servicemonitors.monitoring.coreos.com CRD.

securitycontextconstraints.security.openshift.io            2020-10-24T00:01:18Z
servicecas.operator.openshift.io                            2020-10-24T00:01:38Z
servicemonitors.monitoring.coreos.com                       2020-10-24T00:15:27Z

You can see the creation of that corresponds to a couple minutes after the first worker node comes up in my cluster

apiVersion: v1
kind: Node
metadata:
  annotations:
    projectcalico.org/IPv4Address: 10.93.34.24/26
    projectcalico.org/IPv4IPIPTunnelAddr: 172.30.32.192
  creationTimestamp: "2020-10-24T00:13:08Z"

Which then initializes the CRD and everything completes. This race condition that causes the manifest bootstrapper pod to CrashLoop until the servicemonitors CRD is initialized I believe can be removed if we either substantiate it with the manifest bootstrapper or rework the creation of it.

It ultimately will complete on the first iteration when the openshift cluster has a node join in successfully and runs the cluster-monitoring-operator-f7b47f45-kw7c4

For more details here where the manifest is defined:
https://github.com/openshift/cluster-monitoring-operator/blob/170f91faabc9683a34df29d1d892027292ed0296/manifests/0000_50_cluster-monitoring-operator_00_0servicemonitor-custom-resource-definition.yaml

There's two that I see get applied:
https://github.com/openshift/ibm-roks-toolkit/blob/master/assets/roks-metrics/roks-metrics-servicemonitor.yaml
https://github.com/openshift/ibm-roks-toolkit/blob/release-4.4/assets/cluster-bootstrap/cluster-kube-apiserver-servicemonitor.yaml

This also might be fine to accept but just thought I'd point it out.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.