crossplane / crossplane Goto Github PK

The Cloud Native Control Plane

License: Apache License 2.0

Makefile 0.28% Dockerfile 0.01% Shell 0.35% Go 99.31% Mustache 0.04%

kubernetes cloud-computing cloud-native containers serverless multicloud cloud-management cncf control-plane infrastructure

crossplane's Issues

RDSInstance Password generator not always generates a valid password value

    - LastTransitionTime: 2018-11-03T23:38:54Z
      Message: "InvalidParameterValue: The parameter MasterUserPassword is not a valid
        password. Only printable ASCII characters besides '/', '@', '\"', ' ' may
        be used.\n\tstatus code: 400, request id: a21fa840-2734-4b97-9b85-444fdfd7157f"
      Reason: Failed to create DBInstance
      Status: "False"
      Type: Failed

Apply patterns from RDS to CloudSQL

@ichekrygin has done some good work recently in the RDS controller and those patterns should be applied to the CloudSQL controller too:

CloudSQL client instance per reconcile loop (controller can handle multiple instances with varying provider credentials)
status conditions #23
event recording #25
finalizer to clean up external resource #55
unit tests for all of these scenarios (including user init)

Set Default values during reconciliation

Currently we "assume" the default values for some of the spec properties.
Such values are never reflected back in the specs.

For example: RDSIntanceSpec.ConnectionSecretRef - if not provided, we will use
RDSIntance.Name and RDSInstance.Namespace

Consider creating a first loop pass dedicate for "Finalizer" and "Default Values" only.

MySQL support for Azure

AWS and GCP are supported, Microsoft Azure support is still needed. From the roadmap:

MySQL support for AWS, GCP, and Azure
- Provider CRDs, credentials management, API/SDK consumption
- Provider specific MySQL CRDs (Amazon RDS, Google Cloud SQL, Microsoft Azure Database for MySQL)
- All 3 big cloud providers will be supported for all resources going forward

TestApplyConnectionSecret unit test fails intermittently

TestApplyConnectionSecret seems to fail every now and then.

AWS controllers don't work in Minikube

The AWS controllers require running in the target environment currently, so they cannot work on Minikube. There are a number of assumptions that are not met when running outside the AWS environment:

credentials are only looked for in the default locations, they cannot be specified by the user
we look up config information for the AWS client such as the Region from the Kubernetes nodes, which may not be running in AWS.
we look up VPC/subnet info from the Kubernetes nodes also.

Example error from conductor logs when running on Minikube and trying to create a RDS instance (note the blank region):

2018/09/18 01:22:05 getting ec2 client from clientset...
2018/09/18 01:22:05 found node with name 'minikube' in region ''
...
2018/09/18 01:22:28 Trying to find db instance rdssample
2018/09/18 01:22:28 wasn't able to describe the db instance with id rdssample: MissingRegion: could not find region configuration

This issue is tracking a quick fix to get AWS working on Minikube, longer term issues for targeting remote cloud providers from a Kubernetes control plane are #3 and #20.

For each PR and each commit to master, we should have a continuous integration pipeline that runs and validates the build with both unit tests and integration tests. The Rook project has a good example of this that uses a Jenkins instance for running each build. The build pipeline should also be able to release and publish a build and its artifacts, but that can be tracked in a separate issue.

Resources from Rook:

Password Secret Management (Creation and Recovery)

Currently generated passwords for managed resources are not recoverable and require managed resource tear down and re-creation if password information is lost.
Example:
RDSInstance/CloudSQLInstance Connection Secret - if lost, there appears to be no way to recover it and the only option is to tear down and re-create the instance.

Revisit the use of references to crossplane-system

We currently use a full ObjectReference on abstract resources to identify the ResourceClass to be used. For example:

apiVersion: storage.conductor.io/v1alpha1
kind: MySQLInstance
...
spec:
  classReference:
    name: standard
    namespace: conductor-system

In this approach a developer would need to know the name of the class standard and the namespace in which conductor operator is running conductor-system. Once approach is simplify the classReference and just use a className and assume that the namespace is wherever the conductor operator is running.

apiVersion: storage.conductor.io/v1alpha1
kind: MySQLInstance
...
spec:
  className: standard

This approach is closer to how StorageClass is identified in PVC design. Storage classes are cluster scope while ResourceClass is at a namespace scope.

This also opens up the whole discussion on how an application developer finds out about available resources classes. Do they kubectl get rc -n conductor-system? do they have access to list classes in that namespace? Can they look at the spec of the resource class or would it contain sensitive information?

PostgreSQL support

PostgreSQL has a lot of traction among open source software users and would make a good addition to the existing MySQL support. We need to research further the specific support that each of the cloud providers have for PostgreSQL. From the roadmap:

PostgreSQL support for AWS, GCP, and Azure
- Provider CRDs, credentials management, API/SDK consumption
- Provider specific PostgreSQL CRDs

Consider breaking this issue down and tracking per each cloud provider.

Controllers should support parallel processing of resources

All controllers should support the processing of their resources in parallel, i.e. a long operation to create a cloud provider resource should not block the creation of other instances of that same type of resource.

We also need to consider the allowed parallelism for managing a single resource instance. For example, if a long operation is running in a controller for a specific resource instance, and the CRD for that instance is updated, should the controller process that update at the same time? should it block until the first update is done? All controllers should have a level triggered (as opposed to edge triggered) design, but how should new resource events be handled when an existing event is already being processed?

This is also very related to locking, resiliency and idempotence.

Consider using a Service for connecting to a service vs. a secret

Currently we use use a host or url field inside a connection secret object. We should consider using a Service object for that instead. Possibly using an ExternalIP or something else.

Add Finalizer Support for RDS Instance

Related to #48
Implement for AWS RDS Instance.

Standardize on type registration process

Use go constantan values for:

Group
Version
APIVersion
[type]Kind (when needed)
[type]KindAPIVersion (when needed)

CII best practices

We should consider implementing and following the Core Infrastructure Initiative (CII) Best Practices, similar to the Rook project which is being tracked at https://bestpractices.coreinfrastructure.org/en/projects/1599#

There are a lot of great items in there that demonstrate a mature and well managed project.

However, we need to figure out the relation to CLOMonitor: https://clomonitor.io/projects/cncf/crossplane. Are both required for Graduation? Should we only focus on one of them?

Consider strongly typing resource classes

Currently ResourceClass is generic and not tied to a specific abstract resource, which means that just listing them and using names like standard or high-performance is not sufficient for usage. It would be easy to use the wrong resource class and have the binding fail.

One solution is adopt a naming convention like mysqlinstance-standard and assume that a developer can make sense of these.

Another solution is for us to go back to strongly typed resource classes like MySqlInstanceClass, which would make it clear that these only apply to MySqlInstance. This approach has other merits, 1) we can strongly type the parameters section, 2) might enable us to create a set of structured properties that are relevant to app developers when selecting the class, for example, there could be a property called "iops", 3) enables more granular RBAC rules on resource classes.

Add support for Finalizers to delete Managed Resources

Currently, we don't have support for finalizers. Thus, deleting CRD Instance leaves behind an "orphan" managed resource

build automation to sync CRD definitions with changes to *types.go

When a change to a CRD spec is made in a *types.go file, it would be great to have that change also updated to the CRD definition in cluster/charts/conductor/crds. A common example is to add or rename a new field. Vanilla kubebuilder projects have this functionality and it looks to come from https://github.com/kubernetes-sigs/controller-tools/blob/master/pkg/crd/generator/generator.go.

We should incorporate this into our make build process as well.

Swagger documentation of API

We should document the API via swagger and other conventions used by the Kubernetes project.

Add Finalizer Support for CloudSQL Instance

This issue is tracking the cleanup (deletion) of CloudSQL instances, related to #48

Project Governance

We should establish a formal project governance. Rook has a good example that may be too mature for our early project needs, but is a good model to follow: https://github.com/rook/rook/blob/master/GOVERNANCE.md

Cloud SQL instance names cannot be reused

The Cloud SQL controller currently uses the CRD instance name as the Cloud SQL instance name. When the Cloud SQL instance is deleted, the name cannot be used again for up to a week, so creating the re-creating the same CRD would fail with a 409 conflict. From the FAQ:

If I delete my instance, can I reuse the instance name? Yes, but not right away. The instance name is unavailable for up to a week before it can be reused.

We need to think of a better approach that will allow CRDs to be deleted and immediately recreated, but also properly links the external resource in GCP to the Kubernetes world.

CRD status should be using conditions

The most robust way to provide information on the status of a CRD instance is to use the concept of "conditions". From the links below:

Conditions represent the latest available observations of an object's current state. Objects may report multiple conditions, and new types of conditions may be added in the future. Therefore, conditions are represented using a list/slice, where all have similar structure.

Background reading:
https://github.com/kubernetes/community/blob/930ce655/contributors/devel/api-conventions.md#spec-and-status
kubernetes/kubernetes#7856

Cloud resource management patterns document

This repository will serve as a useful set of patterns for managing cloud provider resources in kubernetes. These patterns should be brainstormed and documented in a shared document that serves to capture design elements as well as best practices for developers implementing controllers for new cloud resources.

Revisit system namespace usage

Concrete resources like RDSInstance and resource classes are considered cluster wide resources that can be used by any abstract resource in any namespace. This was modeled after the PVC and PV/SC in K8S, where PVC are namespace scope and PV and SC are cluster scope.

One problem we ran into when making concrete resources and resource classes cluster-scope CRDs was that they required secrets (and possibly configmaps) and unfortunately namespace scoped. So a cluster scope resource would still need to create a namespace to hold its secrets, configmaps etc.

Also the conductor operator pod/deployment itself needs to run in a namespace.

As a result, we decided to go with a singleton namespace design, where conductor-system is the single namespace used for all concrete resources, resource classes, and their secrets. Essentially the namespace conductor-system is like a cluster-scoped namespace. The admin can specify a different namespace than conductor-system but currently we only support one.

We should think through whether we want to support concrete resources being created in other namespaces beside conductor-system

add additional printer columns for kubectl get output

In Kubernetes 1.11, server side printing for kubectl get can be enhanced with additionalPrinterColumns on the CRD specs to show additional useful information:
https://kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definitions/#additional-printer-columns

we should consider adding additionalPrinterColumns to our CRDs to enhance the usability/experience of them.

User guides and walkthroughs

For the first release, we should include useful and helpful user guides and walkthroughs so that first time visitors to the project can successfully try it out and have a good experience with the software.

We should consider publishing these to a docs website, similar to https://rook.io/docs/rook/v0.8/, which could be a separate ticket. The important part of this ticket is to ensure that our user documentation is well written, easy to follow, and high quality.

DCO or CLA?

We need to decide if this project will use a Developer Certificate of Origin (DCO) or a Contributor License Agreement (CLA) to ensure that contributors are allowed to make a contribution and that this project has the right to distribute it. There are pros/cons to both approaches.

Cloud provider credentials management

As discussed in this PR, we need to figure out how are we providing cloud credentials. This could be as simple as a local reference to a Secret, or perhaps a better approach - cloud credentials provider CRD/Operator.

Currently, the controllers are using the default means of discovering cloud provider credentials, which means they have to be running in the target environment (e.g. to deploy Cloud SQL, the controller has to be running inside GCP). These controllers should be able to run from anywhere (even local minikube) and still manage all cloud provider resources as long as they have been provided valid credentials.

Controller Reconcile Events

Decide on when to record events

        // Fetch Provider Secret
	secret, err := r.kubeclient.CoreV1().Secrets(request.Namespace).Get(instance.Spec.SecretKey.Name, metav1.GetOptions{})
	if err != nil {
		r.recorder.Event(instance, corev1.EventTypeWarning, "Error", err.Error())
		return reconcile.Result{}, err
	}

	// Retrieve credentials.json
	data, ok := secret.Data[instance.Spec.SecretKey.Key]
	if !ok {
		setInvalid(&instance.Status, fmt.Sprintf("invalid GCP Provider secret, %s data is not found", instance.Spec.SecretKey.Key), "")
		return reconcile.Result{}, r.Update(ctx, instance)
	}

In the first block, we are trying to retrieve a secret, and if the secret is not found (i.e. unmet requirement), we simply break out from the reconciler w/ error, while not changing the provider stats (i.e. it is not invalid nor valid). At the same time, we are recording the event, so that the user has a reference point on why is this provider is not becoming valid.

In the second block, we have retrieved the secret (i.e. all requirements are satisfied), and we begin credentials validation. If the secret format is invalid (in this case we cannot find data element with provided key) - we make this provider as invalid and break out from the reconcile loop. Since we have updated the status, the user has a reference on the status/state of this provider, including the Reason.
Next, reconcile iteration will attempt to re-process (re-validate) the secret credentials, possibly resetting state of this provider to Valid or Invalid (for this or other reasons).

GoDoc docs generation for codebase

We should be generating and publishing full documentation for the codebase using godoc, which means we need to have fully commented code in order to generate the docs. This is especially important for the library/utility functions that will help developers add support for more resources. This should be published publicly for dev reference.

Links:
https://blog.golang.org/godoc-documenting-go-code
https://godoc.org/golang.org/x/tools/cmd/godoc

GCPProvider: Rename 'permissions' -> 'requiredPermissions'

To clarify semantics

Remove kubebuilder test harness

we should consider removing the kubebuilder tool dependency while testing. Our test harness could start the api server and etcd in-proc if we need to without bringing in 100MB+ kubebuilder tool set.

Controller Manager Sync Interval

The default value of 10 hours appears to be too long:

// SyncPeriod determines the minimum frequency at which watched resources are
// reconciled. A lower period will correct entropy more quickly, but reduce
// responsiveness to change if there are many watched resources. Change this
// value only if you know what you are doing. Defaults to 10 hours if unset.
SyncPeriod *time.Duration

We need to decide on the more appropriate value with a much lower period, however, not too low to trigger cloud providers API rate limits.

Controllers should recover from failures

All controllers should continually drive towards desired state, even in the face of failures. They should retry, operations should be idempotent, they should be able to handle partially applied state, etc.

From the roadmap:

Retry/recovery from failure, idempotence, dealing with partial state

Race condition in reconcile loops

Reconcile loop can pick up and process the same resource more than once. In most cases, this issue is benign if the CR and Resource have strong affinity link. For example, the managed resource properties (name) contains UID of the CR object. However, there are cases when we don't have such strong affinity. For example: MySQLInstance -> RDSInstance.

Refactor util/kubernetes and update usage

refactor client/kubernetes to util/kubernetes.
update code to use GetSecret

Examples: Wordpress

Create an example application to demonstrate Conductor components utilization.

Sample

---
# GCP Admin service account secret - used to provision GCP resources
apiVersion: v1
kind: Secret
metadata:
  name: gcp-provider-creds
  namespace: demo
type: Opaque
data:
  credentials.json: base64encrytpeddata
---
# GCP SQL user service account - used by application cloudsql-proxy side car
# - list cloudsql intances
# - connect to a cloudsql instance
apiVersion: v1
kind: Secret
metadata:
  name: gcp-sql-creds
  namespace: demo
type: Opaque
data:
  credentials.json: base64encrytpeddata
---
# Database credentials - used by cloudsql provider to bootstrap database instance
#  and by the application to establish connection
apiVersion: v1
kind: Secret
metadata:
  name: database-credentials
  namespace: demo
type: Opaque
data:
  username: base64encrytpeddata
  password: base64encrytpeddata
---
apiVersion: gcp.conductor.io/v1alpha1
kind: Provider
metadata:
  name: gcp-provider
spec:
  credentialsSecretRef:
    name: gcp-provider-creds
    key: credentials.json
  projectID: demo-google-project
  permissions:
  # blank permission means - no validation (on permissions)
---
apiVersion: database.gcp.conductor.io/v1alpha1
kind: CloudsqlInstance
metadata:
  labels:
  name: cloudsql-demo-20049 # TODO: is this instance the same as instance name?
spec:
  # somewhere here we need gcp-provider-reference
  projectID: demo-google-project # TODO: same as in Provider
  tier: db-n1-standard-1
  region: us-west2
  databaseVersion: MYSQL_5_7
  storageType: PD_SSD
  databaseName: demo-database ## TODO: do we support this?
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: wordpress
  labels:
    app: wordpress
spec:
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: wordpress
        tier: frontend
    spec:
      containers:
      - name: cloudsql-proxy
        image: gcr.io/cloudsql-docker/gce-proxy:1.11
        command: ["/cloud_sql_proxy",
                  "-instances=demo-google-project:us-west2:cloudsql-demo20049=tcp:3306",
                  "-credential_file=/secrets/cloudsql/credentials.json"]
        volumeMounts:
        - name: cloudsql-instance-credentials
          mountPath: /secrets/cloudsql
          readOnly: true
      - name: wordpress
        image: wordpress:4.6.1-apache
        env:
        # managed posgresql environment variables
        - name: WORDPRESS_DB_HOST
          value: 127.0.0.1  # cloudsql-proxy sidecar
        - name: WORDPRESS_DB_NAME
          value: demo-database # TODO - same as in CloudsqlInstance
        - name: WORDPRESS_DB_USER
          valueFrom:
            secretKeyRef:
              name: database-credentials
              key: DATABASE_USER
        - name: WORDPRESS_DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: database-credentials
              key: DATABASE_PASSWORD
        ports:
        - containerPort: 80
          name: wordpress
      volumes:
      - name: cloudsql-instance-credentials
        secret:
          secretName: gcp-sql-creds
---
apiVersion: v1
kind: Service
metadata:
  name: wordpress
  labels:
    app: wordpress
spec:
  ports:
  - port: 80
  selector:
    app: wordpress
    tier: frontend
  type: LoadBalancer

Expose database and user, not just database instance

I think we should support types for a database and user and not just database instances. This would enable applications to model their databases and users independently from the service instance running them.

AWS Provider - Do not expose Connection Profile

Controller manager should use leader election

The controller manager that hosts all the controllers should be managed by a Deployment for resiliency and life-cycle management. For high availability, at least 2 replicas should be running and they should use leader election to determine which is active/authoritative.

The etcd-operator has an example of leader election: https://github.com/coreos/etcd-operator/blob/master/cmd/operator/main.go#L121

Release pipeline

part of #864

The build pipeline described in #2 should also be able to release and publish artifacts such as docker images, helm charts, documentation, etc. The Rook project has a good example of this.

Add schema validation for resource properties

For example: RDSInstance

--db-instance-identifier (string)

The DB instance identifier. This parameter is stored as a lowercase string.

Constraints:

Must contain from 1 to 63 letters, numbers, or hyphens.
First character must be a letter.
Can't end with a hyphen or contain two consecutive hyphens.

We should be able to assert these constraints via schema validation.

Reconcile loops should not wait/block

In general, the controller Reconcile functions should not wait for operations to finish, they should set status and exit, then check again on the next reconciliation attempt.

From @ichekrygin:

the reconcile loop execution should be quick (success/fail/otherwise), i.e. it should not block/wait for anything. This way we can address every item in the queue in a timely manner.

Discussion:
#67 (comment)

Controllers should use consistent logging

We should decide on a single logging package and be consistent with it. The logging package should support:

concurrent usage
configurable levels
log file rollover

log.Printf is not sufficient for these goals.

Explore user scenarios with deployment yaml

Some of the up front design work we should do would be to work backwards: start identifying some user scenarios and author the yaml that we'd like to support for those scenarios. This will inform the user experience that we want to provide and help determine the scope and priority of the various controllers and resources.

Consider matching the best `ResourceClass` for a given abstract resource

Currently classReference is required for dynamic provisioning, and if it's omitted we are thinking of supporting a default class.

It would be more useful if instead of just a single default class, we could select the best available class that meets the requirements of the abstract resource. Abstract resources would need to specify more granular requirements for that to be useful, for example, for MySqlInstance they might need to specify the requirements on IOPS.

The end result here is that the abstract resource would not need to know about classes at all, and instead define more granular requirements that get matched heuristically.

Refactor common code across providers

Across the 3 currently supported cloud providers, there is some common (duplicated) code that could be refactored to single and reusable implementations. We should take a pass through the providers and refactor to remove this duplication.

Sock Shop microservices demo

A nice demo that shows a lot of breadth of cloud resource management would be to run Sock Shop using the controllers and resources in this repo. We should be able to deploy all resources used by Sock Shop with a single yaml file.

Contextual Properties and Separation of Concern

I wanted to capture some thoughts about a pattern that I think we should use in conductor. Consider the following RDS example:

apiVersion: database.aws.conductor.io/v1alpha1
kind: RDSInstance
metadata:
  name: rdssample
spec:
  class: db.t2.small # type of the db instance
  engine: postgres # what engine to use postgres, mysql, aurora-postgresql etc.
  name: pgsql # name of the database at the provider
  password: # link to database secret
    key: rdsPassword # the key in the secret
    name: rds-secret # the name of the secret
  username: postgres # Database username
  size: 10 # size in BG
  backupretentionperiod: 10 # days to keep backup, 0 means diable
  encrypted: true # should the database be encrypted
  #iops: 1000 # number of iops
  multiaz: true # multi AZ support
  storagetype: gp2 # type of the underlying storage

In order for us to actually create the RDS instance we need AWS credentials, AWS region, VPC information and possibly more. One option would be to explicitly add all of these into the spec of each type, something like:

apiVersion: database.aws.conductor.io/v1alpha1
kind: RDSInstance
metadata:
  name: rdssample
spec:
  region: us-east-1
  awsCredentialsRef: mycreds

this approach is too restrictive and assumes that the author of the database instance has all the information needed. Instead I believe we should identify a set of properties that we think are "contextual" and can be specified at the namespace and/or cluster level, and potentially by a another person. for example, consider a new cloud provider type:

apiVersion: aws.conductor.io/v1alpha1
kind: Provider
metadata:
  name: my-aws
  annotations:
      conductor.io/is-default-provider: true
spec:
  region: us-east-1
  awsCredentialsRef: my-aws-secret

An instance of the provider can be created at the namespace or at the cluster level. When an RDS instance is created in the same namespace we can default to the default provider (marked by the condutor.io/is-default-provider annotation) in the same namespace or the default one in the cluster. The user can also specify the provider by name on the RDS type:

apiVersion: database.aws.conductor.io/v1alpha1
kind: RDSInstance
metadata:
  name: rdssample
spec:
  providerRef: my-aws

it would also be great to show the provider selected in the status section of the instance.

This approach enables a healthy separation of concern where the creator of the RDS instance does not need to know a number of contextual properties about it. A cluster-admin could set the credentials, region, and other settings. We could have more granular RBAC policies for who has access to what.

Finally, it keeps the specs of each type more concise. The same model should apply to network isolation (VPCs etc.), and other constructs.

crossplane / crossplane Goto Github PK

crossplane's Issues

Sample

Recommend Projects

Recommend Topics

Recommend Org