crossplane / crossplane Goto Github PK
View Code? Open in Web Editor NEWThe Cloud Native Control Plane
Home Page: https://crossplane.io
License: Apache License 2.0
The Cloud Native Control Plane
Home Page: https://crossplane.io
License: Apache License 2.0
- LastTransitionTime: 2018-11-03T23:38:54Z
Message: "InvalidParameterValue: The parameter MasterUserPassword is not a valid
password. Only printable ASCII characters besides '/', '@', '\"', ' ' may
be used.\n\tstatus code: 400, request id: a21fa840-2734-4b97-9b85-444fdfd7157f"
Reason: Failed to create DBInstance
Status: "False"
Type: Failed
@ichekrygin has done some good work recently in the RDS controller and those patterns should be applied to the CloudSQL controller too:
Currently we "assume" the default values for some of the spec properties.
Such values are never reflected back in the specs.
For example: RDSIntanceSpec.ConnectionSecretRef
- if not provided, we will use
RDSIntance.Name
and RDSInstance.Namespace
Consider creating a first loop pass dedicate for "Finalizer" and "Default Values" only.
AWS and GCP are supported, Microsoft Azure support is still needed. From the roadmap:
TestApplyConnectionSecret
seems to fail every now and then.
The AWS controllers require running in the target environment currently, so they cannot work on Minikube. There are a number of assumptions that are not met when running outside the AWS environment:
Example error from conductor logs when running on Minikube and trying to create a RDS instance (note the blank region):
2018/09/18 01:22:05 getting ec2 client from clientset...
2018/09/18 01:22:05 found node with name 'minikube' in region ''
...
2018/09/18 01:22:28 Trying to find db instance rdssample
2018/09/18 01:22:28 wasn't able to describe the db instance with id rdssample: MissingRegion: could not find region configuration
This issue is tracking a quick fix to get AWS working on Minikube, longer term issues for targeting remote cloud providers from a Kubernetes control plane are #3 and #20.
For each PR and each commit to master, we should have a continuous integration pipeline that runs and validates the build with both unit tests and integration tests. The Rook project has a good example of this that uses a Jenkins instance for running each build. The build pipeline should also be able to release and publish a build and its artifacts, but that can be tracked in a separate issue.
Resources from Rook:
Currently generated passwords for managed resources are not recoverable and require managed resource tear down and re-creation if password information is lost.
Example:
RDSInstance/CloudSQLInstance Connection Secret - if lost, there appears to be no way to recover it and the only option is to tear down and re-create the instance.
We currently use a full ObjectReference
on abstract resources to identify the ResourceClass
to be used. For example:
apiVersion: storage.conductor.io/v1alpha1
kind: MySQLInstance
...
spec:
classReference:
name: standard
namespace: conductor-system
In this approach a developer would need to know the name of the class standard
and the namespace in which conductor operator is running conductor-system
. Once approach is simplify the classReference and just use a className
and assume that the namespace is wherever the conductor operator is running.
apiVersion: storage.conductor.io/v1alpha1
kind: MySQLInstance
...
spec:
className: standard
This approach is closer to how StorageClass
is identified in PVC design. Storage classes are cluster scope while ResourceClass
is at a namespace scope.
This also opens up the whole discussion on how an application developer finds out about available resources classes. Do they kubectl get rc -n conductor-system
? do they have access to list classes in that namespace? Can they look at the spec of the resource class or would it contain sensitive information?
PostgreSQL has a lot of traction among open source software users and would make a good addition to the existing MySQL support. We need to research further the specific support that each of the cloud providers have for PostgreSQL. From the roadmap:
Consider breaking this issue down and tracking per each cloud provider.
All controllers should support the processing of their resources in parallel, i.e. a long operation to create a cloud provider resource should not block the creation of other instances of that same type of resource.
We also need to consider the allowed parallelism for managing a single resource instance. For example, if a long operation is running in a controller for a specific resource instance, and the CRD for that instance is updated, should the controller process that update at the same time? should it block until the first update is done? All controllers should have a level triggered (as opposed to edge triggered) design, but how should new resource events be handled when an existing event is already being processed?
This is also very related to locking, resiliency and idempotence.
Currently we use use a host
or url
field inside a connection secret object. We should consider using a Service
object for that instead. Possibly using an ExternalIP or something else.
Related to #48
Implement for AWS RDS Instance.
Use go constantan values for:
We should consider implementing and following the Core Infrastructure Initiative (CII) Best Practices, similar to the Rook project which is being tracked at https://bestpractices.coreinfrastructure.org/en/projects/1599#
There are a lot of great items in there that demonstrate a mature and well managed project.
However, we need to figure out the relation to CLOMonitor: https://clomonitor.io/projects/cncf/crossplane. Are both required for Graduation? Should we only focus on one of them?
Currently ResourceClass
is generic and not tied to a specific abstract resource, which means that just listing them and using names like standard
or high-performance
is not sufficient for usage. It would be easy to use the wrong resource class and have the binding fail.
One solution is adopt a naming convention like mysqlinstance-standard
and assume that a developer can make sense of these.
Another solution is for us to go back to strongly typed resource classes like MySqlInstanceClass
, which would make it clear that these only apply to MySqlInstance. This approach has other merits, 1) we can strongly type the parameters
section, 2) might enable us to create a set of structured properties that are relevant to app developers when selecting the class, for example, there could be a property called "iops", 3) enables more granular RBAC rules on resource classes.
Currently, we don't have support for finalizers
. Thus, deleting CRD Instance leaves behind an "orphan" managed resource
When a change to a CRD spec is made in a *types.go
file, it would be great to have that change also updated to the CRD definition in cluster/charts/conductor/crds
. A common example is to add or rename a new field. Vanilla kubebuilder projects have this functionality and it looks to come from https://github.com/kubernetes-sigs/controller-tools/blob/master/pkg/crd/generator/generator.go.
We should incorporate this into our make
build process as well.
We should document the API via swagger and other conventions used by the Kubernetes project.
This issue is tracking the cleanup (deletion) of CloudSQL instances, related to #48
We should establish a formal project governance. Rook has a good example that may be too mature for our early project needs, but is a good model to follow: https://github.com/rook/rook/blob/master/GOVERNANCE.md
The Cloud SQL controller currently uses the CRD instance name as the Cloud SQL instance name. When the Cloud SQL instance is deleted, the name cannot be used again for up to a week, so creating the re-creating the same CRD would fail with a 409 conflict. From the FAQ:
If I delete my instance, can I reuse the instance name? Yes, but not right away. The instance name is unavailable for up to a week before it can be reused.
We need to think of a better approach that will allow CRDs to be deleted and immediately recreated, but also properly links the external resource in GCP to the Kubernetes world.
The most robust way to provide information on the status of a CRD instance is to use the concept of "conditions". From the links below:
Conditions represent the latest available observations of an object's current state. Objects may report multiple conditions, and new types of conditions may be added in the future. Therefore, conditions are represented using a list/slice, where all have similar structure.
Background reading:
https://github.com/kubernetes/community/blob/930ce655/contributors/devel/api-conventions.md#spec-and-status
kubernetes/kubernetes#7856
This repository will serve as a useful set of patterns for managing cloud provider resources in kubernetes. These patterns should be brainstormed and documented in a shared document that serves to capture design elements as well as best practices for developers implementing controllers for new cloud resources.
Concrete resources like RDSInstance
and resource classes are considered cluster wide resources that can be used by any abstract resource in any namespace. This was modeled after the PVC and PV/SC in K8S, where PVC are namespace scope and PV and SC are cluster scope.
One problem we ran into when making concrete resources and resource classes cluster-scope CRDs was that they required secrets (and possibly configmaps) and unfortunately namespace scoped. So a cluster scope resource would still need to create a namespace to hold its secrets, configmaps etc.
Also the conductor operator pod/deployment itself needs to run in a namespace.
As a result, we decided to go with a singleton namespace design, where conductor-system is the single namespace used for all concrete resources, resource classes, and their secrets. Essentially the namespace conductor-system
is like a cluster-scoped namespace. The admin can specify a different namespace than conductor-system
but currently we only support one.
We should think through whether we want to support concrete resources being created in other namespaces beside conductor-system
In Kubernetes 1.11, server side printing for kubectl get
can be enhanced with additionalPrinterColumns
on the CRD specs to show additional useful information:
https://kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definitions/#additional-printer-columns
we should consider adding additionalPrinterColumns
to our CRDs to enhance the usability/experience of them.
For the first release, we should include useful and helpful user guides and walkthroughs so that first time visitors to the project can successfully try it out and have a good experience with the software.
We should consider publishing these to a docs website, similar to https://rook.io/docs/rook/v0.8/, which could be a separate ticket. The important part of this ticket is to ensure that our user documentation is well written, easy to follow, and high quality.
We need to decide if this project will use a Developer Certificate of Origin (DCO) or a Contributor License Agreement (CLA) to ensure that contributors are allowed to make a contribution and that this project has the right to distribute it. There are pros/cons to both approaches.
As discussed in this PR, we need to figure out how are we providing cloud credentials. This could be as simple as a local reference to a Secret, or perhaps a better approach - cloud credentials provider CRD/Operator.
Currently, the controllers are using the default means of discovering cloud provider credentials, which means they have to be running in the target environment (e.g. to deploy Cloud SQL, the controller has to be running inside GCP). These controllers should be able to run from anywhere (even local minikube) and still manage all cloud provider resources as long as they have been provided valid credentials.
Decide on when to record events
// Fetch Provider Secret
secret, err := r.kubeclient.CoreV1().Secrets(request.Namespace).Get(instance.Spec.SecretKey.Name, metav1.GetOptions{})
if err != nil {
r.recorder.Event(instance, corev1.EventTypeWarning, "Error", err.Error())
return reconcile.Result{}, err
}
// Retrieve credentials.json
data, ok := secret.Data[instance.Spec.SecretKey.Key]
if !ok {
setInvalid(&instance.Status, fmt.Sprintf("invalid GCP Provider secret, %s data is not found", instance.Spec.SecretKey.Key), "")
return reconcile.Result{}, r.Update(ctx, instance)
}
In the first block, we are trying to retrieve a secret, and if the secret is not found (i.e. unmet requirement), we simply break out from the reconciler w/ error, while not changing the provider stats (i.e. it is not invalid nor valid). At the same time, we are recording the event, so that the user has a reference point on why is this provider is not becoming valid.
In the second block, we have retrieved the secret (i.e. all requirements are satisfied), and we begin credentials validation. If the secret format is invalid (in this case we cannot find data element with provided key) - we make this provider as invalid and break out from the reconcile loop. Since we have updated the status, the user has a reference on the status/state of this provider, including the Reason
.
Next, reconcile iteration will attempt to re-process (re-validate) the secret credentials, possibly resetting state of this provider to Valid
or Invalid
(for this or other reasons).
We should be generating and publishing full documentation for the codebase using godoc
, which means we need to have fully commented code in order to generate the docs. This is especially important for the library/utility functions that will help developers add support for more resources. This should be published publicly for dev reference.
Links:
https://blog.golang.org/godoc-documenting-go-code
https://godoc.org/golang.org/x/tools/cmd/godoc
To clarify semantics
we should consider removing the kubebuilder tool dependency while testing. Our test harness could start the api server and etcd in-proc if we need to without bringing in 100MB+ kubebuilder tool set.
The default value of 10 hours appears to be too long:
// SyncPeriod determines the minimum frequency at which watched resources are
// reconciled. A lower period will correct entropy more quickly, but reduce
// responsiveness to change if there are many watched resources. Change this
// value only if you know what you are doing. Defaults to 10 hours if unset.
SyncPeriod *time.Duration
We need to decide on the more appropriate value with a much lower period, however, not too low to trigger cloud providers API rate limits.
All controllers should continually drive towards desired state, even in the face of failures. They should retry, operations should be idempotent, they should be able to handle partially applied state, etc.
From the roadmap:
Reconcile loop can pick up and process the same resource more than once. In most cases, this issue is benign if the CR and Resource have strong affinity link. For example, the managed resource properties (name) contains UID of the CR object. However, there are cases when we don't have such strong affinity. For example: MySQLInstance
-> RDSInstance
.
Create an example application to demonstrate Conductor components utilization.
---
# GCP Admin service account secret - used to provision GCP resources
apiVersion: v1
kind: Secret
metadata:
name: gcp-provider-creds
namespace: demo
type: Opaque
data:
credentials.json: base64encrytpeddata
---
# GCP SQL user service account - used by application cloudsql-proxy side car
# - list cloudsql intances
# - connect to a cloudsql instance
apiVersion: v1
kind: Secret
metadata:
name: gcp-sql-creds
namespace: demo
type: Opaque
data:
credentials.json: base64encrytpeddata
---
# Database credentials - used by cloudsql provider to bootstrap database instance
# and by the application to establish connection
apiVersion: v1
kind: Secret
metadata:
name: database-credentials
namespace: demo
type: Opaque
data:
username: base64encrytpeddata
password: base64encrytpeddata
---
apiVersion: gcp.conductor.io/v1alpha1
kind: Provider
metadata:
name: gcp-provider
spec:
credentialsSecretRef:
name: gcp-provider-creds
key: credentials.json
projectID: demo-google-project
permissions:
# blank permission means - no validation (on permissions)
---
apiVersion: database.gcp.conductor.io/v1alpha1
kind: CloudsqlInstance
metadata:
labels:
name: cloudsql-demo-20049 # TODO: is this instance the same as instance name?
spec:
# somewhere here we need gcp-provider-reference
projectID: demo-google-project # TODO: same as in Provider
tier: db-n1-standard-1
region: us-west2
databaseVersion: MYSQL_5_7
storageType: PD_SSD
databaseName: demo-database ## TODO: do we support this?
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: wordpress
labels:
app: wordpress
spec:
strategy:
type: Recreate
template:
metadata:
labels:
app: wordpress
tier: frontend
spec:
containers:
- name: cloudsql-proxy
image: gcr.io/cloudsql-docker/gce-proxy:1.11
command: ["/cloud_sql_proxy",
"-instances=demo-google-project:us-west2:cloudsql-demo20049=tcp:3306",
"-credential_file=/secrets/cloudsql/credentials.json"]
volumeMounts:
- name: cloudsql-instance-credentials
mountPath: /secrets/cloudsql
readOnly: true
- name: wordpress
image: wordpress:4.6.1-apache
env:
# managed posgresql environment variables
- name: WORDPRESS_DB_HOST
value: 127.0.0.1 # cloudsql-proxy sidecar
- name: WORDPRESS_DB_NAME
value: demo-database # TODO - same as in CloudsqlInstance
- name: WORDPRESS_DB_USER
valueFrom:
secretKeyRef:
name: database-credentials
key: DATABASE_USER
- name: WORDPRESS_DB_PASSWORD
valueFrom:
secretKeyRef:
name: database-credentials
key: DATABASE_PASSWORD
ports:
- containerPort: 80
name: wordpress
volumes:
- name: cloudsql-instance-credentials
secret:
secretName: gcp-sql-creds
---
apiVersion: v1
kind: Service
metadata:
name: wordpress
labels:
app: wordpress
spec:
ports:
- port: 80
selector:
app: wordpress
tier: frontend
type: LoadBalancer
I think we should support types for a database and user and not just database instances. This would enable applications to model their databases and users independently from the service instance running them.
The controller manager that hosts all the controllers should be managed by a Deployment
for resiliency and life-cycle management. For high availability, at least 2 replicas should be running and they should use leader election to determine which is active/authoritative.
The etcd-operator has an example of leader election: https://github.com/coreos/etcd-operator/blob/master/cmd/operator/main.go#L121
For example: RDSInstance
--db-instance-identifier (string)
The DB instance identifier. This parameter is stored as a lowercase string.
Constraints:
Must contain from 1 to 63 letters, numbers, or hyphens.
First character must be a letter.
Can't end with a hyphen or contain two consecutive hyphens.
We should be able to assert these constraints via schema validation.
In general, the controller Reconcile functions should not wait for operations to finish, they should set status and exit, then check again on the next reconciliation attempt.
From @ichekrygin:
the reconcile loop execution should be quick (success/fail/otherwise), i.e. it should not block/wait for anything. This way we can address every item in the queue in a timely manner.
Discussion:
#67 (comment)
We should decide on a single logging package and be consistent with it. The logging package should support:
log.Printf
is not sufficient for these goals.
Some of the up front design work we should do would be to work backwards: start identifying some user scenarios and author the yaml that we'd like to support for those scenarios. This will inform the user experience that we want to provide and help determine the scope and priority of the various controllers and resources.
Currently classReference
is required for dynamic provisioning, and if it's omitted we are thinking of supporting a default class.
It would be more useful if instead of just a single default class, we could select the best available class that meets the requirements of the abstract resource. Abstract resources would need to specify more granular requirements for that to be useful, for example, for MySqlInstance
they might need to specify the requirements on IOPS.
The end result here is that the abstract resource would not need to know about classes at all, and instead define more granular requirements that get matched heuristically.
Across the 3 currently supported cloud providers, there is some common (duplicated) code that could be refactored to single and reusable implementations. We should take a pass through the providers and refactor to remove this duplication.
A nice demo that shows a lot of breadth of cloud resource management would be to run Sock Shop using the controllers and resources in this repo. We should be able to deploy all resources used by Sock Shop with a single yaml file.
I wanted to capture some thoughts about a pattern that I think we should use in conductor. Consider the following RDS example:
apiVersion: database.aws.conductor.io/v1alpha1
kind: RDSInstance
metadata:
name: rdssample
spec:
class: db.t2.small # type of the db instance
engine: postgres # what engine to use postgres, mysql, aurora-postgresql etc.
name: pgsql # name of the database at the provider
password: # link to database secret
key: rdsPassword # the key in the secret
name: rds-secret # the name of the secret
username: postgres # Database username
size: 10 # size in BG
backupretentionperiod: 10 # days to keep backup, 0 means diable
encrypted: true # should the database be encrypted
#iops: 1000 # number of iops
multiaz: true # multi AZ support
storagetype: gp2 # type of the underlying storage
In order for us to actually create the RDS instance we need AWS credentials, AWS region, VPC information and possibly more. One option would be to explicitly add all of these into the spec of each type, something like:
apiVersion: database.aws.conductor.io/v1alpha1
kind: RDSInstance
metadata:
name: rdssample
spec:
region: us-east-1
awsCredentialsRef: mycreds
this approach is too restrictive and assumes that the author of the database instance has all the information needed. Instead I believe we should identify a set of properties that we think are "contextual" and can be specified at the namespace and/or cluster level, and potentially by a another person. for example, consider a new cloud provider type:
apiVersion: aws.conductor.io/v1alpha1
kind: Provider
metadata:
name: my-aws
annotations:
conductor.io/is-default-provider: true
spec:
region: us-east-1
awsCredentialsRef: my-aws-secret
An instance of the provider can be created at the namespace or at the cluster level. When an RDS instance is created in the same namespace we can default to the default provider (marked by the condutor.io/is-default-provider
annotation) in the same namespace or the default one in the cluster. The user can also specify the provider by name on the RDS type:
apiVersion: database.aws.conductor.io/v1alpha1
kind: RDSInstance
metadata:
name: rdssample
spec:
providerRef: my-aws
it would also be great to show the provider selected in the status section of the instance.
This approach enables a healthy separation of concern where the creator of the RDS instance does not need to know a number of contextual properties about it. A cluster-admin could set the credentials, region, and other settings. We could have more granular RBAC policies for who has access to what.
Finally, it keeps the specs of each type more concise. The same model should apply to network isolation (VPCs etc.), and other constructs.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.