stackstorm / stackstorm-k8s Goto Github PK

K8s Helm Chart that codifies StackStorm (aka "IFTTT for Ops" https://stackstorm.com/) Highly Availability fleet as a simple to use reproducible infrastructure-as-code app

Home Page: https://helm.stackstorm.com/

License: Apache License 2.0

Smarty 67.64% Shell 32.36%

k8s helm helm-charts ha high-availability stackstorm st2 kubernetes docker containers

stackstorm-k8s's Introduction

`stackstorm-ha` Helm Chart

K8s Helm Chart for running StackStorm cluster in HA mode.

It will install 2 replicas for each component of StackStorm microservices for redundancy, as well as backends like RabbitMQ HA, MongoDB HA Replicaset and Redis cluster that st2 replies on for MQ, DB and distributed coordination respectively.

It's more than welcome to fine-tune each component settings to fit specific availability/scalability demands.

Requirements

Supported Kubernetes cluster
Helm v3.5 or greater

Usage

Edit values.yaml with configuration for the StackStorm HA K8s cluster.

NB! It's highly recommended to set your own secrets as file contains unsafe defaults like self-signed SSL certificates, SSH keys, StackStorm access and DB/MQ passwords!

Pull 3rd party Helm dependencies:

helm dependency update

Install the chart

helm install .

Upgrade Once you make any changes to values, upgrade the cluster:

helm upgrade <release-name> .

Configuration

The default configuration values for this chart are described in values.yaml.

Ingress

Ingress is worth considering if you want to expose multiple services under the same IP address, and these services all use the same L7 protocol (typically HTTP). You only pay for one load balancer if you are using native cloud integration, and because Ingress is "smart", you can get a lot of features out of the box (like SSL, Auth, Routing, etc.). See the ingress section in values.yaml for configuration details.

You will first need to deploy an ingress controller of your preference. See https://kubernetes.io/docs/concepts/services-networking/ingress-controllers/#additional-controllers for more information.

Components

The Community FOSS Dockerfiles used to generate the docker images for each st2 component are available at st2-dockerfiles.

st2client

A helper container to switch into and run st2 CLI commands against the deployed StackStorm cluster. All resources like credentials, configs, RBAC, packs, keys and secrets are shared with this container.

# obtain st2client pod name
ST2CLIENT=$(kubectl get pod -l app.kubernetes.io/name=st2client -o jsonpath="{.items[0].metadata.name}")

# run a single st2 client command
kubectl exec -it ${ST2CLIENT} -- st2 --version

# switch into a container shell and use st2 CLI
kubectl exec -it ${ST2CLIENT} /bin/bash

st2web

st2web is a StackStorm Web UI admin dashboard. By default, st2web K8s config includes a Pod Deployment and a Service. 2 replicas (configurable) of st2web serve the web app and proxy requests to st2auth, st2api, st2stream. By default, st2web uses HTTP instead of HTTPS. We recommend you rely on LoadBalancer or Ingress to add HTTPS layer on top of it.

Note! By default, st2web is a NodePort Service and is not exposed to the public net. If your Kubernetes cluster setup supports the LoadBalancer service type, you can edit the corresponding helm values to configure st2web as a LoadBalancer service in order to expose it and the services it proxies to the public net.

st2auth

All authentication is managed by st2auth service. K8s configuration includes a Pod Deployment backed by 2 replicas by default and Service of type ClusterIP listening on port 9100. Multiple st2auth processes can be behind a load balancer in an active-active configuration and you can increase number of replicas per your discretion.

st2api

Service hosts the REST API endpoints that serve requests from WebUI, CLI, ChatOps and other st2 components. K8s configuration consists of Pod Deployment with 2 default replicas for HA and ClusterIP Service accepting HTTP requests on port 9101. Being one of the most important StackStorm services with a lot of logic involved, we recommend you increase the number of replicas if you expect increased load.

st2stream

StackStorm st2stream - exposes a server-sent event stream, used by the clients like WebUI and ChatOps to receive updates from the st2stream server. Similar to st2auth and st2api, st2stream K8s configuration includes Pod Deployment with 2 replicas for HA (can be increased in values.yaml) and ClusterIP Service listening on port 9102.

st2rulesengine

st2rulesengine evaluates rules when it sees new triggers and decides if new action execution should be requested. K8s config includes Pod Deployment with 2 (configurable) replicas by default for HA.

st2timersengine

st2timersengine is responsible for scheduling all user specified timers aka st2 cron. Only a single replica is created via K8s Deployment as timersengine can't work in active-active mode at the moment (multiple timers will produce duplicated events) and it relies on K8s failover/reschedule capabilities to address cases of process failure.

st2workflowengine

st2workflowengine drives the execution of orquesta workflows and actually schedules actions to run by another component st2actionrunner. Multiple st2workflowengine processes can run in active-active mode and so minimum 2 K8s Deployment replicas are created by default. All the workflow engine processes will share the load and pick up more work if one or more of the processes become available.

Note! As Mistral is going to be deprecated and removed from StackStorm platform soon, Helm chart relies only on Orquesta st2workflowengine as a new native workflow engine.

st2scheduler

st2scheduler is responsible for handling ingress action execution requests. 2 replicas for K8s Deployment are configured by default to increase StackStorm scheduling throughput.

st2notifier

Multiple st2notifier processes can run in active-active mode, using connections to RabbitMQ and MongoDB and generating triggers based on action execution completion as well as doing action rescheduling. In an HA deployment there must be a minimum of 2 replicas of st2notifier running, requiring a coordination backend, which in our case is Redis.

st2sensorcontainer

st2sensorcontainer manages StackStorm sensors: It starts, stops and restarts them as subprocesses. By default, deployment is configured with 1 replica containing all the sensors.

You can increase the number of st2sensorcontainer pods by increasing the number of deployments. The replicas count is still only 1 per deployment, but the sensors are distributed between the deployments using Sensor Hash Range Partitioning. The hash ranges are calculated automatically based on the number of deployments.

st2sensorcontainer also supports a more Docker-friendly single-sensor-per-container mode as a way of Sensor Partitioning. This distributes the computing load between many pods and relies on K8s failover/reschedule mechanisms, instead of running everything on a single instance of st2sensorcontainer. The sensor(s) must be deployed as part of the custom packs image.

As an example, override the default Helm values as follows:

st2:
  packs:
    sensors:
      - name: github
        ref: githubwebhook.GitHubWebhookSensor
      - name: circleci
        ref: circle_ci.CircleCIWebhookSensor

st2actionrunner

Stackstorm workers that actually execute actions. 5 replicas for K8s Deployment are configured by default to increase StackStorm ability to execute actions without excessive queuing. Relies on redis for coordination. This is likely the first thing to lift if you have a lot of actions to execute per time period in your StackStorm cluster.

st2garbagecollector

Service that cleans up old executions and other operations data based on setup configurations. Having 1 st2garbagecollector replica for K8s Deployment is enough, considering its periodic execution nature. By default this process does nothing and needs to be configured in st2.conf settings (via values.yaml). Purging stale data can significantly improve cluster abilities to perform faster and so it's recommended to configure st2garbagecollector in production.

st2chatops

StackStorm ChatOps service, based on hubot engine, custom stackstorm integration module and preinstalled list of chat adapters. Due to Hubot limitation, st2chatops doesn't provide mechanisms to guarantee high availability and so only single 1 node of st2chatops is deployed. This service is disabled by default. Please refer to Helm values.yaml about how to enable and configure st2chatops with ENV vars for your preferred chat service.

MongoDB ReplicaSet

StackStorm works with MongoDB as a database engine. External Helm Chart is used to configure MongoDB ReplicaSet. By default 3 nodes (1 primary and 2 secondaries) of MongoDB are deployed via K8s StatefulSet. For more advanced MongoDB configuration, refer to bitnami mongodb Helm chart settings, which might be fine-tuned via values.yaml.

The deployment of MongoDB to the k8s cluster can be disabled by setting the mongodb-ha.enabled key in values.yaml to false. Note: Stackstorm relies heavily on connections to a MongoDB instance. If the in-cluster deployment of MongoDB is disabled, a connection to an external instance of MongoDB must be configured. The st2.config key in values.yaml provides a way to configure stackstorm. See Configure MongoDB for configuration details.

RabbitMQ HA Cluster

RabbitMQ is a message bus StackStorm relies on for inter-process communication and load distribution. External Helm Chart is used to deploy RabbitMQ cluster in Highly Available mode. By default 3 nodes of RabbitMQ are deployed via K8s StatefulSet. For more advanced RabbitMQ configuration, please refer to bitnami rabbitmq Helm chart repository, - all settings could be overridden via values.yaml.

The deployment of RabbitMQ to the k8s cluster can be disabled by setting the rabbitmq-ha.enabled key in values.yaml to false. Note: Stackstorm relies heavily on connections to a RabbitMQ instance. If the in-cluster deployment of RabbitMQ is disabled, a connection to an external instance of RabbitMQ must be configured. The st2.config key in values.yaml provides a way to configure stackstorm. See Configure RabbitMQ for configuration details.

redis

StackStorm employs redis sentinel as a distributed coordination backend, required for st2 cluster components to work properly in HA scenario. 3 node Redis cluster with Sentinel enabled is deployed via external bitnami Helm chart dependency redis. As any other Helm dependency, it's possible to further configure it for specific scaling needs via values.yaml.

Install custom st2 packs in the cluster

There are two ways to install st2 packs in the k8s cluster.

The st2packs method is the default. This method will work for practically all clusters, but st2 pack install does not work. The packs are injected via st2packs images instead.
The other method defines shared/writable volumes. This method allows st2 pack install to work, but requires a persistent storage backend to be available in the cluster. This chart will not configure a storage backend for you.

NOTE: In general, we recommend using only one of these methods. See the NOTE under Method 2 below about how both methods can be used together with care.

Method 1: st2packs images (the default)

The st2packs method is the default. st2 pack install does not work because this chart (by default) uses read-only emptyDir volumes for /opt/stackstorm/{packs,virtualenvs}. Instead, you need to bake the packs into a custom docker image, push it to a private or public docker registry and reference that image in Helm values. Helm chart will take it from there, sharing /opt/stackstorm/{packs,virtualenvs} via a sidecar container in pods which require access to the packs (the sidecar is the only place where the volumes are writable).

Building st2packs image

For your convenience, we created a new st2-pack-install <pack1> <pack2> <pack3> utility and included it in a container that will help to install custom packs during the Docker build process without relying on live DB and MQ connection. Please see https://github.com/StackStorm/st2packs-dockerfiles/ for instructions on how to build your custom st2packs image.

How to provide custom pack configs

Update the st2.packs.configs section of Helm values:

For example:

  configs:
    email.yaml: |
      ---
      # example email pack config file

    vault.yaml: |
      ---
      # example vault pack config file

Don't forget running Helm upgrade to apply new changes.

NOTE: On helm upgrade any configs in st2.packs.configs will overwrite the contents of st2.packs.volumes.configs (optional part of Method 2, described below).

Pull st2packs from a private Docker registry

If you need to pull your custom packs Docker image from a private repository, create a Kubernetes Docker registry secret and pass it to Helm values. See K8s documentation for more info.

# Create a Docker registry secret called 'st2packs-auth'
kubectl create secret docker-registry st2packs-auth --docker-server=<your-registry-server> --docker-username=<your-name> --docker-password=<your-password>

Once secret created, reference its name in helm value: st2.packs.images[].pullSecret.

Method 2: Shared Volumes

This method requires cluster-specific storage setup and configuration. As the storage volumes are both writable and shared, st2 pack install should work like it does for standalone StackStorm installations. The volumes get mounted at /opt/stackstorm/{packs,virtualenvs} in the containers that need read or write access to those directories. With this method, /opt/stackstorm/configs can also be mounted as a writable volume (in which case the contents of st2.packs.configs takes precedence on helm upgrade).

NOTE: With care, st2packs images can be used with volumes. Just make sure to keep the st2packs images up-to-date with any changes made via st2 pack install. If a pack is installed via an st2packs image and then it gets updated with st2 pack install, a subsequent helm upgrade will revert back to the version in the st2packs image.

Configure the storage volumes

Enable the st2.packs.volumes section of Helm values and add volume definitions for both packs and virtualenvs. Each of the volume definitions should be customized for your cluster and storage solution.

For example, to use persistentVolumeClaims:

  volumes:
    enabled: true
    packs:
      persistentVolumeClaim:
        claimName: pvc-st2-packs
    virtualenvs:
      persistentVolumeClaim:
        claimName: pvc-st2-virtualenvs

Or, for example, to use NFS:

  volumes:
    enabled: true
    packs:
      nfs:
        server: nfs.example.com
        path: /var/nfsshare/packs
    virtualenvs:
      nfs:
        server: nfs.example.com
        path: /var/nfsshare/virtualenvs

Please consult the documentation for your cluster's storage solution to see how to add the storage backend to your cluster and how to define volumes that use your storage backend.

How to provide custom pack configs

You may either use the st2.packs.configs section of Helm values (like Method 1, see above), or add another shared writable volume similar to packs and virtualenvs. This volume gets mounted to /opt/stackstorm/configs instead of the st2.packs.config values.

NOTE: If you define a configs volume and specify st2.packs.configs, anything in st2.packs.configs takes precdence during helm upgrade, overwriting config files already in the volume.

For example, to use persistentVolumeClaims:

  volumes:
    enabled: true
    ... # define packs and virtualenvs volumes as shown above
    configs:
      persistentVolumeClaim:
        claimName: pvc-st2-pack-configs

Or, for example, to use NFS:

  volumes:
    enabled: true
    ... # define packs and virtualenvs volumes as shown above
    configs:
      nfs:
        server: nfs.example.com
        path: /var/nfsshare/configs

Caveat: Mounting and copying packs

If you use something like NFS where you can mount the shares outside of the StackStorm pods, there are a couple of things to keep in mind.

Though you could manually copy packs into the packs shared volume, be aware that StackStorm does not automatically register any changed content. So, if you manually copy a pack into the packs shared volume, then you also need to trigger updating the virtualenv and registering the content, possibly using APIs like: packs/install, and packs/register You will have to repeat the process each time the packs code is modified.

Caveat: System packs

After Helm installs, upgrades, or rolls back a StackStorm install, it runs an st2-register-content batch job. This job will copy and register system packs. If you have made any changes (like disabling default aliases), those changes will be overwritten.

NOTE: Upgrades will not remove files (such as a renamed or removed action) if they were removed in newer StackStorm versions. This mirrors the how pack registration works. Make sure to review any upgrade notes and manually handle any removals.

Tips & Tricks

Grab all logs for entire StackStorm cluster with dependent services in Helm release:

kubectl logs -l app.kubernetes.io/instance=<release-name>

Grab all logs only for stackstorm backend services, excluding st2web and DB/MQ/redis:

kubectl logs -l app.kubernetes.io/instance=<release-name>,app.kubernetes.io/component=backend

Running jobs before/after install, upgrade, or rollback

WARNING: The feature described in this section is an Advanced feature that new users should not need.

It may be convenient to run one or more Job(s) in your stackstorm-ha cluster to manage your release's life cycle. As the Helm docs explain:

Helm provides a hook mechanism to allow chart developers to intervene at certain points in a release's life cycle.

The jobs.extra_hooks feature in this chart simplifies creating Jobs that Helm will run in its hooks. These jobs will use the same settings as any other job defined by this chart (eg image, annotations, pod placement). The st2.conf files and packs volumes will be mounted in the Job and the st2 cli will be configured. This feature is primarily useful when you need to run a StackStorm workflow (with st2 run ...) after install, before/after upgrades, or before/after rollbacks.

NOTE: The jobs.extra_hooks feature is very opinionated. If you need to to apply helm hooks to anything other than Jobs, or if these jobs do not meet your needs, then you will need to do so from a parent chart. For example, parent charts are much better suited to jobs that don't need access to the packs, configs, configmaps, and secrets that this chart provides. See "Extending this chart" below.

These extra hooks jobs can be used for st2 installation-specific jobs like:

running a pre-upgrade st2 workflow that notifies on various channels that the upgrade is happening,
running post-upgrade smoke tests to ensure st2 can connect to vital services (vault, kubernetes, aws, etc),
running a pre-upgrade st2 workflow that pauses long-running workflows,
running a post-upgrade st2 workflow that resumes long-running workflows,
running one-time post-install configuration (such as generating dynamic secrets in the st2kv datastore),

To configure the jobs.extra_hooks, set jobs.extra_hooks in your values file. Please refer to stackstorm-ha's default values.yaml file for examples.

Extending this chart

If you have any suggestions or ideas about how to extend this chart functionality, we welcome you to collaborate in Issues and contribute via Pull Requests. However if you need something very custom and specific to your infra that doesn't fit official chart plans, we strongly recommend you to create a parent Helm chart with custom K8s objects and referencing stackstorm-ha chart as a child dependency. This approach allows not only extending sub-chart with custom objects and templates within the same deployment, but also adds flexibility to include many sub-chart dependencies and pin versions as well as include all the sub-chart values in one single place. This approach is infra-as-code friendly and more reproducible. See official Helm documentation about Subcharts and Dependencies.

Releasing information

In order to create a release, the steps are as follows:

Create a pull request by updating CHANGELOG.md by replacing the "In Development" heading with the new version, and Chart.yaml by replacing the version value.
Once the pull request is merged, create and push the matching tag (for example, if you are creating release v1.0.0, then the tag should also be v1.0.0).
After the tag is pushed, create the corresponding release.
After the release is created, switch to the gh-pages branch, and generate the updated Helm index, package and provenance.
After committing and pushing the changes in the previous step, verify that the new release is present on ArtifactHub.

stackstorm-k8s's People

Contributors

Stargazers

Watchers

Forkers

ravenac95 clementtrebuchet trstruth fastly vivekrmk rapittdev lukmy doriftoshoes cmmdrdata jorson-chen junsheng-wu jospradlin sangrah92 rockdj1983 pratikk1995 ishu27 ezeteze erenatas caiges valentintorikian scanbuy-inc josue1402 contentsquare satheeshcharles mikeling stefangusa laashub-soa cxa-security rrahman-nv ytjohn saiso12 ashm2 ahmedsameh abhyudayasharma anushkakamerkar sheshagiri makandre rahulshinde26 shub-sfdc manisha-tanwar arms11 yohanbrun xiaoruiguo hnanchahal lordpengwin nithinsubbaraj saurav-k alertlogic cognifloyd moti1992 nickgulrajani sandesvitor marceloapps wyangsun agateblue alexrogalskiy kernelpanek anrajme jstaph cloudreach-coreops kaynenotkanye joaquinsosa28 chris3081 lynxye cwilson21 0xhaven ahmed-alhameedawi jdawson-pf rebrowning ribal-aladeeb suganjoe mishka-g scotts-byte webkite tnajanssen bmarick sthuraisamy freddyaprilliant planetwatchers jialiu12345 mamercad nicoolanus guzzijones lingaswamy143 antverpp rush-skills agracie fuhrmannb enricsan1 fuze vsbopi cars equinixmetal-helm zoeleah axl-ankush izaakniksan armab skiedude jk464 mvandermeulen

stackstorm-k8s's Issues

Don't reconnect st2 DB/MQ on failure, exit early instead

Cloned from https://github.com/StackStorm/st2enterprise-dockerfiles/issues/57

The current/default StackStorm tries to reconnect to Mongo or RabbitMQ on failure.
This is undesired behavior in K8s environment where engine handles failover by rescheduling exited container.
So the desired configuration is to exit fast any st2 service on failure without reconnection and let K8s handle it.

Luckily, these settings are configurable in st2.conf:
https://github.com/StackStorm/st2/blob/master/conf/st2.conf.sample#L179
https://github.com/StackStorm/st2/blob/master/conf/st2.conf.sample#L109

Hardcode them in default st2.conf (during the Docker image build or/and in K8s objects) to not reconnect, but exit st2 on failure.

Partitioning StackStorm Sensors

Moved from https://github.com/StackStorm/k8s-st2/issues/12

See: https://docs.stackstorm.com/reference/sensor_partitioning.html

At the moment st2sensorcontainer K8s configuration consists of Deployment with hardcoded 1 replica and works as a parent responsible to forking/starting/stopping all sensors in the system.

Future plans to re-work this setup (discussion https://github.com/StackStorm/discussions/issues/305) and benefit from Docker-friendly single-sensor-per-container mode (added since st2 v2.9) as a way of Sensor Partitioning, distributing the computing load between many pods and relying on K8s failover/reschedule mechanisms, instead of running everything on 1 single instance of st2sensorcontainer.

Need to codify + config that mode in Docker + K8s Helm templates so user specifies list of sensors he wants to run in values.yaml, Helm takes care about starting a pod for each record.

References

https://github.com/StackStorm/stackstorm-enterprise-ha/blob/ca116b1c20b8953aebde429139022bd4141aa068/values.yaml#L279-L285

https://github.com/StackStorm/stackstorm-enterprise-ha/blob/ca116b1c20b8953aebde429139022bd4141aa068/templates/deployments.yaml#L710-L715

Find resource Requests for all K8s pods/deployments

Moved from https://github.com/StackStorm/k8s-st2/issues/25

Find resource requests and resource limits (at least Memory) for each container/pod in StackStorm cluster, allow modifying the values.

This is the best K8s practice which affects cluster capacity planning and so is must do for prod and all Deployment objects we write.

Capacity planning is based on defined K8s resources metadata and so it's important setting to configure.
Another quick example if an app introduces memory leak or rabbitmq app acquires all the memory it can see. If such pods deployed to the cluster with no limit set they can crash a node.

Components

A few references

https://github.com/StackStorm/stackstorm-enterprise-ha/blob/9e3caf754b14f1a20efe45f43bc3037f9f09c649/templates/jobs.yaml#L95-L96

https://github.com/StackStorm/stackstorm-enterprise-ha/blob/9e3caf754b14f1a20efe45f43bc3037f9f09c649/values.yaml#L210-L216

Use custom/external RabbitMQ and MongoDB

The use-case is in following the ST2-HA guide we are managing the Infrastructure (Mongo/Rabbit/etc...) separate from StackStorm. As an example, having KubeDB Operators.

Interested in another chart or values toggle for bring-your-own-infrastructure. Ideally I can tell stackstorm config with values URLs or a known k8s Secret or a ConfigMap (open to other ideas).

st2sensorcontainer is not re-deployed on 'st2.packs.configs' change

Sensors doesn't pick up new settings from st2.packs.configs once they're changed.

This means we need to set another annotation dependency to re-deploy st2sensorcontainer on config change.

Add st2scheduler

Since 3.0 stackstorm introduces a new service st2scheduler, - we'll need to codify it.

[possible] custom packs & RBAC error with 0.10.0

After the upgrade I had an issue getting both to work. This is not the case with 0.8
Also this happened on the development platform. I am not how much of it is related to my changes. will update later when I can revert to lower version.

 resource: {
  labels: {…}   
  type:  "container"   
 }
 severity:  "ERROR"  
 textPayload:  "cp: can't stat '/opt/stackstorm/packs/.': No such file or directory
"

seeing similar issue as #42

Issue creating installing packs

I am trying to install packs following the guide. I enabled the K8 registry and I successfully created an image locally with a couple of public packs. I am running into issues running the following command.

docker run --privileged --pid=host socat:latest nsenter -t 1 -u -n -i socat TCP-LISTEN:5000,fork TCP:docker.for.mac.localhost:5000

I get the response Unable to find image 'socat:latest' locally

How / where do I get this image from?

HA for etcd cluster

Moved from https://github.com/StackStorm/st2enterprise-dockerfiles/issues/48

Currently we install single-node etcd instance as st2 [coordination] backend, just to get started quick.

Enhance the K8s manifests to configure 3 node HA cluster by default.

In future, further template with Helm to optionally use 5, 7 or more nodes, if user needs it.

Use 3rd party Helm chart for configuring etcd similar to mongodb-ha and rabbitmq-ha, instead of using quick hacked K8s objects.

See https://github.com/helm/charts/tree/master/incubator/etcd

Add LoadBalancer for st2web

While #6 is the long term goal, first step on this path is to add a LoadBalancer service type to st2web. This is all that we require to resolve immediate needs.

Pod scheduling priority: Helm release could be stuck in a restart loop due Mongo/Rabbit clusters glued with a "Pending" state

Moved from https://github.com/StackStorm/st2enterprise-dockerfiles/issues/80

Sometimes when deploying StackStorm cluster in K8s (especially with resource pressure), entire Helm deployment is stuck in a dead dependency loop: st2 pods keep restarting, rescheduling, recreating due to no MQ/DB connection, while Mongo and Rabbit cluster can't start because of st2 pods reschedule spam.

Note that our HA deployment with 3 nodes for each DB and MQ, 2 for each st2 service creates a cluster of minimum 30 Pods

The solution could be trying K8s Pod priority https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/ which is beta starting from K8s v1.11.

This way we can prioritize scheduling for MongoDB, RabbitMQ and etcd clusters before st2 services.

Sharing pack content: Bundle custom st2packs image vs Shared NFS storage?

Bundling custom st2 packs as immutable Docker image

In current implementation for this chart to use custom st2 packs we rely on building a dedicated Docker image with pack content and virtualenv, pre-installed and bundled beforehand, see: https://docs.stackstorm.com/latest/install/ewc_ha.html#custom-st2-packs
As a downside means any writes like st2 pack install or saving the workflow from st2flow won't work in current HA env.

Share content via ReadWriteMany shared file system

There is an alternative approach, - sharing pack content via read-write-many NFS (Network File System) as High Availability Deployment doc recommends.

Examples

For example, There is a stable Helm chart nfs-server-provisioner which codifies NFS in an automated way.

https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes
lists K8s supported solutions for ReadWriteMany (RWX) volumes.

From that list of volumes, additional interest apart of NFS, goes to CephFS and https://github.com/rook/rook Rook, a CNCF-hosted project for storage orchestration.

Feedback Needed!

As beta is in progress and both methods have their pros and cons, we’d like to hear your feedback, ideas, experience running any of these shared file-systems and which way would work better for you.

Switch Docker images from Ubuntu Xenial -> Bionic as a base

Per https://docs.stackstorm.com/roadmap.html starting from st2 v3.1 Bionic will be going from beta to GA state with Python 3.6 used by default.

We'll need to switch our Dockerfiles to latest Bionic as it's something that's needed for internal infra/py3 testing efforts and would be good in general for our community as well.

Complementing PR for Dockerfiles: StackStorm/st2-dockerfiles#16

StackStorm Operator

We might need a operator to package, deploy and manage stackstorm-ha application
Ref:
https://operatorhub.io/operator/etcd
https://commons.openshift.org/sig/operators.html

stanley_rsa is owned by root:root instead of stanley user in k8s

in the st2client pod sudo su - stanley then try to login somewhere using stanley_rsa key. It won't work unless you are root or use sudo.

whoami
stanley
stanley@stackstorm-st2client-b96dd9f76-t9kjp:~$ ls -l ~stanley/.ssh/
total 0
lrwxrwxrwx 1 root root 18 Sep  9 20:35 stanley_rsa -> ..data/stanley_rsa
stanley@stackstorm-st2client-b96dd9f76-t9kjp:~$ ssh undercloud.admin -i ~stanley/.ssh/stanley_rsa
The authenticity of host 'undercloud.admin' (10.75.163.57)' can't be established.
ECDSA key fingerprint is SHA256:Nr3mtGvtxhNNMeCzpy642VkASYji/u0xESuTTVKe4.
Are you sure you want to continue connecting (yes/no)? yes
Failed to add the host to the list of known hosts (/home/stanley/.ssh/known_hosts).
        ******** WARNING: UNAUTHORIZED PERSONS, DO NOT PROCEED ********
This system is intended to be used solely by authorized users in the course of
legitimate corporate business.  Users are monitored to the extent necessary to
properly administer the system, to identify the unauthorized users or users
operating beyone their proper authroity, and to investigate improper access or
use.  By accessing the system, you are consenting to this monitoring.
Additionally, users accessing this system agree that they understand and will
comply with all Verizon Information  Security and Privacy policies, including
policy statements, instructions, standards and guidelines.
        ******** WARNING: UNAUTHORIZED PERSONS, DO NOT PROCEED ********

Load key "/home/stanley/.ssh/stanley_rsa": Permission denied
[email protected]'s password:

Start mongodb with --auth option

Moved from StackStorm/st2enterprise-dockerfiles#62

It is not recommended to run mongodb without authentication, for any length of time.

Some useful references:

Chart has auth section to configure: https://github.com/helm/charts/tree/master/stable/mongodb-replicaset#authentication so should be doable with Helm values.yaml.

Use K8s Liveness & Readiness Probes

Copied from https://github.com/StackStorm/k8s-st2/issues/5

Use Kubernetes Livenes and Readiness probes to check if pod container is ready/working or not. For example, st2 services could start, but in fact be in unreachable or "initializing" state, meaning potential loss of requests.

This becomes important as we reach the Production Deployments stage.

There is an issue in to track the implementation progress in StackStorm core: StackStorm/st2#4020 (help wanted!)

Resources

https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/
How readinessProbe differs from livenessProbe: https://www.ianlewis.org/en/using-kubernetes-health-checks

helm-e2e tests pass even though pod in CrashLoopBackOff

At the following link, click on "Show Created K8s Resources", and you'll see that st2web pod is stuck in CrashLoopBackOff, even though helm-e2e tests indicate passed.

https://circleci.com/gh/StackStorm/stackstorm-ha/530?utm_campaign=vcs-integration-link

Seems to me that the test should fail.

Error: validation failed: unable to recognize "": no matches for kind "StatefulSet" in version "apps/v1beta1"

go the error when trying to install stackstorm in K8S cluster, V1.6.

Error: validation failed: unable to recognize "": no matches for kind "StatefulSet" in version "apps/v1beta1"

Command to install stackstorm-ha:
helm repo add stackstorm https://helm.stackstorm.com/
helm install stackstorm/stackstorm-ha

Updating stackstorm packs does not work

Hello,

I have been trying to update the packs that I have created in stackstorm-ha. I have created my custom st2 pack image by following Install custom st2 packs in the cluster. It has worked fine during installation. I have done some changes on my pack and updated the docker image, but if I do helm upgrade <release-name> ., st2-register-content job runs but it does not update the pack that I have installed before.

I have also tried helm upgrade <release-name> . --recreate-pods, now rabbit-mq-ha pod does not run. Therefore, most of the stackstorm pods keeps giving CrashLoopBackOff```. I would really appreciate if you can help. Otherwise I won't be able to keep the persistency, and my data will be gone. Here are the logs of Rabbit-MQ-ha:

kubectl logs stackstorm18-rabbitmq-ha-0
2019-10-21 10:41:10.112 [info] <0.266.0> 
 Starting RabbitMQ 3.7.15 on Erlang 22.0.5
 Copyright (C) 2007-2019 Pivotal Software, Inc.
 Licensed under the MPL.  See https://www.rabbitmq.com/

  ##  ##
  ##  ##      RabbitMQ 3.7.15. Copyright (C) 2007-2019 Pivotal Software, Inc.
  ##########  Licensed under the MPL.  See https://www.rabbitmq.com/
  ######  ##
  ##########  Logs: <stdout>

              Starting broker...
2019-10-21 10:41:10.113 [info] <0.266.0> 
 node           : rabbit@stackstorm18-rabbitmq-ha-0.stackstorm18-rabbitmq-ha-discovery.default.svc.cluster.local
 home dir       : /var/lib/rabbitmq
 config file(s) : /etc/rabbitmq/rabbitmq.conf
 cookie hash    : HtASEQMieCqdw0KvjzlO2w==
 log(s)         : <stdout>
 database dir   : /var/lib/rabbitmq/mnesia/rabbit@stackstorm18-rabbitmq-ha-0.stackstorm18-rabbitmq-ha-discovery.default.svc.cluster.local
2019-10-21 10:41:10.604 [info] <0.266.0> Running boot step pre_boot defined by app rabbit
2019-10-21 10:41:10.605 [info] <0.266.0> Running boot step rabbit_core_metrics defined by app rabbit
2019-10-21 10:41:10.605 [info] <0.266.0> Running boot step rabbit_alarm defined by app rabbit
2019-10-21 10:41:10.611 [info] <0.272.0> Memory high watermark set to 244 MiB (256000000 bytes) of 32174 MiB (33737826304 bytes) total
2019-10-21 10:41:10.617 [info] <0.274.0> Enabling free disk space monitoring
2019-10-21 10:41:10.618 [info] <0.274.0> Disk free limit set to 50MB
2019-10-21 10:41:10.621 [info] <0.266.0> Running boot step code_server_cache defined by app rabbit
2019-10-21 10:41:10.621 [info] <0.266.0> Running boot step file_handle_cache defined by app rabbit
2019-10-21 10:41:10.622 [info] <0.277.0> Limiting to approx 1048476 file handles (943626 sockets)
2019-10-21 10:41:10.622 [info] <0.278.0> FHC read buffering:  OFF
2019-10-21 10:41:10.622 [info] <0.278.0> FHC write buffering: ON
2019-10-21 10:41:10.622 [info] <0.266.0> Running boot step worker_pool defined by app rabbit
2019-10-21 10:41:10.623 [info] <0.266.0> Running boot step database defined by app rabbit
2019-10-21 10:41:10.632 [info] <0.266.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2019-10-21 10:41:40.633 [warning] <0.266.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2019-10-21 10:41:40.633 [info] <0.266.0> Waiting for Mnesia tables for 30000 ms, 8 retries left
2019-10-21 10:42:10.635 [warning] <0.266.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2019-10-21 10:42:10.635 [info] <0.266.0> Waiting for Mnesia tables for 30000 ms, 7 retries left
2019-10-21 10:42:40.636 [warning] <0.266.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2019-10-21 10:42:40.636 [info] <0.266.0> Waiting for Mnesia tables for 30000 ms, 6 retries left
2019-10-21 10:43:10.649 [warning] <0.266.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2019-10-21 10:43:10.650 [info] <0.266.0> Waiting for Mnesia tables for 30000 ms, 5 retries left
2019-10-21 10:43:40.651 [warning] <0.266.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2019-10-21 10:43:40.651 [info] <0.266.0> Waiting for Mnesia tables for 30000 ms, 4 retries left
Stopping and halting node rabbit@stackstorm18-rabbitmq-ha-0.stackstorm18-rabbitmq-ha-discovery.default.svc.cluster.local ...
2019-10-21 10:44:04.489 [info] <0.306.0> RabbitMQ hasn't finished starting yet. Waiting for startup to finish before stopping...
2019-10-21 10:44:10.654 [warning] <0.266.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2019-10-21 10:44:10.654 [info] <0.266.0> Waiting for Mnesia tables for 30000 ms, 3 retries left

Ability to import data into st2 K/V storage

Moved from https://github.com/StackStorm/st2enterprise-dockerfiles/issues/41

As part of the Helm app, add an option to import yaml-formatted data into st2 K/V storage.
So user defines a list of required K/V in Helm values.yaml.

An example from ops-infra (private repo) for st2cicd server, which uses a lot of K/V:
https://github.com/StackStorm/ops-infra/blob/master/roles/stackstorm.st2cicd/defaults/main.yml#L2
In short, we need to rely on st2 key load <...> command to import K/V settings and do that via one-time K8s job, similar to how register content works.

Add some config value, so we can disable nginx ssl on st2web container

when i deployment stackstorm-ha use k8s cluster,I find st2web accept ssl requests on 443 port.
But in my case ,I just want use ingress controller handle ssl connection.
I'm always worried the ssl certificate will be expired, so I don't want put website certificate to Value.yaml and write to st2web container.
So, i'm curious, is possible stackstorm-ha offer extra config in helm value.yaml, that we can disable ssl certificate on st2web container.
Thinks anyway.

'stable/etcd-operator' is not really stable for [coordination]

After seeing the following in logs when cluster couldn't start itself or even start clean if all etcd pods were killed:

level=warning msg="all etcd pods are dead." cluster-name=etcd-cluster cluster-namespace=default pkg=cluster

This situation is not recovered by etcd-operator.
https://github.com/coreos/etcd-operator/blob/8347d27afa18b6c76d4a8bb85ad56a2e60927018/pkg/cluster/cluster.go#L248-L252

Researching further looks like there are quite a lot of cases when etcd-operator can't recover itself:

Because this backend is needed just for short-lived coordination locks, consider switching to Redis or even single-instance etcd like it was before (#52)?

Switch from latest ST2 'dev' to 'stable'

Currently, to pickup most recent st2 fixes and improvements related to HA, we rely on latest st2 X.Ydev versions in Docker images that are built nightly.
Once we proceed our ST2 HA K8s story from beta -> prod stage, switch from using latest st2 dev versions to st2 stable.

TODO:

appVersion Helm change
Dockerfiles CI/CD build settings updates
Announce the change and prod/GA readiness to our community.
Docs update for https://docs.stackstorm.com/install/k8s_ha.html
https://github.com/StackStorm/stackstorm-ha README

st2 pods stuck in loop

I'm deploying a stackstorm/stackstorm-ha cluster with helm and since yesterday some of the pods seem to be stuck in a loop.
"Deployment does not have minimum availability". This was previously working without any issues.

We don't have much info on the logs:

Logs: st2actionrunner
2019-03-21 08:40:38,795 DEBUG [-] Using Python: 2.7.12 (/opt/stackstorm/st2/bin/python)
2019-03-21 08:40:38,796 DEBUG [-] Using config files: /etc/st2/st2.conf,/etc/st2/st2.docker.conf,/etc/st2/st2.user.conf
2019-03-21 08:40:38,796 DEBUG [-] Using logging config: /etc/st2/logging.actionrunner.conf
2019-03-21 08:40:38,802 INFO [-] Connecting to database "st2" @ "aspiring-ferret-mongodb-ha:27017" as user "None".
2019-03-21 08:40:38,830 INFO [-] Successfully connected to database "st2" @ "aspiring-ferret-mongodb-ha:27017" as user "None".
2019-03-21 08:40:39,972 ERROR [-] (PID=1) Worker quit due to exception.
Traceback (most recent call last):
File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2actions/cmd/actionrunner.py", line 80, in main
_setup()
File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2actions/cmd/actionrunner.py", line 41, in _setup
register_signal_handlers=True, service_registry=True, capabilities=capabilities)
File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2common/service_setup.py", line 187, in setup
start_heart=True)
File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2common/service_setup.py", line 224, in register_service_in_service_registry
coordinator.create_group(group_id).get()
File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/tooz/coordination.py", line 476, in create_group
raise tooz.NotImplemented
NotImplemented

=========================
Logs: st2workflowengine
2019-03-21 08:45:45,742 DEBUG [-] Using Python: 2.7.12 (/opt/stackstorm/st2/bin/python)
2019-03-21 08:45:45,742 DEBUG [-] Using config files: /etc/st2/st2.conf,/etc/st2/st2.docker.conf,/etc/st2/st2.user.conf
2019-03-21 08:45:45,743 DEBUG [-] Using logging config: /etc/st2/logging.workflowengine.conf
2019-03-21 08:45:45,748 INFO [-] Connecting to database "st2" @ "aspiring-ferret-mongodb-ha:27017" as user "None".
2019-03-21 08:45:45,774 INFO [-] Successfully connected to database "st2" @ "aspiring-ferret-mongodb-ha:27017" as user "None".
2019-03-21 08:45:46,633 ERROR [-] Traceback (most recent call last):
2019-03-21 08:45:46,633 ERROR [-] File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2actions/cmd/workflow_engine.py", line 94, in main
2019-03-21 08:45:46,633 ERROR [-] setup()
2019-03-21 08:45:46,633 ERROR [-] File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2actions/cmd/workflow_engine.py", line 65, in setup
2019-03-21 08:45:46,633 ERROR [-] capabilities=capabilities
2019-03-21 08:45:46,633 ERROR [-] File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2common/service_setup.py", line 187, in setup
2019-03-21 08:45:46,633 ERROR [-] start_heart=True)
2019-03-21 08:45:46,633 ERROR [-] File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2common/service_setup.py", line 224, in register_service_in_service_registry
2019-03-21 08:45:46,634 ERROR [-] coordinator.create_group(group_id).get()
2019-03-21 08:45:46,634 ERROR [-] File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/tooz/coordination.py", line 476, in create_group
2019-03-21 08:45:46,634 ERROR [-] raise tooz.NotImplemented
2019-03-21 08:45:46,634 ERROR [-] NotImplemented
2019-03-21 08:45:46,634 ERROR [-] (PID=1) Workflow engine quit due to exception.
Traceback (most recent call last):
File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2actions/cmd/workflow_engine.py", line 94, in main
setup()
File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2actions/cmd/workflow_engine.py", line 65, in setup
capabilities=capabilities
File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2common/service_setup.py", line 187, in setup
start_heart=True)
File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2common/service_setup.py", line 224, in register_service_in_service_registry
coordinator.create_group(group_id).get()
File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/tooz/coordination.py", line 476, in create_group
raise tooz.NotImplemented
NotImplemented

=========================

This was installed with command: helm install stackstorm/stackstorm-ha
I've used the same layout to deploy and test few days ago. There's no resources issue.

root@aspiring-ferret-st2client-849f88776f-p4ntd:/opt/stackstorm# st2 --version
st2 3.0dev (0b95862), on Python 2.7.12

Custom user-defined K8s objects?

Initially was raised in our infra where custom environment-specific K8s resources were created as addition to base ST2 HA Helm chart

As more groups will need flexibility from K8s/Helm and configure something on their own as supplement to ST2 HA.
We'll need to allow users to include/define custom K8s objects (is it additional CRD, Service, Deployment, Job or anything else).

One solution it looks like possible to allow user referencing custom K8s objects they'll need via Helm values https://helm.sh/docs/charts_tips_and_tricks/#using-the-tpl-function helm/helm#1978 (comment) that will be templated as part of our base stackstorm-ha Helm chart.
As an alternative solution to this, users can create parent Helm chart with their custom objects/resources that will include & configure stackstorm-ha as a dependency (like we do for mongo/rabbitmq/etc).

Check both solutions and decide which one looks more elegant.

P.S. We're looking for feedback on this, so any opinions/comments are very welcome!

Prometheus Metrics

Future (sometime near production) version of chart should include an option to enable Prometheus Metrics via Helm values.yaml.
Once configured/enabled the chart should take it from there to setup everything.

An example could be found at mongodb-replicaset Helm chart:
https://github.com/helm/charts/tree/master/stable/mongodb-replicaset#promethus-metrics

Officially Prometheus metrics are not yet supported by StackStorm (see StackStorm/st2#4341), we'll likely need a dedicated st2 exporter (see the list https://prometheus.io/docs/instrumenting/exporters/) If not, use https://github.com/prometheus/statsd_exporter#without-statsd as a middle-man between st2 metrics and Prometheus since st2 supports statsd metrics.

Error: no kind "Job" is registered for version "batch/v1"

Due to another Helm version upgrade some parts of the chart stopped working:

Error: no kind "Job" is registered for version "batch/v1" in scheme "k8s.io/kubernetes/pkg/api/legacyscheme/scheme.go:30"

Original issue: helm/helm#6894 and looks like the Helm fix will be released soon.

ASCII art

Replace ASCII art in Helm NOTES.txt from ST2 OK -> ST2 HA OK :)

███████╗████████╗██████╗     ██╗  ██╗ █████╗      ██████╗ ██╗  ██╗
██╔════╝╚══██╔══╝╚════██╗    ██║  ██║██╔══██╗    ██╔═══██╗██║ ██╔╝
███████╗   ██║    █████╔╝    ███████║███████║    ██║   ██║█████╔╝ 
╚════██║   ██║   ██╔═══╝     ██╔══██║██╔══██║    ██║   ██║██╔═██╗ 
███████║   ██║   ███████╗    ██║  ██║██║  ██║    ╚██████╔╝██║  ██╗
╚══════╝   ╚═╝   ╚══════╝    ╚═╝  ╚═╝╚═╝  ╚═╝     ╚═════╝ ╚═╝  ╚═╝

Because we can.

Helm chart Unit Tests

Apart of Helm Chart Integration Testing in #27 which could be ran only against real K8s cluster and deployed release, consider adding Unit Tests for smaller item checks which is available as a Helm plugin: ~~https://github.com/lrills/helm-unittest~~ https://github.com/quintush/helm-unittest

Helm v2.15.0 stopped templating MongoDB connection string

With new released Helm v2.15.0 templating used to populate MongoDB host stopped working:

[database]
- host = mongodb://test-mongodb-ha-0.test-mongodb-ha,test-mongodb-ha-1.test-mongodb-ha,test-mongodb-ha-2.test-mongodb-ha/?authSource=admin&replicaSet=rs0
+ host = mongodb:///?authSource=admin&replicaSet=rs0

resulting in failed st2 Pods and red e2e builds https://circleci.com/gh/StackStorm/stackstorm-ha/1147

Something might be changed with the Helm template engine. Needs investigation and replacing template rule to populate the MongoDB host again.

Ability to replace "stanley" ssh remote user by a custom account

In the current Helm chart for stackstorm-ha, it is impossible via values.yaml to replace the default system user "stanley" that is used to connect to remote servers.

This would need to be modified via values.yaml.

Basic Helm Chart end to end sanity tests

This issue brought up lack of even basic end to end sanity tests - #57.

To begin with, I think having even basic tests which spin up all the services and just run st2 run core.local cmd=date ; st2 run pack install foo or similar (aka what we do on Travis for StackStorm packages) would help a lot and catch such issues early on.

If you think this belongs to the Integration tests issue (#27) feel free to merge those two issues and close this one.

And down the road, having CICD running using Helm Charts will help a lot as well.

Host this chart at official Helm repo

Currently we're relying on our custom repo for hosting charts https://helm.stackstorm.com/.

Once the code stabilizes, aim to contribute our chart to official Helm repo https://github.com/helm/charts so it'll be easier for users to install it and bring us more OSS visibility.

The only question how that'll affect velocity, how fast changes are reviewed, which changes are accepted and how decision is made about what's merged and what's not, - check if there is a fit.

st2actionrunner reloads on adding a new pack

Currently, I have tried installing some custom st2packs by baking them in a docker image following https://github.com/StackStorm/st2packs-dockerfiles and it worked well. But everytime I add a new pack by updating the image, the st2actionrunner pod reloads. So if I want to be able to add packs on a regular basis, it will certainly affect the execution of the actions and workflows running at that moment. Is there another way to add packs without affecting the execution?

Autogenerate secrets if no Helm value provided, remove insecure defaults

In Helm values we specify unsafe defaults for ~~SSL auto-generated cert #6~~, SSH key, st2 password to bring a "one click" experience.

An alternative to that is to auto-generate these secrets if no Helm values provided.

Add st2chatops

Currently K8s/Helm is missing st2chatops deployment.

Making it Docker and K8s-friendly to read ENV vars outside of st2chatops.env will take some work, (see blocker StackStorm/st2chatops#50).

And so running 1 replica of st2chatops would be doable and nice to have, but having st2chatops in HA is a story for a new dedicated issue and research.

RabbitMQ custom st2packs are not working on 3.2dev version

I have created a custom docker st2pack for RabbitMQ to install in Stackstorm-HA environment on the version 3.2Dev(Yes I have passed rabbitmq.yaml file by modifying in the values.yaml file and executed helm upgrade). However, this RabbitMQ is not showing on my server.

Followed the same process on version 3.1Dev and the connection status is showing as running on my RabbitMQ server console.

Add imagePullSecret support for privately-hosted st2packs Docker image

We support specifying custom st2packs Docker image from external public repository:
https://github.com/StackStorm/stackstorm-ha/blob/d0cfc696044a713f7e706f53096ad0b63a04d325/values.yaml#L78-L87

But it doesn't have pull secret option to download custom st2packs docker image from privately hosted repository with auth.

Track the issue here.

K8s Ingress Controller

Moved from https://github.com/StackStorm/k8s-st2/issues/31

As part of templating Helm Charts, we'll need a K8s Ingress controller (https://kubernetes.io/docs/concepts/services-networking/ingress/) to allow user configuring inbound access to StackStorm Web UI/APIs from the outside world and have more granular configuration comparing to LoadBalancer type.

The Plan

Per #44 (comment) discussion we want to:

Expose Ingress controller settings via Helm values.yaml to allow users to configure the SSL/TLS negotiation layer on their own (optional).
Change st2web Docker image so it will respond on HTTP by default (currently HTTPS).

https://github.com/StackStorm/stackstorm-enterprise-ha/blob/d5e34fc28e820e39097d5eaac7dcadcc4a481f2b/values.yaml#L204-L205

https://github.com/StackStorm/stackstorm-enterprise-ha/blob/d5e34fc28e820e39097d5eaac7dcadcc4a481f2b/templates/ingress.yaml#L1

Resources

Pin etcd to latest 3.x version

Currently we rely on etcd 2.2.5, pinned by default in external etcd Helm chart dependency:
https://github.com/helm/charts/blob/b7f4e8e17aa9cd2adba42756bf7d56f36d90449b/incubator/etcd/values.yaml#L10-L12

Per discussion with @Kami, etcd 3.x is needed for group membership functionality, required by st2.

Use K8s PodPresets

Moved from https://github.com/StackStorm/st2enterprise-dockerfiles/issues/83

It turns out that instead of sharing same volumes, files, secrets and ENV vars for each Deployment, we can just create PodPresets for each resource like volume, vars, secrets with selectors that will apply resource sharing with specific containers.
This would help to avoid resource/code duplication and slightly simplify K8s objects.

See

etcd deployment is unstable

We're using external chart dependency https://github.com/helm/charts/tree/master/incubator/etcd from Helm incubator to deploy etcd cluster as part of the stackstorm-ha.

It doesn't work well, failing frequently, like https://circleci.com/gh/StackStorm/stackstorm-ha/604.

Example K8s pod list:

NAME                                                              READY   STATUS             RESTARTS   AGE
st2cicd-etcd-0                                                    0/1     CrashLoopBackOff   107        9h
st2cicd-etcd-1                                                    1/1     Running            0          11h
st2cicd-etcd-2                                                    1/1     Running            112        9h

With error log from etcd pod:

Waiting for st2cicd-etcd-2.st2cicd-etcd to come up
Re-joining etcd member
cat: can't open '/var/run/etcd/member_id': No such file or directory

Research other solutions, consider etcd operator and how to automate that via Helm deployment.

Crypto key for st2 K/V datastore

Moved from https://github.com/StackStorm/st2enterprise-dockerfiles/issues/52

In K8s objects and Helm chart we need to codify a way for user to specify custom crypto key for st2 datastore.
The secret key should be shared with corresponding stackstorm service as a file (st2pi, st2actionrunners etc).

See: https://docs.stackstorm.com/datastore.html#securing-secrets-admin-only

st2-register-content job should clean up packs/actions/etc that no longer exist

Right now, the functionality of the HA-implementation makes it so that custom packs are installed via Docker image. This works fine and is in line with expected results.

What doesn't work as expected is the removal of those packs and actions. Removing a pack/action/rule/etc from the docker container leaves it showing in the list even though it will not function when called. This is because no cleanup step is performed to remove those packs.

I would propose that an expected behavior would be that any functionality not defined in values.yaml or else by the docker container would be reasonable to assume to be overridden. This is especially confirmed by the warning message displayed when connecting to the st2client pod.

Add support for 'st2 apikey load'

Similar as we do with importing data into st2 keyvalue storage #30, need an option to import st2 apikeys.
Respective documentation: https://docs.stackstorm.com/authentication.html#api-key-migration

As needed in https://github.com/StackStorm/ops-infra/blob/master/roles/stackstorm.st2cicd/tasks/install_api_keys.yml

Helm Chart Integration Tests

See https://github.com/helm/helm/blob/master/docs/chart_tests.md for how to write and run Helm chart integration tests. This will run end-to-end tests on a real deployment and may be used as a way to ensure that installation went well.

helm test:

The test command runs the tests for a release.

The argument this command takes is the name of a deployed release.
The tests to be run are defined in the chart that was installed.

Usage:
  helm test [RELEASE] [flags]

Flags:
      --cleanup               delete test pods upon completion
      --timeout int           time in seconds to wait for any individual Kubernetes operation (like Jobs for hooks) (default 300)
      --tls                   enable TLS for request
      --tls-ca-cert string    path to TLS CA certificate file (default "$HELM_HOME/ca.pem")
      --tls-cert string       path to TLS certificate file (default "$HELM_HOME/cert.pem")
      --tls-hostname string   the server name used to verify the hostname on the returned certificates from the server
      --tls-key string        path to TLS key file (default "$HELM_HOME/key.pem")
      --tls-verify            enable TLS for request and verify remote

Global Flags:
      --debug                           enable verbose output
      --home string                     location of your Helm config. Overrides $HELM_HOME (default "/home/arma/.helm")
      --host string                     address of Tiller. Overrides $HELM_HOST
      --kube-context string             name of the kubeconfig context to use
      --kubeconfig string               absolute path to the kubeconfig file to use
      --tiller-connection-timeout int   the duration (in seconds) Helm will wait to establish a connection to tiller (default 300)
      --tiller-namespace string         namespace of Tiller (default "kube-system")

Add Community support for K8s Helm Chart

Refactor templates so it will be possible to switch between StackStorm Community version and Enterprise in one single Helm chart.

Apart of full chart refactoring the task will have many subtasks and dedicated set of requirements for new community Dockerfiles to be based on https://github.com/stackstorm/st2enterprise-dockerfiles.

Release Automation, CD

https://github.com/stackstorm/stackstorm-enterprise-ha/tree/gh-pages branch was taken to serve chart archives and index metadata via Github pages. See https://github.com/helm/helm/blob/master/docs/chart_repository.md#github-pages-example for more info how it was configured and CNAMEd as https://helm.stackstorm.com/ repo.

Currently it's all done manually as a temporary/fast way to deliver charts via https://helm.stackstorm.com/

The task here is to automate the CI/CD process with CircleCI. So every Github Release (initiated by engineer) will generate the Helm chart metadata in gh-pages.

Possibly look at https://github.com/sstarcher/helm-release for automating versioning based on git tags.

Rearrange "secrets" in Helm values

Currently in Helm values.yaml for this chart there is a dedicated secrets section holding values for data like st2 username/password, SSL certs, SSH keys and so on.

~~We wanted to move it to dedicated secrets.yaml when it gets implemented in Helm (helm/helm#2196).~~ (now closed)

Update: We'll need to just obsolete dedicated secrets block and merge it with all other sections as any official Helm chart does without dedicated sub-section weirdness.

https://github.com/StackStorm/stackstorm-enterprise-ha/blob/95fe6bb5f0d76ef772ad3f4391ff5cd3e00d1d86/values.yaml#L82-L100

~~Feedback needed: which approach/way looks better and friendlier to you?~~

stackstorm / stackstorm-k8s Goto Github PK

stackstorm-k8s's Introduction

stackstorm-ha Helm Chart

Requirements

Usage

Configuration

Ingress

Components

st2client

Install custom st2 packs in the cluster

Method 1: st2packs images (the default)

Building st2packs image

How to provide custom pack configs

Pull st2packs from a private Docker registry

Method 2: Shared Volumes

Configure the storage volumes

How to provide custom pack configs

Caveat: Mounting and copying packs

Caveat: System packs

Tips & Tricks

Running jobs before/after install, upgrade, or rollback

Extending this chart

Releasing information

stackstorm-k8s's People

Contributors

Stargazers

Watchers

Forkers

stackstorm-k8s's Issues

References

Components

A few references

Bundling custom st2 packs as immutable Docker image

Share content via ReadWriteMany shared file system

Examples

Feedback Needed!

Resources

TODO:

The Plan

Resources

See

Recommend Projects

Recommend Topics

Recommend Org

`stackstorm-ha` Helm Chart