intuit / foremast Goto Github PK

Foremast adds application resiliency to Kubernetes by leveraging machine learnt patterns of application health to keep applications healthy and stable

Home Page: https://foremast.io

License: Apache License 2.0

Shell 0.57% Dockerfile 0.47% Makefile 0.28% Go 35.54% Java 54.83% HTML 0.64% CSS 0.24% JavaScript 7.43%

kubernetes observability remediation application-monitoring machine-learning anomaly-detection

foremast's Introduction

Foremast

Foremast is a cloud native application health manager for Kubernetes. Foremast leverages observability signals from platforms such as Prometheus, Fluentd and Jaeger and provides timely application health alerts. These alerts are especially important during deployments and other changes that alter application state. This information is then leveraged by Foremast’s action framework - which allows developers and operators to take actions based on the state of the application.

How can developers reliably know that changes made to running software have not degraded the application?

Foremast provides early warnings for detecting problems with the deployment of a new version of a service or application on Kubernetes.

Traditionally, production deployments have used manual canary analysis as the standard mechanism for evaluating application health. Various types of canary analysis exist, such as: A/B testing, phased rollout, or incremental rollout.

Foremast automates the analysis of an application health, by scoring the health of new deployments on the basis of performance, functionality, and quality. This analysis provides a comprehensive picture of an application's health and enables corrective action if a deterioration in health is detected.

It addresses following problems in an enterprise environment of Kubernetes:

Detect metrics spike or drop due to a deployment
Detect impact to downstream services
Automated remediation including alert, rollback etc
Metrics anomaly aggregated at service or API level
Aggregate service health check across multiple K8s clusters

The architecture and design documentation provide a detailed overview of the system and an under the hood view of how Foremast works.

Running Foremast

Foremast can be run in multiple modes

On Minikube
On a remote K8s cluster

Technical Requirements

Make sure you have the following prerequisites:

A local Go 1.7+ development environment
Admin access to a Kubernetes cluster - this could either be Minikube or a remote cluster.

Setup Steps

The Set Up documentation has step by step instructions on how to setup and run Foremast.

Running Foremast-Browser UI

Documentation for running the front-end portion of the project can be found here.

Roadmap

January 2019 v0.5
February 2019 v0.6
HPA score support Ongoing
ClusterAutoScaler prediction support

Foremast presented at Kubecon

https://kccnceu19.sched.com/event/MPaQ/ready-a-deep-dive-into-pod-readiness-gates-for-service-health-management-minhan-xia-google-ping-zou-intuit

Contributing

We welcome you to get involved with Foremast. You can contribute by using Foremast and providing feedback, contributing to our code, participating in our code and design reviews etc.

Read the contributing guidelines to learn about building the project, the project structure, and the purpose of each package.

foremast's People

Contributors

Stargazers

Watchers

Forkers

srivathsanvc davemasselink kevinli11 zhihzhang pzou1974 datree-intuit-org vidyanarayanan etsangsplk enterstudio kianjones4 dwding18 harishkrishna17 msaas-iks zeeshansd formuzi jdfalko shaoxt asthinasthi pglagolev aayushichadha rlcooper46 roxiomontes lt-schmidt-jr sampathinturi suyambuganesh ananyasen jacobjohansen veenadesai38 ming-ddtechcg roieyow classicvalues hiteshbedre mario0718 avi1512 norrinrad64 pzou19741 isabella232 ghas-results

foremast's Issues

Foremast-brain auto scale

Current design Foremast-brain can be scaled manually without code change. We'd like to make it auto scale via Kubernetes Horizontal Pod Autoscaler.

We can provide customized metric based on #of open requests to scale up and down the foremast-brains.

Only expose limited common metrics

Common Metrics Filter is to hide the metrics by default, and show the metrics what are in the whitelist.

If the metric is enabled in properties, keep it enabled.
If the metric is disabled in properties, keep it disabled.
If the metric starts with "PREFIX", enable it. otherwise disable the metric

Alerting from Foremast

Send alerts from Foremast to multiple destinations:

Slack
Email
Pager Duty
Service Now

kubectl foremast command line

Start to watch deployment via command line
Add command line plugin on kubectl

Website for Foremast

Create a new website for Foremast. It should be:

Accessible from foremast.io and foremast.ai
Provide a high level overview of what Foremast is

Canary Deployment Analysis

Include support for deploying canaries and analyzing the performance of a newer version of code simultaneously with a last known good version.

Terminology definition:

New Version
- Referred to as n and canary
- This is the latest version of code that is being deployed and starts to take live traffic
Last known good
- Referred to as n-1 and baseline
- This version was previously serving live traffic and is proven to be healthy

Both the n and n-1 versions will simultaneously be deployed and will start taking live traffic. This is to ensure that there is an apple-apple comparison that rules out differences due to one time events such as cache warmup, initialization etc. Key capabilities that need to be enabled are:

Pair comparison of ML models against simultaneous performance data coming from the n and n-1 versions.
Integration and correlation between the metrics, logs and traces that application is emitting.

Metrics for Generic Applications

Foremast works best when it has a lot of data about the characteristics of the application it is managing. In order to be applicable for a wider range of applications and technologies, the following need to be developed:

Documentation for writing metrics that Foremast can leverage
Easy to use libraries for
- Go
- Python

More languages will be added based on developer demand.

AI model advisory and turning hyper parameters Component

We need to have the component to give AI model advisory based on model loss function, performance, etc.
This can be configured as a batch daily or weekly batch.

Logging integration with Foremast

Add logging connectors to the ingestion layer of Foremast. Some key logging implementations to add connectors are for:

Splunk
Loki
Kafka

This issue will track both the design and implementation of these connectors.

Model Advisory to tune the AI models based on application historical data

Here are the basic features:
Based on the registered application historical data check
. if stationary
. if has seasonality pattern
. auto tuning model the super parameters
Persist the result

Introduce Warning state for formast

Instead of an anomaly. we need to have a warning state for manual rollback

Foremast brain model advisory

We need a module to automatically tune the different model super parameter and pick the best model periodically.

Grafana dashboard for demo application

To create a grafana dashboard for demo application. So that user can import it to grafana and make the metrics visible.

Automatic Remediation - Rollback

Foremast's action framework allows for multiple actions to be plugged into the system. These actions can be configured to trigger based on alerts coming from anomalies that get detected in the system.

One of a key automatic remediation that can be done is to rollback faulty code when problems are detected. Foremast needs to support the ability to do automatic rollbacks of the following types:

% wise rollback (canary rollback) - If an application was released using a ladder strategy, the deployment should be allowed to rollback at any stage of that rollout. For instance, if the application rolled out to 25% -> 50%, Foremast should be able to rollback in the same way from 50% -> 25% -> 0%, at each step watching for the application health. If the health recovers mid-way, Foremast should be able to stop the rollback at that point.
Use Granular criteria for determining whether to rollback or not - the decision for rollback should be based on specific metric based criteria such as number of errors, TPS etc.

Continuous monitoring

Foremast only support Deployment changes monitoring for now.
But incident might happen after that for couple hours.
Keep watching application healthy is another key to sustain.

Expose configurable model parameters for different metric type via environment

Provide metrics in foremast-barrelman

Watch events
Running foremast jobs
Watching applications
Foremast TTD(Time to detect) and TTR( Time to recover)

ElasticSearch Upsert unstable investigation

Investigate the root cause of ElasticSearch Upsert is unstable. (if it only appeared on specific docker image version or something else)
After upsert , add search and make sure upsert change is there. This actually already implemented but still see other process failed find the change.
create new index for all the completed request. it will be good to introduce message queue to combine with this change

Core metrics (CPU, Memory) are not able to show in prometheus

Solution:
To add extra config in minikube.sh

    --extra-config=kubelet.authentication-token-webhook=true

Current Query got wrong value of future data points due to Prometheus query return wrong data for the future time stamp

We found Prometheus query returns wrong data for the future time stamp which caused foremast-brain detected fake anomaly data point.
We need to remove the current query end data but replace with current time every time we query current or baseline Prometheus metric

Investigate foremast-brain failed to expose metric to Promethus

First we need to isolate the issue via add logging

Add Foremast-brain request process monitoring metric

We'd like to have Prometheus metrics to trace the request process time for foremast-brain.

RCA - Provide automated root cause analysis for incidents

Enhance Foremast to understand events across the network of applications that it is managing, and be able to correlate anomalies and events to a specific root cause.

Identify each individual application as a separate entity (via an unique identifier)
Identify network and API call paths between applications (via Tracing #24)
Construct a network graph of how the applications are linked together
When more than one alerts are being raised, trace back to the root of the problem by correlating the alerts and other events that occurred in the ecosystem.

Community Participation for Foremast

Create community participation channels and information about them:

Meeting Calendar
Monthly online meetings
Meeting notes
Youtube channel
Google groups forum
Community guidelines documentation with relevant links

Add anomaly data set and notebook for foremast-brain

anomaly data set and notebook for foremast-brain will help the user to understand which model is the best fit.

Need to introduce abort status to indicate user abort the deployment

It should allow user to abort the deployment, so foremast brain will ignore the request.
Need to make sure foremast-barrelman will also ignore the request.

API based anomaly detection

The granularity of anomaly detection in Foremast is currently at a service level. Metrics are aggregated over all the operations in the service.

In a situation where a service has multiple APIs, Foremast's service level granularity would render it vulnerable to obfuscated impacts - where a problem in one API is obscured by the normal performance of another API.

In order to mitigate this situation, Foremast needs to perform anomaly detection at a much more granular level. API level metrics are available from a variety of sources such as the service mesh, and these can be leveraged in order for evaluating API level anomalies.

Foremast Dashboard

Create a dashboard that can bring together the various aspects of Foremast and provide an application-centric view, with a well rounded picture of what is going on with the application. This should include:

Metrics
Logs
Traces
Alerts
Events, such as:
- Deployment
- Config change
- etc.

A hand drawn mock-up sketch of it is attached:

Anomaly detection for logs

Foremast Brain currently has algorithms that can detect anomalies from time series data such as metrics.

Support is being added to ingest logs from various sources (#23). Once Foremast is able to ingest the logs, Brain needs to add algorithms and learning models that can detect anomalies in logs. Key capabilities of this should include:

Stream processing of logs so the detection can be in near real-time
Ability to cluster anomalies and present them visually
Ability to customize and transform logs before they are analyzed

PoC: Investigate text clustering model for logging

We'd like to see if we can use Document/text clustering for logging anomaly detection

Remediation Framework

Action framework means how to handle the action when foremast detected unhealthy state.
Action could be remediation by foremast, user can provide couple options such as

Rollback
Pause, for large amount pods deployment, to pause the job means reduce the impact.
Scaling, for specific use cases, such as connections stack or high CPU bump up.
Auto, decide by foremast, foremast can decide by the various information of deployment and makes the good decision.
etc.

Action could be sending alerts to users. Since foremast the main focus is not alert, so we are going to have a webhook implementation at the beginning, foremast will send the context information aas much as possible, developer can integrate it will their existing alert system easily.

Expose metrics for prometheus on spring-boot 1.x

Since Micrometer only supports spring-boot >1.5.x, for legacy spring-boot application which is using spring-boot 1.4 or earlier versions, it doesn't have existing component to enable metrics for prometheus.

To update existing application to spring-boot 1.5.x is a challenge. Application developer usually doesn't want to such non-business feature to impact their application.

So we need to back port micrometer to support metrics on Spring-Boot 1.4 or 1.3.

Add website build to Travis CI

Build the website under docs/ folder along with the site build.

Dependencies for building the website:

Jekyll
Ruby
Bundle

Persist Application Metric Configs, Model hyperparameters and Model result

We need to store Application Metric Configs, Model hyperparameters, and Model result.
Phase one will CRUD the config via restful API.
We will enter a separate enhancement request for UI in the future.

Export anomaly metric data point to Prometheus directly

We'd like to foremast-brain export anomaly data points to Prometheus directly instead of
Foremast-service send back to foremast-barrelman to reduce the MTTD and MTTR.

Expose Model parameters via environment

If user want to overwrite foremast-brain's model's default parameter they can make the change via the environment.

Distinct pre-build and post-build process via deployment strategy

We use a different methodology for pre-build and post-build.
pre-build will performance canary analysis and applied pairwise algorithm between baseline and current
post-build requires historical data to generate the model and compare with current

add simulated metric pattern for demo app

Need simulated NAB metric to simulation for demo app

Change or add slack link to request invite on README

😄 Or send me an invite [email protected]

Overcome incorrect of Prometheus future timestamp metric value

Short term fixes to overcome incorrect of Prometheus future timestamp metric value.

Multivariate Anomaly Detection Model Integration

We'd like to integrate multivariate anomaly detection models integration. We already had Univariate Gaussian distribution, LSTM models.

Tracing integration with Foremast

Add connectors for distributed tracing systems to Foremast. The Foremast connector will be built to conform to the OpenTracing interfaces.

Foremast should be able to ingest tracing information from any Tracer that implements the Open Tracing specification.

Enable Travis CI

Enable Travis CI in foremast so that any change will trigger a build and docker image

Improve Marketing Website

I did an audit of the marketing website and found the following:

the features under Everything about Foremast should be center aligned. Currently its a little off balance
the footer is left justified. would look a lot better in the center
on mobile the image overflows
the footer is not sticky. on larger screens this means the footer will be at the end of the content rather than the bottom of the page
on getting started the highlighted header on the right skips Installing Foremast
on getting started you link to github.intuit.com
on about and about screens floating footer is present on normal screens
on about page it says i can contact you on slack but doesn't link to the slack team
same thing for the calender. what are the meetup days?

Expose z-score as environment variables for different metric type

We had different application/system metrics to monitoring. We'd like to give different z-score.

foremast-brain travis CI to build image

We need to enhance existing Travis CI to add image building as well

Foremast-Service should store rollout or canary strategy to elasticsearch

Foremast-Service should store canary(pre rollout), rollout(post rollout) to elastic search.
2.Foremast-brain can be based on the strategy to decide instead of check existing of baseline, historical metric.
3.strategy need to be redefined for all the use case and consist between upstream and downstream

KNative support in Foremast

Add support in Foremast for serverless applications that are running on KNative.

These functions need to be supported:

Definition of metrics for serverless applications
Watching of the metrics - could leverage metrics coming from both Prometheus and Istio #25
Remediation and actions

Service Mesh integration with Foremast

Add connectors to ingest data from the Istio service mesh to Foremast. Istio will provide information on API calls, latencies, errors etc.

Foremast-Service proxy Prometheus Query API

We expect foremast-service can access all the Prometheus endpoints but not all the foremast client (foremast dashboard). In this way, foremast client can invoke foremast-service to query Prometheus query , etc.

intuit / foremast Goto Github PK

foremast's Introduction

Foremast

Foremast

Running Foremast

Technical Requirements

Setup Steps

Running Foremast-Browser UI

Roadmap

Foremast presented at Kubecon

Contributing

foremast's People

Contributors

Stargazers

Watchers

Forkers

foremast's Issues

Recommend Projects

Recommend Topics

Recommend Org