Code Monkey home page Code Monkey logo

intuit / foremast Goto Github PK

View Code? Open in Web Editor NEW
130.0 15.0 43.0 98.26 MB

Foremast adds application resiliency to Kubernetes by leveraging machine learnt patterns of application health to keep applications healthy and stable

Home Page: https://foremast.io

License: Apache License 2.0

Shell 0.57% Dockerfile 0.47% Makefile 0.28% Go 35.54% Java 54.83% HTML 0.64% CSS 0.24% JavaScript 7.43%
kubernetes observability remediation application-monitoring machine-learning anomaly-detection

foremast's Introduction

Foremast

Build Status Go Report Card Slack Chat

Foremast

Foremast is a cloud native application health manager for Kubernetes. Foremast leverages observability signals from platforms such as Prometheus, Fluentd and Jaeger and provides timely application health alerts. These alerts are especially important during deployments and other changes that alter application state. This information is then leveraged by Foremast’s action framework - which allows developers and operators to take actions based on the state of the application.

How can developers reliably know that changes made to running software have not degraded the application?

Foremast provides early warnings for detecting problems with the deployment of a new version of a service or application on Kubernetes.

Traditionally, production deployments have used manual canary analysis as the standard mechanism for evaluating application health. Various types of canary analysis exist, such as: A/B testing, phased rollout, or incremental rollout.

Foremast automates the analysis of an application health, by scoring the health of new deployments on the basis of performance, functionality, and quality. This analysis provides a comprehensive picture of an application's health and enables corrective action if a deterioration in health is detected.

It addresses following problems in an enterprise environment of Kubernetes:

  • Detect metrics spike or drop due to a deployment
  • Detect impact to downstream services
  • Automated remediation including alert, rollback etc
  • Metrics anomaly aggregated at service or API level
  • Aggregate service health check across multiple K8s clusters

The architecture and design documentation provide a detailed overview of the system and an under the hood view of how Foremast works.

Running Foremast

Foremast can be run in multiple modes

  • On Minikube
  • On a remote K8s cluster

Technical Requirements

Make sure you have the following prerequisites:

  • A local Go 1.7+ development environment
  • Admin access to a Kubernetes cluster - this could either be Minikube or a remote cluster.

Setup Steps

The Set Up documentation has step by step instructions on how to setup and run Foremast.

Running Foremast-Browser UI

Documentation for running the front-end portion of the project can be found here.

Roadmap

  • January 2019 v0.5
  • February 2019 v0.6
  • HPA score support Ongoing
  • ClusterAutoScaler prediction support

Foremast presented at Kubecon

https://kccnceu19.sched.com/event/MPaQ/ready-a-deep-dive-into-pod-readiness-gates-for-service-health-management-minhan-xia-google-ping-zou-intuit

Contributing

We welcome you to get involved with Foremast. You can contribute by using Foremast and providing feedback, contributing to our code, participating in our code and design reviews etc.

Read the contributing guidelines to learn about building the project, the project structure, and the purpose of each package.

foremast's People

Contributors

davemasselink avatar dependabot[bot] avatar dwding18 avatar formuzi avatar jdfalko avatar kevinli11 avatar kianjones4 avatar lt-schmidt-jr avatar pzou1974 avatar rociomontes avatar shaoxt avatar srivathsanvc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

foremast's Issues

Foremast-brain auto scale

Current design Foremast-brain can be scaled manually without code change. We'd like to make it auto scale via Kubernetes Horizontal Pod Autoscaler.

We can provide customized metric based on #of open requests to scale up and down the foremast-brains.

Only expose limited common metrics

Common Metrics Filter is to hide the metrics by default, and show the metrics what are in the whitelist.

If the metric is enabled in properties, keep it enabled.
If the metric is disabled in properties, keep it disabled.
If the metric starts with "PREFIX", enable it. otherwise disable the metric

Alerting from Foremast

Send alerts from Foremast to multiple destinations:

  • Slack
  • Email
  • Pager Duty
  • Service Now

Website for Foremast

Create a new website for Foremast. It should be:

  • Accessible from foremast.io and foremast.ai
  • Provide a high level overview of what Foremast is

Canary Deployment Analysis

Include support for deploying canaries and analyzing the performance of a newer version of code simultaneously with a last known good version.

Terminology definition:

  • New Version

    • Referred to as n and canary
    • This is the latest version of code that is being deployed and starts to take live traffic
  • Last known good

    • Referred to as n-1 and baseline
    • This version was previously serving live traffic and is proven to be healthy

Both the n and n-1 versions will simultaneously be deployed and will start taking live traffic. This is to ensure that there is an apple-apple comparison that rules out differences due to one time events such as cache warmup, initialization etc. Key capabilities that need to be enabled are:

  • Pair comparison of ML models against simultaneous performance data coming from the n and n-1 versions.
  • Integration and correlation between the metrics, logs and traces that application is emitting.

Metrics for Generic Applications

Foremast works best when it has a lot of data about the characteristics of the application it is managing. In order to be applicable for a wider range of applications and technologies, the following need to be developed:

  • Documentation for writing metrics that Foremast can leverage
  • Easy to use libraries for
    • Go
    • Python

More languages will be added based on developer demand.

Logging integration with Foremast

Add logging connectors to the ingestion layer of Foremast. Some key logging implementations to add connectors are for:

  • Splunk
  • Loki
  • Kafka

This issue will track both the design and implementation of these connectors.

Automatic Remediation - Rollback

Foremast's action framework allows for multiple actions to be plugged into the system. These actions can be configured to trigger based on alerts coming from anomalies that get detected in the system.

One of a key automatic remediation that can be done is to rollback faulty code when problems are detected. Foremast needs to support the ability to do automatic rollbacks of the following types:

  • % wise rollback (canary rollback) - If an application was released using a ladder strategy, the deployment should be allowed to rollback at any stage of that rollout. For instance, if the application rolled out to 25% -> 50%, Foremast should be able to rollback in the same way from 50% -> 25% -> 0%, at each step watching for the application health. If the health recovers mid-way, Foremast should be able to stop the rollback at that point.

  • Use Granular criteria for determining whether to rollback or not - the decision for rollback should be based on specific metric based criteria such as number of errors, TPS etc.

Continuous monitoring

Foremast only support Deployment changes monitoring for now.
But incident might happen after that for couple hours.
Keep watching application healthy is another key to sustain.

ElasticSearch Upsert unstable investigation

  1. Investigate the root cause of ElasticSearch Upsert is unstable. (if it only appeared on specific docker image version or something else)
  2. After upsert , add search and make sure upsert change is there. This actually already implemented but still see other process failed find the change.
  3. create new index for all the completed request. it will be good to introduce message queue to combine with this change

RCA - Provide automated root cause analysis for incidents

Enhance Foremast to understand events across the network of applications that it is managing, and be able to correlate anomalies and events to a specific root cause.

  • Identify each individual application as a separate entity (via an unique identifier)
  • Identify network and API call paths between applications (via Tracing #24)
  • Construct a network graph of how the applications are linked together
  • When more than one alerts are being raised, trace back to the root of the problem by correlating the alerts and other events that occurred in the ecosystem.

Community Participation for Foremast

Create community participation channels and information about them:

  • Meeting Calendar
  • Monthly online meetings
  • Meeting notes
  • Youtube channel
  • Google groups forum
  • Community guidelines documentation with relevant links

API based anomaly detection

The granularity of anomaly detection in Foremast is currently at a service level. Metrics are aggregated over all the operations in the service.

In a situation where a service has multiple APIs, Foremast's service level granularity would render it vulnerable to obfuscated impacts - where a problem in one API is obscured by the normal performance of another API.

In order to mitigate this situation, Foremast needs to perform anomaly detection at a much more granular level. API level metrics are available from a variety of sources such as the service mesh, and these can be leveraged in order for evaluating API level anomalies.

Foremast Dashboard

Create a dashboard that can bring together the various aspects of Foremast and provide an application-centric view, with a well rounded picture of what is going on with the application. This should include:

  • Metrics
  • Logs
  • Traces
  • Alerts
  • Events, such as:
    • Deployment
    • Config change
    • etc.

A hand drawn mock-up sketch of it is attached:
img_1800

Anomaly detection for logs

Foremast Brain currently has algorithms that can detect anomalies from time series data such as metrics.

Support is being added to ingest logs from various sources (#23). Once Foremast is able to ingest the logs, Brain needs to add algorithms and learning models that can detect anomalies in logs. Key capabilities of this should include:

  • Stream processing of logs so the detection can be in near real-time
  • Ability to cluster anomalies and present them visually
  • Ability to customize and transform logs before they are analyzed

Remediation Framework

Action framework means how to handle the action when foremast detected unhealthy state.
Action could be remediation by foremast, user can provide couple options such as

  1. Rollback
  2. Pause, for large amount pods deployment, to pause the job means reduce the impact.
  3. Scaling, for specific use cases, such as connections stack or high CPU bump up.
  4. Auto, decide by foremast, foremast can decide by the various information of deployment and makes the good decision.
  5. etc.

Action could be sending alerts to users. Since foremast the main focus is not alert, so we are going to have a webhook implementation at the beginning, foremast will send the context information aas much as possible, developer can integrate it will their existing alert system easily.

Expose metrics for prometheus on spring-boot 1.x

Since Micrometer only supports spring-boot >1.5.x, for legacy spring-boot application which is using spring-boot 1.4 or earlier versions, it doesn't have existing component to enable metrics for prometheus.

To update existing application to spring-boot 1.5.x is a challenge. Application developer usually doesn't want to such non-business feature to impact their application.

So we need to back port micrometer to support metrics on Spring-Boot 1.4 or 1.3.

Add website build to Travis CI

Build the website under docs/ folder along with the site build.

Dependencies for building the website:

  • Jekyll
  • Ruby
  • Bundle

Tracing integration with Foremast

Add connectors for distributed tracing systems to Foremast. The Foremast connector will be built to conform to the OpenTracing interfaces.

Foremast should be able to ingest tracing information from any Tracer that implements the Open Tracing specification.

Enable Travis CI

Enable Travis CI in foremast so that any change will trigger a build and docker image

Improve Marketing Website

I did an audit of the marketing website and found the following:

  1. the features under Everything about Foremast should be center aligned. Currently its a little off balance
  2. the footer is left justified. would look a lot better in the center
  3. on mobile the image overflows
  4. the footer is not sticky. on larger screens this means the footer will be at the end of the content rather than the bottom of the page
  5. on getting started the highlighted header on the right skips Installing Foremast
  6. on getting started you link to github.intuit.com
  7. on about and about screens floating footer is present on normal screens
  8. on about page it says i can contact you on slack but doesn't link to the slack team
  9. same thing for the calender. what are the meetup days?

Foremast-Service should store rollout or canary strategy to elasticsearch

  1. Foremast-Service should store canary(pre rollout), rollout(post rollout) to elastic search.
    2.Foremast-brain can be based on the strategy to decide instead of check existing of baseline, historical metric.
    3.strategy need to be redefined for all the use case and consist between upstream and downstream

KNative support in Foremast

Add support in Foremast for serverless applications that are running on KNative.

These functions need to be supported:

  • Definition of metrics for serverless applications
  • Watching of the metrics - could leverage metrics coming from both Prometheus and Istio #25
  • Remediation and actions

Foremast-Service proxy Prometheus Query API

We expect foremast-service can access all the Prometheus endpoints but not all the foremast client (foremast dashboard). In this way, foremast client can invoke foremast-service to query Prometheus query , etc.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.