Code Monkey home page Code Monkey logo

theliv's People

Contributors

24237805 avatar brianwarner avatar dependabot[bot] avatar jessieteng89 avatar kishoregv avatar mason-liu avatar michael12312 avatar olivia-zh-xm avatar padraigmc avatar rajarajanpsj avatar srikaraum avatar towens182 avatar wangli1030 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

theliv's Issues

Copyright headers for UI files

Describe the solution you'd like:
add copyright headers to UI files

Why do you want this feature:

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Create deployment guide and Helm chart

User Story

As a user of this Kubernetes open source project, I want a deployment guide and Helm chart so I can easily install and configure the project in my Kubernetes cluster.

Description

We need to create docs and assets to make deployment of this project simpler for users.

Deployment Guide

  • Document steps for deployment
    • Prerequisites (K8s version, resources, etc)
    • Install with YAML manifests
    • Install with Helm
  • Include examples and commands for:
    • Creating namespaces/service accounts
    • Applying RBAC
    • Creating configmaps/secrets
    • Deploying applications
    • Accessing services

Helm Chart

  • Package Kubernetes manifests into a Helm chart
    • Support customizable namespace
    • Parameterize resource requests/limits
    • Parameterize replicas
    • Include CHANGELOG and README
  • Add to chart repository

Acceptance Criteria

  • Deployment guide doc added to docs/deploy.md
  • Helm chart published to [repo]
  • Doc updated with install instructions using Helm chart
  • Installation and configuration via Helm and docs steps tested and confirmed

Links

  • Chart repo: [link]
  • Docs: [link]

feat: process prometheus alerts and convert into problem structs

Theliv should integrate with prometheus. i.e. theliv will maintain its own set of prometheus alerts in each cluster. These alerts would be created and managed by theliv and installation of a prometheus server in each of the kubernetes cluster is a pre-requisite.

  • The idea is that when the user come to the theliv UI and search for any possible issues by giving a cluster name and a namespace name, we need to have a logic that when executed, connects to the prometheus api server and fetches any active alerts that is either related to user supplied namespace or management namespaces or cluster level alerts.
  • Once the alerts are gathered, this is then converted into problem struct. The problem struct could be as simple as below.

This list of problems would then be run through an investigation framework and then be passed onto an aggregation logic which will produce the report cards and show it to the user.

problem struct:
name: {alert_name}
description: {alert_description}
tags: {alert_tags}
affectedResources: []Resource
details: []string {alert_details}

Resource struct
object: runtime.Object
objectKind: string
ownerKind: string
owner: runtime.Object

Integrate Generative AI model for trouble-shooting

1. Current solution:

Now Theliv use Rule-based solution to do issue detection and analysis.

  • Painpoints:
  1. Difficult to extend: needs development work to support new issues.
  2. Needs to always keep updated, to support new kinds of issues.

2. Preferable Solution:

Take advantage of Generative AI model, to do trouble shooting and provide suggestions.

3. Motivations:

  1. New kubernetes issue analysis products like k8sgpt, Akuity, introduce Generative AI for trouble shooting and suggestions generation.
  2. More advanced and intelligent approach comparing to rule-based solution, keep pace with the developing & future trends.
  3. More quick and accurate on trouble-shooting and solution provision, reduce MTTD and MTTR, time & efforts as well.
  4. Helpful to finally achieve AiOps.

4. Works required

  1. Generate common interface for Generative AI model API integration.
  2. Generate client implementations for popular Generative Models service, like OpenAI...
  3. Build API request, and receive response.
  4. Before send the request, remove internal or confidential data.
  5. Show results in UI.

feat: implement the investigator logic which will troubleshoot the alerts further

Alerts in general give you a high level information of what is going wrong. Sometimes it is enough for the user but sometimes it needs further troubleshooting. Theliv will maintain a bunch of "investigator" functions which can be contributed by anyone. These are single purpose functions that are mapped to specific alerts and helps trouble shoot those alerts further.

Imagine something like map[alert_name][investigator_func] . When theliv gets the list of alerts from the prometheus api and constructs the problem struct, it would then be passed on to some investigator i.e. an image pull backoff alert is fired and we have a dedicated investigator function for image pull back off which holds the troubleshooting logic for imagepull back off issues.

The investigator will then ADD more details to the []details field of the problem struct. This would also mean anyone from the operators team should be able to raise a PR with investigator functions for specific types of alerst (i.e. pending pods, IP exhaustion etc).

This investigator piece is one of the main elements of theliv where instead of an operations team member analyzing and troubleshooting the issue for the application team, theliv will do it and throw more light into what is happening via report cards in the UI.

the investigator should also have access to some of the following
clustername, namespace, kube api client, aws client (for ingress related) etc.

feat: enhance theliv investigators for kubernetes to analyze kubernetes events

Why do you want this feature:
theliv investigator functions are supposed to analyze the alerts deeply and provide actionable insights/next steps to the users. This means investigator functions should analyze kubernetes events in combination with the alert information and provide more information to the user.

Describe the solution you'd like:
Theliv provides an investigation framework on top of prometheus alerts. This means it will analyze alerts from prometheus, dive deeper to provide actionable insights to the user. E.g. when a crashloop backoff alert is triggered, typically a sre or a devops member would dive deeper to figure out the root cause. Many a times, that involves analyzing the kubernetes events.

  1. theliv has an investigator for crash loopbackoff which needs to be enhanced to analyze the kubernetes events and use that information to provide more information to user. E.g. it could provide more information to user based on the exit code etc.
  2. the same goes for other investigators as well.
  3. events are maintained in etcd usually for an hour. So the investigator function will work on a best effort basis i.e. if the user is using theliv to debug within that 1 hour, they will be provided with more information. If they use the app after an hour, the investigator function would not be able to analyze the events and hence would do its best to add more information on top of what is already provided by the alert.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

feat: changes to the aggregation logic

The existing aggregation logic needs to be modified accordingly. The logic still creates the report cards for the user to view based on applications. There will be two types of report cards, first one for cluster level alerts, cloud provider level alerts like az/region failures etc and management namespace related alerts (where some of the critical add-ons run). The second report card is only for alerts pertaining to user supplied namespace. The alerts within the report cards needs to be arranged in a specific order (simple correlation logic) e.g. alerts related to nodes or api server is at the top (which could be a possible root cause for many issues)

create Github Actions to push image to ghcr.io

Describe the solution you'd like:
Create Github Actions to push image to ghcr.io

Why do you want this feature:

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.