fidelity / theliv Goto Github PK

View Code? Open in Web Editor NEW

10.0 10.0 0.0 1.47 MB

License: Apache License 2.0

Dockerfile 0.54% Go 46.54% Shell 0.33% JavaScript 0.68% TypeScript 19.05% HTML 16.70% SCSS 16.17%

theliv's People

Contributors

Stargazers

Watchers

theliv's Issues

Copyright headers for UI files

Describe the solution you'd like:
add copyright headers to UI files

Why do you want this feature:

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Create deployment guide and Helm chart

User Story

As a user of this Kubernetes open source project, I want a deployment guide and Helm chart so I can easily install and configure the project in my Kubernetes cluster.

Description

We need to create docs and assets to make deployment of this project simpler for users.

Deployment Guide

Document steps for deployment
- Prerequisites (K8s version, resources, etc)
- Install with YAML manifests
- Install with Helm
Include examples and commands for:
- Creating namespaces/service accounts
- Applying RBAC
- Creating configmaps/secrets
- Deploying applications
- Accessing services

Helm Chart

Package Kubernetes manifests into a Helm chart
- Support customizable namespace
- Parameterize resource requests/limits
- Parameterize replicas
- Include CHANGELOG and README
Add to chart repository

Acceptance Criteria

Deployment guide doc added to docs/deploy.md
Helm chart published to [repo]
Doc updated with install instructions using Helm chart
Installation and configuration via Helm and docs steps tested and confirmed

Links

Chart repo: [link]
Docs: [link]

feat: process prometheus alerts and convert into problem structs

Theliv should integrate with prometheus. i.e. theliv will maintain its own set of prometheus alerts in each cluster. These alerts would be created and managed by theliv and installation of a prometheus server in each of the kubernetes cluster is a pre-requisite.

The idea is that when the user come to the theliv UI and search for any possible issues by giving a cluster name and a namespace name, we need to have a logic that when executed, connects to the prometheus api server and fetches any active alerts that is either related to user supplied namespace or management namespaces or cluster level alerts.
Once the alerts are gathered, this is then converted into problem struct. The problem struct could be as simple as below.

This list of problems would then be run through an investigation framework and then be passed onto an aggregation logic which will produce the report cards and show it to the user.

problem struct:
name: {alert_name}
description: {alert_description}
tags: {alert_tags}
affectedResources: []Resource
details: []string {alert_details}

Resource struct
object: runtime.Object
objectKind: string
ownerKind: string
owner: runtime.Object

[Docs] typo in readme

Documentation link: https://github.com/fidelity/theliv/blob/main/readme.md

Description:
Finally theliv aims to be an extensible framework where custom checks can be plugged in whereever required.

Change whereever to wherever

Integrate Generative AI model for trouble-shooting

1. Current solution:

Now Theliv use Rule-based solution to do issue detection and analysis.

Painpoints:

Difficult to extend: needs development work to support new issues.
Needs to always keep updated, to support new kinds of issues.

2. Preferable Solution:

Take advantage of Generative AI model, to do trouble shooting and provide suggestions.

3. Motivations:

New kubernetes issue analysis products like k8sgpt, Akuity, introduce Generative AI for trouble shooting and suggestions generation.
More advanced and intelligent approach comparing to rule-based solution, keep pace with the developing & future trends.
More quick and accurate on trouble-shooting and solution provision, reduce MTTD and MTTR, time & efforts as well.
Helpful to finally achieve AiOps.

4. Works required

Generate common interface for Generative AI model API integration.
Generate client implementations for popular Generative Models service, like OpenAI...
Build API request, and receive response.
Before send the request, remove internal or confidential data.
Show results in UI.

feat: implement the investigator logic which will troubleshoot the alerts further

Alerts in general give you a high level information of what is going wrong. Sometimes it is enough for the user but sometimes it needs further troubleshooting. Theliv will maintain a bunch of "investigator" functions which can be contributed by anyone. These are single purpose functions that are mapped to specific alerts and helps trouble shoot those alerts further.

Imagine something like map[alert_name][investigator_func] . When theliv gets the list of alerts from the prometheus api and constructs the problem struct, it would then be passed on to some investigator i.e. an image pull backoff alert is fired and we have a dedicated investigator function for image pull back off which holds the troubleshooting logic for imagepull back off issues.

The investigator will then ADD more details to the []details field of the problem struct. This would also mean anyone from the operators team should be able to raise a PR with investigator functions for specific types of alerst (i.e. pending pods, IP exhaustion etc).

This investigator piece is one of the main elements of theliv where instead of an operations team member analyzing and troubleshooting the issue for the application team, theliv will do it and throw more light into what is happening via report cards in the UI.

the investigator should also have access to some of the following
clustername, namespace, kube api client, aws client (for ingress related) etc.

feat: enhance theliv investigators for kubernetes to analyze kubernetes events

Why do you want this feature:
theliv investigator functions are supposed to analyze the alerts deeply and provide actionable insights/next steps to the users. This means investigator functions should analyze kubernetes events in combination with the alert information and provide more information to the user.

Describe the solution you'd like:
Theliv provides an investigation framework on top of prometheus alerts. This means it will analyze alerts from prometheus, dive deeper to provide actionable insights to the user. E.g. when a crashloop backoff alert is triggered, typically a sre or a devops member would dive deeper to figure out the root cause. Many a times, that involves analyzing the kubernetes events.

theliv has an investigator for crash loopbackoff which needs to be enhanced to analyze the kubernetes events and use that information to provide more information to user. E.g. it could provide more information to user based on the exit code etc.
the same goes for other investigators as well.
events are maintained in etcd usually for an hour. So the investigator function will work on a best effort basis i.e. if the user is using theliv to debug within that 1 hour, they will be provided with more information. If they use the app after an hour, the investigator function would not be able to analyze the events and hence would do its best to add more information on top of what is already provided by the alert.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

feat: changes to the aggregation logic

The existing aggregation logic needs to be modified accordingly. The logic still creates the report cards for the user to view based on applications. There will be two types of report cards, first one for cluster level alerts, cloud provider level alerts like az/region failures etc and management namespace related alerts (where some of the critical add-ons run). The second report card is only for alerts pertaining to user supplied namespace. The alerts within the report cards needs to be arranged in a specific order (simple correlation logic) e.g. alerts related to nodes or api server is at the top (which could be a possible root cause for many issues)

Log issues happened in goroutine, reply to users for reference.

Now in Ingress detector, use goroutine to call service in parallel.

If the service call failed in 1 routine, should record the failure. Finally generate a summary, reply to the user.

As 1 notification, user can try again, soon or later.

feat: UI changes based on the aggregator logic

UI needs to be updated based on the changes in the aggregation logic.

[Docs]: typo in readme

Documentation link: https://github.com/fidelity/theliv/blob/main/readme.md

Description:
While an developer can perfectly be equipped with the necessary skills to debug using the above flow, most of the time they dont want to do it since their focus is on rolling out a business feature to production asap.

Change dont to don't