keikoproj / active-monitor Goto Github PK

Provides deep monitoring and self-healing of Kubernetes clusters

License: Apache License 2.0

Dockerfile 1.65% Makefile 11.55% Go 86.80%

aiops argo-workflows healthcheck healthcheck-workflow kubernetes kubernetes-cluster kubernetes-controller kubernetes-tools machine-learning metrics prometheus-metrics remedy self-healing workflow

active-monitor's People

Contributors

Stargazers

Watchers

active-monitor's Issues

Update README and feature/bug template with correct Orka project Slack links

Is your feature request related to a problem? Please describe.
Currently, links to Slack in the README and issue templates either are broken links OR actually direct to argoproj workspace.

Describe the solution you'd like
Once a new slack workspace is setup for orkaproj, update the assets in this repo to match.

Add Events to Acive-Monitor Custom Resources

Is your feature request related to a problem? Please describe.
Active-Monitor Custom Resources should have events displayed. It gives the state of Each CR and its progress.

Describe the solution you'd like
We need to add events to custom resources for Active-Monitor

Have you thought about contributing yourself?
Yes

Exponentially reduce Kubernetes API calls

Is your feature request related to a problem? Please describe.
In Active-monitor the status of the workflow is polled every second which results in too many kubernetes api calls. As the number of monitors or self-healing use cases increases, in a managed kubernetes clusters this can result in additional cost, in non-managed kubernetes clusters the master nodes might have to be scaled up to accomodate these api calls.

Describe the solution you'd like
This can be solved by leveraging https://github.com/keikoproj/inverse-exp-backoff library. There is an API in this library where we can use inverse exponential backoff with timeout. This will allow us to reduce the API calls made to the kubernetes API server exponentially.

Have you thought about contributing yourself?

Yes I will implement it.

If you want to get involved, check out the
contributing guide, then reach out to us on Slack so we can see how to get you started.

Update active-monitor deployment to be compatible to kubernetes 1.22

Is your feature request related to a problem? Please describe.
Kubernetes 1.22 has some breaking changes so need to update the resources for deployment to be compatible such that users can use active-monitor in latest versions of kubernetes.

ServiceAcct not correctly applied to workflow

Describe the bug
When a serviceAccount is passed to the HealthCheck in spec, this same account should be applied to the argo workflows which the controller is creating. In this way, it will be passed down to the running pod where the account will actually be used

To Reproduce
If we cannot reproduce, we cannot fix! Steps to reproduce the behavior:

Craft a healthcheck which uses the serviceAccount property.
Submit healthcheck via kubectl create -f <filename>
Inspect the workflow with argo get <workflowName>
See that the default account was applied regardless of configured serviceAccount

Expected behavior
Configured serviceAccount should be set on workflow

Version
all

Consolidate metrics end points

Don't begin this work until we can confirm we want to make this change. This will likely require a design decision, the outcome of which should be attached to this ticket as a comment.

Is your feature request related to a problem? Please describe.
After this project was re-worked using kubebuilder 2.0, a prometheus style metrics server was exposed by default. This server provides data such as underlying golang cpu/memory/networking/gc details for the Active-Monitor controller. http://0.0.0.0:8080/metrics

Our application also exposes its own metrics server running on a separate port and end-pt. This server is meant to communicate details of healthcheck operation and to be consumed by entities which may need to take a remediating action. http://0.0.0.0:2112/metrics

Describe the solution you'd like
An open question is: should these 2 servers which are doing pretty much the same thing (though with respect to different data sets) be combined some way?

Should there be just a single port/path combo where ALL metrics (whether built-in or health check oriented) could be exposed?

If not a single port/path combo, how about the same port but 1 path for healthcheck metrics and another for internal metrics? ex: 0.0.0.0:8080/metrics and 0.0.0.0:8080/internal or similar

kubectl describe healthcheck does not get expected health status

**Step to reproduce **

step 1: install argo workflow-controller

kubectl apply -f https://raw.githubusercontent.com/orkaproj/active-monitor/master/deploy/deploy-argo.yaml

step 2: install active-monitor controller

kubectl apply -f https://raw.githubusercontent.com/orkaproj/active-monitor/master/config/crd/bases/activemonitor.orkaproj.io_healthchecks.yaml
kubectl apply -f https://raw.githubusercontent.com/orkaproj/active-monitor/master/deploy/deploy-active-monitor.yaml
#step 3
make run
#step 4
kubectl create -f examples/inlineHello.yaml
#step 5
kubectl get healthcheck -n health

NAME AGE
inline-hello-fbgj9 5m

#step 6
kubectl describe healthcheck inline-hello-fbgj9 -n health
get result without readme suggested
...
Status:
Failed Count: 0
Finished At: timestampe
Last Successful Workflow: inline-hello-fbgj9
Status: Succeeded
Success Count:
Events:

Update Default TTL Strategy to secondsAfterCompletion

Is your feature request related to a problem? Please describe.
Argoworkflow 3.x doesnot support ttlSecondsAfterFinished anymore.
It needs to be replaced with ttlStrategy: secondsAfterCompletion

Describe the solution you'd like
Update the default ttlstrategy here:

active-monitor/controllers/healthcheck_controller.go

Line 891 in 762bdaa

    
           if ttlSecondAfterFinished := data["spec"].(map[string]interface{})["ttlSecondsAfterFinished"]; ttlSecondAfterFinished == nil {

If you want to get involved, check out the
contributing guide, then reach out to us on Slack so we can see how to get you started.

Release 0.5.2 - Active-Monitor

Release issue, predominantly for visibility purposes.

DRAFT CHANGELOG:
Bug Fix:
#88 - Active Monitor crashing with concurrent map updates

Allow for success/failure counts to be reset at user request or periodically

Currently, the success and failure counts associated with each healthcheck will start at 0 when first registered and monotonically increase over the life of the healthcheck (or active-monitor controller).

This works alright, however, one downside of such an approach is that you need to know how long the healthcheck has been running in order to have any context for whether the counts are large or small.

Therefore, the aim of this ticket is to provide:

a mechanism for a user to "reset" the counts so that they will better understand what a non-0 failure count means, for instance OR
a new property in the spec which allows for the user to indicate how frequently the counts should be reset to 0 (this would allow for alerting if the failure count surpassed N within that time period) OR
an addl status property which would represent a moving average rate of success/fail over some unit time (ex: 4 successes/hr, 2 failures/day)

Open questions:

which approach to pursue
what unit to use as the default time duration, ex: hr, day (if 3rd approach above is pursued)
are there any better strategies which allow for counts to reliably be used in alerting logic

Limit number of times the Self-Healing/Remedy should be run

Is your feature request related to a problem? Please describe.
HealthCheck Custom Resource should have an ability to limit the number of times the remedy/self-healing should be run in a given interval of time.

Describe the solution you'd like
HealthCheck Custom Resource should provide parameters to limit the number of times the remedy should be run in a given interval. This would be helpful to avoid a continuous loop of running health check and remedy in case when remedy action given does not work.

Have you thought about contributing yourself?

Yes. I will implement it.

If you want to get involved, check out the
contributing guide, then reach out to us on Slack so we can see how to get you started.

Release 0.4.0 - Active-Monitor

Release issue, predominantly for visibility purposes.

DRAFT CHANGELOG:

#47 - Controller-gen and kube-builder updates.
#7 - Consolidate Metric Endpoints for Active-Monitor
#49 - Update Status fields to include Total Healthcheck count
#23 - Update healthcheck spec and controller to support automatic "remediation"
#55 - Workflow creation ignores metadata information in the healthcheck spec
#54 - Active-Monitor should process custom resources in parallel

Update healthcheck spec and controller to support automatic "remediation"

Currently, Healthcheck custom resource and its child argo workflow can "detect" if there is a problem.

However, there is no place to express what action to take in case of a problem.

Therefore, healthcheck spec should support an alternative argo workflow for "remediation". If the main workflow fails, the "remediation" workflow should be run. Also, additional remediation metrics should be captured and exposed accordingly.

Open Source software thrives with your contribution. It not only gives skills you might not be able to get in your day job, it also looks amazing on your resume.

If you want to get involved, check out the
contributing guide, then reach out to us on Slack so we can see how to get you started.

Document release process

The project doesn't yet have a defined and documented release process. This task's aim is to document that in the README and get a 1.0.0 release pushed to dockerhub.

Update Argo controller version

Is your feature request related to a problem? Please describe.
Update Argo Controller version to use latest features.

Describe the solution you'd like
Use latest Argo Controller version to leverage latest features.

Have you thought about contributing yourself?
Yes

Add healthcheck status info to kubectl get response

Is your feature request related to a problem? Please describe
The problem is that the latest status as well as success/fail counts related to a healthcheck can only be seen when it is kubectl describe ...d.

Describe the solution you'd like
Instead, at least three columns should be added to the custom printer for healthcheck objects. Those are: status, success count, failure count.

This should be straightforward to accomplish using kubebuilder annotations regarding additional-printer-columns

Current example:

NAME      AGE
foo       1h
bar       1h

Target example:

NAME        LATEST STATUS        SUCCESS CNT        FAIL CNT        AGE
foo         Succeeded            4                  0               1h
bar         Failed               1                  3               1h

https://kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definitions/#additional-printer-columns

Extend support for custom metrics

Is your feature request related to a problem? Please describe.
After this project was converted to kubebuilder style, custom metrics haven't been tested and confirmed to work.

Describe the solution you'd like
The aim of this ticket is to confirm this and update README accordingly

Document how to recognize and correct for false-positive and false-negatives

The concept of a health check succeeding or failing is related to the final return value from the nested/imported Argo workflow.

This isn't always incredibly obvious and can lead to scenarios where the workflow doesn't behave as expected yet is still marked as succeeded. Similarly, even if the workflow behaves as expected, it may indicate a failure if a non-0 return code is used.

README documentation should be improved to highlight this and provide users with patterns/strategies to ensure that healthchecks are behaving as expected and building confidence in the usage of Active-Monitor.

After upgrade argo BDD fails with errors

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

[2022-03-11T20:08:20.710Z] [2022-03-11T20:08:20Z] �[36mINFO�[0m - skipping instancegroup since addon active-monitor is missing file instancegroup.yaml
[2022-03-11T20:08:21.274Z] error: error validating "/tmp/kubectl_manifest775841473": error validating data: [ValidationError(CustomResourceDefinition.spec): unknown field "additionalPrinterColumns" in io.k8s.apiextensions-apiserver.pkg.apis.apiextensions.v1.CustomResourceDefinitionSpec, ValidationError(CustomResourceDefinition.spec): unknown field "subresources" in io.k8s.apiextensions-apiserver.pkg.apis.apiextensions.v1.CustomResourceDefinitionSpec, ValidationError(CustomResourceDefinition.spec): unknown field "validation" in io.k8s.apiextensions-apiserver.pkg.apis.apiextensions.v1.CustomResourceDefinitionSpec, ValidationError(CustomResourceDefinition.spec): unknown field "version" in io.k8s.apiextensions-apiserver.pkg.apis.apiextensions.v1.CustomResourceDefinitionSpec]; if you choose to ignore these errors, turn validation off with --validate=false
[2022-03-11T20:08:21.274Z] [2022-03-11T20:08:21Z] �[33mWARN�[0m - cmd.Run() failed for cmd kubectl apply --filename /tmp/kubectl_manifest775841473 --context arktika-bdd-data-usw2 with exit status 1

Describe the solution you'd like
A clear and concise description of what you want to happen.

Have you thought about contributing yourself?

Open Source software thrives with your contribution. It not only gives skills you might not be able to get in your day job, it also looks amazing on your resume.

If you want to get involved, check out the
contributing guide, then reach out to us on Slack so we can see how to get you started.

Extend support for cluster/namespace scoping of healthcheck

Is your feature request related to a problem? Please describe.
This project had previously supported healthchecks being defined either as cluster or namespace scoping.

Describe the solution you'd like
This feature needs to be confirmed to still work. It may also require some design work to ensure it's still a necessary feature.

active-monitor running workflows more frequently than the configuration

Describe the bug
Active-Monitor workflows are run continuously if there are errors in updating Custom Resources with a storage error or api server being busy etc.,
The timers then are not stopped causing leaks and a number of workflow pods getting created.

Expected behavior
The CR update if failed should the timers should be stopped and reqeued.

Logs

2021-02-22T13:57:55.825Z	ERROR	controllers.HealthCheck	Error updating healthcheck resource	{"HealthCheck": "monitoring/dns-healthcheck", "error": "Operation cannot be fulfilled on healthchecks.activemonitor.keikoproj.io \"dns-healthcheck\": the object has been modified; please apply your changes to the latest version and try again"}
github.com/go-logr/zapr.(*zapLogger).Error
	/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128
github.com/keikoproj/active-monitor/controllers.(*HealthCheckReconciler).watchWorkflowReschedule
	/workspace/controllers/healthcheck_controller.go:525
github.com/keikoproj/active-monitor/controllers.(*HealthCheckReconciler).createSubmitWorkflowHelper.func1
	/workspace/controllers/healthcheck_controller.go:391

2021-02-22T14:58:59.848Z	ERROR	controllers.HealthCheck	Error updating healthcheck resource	{"HealthCheck": "monitoring/dns-healthcheck", "error": "Operation cannot be fulfilled on healthchecks.activemonitor.keikoproj.io \"dns-healthcheck\": StorageError: invalid object, Code: 4, Key: /registry/activemonitor.keikoproj.io/healthchecks/monitoring/dns-healthcheck, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: xxxx, UID in object meta: "}```

make serviceAccount attribute optional for resource

Describe the bug
Currently the Workflow's resource.serviceAccount attribute is required. However, it isn't actually necessary for all healthchecks/workflows. So, this attribute should be re-configured to be optional.

To Reproduce
If we cannot reproduce, we cannot fix! Steps to reproduce the behavior:

Create a healthcheck with an in-line workflow, leave out the resource.serviceAccount attribute
Submit (create/apply) this healthcheck
See error

Update AWS limit/quota check healthcheck(s) considering recent AWS updates

https://aws.amazon.com/blogs/compute/preview-vcpu-based-instance-limits/

AWS has recently modified how it handles limits/quotes. Rather than limit on the number of EC2 instances or other direct resources, limits are now defined based on vCPU quotas.

This means that healthcheck(s) using AWS cli/APIs to determine usage vs. limits, may benefit from an update in order to use newer quota mechanisms which will be supported further into the future.

Release 0.6.0 - Active-Monitor

Release issue, predominantly for visibility purposes.

Move to Github Actions for CI enhancement - #81
Active Monitor crashing with concurrent map updates - #98
Update Default TTL Strategy to secondsAfterCompletion - #99
Update Argo controller version - #80

Improve reliability of active-monitor-controller

Is your feature request related to a problem? Please describe
Though it's generally rare in testing so far, there are cases in which the active-monitor-controller can crash and be forced to restart. We may be losing information about the cause of the crash unless the condition is noticed and logged, in some way.

Describe the solution you'd like
We should follow the panic/recover pattern found in other, similar projects. ex: https://github.com/keikoproj/addon-manager/pull/28/files

Active Monitor crashing with concurrent map updates

Describe the bug
Active-Monitor is crashing intermittently with this error:

fatal error: concurrent map writes
goroutine 105843 [running]:
runtime.throw(0x1633f60, 0x15)
/usr/local/go/src/runtime/panic.go:1116 +0x72 fp=0xc00056baa8 sp=0xc00056ba78 pc=0x436532
runtime.mapassign_faststr(0x1488fa0, 0xc00048a1b0, 0xc0006e2ee0, 0x1b, 0xc0008362a0)
/usr/local/go/src/runtime/map_faststr.go:291 +0x3d8 fp=0xc00056bb10 sp=0xc00056baa8 pc=0x414538
github.com/keikoproj/active-monitor/controllers.(*HealthCheckReconciler).watchWorkflowReschedule(0xc0002530e0, 0x17e9da0, 0xc000126200, 0x0, 0x0, 0x0, 0x0, 0x17f1d20, 0xc0008362a0, 0xc000681d80, ...)
/workspace/controllers/healthcheck_controller.go:676 +0x12cf fp=0xc00056bf00 sp=0xc00056bb10 pc=0x132f2ef
github.com/keikoproj/active-monitor/controllers.(*HealthCheckReconciler).createSubmitWorkflowHelper.func1()
/workspace/controllers/healthcheck_controller.go:459 +0x1b5 fp=0xc00056bfe0 sp=0xc00056bf00 pc=0x13394d5
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1374 +0x1 fp=0xc00056bfe8 sp=0xc00056bfe0 pc=0x46b941
created by time.goFunc
/usr/local/go/src/time/sleep.go:167 +0x45
goroutine 1 [select, 1658 minutes]:
sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).Start(0xc000330a80, 0xc0000fa9c0, 0x0, 0x0)
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:433 +0x1dd
main.main()
/workspace/main.go:84 +0x59c

Add metrics to the status

Is your feature request related to a problem? Please describe.
Add:

Total
LastSuccess
LastFailure

Describe the solution you'd like
A clear and concise description of what you want to happen.

Have you thought about contributing yourself?

Open Source software thrives with your contribution. It not only gives skills you might not be able to get in your day job, it also looks amazing on your resume.

If you want to get involved, check out the
contributing guide, then reach out to us on Slack so we can see how to get you started.

Update go version to 1.18

Is your feature request related to a problem? Please describe.
Earlier go version produces breaking changes. The workflow is running 1.18 whereas repo is using 1.15, this creates breaking run of workflow because 1.18 uses go install instead of go get. Even kubebuilder is breaking since we should setup test env using make targets.

Describe the solution you'd like
Update go to 1.18 and use make targets for setting up test env.

Have you thought about contributing yourself?

Open Source software thrives with your contribution. It not only gives skills you might not be able to get in your day job, it also looks amazing on your resume.

If you want to get involved, check out the
contributing guide, then reach out to us on Slack so we can see how to get you started.

Race condition possible when deleting HealthCheck

Describe the bug
Rarely, an error will occur if a HealthCheck resource is deleted while it has a corresponding child workflow currently running.

To Reproduce

create a healthcheck (kubectl create -f examples/inlineHello.yaml)
while a corresponding workflow is running, delete the healthcheck (kubectl delete healthcheck inline-hello-abc01)
infrequently, but eventually, this will cause an error seen by the error condition log message indicated below.

Expected behavior
Regardless of WHEN a healthcheck is deleted, it and any corresponding workflow resources should be successfully cleaned up. The repeating workflow executions should also stop at this time.

Error Condition

2019-08-09T15:55:18.479-0700    ERROR   controllers.HealthCheck Error updating healthcheck resource     {"HealthCheck": "health/url-hello-dkkxt", "error": "Operation cannot be fulfilled on healthcheck.activemonitor.orkaproj.io \"url-hello-dkkxt\": StorageError: invalid object, Code: 4, Key: /registry/activemonitor.orkaproj.io/healthcheck/health/url-hello-dkkxt, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: a7c1b9b6-3e8f-4200-b41a-d81f53447ba2, UID in object meta: "}
github.com/go-logr/zapr.(*zapLogger).Error
        /Users/dmasselink/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128
github.com/orkaproj/active-monitor/controllers.(*HealthCheckReconciler).watchWorkflowReschedule
        /Users/dmasselink/go/src/github.com/orkaproj/active-monitor/controllers/healthcheck_controller.go:229
github.com/orkaproj/active-monitor/controllers.(*HealthCheckReconciler).createSubmitWorkflowHelper.func1
        /Users/dmasselink/go/src/github.com/orkaproj/active-monitor/controllers/healthcheck_controller.go:147

Version
0.1.0

Enhance documentation to simplify cluster installation

Is your feature request related to a problem? Please describe.
Currently, the documentation suggests that users either directly build the project OR build a docker image and use that.

Describe the solution you'd like
Update documentation to highlight an installation track which requires nothing more than applying some yaml to a cluster with kubectl

Issue in Getting Started with Active-monitor

Hi Team,

I am new to Go, Kubernetes and exploring the Kubernetes monitoring tool. I came across the active-monitor tool. I am facing few issues while getting started with this tool. Any help in this regard, will be highly appreciated. The details are as under:

Versions:
OS: Linux 5.11.0-25-generic, 20.04.1-Ubuntu
Go: go1.13.8 linux/amd64
Kubectl client: v1.22.0
Kubectl Server: v1.21.2
minikube: v1.22.0
argo: v3.0.10
active-monitor: 0.6.0

Also tried with Kubectl client version:v1.19.0 and server version:v1.20.0 but still the same warnings and errors.

Issue:
While following the step 2 for both type of installation, a warning is raised regarding the CRD versions. The screenshot is attached below:

While running the main.go file, the healthcheck starts but it produces error for some go files. The error screenshot is below:

Please let me know how can I proceed further to run active-monitor.

Clean-up example/sample workflows

Is your feature request related to a problem? Please describe.
Currently, the project contains both an examples/ as well as a sample-workflows/ directory. The directory names and corresponding documentation references don't really make it clear why both directories exist and when a new workflow should be added to one or the other.

Describe the solution you'd like
Instead, it would likely be best to consolidate all example or sample workflows into a single directory and ensure that all documentation references are up-to-date. Further, the README should be extended to explain what a contributor wanting to add a new example workflow aught to do and where it should be added.

Upgrade argo to v3.2.6 as well as its dependencies

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Have you thought about contributing yourself?

Open Source software thrives with your contribution. It not only gives skills you might not be able to get in your day job, it also looks amazing on your resume.

If you want to get involved, check out the
contributing guide, then reach out to us on Slack so we can see how to get you started.

Improve usage of failed healthcheck error message

Is your feature request related to a problem? Please describe.
If a healthcheck's underlying workflow fails, a helpful error message as to why this occurred should be set as the healthcheck's error message in its status object. Currently, this error message is always hard-coded to a message indicating that the workflow couldn't start. This isn't always correct and therefore could be confusing to someone trying to deduce the reasons of a healthcheck/workflow failure.

Describe the solution you'd like
Instead, a relevant and meaningful description of an underlying reason should be set into this property in the status resource.

Active Monitor crashing with concurrent map updates

Describe the bug
Active-Monitor crashin with the following error:

fatal error: concurrent map read and map write

goroutine 5518 [running]:
runtime.throw(0x2245810, 0x21)
        /usr/local/go/src/runtime/panic.go:1116 +0x72 fp=0xc0006e8ff0 sp=0xc0006e8fc0 pc=0x1037c12
runtime.mapaccess2(0x208cf80, 0xc00088ee70, 0xc00082edf0, 0xc00082edf0, 0xc00036f802)
        /usr/local/go/src/runtime/map.go:469 +0x25b fp=0xc0006e9030 sp=0xc0006e8ff0 pc=0x101179b
reflect.mapaccess(0x208cf80, 0xc00088ee70, 0xc00082edf0, 0x2238f1d)
        /usr/local/go/src/runtime/map.go:1309 +0x3f fp=0xc0006e9068 sp=0xc0006e9030 pc=0x1066aff
reflect.Value.MapIndex(0x208cf80, 0xc00088ee70, 0x15, 0x2037380, 0xc00082edf0, 0x98, 0x21f8c40, 0x208cf80, 0x208cf80)
        /usr/local/go/src/reflect/value.go:1188 +0x16e fp=0xc0006e90e0 sp=0xc0006e9068 pc=0x109cbee
encoding/json.mapEncoder.encode(0x22cd1b0, 0xc000650080, 0x208cf80, 0xc00088ee70, 0x15, 0x2080000)
        /usr/local/go/src/encoding/json/encode.go:801 +0x30d fp=0xc0006e9258 sp=0xc0006e90e0 pc=0x111896d
encoding/json.mapEncoder.encode-fm(0xc000650080, 0x208cf80, 0xc00088ee70, 0x15, 0x2cb0000)
        /usr/local/go/src/encoding/json/encode.go:777 +0x65 fp=0xc0006e9298 sp=0xc0006e9258 pc=0x1124e65
encoding/json.(*encodeState).reflectValue(0xc000650080, 0x208cf80, 0xc00088ee70, 0x15, 0xc0006e0000)
        /usr/local/go/src/encoding/json/encode.go:358 +0x82 fp=0xc0006e92d0 sp=0xc0006e9298 pc=0x1115b02
encoding/json.(*encodeState).marshal(0xc000650080, 0x208cf80, 0xc00088ee70, 0x1f50000, 0x0, 0x0)
        /usr/local/go/src/encoding/json/encode.go:330 +0xf4 fp=0xc0006e9330 sp=0xc0006e92d0 pc=0x11156f4
encoding/json.(*Encoder).Encode(0xc00061ec80, 0x208cf80, 0xc00088ee70, 0x30, 0x30)
        /usr/local/go/src/encoding/json/stream.go:206 +0x8b fp=0xc0006e93c0 sp=0xc0006e9330 pc=0x112294b
go.uber.org/zap/zapcore.(*jsonEncoder).AddReflected(0xc0003d2e10, 0x222b748, 0x8, 0x208cf80, 0xc00088ee70, 0xc0006e94b0, 0x1085ff4)
        /Users/rhari/go/pkg/mod/go.uber.org/[email protected]/zapcore/json_encoder.go:150 +0x65 fp=0xc0006e9440 sp=0xc0006e93c0 pc=0x1f5f385
go.uber.org/zap/zapcore.Field.AddTo(0x222b748, 0x8, 0x16, 0x0, 0x0, 0x0, 0x208cf80, 0xc00088ee70, 0x240ac20, 0xc0003d2e10)
        /Users/rhari/go/pkg/mod/go.uber.org/[email protected]/zapcore/field.go:159 +0xb16 fp=0xc0006e9518 sp=0xc0006e9440 pc=0x1f5e3d6
go.uber.org/zap/zapcore.addFields(0x240ac20, 0xc0003d2e10, 0xc00084c800, 0x1, 0x1)
        /Users/rhari/go/pkg/mod/go.uber.org/[email protected]/zapcore/field.go:199 +0xcf fp=0xc0006e95c0 sp=0xc0006e9518 pc=0x1f5eaaf
go.uber.org/zap/zapcore.consoleEncoder.writeContext(0xc000260000, 0xc00023bc00, 0xc00084c800, 0x1, 0x1)
        /Users/rhari/go/pkg/mod/go.uber.org/[email protected]/zapcore/console_encoder.go:131 +0xcb fp=0xc0006e9660 sp=0xc0006e95c0 pc=0x1f5a88b
go.uber.org/zap/zapcore.consoleEncoder.EncodeEntry(0xc000260000, 0x0, 0xc027abae3294edc8, 0x5acbf2bbb7f, 0x2cb8ce0, 0xc000041680, 0x17, 0x223ac34, 0x18, 0x0, ...)
        /Users/rhari/go/pkg/mod/go.uber.org/[email protected]/zapcore/console_encoder.go:110 +0x3df fp=0xc0006e9718 sp=0xc0006e9660 pc=0x1f5a23f
sigs.k8s.io/controller-runtime/pkg/log/zap.(*KubeAwareEncoder).EncodeEntry(0xc000478140, 0x0, 0xc027abae3294edc8, 0x5acbf2bbb7f, 0x2cb8ce0, 0xc000041680, 0x17, 0x223ac34, 0x18, 0x0, ...)
        /Users/rhari/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/log/zap/kube_helpers.go:126 +0x175 fp=0xc0006e9930 sp=0xc0006e9718 pc=0x1f78c95
go.uber.org/zap/zapcore.(*ioCore).Write(0xc000260060, 0x0, 0xc027abae3294edc8, 0x5acbf2bbb7f, 0x2cb8ce0, 0xc000041680, 0x17, 0x223ac34, 0x18, 0x0, ...)
        /Users/rhari/go/pkg/mod/go.uber.org/[email protected]/zapcore/core.go:86 +0xa9 fp=0xc0006e9a08 sp=0xc0006e9930 pc=0x1f5b0c9
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00016e6e0, 0xc00084c800, 0x1, 0x1)
        /Users/rhari/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:215 +0x12d fp=0xc0006e9ba8 sp=0xc0006e9a08 pc=0x1f5cb4d
github.com/go-logr/zapr.(*infoLogger).Info(0xc0004781c8, 0x223ac34, 0x18, 0xc0004c43e0, 0x2, 0x2)
        /Users/rhari/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:70 +0xdd fp=0xc0006e9c10 sp=0xc0006e9ba8 pc=0x1f776fd
github.com/keikoproj/active-monitor/controllers.(*HealthCheckReconciler).parseWorkflowFromHealthcheck(0xc00073c780, 0x23f72a0, 0xc0004781c0, 0xc000154a00, 0xc0005262a0, 0x0, 0x0)
        /Users/rhari/go/src/github.com/keikoproj/active-monitor/controllers/healthcheck_controller.go:851 +0x5d0 fp=0xc0006e9de0 sp=0xc0006e9c10 pc=0x1f36010
github.com/keikoproj/active-monitor/controllers.(*HealthCheckReconciler).createSubmitWorkflow(0xc00073c780, 0x23ef140, 0xc0001341f8, 0x23f72a0, 0xc0004781c0, 0xc000154a00, 0x0, 0xc000b00000, 0xc000001e00, 0x0)
        /Users/rhari/go/src/github.com/keikoproj/active-monitor/controllers/healthcheck_controller.go:471 +0x8c fp=0xc0006e9f00 sp=0xc0006e9de0 pc=0x1f2f30c
github.com/keikoproj/active-monitor/controllers.(*HealthCheckReconciler).createSubmitWorkflowHelper.func1()
        /Users/rhari/go/src/github.com/keikoproj/active-monitor/controllers/healthcheck_controller.go:456 +0x115 fp=0xc0006e9fe0 sp=0xc0006e9f00 pc=0x1f3b535
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1374 +0x1 fp=0xc0006e9fe8 sp=0xc0006e9fe0 pc=0x106d0a1
created by time.goFunc
        /usr/local/go/src/time/sleep.go:167 +0x45

To Reproduce
Run Multiple workflows in parallel.

Expected behavior
Active-Monitor continue to work without issues.

Version
latest changes.

Release 0.5.0 - Active-Monitor

Release issue, predominantly for visibility purposes.

DRAFT CHANGELOG:

#64 - Exponentially reduce Kubernetes API calls.
#65 - Limit number of times the Self-Healing/Remedy should be run
#67 - Enable default PodGC strategy as OnPodCompletion in workflow
#70 - Add Events to Acive-Monitor Custom Resources

Enable/Disable flag in HealthCheckSpec

Is your feature request related to a problem? Please describe.

Having an Enable/Disable flag in HealthCheckSpec will provide the flexibility to stop monitoring on individual a given cluster. Sometime we might encounter a problematic situation on a cluster with this flag in place we can instruct the controller not to process the health check on a given cluster until the problem is addressed.

Noise alerts can be addressed instantly
Cluster upgrade operation can be avoided for just to disable a HealthCheck

Describe the solution you'd like
Under the HealthCheckSpec struct, we should add a field called EnableHealthCheck set to true by default and if set to false. The controller reconciler shouldn't process the HealthCheck. This field should be read dynamically.

We will have to handle this under the process workflow method.

Have you thought about contributing yourself?
Yes I would like to work on this solution.
Open Source software thrives with your contribution. It not only gives skills you might not be able to get in your day job, it also looks amazing on your resume.

If you want to get involved, check out the
contributing guide, then reach out to us on Slack so we can see how to get you started.

Release 0.5.1 - Active-Monitor

Release issue, predominantly for visibility purposes.

DRAFT CHANGELOG:

Bug Fix:
#82 - active-monitor running workflows more frequently than the configuration.

Workflow creation ignores metadata information in the healthcheck spec

Describe the bug
The active monitor controller ignores the metadata information while submitting the monitoring workflow. The metadata information has the controller instanceID details which are needed for the workflow controller to pick and execute the workflow.
The controller should pick and parse the metadata information and should include it while the workflow is submitted.

To Reproduce
If we cannot reproduce, we cannot fix! Steps to reproduce the behavior:
If the workflow controller is started with specific InstanceID details. The workflow controller will not pick the active monitor workflow for execution. The WF stays in the pending status.

Expected behavior
Start the workflow controller on a specific InstanceID and make the changes in the active monitor controller to parse the metadata information while submitting the workflow.

Screenshots
If applicable, add screenshots to help explain your problem.

Version

Paste any relevant version print outputs here.

Logs

Paste any relevant application logs here.

Have you thought about contributing a fix yourself?

Open Source software thrives with your contribution. It not only gives skills you might not be able to get in your day job, it also looks amazing on your resume.

If you want to get involved, check out the
contributing guide, then reach out to us on Slack so we can see how to get you started.

Design/wireframe UI/dashboard for Active-Monitor (and all KeikoProj components)

The Keiko project components are awesome... but it's sometimes hard for people (who aren't deeply familiar with k8s) to understand how awesome they are. However, once the Keiko components have a user interface or dashboard, they will be immediately understandable/accessible to a much wider audience.

The aim of this ticket is to design one or more views for Active-Monitor data in a web interface.

Each component will likely have a similar task to design their respective views.

One possibility is build plug-ins for the Octant UI project, sponsored by VMWare - https://github.com/vmware-tanzu/octant

Addl metric indicating start and end times of latest run for each healthcheck

Is your feature request related to a problem? Please describe.
first discussed in this slack thread: https://intuit-teams.slack.com/archives/GBLA5J9DH/p1579889119031100

Describe the solution you'd like
Controller should expose the start and end times of latest run for each healthcheck as a metric. This would assist in cluster issue debugging/diagnosis since it will be more obvious when work/traffic/etc. is happening due to a healthcheck rather than organic work/traffic.

Currently this can be determined only by looking at the healthcheck status and, even then, only completion times are tracked

Yet to discuss
Are timestamps the best piece of data to track? Otherwise would an "ongoing" boolean and "lastRunDuration" float be easier to make sense of?

Build unit tests around healthcheck_controller

Is your feature request related to a problem? Please describe.
Currently there are no unit tests for the bulk of the logic involved in this project, the controller.

Describe the solution you'd like
Build out tests to get controller coverage > 66%. This may be tricky since it isn't always straightforward to mock out kube-api related interactions.

BUG: Workflow not recreated on next run

Describe the bug
Workflow can be failed for some reason like scheduling and did not run but state is not changed in HealthCheck to reflect this error

To Reproduce
If we cannot reproduce, we cannot fix! Steps to reproduce the behavior:

Go to '...'
Click on '....'
Scroll down to '....'
See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Version

Paste any relevant version print outputs here.

Logs

Paste any relevant application logs here.

Have you thought about contributing a fix yourself?

Open Source software thrives with your contribution. It not only gives skills you might not be able to get in your day job, it also looks amazing on your resume.

If you want to get involved, check out the
contributing guide, then reach out to us on Slack so we can see how to get you started.

Reduce code complexity

metrics\collector.go

We can remove to if layer to

if workflowStatus.Outputs == nil {
  return
}
if workflowStatus.Outputs.Parameters == nil {
  return
}
for ...
   for

Release 0.3.0 - including feature #22 - cron-like schedule

Release issue, predominantly for visibility purposes.

DRAFT CHANGELOG:

#22 - Support cron-like expressions for scheduling periodic healthchecks more flexibly than RepeatAfterSec mechanism

Active-Monitor should process custom resources in parallel

Is your feature request related to a problem? Please describe.
Active-Monitor should process custom resources in parallel.

Describe the solution you'd like
Kubebuilder supports MaxParallel option to run multiple go routines. We should pass this maxparallel option into the reconciler to process multiple CR's.

Have you thought about contributing yourself?
Yes.. I will add this feature.

Contribute more advanced real-world example workflow

Is your feature request related to a problem? Please describe.
The existing example workflows are all quite simple. They don't well represent a real-world workflow.

Describe the solution you'd like
There should be a new workflow example in the examples/ directory which carries out the following steps:

create namespace
create deployment
delete deployment
delete namespace

At this point, no record of the deployment nor namespace should exist

Move to Github Actions for CI

Based on the message at: https://travis-ci.org/

Please be aware travis-ci.org will be shutting down in several weeks, with all accounts migrating to travis-ci.com. Please stay tuned here for more information.

We need to migrate cI pipeline to Github Actions.

Support cron-like expressions as an alternative to repeatAfterSec param

Currently, active-monitor repeats the workflow submissions based on the repeatAfterSec spec parameter.

However, it would be more flexible if we also allowed users to specify how often they want the workflow to run using a cron-like expression.

@pzou1974 had the great suggestion to look at this library to help with parsing cron expressions: https://godoc.org/gopkg.in/robfig/cron.v2

If you want to get involved, check out the
contributing guide, then reach out to us on Slack so we can see how to get you started.

Enable default PodGC strategy as OnPodCompletion in workflow

Is your feature request related to a problem? Please describe.
As the use cases for Monitor/Remedy grow the number of pods created can grow more. If the ttl for pod deletion is not aggressive there can many pods in completed state which can still consume resources such as IP's (in EKS) etc.,

Describe the solution you'd like
Enable default PodGC strategy as OnPodCompletion in Active-Monitor workflow. https://argoproj.github.io/argo/fields/#podgc.
This will help cleanup the pods immediately after execution. The status of the pod execution is updated in argo workflow. As the Active-Monitor controller reads the status from argo workflow we donot need the pod itself once it is executed. This will save resources.

Have you thought about contributing yourself?

Yes.

keikoproj / active-monitor Goto Github PK

active-monitor's People

Contributors

Stargazers

Watchers

Forkers

active-monitor's Issues

step 1: install argo workflow-controller

step 2: install active-monitor controller

NAME AGE inline-hello-fbgj9 5m

Recommend Projects

Recommend Topics

Recommend Org

NAME AGE
inline-hello-fbgj9 5m