keikoproj / active-monitor Goto Github PK
View Code? Open in Web Editor NEWProvides deep monitoring and self-healing of Kubernetes clusters
License: Apache License 2.0
Provides deep monitoring and self-healing of Kubernetes clusters
License: Apache License 2.0
Is your feature request related to a problem? Please describe.
Currently, links to Slack in the README and issue templates either are broken links OR actually direct to argoproj workspace.
Describe the solution you'd like
Once a new slack workspace is setup for orkaproj, update the assets in this repo to match.
Is your feature request related to a problem? Please describe.
Active-Monitor Custom Resources should have events displayed. It gives the state of Each CR and its progress.
Describe the solution you'd like
We need to add events to custom resources for Active-Monitor
Have you thought about contributing yourself?
Yes
Is your feature request related to a problem? Please describe.
In Active-monitor the status of the workflow is polled every second which results in too many kubernetes api calls. As the number of monitors or self-healing use cases increases, in a managed kubernetes clusters this can result in additional cost, in non-managed kubernetes clusters the master nodes might have to be scaled up to accomodate these api calls.
Describe the solution you'd like
This can be solved by leveraging https://github.com/keikoproj/inverse-exp-backoff library. There is an API in this library where we can use inverse exponential backoff with timeout. This will allow us to reduce the API calls made to the kubernetes API server exponentially.
Have you thought about contributing yourself?
Yes I will implement it.
If you want to get involved, check out the
contributing guide, then reach out to us on Slack so we can see how to get you started.
Is your feature request related to a problem? Please describe.
Kubernetes 1.22 has some breaking changes so need to update the resources for deployment to be compatible such that users can use active-monitor in latest versions of kubernetes.
Describe the bug
When a serviceAccount is passed to the HealthCheck in spec, this same account should be applied to the argo workflows which the controller is creating. In this way, it will be passed down to the running pod where the account will actually be used
To Reproduce
If we cannot reproduce, we cannot fix! Steps to reproduce the behavior:
kubectl create -f <filename>
argo get <workflowName>
default
account was applied regardless of configured serviceAccountExpected behavior
Configured serviceAccount should be set on workflow
Version
all
Don't begin this work until we can confirm we want to make this change. This will likely require a design decision, the outcome of which should be attached to this ticket as a comment.
Is your feature request related to a problem? Please describe.
After this project was re-worked using kubebuilder 2.0, a prometheus style metrics server was exposed by default. This server provides data such as underlying golang cpu/memory/networking/gc details for the Active-Monitor controller. http://0.0.0.0:8080/metrics
Our application also exposes its own metrics server running on a separate port and end-pt. This server is meant to communicate details of healthcheck operation and to be consumed by entities which may need to take a remediating action. http://0.0.0.0:2112/metrics
Describe the solution you'd like
An open question is: should these 2 servers which are doing pretty much the same thing (though with respect to different data sets) be combined some way?
Should there be just a single port/path combo where ALL metrics (whether built-in or health check oriented) could be exposed?
If not a single port/path combo, how about the same port but 1 path for healthcheck metrics and another for internal metrics? ex: 0.0.0.0:8080/metrics
and 0.0.0.0:8080/internal
or similar
kubectl describe healthcheck does not get expected health status
**Step to reproduce **
kubectl apply -f https://raw.githubusercontent.com/orkaproj/active-monitor/master/deploy/deploy-argo.yaml
#step 6
kubectl describe healthcheck inline-hello-fbgj9 -n health
get result without readme suggested
...
Status:
Failed Count: 0
Finished At: timestampe
Last Successful Workflow: inline-hello-fbgj9
Status: Succeeded
Success Count:
Events:
Is your feature request related to a problem? Please describe.
Argoworkflow 3.x doesnot support ttlSecondsAfterFinished anymore.
It needs to be replaced with ttlStrategy: secondsAfterCompletion
Describe the solution you'd like
Update the default ttlstrategy here:
If you want to get involved, check out the
contributing guide, then reach out to us on Slack so we can see how to get you started.
Release issue, predominantly for visibility purposes.
DRAFT CHANGELOG:
Bug Fix:
#88 - Active Monitor crashing with concurrent map updates
Currently, the success and failure counts associated with each healthcheck will start at 0 when first registered and monotonically increase over the life of the healthcheck (or active-monitor controller).
This works alright, however, one downside of such an approach is that you need to know how long the healthcheck has been running in order to have any context for whether the counts are large or small.
Therefore, the aim of this ticket is to provide:
Open questions:
Is your feature request related to a problem? Please describe.
HealthCheck Custom Resource should have an ability to limit the number of times the remedy/self-healing should be run in a given interval of time.
Describe the solution you'd like
HealthCheck Custom Resource should provide parameters to limit the number of times the remedy should be run in a given interval. This would be helpful to avoid a continuous loop of running health check and remedy in case when remedy action given does not work.
Have you thought about contributing yourself?
Yes. I will implement it.
If you want to get involved, check out the
contributing guide, then reach out to us on Slack so we can see how to get you started.
Release issue, predominantly for visibility purposes.
DRAFT CHANGELOG:
Currently, Healthcheck custom resource and its child argo workflow can "detect" if there is a problem.
However, there is no place to express what action to take in case of a problem.
Therefore, healthcheck spec should support an alternative argo workflow for "remediation". If the main workflow fails, the "remediation" workflow should be run. Also, additional remediation metrics should be captured and exposed accordingly.
Open Source software thrives with your contribution. It not only gives skills you might not be able to get in your day job, it also looks amazing on your resume.
If you want to get involved, check out the
contributing guide, then reach out to us on Slack so we can see how to get you started.
The project doesn't yet have a defined and documented release process. This task's aim is to document that in the README and get a 1.0.0 release pushed to dockerhub.
Is your feature request related to a problem? Please describe.
Update Argo Controller version to use latest features.
Describe the solution you'd like
Use latest Argo Controller version to leverage latest features.
Have you thought about contributing yourself?
Yes
Is your feature request related to a problem? Please describe
The problem is that the latest status as well as success/fail counts related to a healthcheck can only be seen when it is kubectl describe ...
d.
Describe the solution you'd like
Instead, at least three columns should be added to the custom printer for healthcheck objects. Those are: status, success count, failure count.
This should be straightforward to accomplish using kubebuilder annotations regarding additional-printer-columns
Current example:
NAME AGE
foo 1h
bar 1h
Target example:
NAME LATEST STATUS SUCCESS CNT FAIL CNT AGE
foo Succeeded 4 0 1h
bar Failed 1 3 1h
Related Reading
https://book.kubebuilder.io/reference/generating-crd.html#additional-printer-columns
Is your feature request related to a problem? Please describe.
After this project was converted to kubebuilder style, custom metrics haven't been tested and confirmed to work.
Describe the solution you'd like
The aim of this ticket is to confirm this and update README accordingly
The concept of a health check succeeding or failing is related to the final return value from the nested/imported Argo workflow.
This isn't always incredibly obvious and can lead to scenarios where the workflow doesn't behave as expected yet is still marked as succeeded. Similarly, even if the workflow behaves as expected, it may indicate a failure if a non-0 return code is used.
README documentation should be improved to highlight this and provide users with patterns/strategies to ensure that healthchecks are behaving as expected and building confidence in the usage of Active-Monitor.
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
[2022-03-11T20:08:20.710Z] [2022-03-11T20:08:20Z] �[36mINFO�[0m - skipping instancegroup since addon active-monitor is missing file instancegroup.yaml
[2022-03-11T20:08:21.274Z] error: error validating "/tmp/kubectl_manifest775841473": error validating data: [ValidationError(CustomResourceDefinition.spec): unknown field "additionalPrinterColumns" in io.k8s.apiextensions-apiserver.pkg.apis.apiextensions.v1.CustomResourceDefinitionSpec, ValidationError(CustomResourceDefinition.spec): unknown field "subresources" in io.k8s.apiextensions-apiserver.pkg.apis.apiextensions.v1.CustomResourceDefinitionSpec, ValidationError(CustomResourceDefinition.spec): unknown field "validation" in io.k8s.apiextensions-apiserver.pkg.apis.apiextensions.v1.CustomResourceDefinitionSpec, ValidationError(CustomResourceDefinition.spec): unknown field "version" in io.k8s.apiextensions-apiserver.pkg.apis.apiextensions.v1.CustomResourceDefinitionSpec]; if you choose to ignore these errors, turn validation off with --validate=false
[2022-03-11T20:08:21.274Z] [2022-03-11T20:08:21Z] �[33mWARN�[0m - cmd.Run() failed for cmd kubectl apply --filename /tmp/kubectl_manifest775841473 --context arktika-bdd-data-usw2 with exit status 1
Describe the solution you'd like
A clear and concise description of what you want to happen.
Have you thought about contributing yourself?
Open Source software thrives with your contribution. It not only gives skills you might not be able to get in your day job, it also looks amazing on your resume.
If you want to get involved, check out the
contributing guide, then reach out to us on Slack so we can see how to get you started.
Is your feature request related to a problem? Please describe.
This project had previously supported healthchecks being defined either as cluster or namespace scoping.
Describe the solution you'd like
This feature needs to be confirmed to still work. It may also require some design work to ensure it's still a necessary feature.
Describe the bug
Active-Monitor workflows are run continuously if there are errors in updating Custom Resources with a storage error or api server being busy etc.,
The timers then are not stopped causing leaks and a number of workflow pods getting created.
Expected behavior
The CR update if failed should the timers should be stopped and reqeued.
Logs
2021-02-22T13:57:55.825Z ERROR controllers.HealthCheck Error updating healthcheck resource {"HealthCheck": "monitoring/dns-healthcheck", "error": "Operation cannot be fulfilled on healthchecks.activemonitor.keikoproj.io \"dns-healthcheck\": the object has been modified; please apply your changes to the latest version and try again"}
github.com/go-logr/zapr.(*zapLogger).Error
/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128
github.com/keikoproj/active-monitor/controllers.(*HealthCheckReconciler).watchWorkflowReschedule
/workspace/controllers/healthcheck_controller.go:525
github.com/keikoproj/active-monitor/controllers.(*HealthCheckReconciler).createSubmitWorkflowHelper.func1
/workspace/controllers/healthcheck_controller.go:391
2021-02-22T14:58:59.848Z ERROR controllers.HealthCheck Error updating healthcheck resource {"HealthCheck": "monitoring/dns-healthcheck", "error": "Operation cannot be fulfilled on healthchecks.activemonitor.keikoproj.io \"dns-healthcheck\": StorageError: invalid object, Code: 4, Key: /registry/activemonitor.keikoproj.io/healthchecks/monitoring/dns-healthcheck, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: xxxx, UID in object meta: "}```
Describe the bug
Currently the Workflow's resource.serviceAccount
attribute is required. However, it isn't actually necessary for all healthchecks/workflows. So, this attribute should be re-configured to be optional.
To Reproduce
If we cannot reproduce, we cannot fix! Steps to reproduce the behavior:
resource.serviceAccount
attributehttps://aws.amazon.com/blogs/compute/preview-vcpu-based-instance-limits/
AWS has recently modified how it handles limits/quotes. Rather than limit on the number of EC2 instances or other direct resources, limits are now defined based on vCPU quotas.
This means that healthcheck(s) using AWS cli/APIs to determine usage vs. limits, may benefit from an update in order to use newer quota mechanisms which will be supported further into the future.
Is your feature request related to a problem? Please describe
Though it's generally rare in testing so far, there are cases in which the active-monitor-controller
can crash and be forced to restart. We may be losing information about the cause of the crash unless the condition is noticed and logged, in some way.
Describe the solution you'd like
We should follow the panic/recover pattern found in other, similar projects. ex: https://github.com/keikoproj/addon-manager/pull/28/files
Related Reading
https://blog.golang.org/defer-panic-and-recover
Describe the bug
Active-Monitor is crashing intermittently with this error:
fatal error: concurrent map writes
goroutine 105843 [running]:
runtime.throw(0x1633f60, 0x15)
/usr/local/go/src/runtime/panic.go:1116 +0x72 fp=0xc00056baa8 sp=0xc00056ba78 pc=0x436532
runtime.mapassign_faststr(0x1488fa0, 0xc00048a1b0, 0xc0006e2ee0, 0x1b, 0xc0008362a0)
/usr/local/go/src/runtime/map_faststr.go:291 +0x3d8 fp=0xc00056bb10 sp=0xc00056baa8 pc=0x414538
github.com/keikoproj/active-monitor/controllers.(*HealthCheckReconciler).watchWorkflowReschedule(0xc0002530e0, 0x17e9da0, 0xc000126200, 0x0, 0x0, 0x0, 0x0, 0x17f1d20, 0xc0008362a0, 0xc000681d80, ...)
/workspace/controllers/healthcheck_controller.go:676 +0x12cf fp=0xc00056bf00 sp=0xc00056bb10 pc=0x132f2ef
github.com/keikoproj/active-monitor/controllers.(*HealthCheckReconciler).createSubmitWorkflowHelper.func1()
/workspace/controllers/healthcheck_controller.go:459 +0x1b5 fp=0xc00056bfe0 sp=0xc00056bf00 pc=0x13394d5
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1374 +0x1 fp=0xc00056bfe8 sp=0xc00056bfe0 pc=0x46b941
created by time.goFunc
/usr/local/go/src/time/sleep.go:167 +0x45
goroutine 1 [select, 1658 minutes]:
sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).Start(0xc000330a80, 0xc0000fa9c0, 0x0, 0x0)
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:433 +0x1dd
main.main()
/workspace/main.go:84 +0x59c
Is your feature request related to a problem? Please describe.
Add:
Describe the solution you'd like
A clear and concise description of what you want to happen.
Have you thought about contributing yourself?
Open Source software thrives with your contribution. It not only gives skills you might not be able to get in your day job, it also looks amazing on your resume.
If you want to get involved, check out the
contributing guide, then reach out to us on Slack so we can see how to get you started.
Is your feature request related to a problem? Please describe.
Earlier go version produces breaking changes. The workflow is running 1.18 whereas repo is using 1.15, this creates breaking run of workflow because 1.18 uses go install
instead of go get
. Even kubebuilder is breaking since we should setup test env using make targets.
Describe the solution you'd like
Update go to 1.18 and use make targets for setting up test env.
Have you thought about contributing yourself?
Open Source software thrives with your contribution. It not only gives skills you might not be able to get in your day job, it also looks amazing on your resume.
If you want to get involved, check out the
contributing guide, then reach out to us on Slack so we can see how to get you started.
Describe the bug
Rarely, an error will occur if a HealthCheck resource is deleted while it has a corresponding child workflow currently running.
To Reproduce
kubectl create -f examples/inlineHello.yaml
)kubectl delete healthcheck inline-hello-abc01
)Expected behavior
Regardless of WHEN a healthcheck is deleted, it and any corresponding workflow resources should be successfully cleaned up. The repeating workflow executions should also stop at this time.
Error Condition
2019-08-09T15:55:18.479-0700 ERROR controllers.HealthCheck Error updating healthcheck resource {"HealthCheck": "health/url-hello-dkkxt", "error": "Operation cannot be fulfilled on healthcheck.activemonitor.orkaproj.io \"url-hello-dkkxt\": StorageError: invalid object, Code: 4, Key: /registry/activemonitor.orkaproj.io/healthcheck/health/url-hello-dkkxt, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: a7c1b9b6-3e8f-4200-b41a-d81f53447ba2, UID in object meta: "}
github.com/go-logr/zapr.(*zapLogger).Error
/Users/dmasselink/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128
github.com/orkaproj/active-monitor/controllers.(*HealthCheckReconciler).watchWorkflowReschedule
/Users/dmasselink/go/src/github.com/orkaproj/active-monitor/controllers/healthcheck_controller.go:229
github.com/orkaproj/active-monitor/controllers.(*HealthCheckReconciler).createSubmitWorkflowHelper.func1
/Users/dmasselink/go/src/github.com/orkaproj/active-monitor/controllers/healthcheck_controller.go:147
Version
0.1.0
Is your feature request related to a problem? Please describe.
Currently, the documentation suggests that users either directly build the project OR build a docker image and use that.
Describe the solution you'd like
Update documentation to highlight an installation track which requires nothing more than applying some yaml
to a cluster with kubectl
Hi Team,
I am new to Go, Kubernetes and exploring the Kubernetes monitoring tool. I came across the active-monitor tool. I am facing few issues while getting started with this tool. Any help in this regard, will be highly appreciated. The details are as under:
Versions:
OS: Linux 5.11.0-25-generic, 20.04.1-Ubuntu
Go: go1.13.8 linux/amd64
Kubectl client: v1.22.0
Kubectl Server: v1.21.2
minikube: v1.22.0
argo: v3.0.10
active-monitor: 0.6.0
Also tried with Kubectl client version:v1.19.0 and server version:v1.20.0 but still the same warnings and errors.
Issue:
While following the step 2 for both type of installation, a warning is raised regarding the CRD versions. The screenshot is attached below:
While running the main.go file, the healthcheck starts but it produces error for some go files. The error screenshot is below:
Please let me know how can I proceed further to run active-monitor.
Is your feature request related to a problem? Please describe.
Currently, the project contains both an examples/
as well as a sample-workflows/
directory. The directory names and corresponding documentation references don't really make it clear why both directories exist and when a new workflow should be added to one or the other.
Describe the solution you'd like
Instead, it would likely be best to consolidate all example or sample workflows into a single directory and ensure that all documentation references are up-to-date. Further, the README should be extended to explain what a contributor wanting to add a new example workflow aught to do and where it should be added.
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like
A clear and concise description of what you want to happen.
Have you thought about contributing yourself?
Open Source software thrives with your contribution. It not only gives skills you might not be able to get in your day job, it also looks amazing on your resume.
If you want to get involved, check out the
contributing guide, then reach out to us on Slack so we can see how to get you started.
Is your feature request related to a problem? Please describe.
If a healthcheck's underlying workflow fails, a helpful error message as to why this occurred should be set as the healthcheck's error message in its status object. Currently, this error message is always hard-coded to a message indicating that the workflow couldn't start. This isn't always correct and therefore could be confusing to someone trying to deduce the reasons of a healthcheck/workflow failure.
Describe the solution you'd like
Instead, a relevant and meaningful description of an underlying reason should be set into this property in the status resource.
Describe the bug
Active-Monitor crashin with the following error:
fatal error: concurrent map read and map write
goroutine 5518 [running]:
runtime.throw(0x2245810, 0x21)
/usr/local/go/src/runtime/panic.go:1116 +0x72 fp=0xc0006e8ff0 sp=0xc0006e8fc0 pc=0x1037c12
runtime.mapaccess2(0x208cf80, 0xc00088ee70, 0xc00082edf0, 0xc00082edf0, 0xc00036f802)
/usr/local/go/src/runtime/map.go:469 +0x25b fp=0xc0006e9030 sp=0xc0006e8ff0 pc=0x101179b
reflect.mapaccess(0x208cf80, 0xc00088ee70, 0xc00082edf0, 0x2238f1d)
/usr/local/go/src/runtime/map.go:1309 +0x3f fp=0xc0006e9068 sp=0xc0006e9030 pc=0x1066aff
reflect.Value.MapIndex(0x208cf80, 0xc00088ee70, 0x15, 0x2037380, 0xc00082edf0, 0x98, 0x21f8c40, 0x208cf80, 0x208cf80)
/usr/local/go/src/reflect/value.go:1188 +0x16e fp=0xc0006e90e0 sp=0xc0006e9068 pc=0x109cbee
encoding/json.mapEncoder.encode(0x22cd1b0, 0xc000650080, 0x208cf80, 0xc00088ee70, 0x15, 0x2080000)
/usr/local/go/src/encoding/json/encode.go:801 +0x30d fp=0xc0006e9258 sp=0xc0006e90e0 pc=0x111896d
encoding/json.mapEncoder.encode-fm(0xc000650080, 0x208cf80, 0xc00088ee70, 0x15, 0x2cb0000)
/usr/local/go/src/encoding/json/encode.go:777 +0x65 fp=0xc0006e9298 sp=0xc0006e9258 pc=0x1124e65
encoding/json.(*encodeState).reflectValue(0xc000650080, 0x208cf80, 0xc00088ee70, 0x15, 0xc0006e0000)
/usr/local/go/src/encoding/json/encode.go:358 +0x82 fp=0xc0006e92d0 sp=0xc0006e9298 pc=0x1115b02
encoding/json.(*encodeState).marshal(0xc000650080, 0x208cf80, 0xc00088ee70, 0x1f50000, 0x0, 0x0)
/usr/local/go/src/encoding/json/encode.go:330 +0xf4 fp=0xc0006e9330 sp=0xc0006e92d0 pc=0x11156f4
encoding/json.(*Encoder).Encode(0xc00061ec80, 0x208cf80, 0xc00088ee70, 0x30, 0x30)
/usr/local/go/src/encoding/json/stream.go:206 +0x8b fp=0xc0006e93c0 sp=0xc0006e9330 pc=0x112294b
go.uber.org/zap/zapcore.(*jsonEncoder).AddReflected(0xc0003d2e10, 0x222b748, 0x8, 0x208cf80, 0xc00088ee70, 0xc0006e94b0, 0x1085ff4)
/Users/rhari/go/pkg/mod/go.uber.org/[email protected]/zapcore/json_encoder.go:150 +0x65 fp=0xc0006e9440 sp=0xc0006e93c0 pc=0x1f5f385
go.uber.org/zap/zapcore.Field.AddTo(0x222b748, 0x8, 0x16, 0x0, 0x0, 0x0, 0x208cf80, 0xc00088ee70, 0x240ac20, 0xc0003d2e10)
/Users/rhari/go/pkg/mod/go.uber.org/[email protected]/zapcore/field.go:159 +0xb16 fp=0xc0006e9518 sp=0xc0006e9440 pc=0x1f5e3d6
go.uber.org/zap/zapcore.addFields(0x240ac20, 0xc0003d2e10, 0xc00084c800, 0x1, 0x1)
/Users/rhari/go/pkg/mod/go.uber.org/[email protected]/zapcore/field.go:199 +0xcf fp=0xc0006e95c0 sp=0xc0006e9518 pc=0x1f5eaaf
go.uber.org/zap/zapcore.consoleEncoder.writeContext(0xc000260000, 0xc00023bc00, 0xc00084c800, 0x1, 0x1)
/Users/rhari/go/pkg/mod/go.uber.org/[email protected]/zapcore/console_encoder.go:131 +0xcb fp=0xc0006e9660 sp=0xc0006e95c0 pc=0x1f5a88b
go.uber.org/zap/zapcore.consoleEncoder.EncodeEntry(0xc000260000, 0x0, 0xc027abae3294edc8, 0x5acbf2bbb7f, 0x2cb8ce0, 0xc000041680, 0x17, 0x223ac34, 0x18, 0x0, ...)
/Users/rhari/go/pkg/mod/go.uber.org/[email protected]/zapcore/console_encoder.go:110 +0x3df fp=0xc0006e9718 sp=0xc0006e9660 pc=0x1f5a23f
sigs.k8s.io/controller-runtime/pkg/log/zap.(*KubeAwareEncoder).EncodeEntry(0xc000478140, 0x0, 0xc027abae3294edc8, 0x5acbf2bbb7f, 0x2cb8ce0, 0xc000041680, 0x17, 0x223ac34, 0x18, 0x0, ...)
/Users/rhari/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/log/zap/kube_helpers.go:126 +0x175 fp=0xc0006e9930 sp=0xc0006e9718 pc=0x1f78c95
go.uber.org/zap/zapcore.(*ioCore).Write(0xc000260060, 0x0, 0xc027abae3294edc8, 0x5acbf2bbb7f, 0x2cb8ce0, 0xc000041680, 0x17, 0x223ac34, 0x18, 0x0, ...)
/Users/rhari/go/pkg/mod/go.uber.org/[email protected]/zapcore/core.go:86 +0xa9 fp=0xc0006e9a08 sp=0xc0006e9930 pc=0x1f5b0c9
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00016e6e0, 0xc00084c800, 0x1, 0x1)
/Users/rhari/go/pkg/mod/go.uber.org/[email protected]/zapcore/entry.go:215 +0x12d fp=0xc0006e9ba8 sp=0xc0006e9a08 pc=0x1f5cb4d
github.com/go-logr/zapr.(*infoLogger).Info(0xc0004781c8, 0x223ac34, 0x18, 0xc0004c43e0, 0x2, 0x2)
/Users/rhari/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:70 +0xdd fp=0xc0006e9c10 sp=0xc0006e9ba8 pc=0x1f776fd
github.com/keikoproj/active-monitor/controllers.(*HealthCheckReconciler).parseWorkflowFromHealthcheck(0xc00073c780, 0x23f72a0, 0xc0004781c0, 0xc000154a00, 0xc0005262a0, 0x0, 0x0)
/Users/rhari/go/src/github.com/keikoproj/active-monitor/controllers/healthcheck_controller.go:851 +0x5d0 fp=0xc0006e9de0 sp=0xc0006e9c10 pc=0x1f36010
github.com/keikoproj/active-monitor/controllers.(*HealthCheckReconciler).createSubmitWorkflow(0xc00073c780, 0x23ef140, 0xc0001341f8, 0x23f72a0, 0xc0004781c0, 0xc000154a00, 0x0, 0xc000b00000, 0xc000001e00, 0x0)
/Users/rhari/go/src/github.com/keikoproj/active-monitor/controllers/healthcheck_controller.go:471 +0x8c fp=0xc0006e9f00 sp=0xc0006e9de0 pc=0x1f2f30c
github.com/keikoproj/active-monitor/controllers.(*HealthCheckReconciler).createSubmitWorkflowHelper.func1()
/Users/rhari/go/src/github.com/keikoproj/active-monitor/controllers/healthcheck_controller.go:456 +0x115 fp=0xc0006e9fe0 sp=0xc0006e9f00 pc=0x1f3b535
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1374 +0x1 fp=0xc0006e9fe8 sp=0xc0006e9fe0 pc=0x106d0a1
created by time.goFunc
/usr/local/go/src/time/sleep.go:167 +0x45
To Reproduce
Run Multiple workflows in parallel.
Expected behavior
Active-Monitor continue to work without issues.
Version
latest changes.
Release issue, predominantly for visibility purposes.
DRAFT CHANGELOG:
#64 - Exponentially reduce Kubernetes API calls.
#65 - Limit number of times the Self-Healing/Remedy should be run
#67 - Enable default PodGC strategy as OnPodCompletion in workflow
#70 - Add Events to Acive-Monitor Custom Resources
Is your feature request related to a problem? Please describe.
Having an Enable/Disable flag in HealthCheckSpec will provide the flexibility to stop monitoring on individual a given cluster. Sometime we might encounter a problematic situation on a cluster with this flag in place we can instruct the controller not to process the health check on a given cluster until the problem is addressed.
Describe the solution you'd like
Under the HealthCheckSpec struct, we should add a field called EnableHealthCheck set to true by default and if set to false. The controller reconciler shouldn't process the HealthCheck. This field should be read dynamically.
We will have to handle this under the process workflow method.
Have you thought about contributing yourself?
Yes I would like to work on this solution.
Open Source software thrives with your contribution. It not only gives skills you might not be able to get in your day job, it also looks amazing on your resume.
If you want to get involved, check out the
contributing guide, then reach out to us on Slack so we can see how to get you started.
Release issue, predominantly for visibility purposes.
DRAFT CHANGELOG:
Bug Fix:
#82 - active-monitor running workflows more frequently than the configuration.
Describe the bug
The active monitor controller ignores the metadata information while submitting the monitoring workflow. The metadata information has the controller instanceID details which are needed for the workflow controller to pick and execute the workflow.
The controller should pick and parse the metadata information and should include it while the workflow is submitted.
To Reproduce
If we cannot reproduce, we cannot fix! Steps to reproduce the behavior:
If the workflow controller is started with specific InstanceID details. The workflow controller will not pick the active monitor workflow for execution. The WF stays in the pending status.
Expected behavior
Start the workflow controller on a specific InstanceID and make the changes in the active monitor controller to parse the metadata information while submitting the workflow.
Screenshots
If applicable, add screenshots to help explain your problem.
Version
Paste any relevant version print outputs here.
Logs
Paste any relevant application logs here.
Have you thought about contributing a fix yourself?
Open Source software thrives with your contribution. It not only gives skills you might not be able to get in your day job, it also looks amazing on your resume.
If you want to get involved, check out the
contributing guide, then reach out to us on Slack so we can see how to get you started.
The Keiko project components are awesome... but it's sometimes hard for people (who aren't deeply familiar with k8s) to understand how awesome they are. However, once the Keiko components have a user interface or dashboard, they will be immediately understandable/accessible to a much wider audience.
The aim of this ticket is to design one or more views for Active-Monitor data in a web interface.
Each component will likely have a similar task to design their respective views.
One possibility is build plug-ins for the Octant UI project, sponsored by VMWare - https://github.com/vmware-tanzu/octant
Is your feature request related to a problem? Please describe.
first discussed in this slack thread: https://intuit-teams.slack.com/archives/GBLA5J9DH/p1579889119031100
Describe the solution you'd like
Controller should expose the start and end times of latest run for each healthcheck as a metric. This would assist in cluster issue debugging/diagnosis since it will be more obvious when work/traffic/etc. is happening due to a healthcheck rather than organic work/traffic.
Currently this can be determined only by looking at the healthcheck status and, even then, only completion times are tracked
Yet to discuss
Are timestamps the best piece of data to track? Otherwise would an "ongoing" boolean and "lastRunDuration" float be easier to make sense of?
Is your feature request related to a problem? Please describe.
Currently there are no unit tests for the bulk of the logic involved in this project, the controller.
Describe the solution you'd like
Build out tests to get controller coverage > 66%. This may be tricky since it isn't always straightforward to mock out kube-api related interactions.
Describe the bug
Workflow can be failed for some reason like scheduling and did not run but state is not changed in HealthCheck to reflect this error
To Reproduce
If we cannot reproduce, we cannot fix! Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Screenshots
If applicable, add screenshots to help explain your problem.
Version
Paste any relevant version print outputs here.
Logs
Paste any relevant application logs here.
Have you thought about contributing a fix yourself?
Open Source software thrives with your contribution. It not only gives skills you might not be able to get in your day job, it also looks amazing on your resume.
If you want to get involved, check out the
contributing guide, then reach out to us on Slack so we can see how to get you started.
Release issue, predominantly for visibility purposes.
DRAFT CHANGELOG:
RepeatAfterSec
mechanismIs your feature request related to a problem? Please describe.
Active-Monitor should process custom resources in parallel.
Describe the solution you'd like
Kubebuilder supports MaxParallel option to run multiple go routines. We should pass this maxparallel option into the reconciler to process multiple CR's.
Have you thought about contributing yourself?
Yes.. I will add this feature.
Is your feature request related to a problem? Please describe.
The existing example workflows are all quite simple. They don't well represent a real-world workflow.
Describe the solution you'd like
There should be a new workflow example in the examples/
directory which carries out the following steps:
At this point, no record of the deployment nor namespace should exist
Based on the message at: https://travis-ci.org/
Please be aware travis-ci.org will be shutting down in several weeks, with all accounts migrating to travis-ci.com. Please stay tuned here for more information.
We need to migrate cI pipeline to Github Actions.
Currently, active-monitor repeats the workflow submissions based on the repeatAfterSec spec parameter.
However, it would be more flexible if we also allowed users to specify how often they want the workflow to run using a cron-like expression.
@pzou1974 had the great suggestion to look at this library to help with parsing cron expressions: https://godoc.org/gopkg.in/robfig/cron.v2
If you want to get involved, check out the
contributing guide, then reach out to us on Slack so we can see how to get you started.
Is your feature request related to a problem? Please describe.
As the use cases for Monitor/Remedy grow the number of pods created can grow more. If the ttl for pod deletion is not aggressive there can many pods in completed state which can still consume resources such as IP's (in EKS) etc.,
Describe the solution you'd like
Enable default PodGC strategy as OnPodCompletion in Active-Monitor workflow. https://argoproj.github.io/argo/fields/#podgc.
This will help cleanup the pods immediately after execution. The status of the pod execution is updated in argo workflow. As the Active-Monitor controller reads the status from argo workflow we donot need the pod itself once it is executed. This will save resources.
Have you thought about contributing yourself?
Yes.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.