rancher / opni Goto Github PK

Multi Cluster Observability with AIOps

License: Apache License 2.0

Dockerfile 0.02% Go 72.99% Starlark 0.02% Smarty 0.07% CSS 0.02% HTML 0.05% Jsonnet 0.01% Python 0.16% JavaScript 0.81% SCSS 0.51% Vue 5.41% TypeScript 19.93%

kubernetes aiops observability logging monitoring

opni's Introduction

Multi Cluster Observability with AIOps

Observability data comes in the form of logs, metrics and traces. The collection and storage of observability data is handled by observability backends and agents. AIOps helps makes sense of this observability data. Opni comes with all these nuts and bolts and can be used to self monitor a single cluster or be a centralized observability data sink for multiple clusters.

You can easily create the following with Opni:

Backends
- Opni Logging - extends Opensearch to make it easy to search, visualize and analyze logs, traces and Kubernetes events
- Opni Monitoring - extends Cortex to enable multi cluster, long term storage for Prometheus metrics
Opni Agent
- Collects logs, Kubernetes events, OpenTelemetry traces and Prometheus metrics with the click of a button
AIOps
Alerting and SLOs

Check out the docs page to get started!

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

opni's People

Contributors

Stargazers

Watchers

Forkers

cjellick galal-hussein sanjay920 amartc tybalex dbason marvel-works rutam21 ekmixon jameson-mcghee doytsujin felbdogg amsuggs37 devopsleague alexrogalskiy zakaria-laktati gujingit leezhi403 csudaxiang amreeshtyagi aiwantaozi irishgordo chinahappyking hongshibao tom2jack juangarcia4ks sitedata suyambuganesh82 kralicky joshmeranda icodein gabrielstein qmutz anthony20102101 xaviercallens reylejano ivan-claire anarkis bvankampen jan-law ai4infra kristian-zh luis-sousa-pinto codyrancher rookie-man-fack masap pjbgf alexandrelamarre jvanz felbdogg-llc wrkode zhbitzwz

opni's Issues

Update nulog inference service ES endpoint

opni/src/nulog-inference-service/nulog-inference-service.yaml

Line 23 in 4f3268e

value: "http://elasticsearch-coordinating-only.default.svc.cluster.local:9200"

Refactor downstream syncing for notifiers

Add logic to DataPrepper to parse out filename

Current within control plane logs, Data Prepper doesn't add do any processing of filename or any other attribute obtained from a log message. Logic should be added to extract out the filename for log messages of this format.

SLO User Input to OpenSLO spec

Bug: edit cluster dialog not scrollable

When a cluster has too many labels, the edit cluster dialog will expand past the bottom of the screen and cannot be scrolled.

Downstream Service Discovery API

We want to detect downstream services to monitor with SLO/SLI rules.

They should show up as a list in the SLO input when creating SLOs.

When Nulog model crashes during training, GPU is blocked from being used for inferencing.

An issue was observed that if the Nulog model crashes during training, it never will send a signal to the GPU service that it has completed its job and thus the GPU is indefinitely blocked from being used for inferencing. This was seen in cases when the DRAIN service provided a normal interval where the starting and ending timestamps were the exact same and when no logs were retrieved, the Nulog model subsequently crashed.

Add peak detection to Opni insights.

In addition to areas of interest, also implement peak detection to Opni insights by specifying a starting and end timestamp.

Agent health/status + UI

Rename opni-monitoring to opni-gateway

Cluster friendly names on all dashboards

This is implemented in the opni monitoring dashboard, but not other dashboards yet.

Opni Metric SLO backend -- as opni gateway API plugin

Unable to ship Rancher logs to Opni

Hello,

We are trying Opni in a Rancher2.6.2/RKE1 environment.
The Opni operator and the Opni cluster are deploy using this procedure.

We have configurde the log shipping using this procedure but in the Opensearch there is no index.

We also see that the type of the cluster output should be ElasticSearch but in the UI we see Unknown.

The API used for opni is logging.opni.io. We configured the cluster output with the logging.opni.io API and we create a second with the logging.banzaicloud.io API but both of them are "Unknown".

I attached the log of the pods opni-svc-payload-receiver, opni-controller-manager-manager, opni-controller-manager-kube-rbac-proxy and the rancher-logging-fluentd.

Thank you for your help.

opni-controller-manager-manager.log
rancher-logging-fluentd.log
opni-controller-manager-kube-rbac-proxy.log
payload-receiver.log

Develop service for Opensearch Updating

Currently, the preprocessing, DRAIN and inferencing services all update Opensearch which can become very expensive and inefficient over time. Thus, having one service dedicated to updating Opensearch is being designed and developed right now.

minio helm chart removed

opnictl demo fails on the minio chart
the helm.min.io chart repo has been archived in favor of using the minio operator. Any chance to get it updated to use https://github.com/minio/operator/tree/master/helm/minio-operator ?

Payload Receiver High Memory Usage

Payload Receiver appears to be using an unusually high amount of memory (runtime approx. 48 hrs)

POC for service map, and RCA or anomaly correlation with it

in Opni v0.6 we should aim to have enabled shipping trace data, and have a service map visualization.
root cause analysis and anomaly correlation in opni v1.0.

MountVolume.SetUp failed for volume "cert" : secret "webhook-server-cert" not found

when I execute command, see below error, want to know where is the "webhook-server-cert"?
kubectl apply -f https://raw.githubusercontent.com/rancher/opni/main/deploy/manifests/10_operator.yaml

MountVolume.SetUp failed for volume "cert" : secret "webhook-server-cert" not found

/amazon/opensearch-dashboards:1.1.0 not found: manifest unknown: manifest unknown

I am trying to deploy Opni on a private kubernetes environment using a proxy. I was able to push all the yaml files with no problem and while trying to deploy the elastic image I got the following message: /amazon/opensearch-dashboards:1.1.0 not found: manifest unknown: manifest unknown

As I did for all the other images. I added the proxy before the image link and it didn't work.

my question is : can we change the source of the elastic image and how can we do this?

preprocessing-service crashing

I installed opni on my EKS cluster. After few minutes preprocessing-service starts crashing. I can see it is receiving 400 from opendistro-es-client-service and not able to handle that.

2021-08-11 05:58:41,927 - INFO - POST https://opendistro-es-client-service.opni-demo.svc.cluster.local:9200/_bulk [status:200 request:0.180s]
2021-08-11 05:58:42,538 - WARNING - POST https://opendistro-es-client-service.opni-demo.svc.cluster.local:9200/_bulk [status:400 request:0.035s]
Traceback (most recent call last):
  File "./preprocess.py", line 183, in <module>
    loop.run_until_complete(
  File "/usr/local/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "./preprocess.py", line 158, in mask_logs
    async for ok, result in async_streaming_bulk(
  File "/usr/local/lib/python3.8/site-packages/elasticsearch/_async/helpers.py", line 177, in async_streaming_bulk
    async for data, (ok, info) in azip(
  File "/usr/local/lib/python3.8/site-packages/elasticsearch/_async/helpers.py", line 107, in azip
    yield tuple([await x.__anext__() for x in aiters])
  File "/usr/local/lib/python3.8/site-packages/elasticsearch/_async/helpers.py", line 107, in <listcomp>
    yield tuple([await x.__anext__() for x in aiters])
  File "/usr/local/lib/python3.8/site-packages/elasticsearch/_async/helpers.py", line 82, in _process_bulk_chunk
    for item in gen:
  File "/usr/local/lib/python3.8/site-packages/elasticsearch/helpers/actions.py", line 193, in _process_bulk_chunk_error
    raise error
  File "/usr/local/lib/python3.8/site-packages/elasticsearch/_async/helpers.py", line 70, in _process_bulk_chunk
    resp = await client.bulk("\n".join(bulk_actions) + "\n", *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/elasticsearch/_async/client/__init__.py", line 457, in bulk
    return await self.transport.perform_request(
  File "/usr/local/lib/python3.8/site-packages/elasticsearch/_async/transport.py", line 329, in perform_request
    raise e
  File "/usr/local/lib/python3.8/site-packages/elasticsearch/_async/transport.py", line 296, in perform_request
    status, headers, data = await connection.perform_request(
  File "/usr/local/lib/python3.8/site-packages/elasticsearch/_async/http_aiohttp.py", line 329, in perform_request
    self._raise_error(response.status, raw_data)
  File "/usr/local/lib/python3.8/site-packages/elasticsearch/connection/base.py", line 322, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(
elasticsearch.exceptions.RequestError: RequestError(400, 'illegal_argument_exception', 'For input string: ""')

I checked opendistro-es-client-service pod but no relevant logs are available there.

ture major version, direct access to system indices will be prevented by default
[2021-08-11T05:59:37,214][DEPRECATION][o.e.d.c.m.IndexNameExpressionResolver] [opendistro-es-client-55d4c4489f-78f96] this request accesses system indices: [.kibana_1, .kibana_92668751_admin_1], but in a future major version, direct access to system indices will be prevented by default
[2021-08-11T05:59:38,046][DEPRECATION][o.e.d.c.m.IndexNameExpressionResolver] [opendistro-es-client-55d4c4489f-78f96] this request accesses system indices: [.kibana_1, .kibana_92668751_admin_1], but in a future major version, direct access to system indices will be prevented by default
[2021-08-11T05:59:38,046][DEPRECATION][o.e.d.c.m.IndexNameExpressionResolver] [opendistro-es-client-55d4c4489f-78f96] this request accesses system indices: [.kibana_1, .kibana_92668751_admin_1], but in a future major version, direct access to system indices will be prevented by default
[2021-08-11T05:59:38,785][DEPRECATION][o.e.d.c.m.IndexNameExpressionResolver] [opendistro-es-client-55d4c4489f-78f96] this request accesses system indices: [.kibana_1, .kibana_92668751_admin_1], but in a future major version, direct access to system indices will be prevented by default
[2021-08-11T05:59:38,786][DEPRECATION][o.e.d.c.m.IndexNameExpressionResolver] [opendistro-es-client-55d4c4489f-78f96] this request accesses system indices: [.kibana_1, .kibana_92668751_admin_1], but in a future major version, direct access to system indices will be prevented by default
[2021-08-11T06:00:33,158][DEPRECATION][o.e.d.c.m.IndexNameExpressionResolver] [opendistro-es-client-55d4c4489f-78f96] this request accesses system indices: [.kibana_1, .kibana_92668751_admin_1], but in a future major version, direct access to system indices will be prevented by default
[2021-08-11T06:01:33,185][DEPRECATION][o.e.d.c.m.IndexNameExpressionResolver] [opendistro-es-client-55d4c4489f-78f96] this request accesses system indices: [.kibana_1, .kibana_92668751_admin_1], but in a future major version, direct access to system indices will be prevented by default

Include granularity as part of the Opni insights service.

Currently, Opni insights takes a starting and ending timestamp but going forward, this needs to be modified to include the granularity as well.

End to end tests

Package tests (recurring)

Pre-Configured Metric Grouping

Failed to install with namespaces "opni-system" not found error

Failed to install:

INFO[0000] Starting installer
W0519 20:22:19.350824   49098 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
W0519 20:22:19.556324   49098 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
INFO[0000] Creating CRD helmcharts.helm.cattle.io
W0519 20:22:19.655385   49098 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
INFO[0000] Creating CRD helmchartconfigs.helm.cattle.io
W0519 20:22:19.741663   49098 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
W0519 20:22:19.873611   49098 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
INFO[0001] Deploying infrastructure resources
W0519 20:22:20.519758   49098 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRole is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRole
W0519 20:22:20.603145   49098 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRole is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRole
W0519 20:22:20.694389   49098 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRole is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRole
W0519 20:22:20.841523   49098 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRoleBinding is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRoleBinding
W0519 20:22:20.945856   49098 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRoleBinding is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRoleBinding
W0519 20:22:21.046222   49098 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRoleBinding is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRoleBinding
FATA[0003] failed to create opni-system/infra-stack /v1, Kind=ConfigMap for infra-stack opni-system/infra-stack: namespaces "opni-system" not found

Nulog model not always training when training signal sent.

A new Nulog model should be trained anytime the GPU service sends over a signal through Nats. However, it is being observed that even when that signal is sent and there is new training data, the model is not being trained at times.

Agent version tracking + upgrades

customer kubernetes cluster how to use LogAdapter

Currently I setup the kubernetes cluster by myself(kubeadm init) without using rke/rke2/aks/eks , so I'm not sure how to use LogAdapter, as the provide should be one of the rke/rke2/aks/eks, could you give some suggestions

Look into problem concerning Elasticsearch memory issue.

When running Opni for over 30 days on an EKS cluster, all of a sudden, we came across this error

TransportError(429, 'circuit_breaking_exception', '[parent] Data too large, data for [<http_request>] would be [516681156/492.7mb], which is larger than the limit of [510027366/486.3mb], real usage: [516680744/492.7mb], new bytes reserved: [412/412b], usages [request=0/0b, fielddata=0/0b, in_flight_requests=412/412b, accounting=0/0b]')

This is showing up any time a query is made to Elasticsearch from any of the services.

Opnictl delete needs to be run twice

When trying to delete Opni, Opnictl delete needs to be run twice as the first time yields this message.

INFO[0000] Deleting Services stack
FATA[0001] the server could not find the requested resource, the server could not find the requested resourc

Error when no prometheus operator and GPU is enabled

[21:25:16] ERROR controller.clusterpolicy-controller Reconciler error {"name": "gpu", "namespace": "", "error": "no matches for kind \"ServiceMonitor\" in version \"monitoring.coreos.com/v1\""}
Fri, Mar 11 2022 10:25:16 am | sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
Fri, Mar 11 2022 10:25:16 am | /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
Fri, Mar 11 2022 10:25:16 am | sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
Fri, Mar 11 2022 10:25:16 am | /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227

This prevents other reconcilers from continuing.

Port forward command in quickstart failing

In the quickstart docs at https://opni.io/deployment/quickstart/ second command to set up port forwarding is giving an error:

# kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml \
>     -n opni-cluster \
>     port-forward --address 0.0.0.0 svc/opni-es-kibana 5601:5601
Error from server (NotFound): namespaces "opni-cluster" not found

FYI - seeing these namespaces in my cluster:

# kubectl get ns
NAME              STATUS   AGE
cert-manager      Active   6m4s
default           Active   6m16s
kube-node-lease   Active   6m18s
kube-public       Active   6m18s
kube-system       Active   6m18s
opni-system       Active   4m2s

Updated installation docs

bug: cortex services using rings need to be statefulsets

install_opni.sh getting wait_for_logging: command not found error

Getting this with first step of quickstart:

ip-10-0-98-10:~ # curl -sfL https://raw.githubusercontent.com/rancher/opni-docs/main/quickstart_files/install_opni.sh | sh -
[INFO]  finding release for channel stable
[INFO]  using v1.22.3+rke2r1 as release
[INFO]  downloading checksums at https://github.com/rancher/rke2/releases/download/v1.22.3+rke2r1/sha256sum-amd64.txt
[INFO]  downloading tarball at https://github.com/rancher/rke2/releases/download/v1.22.3+rke2r1/rke2.linux-amd64.tar.gz
[INFO]  verifying tarball
[INFO]  unpacking tarball file to /usr/local
[INFO]  Installing Cert Manager
deployment.apps/cert-manager condition met
deployment.apps/cert-manager-cainjector condition met
[INFO]  Installing Opni
deployment.apps/opni-controller-manager condition met
deployment.apps/opni-es-kibana condition met
sh: line 118: wait_for_logging: command not found

Develop Protobuf specifications for messages between services.

Create protobuf specifications for communication between microservices.

0/1 nodes are available: 1 Insufficient nvidia.com/gpu

Currently I setup kubernetes cluster in EC2 instance with kubeadm only one node, and EC2 instance type is g4dn.xlarge, now I can install opni with basic installation successfully, but when I try to setup opni gpu controller, got this issue

Warning  FailedScheduling  40s   default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu.

could you give some suggestion, thanks!
following is snippet info describe node

Capacity:
  cpu:                4
  ephemeral-storage:  101583780Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             16085440Ki
  pods:               110
Allocatable:
  cpu:                4
  ephemeral-storage:  93619611493
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             15983040Ki
  pods:               110

Add missing capabilities to CLI

[] Allow creating label selector expressions
[] Allow editing cluster labels

User Management

how to setup proxy when install opni-inference-control-plane

currently facing issue when install opni-inference-control, I check the logs

Error from server (BadRequest): container "inference-service" in pod "opni-inference-control-plane-586d857f5b-kgn6d" is waiting to start: PodInitializing

I suspect can't download the https://opni-public.s3.us-east-2.amazonaws.com/pretrain-models/control-plane-model-v0.1.2.zip, maybe need proxy, but I don't know how to add proxy, could you give some suggestions, thanks!

Make CortexAdminClient into an APIextension

Convert OpenSLO to Prometheus Rules

For cortex downstream clusters we want to convert OpenSLO specifications to Sloth specifications to be applied via the Cortex rule API

Support user-provided serving certificates

Anomaly detection using Drain/Nulog

Hello,

I want to thank you for this nice job.

we deployed OPNI on a private RKE1 cluster and connected it to a data stream from another RKE1 cluster. Then we triggered an incident trying to deploy a bad hello world with 50 failed pods, except that all generated logs have been detected as normal.

I'm trying to find out if this is normal or if we need to re-train the model... And what exactly is the model used to classify logs as normal, sucpicious or anomaly?

thank you in advance

Add workload metadata field within log payload.

Currently, logs received have the pod name and namespace name but the workload data is not present so it needs to be added through a FluentD plugin.

Add (meta)data in preprocessing service to be used in insights service

Add metadata:

name of workload that log message belongs to

Add topology information to ES

Address situation where Seaweed runs out of memory.

When running Opni log anomaly detection, noticed that S3 was no longer storing any additional files. Upon viewing the logs of the seaweed pod, noticed these error logs

1 volume_layout.go:391] Volume 7 becomes writable
I0615 19:43:42     1 volume_growth.go:235] Created Volume 7 on topo:DefaultDataCenter:DefaultRack:10.42.6.21:8080
E0617 00:22:31     1 filer_server_handlers_write.go:45] failing to assign a file id: rpc error: code = Unknown desc = no free volumes left for {"collection":"opni-nulog-models","replication":{},"ttl":{"Count":0,"Unit":0},"preallocate":536870912}

rancher / opni Goto Github PK

opni's Introduction

Multi Cluster Observability with AIOps

License

opni's People

Contributors

Stargazers

Watchers

Forkers

opni's Issues

Recommend Projects

Recommend Topics

Recommend Org