Code Monkey home page Code Monkey logo

opni's Introduction

Multi Cluster Observability with AIOps

License codecov Go Report Card

Observability data comes in the form of logs, metrics and traces. The collection and storage of observability data is handled by observability backends and agents. AIOps helps makes sense of this observability data. Opni comes with all these nuts and bolts and can be used to self monitor a single cluster or be a centralized observability data sink for multiple clusters.

You can easily create the following with Opni:

  • Backends

    • Opni Logging - extends Opensearch to make it easy to search, visualize and analyze logs, traces and Kubernetes events
    • Opni Monitoring - extends Cortex to enable multi cluster, long term storage for Prometheus metrics
  • Opni Agent

    • Collects logs, Kubernetes events, OpenTelemetry traces and Prometheus metrics with the click of a button
  • AIOps

  • Alerting and SLOs

Check out the docs page to get started!

alt text


License

Copyright (c) 2020-2022 SUSE, LLC

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

opni's People

Contributors

alexandrelamarre avatar amartc avatar amsuggs37 avatar codyrancher avatar dbason avatar dependabot[bot] avatar jaehnri avatar jameson-mcghee avatar joshmeranda avatar jvanz avatar kralicky avatar pjbgf avatar sanjay920 avatar tybalex avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

opni's Issues

Add logic to DataPrepper to parse out filename

Current within control plane logs, Data Prepper doesn't add do any processing of filename or any other attribute obtained from a log message. Logic should be added to extract out the filename for log messages of this format.

Downstream Service Discovery API

We want to detect downstream services to monitor with SLO/SLI rules.

They should show up as a list in the SLO input when creating SLOs.

When Nulog model crashes during training, GPU is blocked from being used for inferencing.

An issue was observed that if the Nulog model crashes during training, it never will send a signal to the GPU service that it has completed its job and thus the GPU is indefinitely blocked from being used for inferencing. This was seen in cases when the DRAIN service provided a normal interval where the starting and ending timestamps were the exact same and when no logs were retrieved, the Nulog model subsequently crashed.

Unable to ship Rancher logs to Opni

Hello,

We are trying Opni in a Rancher2.6.2/RKE1 environment.
The Opni operator and the Opni cluster are deploy using this procedure.

We have configurde the log shipping using this procedure but in the Opensearch there is no index.

We also see that the type of the cluster output should be ElasticSearch but in the UI we see Unknown.

The API used for opni is logging.opni.io. We configured the cluster output with the logging.opni.io API and we create a second with the logging.banzaicloud.io API but both of them are "Unknown".

I attached the log of the pods opni-svc-payload-receiver, opni-controller-manager-manager, opni-controller-manager-kube-rbac-proxy and the rancher-logging-fluentd.

Thank you for your help.

opni-controller-manager-manager.log
rancher-logging-fluentd.log
opni-controller-manager-kube-rbac-proxy.log
payload-receiver.log

Develop service for Opensearch Updating

Currently, the preprocessing, DRAIN and inferencing services all update Opensearch which can become very expensive and inefficient over time. Thus, having one service dedicated to updating Opensearch is being designed and developed right now.

/amazon/opensearch-dashboards:1.1.0 not found: manifest unknown: manifest unknown

I am trying to deploy Opni on a private kubernetes environment using a proxy. I was able to push all the yaml files with no problem and while trying to deploy the elastic image I got the following message: /amazon/opensearch-dashboards:1.1.0 not found: manifest unknown: manifest unknown

As I did for all the other images. I added the proxy before the image link and it didn't work.

my question is : can we change the source of the elastic image and how can we do this?
image

preprocessing-service crashing

I installed opni on my EKS cluster. After few minutes preprocessing-service starts crashing. I can see it is receiving 400 from opendistro-es-client-service and not able to handle that.

2021-08-11 05:58:41,927 - INFO - POST https://opendistro-es-client-service.opni-demo.svc.cluster.local:9200/_bulk [status:200 request:0.180s]
2021-08-11 05:58:42,538 - WARNING - POST https://opendistro-es-client-service.opni-demo.svc.cluster.local:9200/_bulk [status:400 request:0.035s]
Traceback (most recent call last):
  File "./preprocess.py", line 183, in <module>
    loop.run_until_complete(
  File "/usr/local/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "./preprocess.py", line 158, in mask_logs
    async for ok, result in async_streaming_bulk(
  File "/usr/local/lib/python3.8/site-packages/elasticsearch/_async/helpers.py", line 177, in async_streaming_bulk
    async for data, (ok, info) in azip(
  File "/usr/local/lib/python3.8/site-packages/elasticsearch/_async/helpers.py", line 107, in azip
    yield tuple([await x.__anext__() for x in aiters])
  File "/usr/local/lib/python3.8/site-packages/elasticsearch/_async/helpers.py", line 107, in <listcomp>
    yield tuple([await x.__anext__() for x in aiters])
  File "/usr/local/lib/python3.8/site-packages/elasticsearch/_async/helpers.py", line 82, in _process_bulk_chunk
    for item in gen:
  File "/usr/local/lib/python3.8/site-packages/elasticsearch/helpers/actions.py", line 193, in _process_bulk_chunk_error
    raise error
  File "/usr/local/lib/python3.8/site-packages/elasticsearch/_async/helpers.py", line 70, in _process_bulk_chunk
    resp = await client.bulk("\n".join(bulk_actions) + "\n", *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/elasticsearch/_async/client/__init__.py", line 457, in bulk
    return await self.transport.perform_request(
  File "/usr/local/lib/python3.8/site-packages/elasticsearch/_async/transport.py", line 329, in perform_request
    raise e
  File "/usr/local/lib/python3.8/site-packages/elasticsearch/_async/transport.py", line 296, in perform_request
    status, headers, data = await connection.perform_request(
  File "/usr/local/lib/python3.8/site-packages/elasticsearch/_async/http_aiohttp.py", line 329, in perform_request
    self._raise_error(response.status, raw_data)
  File "/usr/local/lib/python3.8/site-packages/elasticsearch/connection/base.py", line 322, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(
elasticsearch.exceptions.RequestError: RequestError(400, 'illegal_argument_exception', 'For input string: ""')

I checked opendistro-es-client-service pod but no relevant logs are available there.

ture major version, direct access to system indices will be prevented by default
[2021-08-11T05:59:37,214][DEPRECATION][o.e.d.c.m.IndexNameExpressionResolver] [opendistro-es-client-55d4c4489f-78f96] this request accesses system indices: [.kibana_1, .kibana_92668751_admin_1], but in a future major version, direct access to system indices will be prevented by default
[2021-08-11T05:59:38,046][DEPRECATION][o.e.d.c.m.IndexNameExpressionResolver] [opendistro-es-client-55d4c4489f-78f96] this request accesses system indices: [.kibana_1, .kibana_92668751_admin_1], but in a future major version, direct access to system indices will be prevented by default
[2021-08-11T05:59:38,046][DEPRECATION][o.e.d.c.m.IndexNameExpressionResolver] [opendistro-es-client-55d4c4489f-78f96] this request accesses system indices: [.kibana_1, .kibana_92668751_admin_1], but in a future major version, direct access to system indices will be prevented by default
[2021-08-11T05:59:38,785][DEPRECATION][o.e.d.c.m.IndexNameExpressionResolver] [opendistro-es-client-55d4c4489f-78f96] this request accesses system indices: [.kibana_1, .kibana_92668751_admin_1], but in a future major version, direct access to system indices will be prevented by default
[2021-08-11T05:59:38,786][DEPRECATION][o.e.d.c.m.IndexNameExpressionResolver] [opendistro-es-client-55d4c4489f-78f96] this request accesses system indices: [.kibana_1, .kibana_92668751_admin_1], but in a future major version, direct access to system indices will be prevented by default
[2021-08-11T06:00:33,158][DEPRECATION][o.e.d.c.m.IndexNameExpressionResolver] [opendistro-es-client-55d4c4489f-78f96] this request accesses system indices: [.kibana_1, .kibana_92668751_admin_1], but in a future major version, direct access to system indices will be prevented by default
[2021-08-11T06:01:33,185][DEPRECATION][o.e.d.c.m.IndexNameExpressionResolver] [opendistro-es-client-55d4c4489f-78f96] this request accesses system indices: [.kibana_1, .kibana_92668751_admin_1], but in a future major version, direct access to system indices will be prevented by default

Failed to install with namespaces "opni-system" not found error

Failed to install:

INFO[0000] Starting installer
W0519 20:22:19.350824   49098 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
W0519 20:22:19.556324   49098 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
INFO[0000] Creating CRD helmcharts.helm.cattle.io
W0519 20:22:19.655385   49098 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
INFO[0000] Creating CRD helmchartconfigs.helm.cattle.io
W0519 20:22:19.741663   49098 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
W0519 20:22:19.873611   49098 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
INFO[0001] Deploying infrastructure resources
W0519 20:22:20.519758   49098 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRole is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRole
W0519 20:22:20.603145   49098 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRole is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRole
W0519 20:22:20.694389   49098 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRole is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRole
W0519 20:22:20.841523   49098 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRoleBinding is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRoleBinding
W0519 20:22:20.945856   49098 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRoleBinding is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRoleBinding
W0519 20:22:21.046222   49098 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRoleBinding is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRoleBinding
FATA[0003] failed to create opni-system/infra-stack /v1, Kind=ConfigMap for infra-stack opni-system/infra-stack: namespaces "opni-system" not found

Nulog model not always training when training signal sent.

A new Nulog model should be trained anytime the GPU service sends over a signal through Nats. However, it is being observed that even when that signal is sent and there is new training data, the model is not being trained at times.

customer kubernetes cluster how to use LogAdapter

Currently I setup the kubernetes cluster by myself(kubeadm init) without using rke/rke2/aks/eks , so I'm not sure how to use LogAdapter, as the provide should be one of the rke/rke2/aks/eks, could you give some suggestions

Look into problem concerning Elasticsearch memory issue.

When running Opni for over 30 days on an EKS cluster, all of a sudden, we came across this error

TransportError(429, 'circuit_breaking_exception', '[parent] Data too large, data for [<http_request>] would be [516681156/492.7mb], which is larger than the limit of [510027366/486.3mb], real usage: [516680744/492.7mb], new bytes reserved: [412/412b], usages [request=0/0b, fielddata=0/0b, in_flight_requests=412/412b, accounting=0/0b]')

This is showing up any time a query is made to Elasticsearch from any of the services.

Opnictl delete needs to be run twice

When trying to delete Opni, Opnictl delete needs to be run twice as the first time yields this message.

INFO[0000] Deleting Services stack
FATA[0001] the server could not find the requested resource, the server could not find the requested resourc

Error when no prometheus operator and GPU is enabled

[21:25:16] ERROR controller.clusterpolicy-controller Reconciler error {"name": "gpu", "namespace": "", "error": "no matches for kind \"ServiceMonitor\" in version \"monitoring.coreos.com/v1\""}
Fri, Mar 11 2022 10:25:16 am | sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
Fri, Mar 11 2022 10:25:16 am | /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266
Fri, Mar 11 2022 10:25:16 am | sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
Fri, Mar 11 2022 10:25:16 am | /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227

This prevents other reconcilers from continuing.

Port forward command in quickstart failing

In the quickstart docs at https://opni.io/deployment/quickstart/ second command to set up port forwarding is giving an error:

# kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml \
>     -n opni-cluster \
>     port-forward --address 0.0.0.0 svc/opni-es-kibana 5601:5601
Error from server (NotFound): namespaces "opni-cluster" not found

FYI - seeing these namespaces in my cluster:

# kubectl get ns
NAME              STATUS   AGE
cert-manager      Active   6m4s
default           Active   6m16s
kube-node-lease   Active   6m18s
kube-public       Active   6m18s
kube-system       Active   6m18s
opni-system       Active   4m2s

install_opni.sh getting wait_for_logging: command not found error

Getting this with first step of quickstart:

ip-10-0-98-10:~ # curl -sfL https://raw.githubusercontent.com/rancher/opni-docs/main/quickstart_files/install_opni.sh | sh -
[INFO]  finding release for channel stable
[INFO]  using v1.22.3+rke2r1 as release
[INFO]  downloading checksums at https://github.com/rancher/rke2/releases/download/v1.22.3+rke2r1/sha256sum-amd64.txt
[INFO]  downloading tarball at https://github.com/rancher/rke2/releases/download/v1.22.3+rke2r1/rke2.linux-amd64.tar.gz
[INFO]  verifying tarball
[INFO]  unpacking tarball file to /usr/local
[INFO]  Installing Cert Manager
deployment.apps/cert-manager condition met
deployment.apps/cert-manager-cainjector condition met
[INFO]  Installing Opni
deployment.apps/opni-controller-manager condition met
deployment.apps/opni-es-kibana condition met
sh: line 118: wait_for_logging: command not found

0/1 nodes are available: 1 Insufficient nvidia.com/gpu

Currently I setup kubernetes cluster in EC2 instance with kubeadm only one node, and EC2 instance type is g4dn.xlarge, now I can install opni with basic installation successfully, but when I try to setup opni gpu controller, got this issue

Warning  FailedScheduling  40s   default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu.

could you give some suggestion, thanks!
following is snippet info describe node

Capacity:
  cpu:                4
  ephemeral-storage:  101583780Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             16085440Ki
  pods:               110
Allocatable:
  cpu:                4
  ephemeral-storage:  93619611493
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             15983040Ki
  pods:               110

how to setup proxy when install opni-inference-control-plane

currently facing issue when install opni-inference-control, I check the logs

Error from server (BadRequest): container "inference-service" in pod "opni-inference-control-plane-586d857f5b-kgn6d" is waiting to start: PodInitializing

I suspect can't download the https://opni-public.s3.us-east-2.amazonaws.com/pretrain-models/control-plane-model-v0.1.2.zip, maybe need proxy, but I don't know how to add proxy, could you give some suggestions, thanks!

Anomaly detection using Drain/Nulog

Hello,

I want to thank you for this nice job.

we deployed OPNI on a private RKE1 cluster and connected it to a data stream from another RKE1 cluster. Then we triggered an incident trying to deploy a bad hello world with 50 failed pods, except that all generated logs have been detected as normal.

I'm trying to find out if this is normal or if we need to re-train the model... And what exactly is the model used to classify logs as normal, sucpicious or anomaly?

thank you in advance

Address situation where Seaweed runs out of memory.

When running Opni log anomaly detection, noticed that S3 was no longer storing any additional files. Upon viewing the logs of the seaweed pod, noticed these error logs

1 volume_layout.go:391] Volume 7 becomes writable
I0615 19:43:42     1 volume_growth.go:235] Created Volume 7 on topo:DefaultDataCenter:DefaultRack:10.42.6.21:8080
E0617 00:22:31     1 filer_server_handlers_write.go:45] failing to assign a file id: rpc error: code = Unknown desc = no free volumes left for {"collection":"opni-nulog-models","replication":{},"ttl":{"Count":0,"Unit":0},"preallocate":536870912}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.