Code Monkey home page Code Monkey logo

zmon-data-service's Introduction

ZMON source code on GitHub is no longer in active development. Zalando will no longer actively review issues or merge pull-requests.

ZMON is still being used at Zalando and serves us well for many purposes. We are now deeper into our observability journey and understand better that we need other telemetry sources and tools to elevate our understanding of the systems we operate. We support the OpenTelemetry initiative and recommended others starting their journey to begin there.

If members of the community are interested in continuing developing ZMON, consider forking it. Please review the licence before you do.

ZMON Data Service

OpenTracing enabled

Worker sends its data to the zmon-data-service, which is itself responsible for:

  • storing it in Redis for frontend
  • storing it in KairosDB for charting
  • track size/rate by team
  • handle notifications (if we cannot do this in a distributed fashion (sms vs email))

Input object:

{
    "account": "",
    "team": "",
    "results": [
        {
            "time": ...,
            "check_id": 1234,
            "check_result": ...,
            "run_time": ...,
            "exception": 0/1,
            "entity_id": "",
            "alerts" : {
                1 : { "state": 0/1, "captures": {}}, ...
            }
        }
    ]
}

Building

$ ./mvnw clean package
$ docker build -t zmon-data-service .

Running

$ export TOKENINFO_URL=...
$ java -jar target/zmon-data-service-1.0-SNAPSHOT.jar

zmon-data-service's People

Contributors

a1exsh avatar alexkorotkikh avatar bocytko avatar elgris avatar hjacobs avatar jan-m avatar jbellmann avatar lmineiro avatar mohabusama avatar pitr avatar rajatparida86 avatar vetinari avatar vibhory2j avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

zmon-data-service's Issues

Skip storage of empty WorkerResult

If the WorkerResult doesn't have results we log the exception and fail the request.

java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
#011at java.util.ArrayList.rangeCheck(ArrayList.java:653) ~[na:1.8.0_121]
#011at java.util.ArrayList.get(ArrayList.java:429) ~[na:1.8.0_121]
#011at de.zalando.zmon.dataservice.data.KairosDBStore.store(KairosDBStore.java:232) ~[classes!/:na]
#011at de.zalando.zmon.dataservice.data.KairosDbWorkResultWriter.write(KairosDbWorkResultWriter.java:35) ~[classes!/:na]
#011at sun.reflect.GeneratedMethodAccessor73.invoke(Unknown Source) ~[na:na]
#011at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_121]
#011at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_121]
#011at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:333) [spring-aop-4.3.6.RELEASE.jar!/:4.3.6.RELEASE]
#011at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:190) [spring-aop-4.3.6.RELEASE.jar!/:4.3.6.RELEASE]
#011at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:157) [spring-aop-4.3.6.RELEASE.jar!/:4.3.6.RELEASE]
#011at org.springframework.aop.interceptor.AsyncExecutionInterceptor$1.call(AsyncExecutionInterceptor.java:115) [spring-aop-4.3.6.RELEASE.jar!/:4.3.6.RELEASE]
#011at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_121]
#011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_121]
#011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_121]
#011at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]  

Support preshared tokens

All components support preshared tokens. Data Service not doing so blocks simple multi region setup.

NPE in PyString date extraction

start_time sometimes is null

[2016-05-31 10:02:55.680] boot - 14 ERROR [zmon-async-51] --- RedisWorkerResultWriter: failed redis write check=1765 data={"account": "dc:123", "results": [{"check_result": {"td": 0.179116, ..., "captures": {"collector": "PR24", "type": "plot"}, "start_time": null, "changed": false, "exception": false, "in_period": true}}, "exception": false, "entity": {"id": "foo"}, "run_time": 0.179116, "time": "2016-05-31 12:02:52.709910+02:00"}], "team": ""}
May 31 10:02:55 ip-172-31-149-94 docker/a7edc08e2065[890]: java.lang.NullPointerException
May 31 10:02:55 ip-172-31-149-94 docker/a7edc08e2065[890]: #011at de.zalando.zmon.dataservice.data.PyString.extractDate(PyString.java:18)

Alerts TimePeriod is not honored

The ZMON worker will remove alerts from "active" alerts (e.g. "zmon:alerts" Redis key) if the "time period" does not match:

From worker logs:

notify - Removed alert with id 5363 on entity abc-live-slave-standby from active alerts due to time period: hr { 16 - 23 }

The behavior is not honored in data service.

To be changed: the data service needs to consider the in_period property of the AlertData object and remove alerts from alert state if in_period==false.

Implement limit on tag cardinality of check results

The underlying time series backend gets under pressure due to high cardinality on the tags (metadata that characterises the metrics) associated with check results. The current implementation processes unlimited number of tags and forwards them to the time series backend for storage.

Check results are provided to data-service at per entity level containing tags produced as per check definition. There are checks which produce large number of tags(key results) and sometimes with unique name every execution resulting in explosion in tag cardinality.

This issue is created to validate hypothesis and implement that the tags cardinality (and consequently pressure on time series backend) can be reduced significantly by putting a rate limit on tags processed in data-service layer.

NPE when running without OpenTracing tracer

When no tracer implementation is configured, we get NPE:

java.lang.NullPointerException: null
at de.zalando.zmon.dataservice.data.RedisDataPointsQueryStore.getSpanContext(RedisDataPointsQueryStore.java:119) ~[classes!/:na]
...

Generic time series data model and TSDB writers

ZMON generic time series data model

Impact

Provides a framework to optionally write Time series metrics from ZMON to other time series databases apart from KairosDB.

Deliverables

  • Implement a generic time series data model for ZMON time series metrics
  • Move ZMON time series metrics data model out of KairosDB writer and store code
  • Implement TSDB writers (M3DB, IronDB, InfluxDB) and store

Fix propagation of tracing span through async operation

As a developer
I want to see that tracing span that belongs to outgoing request (executed asynchronously) and tracing span coming with incoming request (which triggers outgoing request) belong to the same trace
So that I get complete tracing picture

Explanation

I'm going to explain it with a single example:

data-service receives requests with DataServiceController#putData handle. There it starts new span (probably belonging to some trace if request contains tracing information). Then it calls:
de.zalando.zmon.dataservice.data.WorkResultWriter#write (implemented by de.zalando.zmon.dataservice.data.ApplicationMetricsWriter#write) and then -
de.zalando.zmon.dataservice.data.AppMetricsClient#receiveData.
AppMetricsClient#receiveData sends outgoing requests to metric-cache asynchronously with org.apache.http.client.fluent.Async. The problem is that span context is not propagated to async processor. Therefore outgoing request triggers generation of new tracing span which is not bound to original trace, it becomes separate.

We have not solved this problem for async operations (yet). For data-service the problem is valid everywhere async processing is used. Also it's relevant to zmon-controller and other Java components that execute async operations without trace propagation.

Surefire plugin fails

on any mvn build with newest openJDK which involves test run:

9:37:46 AM -------------------------------------------------------
9:37:46 AM T E S T S
9:37:46 AM -------------------------------------------------------
9:37:46 AM Error: Could not find or load main class org.apache.maven.surefire.booter.ForkedBooter
9:37:46 AM 
9:37:46 AM Results :
9:37:46 AM 
9:37:46 AM Tests run: 0, Failures: 0, Errors: 0, Skipped: 0
9:37:46 AM 
9:37:46 AM [INFO] ------------------------------------------------------------------------
9:37:46 AM [INFO] BUILD FAILURE
9:37:46 AM [INFO] ------------------------------------------------------------------------

Refactor KairosDB storage configuration

It's currently possible to configure the KairosDB using a List of Lists, like found in DataServiceConfigProperties.java

It's not immediately perceptible what each level of those lists stand for and can lead to misuse - partitions instead of replicas.

I propose to change this configuration so that the KairosDB storage can be configured explicitly stating which partition(s) and replica(s) to use.

For ex.:

{
    "dataservice": {
        "storage": {
            "replicas": [
                {
                    "url": "http://endpoint-for-replica-1"
                },
                {
                    "partitions": [
                        "http://endpoint-for-replica-2-partition-1",
                        "http://endpoint-for-replica-2-partition-2",
                        "http://endpoint-for-replica-2-partition-n"
                    ]
                },
                {
                    "url": "http://endpoint-for-replica-n"
                }
            ]
        }
}

Where the outermost element - replicas can have a single attribute url which represents the single endpoint for that replica or, optionally, a partitions attribute that accepts the list with each partition's endpoint for that replica.

Add batch API options

Some API calls could be batched, like:

  • Deleting multiple entities
  • Adding entities

Both are done by the agent after each cycle

Missing error processing in ControllerProxy

How to reproduce

  1. Install whole ZMON stack
  2. Break ZMON Controller API so it returns errors (e.g. because of broken auth configuration, so requests from data-service are not authorised)
  3. Observe Index Out of Bounds Exception coming from de.zalando.zmon.dataservice.proxies.ControllerProxy#proxyForLastModifiedHeader

Explanation

When ZMON Controller API returns error responses, data-service still tries to get "Last-Modified" header from there and fails. Need to make sure that data-service tries to check if the header in the response before it tries to extract it.

Acceptance criteria

  • Instead of Out of Bounds Exception data-service must report that "Last-Modified" header was not found and provide response data, so it's easier for developers to investigate

KairosDB Proxy: bad request should not product stack traces

May 31 12:04:33 ip-172-31-145-144 docker/c63c2ddc1ae4[877]: [2016-05-31 12:04:33.916] boot - 11 ERROR [http-nio-8080-exec-1] --- [dispatcherServlet]: Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception
 May 31 12:04:33 ip-172-31-145-144 docker/c63c2ddc1ae4[877]: org.apache.http.client.HttpResponseException: Bad Request
 May 31 12:04:33 ip-172-31-145-144 docker/c63c2ddc1ae4[877]: #011at org.apache.http.client.fluent.ContentResponseHandler.handleResponse(ContentResponseHandler.java:47)
 May 31 12:04:33 ip-172-31-145-144 docker/c63c2ddc1ae4[877]: #011at org.apache.http.client.fluent.ContentResponseHandler.handleResponse(ContentResponseHandler.java:40)
 May 31 12:04:33 ip-172-31-145-144 docker/c63c2ddc1ae4[877]: #011at org.apache.http.client.fluent.Response.handleResponse(Response.java:90)
 May 31 12:04:33 ip-172-31-145-144 docker/c63c2ddc1ae4[877]: #011at org.apache.http.client.fluent.Response.returnContent(Response.java:97)
 May 31 12:04:33 ip-172-31-145-144 docker/c63c2ddc1ae4[877]: #011at de.zalando.zmon.dataservice.proxies.kairosdb.KairosdbProxy.kairosDBPost(KairosdbProxy.java:102)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.