zalando-zmon / zmon-data-service Goto Github PK

Receiving end of new worker to push data across DC boundaries

License: Other

Java 99.54% Shell 0.13% Lua 0.23% Dockerfile 0.09%

zmon data-service java ingestion

zmon-data-service's Introduction

ZMON source code on GitHub is no longer in active development. Zalando will no longer actively review issues or merge pull-requests.

ZMON is still being used at Zalando and serves us well for many purposes. We are now deeper into our observability journey and understand better that we need other telemetry sources and tools to elevate our understanding of the systems we operate. We support the OpenTelemetry initiative and recommended others starting their journey to begin there.

If members of the community are interested in continuing developing ZMON, consider forking it. Please review the licence before you do.

ZMON Data Service

Worker sends its data to the zmon-data-service, which is itself responsible for:

storing it in Redis for frontend
storing it in KairosDB for charting
track size/rate by team
handle notifications (if we cannot do this in a distributed fashion (sms vs email))

Input object:

{
    "account": "",
    "team": "",
    "results": [
        {
            "time": ...,
            "check_id": 1234,
            "check_result": ...,
            "run_time": ...,
            "exception": 0/1,
            "entity_id": "",
            "alerts" : {
                1 : { "state": 0/1, "captures": {}}, ...
            }
        }
    ]
}

Building

$ ./mvnw clean package
$ docker build -t zmon-data-service .

Running

$ export TOKENINFO_URL=...
$ java -jar target/zmon-data-service-1.0-SNAPSHOT.jar

zmon-data-service's People

Contributors

Stargazers

Watchers

Forkers

chakra-coder hjacobs

zmon-data-service's Issues

History in alert page should always include ALERT_ENTITY_ENDED and ALERT_ENTITY_STARTED

On occasion, when looking at an alert history page, ALERT_ENTITY_STARTED will not be displayed, only ALERT_ENTITY_ENDED.

Steps To Reproduce: Unknown

Observed results: ALERT_ENTITY-STARTED is not present
Expected results: Both STARTED and ENDED events are in the history.

Skip storage of empty WorkerResult

If the WorkerResult doesn't have results we log the exception and fail the request.

java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
#011at java.util.ArrayList.rangeCheck(ArrayList.java:653) ~[na:1.8.0_121]
#011at java.util.ArrayList.get(ArrayList.java:429) ~[na:1.8.0_121]
#011at de.zalando.zmon.dataservice.data.KairosDBStore.store(KairosDBStore.java:232) ~[classes!/:na]
#011at de.zalando.zmon.dataservice.data.KairosDbWorkResultWriter.write(KairosDbWorkResultWriter.java:35) ~[classes!/:na]
#011at sun.reflect.GeneratedMethodAccessor73.invoke(Unknown Source) ~[na:na]
#011at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_121]
#011at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_121]
#011at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:333) [spring-aop-4.3.6.RELEASE.jar!/:4.3.6.RELEASE]
#011at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:190) [spring-aop-4.3.6.RELEASE.jar!/:4.3.6.RELEASE]
#011at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:157) [spring-aop-4.3.6.RELEASE.jar!/:4.3.6.RELEASE]
#011at org.springframework.aop.interceptor.AsyncExecutionInterceptor$1.call(AsyncExecutionInterceptor.java:115) [spring-aop-4.3.6.RELEASE.jar!/:4.3.6.RELEASE]
#011at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_121]
#011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_121]
#011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_121]
#011at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]

Remove worker from Kairosdb tags

In a highly dynamic infrastructure (e.g. Kubernetes) worker tag could cause unnecessary increase in cardinality.

Setup TTL for HTTP connections

Pooled HTTP connections cache resolved DNS records. That doesn't work correctly with ELB and traffic switches: even thought all the traffic was switched to a newer stack, persisted connections are still working with the one they've cached. To avoid this situation, HTTP connections should have TTL.

Some more info: https://stackoverflow.com/questions/29107681/httpclients-poolinghttpclientconnectionmanager-and-dns-caching

Support preshared tokens

All components support preshared tokens. Data Service not doing so blocks simple multi region setup.

NPE in PyString date extraction

start_time sometimes is null

[2016-05-31 10:02:55.680] boot - 14 ERROR [zmon-async-51] --- RedisWorkerResultWriter: failed redis write check=1765 data={"account": "dc:123", "results": [{"check_result": {"td": 0.179116, ..., "captures": {"collector": "PR24", "type": "plot"}, "start_time": null, "changed": false, "exception": false, "in_period": true}}, "exception": false, "entity": {"id": "foo"}, "run_time": 0.179116, "time": "2016-05-31 12:02:52.709910+02:00"}], "team": ""}
May 31 10:02:55 ip-172-31-149-94 docker/a7edc08e2065[890]: java.lang.NullPointerException
May 31 10:02:55 ip-172-31-149-94 docker/a7edc08e2065[890]: #011at de.zalando.zmon.dataservice.data.PyString.extractDate(PyString.java:18)

Alerts TimePeriod is not honored

The ZMON worker will remove alerts from "active" alerts (e.g. "zmon:alerts" Redis key) if the "time period" does not match:

From worker logs:

notify - Removed alert with id 5363 on entity abc-live-slave-standby from active alerts due to time period: hr { 16 - 23 }

The behavior is not honored in data service.

To be changed: the data service needs to consider the in_period property of the AlertData object and remove alerts from alert state if in_period==false.

Setup TTL for HTTP connections

Some more info: https://stackoverflow.com/questions/29107681/httpclients-poolinghttpclientconnectionmanager-and-dns-caching

Implement limit on tag cardinality of check results

The underlying time series backend gets under pressure due to high cardinality on the tags (metadata that characterises the metrics) associated with check results. The current implementation processes unlimited number of tags and forwards them to the time series backend for storage.

Check results are provided to data-service at per entity level containing tags produced as per check definition. There are checks which produce large number of tags(key results) and sometimes with unique name every execution resulting in explosion in tag cardinality.

This issue is created to validate hypothesis and implement that the tags cardinality (and consequently pressure on time series backend) can be reduced significantly by putting a rate limit on tags processed in data-service layer.

NPE when running without OpenTracing tracer

When no tracer implementation is configured, we get NPE:

java.lang.NullPointerException: null
at de.zalando.zmon.dataservice.data.RedisDataPointsQueryStore.getSpanContext(RedisDataPointsQueryStore.java:119) ~[classes!/:na]
...

Generic time series data model and TSDB writers

ZMON generic time series data model

Impact

Provides a framework to optionally write Time series metrics from ZMON to other time series databases apart from KairosDB.

Deliverables

Implement a generic time series data model for ZMON time series metrics
Move ZMON time series metrics data model out of KairosDB writer and store code
Implement TSDB writers (M3DB, IronDB, InfluxDB) and store

Fix propagation of tracing span through async operation

As a developer
I want to see that tracing span that belongs to outgoing request (executed asynchronously) and tracing span coming with incoming request (which triggers outgoing request) belong to the same trace
So that I get complete tracing picture

Explanation

I'm going to explain it with a single example:

data-service receives requests with DataServiceController#putData handle. There it starts new span (probably belonging to some trace if request contains tracing information). Then it calls:
de.zalando.zmon.dataservice.data.WorkResultWriter#write (implemented by de.zalando.zmon.dataservice.data.ApplicationMetricsWriter#write) and then -
de.zalando.zmon.dataservice.data.AppMetricsClient#receiveData.
AppMetricsClient#receiveData sends outgoing requests to metric-cache asynchronously with org.apache.http.client.fluent.Async. The problem is that span context is not propagated to async processor. Therefore outgoing request triggers generation of new tracing span which is not bound to original trace, it becomes separate.

We have not solved this problem for async operations (yet). For data-service the problem is valid everywhere async processing is used. Also it's relevant to zmon-controller and other Java components that execute async operations without trace propagation.

Block entity data in trial run if ID lacks [account] tag

Surefire plugin fails

on any mvn build with newest openJDK which involves test run:

9:37:46 AM -------------------------------------------------------
9:37:46 AM T E S T S
9:37:46 AM -------------------------------------------------------
9:37:46 AM Error: Could not find or load main class org.apache.maven.surefire.booter.ForkedBooter
9:37:46 AM 
9:37:46 AM Results :
9:37:46 AM 
9:37:46 AM Tests run: 0, Failures: 0, Errors: 0, Skipped: 0
9:37:46 AM 
9:37:46 AM [INFO] ------------------------------------------------------------------------
9:37:46 AM [INFO] BUILD FAILURE
9:37:46 AM [INFO] ------------------------------------------------------------------------

Refactor KairosDB storage configuration

It's currently possible to configure the KairosDB using a List of Lists, like found in DataServiceConfigProperties.java

It's not immediately perceptible what each level of those lists stand for and can lead to misuse - partitions instead of replicas.

I propose to change this configuration so that the KairosDB storage can be configured explicitly stating which partition(s) and replica(s) to use.

For ex.:

{
    "dataservice": {
        "storage": {
            "replicas": [
                {
                    "url": "http://endpoint-for-replica-1"
                },
                {
                    "partitions": [
                        "http://endpoint-for-replica-2-partition-1",
                        "http://endpoint-for-replica-2-partition-2",
                        "http://endpoint-for-replica-2-partition-n"
                    ]
                },
                {
                    "url": "http://endpoint-for-replica-n"
                }
            ]
        }
}

Where the outermost element - replicas can have a single attribute url which represents the single endpoint for that replica or, optionally, a partitions attribute that accepts the list with each partition's endpoint for that replica.

Update Docker base image

Current version is outdated and contains some medium CVEs

log check id for query requests

Add batch API options

Some API calls could be batched, like:

Deleting multiple entities
Adding entities

Both are done by the agent after each cycle

Missing error processing in ControllerProxy

How to reproduce

Install whole ZMON stack
Break ZMON Controller API so it returns errors (e.g. because of broken auth configuration, so requests from data-service are not authorised)
Observe Index Out of Bounds Exception coming from de.zalando.zmon.dataservice.proxies.ControllerProxy#proxyForLastModifiedHeader

Explanation

When ZMON Controller API returns error responses, data-service still tries to get "Last-Modified" header from there and fails. Need to make sure that data-service tries to check if the header in the response before it tries to extract it.

Acceptance criteria

Instead of Out of Bounds Exception data-service must report that "Last-Modified" header was not found and provide response data, so it's easier for developers to investigate

Latest changes causes application to fail

Error

no main manifest attribute, in /zmon-data-service.jar

Add opentracing support

KairosDB Proxy: bad request should not product stack traces

May 31 12:04:33 ip-172-31-145-144 docker/c63c2ddc1ae4[877]: [2016-05-31 12:04:33.916] boot - 11 ERROR [http-nio-8080-exec-1] --- [dispatcherServlet]: Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception
 May 31 12:04:33 ip-172-31-145-144 docker/c63c2ddc1ae4[877]: org.apache.http.client.HttpResponseException: Bad Request
 May 31 12:04:33 ip-172-31-145-144 docker/c63c2ddc1ae4[877]: #011at org.apache.http.client.fluent.ContentResponseHandler.handleResponse(ContentResponseHandler.java:47)
 May 31 12:04:33 ip-172-31-145-144 docker/c63c2ddc1ae4[877]: #011at org.apache.http.client.fluent.ContentResponseHandler.handleResponse(ContentResponseHandler.java:40)
 May 31 12:04:33 ip-172-31-145-144 docker/c63c2ddc1ae4[877]: #011at org.apache.http.client.fluent.Response.handleResponse(Response.java:90)
 May 31 12:04:33 ip-172-31-145-144 docker/c63c2ddc1ae4[877]: #011at org.apache.http.client.fluent.Response.returnContent(Response.java:97)
 May 31 12:04:33 ip-172-31-145-144 docker/c63c2ddc1ae4[877]: #011at de.zalando.zmon.dataservice.proxies.kairosdb.KairosdbProxy.kairosDBPost(KairosdbProxy.java:102)

zalando-zmon / zmon-data-service Goto Github PK

zmon-data-service's Introduction

ZMON Data Service

Building

Running

zmon-data-service's People

Contributors

Stargazers

Watchers

Forkers

zmon-data-service's Issues

ZMON generic time series data model

Impact

Deliverables

Explanation

How to reproduce

Explanation

Acceptance criteria

Recommend Projects

Recommend Topics

Recommend Org