betfair / opentsp Goto Github PK

View Code? Open in Web Editor NEW

27.0 20.0 5.0 274 KB

Time Series Pipeline (TSP) is an open metric gathering and routing system.

License: Apache License 2.0

Go 89.24% Makefile 2.10% Groff 6.85% Shell 1.81%

opentsp's Introduction

Documentation

See the wiki.

Issues

Use Github Issues to report issues or to get help.

opentsp's People

Contributors

Stargazers

Watchers

Forkers

andredasilvapinto the-cloud-source liudvikas coverprice

opentsp's Issues

Show up the failing tests if something breaks during the rpmbuild process

On our Jenkins job we perform a make rpm to build the project and create the rpm for the repositories.

As rpmbuild is being called with --quiet option
https://github.com/betfair/opentsp/blob/master/cmd/tsp-controller/dist/Makefile#L10

all the output of the build job is being "swallowed", which means that when some test fails we will only have output such as this:

rpmbuild -bb --quiet --define 'version 21' service.spec
make[2]: *** [build] Error 1
error: Bad exit status from /var/tmp/rpm-tmp.54232 (%build)
    Bad exit status from /var/tmp/rpm-tmp.54232 
make[1]: *** [rpm] Error 1
make[1]: Leaving directory `/var/jenkins/workspace/TSDB/OpenTSP/src/opentsp.org/cmd/tsp-controller/dist'
make: *** [rpm] Error 2

Which gives no clue about which test has actually failed. Surely we can (and should) run the tests locally, but I think this should also be a possibility when building the rpm on CI tools.

Maybe having the --quiet option being conditionally included based on an environment property would offer enough flexibility for both cases?

cmd/tsp-aggregator: self-stats absent in opentsdb

Aggregator's self-stats tsp.aggregator.* are only available in aggregator's own real-time stream. Direct subscribers like OpenTSDB don't observe aggregator's metrics. One solution to this would be to make all programs in the tsp-* family publish their stats in a conventional place (e.g. /var/run/tsp-aggregator.sock in aggregator's case), which tsp-forwarder could detect and scrape automatically.

internal/tsdb/filter: add support for white-list filtering

At the moment filtering works only in "black list" mode, once you specify a filtering rule everything else is accepted. There are cases when it is useful to operate in a "white list" mode - in other words allow one to configure all the rules which describe what points are allowed to pass and block everything else.

E.g. Sample configuration which will accept only metric points which have attribute "path" starting with "/app". Any other metric point will be blocked/dropped.

"Filter": [
{ "Match": ["", "path", "^/app"], "Block": false },
{ "Block": true }
]

I propose to enhance the meaning of the Block attribute: true means "block", false means "pass" and when not specified it means the rule will not stop the evaluation of the following rules (e.g. the case of Set rules).

internal/tsdb: make decoderMaxSeries configurable?

Our aggregator is constantly hitting the too many time series limit.

$ sudo tail -30000 /var/log/tsp/aggregator.log | uniq -c
    466 2014/12/17 17:13:34 collect.d/site.sh: too many time series (>1000000)
   3757 2014/12/17 17:13:37 collect.d/site.sh: too many time series (>1000000)
   2028 2014/12/17 17:13:39 collect.d/site.sh: too many time series (>1000000)
  11029 2014/12/17 17:13:52 collect.d/site.sh: too many time series (>1000000)
   1372 2014/12/17 17:13:53 collect.d/site.sh: too many time series (>1000000)
    863 2014/12/17 17:13:59 collect.d/site.sh: too many time series (>1000000)
      2 2014/12/17 17:14:00 collect.d/site.sh: too many time series (>1000000)
   1491 2014/12/17 17:14:04 collect.d/site.sh: too many time series (>1000000)
   6059 2014/12/17 17:14:05 collect.d/site.sh: too many time series (>1000000)
   2932 2014/12/17 17:14:07 collect.d/site.sh: too many time series (>1000000)
      1 2014/12/17 17:14:08 collect.d/site.sh: too many time series (>1000000)

Is there a reason why it's hardcoded to 1M? Can we make it configurable?

Possible to have duplicated objects

Even though the configuration doesn't allow to have duplicated targets, it is possible to provoke duplicated objects, if, for example, we erroneously search for a process that is a cluster id as view.Objects just appends the result of host and process queries without caring if they include duplicates.

https://github.com/betfair/opentsp/blob/master/cmd/tsp-controller/control/collect-jmx/control.go#L319

cmd/tsp-forwarder: add ListenAddr (obsolete collect-site)

Add a ListenAddr setting that specifies the listen address for the built-in data point listener (this listener is currently implemented in cmd/collect-site).

The ListenAddr setting is the last thing that prevents tsp-controller from managing all details of the network topology. Currently, the operator must pass the listen port as a flag to collect-site, even though this port is already declared in /etc/tsp-controller/network.

This is particularly relevant on machines that try to run both opentsdb and tsp-aggregator: they both try to listen on :4242, but only one is allowed.

Breakout detection?

https://blog.twitter.com/2014/breakout-detection-in-the-wild

I wonder how hard would it be to write a TSP subscriber that applies this algorithm on the fly.

internal/relay: support filtering at relay level

Current implementation assumes all relays are interested in receiving the same stream of metric points. This is true when running in forwarder mode but not always true when running in aggregator mode. Supporting filtering configuration at relay level will allow one to specify which points should be forwarded/blocked for each relay.

E.g. Assuming one runs an aggregator which receives 100K points/s and is configured to forward to 4 relays. If we know that one of the relays is not interested in receiving metric X why not support filtering it out before sending the metric over the network. This can save bandwidth at the expense of additional CPU usage. The bandwidth saving is more obvious when one of the relays is interested only in metric X (which might represent a tiny fraction of the entire stream).

contrib/collect-netscaler: nitro authentication fails on NetScaler 10.5

2015/09/09 06:19:11 nitro: Config: request error: cookie refresh error: got http code 404 (Not Found)
2015/09/09 06:19:11 nitro: Stat: request error: cookie refresh error: got http code 404 (Not Found)
2015/09/09 06:19:11 nitro: Stat: request error: cookie refresh error: got http code 404 (Not Found)
2015/09/09 06:19:11 nitro: Stat: request error: cookie refresh error: got http code 404 (Not Found)
2015/09/09 06:19:11 nitro: Stat: request error: cookie refresh error: got http code 404 (Not Found)
2015/09/09 06:19:11 nitro: Stat: request error: cookie refresh error: got http code 404 (Not Found)
2015/09/09 06:19:11 nitro: Stat: request error: cookie refresh error: got http code 404 (Not Found)
2015/09/09 06:19:11 nitro: Stat: request error: cookie refresh error: got http code 404 (Not Found)
2015/09/09 06:19:11 nitro: Stat: request error: cookie refresh error: got http code 404 (Not Found)
2015/09/09 06:19:11 nitro: Stat: request error: cookie refresh error: got http code 404 (Not Found)

I traced it to a difference in the Nitro API, where requesting the /nitro/v1/ endpoint would no longer prompt for authentication but instead serve a 404. Both config and stats endpoints within that base still request HTTP basic auth.

The documentation for 10.5 is down at the the time of writing (Google cache), but it states authentication should use the /nitro/v1/config/login endpoint, which would also work for older 10.1 and 9.3 NetScalers.

I made a quick fix for it. I'll send a pull request for review prompty.

cmd/tsp-forwarder: prevent overconsumption of CPU

GOMAXPROCS default is changing, see the proposal at https://golang.org/s/go15gomaxprocs

In response to this, I think cmd/tsp-forwarder should call runtime.GOMAXPROCS(1). It should be enough for everybody, because the heavy-lifting is done by cmd/tsp-poller and cmd/tsp-aggregator.

This is partially a security feature, because CPU consumption is affected by tsp-controller (think Filter rules), and therefore can be potentially compromised and used as a DoS attack vector.

control/collect-jmx handles requests referencing undeclared processes

Therefore process targets may match with hostgroup, cluster or host targets and vice-versa.

https://github.com/betfair/opentsp/blob/master/cmd/tsp-controller/control/collect-jmx/control.go#L147

Targets should be split by type and we should only match ids with the target types we want.

internal/tsdb/filter: add support for exact match filtering (without using Regexp)

Often times one knows the exact metric names or tag values which need to be accepted or blocked. In these cases is more efficient to not use regular expressions but just to lookup the metric name and the tag values of the evaluated point in a map of preconfigured values.

Here is an example of a configuration I've been already using (I named the filter OneOf however I am not 100% pleased with this name):

"Filter": [
    {
     "OneOf": {
       "Metrics": ["metricA", "metricB"],
       "Tags": { "cluster": ["x", "y", "z"], "anotherTag": ["value1", "value2"] }
     },
     "Block": true
    }
 ]

cmd/tsp-controller: Allow multiple poller instances

The controller's network file and related data structures only allows for 1 poller instance. Some reasons we could use more than one poller:

Active-Passive HA setup;
Sharding pollers across multiple hosts for performance reasons.

cmd/tsp-controller: support multi-address host attributes

Quoting tsp-forwarder(1):

          Host (string)
                 Server address in host:port format. If port is  not  pro-
                 vided,  it  defaults  to  4242.  The protocol used is the
                 OpenTSDB line-based telnet protocol. For load  balancing,
                 multiple  servers  may be defined using comma to separate
                 server addresses. The traffic will be sent to all  listed
                 servers,  partitioned using a hash of time series identi-
                 fier.

The "comma-separated multiple servers" case is currently broken when used with in /etc/tsp-controller/network, for example:

<aggregator host="foo.example.com:4242,bar.example.com:4242"/>

is intended to allow deploying tsp-aggregator to multiple machines for scalability reasons but is currently unsupported: tsp-controller rejects control requests with the "not an aggregator" error message.

cmd/collect-statse: integrate external quantile estimator?

Evaluate http://godoc.org/?q=streaming+quantile - maybe one of them has nicer properties than the estimator currently built into collect-statse?

cmd/tsp-controller: obsolete the 'dedup' attribute

As is, when registering a subscriber in /etc/tsp-controller/network, the operator is burdened with having to know the correct value of the "dedup" attribute. This "dedup" attribute translates to aggregator's "DropRepeats" setting.

However, practice suggests that "DropRepeats" is subscriber's internal detail. It would be better if the subscriber had a way of communicating its value back to aggregator without any extra human involvement.

One way of doing it would be to provide better specification for the "version" command. For example:

If the response to the "version" command begins with open-curly "{", then it is assumed to be a JSON object that contains overrides for the "Relay" settings, for example:

 >>> version
 <<< {"DropRepeats": false}

If the "version" command produces any other response, then the subscriber is assumed to be OpenTSDB, and therefore sets DropRepeats to true.

This approach has two nice benefits down the road. Firstly, it finally provides clear semantics for the "version" command, which currently has unacceptably ad hoc handling in the source code. Secondly, it paves the way for potential subscriber-supplied aggregator-side filtering rules, e.g.:

>>> version
<<< {"Filter": [{"Match": "foo", "Block": true}]}

But one must be careful to allow subscriber to override only some of its Relay settings. All other settings should be presumed immutable becase of reliability/security.

Collect StatsE starts to record all requests as error after a while

A pull request will be appended soon.