Code Monkey home page Code Monkey logo

refinery's Introduction

Refinery - the Honeycomb Sampling Proxy

refinery

OSS Lifecycle Build Status

Release Information

For a detailed list of linked pull requests merged in each release, see CHANGELOG.md. For more readable information about recent changes, please see RELEASE_NOTES.md.

Purpose

Refinery is a tail-based sampling proxy and operates at the level of an entire trace. Refinery examines whole traces and intelligently applies sampling decisions to each trace. These decisions determine whether to keep or drop the trace data in the sampled data forwarded to Honeycomb.

A tail-based sampling model allows you to inspect an entire trace at one time and make a decision to sample based on its contents. For example, your data may have a root span that contains the HTTP status code to serve for a request, and another span that contains information on whether the data was served from a cache. Using Refinery, you can choose to keep only traces that had a 500 status code and were also served from a cache.

Refinery's tail sampling capabilities

Refinery support several kinds of tail sampling:

  • Dynamic sampling - This sampling type configures a key based on a trace's set of fields and automatically increases or decreases the sampling rate based on how frequently each unique value of that key occurs. For example, using a key based on http.status_code, you can include in your sampled data:
    • one out of every 1,000 traces for requests that return 2xx
    • one out of every 10 traces for requests that return 4xx
    • every request that returns 5xx
  • Rules-based sampling - This sampling type enables you to define sampling rates for well-known conditions. For example, you can keep 100% of traces with an error and then apply dynamic sampling to all other traffic.
  • Throughput-based sampling - This sampling type enables you to sample traces based on a fixed upper-bound for the number of spans per second. The sampler will dynamically sample traces with a goal of keeping the throughput below the specified limit.
  • Deterministic probability sampling - This sampling type consistently applies sampling decisions without considering the contents of the trace other than its trace ID. For example, you can include 1 out of every 12 traces in the sampled data sent to Honeycomb. This kind of sampling can also be done using head sampling, and if you use both, Refinery takes that into account.

Refinery lets you combine all of the above techniques to achieve your desired sampling behavior.

Setting up Refinery

Refinery is designed to sit within your infrastructure where all traces can reach it. Refinery can run standalone or be deployed in a cluster of two or more Refinery processes accessible via a separate load balancer.

Refinery processes must be able to communicate with each other to concentrate traces on single servers.

Within your application (or other Honeycomb event sources), you would configure the API Host to be http(s)://load-balancer/. Everything else remains the same, such as API key, dataset name, and so on since all that lives with the originating client.

Minimum Configuration

Every Refinery instance should have a minimum of:

  • a linux/amd64 or linux/arm64 operating system
  • 2GB RAM for each server used
  • Access to 2 cores for each server used

In many cases, Refinery only needs one node. If experiencing a large volume of traffic, you may need to scale out to multiple nodes, and likely need a small Redis instance to handle scaling.

We recommend increasing the amount of RAM and the number of cores after your initial set-up. Additional RAM and CPU can be used by increasing configuration values; in particular, CacheCapacity is an important configuration value. Refinery's Stress Relief system provides a good indication of how hard Refinery is working, and when invoked, logs (as reason) the name of the Refinery configuration value that should be increased to reduce stress. Use our scaling and troubleshooting documentation to learn more.

Setting up Refinery in Kubernetes

Refinery is available as a Helm chart in the Honeycomb Helm repository.

You can install Refinery with the following command, which uses the default values file:

helm repo add honeycomb https://honeycombio.github.io/helm-charts
helm install refinery honeycomb/refinery

Alternatively, supply your own custom values file:

helm install refinery honeycomb/refinery --values /path/to/refinery-values.yaml

where /path/to/refinery-values.yaml is the file's path.

Peer Management

When operating in a cluster, Refinery expects to gather all of the spans in a trace onto a single instance so that it can make a trace decision. Since each span arrives independently, each Refinery instance needs to be able to communicate with all of its peers in order to distribute traces to the correct instance.

This communication can be managed in two ways: via an explicit list of peers in the configuration file, or by using self-registration via a shared Redis cache. Installations should generally prefer to use Redis. Even in large installations, the load on the Redis server is quite light, with each instance only making a few requests per minute. A single Redis instance with fractional CPU is usually sufficient.

Configuration

Configuration is controlled by Refinery's two configuration files, which is generally referred to as config.yaml for general configuration and rules.yaml for sampling configuration. These files can be loaded from an accessible filesystem, or loaded with an unauthenticated GET request from a URL.

Learn more about config.yaml and all the parameters that control Refinery's operation in our Refinery configuration documentation.

Learn more about rules.yaml and sampler configuration in our Refinery sampling methods documentation.

It is valid to specify more than one configuration source. For example, it would be possible to have a common configuration file, plus a separate file containing only keys. On the command line, specify multiple files by repeating the command line switch. In environment variables, separate multiple config locations with commas.

Running Refinery

Refinery is a typical linux-style command line application, and supports several command line switches.

refinery -h will print an extended help text listing all command line options and supported environment variables.

Environment Variables

Refinery supports the following key environment variables; please see the command line help or the online documentation for the full list. Command line switches take precedence over file configuration, and environment variables take precedence over both.

Environment Variable Configuration Field
REFINERY_GRPC_LISTEN_ADDRESS GRPCListenAddr
REFINERY_REDIS_HOST PeerManagement.RedisHost
REFINERY_REDIS_USERNAME PeerManagement.RedisUsername
REFINERY_REDIS_PASSWORD PeerManagement.RedisPassword
REFINERY_HONEYCOMB_API_KEY HoneycombLogger.LoggerAPIKey
REFINERY_HONEYCOMB_METRICS_API_KEY LegacyMetrics.APIKey
REFINERY_HONEYCOMB_API_KEY LegacyMetrics.APIKey
REFINERY_QUERY_AUTH_TOKEN QueryAuthToken

Note: REFINERY_HONEYCOMB_METRICS_API_KEY takes precedence over REFINERY_HONEYCOMB_API_KEY for the LegacyMetrics.APIKey configuration.

Managing Keys

Sending data to Honeycomb requires attaching an API key to telemetry. In order to make managing telemetry easier, Refinery support the ReceiveKeys and SendKey config options, along with AcceptOnlyListedKeys and SendKeyMode. In various combinations, they have a lot of expressive power. Please see the configuration documentation for details on how to set these parameters.

A quick start for specific scenarios is below:

A small number of services

  • Set keys in your applications the way you normally would, and leave Refinery set to the defaults.

Large number of services, central key preferred

  • Do not set keys in your applications
  • Set SendKey to a valid Honeycomb Key
  • Set SendKeyMode to all

Applications must set a key, but control the actual key at Refinery

  • Set SendKey to a valid Honeycomb Key
  • Set SendKeyMode to nonblank

Replace most keys but permit exceptions

  • Set ReceiveKeys to the list of exceptions
  • Set SendKey to a valid Honeycomb Key
  • Set SendKeyMode to unlisted

Some applications have custom keys, but others should use central key

  • Set custom keys in your applications as needed, leave others blank
  • Set SendKey to a valid Honeycomb Key
  • Set SendKeyMode to missingonly

Only applications knowing a specific secret should be able to send telemetry, but a central key is preferred

  • Choose an internal secret key (any arbitrary string)
  • Add that secret to ReceiveKeys
  • Set AcceptOnlyListedKeys to true
  • Set SendKey to a valid Honeycomb Key
  • Set SendKeyMode to listedonly

Replace specific keys used by certain applications with the central key

  • Set AcceptOnlyListedKeys to false
  • Set ReceiveKeys to the keys that should be replaced
  • Set SendKey to a valid Honeycomb Key
  • Set SendKeyMode to listedonly

Dry Run Mode

When getting started with Refinery or when updating sampling rules, it may be helpful to verify that the rules are working as expected before you start dropping traffic. To do so, use Dry Run Mode in Refinery.

Enable Dry Run Mode by adding DryRun = true in your configuration file (config.yaml). Then, use Query Builder in the Honeycomb UI to run queries to check your results and verify that the rules are working as intended.

When Dry Run Mode is enabled, the metric trace_send_kept will increment for each trace, and the metric for trace_send_dropped will remain 0, reflecting that we are sending all traces to Honeycomb.

Scaling Up

Refinery uses bounded queues and circular buffers to manage allocating traces, so even under high volume memory use shouldn't expand dramatically. However, given that traces are stored in a circular buffer, when the throughput of traces exceeds the size of the buffer, things will start to go wrong. If you have statistics configured, a counter named collect_cache_buffer_overrun will be incremented each time this happens. The symptoms of this will be that traces will stop getting accumulated together, and instead spans that should be part of the same trace will be treated as two separate traces. All traces will continue to be sent (and sampled), but some sampling decisions will be made on incomplete data. The size of the circular buffer is a configuration option named CacheCapacity. To choose a good value, you should consider the throughput of traces (for example, traces / second started) and multiply that by the maximum duration of a trace (such as 3 seconds), then multiply that by some large buffer (maybe 10x). This estimate will give a good headroom.

Determining the number of machines necessary in the cluster is not an exact science, and is best influenced by watching for buffer overruns. But for a rough heuristic, count on a single machine using about 2GB of memory to handle 5,000 incoming events and tracking 500 sub-second traces per second (for each full trace lasting less than a second and an average size of 10 spans per trace).

Stress Relief

Refinery offers a mechanism called Stress Relief that improves stability under heavy load. The stress_level metric is a synthetic metric on a scale from 0 to 100 that is constructed from several Refinery metrics relating to queue sizes and memory usage. Under normal operation, its value should usually be in the single digits. During bursts of high traffic, the stress levels might creep up and then drop again as the volume drops. As it approaches 100, it is more and more likely that Refinery will start to fail and possibly crash.

Stress Relief is a system that can monitor the stress_level metric and shed load when stress becomes a danger to stability. Once the ActivationLevelis reached, Stress Relief mode will become active. In this state. Refinery will deterministically sample each span based on TraceID without having to store the rest of the trace or evaluate rule conditions. Stress Relief will remain active until stress falls below the DeactivationLevel specified in the config.

The stress relief settings are:

  • Mode - Setting to indicate how Stress Relief is used. never indicates that Stress Relief will not activate. monitor means Stress Relief will activate when the ActivationLevel and deactivate when the is reached. always means that Stress Relief mode will continuously be engaged. The always mode is intended for use in emergency situations.
  • ActivationLevel - When the stress level rises above this threshold, Refinery will activate Stress Relief.
  • DeactivationLevel - When the stress level falls below this threshold, Refinery will deactivate Stress Relief.
  • SamplingRate - The rate at which Refinery samples while Stress Relief is active.

The stress_level is currently the best proxy for the overall load on Refinery. Even if Stress Relief is not active, if stress_level is frequently above 50, it is a good indicator that Refinery needs more resources -- more CPUs, more memory, or more nodes. On the other hand, if stress_level never goes into double digits it is likely that Refinery is overprovisioned.

Understanding Regular Operation

Refinery emits a number of metrics to give some indication about the health of the process. These metrics should be sent to Honeycomb, typically with Open Telemetry, and can also be exposed to Prometheus. The interesting ones to watch are:

  • Sample rates: how many traces are kept / dropped, and what does the sample rate distribution look like?
  • [incoming|peer]_router_*: how many events (no trace info) vs. spans (have trace info) have been accepted, and how many sent on to peers?
  • collect_cache_buffer_overrun: this should remain zero; a positive value indicates the need to grow the size of Refinery's circular trace buffer (via configuration CacheCapacity).
  • process_uptime_seconds: records the uptime of each process; look for unexpected restarts as a key towards memory constraints.

Troubleshooting

Logging

The default logging level of warn is fairly quiet. The debug level emits too much data to be used in production, but contains excellent information in a pre-production environment,including trace decision information. info is somewhere between. Setting the logging level to debug during initial configuration will help understand what's working and what's not, but when traffic volumes increase it should be set to warn or even error. Logs may be sent to stdout or to Honeycomb.

Configuration Validation

Refinery validates its configuration on startup or when a configuration is reloaded, and it emits diagnostics for any problems. On startup, it will refuse to start; on reload, it will not change the existing configuration.

Configuration Query

Check the loaded configuration by using one of the /query endpoints from the command line on a server that can access a Refinery host.

The /query endpoints are protected and can be enabled by specifying QueryAuthToken in the configuration file or specifying REFINERY_QUERY_AUTH_TOKEN in the environment. All requests to any /query endpoint must include the header X-Honeycomb-Refinery-Query set to the value of the specified token.

For file-based configurations (the only type currently supported), the hash value is identical to the value generated by the md5sum command for the given configuration file.

For all of these commands:

  • $REFINERY_HOST should be the URL of your refinery.
  • $FORMAT can be one of yaml, toml, or json.
  • $DATASET is the name of the dataset you want to check.

To retrieve the entire Rules configuration:

curl --include --get $REFINERY_HOST/query/allrules/$FORMAT --header "x-honeycomb-refinery-query: my-local-token"

To retrieve the rule set that Refinery uses for the specified dataset, which will be returned as a map of the sampler type to its rule set:

curl --include --get $REFINERY_HOST/query/rules/$FORMAT/$DATASET --header "x-honeycomb-refinery-query: my-local-token"

To retrieve information about the configurations currently in use, including the timestamp when the configuration was last loaded:

curl --include --get $REFINERY_HOST/query/configmetadata --header "x-honeycomb-refinery-query: my-local-token"

Sampling

Refinery can send telemetry that includes information that can help debug the sampling decisions that are made. To enable, in the configuration file, set AddRuleReasonToTrace to true. This will cause traces that are sent to Honeycomb to include a field meta.refinery.reason, which will contain text indicating which rule was evaluated that caused the trace to be included.

Restarts

Refinery does not yet buffer traces or sampling decisions to disk. When you restart the process all in-flight traces will be flushed (sent upstream to Honeycomb), but you will lose the record of past trace decisions. When started back up, it will start with a clean slate.

Architecture of Refinery itself (for contributors)

Within each directory, the interface the dependency exports is in the file with the same name as the directory and then (for the most part) each of the other files are alternative implementations of that interface. For example, in logger, /logger/logger.go contains the interface definition and logger/honeycomb.go contains the implementation of the logger interface that will send logs to Honeycomb.

main.go sets up the app and makes choices about which versions of dependency implementations to use (eg which logger, which sampler, etc.) It starts up everything and then launches App.

app/app.go is the main control point. When its Start function ends, the program shuts down. It launches two Routers which listen for incoming events.

route/route.go listens on the network for incoming traffic. There are two routers running and they handle different types of incoming traffic: events coming from the outside world (the incoming router) and events coming from another member of the Refinery cluster (peer traffic). Once it gets an event, it decides where it should go next: is this incoming request an event (or batch of events), and if so, does it have a trace ID? Everything that is not an event or an event that does not have a trace ID is immediately handed to transmission to be forwarded on to Honeycomb. If it is an event with a trace ID, the router extracts the trace ID and then uses the sharder to decide which member of the Refinery cluster should handle this trace. If it's a peer, the event will be forwarded to that peer. If it's us, the event will be transformed into an internal representation and handed to the collector to bundle spans into traces.

collect/collect.go the Collector is responsible for bundling spans together into traces and deciding when to send them to Honeycomb or if they should be dropped. The first time a trace ID is seen, the Collector starts a timer. If the root span, which is a span with a trace ID and no parent ID, arrives before the timer expires, then the trace is considered complete. The trace is sent and the timer is canceled. If the timer expires before the root span arrives, the trace will be sent whether or not it is complete. Just before sending, the Collector asks the sampler for a sample rate and whether or not to keep the trace. The Collector obeys this sampling decision and records it (the record is applied to any spans that may come in as part of the trace after the decision has been made). After making the sampling decision, if the trace is to be kept, it is passed along to the transmission for actual sending.

transmit/transmit.go is a wrapper around the HTTP interactions with the Honeycomb API. It handles batching events together and sending them upstream.

logger and metrics are for managing the logs and metrics that Refinery itself produces.

sampler contains algorithms to compute sample rates based on the traces provided.

sharder determines which peer in a clustered Refinery configuration is supposed to handle an individual trace.

types contains a few type definitions that are used to hand data in between packages.

refinery's People

Contributors

asdvalenzuela avatar bdarfler avatar cartermp avatar dcarley avatar dependabot[bot] avatar ecobrien29 avatar fchikwekwe avatar ianwilkes avatar irvingpop avatar ismith avatar isnotajoke avatar jamiedanielson avatar jharley avatar kentquirk avatar leviwilson avatar lizthegrey avatar magnusstahre avatar maplebed avatar martin308 avatar mikegoldsmith avatar mjingle avatar pkanal avatar puckpuck avatar rainofterra avatar robbkidd avatar tdarwin avatar tredman avatar tylerhelmuth avatar vinozzz avatar vreynolds avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

refinery's Issues

Default OTLP port conflicts with Prometheus

The OTLP export endpoint and Prometheus metrics endpoint both try to bind to port 9090 by default.

OTel collector switched from default of 9090 to 4317, likely because of this conflict.

Sample config is missing some quotes

Inside the sample rules.toml there are sample values (commented out) for setting these:

DryRun = true
...
DryRunFieldName = refinery_kept

If you uncomment them the second needs quotes:

% docker run --rm -it --volume $(pwd):/etc/refinery --expose 8080 --publish 44444:8080 honeycombio/refinery:latest /usr/bin/refinery
unable to load config: While parsing config: (10, 19): no value can start with r

Using refinery and otel-collector to route traces based on content (add dataset to inmemory collector cache key)

Started the discussion on the pollinators slack, here.

Im trying to use refinery as one of the tools to achieve "trace routing" or "trace multiplexing" capabilities, it is a unusual use case, but here is the context of the ask.

this might be an unconventional use case, and seems like it runs into some problems here, because it uses only traceID to group spans.

Details πŸ‘‡
Context:
Using istio and a shared ingress gateway for multiple applications/environments/teams (each mapping to its own dataset)
This means i have to split the destination of the traces based on content they carry (not doable in istio alone, or any app)..

Screen Shot 2021-06-06 at 5 58 09 PM

What am i trying to do?
Get those istio traces to the correct dataset, by means of:
Shared istio gateway -> otel-collector (fan out to 2 exporters, one for each dataset) -> refinery (rulesbasedSampler - drop traces of other environments/namespaces) -> HC
This effectively duplicates all traces generated by istio, and adds the meta of different datasets to each,
After that, the thought was to drop based on Refinery rules the ones going to the wrong dataset.

What ends up happening?
For a single exporter, it works great.. when multiple exporters are enabled, traces don’t get evaluated properly, some being kept/dropped, seemly in a random fashion.

What i think the problem is:
(No batching is enabled in otel)
When refinery collect the spans in memory and collate them to form a trace here, only the TraceID is taken into account, so this is combining (sometimes) both exporters data, with different datasets as upstream into one single trace, making the sampling decision look wrong.

To The actual questions
Does this sound right? or did i get it all wrong πŸ˜ƒ
Was this a conscious decision or i am trying to use the tool on an unexpected use case?

Possible solutions

  • Adding sp.Dataset to the cache object key as a prefix should fix it..
  • using multiple refinery deployments for each exporter

Would prefer to do 1, but wanted to hear your thoughts before working on a PR πŸ˜ƒ

Docker hub image at version 0.7.0, tag new release?

I was looking at the docker hub available image and I see it is a bit old. Would you recommending using the 0.7.0 image that is on docker hub?

Changes since last release:
ba8a5b7...main

If using version 0.7.0 is not recommended, would it be possible to tag a more recent commit as a release so docker hub is updated?

Redis In-Transit Encryption (TLS) and password support

When using Redis for peer discovery, if Redis has in-transit encryption enabled, Samproxy will fail with an i/o timeout:

time="..." level=error msg="registration failed" err="read tcp 172.16.145.223:44362->172.16.4.75:6379: i/o timeout" name="http://ip-000-00-000-000.eu-west-1.compute.internal:8081" timeoutSec=10
time="..." level=error msg="failed to register self with peer store" error="read tcp 172.16.145.223:44362->172.16.4.75:6379: i/o timeout"

This is due to the Redis client used not having the option enabled:
https://github.com/honeycombio/samproxy/blob/7ad0d4372207be9e09d43678d5a7549a76e0ea8b/internal/peer/redis.go#L60-L65

As per Redislabs' "Redis Enterprise and Go" which also uses github.com/garyburd/redigo/redis, this requires specific parameters to be sent when initializing the client:

conn, err := redis.Dial("tcp","<endpoint>:<port>",
    redis.DialPassword("<password>"), // password support would be nice too!
    redis.DialTLSConfig(&tlsConfig),
    redis.DialTLSSkipVerify(true),
    redis.DialUseTLS(true))

Relevant links:

Ensure all sample rule types are documented

From a hnycon/o11ycon talk on Refinery:

"There are more sample rule types in Refinery than are documented. Which feels like a bad thing to complain about. Like, why complain about more features? But we did have to read the source code to find these. And I might not have thought to look for them if I hadn’t used Legacy Refinery previously and known that some of these were exposed in the UI."

samproxy ignores proxy settings in the environment

If the environment variable HTTPS_PROXY is set, samproxy should use that proxy to send traffic. In the current state, traffic is sent directly to Honeycomb regardless of the setting of the environment.

Feature request: Time-limited sampling changes

Something that we've struggled with since implementing samproxy/refinery is being able to surface infrequently occurring and non-erroneous traces, such as canary deployments or services that just don't receive much traffic.

We've tried experimenting with using additional fields for the sampling key, such as including the service name or a "boost" attribute that can be controlled by the service, but it's mostly resulted in an unpredictable cardinality and throughout of all spans.

Some colleagues (@emauton and @conormcd) had an idea that was inspired by Fred Hebert's Recon to provide a mechanism of forcing the capture of all interesting spans for a limited time period. For example, with a hypothetical CLI and API, you'd be able to:

$ capture-traces 30 service:circle-www-api-canary1
or
$ capture-traces 15 service:circle-www-api-v1 name:circle.permissions/user-can-view-builds

There's some similarity with the LaunchDarkly proposal that was discussed in Slack.

Logging of stacktraces

What do you think about logging stacktraces when they are caught by the middleware?

#61 took a bit more effort to debug because I had to rebuild and ship a copy of Docker image that called debug.PrintStack() to figure out where the panic was coming from. Dumping straight to STDERR amongst structured logging is pretty messy though. Should each of the frames be formatted as a logging field?

Should this behaviour always be enabled, or could it generate too many logs and would stopping-the-world be too detrimental to non-failing handlers?

I'd be happy to raise a PR once we've figured out the implementation details.

config validation fails if Peers list is empty and Type set to redis

config:

[PeerManagement]
Peers = []
Type = "redis"
RedisHost = "example.com"
unable to load config: Key: 'configContents.PeerManagement.Peers' Error:Field validation for 'Peers' failed on the 'required' tag

if set to the default value (["http://127.0.0.1:8081"]) it will be overwritten by the peer list from redis (as expected)

Add tests for int64 sample rates

Handling of int64 sample rates was added in #209. We should extend the tests in route_test.go to test int64 sample rates are handled as expected and are not accidentally broken in the future.

Add test for the following cases:

  • int64 value lower/equal to than int32 max - uses sample rate value
  • int64 value greater than int32 max - uses int32 max value

Redis peer configuration option to use FQDN instead of Hostname?

In GKE, pods can be resolved using their FQDN (if a headless service is set up) but unless I'm missing something I think the self-reported hostnames don't resolve?

I've forked the project to use FQDN in my own deployment and everything works perfectly. Is it was worth introducing this as a configurable option?

Trace timeout metric

I would like to suggest adding a counter metric that is incremented when TraceTimeout is hit.

We are trying to tune SendDelay to avoid spans arriving after the trace has already been flushed from the Samproxy instance. We know that we have at least some spans arriving late because we see trace_sent_cache_hit being incremented. What we don't know is how often spans arrive after the trace has been evicted from the sent trace cache.

We know in this case that the span will be treated by Samproxy as a new trace. We also know that the root span for the trace has already come and gone, so the "new" trace will only be flushed when TraceTimeout is hit. If Samproxy provided a trace timeout metric then we'd have a proxy metric for late arriving spans.

Note, this metric would also be valuable for us in setting CacheCapacity as we don't know what portion of the cache is currently going towards caching these late arriving spans.

Make grpc ServerParameters configurable

gRPC has a set of ServerParameters that we should make configurable.

Specifically, the MaxConnectionAge and MaxConnectionAgeGrace are of interest to help keep gRPC connections balanced across a pool of refinery hosts.

An issue can arise when a host dies that all the connections will failover to the other hosts and remain there due to the long-lived nature of gRPC connections. The new host that spins up to replace the dead host will never receive any incoming connections and is then underutilized while the other hosts are overburdened.

cannot find package "github.com/vmihailenco/msgpack/v4"

My docker build fails because it cannot find package "github.com/vmihailenco/msgpack/v4". Have you experienced this issue? Any advice on how to resolve it?

docker build -t samproxy .
Sending build context to Docker daemon  9.204MB
Step 1/6 : FROM golang:alpine
alpine: Pulling from library/golang
df20fa9351a1: Already exists 
ed8968b2872e: Pull complete 
a92cc7c5fd73: Pull complete 
00abe362642d: Pull complete 
e0491c15e88f: Pull complete 
Digest: sha256:d6deb50437547fd7226890aa19b075fbf7f5063e3c0d7a2eaf7fb41d11069013
Status: Downloaded newer image for golang:alpine
 ---> 30df784d6206
Step 2/6 : RUN apk add --update --no-cache git
 ---> Running in 74496b0f0285
fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.12/community/x86_64/APKINDEX.tar.gz
(1/5) Installing nghttp2-libs (1.41.0-r0)
(2/5) Installing libcurl (7.69.1-r0)
(3/5) Installing expat (2.2.9-r1)
(4/5) Installing pcre2 (10.35-r0)
(5/5) Installing git (2.26.2-r0)
Executing busybox-1.31.1-r16.trigger
OK: 22 MiB in 20 packages
Removing intermediate container 74496b0f0285
 ---> 4e7c00c4eeb8
Step 3/6 : RUN go get github.com/honeycombio/samproxy/cmd/samproxy
 ---> Running in 7dc3b072ccf1
package github.com/vmihailenco/msgpack/v4: cannot find package "github.com/vmihailenco/msgpack/v4" in any of:
        /usr/local/go/src/github.com/vmihailenco/msgpack/v4 (from $GOROOT)
        /go/src/github.com/vmihailenco/msgpack/v4 (from $GOPATH)
The command '/bin/sh -c go get github.com/honeycombio/samproxy/cmd/samproxy' returned a non-zero code: 1

Msgpack support

Libhoney-go supports output in msgpack format. We'll want to support receipt of this in samproxy. In general, this should give us higher throughput.

How to make TotalThroughputSampler in terms of events per second

We have traces that vary in in length substantially, and want to use TotalThroughputSampler to help us stay within rate limits, but having it sample in terms of traces per second results in too much variability in the output events per second.

Calls to the underlying dynsampler are done one per trace, which to my understanding increments the internal counter once per trace:

https://github.com/honeycombio/refinery/blob/v1.4.0/sample/totalthroughput.go#L58

To make this in terms of events per second, I see the only way to do this would be to call that line above in a loop, where the iterations == the number of spans in the trace, but that would be a pretty inefficient way of making this work. What approach would you recommend?

Remove traces from cache when sent

We have a theory that the memory problems that we're seeing in #94 may be a result of larger traces (by quantity and size of spans) being cached for longer during our low traffic periods. It occurred to us that traces (and their spans) aren't currently removed from the cache and eligible for GC until their position in the ring buffer is used again, regardless of whether the trace has already been sent.

Could traces be removed from the cache at the time of sending? Or, if the concurrent access is likely to cause race conditions, maybe zero out the references to spans, as they aren't size bound by the CacheCapacity?

Refinery Metrics are three different event goroutines instead of a single, aggregated event

Refinery Metric Events are running in three separate go routines (metrics, upstreamMetrics, and peerMetrics) which can lead to instances where a single millisecond has three different metrics events published (each containing their own data along with the dynamic fields).

Ideally, these three different events would get pre-aggregated into a single coherent event.

Originally opened as #293, which was closed after further investigation.

Cache capacity and usage metrics

It'd be great to have some better metrics about how the cache is being used, both to use when sizing the cache and investigating memory usage. We'd particularly like to know how many traces in the cache are sent/free vs unsent/used and how many spans they contain. This might only be possible by periodically iterating through the cache - would you foresee this causing any performance problems?

Strange increases in memory usage

We've seen some strange increases in memory usage that seem to correspond with … πŸ€” … less traffic. We're running 10 nodes with a CacheCapacity of 150k each. They normally consume 6.5-7G of memory each, so we've set a limit of 8G. On two occasions now they've ramped up to the limit and have been OOM killed. This appears to repeat until the traffic levels return to normal.

We could increase the memory limit, but it's not clear how much would be required and we haven't been able to reproduce it within our control. We've thought about capturing pprof data from #82 but we're not sure that it will be possible in the small timeframe before the OOM. Is there anything else we should be looking at?

Metrics from during an incident:

image

Metrics from over a weekend (#89 was causing some additional restarts):

image

Support for Redis with Authentication/Password

My org only deploys PaaS-y instances of Redis, specifically Azure Cache for Redis and I'd like to be able to use an instance from that service to support Refinery.

This would require the ability to specify a password and to require TLS/SSL. I'll poke around to see if I can assist but it felt useful to open an issue first.

API errors are incorrectly reported as peer errors

It seems refinery checks the API host option in outgoing events to see if the message is destined for a peer, or the honeycomb API:

if honeycombAPI == apiHost {
// if the API host matches the configured honeycomb API,
// count it as an API error
d.Metrics.Increment(d.Name + counterResponseErrorsAPI)
} else {
// otherwise, it's probably a peer error
d.Metrics.Increment(d.Name + counterResponseErrorsPeer)
}

During a conversation in the pollinators slack we noticed that log lines emitted by refinery for some messages showed that the event's api_host option was empty, even though the event was being sent to honeycomb:

Apr 01 12:13:01 production-refinery-i-00c73eea958d0b46c refinery[11758]: time="2021-04-01T12:13:01Z" level=error msg="non-20x response when sending event" api_host= dataset= error="Post \"https://api.honeycomb.io/1/batch/production.traces\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" event_type= status_code=0 target=

This caused timeout errors in the upstream honeycomb API to be treated as peer errors:

Image 2021-04-06 at 10 36 30 am

This is a bit misleading as it causes the operator to believe there's a problem with their cluster, when the issue is really coming from the honeycomb API.

Given metrics are already prefixed with upstream_ or peer_, I'm wondering whether we really need the _api and _peer suffixes, as it's a bit confusing. e.g. How should upstream_response_errors_peer differ from peer_response_errors_peer?

Would it be possible to remove this suffix so that operators can determine whether a metric is related to the honeycomb API, or the peer, by its prefix?

InsecureSkipVerify does not work. Request to make it a config

Right now there is no option to use InsecureSkipVerify. It would be nice if this could be a config option, like UseTLS.

The addition of SkipVerify in the options here is a noop as InsecureSkipVeridy is only set if tlsConfig is nil here.

I proposed a solution in #249, in this commit e1620f0. @vreynolds requested our use case for InsecureSkipVerify. Heroku Redis uses self signed certs so in order to run refinery in a Heroku private/shield space along with our application we need to set InsecureSkipVerify=true. This will also apply to Heroku's customers.

Feature request: Option to enable/disable GZIP compression to peers

We've been monitoring our AWS bill as we've adopted refinery and one thing we've been surprised by is the jump in our AWS inter-az data transfer costs, which are more significant than we expected. We're currently running 3 c5.2xlarge instances of refinery, in separate AZs, and each one seems to be sending ~7MB/s of data to its other 2 peers. Some back of the envelope math suggests this could generate a bill of (7MB/s * 3600 * 24 * 30)/1000MB per GB * $0.02/GB = $362.88 for 1 node to talk to one peer. Given each node has 2 peers, and there are three nodes, it seems plausible that the bandwidth alone could cost in the ballpark of $2,172 - about 3 times the cost of running the cluster, without even considering the cost of transmitting the sampled-in data to Honeycomb.

While looking through the source code I noticed that refinery explicitly opts out of using gzip compression when communicating with peers. Do you have any context for why that decision was made (e.g. how much overhead did it create), and whether it would be possible to make this value configurable?

// gzip compression is expensive, and peers are most likely close to each other
// so we can turn off gzip when forwarding to peers
DisableGzipCompression: true,
EnableMsgpackEncoding: true,

Our cluster is currently running at about 30-40% CPU utilization, so I'd much rather use a few more CPU cycles if it could save us a fair chunk on our bandwidth bill.

Tag each event that flows through

To make querying easier, it would be good if we tagged each event that flows through refinery with a specific field.

We have derived columns that can attempt to work it out that utilise the user agent. These would also need to be updated to look for the event field.

Clarification around expected timeouts/backpressure when honeycomb API slows down?

The other day in the pollinators slack several folks noticed that their refinery clusters all had buffer overruns at the same time. After a bit of discussion it came to light that the honeycomb API had been having a bit of a slowdown around that time, which caused requests to the honeycomb API to stall and put backpressure on refinery's cache.

It seems like the metrics that honeycomb emits don't include statistics about how long upstream API requests are taking - would it be possible to add support for these metrics so that operators can understand if performance/capacity issues are related to the honeycomb API?

On a related note, it appears that libhoney configures a 60s timeout on all upstream requests. Given the decision to change the timeout from 10 seconds to 60 seconds was made 2 years ago I was wondering if that's still an appropriate timeout, or if a lower value would be more suitable?

For reference, here're some charts of what our refinery cluster looked like during the incident. You can see how our refinery instance started dropping incoming spans almost as soon as the API slowdown occurred (the upstream_response_errors_peer is actually for the honeycomb API, not refinery peers - see #244). The point where memory usage drops is where the instance OOMs and restarts:

image

As an aside, it's not really clear to me why the collect_cache_entries_max metric just disappears as soon as the system starts experiencing backpressure - is that something you would expect to happen?

Security: produce signed SHA256SUMS file

Right now there's no way of verifying that our binaries have not been tampered with after leaving circleci (e.g. by someone malicious who has access to our github account)

Add the calculated dynamic sampler key to outgoing traces

In the honeycomb-hosted refinery product there was a meta.dynamic_sampler_key field that was added to the root span of every trace that went through the dynamic sampler, and contained the sampling key calculated from all spans in the trace. This field was really helpful for evaluating different sampling configurations, and without it, it's really hard for us to understand the decisions the sampler is making.

Is it possible to have this field added to the open source version of refinery?

Segmentation fault upon accessing a potentially undefined status field of a span

Expected Behavior

A Refinery process is able to receive an OTLP span, encode it into a Span event with a trace ID, and use the sharder to route it to the correct Refinery member.

Current Behavior

The Refinery process segfaults and is killed when, as part of creating the Event, it tries to access an uninitialized Status field of a span:

refinery-7666b8c94d-k2hkp refinery panic: runtime error: invalid memory address or nil pointer dereference
refinery-7666b8c94d-k2hkp refinery [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0xbc6844]
refinery-7666b8c94d-k2hkp refinery
refinery-7666b8c94d-k2hkp refinery goroutine 53981 [running]:
refinery-7666b8c94d-k2hkp refinery github.com/honeycombio/refinery/route.(*Router).Export(0xc000332020, 0xed4d78, 0xc00095bfb0, 0xc0003ac9c0, 0xc000332020, 0xc00095bfb0, 0xc00024bba0)
refinery-7666b8c94d-k2hkp refinery 	/app/route/route.go:452 +0x944
refinery-7666b8c94d-k2hkp refinery github.com/honeycombio/refinery/internal/opentelemetry-proto-gen/collector/trace/v1._TraceService_Export_Handler(0xd901a0, 0xc000332020, 0xed4d78, 0xc00095bfb0, 0xc0000b2ae0, 0x0, 0xed4d78, 0xc00095bfb0, 0xc000224000, 0x10db)
refinery-7666b8c94d-k2hkp refinery 	/app/internal/opentelemetry-proto-gen/collector/trace/v1/trace_service.pb.go:190 +0x214
refinery-7666b8c94d-k2hkp refinery google.golang.org/grpc.(*Server).processUnaryRPC(0xc000430700, 0xedbb58, 0xc000b56000, 0xc0009a3000, 0xc0004dea20, 0x1399250, 0x0, 0x0, 0x0)
refinery-7666b8c94d-k2hkp refinery 	/go/pkg/mod/google.golang.org/[email protected]/server.go:1194 +0x52b
refinery-7666b8c94d-k2hkp refinery google.golang.org/grpc.(*Server).handleStream(0xc000430700, 0xedbb58, 0xc000b56000, 0xc0009a3000, 0x0)
refinery-7666b8c94d-k2hkp refinery 	/go/pkg/mod/google.golang.org/[email protected]/server.go:1517 +0xd05
refinery-7666b8c94d-k2hkp refinery google.golang.org/grpc.(*Server).serveStreams.func1.2(0xc000b966f0, 0xc000430700, 0xedbb58, 0xc000b56000, 0xc0009a3000)
refinery-7666b8c94d-k2hkp refinery 	/go/pkg/mod/google.golang.org/[email protected]/server.go:859 +0xab
refinery-7666b8c94d-k2hkp refinery created by google.golang.org/grpc.(*Server).serveStreams.func1
refinery-7666b8c94d-k2hkp refinery 	/go/pkg/mod/google.golang.org/[email protected]/server.go:857 +0x1fd

Possible Solution

Steps to Reproduce

  1. An otel-collector configured with the OTLP exporter to forward spans to Refinery:
    exporters:
      otlp/refinery:
        endpoint: "refinery.tracing.svc.cluster.local:9090"
        insecure: true
        headers:
          x-honeycomb-team: some-key
          x-honeycomb-dataset: some-dataset
  1. A Refinery service running on the endpoint configured above.

Context (Environment)

Detailed Description

The Opentelemetry spec defines the Status attribue of a span as only optional.

The PR referenced above guards against a potential undefined memory access by defensively checking for a null pointer dereference.

File-based peer management with multi-instance failing because peers not found

Hello Refinery team! πŸ‘‹

When deploying refinery on Kubernetes with multiple instances and file-based peer management, during the initialization phase where refinery tries to lookup it’s own hostname in the peer list, the lookup fails with β€œno such host” exception (because the first address in the peer list is not this specific instance, and the chosen address is not initialized yet).
When this occurs, the instance is then terminated.

I can see that when the lookup fails for an address, no retry mecanism is implemented here: https://github.com/honeycombio/refinery/blob/main/sharder/deterministic.go#L133

To add info, I used the official helm chart provided https://github.com/honeycombio/helm-charts/tree/main/charts/refinery, but changed the deployment type to a StatefulSet with a headless service (to have stable hostnames).

Please tell me if I am missing something here, maybe it's due to a lack of understanding on my part.

if the issue is real however, it would me my pleasure to contribute!

Configuration through environment variables

Hej,

tldr; make all configuration properties available via environment variables to provide the means to dynamically adjust Refinery

we are introducing Refinery into our setup and stumbled upon the problem that the environment variables are not evaluated within the configuration file.

For example a configuration like following

 18 [PeerManagement]
 19 Type = "redis"
 20 RedisHost =  "refinery-leader-redis-${ENVIRONMENT}.mydomain:6379"

will not work. (I don't know if this is the behaviour of viper).

Looking at the implementation of Refinery we can see that only four environment variables are exposed for usage

c.BindEnv("PeerManagement.RedisHost", "REFINERY_REDIS_HOST")

Taking the 12factor principle into account, I can't think about a design decision why not all properties are exposed through environment variables. It seems to be possible with viper by defining a SetEnvPrefix and running AutomaticEnv.

A use case in our scenario is following:

  • we are running different environments (e.g. dev, stage, prod)
  • each of them uses a dedicated Refinery instance (plus redis)
  • we want to send the metrics of Refinery to our analyse backend but into different datasets
    --> at this moment it is not possible to do this without baking in the different dataset names

Thanks for any input and maybe I am missing something simple here :)

Feature: Support "does-not-contain" as operator

Hello,

Seems like not all ui/api operators from the platform are supported on the RuleBasedSampler, here.

When using one of the unsupported operators, there is no indication of error, it just doesn't match the desired trace/span.
It is also not documented anywhere (that i could find).

Im looking specifically for that one, but would be nice to have them all =)

Regards,

Missing Indication that sample decision was made on incomplete trace

When using RulesBasedSampler, some long running traces can be evaluated while they are still incomplete (missing root span), this ends up creating results that could unexpected by the users.

A workaround is to have longer TraceTimeout but when those traces are not the norm, this could be detrimental to the size of cache and the the inner workings of sample/transmit on other traces that might be broken for other reasons.

Since the information that the trace is incomplete already exists on the trace object, one could just add a log indication or a field to the spans that the evaluation was executed on like refinery_incomplete or refinery_evaluated.

Let me know if you have any concerns or preferences on the implementation.

PS. There are metrics for these, but those don't help the users knowing the why/what, only showing that it happens, sometime, somewhere.

Document metrics for tuning EMADSampler

A common question from customers is how the sampling goal works with the EMADSaampler. This ticket is to document that relationship, along with metrics to look at to understand it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.