teslamotors / kafka-helmsman Goto Github PK

kafka-helmsman is a repository of tools that focus on automating a Kafka deployment

License: MIT License

Python 7.87% Shell 0.67% Java 85.10% Starlark 6.37%

kafka-helmsman's Introduction

kafka-helmsman

kafka-helmsman is a repository of tools that focus on automating a Kafka deployment. These tools were developed by data platform engineers at Tesla, they add value to the open-source Kafka ecosystem in a couple of ways:

The tasks covered by these tools are infamous for adding toil on engineers, for us these tools save engineering weeks each quarter
They have been battle-tested internally, high quality and user-friendly

The tools are

Follow the links for more detail.

Development

The tools are written in Java & Python. Refer to language specific section for development instructions.

Java

Kafka Helmsman is built and tested using Java 17. Bazel hermetically fetches a Java 17 distribution and uses it to compile the code and run tests. User environments only require the minimum Java version required for running Bazel itself (currently Java 8).

Dependencies

Java code uses bazel as the build tool, installation of which is managed via bazelisk. Bazelisk is a version manager for Bazel. It takes care of downloading and installing Bazel itself, so you don’t have to worry about using the correct version of Bazel.

Bazelisk can be installed in different ways, see here for details.

Test

bazel test //...

Build

bazel build //...

Note: In bazel, target spec //... indicates all targets in all packages beneath workspace, thus bazel build //... implies build everything.

Intellij

If you are using Intellij + Bazel, you can import the project directly with:

File -> Import Bazel Project
Select kafka-helmsman from where you cloned it locally
Select "Import project view file" and select the ij.bazelproject file
Select "Finish" with Infer from: Workspace (the default selection)

We recommend using the latest version of Intellij with the latest version of Bazel plugin.

Python

Dependencies

Python code uses tox to run tests. If you don't have tox installed, here is a quick primer of tox.

Tox Primer (optional, skip if you have a working tox setup)

Install pyenv

brew install pyenv
brew install pyenv-virtualenv

Install python versions

pyenv install 3.5.7
pyenv install 3.6.9
pyenv install 3.7.4

Install tox

pyenv virtualenv 3.7.4 tox
pyenv activate tox
pip install tox
pyenv deactivate

Setup python path

pyenv local 3.5.7 3.6.9 3.7.4 tox

Test

./build_python.sh

Build

./build_python.sh package

kafka-helmsman's People

Contributors

Stargazers

Watchers

kafka-helmsman's Issues

Freshness tracker verbosity could be improved for errors and debugging

For example, when a consumer is not available from burrow we dump a large error message in the logs

2022-06-29 09:03:52 ERROR [main] c.t.d.c.f.ConsumerFreshness:312 - Failed to read Burrow status for consumer example.missing.consumer. Skipping
java.io.IOException: Response was not successful: Response{protocol=http/1.1, code=404, message=Not Found, url=http://my.burrow/v3/kafka/my-cluster/consumer/example.missing.consumer/lag}
        at com.tesla.data.consumer.freshness.Burrow.request(Burrow.java:95)
        at com.tesla.data.consumer.freshness.Burrow.getConsumerGroupStatus(Burrow.java:111)
        at com.tesla.data.consumer.freshness.Burrow$ClusterClient.getConsumerGroupStatus(Burrow.java:144)
        at com.tesla.data.consumer.freshness.ConsumerFreshness.measureConsumer(ConsumerFreshness.java:307)
        at com.tesla.data.consumer.freshness.ConsumerFreshness.measureCluster(ConsumerFreshness.java:271)
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
        at java.util.stream.ReferencePipeline$11$1.accept(ReferencePipeline.java:440)
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
        at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
        at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:290)
        at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
        at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)

But these consumers can be missing lag information if burrow has include/exclusions, making these error messages just clog the logs.

Conversely, its hard to diagnose a bug for a consumer if you don't know what freshness tracker is seeing. For example, a consumer is showing as having increasing lag but burrow & kafka both say that it is up-to-date on the latest commit (this occurred recently). If this persists past a freshness-tracker restart, something is wonky in the tracker and you would want to turn on some debug logging (even if it is verbose) to see what is going on.

freshness-tracker should be able to read cluster bootstrap servers from burrow

Burrow already exposes an endpoint to get cluster information (https://github.com/linkedin/Burrow/wiki/http-request-kafka-cluster-detail). We should able to use that to get the default bootstrap.servers for the cluster, rather than having to provide it as a config property. This only marginally helps us get around the rest of the config for things like SSL, but at least would help keep things consistent.

Freshness tracker should fail a cluster iteration if all partitions for all consumers fails

Currently, we are very generous with the failure constraints for a cluster, from ConsumerFreshness (ln 281-293):

    // if all the consumer measurements succeed, then we return the cluster name
    // otherwise, Future.get will throw an exception representing the failure to measure a consumer (and thus the
    // failure to successfully monitor the cluster).
    return Futures.whenAllSucceed(completedConsumers).call(client::getCluster, this.executor);
  }

  /**
   * Measure the freshness for all the topic/partitions currently consumed by the given consumer group. To maintain
   * the existing contract, a consumer measurement fails ({@link Future#get()} throws an exception) only if:
   *  - burrow group status lookup fails
   *  - execution is interrupted
   * Failure to actually measure the consumer is swallowed into a log message & metric update; obviously, this is less
   * than ideal for many cases, but it will be addressed later.

However, SSL connection issues (i.e. a misconfiguration) only show up when querying the consumers. So you can have a valid burrow lookup for the cluster (b/c burrow is configured correctly) but freshness fails for each consumer because the tracker misconfigured. You would never know though (from the kafka_consumer_freshness_last_success_run_timestamp metric) since that will not get incremented for the failures.

Use KafkaAdminClient for quota enforcement + upgrade Kafka libs to 2.6

With the current Kafka version (2.4.1), quota enforcement was implemented through the use of a Zookeeper admin client, as using the KafkaAdminClient only support quota configuration with Kafka >= 2.6, client and server-side.

With the introduction of quota enforcement functionality in this project, we had to add in the Kafka server library (which contains the ZK admin client code), which in turn required complicating the dependency environment with various Scala libraries, and scala bazel_rules.

When we are ready to upgrade Kafka to >= 2.6, it would make sense to bring remove these Scala dependencies and go back to a light-weight dependencies.yaml with just Java libraries. This entails

removing Kafka server library
upgrading kafka-client library
remove unneeded dependencies that were added in the below PRs
use KafkaAdminClient to for quota configuration

#54
#53

build do not produce anything

Hi, wi a clone of the repo on master branch

XXXXX@YYYY:~/REPO/test/kafka-helmsman$ bazel build //...:all
INFO: Analyzed 79 targets (0 packages loaded, 0 targets configured).
INFO: Found 79 targets...
INFO: Elapsed time: 0.070s, Critical Path: 0.00s
INFO: 0 processes.
INFO: Build completed successfully, 1 total action
XXXXX@YYYY:~/REPO/test/kafka-helmsman$ bazel version
Bazelisk version: v1.7.5
Build label: 3.4.1
Build target: bazel-out/k8-opt/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Tue Jul 14 06:27:53 2020 (1594708073)
Build timestamp: 1594708073
Build timestamp as int: 1594708073
XXXXX@YYYY:~/REPO/test/kafka-helmsman$ javac -version
javac 1.8.0_282

build command do not produce anything , no jar

Thank you.

Topic enforcer should indicate replication factor config drift

Topic enforcer can not alter replication factor once a topic has been created (Kafka doesn't allow it). Currently, enforcer run finishes silently with no indication of replication factor drift even if it detects one. A better ux would be to log the drift and inform the user that its non enforceable.

kafka_roller cannot be used for a broker upgrade

pre-stop command is very helpful for heathcheck like lag on some topics, but there is no way to do a 'pre-start' command, which could be used for upgrading.

Docker example of ConsumerFreshness_deploy

There go a very lazy docker example of ConsumerFreshness_deploy

FROM openjdk:11-jre-slim
ADD ConsumerFreshness_deploy.jar ConsumerFreshness_deploy.jar
ADD conf.yaml conf.yaml
CMD java -jar ConsumerFreshness_deploy.jar --conf conf.yaml

version: "3"
services:
  burrow:
    build:
      context: ./burrow/
      dockerfile: Dockerfile
    volumes:
      - ./burrow/burrow.toml:/etc/burrow/burrow.toml
    ports:
      - 8000:8000
    depends_on:
      - zookeeper
      - kafka

  time_lag:
    build:
      context: ./tesla/
      dockerfile: Dockerfile
    volumes:
      - ./tesla/conf.yaml:/conf.yaml
      - ./tesla/ConsumerFreshness_deploy.jar:/ConsumerFreshness_deploy.jar
    ports:
      - 8099:8081
    depends_on:
      - burrow
      - kafka
...

any suggestion is super welcome 👍

Strange freshness calculation for some topic partitions

I've started relying on the freshness tracker for kafka consumer health alerting. Recently some of the freshness tracker metrics seem to be unreliable. I have a topic with 900 partitions. Checking offset lag via the kafka API, I see per partition offset lags oscillating between 0 and 1k. In the attached graphs, I'm singling out a single partition. The first graph shows the freshness-derived lag, the second shows burrow's reported offset lag. I can't figure out why the freshness lag is so far off and oscillates between zero and many hours.

Have you encountered something like this before?

teslamotors / kafka-helmsman Goto Github PK

kafka-helmsman's Introduction

kafka-helmsman

Development

Java

Dependencies

Test

Build

Intellij

Python

Dependencies

Tox Primer (optional, skip if you have a working tox setup)

Install pyenv

Install python versions

Install tox

Setup python path

Test

Build

kafka-helmsman's People

Contributors

Stargazers

Watchers

Forkers

kafka-helmsman's Issues

Recommend Projects

Recommend Topics

Recommend Org