Code Monkey home page Code Monkey logo

kafka-helmsman's Introduction

kafka-helmsman

Build status

kafka-helmsman is a repository of tools that focus on automating a Kafka deployment. These tools were developed by data platform engineers at Tesla, they add value to the open-source Kafka ecosystem in a couple of ways:

  1. The tasks covered by these tools are infamous for adding toil on engineers, for us these tools save engineering weeks each quarter
  2. They have been battle-tested internally, high quality and user-friendly

The tools are

Follow the links for more detail.

Development

The tools are written in Java & Python. Refer to language specific section for development instructions.

Java

Kafka Helmsman is built and tested using Java 17. Bazel hermetically fetches a Java 17 distribution and uses it to compile the code and run tests. User environments only require the minimum Java version required for running Bazel itself (currently Java 8).

Dependencies

Java code uses bazel as the build tool, installation of which is managed via bazelisk. Bazelisk is a version manager for Bazel. It takes care of downloading and installing Bazel itself, so you don’t have to worry about using the correct version of Bazel.

Bazelisk can be installed in different ways, see here for details.

Test

bazel test //...

Build

bazel build //...

Note: In bazel, target spec //... indicates all targets in all packages beneath workspace, thus bazel build //... implies build everything.

Intellij

If you are using Intellij + Bazel, you can import the project directly with:

  • File -> Import Bazel Project
  • Select kafka-helmsman from where you cloned it locally
  • Select "Import project view file" and select the ij.bazelproject file
  • Select "Finish" with Infer from: Workspace (the default selection)

We recommend using the latest version of Intellij with the latest version of Bazel plugin.

Python

Dependencies

Python code uses tox to run tests. If you don't have tox installed, here is a quick primer of tox.

Tox Primer (optional, skip if you have a working tox setup)

Install pyenv
brew install pyenv
brew install pyenv-virtualenv
Install python versions
pyenv install 3.5.7
pyenv install 3.6.9
pyenv install 3.7.4
Install tox
pyenv virtualenv 3.7.4 tox
pyenv activate tox
pip install tox
pyenv deactivate
Setup python path
pyenv local 3.5.7 3.6.9 3.7.4 tox

Test

./build_python.sh

Build

./build_python.sh package

kafka-helmsman's People

Contributors

danielfritsch avatar dependabot[bot] avatar jyates avatar mattbonnell avatar matteobaccan avatar rohagarwal avatar rohanag12 avatar scottcrossen avatar shk3 avatar shrijeet-tesla avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kafka-helmsman's Issues

Topic enforcer should indicate replication factor config drift

Topic enforcer can not alter replication factor once a topic has been created (Kafka doesn't allow it). Currently, enforcer run finishes silently with no indication of replication factor drift even if it detects one. A better ux would be to log the drift and inform the user that its non enforceable.

Freshness tracker verbosity could be improved for errors and debugging

For example, when a consumer is not available from burrow we dump a large error message in the logs

2022-06-29 09:03:52 ERROR [main] c.t.d.c.f.ConsumerFreshness:312 - Failed to read Burrow status for consumer example.missing.consumer. Skipping
java.io.IOException: Response was not successful: Response{protocol=http/1.1, code=404, message=Not Found, url=http://my.burrow/v3/kafka/my-cluster/consumer/example.missing.consumer/lag}
        at com.tesla.data.consumer.freshness.Burrow.request(Burrow.java:95)
        at com.tesla.data.consumer.freshness.Burrow.getConsumerGroupStatus(Burrow.java:111)
        at com.tesla.data.consumer.freshness.Burrow$ClusterClient.getConsumerGroupStatus(Burrow.java:144)
        at com.tesla.data.consumer.freshness.ConsumerFreshness.measureConsumer(ConsumerFreshness.java:307)
        at com.tesla.data.consumer.freshness.ConsumerFreshness.measureCluster(ConsumerFreshness.java:271)
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
        at java.util.stream.ReferencePipeline$11$1.accept(ReferencePipeline.java:440)
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
        at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
        at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:290)
        at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
        at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)

But these consumers can be missing lag information if burrow has include/exclusions, making these error messages just clog the logs.

Conversely, its hard to diagnose a bug for a consumer if you don't know what freshness tracker is seeing. For example, a consumer is showing as having increasing lag but burrow & kafka both say that it is up-to-date on the latest commit (this occurred recently). If this persists past a freshness-tracker restart, something is wonky in the tracker and you would want to turn on some debug logging (even if it is verbose) to see what is going on.

build do not produce anything

Hi, wi a clone of the repo on master branch

XXXXX@YYYY:~/REPO/test/kafka-helmsman$ bazel build //...:all
INFO: Analyzed 79 targets (0 packages loaded, 0 targets configured).
INFO: Found 79 targets...
INFO: Elapsed time: 0.070s, Critical Path: 0.00s
INFO: 0 processes.
INFO: Build completed successfully, 1 total action
XXXXX@YYYY:~/REPO/test/kafka-helmsman$ bazel version
Bazelisk version: v1.7.5
Build label: 3.4.1
Build target: bazel-out/k8-opt/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Tue Jul 14 06:27:53 2020 (1594708073)
Build timestamp: 1594708073
Build timestamp as int: 1594708073
XXXXX@YYYY:~/REPO/test/kafka-helmsman$ javac -version
javac 1.8.0_282

build command do not produce anything , no jar

Thank you.

Use KafkaAdminClient for quota enforcement + upgrade Kafka libs to 2.6

With the current Kafka version (2.4.1), quota enforcement was implemented through the use of a Zookeeper admin client, as using the KafkaAdminClient only support quota configuration with Kafka >= 2.6, client and server-side.

With the introduction of quota enforcement functionality in this project, we had to add in the Kafka server library (which contains the ZK admin client code), which in turn required complicating the dependency environment with various Scala libraries, and scala bazel_rules.

When we are ready to upgrade Kafka to >= 2.6, it would make sense to bring remove these Scala dependencies and go back to a light-weight dependencies.yaml with just Java libraries. This entails

  • removing Kafka server library
  • upgrading kafka-client library
  • remove unneeded dependencies that were added in the below PRs
  • use KafkaAdminClient to for quota configuration

#54
#53

Strange freshness calculation for some topic partitions

I've started relying on the freshness tracker for kafka consumer health alerting. Recently some of the freshness tracker metrics seem to be unreliable. I have a topic with 900 partitions. Checking offset lag via the kafka API, I see per partition offset lags oscillating between 0 and 1k. In the attached graphs, I'm singling out a single partition. The first graph shows the freshness-derived lag, the second shows burrow's reported offset lag. I can't figure out why the freshness lag is so far off and oscillates between zero and many hours.

Have you encountered something like this before?

freshness lag graph

offset lag graph

Freshness tracker should fail a cluster iteration if all partitions for all consumers fails

Currently, we are very generous with the failure constraints for a cluster, from ConsumerFreshness (ln 281-293):

    // if all the consumer measurements succeed, then we return the cluster name
    // otherwise, Future.get will throw an exception representing the failure to measure a consumer (and thus the
    // failure to successfully monitor the cluster).
    return Futures.whenAllSucceed(completedConsumers).call(client::getCluster, this.executor);
  }

  /**
   * Measure the freshness for all the topic/partitions currently consumed by the given consumer group. To maintain
   * the existing contract, a consumer measurement fails ({@link Future#get()} throws an exception) only if:
   *  - burrow group status lookup fails
   *  - execution is interrupted
   * Failure to actually measure the consumer is swallowed into a log message & metric update; obviously, this is less
   * than ideal for many cases, but it will be addressed later.

However, SSL connection issues (i.e. a misconfiguration) only show up when querying the consumers. So you can have a valid burrow lookup for the cluster (b/c burrow is configured correctly) but freshness fails for each consumer because the tracker misconfigured. You would never know though (from the kafka_consumer_freshness_last_success_run_timestamp metric) since that will not get incremented for the failures.

Docker example of ConsumerFreshness_deploy

There go a very lazy docker example of ConsumerFreshness_deploy

FROM openjdk:11-jre-slim
ADD ConsumerFreshness_deploy.jar ConsumerFreshness_deploy.jar
ADD conf.yaml conf.yaml
CMD java -jar ConsumerFreshness_deploy.jar --conf conf.yaml
version: "3"
services:
  burrow:
    build:
      context: ./burrow/
      dockerfile: Dockerfile
    volumes:
      - ./burrow/burrow.toml:/etc/burrow/burrow.toml
    ports:
      - 8000:8000
    depends_on:
      - zookeeper
      - kafka

  time_lag:
    build:
      context: ./tesla/
      dockerfile: Dockerfile
    volumes:
      - ./tesla/conf.yaml:/conf.yaml
      - ./tesla/ConsumerFreshness_deploy.jar:/ConsumerFreshness_deploy.jar
    ports:
      - 8099:8081
    depends_on:
      - burrow
      - kafka
...      

any suggestion is super welcome πŸ‘

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.