We're a small team with a small set of kafka topics and clients, but already we need a

We are thinking about using terraform for this: <a href="https://github.com/packetloop

We're now phasing in auto.topic.create (<a class="iss

Good points here: <a href="https://www.confluent.io/blog/put-several-event-types-kafka

Randomly reading through this issue, I noticed you mentioned <code class="notranslate"

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Wanted: topic management, declarative about kubernetes-kafka HOT 10 OPEN

yolean commented on August 27, 2024 2

Wanted: topic management, declarative

from kubernetes-kafka.

Comments (10)

BenjaminDavison commented on August 27, 2024

We are thinking about using terraform for this: https://github.com/packetloop/terraform-provider-kafka

from kubernetes-kafka.

DanielFerreiraJorge commented on August 27, 2024

Maybe this: https://github.com/nbogojevic/kafka-operator

from kubernetes-kafka.

solsson commented on August 27, 2024

My current best bet is to rely on auto.create.topics.enable=true + default.replication.factor and a naming convention (some ideas here).

I found that during stream processing it's a quite legitimate use case to want to add new topics dynamically. For example in log processing you could split on kubernetes namespace, and those may pop up while your streams app is running.

from kubernetes-kafka.

solsson commented on August 27, 2024

I've tried to summarize my confusion by closing various open investigations. What surprised me the most was the insight(?) that Schema Registry doesn't help.

At Yolean we store all our persistent data in Kafka. Some of it has the integrity requirements you'd normally entrust a relational database. If validation is supported at the API level when producing, we'll make design choices based on the type of data (rather than time constraints / laziness).

Now I suspect we won't get that far without in-house libs - a maintenance effort we can't depend on, given the rapid evolution of Kafka APIs.

Validation can still be simple enough, if schemas are available as files. We already use json-schema in some core producers using Node.js. Files have the added benefit that they get the same versioning and release cycle as the services they share repository with.

We do need a little bit of tooling to copy schema files, be they Avro or JSON or Protobuf, to the source folders of the services that depend on them. Due to Docker build contexts we can't pull them from a parent folder.

We still don't enforce anything, but luckily there's a compelling alternative: Monitoring (explored in #49 and #93). Consumer services can export counters on the number of messages they had to drop due to invalidity, or even if they don't we can look at generic failure rates. We can catch errors that schemas can't: the other day we had misbehaving client-side code that led to an increase in messages/second to a topic by a factor 100.

We can also write dedicated validation services, whose sole purpose is to expose metrics. This is where naming conventions might matter. Let's say that topic names ending with a dash followed by digits indicate a version number. It'd be quite easy to spot, based on a topic listing and the bytes or message rates reported by broker JMX, that we have some service producing to an old version of a topic.

from kubernetes-kafka.

solsson commented on August 27, 2024

We're now phasing in auto.topic.create (#107) in production. The gotchas that made me disable it from the start are there for sure -- run kafkacat with some typo in -t and a new topic gets created. Who would configure their database to create a table on INSERT INTO xyz?

However, Kafka Streams in essence requires auto creation. We just have to live with the gotchas.

Regarding topics + schema management, with the arguments I tried to summarize above, I ended up writing a Gradle build that generates Java POJO typed Serde impls from json-schema files. We can also use these schema files directly from libs like ajv in Node.

To do this we had to establish a list of topics, or actually topic name regexes, where we configure the model (i.e. schema) that is used for values. With this list we can unit test the whole thing: "deserialize this byte stream for topic x" etc, which is what I argued for earlier.

Regarding topic management, with or without auto create, we could compare that list to existing topics in kafka, which together with bytes in/out rates could indicate if there are bogus topics lying around or we're writing to unexpected topics.

from kubernetes-kafka.

solsson commented on August 27, 2024

Good points here: https://www.confluent.io/blog/put-several-event-types-kafka-topic/. I agree to a lot of what Kleppmann says, and conclude that my design above with one schema per topic is too restrictive. Schema evolution support basically only caters for new versions adding new fields. A single version may need to produce using multiple schemas to the same topic, due to ordering requirements.

I also happened to read http://www.dwmkerr.com/the-death-of-microservice-madness-in-2018/ today, and it nicely captures some versioning aspects of using topics as API contracts between services. As that post puts it:

"When a microservice system uses message queues for intra-service communication, you essentially have a large database (the message queue or broker) glueing the services together. Again, although it might not seem like a challenge at first, schema will come back to bite you."

I wanted to have as much of this complexity as possible known at build time, rather than provisioning/orchestration time (Kube manifests) or runtime (schema registry etc). Am I mistaken here? As a consumer, with Kleppmann's patch to Confluent's Avro serdes, you'll basically if on the deserialized type and do your processing from there. This could be used to try to support different versions of the upstream service's messages, without schema evolution. I guess that temptation should be resisted. At Yolean we've tried topic generation numbers for that, with mixed results.

Kleppmann writes:

If you are using a data encoding such as JSON, without a statically defined schema, you can easily put many different event types in the same topic. However, if you are using a schema-based encoding such as Avro, a bit more thought is needed to handle multiple event types in a single topic.

That you can deserialize JSON without a schema doesn't mean you always should. For domain entities or "facts" - not operational data - I'd like all our records to have schemas. If Schema Registry is evolving, it's a pity confluentinc/schema-registry#220 gets so little attention.

from kubernetes-kafka.

solsson commented on August 27, 2024

Regarding naming conventions for topics, just now I spotted the config property create.topic.policy.class.name. https://cwiki.apache.org/confluence/display/KAFKA/KIP-108%3A+Create+Topic+Policy looks useful.

from kubernetes-kafka.

joewood commented on August 27, 2024

Randomly reading through this issue, I noticed you mentioned auto.create.topics.enable=true is required for streams. I don't believe this is the case as Kafka Streams uses the admin messages for topic creation and not the meta data request message (which is what this config relates to). Streams should create the derived topics based on the same configuration as the source topic (number of partitions etc.), with compaction for store backing topics.

from kubernetes-kafka.

solsson commented on August 27, 2024

@joewood Thanks a lot for correcting my misunderstanding here. We (Yolean) must have drawn some premature conclusions when evaluating streams.

Actually my disappointment summarized in #101 (comment) remains unchanged, and with your insight there's really no argument left for auto.create.topics.enable=true IMO, except that it is the default. I'd really like to do #148.

You don't happen to have experience with Create Topic Policy?

from kubernetes-kafka.

solsson commented on August 27, 2024

Reading about the new Knative - apparently awesome enough to need another landing page :) - it could be that CloudEvents matches what I've been looking for. It seems to go further than mapping a schema to topic messages, as it discusses feeds and event types and the relation to an "action". Types are managed in Kubernetes using CRD.

They say that "A producer can generate events before a consumer is listening, and a consumer can express an interest in an event or class of events that is not yet being produced.". That's in contrast to auto.create.topics.

Knative supports a "user-provided" Kafka event "Bus".

There's an event sorce for kubernetes events, interesting as alternative to #186 #145.

I haven't tested any of the above yet. I'll be particularily interested in how it relates to Kleppmann's that I referred to earlier. My first goal would be to see if I could implement event sources for our git hosting (Gogs for which we already publish events to Kafka using webhooks + #183) and Registry which didn't work with Pixy out of the box.

from kubernetes-kafka.

Wanted: topic management, declarative about kubernetes-kafka HOT 10 OPEN

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent