Code Monkey home page Code Monkey logo

Comments (10)

BenjaminDavison avatar BenjaminDavison commented on August 27, 2024

We are thinking about using terraform for this: https://github.com/packetloop/terraform-provider-kafka

from kubernetes-kafka.

DanielFerreiraJorge avatar DanielFerreiraJorge commented on August 27, 2024

Maybe this: https://github.com/nbogojevic/kafka-operator

from kubernetes-kafka.

solsson avatar solsson commented on August 27, 2024

My current best bet is to rely on auto.create.topics.enable=true + default.replication.factor and a naming convention (some ideas here).

I found that during stream processing it's a quite legitimate use case to want to add new topics dynamically. For example in log processing you could split on kubernetes namespace, and those may pop up while your streams app is running.

from kubernetes-kafka.

solsson avatar solsson commented on August 27, 2024

I've tried to summarize my confusion by closing various open investigations. What surprised me the most was the insight(?) that Schema Registry doesn't help.

At Yolean we store all our persistent data in Kafka. Some of it has the integrity requirements you'd normally entrust a relational database. If validation is supported at the API level when producing, we'll make design choices based on the type of data (rather than time constraints / laziness).

Now I suspect we won't get that far without in-house libs - a maintenance effort we can't depend on, given the rapid evolution of Kafka APIs.

Validation can still be simple enough, if schemas are available as files. We already use json-schema in some core producers using Node.js. Files have the added benefit that they get the same versioning and release cycle as the services they share repository with.

We do need a little bit of tooling to copy schema files, be they Avro or JSON or Protobuf, to the source folders of the services that depend on them. Due to Docker build contexts we can't pull them from a parent folder.

We still don't enforce anything, but luckily there's a compelling alternative: Monitoring (explored in #49 and #93). Consumer services can export counters on the number of messages they had to drop due to invalidity, or even if they don't we can look at generic failure rates. We can catch errors that schemas can't: the other day we had misbehaving client-side code that led to an increase in messages/second to a topic by a factor 100.

We can also write dedicated validation services, whose sole purpose is to expose metrics. This is where naming conventions might matter. Let's say that topic names ending with a dash followed by digits indicate a version number. It'd be quite easy to spot, based on a topic listing and the bytes or message rates reported by broker JMX, that we have some service producing to an old version of a topic.

from kubernetes-kafka.

solsson avatar solsson commented on August 27, 2024

We're now phasing in auto.topic.create (#107) in production. The gotchas that made me disable it from the start are there for sure -- run kafkacat with some typo in -t and a new topic gets created. Who would configure their database to create a table on INSERT INTO xyz?

However, Kafka Streams in essence requires auto creation. We just have to live with the gotchas.

Regarding topics + schema management, with the arguments I tried to summarize above, I ended up writing a Gradle build that generates Java POJO typed Serde impls from json-schema files. We can also use these schema files directly from libs like ajv in Node.

To do this we had to establish a list of topics, or actually topic name regexes, where we configure the model (i.e. schema) that is used for values. With this list we can unit test the whole thing: "deserialize this byte stream for topic x" etc, which is what I argued for earlier.

Regarding topic management, with or without auto create, we could compare that list to existing topics in kafka, which together with bytes in/out rates could indicate if there are bogus topics lying around or we're writing to unexpected topics.

from kubernetes-kafka.

solsson avatar solsson commented on August 27, 2024

Good points here: https://www.confluent.io/blog/put-several-event-types-kafka-topic/. I agree to a lot of what Kleppmann says, and conclude that my design above with one schema per topic is too restrictive. Schema evolution support basically only caters for new versions adding new fields. A single version may need to produce using multiple schemas to the same topic, due to ordering requirements.

I also happened to read http://www.dwmkerr.com/the-death-of-microservice-madness-in-2018/ today, and it nicely captures some versioning aspects of using topics as API contracts between services. As that post puts it:

"When a microservice system uses message queues for intra-service communication, you essentially have a large database (the message queue or broker) glueing the services together. Again, although it might not seem like a challenge at first, schema will come back to bite you."

I wanted to have as much of this complexity as possible known at build time, rather than provisioning/orchestration time (Kube manifests) or runtime (schema registry etc). Am I mistaken here? As a consumer, with Kleppmann's patch to Confluent's Avro serdes, you'll basically if on the deserialized type and do your processing from there. This could be used to try to support different versions of the upstream service's messages, without schema evolution. I guess that temptation should be resisted. At Yolean we've tried topic generation numbers for that, with mixed results.

Kleppmann writes:

If you are using a data encoding such as JSON, without a statically defined schema, you can easily put many different event types in the same topic. However, if you are using a schema-based encoding such as Avro, a bit more thought is needed to handle multiple event types in a single topic.

That you can deserialize JSON without a schema doesn't mean you always should. For domain entities or "facts" - not operational data - I'd like all our records to have schemas. If Schema Registry is evolving, it's a pity confluentinc/schema-registry#220 gets so little attention.

from kubernetes-kafka.

solsson avatar solsson commented on August 27, 2024

Regarding naming conventions for topics, just now I spotted the config property create.topic.policy.class.name. https://cwiki.apache.org/confluence/display/KAFKA/KIP-108%3A+Create+Topic+Policy looks useful.

from kubernetes-kafka.

joewood avatar joewood commented on August 27, 2024

Randomly reading through this issue, I noticed you mentioned auto.create.topics.enable=true is required for streams. I don't believe this is the case as Kafka Streams uses the admin messages for topic creation and not the meta data request message (which is what this config relates to). Streams should create the derived topics based on the same configuration as the source topic (number of partitions etc.), with compaction for store backing topics.

from kubernetes-kafka.

solsson avatar solsson commented on August 27, 2024

@joewood Thanks a lot for correcting my misunderstanding here. We (Yolean) must have drawn some premature conclusions when evaluating streams.

Actually my disappointment summarized in #101 (comment) remains unchanged, and with your insight there's really no argument left for auto.create.topics.enable=true IMO, except that it is the default. I'd really like to do #148.

You don't happen to have experience with Create Topic Policy?

from kubernetes-kafka.

solsson avatar solsson commented on August 27, 2024

Reading about the new Knative - apparently awesome enough to need another landing page :) - it could be that CloudEvents matches what I've been looking for. It seems to go further than mapping a schema to topic messages, as it discusses feeds and event types and the relation to an "action". Types are managed in Kubernetes using CRD.

They say that "A producer can generate events before a consumer is listening, and a consumer can express an interest in an event or class of events that is not yet being produced.". That's in contrast to auto.create.topics.

Knative supports a "user-provided" Kafka event "Bus".

There's an event sorce for kubernetes events, interesting as alternative to #186 #145.

I haven't tested any of the above yet. I'll be particularily interested in how it relates to Kleppmann's that I referred to earlier. My first goal would be to see if I could implement event sources for our git hosting (Gogs for which we already publish events to Kafka using webhooks + #183) and Registry which didn't work with Pixy out of the box.

from kubernetes-kafka.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.