Hi we ran a to do sanity check. though we see some events bei

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I have run the and processing again with just one node to confirm. <p dir="

Losing event in around million events about clickhouse-kafka-connect HOT 20 CLOSED

abhishekgahlot2 commented on August 12, 2024 1

Losing event in around million events

from clickhouse-kafka-connect.

Comments (20)

ne1r0n commented on August 12, 2024 2

I also had rare gaps in data after errors related to offset commit in kafka that occurred when running backups on the kafka cluster, which caused the consumer to start reading messages from the old offset again. Since the size of received messages is not deterministic, this led to the case mentioned in #291 (comment). I have to use a custom build with a fix for more than half a year.

from clickhouse-kafka-connect.

abhishekgahlot2 commented on August 12, 2024 1

It looks like it happened with 1 node too it took millions of event but it did happen after 3.5 millions event.

So either could be combination of keeper lag and the one @ne1r0n mentioned.

@ne1r0n did it happen to you without any broker restart or while kafka is running. I am looking to apply some solution because we are running on production so missing events is a big deal.

from clickhouse-kafka-connect.

abhishekgahlot2 commented on August 12, 2024 1

Another reason i am expecting is having tasks.max = 2

also in KeeperStateProvider.java this could happen in future

 String key = String.format("%s-%d", topic, partition);
 String selectStr = String.format("SELECT * FROM `%s` WHERE `key`= '%s'", csc.getZkDatabase(), key);

in case of new code where virtual topic is allowed. Same keys can be stored in connect_state table with different min and max offset.

for eg i have same two table in different database the keeper state would look liike
tablename-0 | minOffset | maxOffset
I think it should be changed for multiple databases too because in virtual topic only the last part is taken for table key.

from clickhouse-kafka-connect.

abhishekgahlot2 commented on August 12, 2024 1

I see thanks in that case shouldn't happen. I have now deployed the new main to production to see if I get overlapping issues after the new code.

I am using this PR: #389

Once I verify that its working I can add multiple nodes so that zookeeper lag can be figured out too. :)

from clickhouse-kafka-connect.

abhishekgahlot2 commented on August 12, 2024 1

An update, we still got the missing event issue after adding multiple nodes to load balancer. looks like zookeeper lag is still a problem.

We haven't had any case for 4-5 days with single node but as soon as we added multiple nodes. The events started to miss they are lesser in frequency after main branch is deployed but still there.

cc @Paultagoras

from clickhouse-kafka-connect.

mzitnik commented on August 12, 2024 1

@abhishekgahlot2, if the issue still happens, can you ping us on ClickHouse community slack so we can discuss

from clickhouse-kafka-connect.

abhishekgahlot2 commented on August 12, 2024

I believe this could be happening due to mismatched minoffset and maxoffset from replicated connect state table and zookeeper lag because the last few event are missed out of couple of million events. I am still not sure but find this the probable reason. Looking forward to your ideas :)

By last or first i mean in payload either we lose [1,2,3,4,5] 1 or 5 from this array.

#380

@Paultagoras

from clickhouse-kafka-connect.

abhishekgahlot2 commented on August 12, 2024

I have run the script and processing again with just one node to confirm.

For now i have removed the extra node from load balancer and trying to figure out if this could be a problem with replicated table and offset now i am pushing to one node so zookeeper lag can be eliminated.

@ne1r0n i also believe the offset seems to be causing to remove few events.

from clickhouse-kafka-connect.

Paultagoras commented on August 12, 2024

Fascinating - so to summarize (please correct me if I'm wrong):

Payload sent with some data (e.g. min-offset: 1 and max-offset: 10)
A hiccup happens in kafka (as they sometimes do)
Due to replicating the connect-state across nodes, the min-offset and max-offset may not be 100% accurate (from either a prior or new call) so chopping happens, dropping some records

Is this happening in a self-hosted instance, or a cloud instance?

from clickhouse-kafka-connect.

abhishekgahlot2 commented on August 12, 2024

Yes this is correct. Its been 2 days and more than 3 million records. I haven't any seen any missing records yet because i removed 2 nodes from 3 node cluster so data is going on only 1 node now. We are processing millions of records every few hours.

Will update if i see problem happening on even 1 node too. Trying to figure out a solution where reading from connect state table is consistent.

Yes the infra is AWS Managed Kafka, Clickhouse Kafka Connector, Clickhouse 3 node cluster behind AWS load balancer with Keeper running for coordination.

from clickhouse-kafka-connect.

Paultagoras commented on August 12, 2024

Do you happen to have the config for the self-hosted setup, the connector, and some sample data? I'd be curious to know if I can a) replicate this locally and b) replicate this in the cloud.

from clickhouse-kafka-connect.

abhishekgahlot2 commented on August 12, 2024

Sure its pretty standard but here are the files

python script to generate sample data
https://gist.github.com/abhishekgahlot2/c43d0d4509f6f4fbb536ae910d9b6f9a

sample data json
https://gist.github.com/abhishekgahlot2/01555c6586510bbca90acf9538624487

docker-compose for clickhouse
https://gist.github.com/abhishekgahlot2/568adcafbb04be4ef95425213990a757

config.xml
https://gist.github.com/abhishekgahlot2/879d6d40028b38487018f0af7e48cd57

please let me know if you need anything else.

it would be wise to log records from ProxySinkTask.java file to eliminate other potential problems. i will do that

from clickhouse-kafka-connect.

abhishekgahlot2 commented on August 12, 2024

I believe this new code that is merged

        long actualMinOffset = rangeContainer.getMinOffset();
        long actualMaxOffset = rangeContainer.getMaxOffset();

        // SAME State [0,10] Actual [0,10]
        if (maxOffset == actualMaxOffset && minOffset == actualMinOffset)
            return RangeState.SAME;
        // NEW State [0,10] Actual [11,20]
        if (actualMinOffset > maxOffset)
            return RangeState.NEW;
        // CONTAINS [0,10] Actual [1, 10]
        if (actualMaxOffset <= maxOffset && actualMinOffset >= minOffset)
            return RangeState.CONTAINS;
        // ZEROED [10, 20] Actual [0, 10]
        if (actualMinOffset < minOffset && actualMinOffset == 0)
            return RangeState.ZERO;
        // ERROR [10,20] Actual [8, X]
        if (actualMinOffset < minOffset)
            return RangeState.ERROR;
        // OVER_LAPPING
        return RangeState.OVER_LAPPING;
    }

seems to address the problem of missing events in case of overlap. I am deploying new code to see if this resolves the issue.

from clickhouse-kafka-connect.

Paultagoras commented on August 12, 2024

The latest code merged was a fix provided by @ne1r0n if it turns out to solve the issue 🙂

from clickhouse-kafka-connect.

abhishekgahlot2 commented on August 12, 2024

Please let me know if I am wrong here. Can the connector have race condition when using tasks.max.

from clickhouse-kafka-connect.

Paultagoras commented on August 12, 2024

Please let me know if I am wrong here. Can the connector have race condition when using tasks.max.

It shouldn't get into a race condition, because of the way the Connect framework actually handles tasks - essentially, it's 1:1 with partitions (unless a task fails, I believe).

from clickhouse-kafka-connect.

Paultagoras commented on August 12, 2024

We haven't had any case for 4-5 days with single node but as soon as we added multiple nodes. The events started to miss they are lesser in frequency after main branch is deployed but still there.

Hmm interesting - so just to be clear, you're adding Keeper nodes or ClickHouse nodes?

from clickhouse-kafka-connect.

Paultagoras commented on August 12, 2024

And are you using something like https://clickhouse.com/docs/en/sql-reference/table-functions/cluster with your SELECT statement to make sure you're getting all the entries across replicas? (It's eventual consistency I believe, not immediate. There's also more performance details in that doc page around Distributed tables that could help)

from clickhouse-kafka-connect.

Paultagoras commented on August 12, 2024

I'm going to close this since we haven't heard anything, but if it's continuing to happen please let us know!

from clickhouse-kafka-connect.

abhishekgahlot2 commented on August 12, 2024

Thanks @Paultagoras @mzitnik , I will let you guys know on slack. as of now I haven't had issue so far. One instance was probably due to timeout apart from that.

Thanks again in helping here :)

from clickhouse-kafka-connect.

Losing event in around million events about clickhouse-kafka-connect HOT 20 CLOSED

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent