Comments (20)
I also had rare gaps in data after errors related to offset commit in kafka that occurred when running backups on the kafka cluster, which caused the consumer to start reading messages from the old offset again. Since the size of received messages is not deterministic, this led to the case mentioned in #291 (comment). I have to use a custom build with a fix for more than half a year.
from clickhouse-kafka-connect.
It looks like it happened with 1 node too it took millions of event but it did happen after 3.5 millions event.
So either could be combination of keeper lag and the one @ne1r0n mentioned.
@ne1r0n did it happen to you without any broker restart or while kafka is running. I am looking to apply some solution because we are running on production so missing events is a big deal.
from clickhouse-kafka-connect.
Another reason i am expecting is having tasks.max = 2
also in KeeperStateProvider.java this could happen in future
String key = String.format("%s-%d", topic, partition);
String selectStr = String.format("SELECT * FROM `%s` WHERE `key`= '%s'", csc.getZkDatabase(), key);
in case of new code where virtual topic is allowed. Same keys can be stored in connect_state table with different min and max offset.
for eg i have same two table in different database the keeper state would look liike
tablename-0 | minOffset | maxOffset
I think it should be changed for multiple databases too because in virtual topic only the last part is taken for table key.
from clickhouse-kafka-connect.
I see thanks in that case shouldn't happen. I have now deployed the new main to production to see if I get overlapping issues after the new code.
I am using this PR: #389
Once I verify that its working I can add multiple nodes so that zookeeper lag can be figured out too. :)
from clickhouse-kafka-connect.
An update, we still got the missing event issue after adding multiple nodes to load balancer. looks like zookeeper lag is still a problem.
We haven't had any case for 4-5 days with single node but as soon as we added multiple nodes. The events started to miss they are lesser in frequency after main branch is deployed but still there.
cc @Paultagoras
from clickhouse-kafka-connect.
@abhishekgahlot2, if the issue still happens, can you ping us on ClickHouse community slack so we can discuss
from clickhouse-kafka-connect.
I believe this could be happening due to mismatched minoffset and maxoffset from replicated connect state table and zookeeper lag because the last few event are missed out of couple of million events. I am still not sure but find this the probable reason. Looking forward to your ideas :)
By last or first i mean in payload either we lose [1,2,3,4,5] 1 or 5 from this array.
from clickhouse-kafka-connect.
I have run the script and processing again with just one node to confirm.
For now i have removed the extra node from load balancer and trying to figure out if this could be a problem with replicated table and offset now i am pushing to one node so zookeeper lag can be eliminated.
@ne1r0n i also believe the offset seems to be causing to remove few events.
from clickhouse-kafka-connect.
Fascinating - so to summarize (please correct me if I'm wrong):
- Payload sent with some data (e.g. min-offset: 1 and max-offset: 10)
- A hiccup happens in kafka (as they sometimes do)
- Due to replicating the connect-state across nodes, the min-offset and max-offset may not be 100% accurate (from either a prior or new call) so chopping happens, dropping some records
Is this happening in a self-hosted instance, or a cloud instance?
from clickhouse-kafka-connect.
Yes this is correct. Its been 2 days and more than 3 million records. I haven't any seen any missing records yet because i removed 2 nodes from 3 node cluster so data is going on only 1 node now. We are processing millions of records every few hours.
Will update if i see problem happening on even 1 node too. Trying to figure out a solution where reading from connect state table is consistent.
Yes the infra is AWS Managed Kafka, Clickhouse Kafka Connector, Clickhouse 3 node cluster behind AWS load balancer with Keeper running for coordination.
from clickhouse-kafka-connect.
Do you happen to have the config for the self-hosted setup, the connector, and some sample data? I'd be curious to know if I can a) replicate this locally and b) replicate this in the cloud.
from clickhouse-kafka-connect.
Sure its pretty standard but here are the files
python script to generate sample data
https://gist.github.com/abhishekgahlot2/c43d0d4509f6f4fbb536ae910d9b6f9a
sample data json
https://gist.github.com/abhishekgahlot2/01555c6586510bbca90acf9538624487
docker-compose for clickhouse
https://gist.github.com/abhishekgahlot2/568adcafbb04be4ef95425213990a757
config.xml
https://gist.github.com/abhishekgahlot2/879d6d40028b38487018f0af7e48cd57
please let me know if you need anything else.
it would be wise to log records from ProxySinkTask.java file to eliminate other potential problems. i will do that
from clickhouse-kafka-connect.
I believe this new code that is merged
long actualMinOffset = rangeContainer.getMinOffset();
long actualMaxOffset = rangeContainer.getMaxOffset();
// SAME State [0,10] Actual [0,10]
if (maxOffset == actualMaxOffset && minOffset == actualMinOffset)
return RangeState.SAME;
// NEW State [0,10] Actual [11,20]
if (actualMinOffset > maxOffset)
return RangeState.NEW;
// CONTAINS [0,10] Actual [1, 10]
if (actualMaxOffset <= maxOffset && actualMinOffset >= minOffset)
return RangeState.CONTAINS;
// ZEROED [10, 20] Actual [0, 10]
if (actualMinOffset < minOffset && actualMinOffset == 0)
return RangeState.ZERO;
// ERROR [10,20] Actual [8, X]
if (actualMinOffset < minOffset)
return RangeState.ERROR;
// OVER_LAPPING
return RangeState.OVER_LAPPING;
}
seems to address the problem of missing events in case of overlap. I am deploying new code to see if this resolves the issue.
from clickhouse-kafka-connect.
The latest code merged was a fix provided by @ne1r0n if it turns out to solve the issue 🙂
from clickhouse-kafka-connect.
Please let me know if I am wrong here. Can the connector have race condition when using tasks.max.
from clickhouse-kafka-connect.
Please let me know if I am wrong here. Can the connector have race condition when using tasks.max.
It shouldn't get into a race condition, because of the way the Connect framework actually handles tasks - essentially, it's 1:1 with partitions (unless a task fails, I believe).
from clickhouse-kafka-connect.
We haven't had any case for 4-5 days with single node but as soon as we added multiple nodes. The events started to miss they are lesser in frequency after main branch is deployed but still there.
Hmm interesting - so just to be clear, you're adding Keeper nodes or ClickHouse nodes?
from clickhouse-kafka-connect.
And are you using something like https://clickhouse.com/docs/en/sql-reference/table-functions/cluster with your SELECT statement to make sure you're getting all the entries across replicas? (It's eventual consistency I believe, not immediate. There's also more performance details in that doc page around Distributed tables that could help)
from clickhouse-kafka-connect.
I'm going to close this since we haven't heard anything, but if it's continuing to happen please let us know!
from clickhouse-kafka-connect.
Thanks @Paultagoras @mzitnik , I will let you guys know on slack. as of now I haven't had issue so far. One instance was probably due to timeout apart from that.
Thanks again in helping here :)
from clickhouse-kafka-connect.
Related Issues (20)
- Wrong decoding of Decimal values from Debezium HOT 3
- Improve DLQ logs
- Make connect compatable with kafka plugin.discovery
- Release 1.0.12 Release Notes, 23.3 Minimum CH Version HOT 1
- Support parallel inserts HOT 7
- Unreasonable cap of tableRefreshInterval at 600ms HOT 2
- Fix cloud tests
- Avro record with an ENUM crashes task HOT 2
- Connect state table may not exist on all clickhouse node in cluster HOT 3
- Can't connect to CliclHouse from local minikube HOT 6
- Make async_insert=0 configurable HOT 1
- Add logging for cases when Kafka authentication for Consumer / Producer is wrong HOT 6
- Use the new V2 version of the java client to simplify serialization HOT 1
- Table mapping always choose the default database. HOT 3
- no support for hebrew language HOT 2
- [BUG] Table with nested failing to load HOT 2
- fetch.max.wait.ms setting of kafka connect not working as expected with CH sink connector HOT 1
- DateTime64(9) are inserted incorrectly (near unix epoch) when using KafkaConnect TimestampConverter to convert from string to org.apache.kafka.connect.data.Time HOT 9
- ClassCastException due to invalid field value to column name mapping for Tuples
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from clickhouse-kafka-connect.