Comments (13)
/assign @weiliu1031
/unassign
from milvus.
/assign @bigsheeper
please help on it
from milvus.
segment 450758977781515919
's data is lost from pulsar.
after DN recovering and seek to 17:31:42, the msg in between 17:31:42 are lost cpTime=2024/06/27 17:46:30.521 are lost.
so the flush is timeout.
[2024/06/27 17:46:21.202 +00:00] [INFO] [datanode/flow_graph_dmstream_input_node.go:53] ["datanode seek successfully when register to msgDispatcher"] [nodeID=13] [collectionID=450758977781515635] [vchannel=by-dev-rootcoord-dml_10_450758977781515635v1] [msgID="\u0008\n\u0010\ufffd\u0006\u0018\u0000 \u0000"] [tsTime=2024/06/27 17:31:42.121 +00:00] [tsLag=14m39.081598088s]
[2024/06/27 17:46:21.202 +00:00] [INFO] [datanode/data_sync_service.go:93] ["dataSyncService starting flow graph"] [collectionID=450758977781515635] [vChanName=by-dev-rootcoord-dml_10_450758977781515635v1]
[2024/06/27 17:46:21.202 +00:00] [INFO] [datanode/channel_manager.go:379] ["Stop timer for ToWatch operation succeeded"] [channel=by-dev-rootcoord-dml_10_450758977781515635v1] [opID=450758977782726625] [timeout=5m0s]
[2024/06/27 17:46:21.202 +00:00] [INFO] [datanode/channel_manager.go:194] ["Success to watch"] [opID=450758977782726625] [channel=by-dev-rootcoord-dml_10_450758977781515635v1] [State=WatchSuccess]
[2024/06/27 17:46:21.344 +00:00] [INFO] [msgdispatcher/manager.go:198] ["start merging..."] [role=datanode] [nodeID=13] [vchannel="{\"by-dev-rootcoord-dml_10_450758977781515635v1\":{}}"]
[2024/06/27 17:46:21.344 +00:00] [INFO] [msgdispatcher/dispatcher.go:149] ["add new target"] [vchannel=by-dev-rootcoord-dml_10_450758977781515635v1] [isMain=true]
[2024/06/27 17:46:21.373 +00:00] [INFO] [msgdispatcher/manager.go:218] ["merge done"] [role=datanode] [nodeID=13] [vchannel="{\"by-dev-rootcoord-dml_10_450758977781515635v1\":{}}"]
[2024/06/27 17:46:21.784 +00:00] [INFO] [datanode/services.go:381] ["DataNode receives CheckChannelOperationProgress"] [traceID=c4cc98513456be9c1824b08d196f446a] [channel=by-dev-rootcoord-dml_10_450758977781515635v1] [operation=ToWatch]
[2024/06/27 17:47:00.720 +00:00] [DEBUG] [writebuffer/write_buffer.go:291] ["checkpoint from latest consumed msg"] [collectionID=450758977781515635] [channel=by-dev-rootcoord-dml_10_450758977781515635v1] [cpTimestamp=450759331812737025]
[2024/06/27 17:47:02.392 +00:00] [DEBUG] [datanode/flow_graph_time_tick_node.go:122] ["UpdateChannelCheckpoint success"] [channel=by-dev-rootcoord-dml_10_450758977781515635v1] [cpTs=450759331812737025] [cpTime=2024/06/27 17:46:30.521 +00:00]
5977 [2024/06/27 17:55:51.734 +00:00] [INFO] [datacoord/services.go:1241] ["DataCoord receive GetFlushState request, Flushed is false"] [traceID=9cc685a95c1cfb1ffd54f70ba951b85b] [collection=450758977781515635] [flushTs=2024/06/27 17:55:42.671 +00:00] [unflushed="[450758977781515919]"] [len=1]
from milvus.
segment
450758977781515919
's data is lost from pulsar.after DN recovering and seek to 17:31:42, the msg in between 17:31:42 are lost cpTime=2024/06/27 17:46:30.521 are lost.
so the flush is timeout.
[2024/06/27 17:46:21.202 +00:00] [INFO] [datanode/flow_graph_dmstream_input_node.go:53] ["datanode seek successfully when register to msgDispatcher"] [nodeID=13] [collectionID=450758977781515635] [vchannel=by-dev-rootcoord-dml_10_450758977781515635v1] [msgID="\u0008\n\u0010\ufffd\u0006\u0018\u0000 \u0000"] [tsTime=2024/06/27 17:31:42.121 +00:00] [tsLag=14m39.081598088s] [2024/06/27 17:46:21.202 +00:00] [INFO] [datanode/data_sync_service.go:93] ["dataSyncService starting flow graph"] [collectionID=450758977781515635] [vChanName=by-dev-rootcoord-dml_10_450758977781515635v1] [2024/06/27 17:46:21.202 +00:00] [INFO] [datanode/channel_manager.go:379] ["Stop timer for ToWatch operation succeeded"] [channel=by-dev-rootcoord-dml_10_450758977781515635v1] [opID=450758977782726625] [timeout=5m0s] [2024/06/27 17:46:21.202 +00:00] [INFO] [datanode/channel_manager.go:194] ["Success to watch"] [opID=450758977782726625] [channel=by-dev-rootcoord-dml_10_450758977781515635v1] [State=WatchSuccess] [2024/06/27 17:46:21.344 +00:00] [INFO] [msgdispatcher/manager.go:198] ["start merging..."] [role=datanode] [nodeID=13] [vchannel="{\"by-dev-rootcoord-dml_10_450758977781515635v1\":{}}"] [2024/06/27 17:46:21.344 +00:00] [INFO] [msgdispatcher/dispatcher.go:149] ["add new target"] [vchannel=by-dev-rootcoord-dml_10_450758977781515635v1] [isMain=true] [2024/06/27 17:46:21.373 +00:00] [INFO] [msgdispatcher/manager.go:218] ["merge done"] [role=datanode] [nodeID=13] [vchannel="{\"by-dev-rootcoord-dml_10_450758977781515635v1\":{}}"] [2024/06/27 17:46:21.784 +00:00] [INFO] [datanode/services.go:381] ["DataNode receives CheckChannelOperationProgress"] [traceID=c4cc98513456be9c1824b08d196f446a] [channel=by-dev-rootcoord-dml_10_450758977781515635v1] [operation=ToWatch] [2024/06/27 17:47:00.720 +00:00] [DEBUG] [writebuffer/write_buffer.go:291] ["checkpoint from latest consumed msg"] [collectionID=450758977781515635] [channel=by-dev-rootcoord-dml_10_450758977781515635v1] [cpTimestamp=450759331812737025] [2024/06/27 17:47:02.392 +00:00] [DEBUG] [datanode/flow_graph_time_tick_node.go:122] ["UpdateChannelCheckpoint success"] [channel=by-dev-rootcoord-dml_10_450758977781515635v1] [cpTs=450759331812737025] [cpTime=2024/06/27 17:46:30.521 +00:00]
5977 [2024/06/27 17:55:51.734 +00:00] [INFO] [datacoord/services.go:1241] ["DataCoord receive GetFlushState request, Flushed is false"] [traceID=9cc685a95c1cfb1ffd54f70ba951b85b] [collection=450758977781515635] [flushTs=2024/06/27 17:55:42.671 +00:00] [unflushed="[450758977781515919]"] [len=1]
Yes, after the datanode restart, the flowgraph did not receive insert data of segment 450758977781515919. There are several possible reasons for this:
- The channel checkpoint was incorrectly advanced past segment 450758977781515919 (without flush/sync).
- The insert data consumed was incorrectly skipped.
- Data was lost in Pulsar (although data was present in Pulsar before the datanode restart).
from milvus.
it is not a stable reproduced issue. so it is a bit hard to verify and reproduce
from milvus.
it is not a stable reproduced issue. so it is a bit hard to verify and reproduce
Try to add some key logs; we'll look into it if it happens again.
from milvus.
There is a reproduced case in 2.4-20240708-f6cd8416-amd64
failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-for-release-cron/detail/chaos-test-for-release-cron/14352/pipeline
log:
artifacts-pulsar-pod-kill-14352-server-logs.tar.gz
[2024-07-08T21:33:29.882Z] <name>: SearchChecker__GGsK0j2n
[2024-07-08T21:33:29.882Z] <description>:
[2024-07-08T21:33:29.882Z] <schema>: {'auto_id': False, 'description': '', 'fields': [{'name': 'int64', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float', 'description': '', 'type': <DataType.FLOAT...... (api_request.py:37)
[2024-07-08T21:33:29.882Z] [2024-07-08 21:30:21 - DEBUG - ci_test]: (api_request) : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:62)
[2024-07-08T21:33:29.882Z] [2024-07-08 21:33:21 - WARNING - pymilvus.decorators]: Retry timeout: 180s (decorators.py:106)
[2024-07-08T21:33:29.882Z] [2024-07-08 21:33:21 - ERROR - pymilvus.decorators]: RPC error: [flush], <MilvusException: (code=1, message=Retry timeout: 180s, message=wait for flush timeout, collection: SearchChecker__GGsK0j2n, flusht_ts: 451011994230456324)>, <Time:{'RPC start': '2024-07-08 21:30:21.415697', 'RPC error': '2024-07-08 21:33:21.580120'}> (decorators.py:146)
[2024-07-08T21:33:29.882Z] [2024-07-08 21:33:21 - ERROR - ci_test]: Traceback (most recent call last):
[2024-07-08T21:33:29.882Z] File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper
[2024-07-08T21:33:29.882Z] res = func(*args, **_kwargs)
[2024-07-08T21:33:29.882Z] File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request
[2024-07-08T21:33:29.882Z] return func(*arg, **kwargs)
[2024-07-08T21:33:29.882Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 319, in flush
[2024-07-08T21:33:29.882Z] conn.flush([self.name], timeout=timeout, **kwargs)
[2024-07-08T21:33:29.882Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 147, in handler
[2024-07-08T21:33:29.882Z] raise e from e
[2024-07-08T21:33:29.882Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 143, in handler
[2024-07-08T21:33:29.882Z] return func(*args, **kwargs)
[2024-07-08T21:33:29.882Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 182, in handler
[2024-07-08T21:33:29.882Z] return func(self, *args, **kwargs)
[2024-07-08T21:33:29.882Z] File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 107, in handler
[2024-07-08T21:33:29.882Z] raise MilvusException(
[2024-07-08T21:33:29.882Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=Retry timeout: 180s, message=wait for flush timeout, collection: SearchChecker__GGsK0j2n, flusht_ts: 451011994230456324)>
[2024-07-08T21:33:29.882Z] (api_request.py:45)
[2024-07-08T21:33:29.882Z] [2024-07-08 21:33:21 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=Retry timeout: 180s, message=wait for flush timeout, collection: SearchChecker__GGsK0j2n, flusht_ts: 451011994230456324)> (api_request.py:46)
from milvus.
reproduced in master-20240709-eeb03a0d-amd64
datanode chaos test
log:artifacts-datanode-pod-kill-15588-server-logs.tar.gz
from milvus.
The cause of this issue is different; it's due to tt delay:
from milvus.
reproduced in
master-20240709-eeb03a0d-amd64
datanode chaos test
dispatcher merged by mistake, the insert data consumed was incorrectly skipped:
from milvus.
/assign @zhuwenxing
should be fixed
from milvus.
/unassign
from milvus.
verified and not reproduced in master-20240713-2c462d38-amd64
from milvus.
Related Issues (20)
- [Bug]: is too hard to make the project,i try my best. HOT 11
- [Bug]: Memory size and delta count in SegmentInfo may be inaccurate
- [Bug]: Unable to Delete Data from Collection After Field Modification in Milvus HOT 3
- [Bug]: "print: command not found" when compile HOT 3
- [Bug]: The `authorizationEnabled` flag reverts to its original state in Milvus standalone mode when using Docker. HOT 2
- [Bug]: checkResultTicker leak HOT 3
- [Bug]: `BloomFilterSet.BatchPkExist` may return false negative when `K` values differ among candidates HOT 2
- [Enhancement]: Check for channel cp lag can be removed
- [Bug]: Strong consistency , delete action, doesnt actually deletes immediately, i could still see the search results HOT 4
- [Feature]: Support to replicate the rbac operation HOT 1
- [Bug]: watch channel stuck forever HOT 2
- [Bug]: milvus crash after dropping a collection if compaction disabled HOT 1
- [Bug]: The decribe index api returns an incoherent response structure when using orm create index and milvusclient create index respectively HOT 1
- [Bug]: indices is empty when maping sparse float vector HOT 1
- [Bug]: pymilvus.exceptions.MilvusException: <MilvusException: (code=65535, message=unrecoverable error)> HOT 2
- [Bug]: Before exiting, make sure the goroutine has exited HOT 1
- [Bug]: err has degenerated into a new variable, which cannot be captured in defer. HOT 1
- [Bug]: Vector Search() bug HOT 9
- [Feature]: Partial load collection at field/column level
- [Bug]: querynode got restarted during test after indexcoord pod kill chaos test HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from milvus.