Comments (2)
I found that max_copy_rate_megabytes_per_disk
causes group check timeouts, leading to primary replica shard ballot growth. Some nodes are generating core dumps, and in severe cases, unalive nodes will appear.
This issue can be reliably reproduced.
What did you do?
a. Execute command restore -c c4tst-function3 -a lpf_test -i 27 -t 1713285672087 -b hdfs_zjy -r /user/s_pegasus/lpfbackup
b. Execute multiple times server-config replica max_copy_rate_megabytes_per_disk set NUM
, NUM = [10,50,10,50,10].
Phenomenon
Many groupcheck timeouts and node core dump. The coredump content is the same as the above coredump file content.
log.2.txt:E2024-04-23 19:08:33.428 (1713870513428314325 9853) replica.replica11.0404000b000003d1: replica_failover.cpp:68:handle_remote_failure(): [31.42@x.x.x.x:45101](mailto:31.42@x.x.x.x:45101): handle remote failure caused by group check, error = ERR_TIMEOUT, status = replication::partition_status::PS_SECONDARY, node = [x.x.x.x:45101](http://x.x.x.x:45101/)
log.2.txt:E2024-04-23 19:09:06.711 (1713870546711168146 9853) replica.replica11.0404000b000003f2: replica_failover.cpp:68:handle_remote_failure(): [31.42@x.x.x.x:45101](mailto:31.42@x.x.x.x:45101): handle remote failure caused by group check, error = ERR_TIMEOUT, status = replication::partition_status::PS_SECONDARY, node = [x.x.x.x:45101](http://x.x.x.x:45101/)
log.2.txt:E2024-04-23 19:10:10.238 (1713870610238377803 9853) replica.replica11.0404000b00000411: replica_failover.cpp:68:handle_remote_failure(): [31.42@x.x.x.x:45101](mailto:31.42@x.x.x.x:45101): handle remote failure caused by group check, error = ERR_TIMEOUT, status = replication::partition_status::PS_SECONDARY, node = [x.x.x.x:45101](http://x.x.x.x:45101/)
log.2.txt:E2024-04-23 19:10:24.219 (1713870624219417888 9845) replica.replica3.04040003000002da: replica_failover.cpp:68:handle_remote_failure(): [31.82@x.x.x.x:45101](mailto:31.82@x.x.x.x:45101): handle remote failure caused by group check, error = ERR_TIMEOUT, status = replication::partition_status::PS_SECONDARY, node = [x.x.x.x:45101](http://x.x.x.x:45101/)
log.2.txt:E2024-04-23 19:10:47.670 (1713870647670827213 9845) replica.replica3.04040003000002e5: replica_failover.cpp:68:handle_remote_failure(): [31.82@x.x.x.x:45101](mailto:31.82@x.x.x.x:45101): handle remote failure caused by group check, error = ERR_TIMEOUT, status = replication::partition_status::PS_SECONDARY, node = [x.x.x.x:45101](http://x.x.x.x:45101/)
log.2.txt:E2024-04-23 19:10:59.988 (1713870659988397675 9847) replica.replica5.0404000500000214: replica_failover.cpp:68:handle_remote_failure(): [31.12@x.x.x.x:45101](mailto:31.12@x.x.x.x:45101): handle remote failure caused by group check, error = ERR_TIMEOUT, status = replication::partition_status::PS_SECONDARY, node = [x.x.x.x:45101](http://x.x.x.x:45101/)
log.2.txt:E2024-04-23 19:10:59.988 (1713870659988935233 9861) replica.replica19.04040013000007cd: replica_failover.cpp:68:handle_remote_failure(): [29.0@x.x.x.x:45101](mailto:29.0@x.x.x.x:45101): handle remote failure caused by group check, error = ERR_TIMEOUT, status = replication::partition_status::PS_SECONDARY, node = [x.x.x.x:45101](http://x.x.x.x:45101/)
Case1
RPC_GROUP_CHECK_ACK is sent with a delay of 9 seconds
primary replica send
D2024-04-23 19:10:02.688 (1713870602688110786 9853) replica.replica11.0404000b00000410: replica_check.cpp:191:call(): rpc_name = RPC_GROUP_CHECK, remote_addr = x.x.x.x:45101, header_format = RPC_GROUP_CHECK, task_id = 289356323303195665, seq_id = 14982, trace_id = 0eb2a9e82c2277de
log.2.txt:D2024-04-23 19:10:02.688 (1713870602688121529 9853) replica.replica11.0404000b00000410: replica_check.cpp:140:broadcast_group_check(): [email protected]:45101:
send group check to x.x.x.x:45101 with state replication::partition_status::PS_SECONDARY , task_id 289356323303195665 ,ballot 7
log.2.txt:E2024-04-23 19:10:10.238 (1713870610238377803 9853) replica.replica11.0404000b00000411: replica_failover.cpp:68:handle_remote_failure(): [email protected]:45101: handle remote failure caused by group check, error = ERR_TIMEOUT, status = replication::partition_status::PS_SECONDARY, node = x.x.x.x:45101
secondary replica reply
log.3.txt:I2024-04-23 19:10:02.688 (1713870602688201671 84230) replica.io-thrd.84230: network.cpp:415:on_recv_message(): LPF rpc_name = RPC_GROUP_CHECK, this_recv_addr = x.x.x.x:64140, header_format = NET_HDR_DSN, seq_id = 14982, trace_id = 0eb2a9e82c2277de
D2024-04-23 19:10:02.688 (1713870602688281016 84284) replica.replica11.04004906000108f0: replica_stub.cpp:1215:on_group_check(): [email protected]:45101: received group check, primary = x.x.x.x:45101, ballot = 7, status = replication::partition_status::PS_SECONDARY, last_committed_decree = 200457
D2024-04-23 19:10:02.688 (1713870602688323625 84284) replica.replica11.04004906000108f0: replica_check.cpp:164:on_group_check(): [[email protected]:45101] process group check, primary = x.x.x.x:45101, ballot = 7, status = replication::partition_status::PS_SECONDARY, last_committed_decree = 200457, confirmed_decree = -1
log.3.txt:I2024-04-23 19:10:11.198 (1713870611198691041 84284) replica.replica11.04004906000108f0: rpc_engine.cpp:742:reply(): LPF response rpc_name = RPC_GROUP_CHECK_ACK, from_addr = x.x.x.x:45101, to_addr = x.x.x.x:45101, header_format = NET_HDR_DSN, seq_id = 14982, trace_id = 0eb2a9e82c2277de
Reason
I suspect that the sleep_for
function is causing the replica.default
thread pool to sleep excessively, affecting the sending of RPC_GROUP_CHECK_ACK
. However, it is still unclear how this is happening.
bool consumeWithBorrowAndWait(double toConsume,
double rate,
double burstSize,
double nowInSeconds = defaultClockNow())
{
auto res = consumeWithBorrowNonBlocking(toConsume, rate, burstSize, nowInSeconds);
if (res.get_value_or(0) > 0) {
int64_t napUSec = res.get() * 1000000;
std::this_thread::sleep_for(std::chrono::microseconds(napUSec));
}
return res.is_initialized();
}
When max_copy_rate_megabytes_per_disk is set 10, Seconds-level sleep is causing the replica.default
thread pool to sleep excessively.
2024-04-23 19:10:02.341 84238) replica.default1.040100070000298f: sleep,microseconds is 3199715,
D2024-04-23 19:10:02.343 84239) replica.default2.0401000700002993: sleep,microseconds is 3598415,
D2024-04-23 19:10:02.440 84241) replica.default4.0401000700002999: sleep,microseconds is 3901018,
D2024-04-23 19:10:02.712 84243) replica.default6.040100080000296b: sleep,microseconds is 4029154,
D2024-04-23 19:10:02.714 84245) replica.default8.0401000700002995: sleep,microseconds is 4427141,
D2024-04-23 19:10:02.744 84242) replica.default5.0401000700002997: sleep,microseconds is 4797563,
D2024-04-23 19:10:03.144 84237) replica.default0.0401000800002969: sleep,microseconds is 4797490,
D2024-04-23 19:10:03.544 84240) replica.default3.0401000200001e72: sleep,microseconds is 4797061,
D2024-04-23 19:10:03.943 84247) replica.default10.0401000a00002990: sleep,microseconds is 4798536,
D2024-04-23 19:10:04.343 84244) replica.default7.040100070000299b: sleep,microseconds is 4798140,
D2024-04-23 19:10:04.746 84246) replica.default9.04010001000024e5: sleep,microseconds is 4795031,
D2024-04-23 19:10:05.142 84248) replica.default11.0401000600001f65: sleep,microseconds is 372727,
D2024-04-23 19:10:05.515 84248) replica.default11.0401000600001f65: sleep,microseconds is 377530,
D2024-04-23 19:10:05.542 84238) replica.default1.0401000200001e74: sleep,microseconds is 4399154,
D2024-04-23 19:10:05.904 84248) replica.default11.0401000b00002296: sleep,microseconds is 4093229,
D2024-04-23 19:10:06.75 84239) replica.default2.040100040000260e: sleep,microseconds is 217558,
D2024-04-23 19:10:06.294 84239) replica.default2.0401000900002654: sleep,microseconds is 4103047,
D2024-04-23 19:10:06.358 84241) replica.default4.0401000600001f73: sleep,microseconds is 334393,
D2024-04-23 19:10:06.693 84241) replica.default4.040100080000296f: sleep,microseconds is 399535,
D2024-04-23 19:10:06.745 84243) replica.default6.040100080000296d: sleep,microseconds is 747141,
D2024-04-23 19:10:07.93 84241) replica.default4.040100050000229c: sleep,microseconds is 799505,
D2024-04-23 19:10:07.142 84245) replica.default8.040100050000229a: sleep,microseconds is 3655019,
D2024-04-23 19:10:07.493 84243) replica.default6.04010000000026db: sleep,microseconds is 799331,
D2024-04-23 19:10:07.542 84242) replica.default5.04010000000026d9: sleep,microseconds is 3255109,
D2024-04-23 19:10:07.895 84241) replica.default4.0401000300002787: sleep,microseconds is 2901920,
D2024-04-23 19:10:07.944 84237) replica.default0.0401000300002789: sleep,microseconds is 748869,
D2024-04-23 19:10:08.294 84243) replica.default6.040100030000278b: sleep,microseconds is 798618,
D2024-04-23 19:10:08.344 84240) replica.default3.0401000a00002992: sleep,microseconds is 2853312,
D2024-04-23 19:10:08.695 84237) replica.default0.0401000a00002994: sleep,microseconds is 797321,
D2024-04-23 19:10:08.743 84247) replica.default10.040100070000299d: sleep,microseconds is 3254699,
D2024-04-23 19:10:09.94 84243) replica.default6.0401000900002660: sleep,microseconds is 3702864,
D2024-04-23 19:10:09.142 84244) replica.default7.0401000900002662: sleep,microseconds is 1150250,
D2024-04-23 19:10:09.493 84237) replica.default0.0401000b0000229a: sleep,microseconds is 799877,
D2024-04-23 19:10:09.541 84246) replica.default9.0401000b0000229c: sleep,microseconds is 751294,
D2024-04-23 19:10:09.944 84238) replica.default1.0401000b0000229e: sleep,microseconds is 348727,
from incubator-pegasus.
max_copy_rate_megabytes_per_disk
causes a large number of threads in the replica.default
thread pool to sleep. The nfs_service_impl::on_copy
working on replica.default
can also cause this issue.
In this thread pool:
-
affecting the timely processing of remote_command. For example:
>>> server_info -r COMMAND: server-info CALL [meta-server] [xxxx:45001] succeed: Pegasus Server 2.4.3-without-slog (84632317ee48b2c6c013f41d0e6f73ad4955b6bb) Release, Started at 2024-03-22 14:10:53 CALL [meta-server] [xxxx:45001] succeed: Pegasus Server 2.4.3-without-slog (84632317ee48b2c6c013f41d0e6f73ad4955b6bb) Release, Started at 2024-03-22 14:11:01 CALL [replica-server] [xxxx:45101] succeed: Pegasus Server 2.4.3-without-slog (84632317ee48b2c6c013f41d0e6f73ad4955b6bb) Release, Started at 2024-03-22 14:11:15 CALL [replica-server] [xxxx:45101] failed: ERR_TIMEOUT CALL [replica-server] [xxxx:45101] succeed: Pegasus Server 2.4.3-without-slog (84632317ee48b2c6c013f41d0e6f73ad4955b6bb) Release, Started at 2024-04-15 17:50:28 CALL [replica-server] [xxxx:45101] failed: ERR_TIMEOUT CALL [replica-server] [xxxx:45101] succeed: Pegasus Server 2.4.3-without-slog (84632317ee48b2c6c013f41d0e6f73ad4955b6bb) Release, Started at 2024-03-22 14:11:22 Succeed count: 5 Failed count: 2
-
Some RPCs are being delayed for transmission. For example:
RPC_CM_CONFIG_SYNC RPC_CM_DUPLICATION_SYNC RPC_NFS_COPY
May be related: #1840
-
Indirectly makes the Group Check ERR_TIMEOUT in
replica.replica
and causes ballot increase.
[general]
app_name : lpf_test
app_id : 46
partition_count : 100
max_replica_count : 3
[replicas]
pidx ballot replica_count primary secondaries
0 6 3/3 x.x.x.x:55101 [x.x.x.x:55101,x.x.x.x:55101]
1 7 3/3 x.x.x.x:55101 [x.x.x.x:55101,x.x.x.x:55101]
2 5 3/3 x.x.x.x:55101 [x.x.x.x:55101,x.x.x.x:55101]
3 9 3/3 x.x.x.x:55101 [x.x.x.x:55101,x.x.x.x:55101]
4 5 3/3 x.x.x.x:55101 [x.x.x.x:55101,x.x.x.x:55101]
5 6 3/3 x.x.x.x:55101 [x.x.x.x:55101,x.x.x.x:55101]
6 7 3/3 x.x.x.x:55101 [x.x.x.x:55101,x.x.x.x:55101]
Solution
-
If the
max_copy_rate_megabytes_per_disk
is not set too low and not frequently adjusted, it is unlikely to trigger the problem. -
Or replace the default RPC thread pool for NFS-related operations in
nfs_server_impl.cpp
can resolves this issue. For example:
-DEFINE_TASK_CODE_RPC(RPC_NFS_COPY, TASK_PRIORITY_COMMON, ::dsn::THREAD_POOL_DEFAULT)
+DEFINE_TASK_CODE_RPC(RPC_NFS_COPY, TASK_PRIORITY_COMMON, ::dsn::THREAD_POOL_LEARN)
DEFINE_TASK_CODE_RPC(RPC_NFS_GET_FILE_SIZE, TASK_PRIORITY_COMMON, ::dsn::THREAD_POOL_DEFAULT)
// test timer task code
DEFINE_TASK_CODE(LPC_NFS_REQUEST_TIMER, TASK_PRIORITY_COMMON, ::dsn::THREAD_POOL_DEFAULT)
-DEFINE_TASK_CODE_AIO(LPC_NFS_READ, TASK_PRIORITY_COMMON, THREAD_POOL_DEFAULT)
+DEFINE_TASK_CODE_AIO(LPC_NFS_READ, TASK_PRIORITY_COMMON, ::dsn::THREAD_POOL_LEARN)
-DEFINE_TASK_CODE_AIO(LPC_NFS_WRITE, TASK_PRIORITY_COMMON, THREAD_POOL_DEFAULT)
+DEFINE_TASK_CODE_AIO(LPC_NFS_WRITE, TASK_PRIORITY_COMMON, ::dsn::THREAD_POOL_LEARN)
-DEFINE_TASK_CODE_AIO(LPC_NFS_COPY_FILE, TASK_PRIORITY_COMMON, THREAD_POOL_DEFAULT)
+DEFINE_TASK_CODE_AIO(LPC_NFS_COPY_FILE, TASK_PRIORITY_COMMON, ::dsn::THREAD_POOL_LEARN)
Changing the thread pool solve the above problem, but it is unclear whether there are other risks.
from incubator-pegasus.
Related Issues (20)
- Failed while checking markdown links due to dead link "open-falcon.org" HOT 1
- Last mutation is always delayed several minutes to be shipped to the remote cluster during duplication HOT 1
- The memory usage of block cache is incorrect
- Getting build error , Host dl.bintray.com:443 was resolved. HOT 17
- It takes a very long time to create checkpoint for a replica with 0 or 1 record while adding new duplication HOT 1
- Build CentOS 7 based development image failed
- "Build Java" in "Lint and build regularly" job failed
- build-env docker support mac arm HOT 2
- dsn_replica_dup_test failed for ASAN due to heap-use-after-free error HOT 1
- Checking that a remote table with only one replica is ready for duplication after copying checkpoints failed for ERR_NOT_ENOUGH_MEMBER HOT 1
- Assertion always fail after the replica server is restarted while a duplication is still in the status of `DS_PREPARE` due to some reason HOT 1
- Bulkload can not work during duplication
- Dynamic reload duplication config HOT 3
- Reduce verberos logs
- Required write permission for module labeler on Github
- The backup-cluster may doing an incomplete learn with duplication HOT 1
- How to connected to target cluster via pegasus shell config HOT 2
- replica server with security enabled exited abnormally due to `blob::create_from_bytes` assertion
- Java client cannot access the meta server when authentication is enabled
- Internal authentication of the cluster failed via Pegasus 2.4 HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from incubator-pegasus.