Code Monkey home page Code Monkey logo

Comments (54)

activenodes avatar activenodes commented on August 10, 2024 4

@tony-iqlusion just like a digital clock, the process froze after 3 days and 15 hours from the last restart

Here the new attach :-)

[New LWP 2631574]
[New LWP 2631575]
[New LWP 2631576]
[New LWP 2631577]
[New LWP 2631578]
[New LWP 2631579]
[New LWP 2631580]
[New LWP 2631581]
[New LWP 2631582]
[New LWP 2631583]
[New LWP 2631584]
[New LWP 2631585]
[New LWP 2631586]
[New LWP 2631587]
[New LWP 2631588]
[New LWP 2631589]
[New LWP 2631590]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__pthread_clockjoin_ex (threadid=140269092235008, thread_return=0x0, clockid=<optimized out>, abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:145
145     pthread_join_common.c: No such file or directory.
(gdb) set logging on
Copying output to gdb.txt.
Copying debug output to gdb.txt.
(gdb) thread apply all bt

Thread 18 (Thread 0x7f92f049d700 (LWP 2631590)):
#0  __lll_lock_wait (futex=futex@entry=0x555700c9ce50, private=0) at lowlevellock.c:52
#1  0x00007f92f3c6c0a3 in __GI___pthread_mutex_lock (mutex=0x555700c9ce50) at ../nptl/pthread_mutex_lock.c:80
#2  0x00005557006940e5 in yubihsm::client::Client::session ()
#3  0x0000555700697e6d in yubihsm::client::Client::sign_ed25519 ()
#4  0x000055570068db50 in <yubihsm::ed25519::signer::Signer as signature::signer::Signer<ed25519::Signature>>::try_sign ()
#5  0x000055570060c799 in tmkms::keyring::ed25519::Signer::sign ()
#6  0x00005557005cfb76 in tmkms::keyring::KeyRing::sign_ed25519 ()
#7  0x00005557005e8e88 in tmkms::session::Session::handle_request ()
#8  0x0000555700629278 in std::panicking::try ()
#9  0x0000555700634811 in tmkms::client::main_loop ()
#10 0x0000555700638719 in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#11 0x00005557005af66a in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
#12 0x000055570087d0ca in <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/alloc/src/boxed.rs:1521
#13 <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/alloc/src/boxed.rs:1521
#14 std::sys::unix::thread::Thread::new::thread_start () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0//library/std/src/sys/unix/thread.rs:71
#15 0x00007f92f3c69609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#16 0x00007f92f3a3b293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 17 (Thread 0x7f92f069e700 (LWP 2631589)):
#0  __lll_lock_wait (futex=futex@entry=0x555700c9ce50, private=0) at lowlevellock.c:52
#1  0x00007f92f3c6c0a3 in __GI___pthread_mutex_lock (mutex=0x555700c9ce50) at ../nptl/pthread_mutex_lock.c:80
#2  0x00005557006940e5 in yubihsm::client::Client::session ()
#3  0x0000555700697e6d in yubihsm::client::Client::sign_ed25519 ()
#4  0x000055570068db50 in <yubihsm::ed25519::signer::Signer as signature::signer::Signer<ed25519::Signature>>::try_sign ()
#5  0x000055570060c799 in tmkms::keyring::ed25519::Signer::sign ()
#6  0x00005557005cfb76 in tmkms::keyring::KeyRing::sign_ed25519 ()
#7  0x00005557005e8e88 in tmkms::session::Session::handle_request ()
#8  0x0000555700629278 in std::panicking::try ()
#9  0x0000555700634811 in tmkms::client::main_loop ()
#10 0x0000555700638719 in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#11 0x00005557005af66a in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
#12 0x000055570087d0ca in <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/alloc/src/boxed.rs:1521
#13 <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/alloc/src/boxed.rs:1521
#14 std::sys::unix::thread::Thread::new::thread_start () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0//library/std/src/sys/unix/thread.rs:71
#15 0x00007f92f3c69609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#16 0x00007f92f3a3b293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 16 (Thread 0x7f92f089f700 (LWP 2631588)):
#0  __lll_lock_wait (futex=futex@entry=0x555700c9ce50, private=0) at lowlevellock.c:52
#1  0x00007f92f3c6c0a3 in __GI___pthread_mutex_lock (mutex=0x555700c9ce50) at ../nptl/pthread_mutex_lock.c:80
#2  0x00005557006940e5 in yubihsm::client::Client::session ()
#3  0x0000555700697e6d in yubihsm::client::Client::sign_ed25519 ()
#4  0x000055570068db50 in <yubihsm::ed25519::signer::Signer as signature::signer::Signer<ed25519::Signature>>::try_sign ()
#5  0x000055570060c799 in tmkms::keyring::ed25519::Signer::sign ()
#6  0x00005557005cfb76 in tmkms::keyring::KeyRing::sign_ed25519 ()
#7  0x00005557005e8e88 in tmkms::session::Session::handle_request ()
#8  0x0000555700629278 in std::panicking::try ()
#9  0x0000555700634811 in tmkms::client::main_loop ()
#10 0x0000555700638719 in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#11 0x00005557005af66a in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
#12 0x000055570087d0ca in <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/alloc/src/boxed.rs:1521
#13 <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/alloc/src/boxed.rs:1521
#14 std::sys::unix::thread::Thread::new::thread_start () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0//library/std/src/sys/unix/thread.rs:71
#15 0x00007f92f3c69609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#16 0x00007f92f3a3b293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 15 (Thread 0x7f92f0aa0700 (LWP 2631587)):
#0  __lll_lock_wait (futex=futex@entry=0x555700c9ce50, private=0) at lowlevellock.c:52
#1  0x00007f92f3c6c0a3 in __GI___pthread_mutex_lock (mutex=0x555700c9ce50) at ../nptl/pthread_mutex_lock.c:80
#2  0x00005557006940e5 in yubihsm::client::Client::session ()
#3  0x0000555700697ef9 in yubihsm::client::Client::sign_ed25519 ()
#4  0x000055570068db50 in <yubihsm::ed25519::signer::Signer as signature::signer::Signer<ed25519::Signature>>::try_sign ()
#5  0x000055570060c799 in tmkms::keyring::ed25519::Signer::sign ()
#6  0x00005557005cfb76 in tmkms::keyring::KeyRing::sign_ed25519 ()
#7  0x00005557005e8e88 in tmkms::session::Session::handle_request ()
#8  0x0000555700629278 in std::panicking::try ()
#9  0x0000555700634811 in tmkms::client::main_loop ()
#10 0x0000555700638719 in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#11 0x00005557005af66a in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
--Type <RET> for more, q to quit, c to continue without paging--
#12 0x000055570087d0ca in <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/alloc/src/boxed.rs:1521
#13 <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/alloc/src/boxed.rs:1521
#14 std::sys::unix::thread::Thread::new::thread_start () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0//library/std/src/sys/unix/thread.rs:71
#15 0x00007f92f3c69609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#16 0x00007f92f3a3b293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 14 (Thread 0x7f92f0ca1700 (LWP 2631586)):
#0  __lll_lock_wait (futex=futex@entry=0x555700c9ce50, private=0) at lowlevellock.c:52
#1  0x00007f92f3c6c0a3 in __GI___pthread_mutex_lock (mutex=0x555700c9ce50) at ../nptl/pthread_mutex_lock.c:80
#2  0x00005557006940e5 in yubihsm::client::Client::session ()
#3  0x0000555700697e6d in yubihsm::client::Client::sign_ed25519 ()
#4  0x000055570068db50 in <yubihsm::ed25519::signer::Signer as signature::signer::Signer<ed25519::Signature>>::try_sign ()
#5  0x000055570060c799 in tmkms::keyring::ed25519::Signer::sign ()
#6  0x00005557005cfb76 in tmkms::keyring::KeyRing::sign_ed25519 ()
#7  0x00005557005e8e88 in tmkms::session::Session::handle_request ()
#8  0x0000555700629278 in std::panicking::try ()
#9  0x0000555700634811 in tmkms::client::main_loop ()
#10 0x0000555700638719 in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#11 0x00005557005af66a in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
#12 0x000055570087d0ca in <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/alloc/src/boxed.rs:1521
#13 <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/alloc/src/boxed.rs:1521
#14 std::sys::unix::thread::Thread::new::thread_start () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0//library/std/src/sys/unix/thread.rs:71
#15 0x00007f92f3c69609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#16 0x00007f92f3a3b293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 13 (Thread 0x7f92f0ea2700 (LWP 2631585)):
#0  __lll_lock_wait (futex=futex@entry=0x555700c9ce50, private=0) at lowlevellock.c:52
#1  0x00007f92f3c6c0a3 in __GI___pthread_mutex_lock (mutex=0x555700c9ce50) at ../nptl/pthread_mutex_lock.c:80
#2  0x00005557006940e5 in yubihsm::client::Client::session ()
#3  0x0000555700697e6d in yubihsm::client::Client::sign_ed25519 ()
#4  0x000055570068db50 in <yubihsm::ed25519::signer::Signer as signature::signer::Signer<ed25519::Signature>>::try_sign ()
#5  0x000055570060c799 in tmkms::keyring::ed25519::Signer::sign ()
#6  0x00005557005cfb76 in tmkms::keyring::KeyRing::sign_ed25519 ()
#7  0x00005557005e8e88 in tmkms::session::Session::handle_request ()
#8  0x0000555700629278 in std::panicking::try ()
#9  0x0000555700634811 in tmkms::client::main_loop ()
#10 0x0000555700638719 in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#11 0x00005557005af66a in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
#12 0x000055570087d0ca in <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/alloc/src/boxed.rs:1521
#13 <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/alloc/src/boxed.rs:1521
#14 std::sys::unix::thread::Thread::new::thread_start () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0//library/std/src/sys/unix/thread.rs:71
#15 0x00007f92f3c69609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#16 0x00007f92f3a3b293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 12 (Thread 0x7f92f10a3700 (LWP 2631584)):
#0  __lll_lock_wait (futex=futex@entry=0x555700c9ce50, private=0) at lowlevellock.c:52
#1  0x00007f92f3c6c0a3 in __GI___pthread_mutex_lock (mutex=0x555700c9ce50) at ../nptl/pthread_mutex_lock.c:80
#2  0x00005557006940e5 in yubihsm::client::Client::session ()
#3  0x0000555700697e6d in yubihsm::client::Client::sign_ed25519 ()
#4  0x000055570068db50 in <yubihsm::ed25519::signer::Signer as signature::signer::Signer<ed25519::Signature>>::try_sign ()
#5  0x000055570060c799 in tmkms::keyring::ed25519::Signer::sign ()
#6  0x00005557005cfb76 in tmkms::keyring::KeyRing::sign_ed25519 ()
#7  0x00005557005e8e88 in tmkms::session::Session::handle_request ()
#8  0x0000555700629278 in std::panicking::try ()
#9  0x0000555700634811 in tmkms::client::main_loop ()
#10 0x0000555700638719 in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#11 0x00005557005af66a in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
#12 0x000055570087d0ca in <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/alloc/src/boxed.rs:1521
#13 <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/alloc/src/boxed.rs:1521
#14 std::sys::unix::thread::Thread::new::thread_start () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0//library/std/src/sys/unix/thread.rs:71
#15 0x00007f92f3c69609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#16 0x00007f92f3a3b293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 11 (Thread 0x7f92f12a4700 (LWP 2631583)):
#0  __lll_lock_wait (futex=futex@entry=0x555700c9ce50, private=0) at lowlevellock.c:52
#1  0x00007f92f3c6c0a3 in __GI___pthread_mutex_lock (mutex=0x555700c9ce50) at ../nptl/pthread_mutex_lock.c:80
#2  0x00005557006940e5 in yubihsm::client::Client::session ()
#3  0x0000555700697e6d in yubihsm::client::Client::sign_ed25519 ()
#4  0x000055570068db50 in <yubihsm::ed25519::signer::Signer as signature::signer::Signer<ed25519::Signature>>::try_sign ()
#5  0x000055570060c799 in tmkms::keyring::ed25519::Signer::sign ()
#6  0x00005557005cfb76 in tmkms::keyring::KeyRing::sign_ed25519 ()
--Type <RET> for more, q to quit, c to continue without paging--
#7  0x00005557005e8e88 in tmkms::session::Session::handle_request ()
#8  0x0000555700629278 in std::panicking::try ()
#9  0x0000555700634811 in tmkms::client::main_loop ()
#10 0x0000555700638719 in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#11 0x00005557005af66a in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
#12 0x000055570087d0ca in <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/alloc/src/boxed.rs:1521
#13 <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/alloc/src/boxed.rs:1521
#14 std::sys::unix::thread::Thread::new::thread_start () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0//library/std/src/sys/unix/thread.rs:71
#15 0x00007f92f3c69609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#16 0x00007f92f3a3b293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 10 (Thread 0x7f92f14a5700 (LWP 2631582)):
#0  __lll_lock_wait (futex=futex@entry=0x555700c9ce50, private=0) at lowlevellock.c:52
#1  0x00007f92f3c6c0a3 in __GI___pthread_mutex_lock (mutex=0x555700c9ce50) at ../nptl/pthread_mutex_lock.c:80
#2  0x00005557006940e5 in yubihsm::client::Client::session ()
#3  0x0000555700697e6d in yubihsm::client::Client::sign_ed25519 ()
#4  0x000055570068db50 in <yubihsm::ed25519::signer::Signer as signature::signer::Signer<ed25519::Signature>>::try_sign ()
#5  0x000055570060c799 in tmkms::keyring::ed25519::Signer::sign ()
#6  0x00005557005cfb76 in tmkms::keyring::KeyRing::sign_ed25519 ()
#7  0x00005557005e8e88 in tmkms::session::Session::handle_request ()
#8  0x0000555700629278 in std::panicking::try ()
#9  0x0000555700634811 in tmkms::client::main_loop ()
#10 0x0000555700638719 in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#11 0x00005557005af66a in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
#12 0x000055570087d0ca in <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/alloc/src/boxed.rs:1521
#13 <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/alloc/src/boxed.rs:1521
#14 std::sys::unix::thread::Thread::new::thread_start () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0//library/std/src/sys/unix/thread.rs:71
#15 0x00007f92f3c69609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#16 0x00007f92f3a3b293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 9 (Thread 0x7f92f16a6700 (LWP 2631581)):
#0  __lll_lock_wait (futex=futex@entry=0x555700c9ce50, private=0) at lowlevellock.c:52
#1  0x00007f92f3c6c0a3 in __GI___pthread_mutex_lock (mutex=0x555700c9ce50) at ../nptl/pthread_mutex_lock.c:80
#2  0x00005557006940e5 in yubihsm::client::Client::session ()
#3  0x0000555700697e6d in yubihsm::client::Client::sign_ed25519 ()
#4  0x000055570068db50 in <yubihsm::ed25519::signer::Signer as signature::signer::Signer<ed25519::Signature>>::try_sign ()
#5  0x000055570060c799 in tmkms::keyring::ed25519::Signer::sign ()
#6  0x00005557005cfb76 in tmkms::keyring::KeyRing::sign_ed25519 ()
#7  0x00005557005e9b4d in tmkms::session::Session::handle_request ()
#8  0x0000555700629278 in std::panicking::try ()
#9  0x0000555700634811 in tmkms::client::main_loop ()
#10 0x0000555700638719 in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#11 0x00005557005af66a in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
#12 0x000055570087d0ca in <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/alloc/src/boxed.rs:1521
#13 <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/alloc/src/boxed.rs:1521
#14 std::sys::unix::thread::Thread::new::thread_start () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0//library/std/src/sys/unix/thread.rs:71
#15 0x00007f92f3c69609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#16 0x00007f92f3a3b293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 8 (Thread 0x7f92f2c2c700 (LWP 2631580)):
#0  0x00007f92f3a2eaff in __GI___poll (fds=0x7f92f2c2bcc0, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f92f3caf1a2 in ?? () from /lib/x86_64-linux-gnu/libusb-1.0.so.0
#2  0x00007f92f3c69609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#3  0x00007f92f3a3b293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 7 (Thread 0x7f92f2e2d700 (LWP 2631579)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f92e4000c78) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f92e4000b60, cond=0x7f92e4000c50) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x7f92e4000c50, mutex=0x7f92e4000b60) at pthread_cond_wait.c:638
#3  0x00005557006b9388 in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#4  0x00005557006c446f in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
#5  0x000055570087d0ca in <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/alloc/src/boxed.rs:1521
#6  <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/alloc/src/boxed.rs:1521
#7  std::sys::unix::thread::Thread::new::thread_start () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0//library/std/src/sys/unix/thread.rs:71
#8  0x00007f92f3c69609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#9  0x00007f92f3a3b293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 6 (Thread 0x7f92f302e700 (LWP 2631578)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f92e4000c78) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f92e4000b60, cond=0x7f92e4000c50) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x7f92e4000c50, mutex=0x7f92e4000b60) at pthread_cond_wait.c:638
--Type <RET> for more, q to quit, c to continue without paging--
#3  0x00005557006b9388 in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#4  0x00005557006c446f in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
#5  0x000055570087d0ca in <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/alloc/src/boxed.rs:1521
#6  <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/alloc/src/boxed.rs:1521
#7  std::sys::unix::thread::Thread::new::thread_start () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0//library/std/src/sys/unix/thread.rs:71
#8  0x00007f92f3c69609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#9  0x00007f92f3a3b293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 5 (Thread 0x7f92f322f700 (LWP 2631577)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f92e4000c78) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f92e4000b60, cond=0x7f92e4000c50) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x7f92e4000c50, mutex=0x7f92e4000b60) at pthread_cond_wait.c:638
#3  0x00005557006b9388 in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#4  0x00005557006c446f in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
#5  0x000055570087d0ca in <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/alloc/src/boxed.rs:1521
#6  <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/alloc/src/boxed.rs:1521
#7  std::sys::unix::thread::Thread::new::thread_start () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0//library/std/src/sys/unix/thread.rs:71
#8  0x00007f92f3c69609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#9  0x00007f92f3a3b293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 4 (Thread 0x7f92f3430700 (LWP 2631576)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f92e4000c78) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f92e4000b60, cond=0x7f92e4000c50) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x7f92e4000c50, mutex=0x7f92e4000b60) at pthread_cond_wait.c:638
#3  0x00005557006b9388 in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#4  0x00005557006c446f in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
#5  0x000055570087d0ca in <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/alloc/src/boxed.rs:1521
#6  <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/alloc/src/boxed.rs:1521
#7  std::sys::unix::thread::Thread::new::thread_start () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0//library/std/src/sys/unix/thread.rs:71
#8  0x00007f92f3c69609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#9  0x00007f92f3a3b293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 3 (Thread 0x7f92f3631700 (LWP 2631575)):
#0  0x00007f92f3a3cc90 in accept4 (fd=3, addr=..., addr_len=0x7f92f363088c, flags=524288) at ../sysdeps/unix/sysv/linux/accept4.c:32
#1  0x000055570086f362 in std::sys::unix::net::Socket::accept::{{closure}} () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0//library/std/src/sys/unix/net.rs:206
#2  std::sys::unix::cvt_r () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0//library/std/src/sys/unix/mod.rs:218
#3  std::sys::unix::net::Socket::accept () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0//library/std/src/sys/unix/net.rs:205
#4  std::sys_common::net::TcpListener::accept () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0//library/std/src/sys_common/net.rs:405
#5  std::net::tcp::TcpListener::accept () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0//library/std/src/net/tcp.rs:757
#6  0x00005557006b8ba9 in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#7  0x00005557006c45df in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
#8  0x000055570087d0ca in <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/alloc/src/boxed.rs:1521
#9  <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/alloc/src/boxed.rs:1521
#10 std::sys::unix::thread::Thread::new::thread_start () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0//library/std/src/sys/unix/thread.rs:71
#11 0x00007f92f3c69609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#12 0x00007f92f3a3b293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7f92f3832700 (LWP 2631574)):
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7f92ec001888) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f92ec000b60, cond=0x7f92ec001860) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x7f92ec001860, mutex=0x7f92ec000b60) at pthread_cond_wait.c:638
#3  0x00005557006bc8bb in tiny_http::util::messages_queue::MessagesQueue<T>::pop ()
#4  0x00005557006aeabf in tiny_http::Server::recv ()
#5  0x000055570068bc46 in yubihsm::connector::http::server::Server::handle_request ()
#6  0x000055570068bc06 in yubihsm::connector::http::server::Server::run ()
#7  0x000055570063e3de in tmkms::yubihsm::run_connnector_server::{{closure}} ()
#8  0x00005557006386be in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#9  0x00005557005af4b1 in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
#10 0x000055570087d0ca in <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/alloc/src/boxed.rs:1521
#11 <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/alloc/src/boxed.rs:1521
#12 std::sys::unix::thread::Thread::new::thread_start () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0//library/std/src/sys/unix/thread.rs:71
#13 0x00007f92f3c69609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#14 0x00007f92f3a3b293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7f92f38e7940 (LWP 2631564)):
#0  __pthread_clockjoin_ex (threadid=140269092235008, thread_return=0x0, clockid=<optimized out>, abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:145
#1  0x000055570087d1ad in std::sys::unix::thread::Thread::join () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0//library/std/src/sys/unix/thread.rs:177
#2  0x00005557005ae716 in std::thread::JoinHandle<T>::join ()
#3  0x00005557005fcd78 in <tmkms::commands::start::StartCommand as abscissa_core::runnable::Runnable>::run ()
#4  0x00005557005804ba in abscissa_core::application::Application::run ()
#5  0x000055570059489b in abscissa_core::application::boot ()
--Type <RET> for more, q to quit, c to continue without paging--
#6  0x000055570057c91d in tmkms::main ()
#7  0x000055570057a253 in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#8  0x000055570057a269 in std::rt::lang_start::{{closure}} ()
#9  0x0000555700875ce7 in core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/core/src/ops/function.rs:259
#10 std::panicking::try::do_call () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0//library/std/src/panicking.rs:379
#11 std::panicking::try () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0//library/std/src/panicking.rs:343
#12 std::panic::catch_unwind () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0//library/std/src/panic.rs:431
#13 std::rt::lang_start_internal () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0//library/std/src/rt.rs:51
#14 0x000055570057c942 in main ()
(gdb)
(gdb)
(gdb)
(gdb)
(gdb)
(gdb)```

from tmkms.

tony-iqlusion avatar tony-iqlusion commented on August 10, 2024 1

I've released v0.8.0-alpha1 which corrects a few minor things that might be going amiss but also adds considerably more debug logging:

https://forum.cosmos.network/t/ann-tendermint-kms-v0-8-tendermint-v0-33-support/3704

If you can reproduce this issue while running in verbose mode (i.e. with the -v command line argument) it'd be very helpful in nailing this down.

We're presently running this release in production.

from tmkms.

gaia avatar gaia commented on August 10, 2024 1

Just FYI, we also experienced the same issue once, a couple of weeks ago, while running 0.7.2. It didn't occur again and as of 4 days ago we are on 0.8.0.

from tmkms.

tony-iqlusion avatar tony-iqlusion commented on August 10, 2024 1

We haven't had any reports of this since re-occurring since the v0.8 release, so I'm going to go ahead and close this. Please let me know if that changes.

from tmkms.

tony-iqlusion avatar tony-iqlusion commented on August 10, 2024 1

For anyone encountering this issue in production (we have never encountered it before, so that makes debugging that much harder), it would be extremely helpful if someone would attach to the process with something like gdb or lldb and obtain a backtrace (e.g. with the bt command).

That said, Tendermint v0.35 will include a gRPC-based privval connection between TMKMS and a validator, which will completely change the entire networking stack on both the TMKMS and Tendermint side. Perhaps that will indirectly fix the issue, or if not, it will at least help isolate the possible places it's occurring.

from tmkms.

tony-iqlusion avatar tony-iqlusion commented on August 10, 2024 1

If you can build TMKMS in debug mode (i.e. omit the --release when doing cargo build) and attach with gdb or lldb and run bt to get a backtrace of its current state when the problem occurs, that would be helpful too.

from tmkms.

tony-iqlusion avatar tony-iqlusion commented on August 10, 2024 1

As a general statement: TMKMS only works when literally everything else is working. A "TMKMS freeze" means something anywhere in your stack has gone wrong.

All of the debugging steps I've suggested are ones which would actually isolate TMKMS as the culprit, if it is the culprit. So far, in all of the debugging info I've ever been supplied, we haven't seen any telltale signs that TMKMS was ever the issue.

If there really is an issue with TMKMS itself, it would be extremely helpful to have some evidence it's actually the source of the problem, and not a symptom.

from tmkms.

activenodes avatar activenodes commented on August 10, 2024 1

.. the TMKMS VM is on the same physical host as the validator, ..

About my system it's physical and not a virtual server/VM

@AC-Validator which method are you using to detect logging stopped?

It's part of my monitoring suite, but something like that (very simple, it was a recent "add-on" for this issue):

process = "tmkms"

def get_journalctl(process):
  try:
    journalctl = subprocess.check_output("journalctl -u " + process + ".service --since \"60 sec ago\"", shell=True)
[..]

log = get_journalctl(process)
log_split = log.splitlines()
[..]

if len(log_split) < 5 and socket.gethostname() == master:
  # NO NEW LOG
  subprocess.call("systemctl restart " + process, shell=True)

Remove "and socket.gethostname() == master" if you only have 1 TMKMS

from tmkms.

tony-iqlusion avatar tony-iqlusion commented on August 10, 2024 1

Let me just reiterate: this problem has been impossible for us to debug because at iqlusion because we have never experienced it a single time and people who have are not providing us with anything to go on in order to debug it.

from tmkms.

tony-iqlusion avatar tony-iqlusion commented on August 10, 2024 1

use gdb -p <pid> or lldb -b <pid> to attach to a running process, then bt to see a backtrace

from tmkms.

activenodes avatar activenodes commented on August 10, 2024 1

OK, that makes sense. We run a max of 2 validators per YubiHSM.

Analyzing the response latencies it seems that the YubiHSM2 has no particular problems (it also depends on the blocking time of each chain). So I prefer active/passive clusters with this number of connections (~10)

from tmkms.

tony-iqlusion avatar tony-iqlusion commented on August 10, 2024 1

Not quite yet... there's still some more work on the yubihsm crate I want to do, then I'll cut another release and update tmkms

from tmkms.

tony-iqlusion avatar tony-iqlusion commented on August 10, 2024 1

No, using cargo install that way will install the latest version from crates.io.

You need to do:

cargo install --path . --features=yubihsm-server,softsign

from tmkms.

activenodes avatar activenodes commented on August 10, 2024 1

A quick update:
Active: active (running) since Thu 2021-12-09 21:26:45 CET; 6 days ago
Usually the crash happened before.
We remain waiting, hoping not to have to restart the service.
In the meantime .. thanks ;)

from tmkms.

tony-iqlusion avatar tony-iqlusion commented on August 10, 2024 1

Going to go ahead and close this.

If anyone experiences another freeze with tmkms v0.11.0-pre.2 or subsequent releases, please leave a comment and I'll reopen this issue.

from tmkms.

tony-iqlusion avatar tony-iqlusion commented on August 10, 2024

There have been reports of similar issues in the past (e.g. tendermint/tmkms#389) however I have never been able to reproduce them.

Can you see if you can reproduce the issue while running tmkms under strace?

from tmkms.

mattharrop avatar mattharrop commented on August 10, 2024

I've started strace watching the trims process. It generates enormous log files! I'll watch and avoid running out of disk space, and keep it running until the next halt. I am not aware of any way to trigger the failure, so it may take a while.

from tmkms.

tarcieri avatar tarcieri commented on August 10, 2024

@mattharrop I've just released v0.7.3 which includes a number of bugfixes. Not sure any of them will help this issue per se, but there were some related to YubiHSM device (re-)enumeration which could potentially be the source of this problem

from tmkms.

mattharrop avatar mattharrop commented on August 10, 2024

And just now I finally, after three weeks of running under strace, I have a freeze. I have 28G of logs. How many lines or how much time do you want?

from tmkms.

tarcieri avatar tarcieri commented on August 10, 2024

I only need a little bit of recent context, maybe the last 10,000 lines or so?

from tmkms.

greg-szabo avatar greg-szabo commented on August 10, 2024

Hi! We just ran into the same issue. It's our second time but the first time, I think we upgraded tmkms.

@mattharrop which version did you run for the strace?

from tmkms.

tony-iqlusion avatar tony-iqlusion commented on August 10, 2024

We managed to isolate a probable location where the problem is occurring. I'll add some more debug logging around this area then cut a v0.7.4 release.

from tmkms.

tony-iqlusion avatar tony-iqlusion commented on August 10, 2024

Looks like #60 triggered auto-close, but it (likely) doesn't actually address the issue, but should help debug it.

from tmkms.

gaia avatar gaia commented on August 10, 2024

Again last night :(

Dec 20 02:08:30 hsm tmkms[25469]: Dec 20 02:08:30.390  INFO tmkms::session: [cosmoshub-3@tcp://a.b.c.e:26658] signed PreCommit:4676990D86 at h/r/s 4493122/0/2 (126 ms)
Dec 20 02:08:31 hsm tmkms[25469]: Dec 20 02:08:31.636  INFO tmkms::session: [columbus-4@tcp://a.b.c.d:26658] signed PreVote:1423FF2F98 at h/r/s 1045895/0/1 (127 ms)
Dec 20 02:17:24 hsm systemd[1]: Stopping TMKMS...
Dec 20 02:17:24 hsm systemd[1]: Stopped TMKMS.
Dec 20 02:17:24 hsm systemd[1]: Started TMKMS.
Dec 20 02:17:24 hsm tmkms[7905]: Dec 20 02:17:24.920  INFO tmkms::commands::start: tmkms 0.9.0 starting up...

from tmkms.

activenodes avatar activenodes commented on August 10, 2024

The same for me after a few weeks of uptime:

8:09:55 tmkms[372518]: #033[...[0m tmkms::session: [xxx@tcp://xx.0.4.11:26658] signed PreCommit:xxx at h/r/s 1023094/0/2 (126 ms)
8:09:57 tmkms[372518]: #033[...[0m tmkms::session: [yyy@tcp://xx.0.3.11:26658] signed PreVote:yyy at h/r/s 1245586/0/1 (125 ms)
8:09:57 tmkms[372518]: #033[...[0m tmkms::session: [yyy@tcp://xx.0.3.11:26658] signed PreCommit:yyy at h/r/s 1245586/0/2 (125 ms)
8:09:57 tmkms[372518]: #033[...[0m tmkms::session: [zzz@tcp://xx.0.2.11:26658] signed PreVote:xxx at h/r/s 3964819/0/1 (156 ms)
8:11:02 systemd[1]: Stopping Cosmos tmkms...
8:11:02 systemd[1]: tmkms.service: Succeeded.
8:11:02 systemd[1]: Stopped Cosmos tmkms.
8:11:02 systemd[1]: Started Cosmos tmkms.
8:11:02 tmkms[458276]: #033[...[0m tmkms::commands::start: tmkms 0.10.0 starting up...
8:11:02 tmkms[458276]: #033[...[0m yubihsm::connector::http::server: yubihsm::http-server[127.0.0.1:12345]: listening for connections
8:11:02 kernel: [3222633.441759] usb 1-1: reset full-speed USB device number 2 using xhci_hcd
8:11:02 kernel: [3222633.717759] usb 1-1: reset full-speed USB device number 2 using xhci_hcd
8:11:02 tmkms[458276]: #033[...[0m tmkms::keyring: [keyring:yubihsm] added consensus Ed25519 key: ...

The server is dedicated to TMKMS

# free -h

              total        used        free      shared  buff/cache   available
Mem:          7.7Gi       219Mi       287Mi       2.0Mi       7.2Gi       7.2Gi
Swap:         4.0Gi       1.0Mi       4.0Gi

$ tmkms version

tmkms 0.10.0

Right now I have automated restarting the process after 60 seconds without log.
I could try to automate a "strace --attach = PID" (for 30 seconds) before rebooting. Can it help?

from tmkms.

tony-iqlusion avatar tony-iqlusion commented on August 10, 2024

If anyone is experiencing something like this, especially frequently, it'd be very helpful to get a traffic capture with a tool like tcpdump.

There are a number of things that could potentially cause this which are external to TMKMS, including it never receiving an incoming request. If you can use a tool like tcpdump to rule that out and show a case where TMKMS is actually receiving incoming network packets but never processing them, then that would be a TMKMS bug.

Based on some other anecdotal reports from people who were experiencing issues like these, they were ultimately a network problem.

from tmkms.

activenodes avatar activenodes commented on August 10, 2024

During the issue I was able to verify that the network connections worked (the destinations IP are all different, all reachable) and a simple restart of the TMKMS service fix the problem.
I will try to capture traffic to see if there is incoming (and outgoing) traffic to/from tmkms service

from tmkms.

gaia avatar gaia commented on August 10, 2024

Same here, simple restart fixes. I've set tmkms to restart every two weeks regardless, it seems that the issue happens around every 3 weeks or so.

Also, IMHO it is unlike it is a network problem for us: the TMKMS VM is on the same physical host as the validator, they are VMs under the same hypervisor.

@AC-Validator which method are you using to detect logging stopped?

from tmkms.

gaia avatar gaia commented on August 10, 2024

Happened again just now, last time was months ago (which points towards something with tmkms itself?)
I lowered the crontab entry that restarts the service from 14 days to 10 days. let's see how long it goes for now!

from tmkms.

tony-iqlusion avatar tony-iqlusion commented on August 10, 2024

@gaia can you try any of the steps I've outlined before to collect some additional debugging information next time?

Information from a debugger or something like tcpdump would be very helpful in nailing down if TMKMS is actually the culprit.

from tmkms.

gaia avatar gaia commented on August 10, 2024

debug mode (i.e. omit the --release when doing cargo build) and attach with gdb or lldb and run bt to get a backtrace of its current state when the problem occurs

Now running the debug build. How do I attach with gdb or lldb and run bt when the problem occurs? Don't want to guess how to do this from my own googling and produce useless output...

PS: I can rule out network issues on our situation with 100% certainty. These (validator and TMKMS) are VMs in the same physical machine, and if the virtual LAN (managed by the hypervisor) failed all VMs would generate alerts, TMKMS or not.

from tmkms.

activenodes avatar activenodes commented on August 10, 2024

use gdb -p <pid> or lldb -b <pid> to attach to a running process, then bt to see a backtrace

...even with scheduled weekly reboots sometimes tmkms freezes (always at night :()
I have deactivated my identification/restart system: #37 (comment)
and recompiled without --release.

I'll be back here at the next freeze ;)

from tmkms.

activenodes avatar activenodes commented on August 10, 2024

Done! @tony-iqlusion

# /home/tmkms/.cargo/bin/tmkms version
tmkms 0.10.1

# gdb -p 2525291
GNU gdb (Ubuntu 9.2-0ubuntu1~20.04) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 2525291
[New LWP 2525303]
[New LWP 2525304]
[New LWP 2525305]
[New LWP 2525306]
[New LWP 2525307]
[New LWP 2525308]
[New LWP 2525309]
[New LWP 2525312]
[New LWP 2525313]
[New LWP 2525314]
[New LWP 2525315]
[New LWP 2525316]
[New LWP 2525317]
[New LWP 2525318]
[New LWP 2525319]
[New LWP 2525320]
[New LWP 2525321]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__pthread_clockjoin_ex (threadid=139626410120960, thread_return=0x0, clockid=<optimized out>, abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:145
145     pthread_join_common.c: No such file or directory.
(gdb) bt
#0  __pthread_clockjoin_ex (threadid=139626410120960, thread_return=0x0, clockid=<optimized out>, abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:145
#1  0x000055eb7e8a91ad in std::sys::unix::thread::Thread::join () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0//library/std/src/sys/unix/thread.rs:177
#2  0x000055eb7e5da716 in std::thread::JoinHandle<T>::join ()
#3  0x000055eb7e628d78 in <tmkms::commands::start::StartCommand as abscissa_core::runnable::Runnable>::run ()
#4  0x000055eb7e5ac4ba in abscissa_core::application::Application::run ()
#5  0x000055eb7e5c089b in abscissa_core::application::boot ()
#6  0x000055eb7e5a891d in tmkms::main ()
#7  0x000055eb7e5a6253 in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#8  0x000055eb7e5a6269 in std::rt::lang_start::{{closure}} ()
#9  0x000055eb7e8a1ce7 in core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0/library/core/src/ops/function.rs:259
#10 std::panicking::try::do_call () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0//library/std/src/panicking.rs:379
#11 std::panicking::try () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0//library/std/src/panicking.rs:343
#12 std::panic::catch_unwind () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0//library/std/src/panic.rs:431
#13 std::rt::lang_start_internal () at /rustc/2fd73fabe469357a12c2c974c140f67e7cdd76d0//library/std/src/rt.rs:51
#14 0x000055eb7e5a8942 in main ()
(gdb) Quit
(gdb) quit
A debugging session is active.

        Inferior 1 [process 2525291] will be detached.

Quit anyway? (y or n) y
Detaching from program: /home/tmkms/.cargo/bin/tmkms, process 2525291
[Inferior 1 (process 2525291) detached]

from tmkms.

tony-iqlusion avatar tony-iqlusion commented on August 10, 2024

@AC-Validator thanks! Seems I gave you an unhelpful command though. That's just showing the main thread waiting on the others, which is expected.

I believe on gdb you'll need to run:

thread apply all bt

...to get the backtraces of all threads, rather than just the main one.

from tmkms.

activenodes avatar activenodes commented on August 10, 2024

Thanks @tony-iqlusion
Ok, I'll update you to the next freeze... ;)
Thanks for the support

from tmkms.

tony-iqlusion avatar tony-iqlusion commented on August 10, 2024

Thank you! This has by far been the most helpful debugging information anyone has gotten yet.

It looks like there are 9 threads backed up attempting to acquire a mutex on the YubiHSM2 client. So it seems like something may be causing it to get wedged somehow.

One thing that would be very helpful for debugging, if anyone is interested in attempting it, is using the yubihsm-connector service from the Yubico SDK to talk to the YubiHSM2, and having TMKMS connect to it via HTTP.

Other than that, this report does give me some ideas of where to add more instrumentation to debug the issue.

from tmkms.

gaia avatar gaia commented on August 10, 2024

@AC-Validator interesting on the interval between failures: we haven't had a failure since lowering the cron restart schedule from 14 days to 10 days, done 77 days ago.

from tmkms.

tony-iqlusion avatar tony-iqlusion commented on August 10, 2024

If there really is a specific cadence to this, it could have something to do with the session lifetime, which is also something else to investigate. I'll try to add some better instrumentation about that.

from tmkms.

activenodes avatar activenodes commented on August 10, 2024

Thank you! This has by far been the most helpful debugging information anyone has gotten yet.

Thanks, I'm glad .. I would like to help you solve this issue

One thing that would be very helpful for debugging, if anyone is interested in attempting it, is using the yubihsm-connector service from the Yubico SDK to talk to the YubiHSM2, and having TMKMS connect to it via HTTP.

My current build is:
cargo install tmkms --features=yubihsm-server,softsign
with this too:
git clone https://github.com/Yubico/yubihsm-shell
and setting:
connector_server = { laddr = "tcp://127.0.0.1:12345", cli = { auth_key = 2 } }

I don't know if you mean this

from tmkms.

activenodes avatar activenodes commented on August 10, 2024

@AC-Validator interesting on the interval between failures: we haven't had a failure since lowering the cron restart schedule from 14 days to 10 days, done 77 days ago.

Initially it froze after ~14 days, then ~7 and finally now "exactly" 3 days and 15 hours

from tmkms.

activenodes avatar activenodes commented on August 10, 2024

If there really is a specific cadence to this, it could have something to do with the session lifetime, which is also something else to investigate. I'll try to add some better instrumentation about that.

It might depend on the number of active [[validators]] connections (currently 10) or the number of blocks processed/signed.
Given the "exact" timing, I think more of the first option

from tmkms.

tony-iqlusion avatar tony-iqlusion commented on August 10, 2024

That would make sense. Every session has a limited lifetime in terms of the number of messages sent to the YubiHSM2, so signing on more chains would trigger the end of a session lifetime more quickly by virtue of higher message volume.

from tmkms.

tony-iqlusion avatar tony-iqlusion commented on August 10, 2024

I don't know if you mean this

yubihsm-connector is a separate Golang executable distributed with the YubiHSM2 SDK:

https://developers.yubico.com/YubiHSM2/Component_Reference/yubihsm-connector/

It talks to the YubiHSM2 over USB and exposes it as an HTTP service.

It would be useful to help narrow down if the problem is resulting from the USB stack or not.

from tmkms.

gaia avatar gaia commented on August 10, 2024

That would make sense. Every session has a limited lifetime in terms of the number of messages sent to the YubiHSM2, so signing on more chains would trigger the end of a session lifetime more quickly by virtue of higher message volume.

OK, that makes sense. We run a max of 2 validators per YubiHSM.

from tmkms.

activenodes avatar activenodes commented on August 10, 2024

yubihsm-connector is a separate Golang executable distributed with the YubiHSM2 SDK:
It would be useful to help narrow down if the problem is resulting from the USB stack or not.

Mmm... It is quite intrusive for a production setup...
Can we do something in the case of "higher message volume"?
At this time it is also easy to reproduce the issue (every 3.5 days)

from tmkms.

activenodes avatar activenodes commented on August 10, 2024

@tony-iqlusion are you ready to let me do some tests?

from tmkms.

tony-iqlusion avatar tony-iqlusion commented on August 10, 2024

I think I have a good guess as to what the core problem is here, and a prospective fix if my guess is right:

iqlusioninc/yubihsm.rs#273

I suspect it's a mutex reentrancy bug, where the same thread blocks itself attempting to acquire an already-held mutex. It occurs in the code which handles message volume limits and rekeying which seemed to be implicated based on the above comments.

If anyone is interested in extremely-alpha testing this, I can make a git branch you can build from. But that will be a little bit of work.

from tmkms.

gaia avatar gaia commented on August 10, 2024

I think I have a good guess as to what the core problem is here, and a prospective fix if my guess is right:

iqlusioninc/yubihsm.rs#273

I suspect it's a mutex reentrancy bug, where the same thread blocks itself attempting to acquire an already-held mutex. It occurs in the code which handles message volume limits and rekeying which seemed to be implicated based on the above comments.

If anyone is interested in extremely-alpha testing this, I can make a git branch you can build from. But that will be a little bit of work.

I'm curious why your setup never experienced this.

from tmkms.

tony-iqlusion avatar tony-iqlusion commented on August 10, 2024

I've released TMKMS v0.11.0-pre.2 which includes the prospective fix this issue. We're now running it in production.

I'm curious why your setup never experienced this.

We just experienced it for the first time recently. As to why we didn't experience it before I'm uncertain, but it's possible my guess as to the cause is wrong.

from tmkms.

activenodes avatar activenodes commented on August 10, 2024

I've released TMKMS v0.11.0-pre.2 which includes the prospective fix this issue. We're now running it in production.

Happy to try it!
I will be AFK in the weekend. I have to do an exact calculation of the next crash before putting it into production, about 3.5 days, but definitely next week

Thanks Tony!

from tmkms.

activenodes avatar activenodes commented on August 10, 2024

@tony-iqlusion we are in production, could you give an indication to be safe?

$ git fetch
$ git checkout v0.11.0-pre.2
$ cargo build --features=yubihsm-server,softsign

   [..]
   Compiling bip32 v0.2.2
   Compiling yubihsm v0.40.0-pre
error: edition 2021 is unstable and only available with -Z unstable-options.

error: could not compile `yubihsm`

To learn more, run the command again with --verbose.
warning: build failed, waiting for other jobs to finish...
error: build failed

from tmkms.

tony-iqlusion avatar tony-iqlusion commented on August 10, 2024

The Minimum Supported Rust Version is now 1.56. You need to upgrade your rustc version

from tmkms.

activenodes avatar activenodes commented on August 10, 2024

The Minimum Supported Rust Version is now 1.56. You need to upgrade your rustc version

Done, thanks.

Reported version is always v0.10.1. Right?

$ git fetch
$ git checkout v0.11.0-pre.2
$ cargo build --features=yubihsm-server,softsign
$ cargo install tmkms --features=yubihsm-server,softsign

   [..]
   Replacing /home/tmkms/.cargo/bin/tmkms
    Replaced package `tmkms v0.10.1` with `tmkms v0.10.1` (executable `tmkms`)


$ tmkms
tmkms 0.10.1

from tmkms.

tony-iqlusion avatar tony-iqlusion commented on August 10, 2024

FYI, we've released v0.11.0 containing a fix for this issue #479

from tmkms.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.