Hi!
Today i've found data_logger process crashed with unhandled exception.
Last lines from stderr:
2013-11-15 18:19:16,352:11890(0x7f0071d24700):ZOO_ERROR@handle_socket_error_msg@
1721: Socket [127.0.0.1:2181] zk retcode=-4, errno=112(Host is down): failed whi
le receiving a server response
2013-11-15 18:19:16,362:11890(0x7f0071d24700):ZOO_ERROR@handle_socket_error_msg@
1739: Socket [127.0.0.1:2181] zk retcode=-112, errno=116(Stale NFS file handle):
sessionId=0x2424c7324bc0097 has expired.
terminate called after throwing an instance of 'ML::Exception'
what(): can't connect; handle already exists
Back trace of two threads from core dump:
(gdb) thread 1
[Switching to thread 1 (Thread 0x7f0071523700 (LWP 11902))]
#0 0x00007f007e860425 in __GI_raise (sig=)
at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
64 in ../nptl/sysdeps/unix/sysv/linux/raise.c
(gdb) bt
#0 0x00007f007e860425 in __GI_raise (sig=)
at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x00007f007e863b8b in __GI_abort () at abort.c:91
#2 0x00007f007f15bb05 in __gnu_cxx::__verbose_terminate_handler() ()
from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3 0x00007f007f159c76 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x00007f007f159ca3 in std::terminate() ()
from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5 0x00007f007f159ece in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6 0x00007f0080bb7b33 in Datacratic::ZookeeperConnection::connectImpl (
this=this@entry=0x2547150, host=..., timeoutInSeconds=5,
timeoutInSeconds@entry=<error reading variable: Could not find type for DW_OP_GNU_const_type>, clientId=clientId@entry=0x0) at ./soa/service/zookeeper.cc:183
#7 0x00007f0080bb7b5c in Datacratic::ZookeeperConnection::connect (
this=this@entry=0x2547150, host=...,
timeoutInSeconds=<error reading variable: Could not find type for DW_OP_GNU_const_type>) at ./soa/service/zookeeper.cc:217
#8 0x00007f0080bb87a2 in Datacratic::ZookeeperConnection::reconnect (
this=this@entry=0x2547150) at ./soa/service/zookeeper.cc:247
#9 0x00007f0080bb89ac in Datacratic::ZookeeperConnection::checkRes (
this=this@entry=0x2547150, returnCode=-112, retries=@0x7f007152285c: 0,
operation=operation@entry=0x7f0080bc9ce9 "zoo_get_children",
path=0x7f00640164b8 "/tddsp-prod/serviceClass/monitor")
at ./soa/service/zookeeper.cc:283
#10 0x00007f0080bb90b9 in Datacratic::ZookeeperConnection::getChildren (this=0x2547150,
path_=..., failIfNodeMissing=true,
watcher=0x7f0080b8f1b0 <Datacratic::watcherFn(int, int, std::string const&, void*)>, watcherData=0x7f0064015bb0) at ./soa/service/zookeeper.cc:575
#11 0x00007f0080b8ee95 in Datacratic::ZookeeperConfigurationService::getChildren (
this=this@entry=0x2547120, key=..., watch=...)
at ./soa/service/zookeeper_configuration_service.cc:217
#12 0x00007f0080ba3cb4 in Datacratic::MultiRestProxy::onServiceProvidersChanged (
this=0x7fffbd6ed520, path=..., local=local@entry=true)
at ./soa/service/rest_proxy.cc:340
#13 0x00007f0080ba48cb in operator() (__closure=0x2547900)
at ./soa/service/rest_proxy.cc:309
#14 std::_Function_handler<void(std::basic_string<char, std::char_traits, std::allo---Type to continue, or q to quit---
cator >, Datacratic::ConfigurationService::ChangeType), Datacratic::MultiRestProxy::connectAllServiceProviders(const string&, const string&, bool)::<lambda(const string&, Datacratic::ConfigurationService::ChangeType)> >::_M_invoke(const std::_Any_data &, std::basic_string<char, std::char_traits, std::allocator >, Datacratic::ConfigurationService::ChangeType) (__functor=..., __args#0=..., __args#1=)
at /usr/include/c++/4.7/functional:1925
#15 0x00007f0080b8f240 in operator() (
__args#1=Datacratic::ConfigurationService::CREATED, __args#0=...,
this=<optimized out>) at /usr/include/c++/4.7/functional:2310
#16 Datacratic::watcherFn (type=, state=, path=...,
watcherCtx=<optimized out>) at ./soa/service/zookeeper_configuration_service.cc:102
#17 0x00007f0080bb7fe6 in call (state=1, type=-1, this=0x7f0064000a70)
at ./soa/service/zookeeper.h:97
#18 Datacratic::(anonymous namespace)::zk_callback (ah=, type=-1,
state=1, path=<optimized out>, user=<optimized out>)
at ./soa/service/zookeeper.cc:35
#19 0x00007f007c064d91 in do_foreach_watcher (state=1, type=-1, path=0x7f0064000920 "",
zh=0x2552ea0, wo=0x7f006c000920) at src/zk_hashtable.c:279
#20 deliverWatchers (zh=0x2552ea0, type=-1, state=1, path=0x7f0064000920 "",
list=0x7f006c002de0) at src/zk_hashtable.c:321
#21 0x00007f007c05a50d in process_completions (zh=0x2552ea0) at src/zookeeper.c:2108
#22 0x00007f007c065231 in do_completion (v=0x2552ea0) at src/mt_adaptor.c:466
#23 0x00007f007e614e9a in start_thread (arg=0x7f0071523700) at pthread_create.c:308
#24 0x00007f007e91dccd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#25 0x0000000000000000 in ?? ()
(gdb) thread 9
[Switching to thread 9 (Thread 0x7f00488f7700 (LWP 11930))]
#0 pthread_cond_timedwait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:215
215 ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S: No such file or directory.
(gdb) bt
#0 pthread_cond_timedwait@@GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:215
#1 0x00007f0080bb77d4 in __gthread_cond_timedwait (__abs_timeout=0x7f00488f65b0,
__mutex=0x2547150, __cond=0x2547178)
at /usr/include/x86_64-linux-gnu/c++/4.7/./bits/gthr-default.h:886
#2 __wait_until_impl<std::chrono::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000l> > > (__atime=..., __lock=, this=0x2547178)
at /usr/include/c++/4.7/condition_variable:164
#3 wait_until<std::chrono::duration<long, std::ratio<1l, 1000000l> > > (__atime=...,
__lock=<synthetic pointer>, this=0x2547178)
at /usr/include/c++/4.7/condition_variable:100
#4 wait_for<long, std::ratio<1l, 1000l> > (__rtime=..., __lock=,
this=0x2547178) at /usr/include/c++/4.7/condition_variable:132
#5 Datacratic::ZookeeperConnection::connectImpl (this=this@entry=0x2547150, host=...,
timeoutInSeconds=5,
timeoutInSeconds@entry=<error reading variable: Could not find type for DW_OP_GNU_const_type>, clientId=clientId@entry=0x0) at ./soa/service/zookeeper.cc:197
#6 0x00007f0080bb7b5c in Datacratic::ZookeeperConnection::connect (
this=this@entry=0x2547150, host=...,
timeoutInSeconds=<error reading variable: Could not find type for DW_OP_GNU_const_type>) at ./soa/service/zookeeper.cc:217
#7 0x00007f0080bb87a2 in Datacratic::ZookeeperConnection::reconnect (
this=this@entry=0x2547150) at ./soa/service/zookeeper.cc:247
#8 0x00007f0080bb89ac in Datacratic::ZookeeperConnection::checkRes (
this=this@entry=0x2547150, returnCode=-112, retries=@0x7f00488f66fc: 0,
operation=operation@entry=0x7f0080bc9ce9 "zoo_get_children",
path=0x7f0020014f98 "/tddsp-prod/serviceClass/rtbRequestRouter")
at ./soa/service/zookeeper.cc:283
#9 0x00007f0080bb90b9 in Datacratic::ZookeeperConnection::getChildren (this=0x2547150,
path_=..., failIfNodeMissing=true,
watcher=0x7f0080b8f1b0 <Datacratic::watcherFn(int, int, std::string const&, void*)>, watcherData=0x7f0020000980) at ./soa/service/zookeeper.cc:575
#10 0x00007f0080b8ee95 in Datacratic::ZookeeperConfigurationService::getChildren (
this=this@entry=0x2547120, key=..., watch=...)
at ./soa/service/zookeeper_configuration_service.cc:217
#11 0x00007f0080e19e25 in Datacratic::ServiceProviderWatcher::handleServiceClassChange (
this=<optimized out>, serviceClass=...) at ./soa/service/zmq_named_pub_sub.h:858
---Type to continue, or q to quit---
#12 0x00007f0080e1b264 in Datacratic::TypedMessageSinkstd::string::processOne (
this=0x7fffbd6ed418) at ./soa/service/typed_message_channel.h:73
#13 0x00007f0080b84808 in Datacratic::MessageLoop::handleEpollEvent (
this=<optimized out>, event=...) at ./soa/service/message_loop.cc:346
#14 0x00007f0080b73d4a in operator() (__args#0=..., this=0x7fffbd6ed150)
at /usr/include/c++/4.7/functional:2310
#15 Datacratic::Epoller::handleEvents(int, int, std::function<bool (epoll_event&)> const&, std::function<void ()> const&, std::function<void ()> const&) (this=0x7fffbd6ed130,
usToWait=0, nEvents=1, handleEvent_=..., beforeSleep_=..., afterSleep_=...)
at ./soa/service/epoller.cc:180
#16 0x00007f0080b5890b in Datacratic::Epoller::processOne (this=0x7fffbd6ed130)
at ./soa/service/epoller.h:85
#17 0x00007f0080b86a33 in Datacratic::MessageLoop::processOne (this=0x7fffbd6eceb8)
at ./soa/service/message_loop.cc:385
#18 0x00007f0080b84af9 in Datacratic::MessageLoop::runWorkerThread (this=0x7fffbd6eceb8)
at ./soa/service/message_loop.cc:293
#19 0x00007f0080b850b8 in operator() (__closure=0x25581d8)
at ./soa/service/message_loop.cc:84
#20 boost::detail::thread_dataDatacratic::MessageLoop::start(std::function<void())::<lambda()> >::run(void) (this=0x2558020)
at /home/dsp/local/include/boost/thread/detail/thread.hpp:117
#21 0x00007f007dc726da in thread_proxy ()
from /home/dsp/local/lib/libboost_thread.so.1.53.0
#22 0x00007f007e614e9a in start_thread (arg=0x7f00488f7700) at pthread_create.c:308
#23 0x00007f007e91dccd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#24 0x0000000000000000 in ?? ()
If i understand correctly, two threads started reconnect to Zookeeper, one of them created new handle and second threw exception "handle already exists".