I get sporadic failures with ShellUTest, but have only noticed them with a Debug build

I have so far observed 3 types of failures From <code class="n

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

ShellUTest errors sometimes with Debug build about cogserver HOT 22 OPEN

opencog commented on June 11, 2024

ShellUTest errors sometimes with Debug build

from cogserver.

Comments (22)

linas commented on June 11, 2024

That backtrace is a typical "double free or malloc pool corruption" bug. It happens because the dependencies between the various different shared libs in the atomspace (and/or in opencog) are messed up. what's happening is that various shlib dtors are run twice (instead of once) because ... something somewhere gets confused. For example, deleting the cogserver might sometimes cause this. Deleting the attentionbank sometimes causes this.

The shared library dtors that run twice are usually the ones that are creating the atom types .. something causes the atom types to be deleted twice. You can see this directly, by putting a printf in the dtor, and watch it get printed twice. or a gdb breakpoint. you can view both stack traces, and say to yourself that they both look perfectly valid, perfectly legit ways to get to the dtor, and so you have to poke around to see why ... and its usually something goofy like deleteing cogserver causes atomspace to be deleted causes attentionbank to be deleted causes the atomspace to be deleted a second time ... stuff like that. Its done by the dtors, because there are global singleton shared pointers that hold these objects, which means they cannot get deleted until the shared library is unloaded .. but by the time the shared library is unloaded, the methods on the C++ objects are already gone (because some other needed shared lib has been unloaded...) .. circular dependency hell.

Its kind of tedious to track these down, and the (ugly) work-around is to just not delete things, and let glibc do the cleanup itself on process exit. In short, problems with the shared lib plumbing. Best handled by someone with a lot of spare time on their hands, and lots of linker/loader experience (or who wants to learn how the linker/loader works, which is actually a rather large hairy OS subsystem -- so not for the weak of heart).

from cogserver.

linas commented on June 11, 2024

Why it's intermittent, I'm not sure. its possible (for example) that the cogserver gets one more command over the network, and so it thinks it has one more thing to do, but the atomspace is already gone. I thought I'd fixed all of these, by making sure that all network connections are closed, all command buffers flushed, and all threads are done and exited, before allowing the cogserver to shut down. But I dunno. Maybe I missed a spot. Again -- its really hairy code.

from cogserver.

linas commented on June 11, 2024

Things like reworking the agent-runner/system activity table could trigger this, e.g. because you are holding a pointer to a system activity table when the singleton has already been deleted. Stuff like that. Did you ever see this bug, before reworking the system activity table code?

from cogserver.

ngeiswei commented on June 11, 2024

I have so far observed 3 types of failures

From testMultiStream

terminate called after throwing an instance of 'concurrent_queue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*>::Canceled'
  what():  std::exception

From testDrain

ShellUTest.cxxtest:78: Error: Expected (281 <= reso.size()), found (281 > 280)

From testControlC

ShellUTest.cxxtest:265: Error: Expected (end - start < 2), found (42 >= 2)

I've only observed that last one from CircleCI, never locally, but it's the most frequent cause of failure there unfortunately.

from cogserver.

linas commented on June 11, 2024

Error 1 happens if some thread attempts to use the queue, after it has been canceled. The queue is canceled when the network connection is being closed. Guess: maybe there is some race, with more traffic coming across the network, even though its being shut down.

Error 2: given the sizes, it would seem that the final carriage-return was not delivered. Why? Dunno. Another race.

Error 3: The control-C message should be two bytes long. Why is it 42 bytes long? Dunno .. that's a lot of bytes that should not be there. Maybe they are an error message? Bytes that should have gone to someone else? Printing them out would provide a clue.

from cogserver.

linas commented on June 11, 2024

You might think that someone out there would have written a generic shell server, that would be fast, bullet-proof, support login passwords and ssh, etc. Then we could replace the cogserver by that. But I was never able to find such a thing.

from cogserver.

ngeiswei commented on June 11, 2024

Not sure it's related but I noticed that ServerSocket::SetCloseAndDelete() doesn't always have time to complete because the CogServer shuts down (leading the unit test to halt) before that.

from cogserver.

ngeiswei commented on June 11, 2024

OK, I understand why ServerSocket::SetCloseAndDelete() doesn't always complete, it is detached from the NetworkServer at its creation, see NetworkServer::listen()

        std::thread(&ConsoleSocket::handle_connection, ss).detach();

then when the connection stops, the while loop in ServerSocket::handle_connection() breaks, and it eventually calls its own destruction

    delete this;

So if the program ends before that line is completed it never gets fully closed.

I have no idea however if that could be a problem, and no idea whether it is related to that issue.

I've added a sleep(1) in ShellUTest::tearDown() and obviously now ServerSocket::SetCloseAndDelete() always completes, however I don't know if that is doing anything to fix the bug (ran 27 times without crash, but it could be just luck since it occurs rarely).

from cogserver.

linas commented on June 11, 2024

Thanks @ngeiswei !! That's a very plausible explanation. The code was carefully crafted to try to avoid exactly this kind of race, but it is sufficiently complicated that perhaps a spot was missed. There locks and flags and semaphores that get set and double-checked, to prevent the out-of-order shutdown.

The only way I know of debugging this is decidedly old-school: put prints or loggers or traces at all the places that should be passed through, during orderly shutdown, and then decode what went out-of-order when it doesn't.

from cogserver.

linas commented on June 11, 2024

The cogserver destructor is supposed to wait for the network server destructor to finish. The network server destructor is supposed to wait for all threads to join. The threads are not supposed to join, until they are all done. So, in principle, everything should have waited for ServerSocket to finish. ...

from cogserver.

linas commented on June 11, 2024

Testing, I've got zero failures on 30 repeats over 90 minutes of testing. This is with boost-1.62

from cogserver.

linas commented on June 11, 2024

I'm on Debian stable, gcc --version gcc (Debian 8.3.0-6) 8.3.0 and ldd --version ldd (Debian GLIBC 2.28-10) 2.28 and bost version 1.62

I'm wondering if it's a glibc bug of a boost:asio bug.

from cogserver.

linas commented on June 11, 2024

5 failures out of 22 repeats with boost-1.67 -- so I guess its a boost bug. will double-check for boost-1.62 ... crap. Debian no longer has boost 1.62 ... Can't test.

from cogserver.

linas commented on June 11, 2024

the boost-1.67 crashes include at least 4 different kinds: corrupted size vs. prev_size and Segmentation fault and double free detected in tcache and corrupted size vs. prev_size

Seems that Debian ONLY has boost-1.67 and without some easy way of getting back to a working version of boost, I'm not sure what to do ...

from cogserver.

linas commented on June 11, 2024

Presumably, its a boost::asio bug .. boost::asio has been a headache for 5+ years ... the easiest solution would be to re-write the code to just not use it ... which maybe isn't hard ...?

from cogserver.

linas commented on June 11, 2024

OK, as a boost asio replacement, there is just plain-old asio: in Debian, libasio-dev otherwise https://think-async.com/Asio/ .. but it seems like boost asio is just a copy/fork of this. Given how buggy and unstable asio has been over the 5+ years, this feels like the wrong direction to go in...

from cogserver.

linas commented on June 11, 2024

If we instead assume that asio is not buggy, and that, instead, it is throwing some new kind of error that we are not yet handling, then network/ServerSocket.cc is the place to look. It already handles nine different kinds of asio errors, maybe there's a new tenth kind of error ...

from cogserver.

vsbogd commented on June 11, 2024

Is this this issue still actual?

from cogserver.

linas commented on June 11, 2024

Well, boost asio is certainly still as buggy as ever. Debian-stable is still on boost-1.67 .... I'd like to see all uses of boost in opencog removed, but its a long, hard task...

from cogserver.

vsbogd commented on June 11, 2024

I mean does this issue repeat? I don't see it for a long time.

from cogserver.

linas commented on June 11, 2024

Well, in comment #5 (comment) I put it into a loop and saw 5 crashes out of 22 runs. Nothing has changed since then ... I can try the test again. Here's my shell script:

#! /bin/bash

date
cnt=0
for i in `seq 1 1000000`;
do
    ./tests/shell/ShellUTest   >>/dev/null
    # ./tests/scm/MultiThreadUTest      >> /dev/null
    # ./tests/rule-engine/backwardchainer/BackwardChainerUTest  >> /dev/null
    rc=$?
    if [ $rc != 0 ]; then
        cnt=$((cnt+1))
        date
        # Halt on the first crash, save the logfile
        # exit 1;
    fi
    #if [ $(($i % 10)) == 0 ]; then
       echo duude iter $i result ---------  rc = $rc  ncrashes = $cnt
    #fi
done

I re-ran this, and am getting 8 crashes out of 48 tests, for an average of 1 out of 6 failure rate. This is debian-stable, boost 1.67, gcc 8.3.0 and glibc 2.28

from cogserver.

linas commented on June 11, 2024

Other intermittent test failures include the pattern-miner SurprisingnessUTest .. between that and the rule-engine, I have to hit the circle-ci "re-run unit tests" button 2-3-4 times before I can get a run that passes all tests. The intermittent failures are very annoying ...

from cogserver.

ShellUTest errors sometimes with Debug build about cogserver HOT 22 OPEN

Comments (22)

Related Issues (16)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent