Code Monkey home page Code Monkey logo

Comments (5)

jwr avatar jwr commented on August 13, 2024 1

I found the root cause of this problem and I thought I'd leave a post-mortem, perhaps it will help others avoid falling into a similar trap.

As it turned out, the problem was a deadlock caused by insufficient threads in the core.async thread pool. A careful audit of the code showed that in many places functions called from go blocks eventually performed blocking I/O, which should be avoided at all costs. go blocks are intended for short-running code only and should delegate all extensive work to separate threads.

I am unclear on why the problem manifested itself in this particular way, I also thought that manifold does not use the same thread pool as core.async. But perhaps it was waiting for something that code written using core.async was supposed to provide and that's why stack traces pointed to manifold.

An immediate emergency remedy is to increase the size of the core.async thread pool by adding a -Dclojure.core.async.pool-size=N parameter to the command line. The correct fix is to audit the code, find all places where blocking operations are initiated from go blocks, and eliminate those. As it turns out, these aren't always immediately obvious: for example, some code several calls down the stack might need to get configuration data and thus perform blocking I/O.

TL;DR takeaway: do not do any blocking or CPU-bound work in go blocks.

from clj-rethinkdb.

danielcompton avatar danielcompton commented on August 13, 2024

I haven’t experienced what you’re describing. What’s the network connection between the two servers? Which version of the library are you using? Are the queries or the results that are hanging unusual or overly large?

I think my next step if this was me would be to raise all of the logging around the connection handling and parsing responses of queries.

from clj-rethinkdb.

jwr avatar jwr commented on August 13, 2024

Both the app and RethinkDB run on a single host. RethinkDB has a (non-voting) replica elsewhere, but the app does not access that. Until recently, I was able to run the same setup for months without problems.

I am using 0.15.26 and I could not identify any specific change that triggered the problem. The application might hang after running for days, hours, or right after restarting. The only potentially relevant change that I could find is the number of changefeeds: it usually hovered around 200, but recently grew to about 300. And restarting the application causes (most) changefeeds to be re-established.

Unfortunately, since this is non-reproducible and non-deterministic, all I have to go on is the jstack dump. There is nothing in RethinkDB logs, nothing in application logs, no exceptions, errors, no visible failures. To make sure, I implemented a default global uncaught exception handler, but it doesn't get called. And RethinkDB seems to respond in the admin interface, too.

This kind of bug is, unfortunately, catastrophic. I don't know if that's a bug in RethinkDB or clj-rethinkdb, but this might finally force me to move to FoundationDB.

from clj-rethinkdb.

danielcompton avatar danielcompton commented on August 13, 2024

I don't have any good suggestions on fixes sorry. I'd maybe recommend forking this library to add in logging statements throughout the networking/protocol code to see if you can narrow down where it's getting stuck.

from clj-rethinkdb.

jwr avatar jwr commented on August 13, 2024

Yes, these kinds of issues are the worst to debug. In the meantime I've come to think that this might be a RethinkDB problem, because if I restart the app after a hang, I usually get a similar hang shortly after it starts up. Rebooting the entire machine or restarting RethinkDB and then the application does not seem to cause another hang.

The BLOCKED state of the threads might indicate a locking issue with manifold, too, but it would need to be something triggered by RethinkDB itself.

I don't think debugging the whole stack is realistic for me.

from clj-rethinkdb.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.