Comments (5)
I found the root cause of this problem and I thought I'd leave a post-mortem, perhaps it will help others avoid falling into a similar trap.
As it turned out, the problem was a deadlock caused by insufficient threads in the core.async thread pool. A careful audit of the code showed that in many places functions called from go
blocks eventually performed blocking I/O, which should be avoided at all costs. go
blocks are intended for short-running code only and should delegate all extensive work to separate threads.
I am unclear on why the problem manifested itself in this particular way, I also thought that manifold does not use the same thread pool as core.async. But perhaps it was waiting for something that code written using core.async was supposed to provide and that's why stack traces pointed to manifold.
An immediate emergency remedy is to increase the size of the core.async thread pool by adding a -Dclojure.core.async.pool-size=N
parameter to the command line. The correct fix is to audit the code, find all places where blocking operations are initiated from go
blocks, and eliminate those. As it turns out, these aren't always immediately obvious: for example, some code several calls down the stack might need to get configuration data and thus perform blocking I/O.
TL;DR takeaway: do not do any blocking or CPU-bound work in go
blocks.
from clj-rethinkdb.
I haven’t experienced what you’re describing. What’s the network connection between the two servers? Which version of the library are you using? Are the queries or the results that are hanging unusual or overly large?
I think my next step if this was me would be to raise all of the logging around the connection handling and parsing responses of queries.
from clj-rethinkdb.
Both the app and RethinkDB run on a single host. RethinkDB has a (non-voting) replica elsewhere, but the app does not access that. Until recently, I was able to run the same setup for months without problems.
I am using 0.15.26 and I could not identify any specific change that triggered the problem. The application might hang after running for days, hours, or right after restarting. The only potentially relevant change that I could find is the number of changefeeds: it usually hovered around 200, but recently grew to about 300. And restarting the application causes (most) changefeeds to be re-established.
Unfortunately, since this is non-reproducible and non-deterministic, all I have to go on is the jstack dump. There is nothing in RethinkDB logs, nothing in application logs, no exceptions, errors, no visible failures. To make sure, I implemented a default global uncaught exception handler, but it doesn't get called. And RethinkDB seems to respond in the admin interface, too.
This kind of bug is, unfortunately, catastrophic. I don't know if that's a bug in RethinkDB or clj-rethinkdb, but this might finally force me to move to FoundationDB.
from clj-rethinkdb.
I don't have any good suggestions on fixes sorry. I'd maybe recommend forking this library to add in logging statements throughout the networking/protocol code to see if you can narrow down where it's getting stuck.
from clj-rethinkdb.
Yes, these kinds of issues are the worst to debug. In the meantime I've come to think that this might be a RethinkDB problem, because if I restart the app after a hang, I usually get a similar hang shortly after it starts up. Rebooting the entire machine or restarting RethinkDB and then the application does not seem to cause another hang.
The BLOCKED
state of the threads might indicate a locking issue with manifold, too, but it would need to be something triggered by RethinkDB itself.
I don't think debugging the whole stack is realistic for me.
from clj-rethinkdb.
Related Issues (20)
- RethinkDB 2.3 handshake V1.0 implementation
- Unwanted stacktraces HOT 8
- View rethink queries as JS? HOT 2
- Improve performance of byte handling HOT 1
- add a `grant` method HOT 7
- add r/expr term HOT 3
- Dependency on Aleph and Netty causes issues deploying to wildfly and jboss HOT 10
- SocketException: Address already in use: /127.0.0.1:28015 HOT 1
- Call to clojure.core/refer-clojure did not conform to spec HOT 2
- Ability to use index in distinct HOT 1
- rethinkdb.query/term? should be public HOT 1
- Any reason not to use protobuf-java 3.0.2? HOT 2
- how to use nested without? HOT 2
- `:temp-var` leaks into query result
- closed connection is not detected
- r/branch support multiple conditions
- Arrays in optargs
- sorted-set is not supported and causes exceptions to be thrown
- rethinkdb.net/deliver-result tries to conj to ex-info, which fails HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from clj-rethinkdb.