Comments (15)
My project setup is very sample. The unusual part is the repl/ide setup. I don't use cider-jack-in. I have a persistent nrepl server running remotely on aws EC2 instance. Repl client on laptops connects to it. Those client sessions sometimes lose connections as I move from one place to another, but nrepl server is still running (sometimes even with unfinished training). Then when connecting a new repl client to it (1/10 of times), nrepl server process would crash. I know all these are a little ad-hoc (but please don't worry too much letting your happy users to deal with rare hiccups, I don't mind :-)
Good news is I currently have a longest surviving nrepl server ever after applied the PR. I'll report back in a few days if it crashes again 🤞
from libpython-clj.
You are welcome! Figuring out the beast that is libpython with all its C glory is quite tough as you can see :-).
I would first put all the code into a (dotimes[iter 20])
and see if that crashes reliably. I imagine it will.
If it does not, then potentially this is a problem with a repl setup meaning outside of your control Clojure uses multiple threads for your REPL and the libpython stuff may not be hardened appropriately for that.
To work around this, and for production scenarios, I would do this:
- require what you need and then immediately call libpython-clj.python/gc!. If it doesn't crash right away then I think you are fine and if it does then we have a repro.
- Wrap the code you care about with a with-gil-stack-rc-context passing and returning only jvm datastructures. Be careful to avoid leaking python objects outside this scope.
I think this may reduce or completely mitigate the crashing and it will have slightly faster single threaded performance as you aren't capturing the gil over and over again.
I was careful to do this (and perhaps I should say more about that) in my facial rec example.
I am really glad you are digging the library! It is fun to work with, that is for sure.
from libpython-clj.
I can confirm crashing. @jdkealy, would you mind try crashing branch from my folk https://github.com/FongHou/libpython-clj.git
@cnuernber, I sent you a PR, though I couldn't provide a reliably reproducible test to prove that's the root cause.
from libpython-clj.
@jdkealy @fonghou can you give a little more info on the setup before you got this?
- What's the
project.clj
and ordeps.edn
? - What system are you on?
- How did you install Python?
- What code did you run to cause the error?
Thanks so much :) Very committed to increasing stability so if you can give us these we can find a solution.
from libpython-clj.
@jdkealy - Can you try latest master and see if this is an issue?
from libpython-clj.
I got a crash core dump again. Here are some interesting events in java hs_err.log and gdb core backtrace. It seems some kind of null pointer passed between java and python bridge.
Event: 9181.812 Thread 0x00007f5f9869e000 Uncommon trap: reason=unstable_if action=reinterpret pc=0x00007f625879cbb4 method=**java.lang.ref.SoftReference.get()**Ljava/lang/Object; @ 6
Event: 9181.812 Thread 0x00007f5f9869e000 Implicit null exception at 0x00007f625879c7dd to 0x00007f625879cb99
Event: 9181.812 Thread 0x00007f5f9869e000 Uncommon trap: trap_request=0xffffff65 fr.pc=0x00007f625879cbb4
Event: 9181.812 Thread 0x00007f5f9869e000 DEOPT PACKING pc=0x00007f625879cbb4 sp=0x00007f615c00e2a0
Event: 9181.812 Thread 0x00007f5f9869e000 DEOPT UNPACKING pc=0x00007f62581bc47a sp=0x00007f615c00e190 mode 2
Here are detail logs for the thread that crashed jvm. Unfortunately, there are not much clojure source level information, but it points to the code path from libpython_clj.python.bridge$generic_python_as_jvm to libpython_clj.jna.DirectMapped.PyObject_CallObject, then python c code attempt to create a new var for tuple value, but got a null pointer from somewhere.
Current thread (0x00007f5f9869e000): JavaThread "nRepl-session-00409c05-7362-4966-9293-fe512f748c3b" daemon [_thread_in_native, id=23291, stack(0x00007f615bf10000,0x00007f615c011000)]
Stack: [0x00007f615bf10000,0x00007f615c011000], sp=0x00007f615c00cc68, free space=1011k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C [libpython3.7m.so+0x13c874]
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
J 5471 libpython_clj.jna.DirectMapped.PyObject_CallObject(Lcom/sun/jna/Pointer;Lcom/sun/jna/Pointer;)Lcom/sun/jna/Pointer; (0 bytes) @ 0x00007f62597a3ae3 [0x00007f62597a3a80+0x63]
J 6171 C2 libpython_clj.jna.protocols.object$PyObject_CallObject.invoke(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object; (10 bytes) @ 0x00007f6258d7fe74 [0x00007f6258d7fd80+0xf4]
j libpython_clj.python.object$eval40850$fn__40851.invoke(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+226
J 6178 C2 libpython_clj.python.protocols$eval38584$fn__38585$G__38575__38594.invoke(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object; (112 bytes) @ 0x00007f6258a1d390 [0x00007f6258a1d1e0+0x1b0]
j libpython_clj.python.bridge$generic_python_as_jvm$reify__41735.do_call_fn(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+202
J 6471 C1 libpython_clj.python.protocols$call_attr_kw.invokeStatic(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object; (116 bytes) @ 0x00007f62586fb814 [0x00007f62586fa980+0xe94]
J 6446 C1 libpython_clj.python.protocols$call_attr_kw.invoke(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object; (18 bytes) @ 0x00007f62586ebfc4 [0x00007f62586ebf40+0x84]
J 5182 C2 clojure.lang.AFn.applyToHelper(Lclojure/lang/IFn;Lclojure/lang/ISeq;)Ljava/lang/Object; (3238 bytes) @ 0x00007f625971bdc0 [0x00007f625971aaa0+0x1320]
J 2885 C2 clojure.lang.AFn.applyTo(Lclojure/lang/ISeq;)Ljava/lang/Object; (12 bytes) @ 0x00007f6258d5e5d4 [0x00007f6258d5e5a0+0x34]
J 3816 C1 clojure.lang.Compiler$InvokeExpr.eval()Ljava/lang/Object; (116 bytes) @ 0x00007f625899fec4 [0x00007f625899f2a0+0xc24]
J 3222 C1 clojure.lang.Compiler$DefExpr.eval()Ljava/lang/Object; (110 bytes) @ 0x00007f6258288124 [0x00007f6258287fc0+0x164]
J 4640 C2 clojure.lang.Compiler.eval(Ljava/lang/Object;Z)Ljava/lang/Object; (445 bytes) @ 0x00007f6258a7dc0c [0x00007f6258a7c740+0x14cc]
j clojure.lang.Compiler.load(Ljava/io/Reader;Ljava/lang/String;Ljava/lang/String;)Ljava/lang/Object;+354
j cljdata.core$eval46246.invokeStatic()Ljava/lang/Object;+65
j cljdata.core$eval46246.invoke()Ljava/lang/Object;+0
J 4640 C2 clojure.lang.Compiler.eval(Ljava/lang/Object;Z)Ljava/lang/Object; (445 bytes) @ 0x00007f6258a7d3ac [0x00007f6258a7c740+0xc6c]
j clojure.lang.Compiler.eval(Ljava/lang/Object;)Ljava/lang/Object;+2
j clojure.core$eval.invokeStatic(Ljava/lang/Object;)Ljava/lang/Object;+3
j clojure.core$eval.invoke(Ljava/lang/Object;)Ljava/lang/Object;+3
j clojure.main$repl$read_eval_print__9086$fn__9089.invoke()Ljava/lang/Object;+11
j clojure.main$repl$read_eval_print__9086.invoke()Ljava/lang/Object;+127
j clojure.main$repl$fn__9095.invoke()Ljava/lang/Object;+7
j clojure.main$repl.invokeStatic(Lclojure/lang/ISeq;)Ljava/lang/Object;+639
j clojure.main$repl.doInvoke(Ljava/lang/Object;)Ljava/lang/Object;+6
j clojure.lang.RestFn.invoke(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+259
j nrepl.middleware.interruptible_eval$evaluate.invokeStatic(Ljava/lang/Object;)Ljava/lang/Object;+867
j nrepl.middleware.interruptible_eval$evaluate.invoke(Ljava/lang/Object;)Ljava/lang/Object;+3
j nrepl.middleware.interruptible_eval$interruptible_eval$fn__1031$fn__1035.invoke()Ljava/lang/Object;+49
j clojure.lang.AFn.run()V+1
j nrepl.middleware.session$session_exec$main_loop__1132$fn__1136.invoke()Ljava/lang/Object;+12
j nrepl.middleware.session$session_exec$main_loop__1132.invoke()Ljava/lang/Object;+86
j clojure.lang.AFn.run()V+1
j java.lang.Thread.run()V+11
v ~StubRoutines::call_stub
Thread 1 (Thread 0x7f615c010700 (LWP 23291)):
#6 <signal handler called>
#7 0x00007f61e463f874 in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#8 0x00007f61e473f3a2 in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#9 0x00007f61e464d275 in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#10 0x00007f61e464dff8 in _PyObject_GC_Malloc () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#11 0x00007f61e464e07a in _PyObject_GC_NewVar () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#12 0x00007f61e472246f in PyTuple_New () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#13 0x00007f61e475af77 in PyList_AsTuple () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#14 0x00007f61e4790d35 in PySequence_Tuple () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#15 0x00007f61e4726520 in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#16 0x00007f61e4721465 in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#17 0x00007f61e4778253 in _PyObject_FastCallKeywords () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#18 0x00007f61e4575b6b in _PyEval_EvalFrameDefault () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#19 0x00007f61e46a0356 in _PyEval_EvalCodeWithName () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#20 0x00007f61e4777333 in _PyFunction_FastCallKeywords () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#21 0x00007f61e4576598 in _PyEval_EvalFrameDefault () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#22 0x00007f61e46a0356 in _PyEval_EvalCodeWithName () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#23 0x00007f61e4777633 in _PyFunction_FastCallDict () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#24 0x00007f61e477a41d in _PyObject_Call_Prepend () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#25 0x00007f61e4717cd5 in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#26 0x00007f61e47214b2 in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#27 0x00007f61e4777e85 in PyObject_Call () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#28 0x00007f61e4573659 in _PyEval_EvalFrameDefault () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#29 0x00007f61e46a0356 in _PyEval_EvalCodeWithName () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#30 0x00007f61e477751e in _PyFunction_FastCallDict () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#31 0x00007f61e477a41d in _PyObject_Call_Prepend () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#32 0x00007f61e4777e85 in PyObject_Call () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
---Type <return> to continue, or q <return> to quit---
#33 0x00007f61e4573659 in _PyEval_EvalFrameDefault () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#34 0x00007f61e46a0356 in _PyEval_EvalCodeWithName () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#35 0x00007f61e477751e in _PyFunction_FastCallDict () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#36 0x00007f61e477a41d in _PyObject_Call_Prepend () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#37 0x00007f61e4777e85 in PyObject_Call () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
from libpython-clj.
@fonghou steps to reproduce?
from libpython-clj.
@jjtolton - That sounds like your nil pointer issue with the soduku solver to me.
from libpython-clj.
@fonghou - That is really helpful. If we run the process out of available memory then malloc returns 0. That would create a nil in an unexpected place. So the question I would have is is the garbage collection system working or is there an edge case where it is completely failing and then it is just a matter of time till something calls malloc which then returns a zero.
from libpython-clj.
The process wasn't in out of memory condition as its resident size is around 9GB. The machine has 16GB RAM, and no sign of swapping at the time of crash. All top level defn are carefully wrapped in py/with-gil-state-rc-context and I did a few py/gc! in repl, which ran fine.
Looking into _PyObject_GC_Malloc source code here https://github.com/python/cpython/blob/00e45877e33d32bb61aa13a2033e3bba370bda4d/Modules/gcmodule.c#L1713. It seems crashing in one of function calls inside of _PyObject_GC_Alloc (shouldn't be one of PyErr_NoMemory() calls I guess). Unfortunately, gdb backtrace doesn't show last three function names. There is a large chunk of code dealing with refcount gc logic here https://github.com/python/cpython/blob/00e45877e33d32bb61aa13a2033e3bba370bda4d/Modules/gcmodule.c#L1696. I wonder whether that interacts with libpython-clj reference tracking logic in any ways or not.
Is there a way to log libpython-clj reference tracking?
from libpython-clj.
If you want to run for a long time you will have to add those to a file. Another option is to completely disable releasing the refcounts and see if you get it. We have seen it happen before with other pathways when releasing the reference tracking is completely turned off.
from libpython-clj.
With logging enabled, it seems crashing more frequently, 1 out of 3 :(
Here is the repo https://github.com/FongHou/pyclj.git to reproduce, just run "lein run" a few times.
My env -
Darwin Kernel Version 18.7.0: Thu Jan 23 06:52:12 PST 2020; root:xnu-4903.278.25~1/RELEASE_X86_64
java version "1.8.0_152"
Java(TM) SE Runtime Environment (build 1.8.0_152-b16)
Java HotSpot(TM) 64-Bit Server VM (build 25.152-b16, mixed mode)
Python 3.7.6
numpy 1.18.1
torch 1.4.0
from libpython-clj.
@fonghou - this is super helpful.
I think from here I might try a few things.
-
If all decref calls are disabled does that change the problem?
-
(defn -main [& args] (py/initialize!) (py/import-module "numpy") (py/import-module "torch"))
- Don't use the refer-python pathway. We can keep trimming the problem I think by reducing the code executing by binary steps. This will also change timings of things. -
Potentially the struct sizes for the various structs used by the JNA system are too small. If they are just too small then potentially overwrites to memory are mostly harmless except in rare situations. I check a lot of these in a cpp program under questions but I may have missed something important.
from libpython-clj.
@fonghou, @jdkealy - If you are up for it the new version 2.00-beta-1
has a much simpler GIL mechanism and also is far more careful about how it releases refcounts to especially python atomic objects. I would be interested to know if this crash is still an issue.
from libpython-clj.
Final comment on this issue. You can embed Clojure in Python which is a work-around for crashes like these. Embedded Clojure in Python is likely to be more stable in general than embedded Python in Clojure due to Python's closer reliance on the hardware of the system.
from libpython-clj.
Related Issues (20)
- Subprocess failed (exit code: 139)
- Cannot run the example in the tutorial in Cider with JDK 17 HOT 8
- process hanging in embeded mode on exception HOT 2
- from-import does not support "string"
- newbie: I get: `dir already refers to: #'clojure.repl/dir in namespace: user` HOT 2
- Newbie: Lein Clojure Windows Anaconda VSCode Cava REPL Returns (NoClassDefFoundError) HOT 5
- Create Python Module then Import Module and Call Function Inside Clojure? HOT 4
- Fix ubuntu version in dockerfiles HOT 6
- Consider releasing returned values from instance fns
- JDK-20 support HOT 3
- builtins/eval throws 'frame does not exist' HOT 2
- Unable to require libraries from within docker using conda HOT 6
- Failed to find a valid python library on MS-Windows with the official python distributions
- Extend CI coverage to all the mainstream architectures
- require-python missing functions with annotation HOT 3
- Make polyglot development easier by allow require-python to be path-informed HOT 1
- Data transformation pipelines? HOT 1
- NSWindow main thread error when trying to setup Gymnasium (openapi gym) HOT 5
- JVM crashes when requiring `libpython-clj2.require` on 64bit Raspberry Pi 4 HOT 1
- error of unhashable type HOT 15
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from libpython-clj.