Code Monkey home page Code Monkey logo

Comments (15)

fonghou avatar fonghou commented on July 19, 2024 1

My project setup is very sample. The unusual part is the repl/ide setup. I don't use cider-jack-in. I have a persistent nrepl server running remotely on aws EC2 instance. Repl client on laptops connects to it. Those client sessions sometimes lose connections as I move from one place to another, but nrepl server is still running (sometimes even with unfinished training). Then when connecting a new repl client to it (1/10 of times), nrepl server process would crash. I know all these are a little ad-hoc (but please don't worry too much letting your happy users to deal with rare hiccups, I don't mind :-)

Good news is I currently have a longest surviving nrepl server ever after applied the PR. I'll report back in a few days if it crashes again 🤞

from libpython-clj.

cnuernber avatar cnuernber commented on July 19, 2024

You are welcome! Figuring out the beast that is libpython with all its C glory is quite tough as you can see :-).

I would first put all the code into a (dotimes[iter 20]) and see if that crashes reliably. I imagine it will.

If it does not, then potentially this is a problem with a repl setup meaning outside of your control Clojure uses multiple threads for your REPL and the libpython stuff may not be hardened appropriately for that.

To work around this, and for production scenarios, I would do this:

  1. require what you need and then immediately call libpython-clj.python/gc!. If it doesn't crash right away then I think you are fine and if it does then we have a repro.
  2. Wrap the code you care about with a with-gil-stack-rc-context passing and returning only jvm datastructures. Be careful to avoid leaking python objects outside this scope.

I think this may reduce or completely mitigate the crashing and it will have slightly faster single threaded performance as you aren't capturing the gil over and over again.

I was careful to do this (and perhaps I should say more about that) in my facial rec example.

I am really glad you are digging the library! It is fun to work with, that is for sure.

from libpython-clj.

fonghou avatar fonghou commented on July 19, 2024

I can confirm crashing. @jdkealy, would you mind try crashing branch from my folk https://github.com/FongHou/libpython-clj.git

@cnuernber, I sent you a PR, though I couldn't provide a reliably reproducible test to prove that's the root cause.

from libpython-clj.

jjtolton avatar jjtolton commented on July 19, 2024

@jdkealy @fonghou can you give a little more info on the setup before you got this?

  1. What's the project.clj and or deps.edn?
  2. What system are you on?
  3. How did you install Python?
  4. What code did you run to cause the error?

Thanks so much :) Very committed to increasing stability so if you can give us these we can find a solution.

from libpython-clj.

cnuernber avatar cnuernber commented on July 19, 2024

@jdkealy - Can you try latest master and see if this is an issue?

from libpython-clj.

fonghou avatar fonghou commented on July 19, 2024

I got a crash core dump again. Here are some interesting events in java hs_err.log and gdb core backtrace. It seems some kind of null pointer passed between java and python bridge.

Event: 9181.812 Thread 0x00007f5f9869e000 Uncommon trap: reason=unstable_if action=reinterpret pc=0x00007f625879cbb4 method=**java.lang.ref.SoftReference.get()**Ljava/lang/Object; @ 6
Event: 9181.812 Thread 0x00007f5f9869e000 Implicit null exception at 0x00007f625879c7dd to 0x00007f625879cb99
Event: 9181.812 Thread 0x00007f5f9869e000 Uncommon trap: trap_request=0xffffff65 fr.pc=0x00007f625879cbb4
Event: 9181.812 Thread 0x00007f5f9869e000 DEOPT PACKING pc=0x00007f625879cbb4 sp=0x00007f615c00e2a0
Event: 9181.812 Thread 0x00007f5f9869e000 DEOPT UNPACKING pc=0x00007f62581bc47a sp=0x00007f615c00e190 mode 2

Here are detail logs for the thread that crashed jvm. Unfortunately, there are not much clojure source level information, but it points to the code path from libpython_clj.python.bridge$generic_python_as_jvm to libpython_clj.jna.DirectMapped.PyObject_CallObject, then python c code attempt to create a new var for tuple value, but got a null pointer from somewhere.

Current thread (0x00007f5f9869e000):  JavaThread "nRepl-session-00409c05-7362-4966-9293-fe512f748c3b" daemon [_thread_in_native, id=23291, stack(0x00007f615bf10000,0x00007f615c011000)]
Stack: [0x00007f615bf10000,0x00007f615c011000],  sp=0x00007f615c00cc68,  free space=1011k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libpython3.7m.so+0x13c874]

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
J 5471  libpython_clj.jna.DirectMapped.PyObject_CallObject(Lcom/sun/jna/Pointer;Lcom/sun/jna/Pointer;)Lcom/sun/jna/Pointer; (0 bytes) @ 0x00007f62597a3ae3 [0x00007f62597a3a80+0x63]
J 6171 C2 libpython_clj.jna.protocols.object$PyObject_CallObject.invoke(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object; (10 bytes) @ 0x00007f6258d7fe74 [0x00007f6258d7fd80+0xf4]
j  libpython_clj.python.object$eval40850$fn__40851.invoke(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+226
J 6178 C2 libpython_clj.python.protocols$eval38584$fn__38585$G__38575__38594.invoke(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object; (112 bytes) @ 0x00007f6258a1d390 [0x00007f6258a1d1e0+0x1b0]
j  libpython_clj.python.bridge$generic_python_as_jvm$reify__41735.do_call_fn(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+202
J 6471 C1 libpython_clj.python.protocols$call_attr_kw.invokeStatic(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object; (116 bytes) @ 0x00007f62586fb814 [0x00007f62586fa980+0xe94]
J 6446 C1 libpython_clj.python.protocols$call_attr_kw.invoke(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object; (18 bytes) @ 0x00007f62586ebfc4 [0x00007f62586ebf40+0x84]
J 5182 C2 clojure.lang.AFn.applyToHelper(Lclojure/lang/IFn;Lclojure/lang/ISeq;)Ljava/lang/Object; (3238 bytes) @ 0x00007f625971bdc0 [0x00007f625971aaa0+0x1320]
J 2885 C2 clojure.lang.AFn.applyTo(Lclojure/lang/ISeq;)Ljava/lang/Object; (12 bytes) @ 0x00007f6258d5e5d4 [0x00007f6258d5e5a0+0x34]
J 3816 C1 clojure.lang.Compiler$InvokeExpr.eval()Ljava/lang/Object; (116 bytes) @ 0x00007f625899fec4 [0x00007f625899f2a0+0xc24]
J 3222 C1 clojure.lang.Compiler$DefExpr.eval()Ljava/lang/Object; (110 bytes) @ 0x00007f6258288124 [0x00007f6258287fc0+0x164]
J 4640 C2 clojure.lang.Compiler.eval(Ljava/lang/Object;Z)Ljava/lang/Object; (445 bytes) @ 0x00007f6258a7dc0c [0x00007f6258a7c740+0x14cc]
j  clojure.lang.Compiler.load(Ljava/io/Reader;Ljava/lang/String;Ljava/lang/String;)Ljava/lang/Object;+354
j  cljdata.core$eval46246.invokeStatic()Ljava/lang/Object;+65
j  cljdata.core$eval46246.invoke()Ljava/lang/Object;+0
J 4640 C2 clojure.lang.Compiler.eval(Ljava/lang/Object;Z)Ljava/lang/Object; (445 bytes) @ 0x00007f6258a7d3ac [0x00007f6258a7c740+0xc6c]
j  clojure.lang.Compiler.eval(Ljava/lang/Object;)Ljava/lang/Object;+2
j  clojure.core$eval.invokeStatic(Ljava/lang/Object;)Ljava/lang/Object;+3
j  clojure.core$eval.invoke(Ljava/lang/Object;)Ljava/lang/Object;+3
j  clojure.main$repl$read_eval_print__9086$fn__9089.invoke()Ljava/lang/Object;+11
j  clojure.main$repl$read_eval_print__9086.invoke()Ljava/lang/Object;+127
j  clojure.main$repl$fn__9095.invoke()Ljava/lang/Object;+7
j  clojure.main$repl.invokeStatic(Lclojure/lang/ISeq;)Ljava/lang/Object;+639
j  clojure.main$repl.doInvoke(Ljava/lang/Object;)Ljava/lang/Object;+6
j  clojure.lang.RestFn.invoke(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;+259
j  nrepl.middleware.interruptible_eval$evaluate.invokeStatic(Ljava/lang/Object;)Ljava/lang/Object;+867
j  nrepl.middleware.interruptible_eval$evaluate.invoke(Ljava/lang/Object;)Ljava/lang/Object;+3
j  nrepl.middleware.interruptible_eval$interruptible_eval$fn__1031$fn__1035.invoke()Ljava/lang/Object;+49
j  clojure.lang.AFn.run()V+1
j  nrepl.middleware.session$session_exec$main_loop__1132$fn__1136.invoke()Ljava/lang/Object;+12
j  nrepl.middleware.session$session_exec$main_loop__1132.invoke()Ljava/lang/Object;+86
j  clojure.lang.AFn.run()V+1
j  java.lang.Thread.run()V+11
v  ~StubRoutines::call_stub

Thread 1 (Thread 0x7f615c010700 (LWP 23291)):
#6  <signal handler called>
#7  0x00007f61e463f874 in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#8  0x00007f61e473f3a2 in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#9  0x00007f61e464d275 in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#10 0x00007f61e464dff8 in _PyObject_GC_Malloc () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#11 0x00007f61e464e07a in _PyObject_GC_NewVar () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#12 0x00007f61e472246f in PyTuple_New () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#13 0x00007f61e475af77 in PyList_AsTuple () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#14 0x00007f61e4790d35 in PySequence_Tuple () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#15 0x00007f61e4726520 in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#16 0x00007f61e4721465 in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#17 0x00007f61e4778253 in _PyObject_FastCallKeywords () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#18 0x00007f61e4575b6b in _PyEval_EvalFrameDefault () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#19 0x00007f61e46a0356 in _PyEval_EvalCodeWithName () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#20 0x00007f61e4777333 in _PyFunction_FastCallKeywords () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#21 0x00007f61e4576598 in _PyEval_EvalFrameDefault () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#22 0x00007f61e46a0356 in _PyEval_EvalCodeWithName () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#23 0x00007f61e4777633 in _PyFunction_FastCallDict () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#24 0x00007f61e477a41d in _PyObject_Call_Prepend () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#25 0x00007f61e4717cd5 in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#26 0x00007f61e47214b2 in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#27 0x00007f61e4777e85 in PyObject_Call () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#28 0x00007f61e4573659 in _PyEval_EvalFrameDefault () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#29 0x00007f61e46a0356 in _PyEval_EvalCodeWithName () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#30 0x00007f61e477751e in _PyFunction_FastCallDict () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#31 0x00007f61e477a41d in _PyObject_Call_Prepend () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#32 0x00007f61e4777e85 in PyObject_Call () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
---Type <return> to continue, or q <return> to quit---
#33 0x00007f61e4573659 in _PyEval_EvalFrameDefault () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#34 0x00007f61e46a0356 in _PyEval_EvalCodeWithName () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#35 0x00007f61e477751e in _PyFunction_FastCallDict () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#36 0x00007f61e477a41d in _PyObject_Call_Prepend () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#37 0x00007f61e4777e85 in PyObject_Call () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so

from libpython-clj.

jjtolton avatar jjtolton commented on July 19, 2024

@fonghou steps to reproduce?

from libpython-clj.

cnuernber avatar cnuernber commented on July 19, 2024

@jjtolton - That sounds like your nil pointer issue with the soduku solver to me.

from libpython-clj.

cnuernber avatar cnuernber commented on July 19, 2024

@fonghou - That is really helpful. If we run the process out of available memory then malloc returns 0. That would create a nil in an unexpected place. So the question I would have is is the garbage collection system working or is there an edge case where it is completely failing and then it is just a matter of time till something calls malloc which then returns a zero.

from libpython-clj.

fonghou avatar fonghou commented on July 19, 2024

The process wasn't in out of memory condition as its resident size is around 9GB. The machine has 16GB RAM, and no sign of swapping at the time of crash. All top level defn are carefully wrapped in py/with-gil-state-rc-context and I did a few py/gc! in repl, which ran fine.

Looking into _PyObject_GC_Malloc source code here https://github.com/python/cpython/blob/00e45877e33d32bb61aa13a2033e3bba370bda4d/Modules/gcmodule.c#L1713. It seems crashing in one of function calls inside of _PyObject_GC_Alloc (shouldn't be one of PyErr_NoMemory() calls I guess). Unfortunately, gdb backtrace doesn't show last three function names. There is a large chunk of code dealing with refcount gc logic here https://github.com/python/cpython/blob/00e45877e33d32bb61aa13a2033e3bba370bda4d/Modules/gcmodule.c#L1696. I wonder whether that interacts with libpython-clj reference tracking logic in any ways or not.

Is there a way to log libpython-clj reference tracking?

from libpython-clj.

cnuernber avatar cnuernber commented on July 19, 2024

If you want to run for a long time you will have to add those to a file. Another option is to completely disable releasing the refcounts and see if you get it. We have seen it happen before with other pathways when releasing the reference tracking is completely turned off.

from libpython-clj.

fonghou avatar fonghou commented on July 19, 2024

With logging enabled, it seems crashing more frequently, 1 out of 3 :(

Here is the repo https://github.com/FongHou/pyclj.git to reproduce, just run "lein run" a few times.

My env -

Darwin Kernel Version 18.7.0: Thu Jan 23 06:52:12 PST 2020; root:xnu-4903.278.25~1/RELEASE_X86_64

java version "1.8.0_152"
Java(TM) SE Runtime Environment (build 1.8.0_152-b16)
Java HotSpot(TM) 64-Bit Server VM (build 25.152-b16, mixed mode)

Python 3.7.6

numpy 1.18.1
torch 1.4.0

from libpython-clj.

cnuernber avatar cnuernber commented on July 19, 2024

@fonghou - this is super helpful.

I think from here I might try a few things.

  1. If all decref calls are disabled does that change the problem?

  2. (defn -main [& args] (py/initialize!) (py/import-module "numpy") (py/import-module "torch")) - Don't use the refer-python pathway. We can keep trimming the problem I think by reducing the code executing by binary steps. This will also change timings of things.

  3. Potentially the struct sizes for the various structs used by the JNA system are too small. If they are just too small then potentially overwrites to memory are mostly harmless except in rare situations. I check a lot of these in a cpp program under questions but I may have missed something important.

from libpython-clj.

cnuernber avatar cnuernber commented on July 19, 2024

@fonghou, @jdkealy - If you are up for it the new version 2.00-beta-1 has a much simpler GIL mechanism and also is far more careful about how it releases refcounts to especially python atomic objects. I would be interested to know if this crash is still an issue.

from libpython-clj.

cnuernber avatar cnuernber commented on July 19, 2024

Final comment on this issue. You can embed Clojure in Python which is a work-around for crashes like these. Embedded Clojure in Python is likely to be more stable in general than embedded Python in Clojure due to Python's closer reliance on the hardware of the system.

from libpython-clj.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.