Code Monkey home page Code Monkey logo

Comments (24)

cnuernber avatar cnuernber commented on July 19, 2024 1

The above is totally sufficient for me at this time. If anything I can just put things into a loop; it will eventually hang I bet. If it doesn't hang in a loop then potentially can run it in a thread pool or something so the same code is called from multiple threads. Will get to this soon; it is high priority.

from libpython-clj.

cnuernber avatar cnuernber commented on July 19, 2024

Yes, for sure this will hang. Python is single threaded and controls this via a global mutex. So if, in one thread, the python interpreter itself hangs on a function call then no other thread can access the python interpreter until that call returns.

Not sure what you are looking for here - we can't change the above aspect of the underlying python system.

from libpython-clj.

fonghou avatar fonghou commented on July 19, 2024

I understood that. I'm just going through this tutorial https://gigasquidsoftware.com/blog/2020/01/10/hugging-face-gpt-with-clojure/. However, I don't see any code there could end up block indefinitely on python side.

from libpython-clj.

fonghou avatar fonghou commented on July 19, 2024

Not knowing how GIL scope is managed between libpython-clj and other native code. In the example code I ran, torch native code also tried to lock on GIL. What would happen if this native stack was on top of java stack (the first one above)?

Thread 48 (Thread 0x7faa765fd700 (LWP 8797)):
#0  0x00007fab1ddaaf85 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007faa55481e91 in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#2  0x00007faa5548233f in PyEval_RestoreThread ()
   from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#3  0x00007faa55459a76 in PyGILState_Ensure () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#4  0x00007faa3975a64c in torch::autograd::PyFunctionPreHook::~PyFunctionPreHook() ()
   from /home/ubuntu/.local/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#5  0x00007fa9f520646a in torch::autograd::Variable::AutogradMeta::~AutogradMeta() ()
   from /home/ubuntu/.local/lib/python3.7/site-packages/torch/lib/libtorch.so
#6  0x00007fa9f102fee0 in c10::TensorImpl::release_resources() ()
   from /home/ubuntu/.local/lib/python3.7/site-packages/torch/lib/libc10.so
#7  0x00007faa39765bb2 in THPVariable_clear(THPVariable*) ()
   from /home/ubuntu/.local/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#8  0x00007faa39765bf6 in THPVariable_dealloc(THPVariable*) ()
   from /home/ubuntu/.local/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#9  0x00007faa554fdeec in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#10 0x00007faa55520dbb in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#11 0x00007faa555292c4 in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#12 0x00007faa5550f607 in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#13 0x00007faa5554d5d7 in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#14 0x00007faa554fe087 in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#15 0x00007faa55542867 in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#16 0x00007faa5554d3ae in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#17 0x00007faa5542e84e in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#18 0x00007faa5542eff8 in _PyObject_GC_Malloc ()
   from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#19 0x00007faa5542f07a in _PyObject_GC_NewVar ()
   from /usr/lib/x86_64-linux-gnu/libpython3.7m.so

from libpython-clj.

cnuernber avatar cnuernber commented on July 19, 2024

OK, catching up with you now. The way the GIL works is somewhat documented here.

That isn't saying there isn't a bug in there somewhere but in the documentation for the python GIL it states that you can't capture the gil in a reentrant fashion. When any python call happens we capture the gil if we haven't already. The current interpreter is a thread-local variable that is captured along with a process wide java lock so in your above example, assuming you called it via libpython-clj you would be fine.

I think your question is more detailed however in that you are asking what if you have unrelated code that captures the gil, code you aren't calling through libpython-clj. In this case, libpython-clj would attempt to capture the gil potentially leading to a deadlock even if everything is in the same thread. What would be needed would be some way to say:

I have already captured the gil so setup the thread local variable and such but don't push/pop the thread context.

We don't currently have that pathway implemented.

from libpython-clj.

fonghou avatar fonghou commented on July 19, 2024

Thank you for the explanation.

Just to log some diagnostic. I was able to reproduce the problem. This time two stacks correlated more clearly as nid=0x2ace is the same LWP 10958.

"nRepl-session-09d1a23e-5d14-427c-9723-a3b836bbc6d1" #19 daemon prio=5 os_prio=0 cpu=3207.94ms elapsed=2609.55s tid=0x00007f7754004000 **nid=0x2ace** runnable  [0x00007f779bbd4000]
   java.lang.Thread.State: RUNNABLE
        at com.sun.jna.Native.invokePointer(Native Method)
        at com.sun.jna.Function.invokePointer(Function.java:496)
        at com.sun.jna.Function.invoke(Function.java:440)
        at com.sun.jna.Function.invoke(Function.java:360)
        at com.sun.jna.Function.invoke(Function.java:314)
        at com.sun.jna.Function.invoke(Function.java:305)
        at libpython_clj.jna.protocols.object$PyObject_Call.invokeStatic(object.clj:277)
        at libpython_clj.jna.protocols.object$PyObject_Call.invoke(object.clj:277)
        at clojure.lang.Var.invoke(Var.java:393)
        at libpython_clj.python.object$eval41112$fn__41113$fn__41114.invoke(object.clj:828)
>>>>at libpython_clj.python.interpreter$with_gil_fn.invokeStatic(interpreter.clj:377)
        at libpython_clj.python.interpreter$with_gil_fn.invoke(interpreter.clj:346)
        at libpython_clj.python.object$eval41112$fn__41113.invoke(object.clj:825)
        at libpython_clj.python.protocols$eval38986$fn__38987$G__38977__38996.invoke(protocols.clj:105)
        at libpython_clj.python.bridge$generic_python_as_jvm$fn$reify__41714$fn__41751.invoke(bridge.clj:633)
        at libpython_clj.python.interpreter$with_gil_fn.invokeStatic(interpreter.clj:377)
        at libpython_clj.python.interpreter$with_gil_fn.invoke(interpreter.clj:346)
        at libpython_clj.python.bridge$generic_python_as_jvm$fn$reify__41714.do_call_fn(bridge.clj:633)
        at libpython_clj.python.protocols$call_kw.invokeStatic(protocols.clj:118)
        at libpython_clj.python.protocols$call_kw.invoke(protocols.clj:115)
        at libpython_clj.python.bridge$cfn.invokeStatic(bridge.clj:595)
        at libpython_clj.python.bridge$cfn.doInvoke(bridge.clj:586)
        at clojure.lang.RestFn.invoke(RestFn.java:460)
        at libpython_clj.python.bridge$generic_python_as_jvm$fn$reify__41714$fn__41743.invoke(bridge.clj:661)
        at libpython_clj.python.interpreter$with_gil_fn$fn__38564$fn__38565.invoke(interpreter.clj:360)
        at clojure.lang.AFn.applyToHelper(AFn.java:152)
        at clojure.lang.AFn.applyTo(AFn.java:144)
        at clojure.core$apply.invokeStatic(core.clj:665)
        at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1973)
        at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1973)
        at clojure.lang.RestFn.invoke(RestFn.java:425)
        at libpython_clj.python.interpreter$with_gil_fn$fn__38564.invoke(interpreter.clj:358)
        at clojure.lang.AFn.applyToHelper(AFn.java:152)
        at clojure.lang.AFn.applyTo(AFn.java:144)
        at clojure.core$apply.invokeStatic(core.clj:665)
        at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1973)
        at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1973)
        at clojure.lang.RestFn.invoke(RestFn.java:425)
        at libpython_clj.python.interpreter$with_gil_fn.invokeStatic(interpreter.clj:356)
        - locked <0x00000007170fe3a0> (a libpython_clj.python.interpreter.Interpreter)
        at libpython_clj.python.interpreter$with_gil_fn.invoke(interpreter.clj:346)
        at libpython_clj.python.bridge$generic_python_as_jvm$fn$reify__41714.invoke(bridge.clj:660)
        at gpt2$generate_sequence_step.invokeStatic(gpt2.clj:32)
        at gpt2$generate_sequence_step.invoke(gpt2.clj:31)
        at gpt2$generate_text$fn__43325.invoke(gpt2.clj:48)

Thread 33 (Thread 0x7f779bbd7700 (**LWP 10958**)):
#0  0x00007f7800478f85 in pthread_cond_timedwait@@GLIBC_2.3.2 ()
   from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f772f4a2e91 in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#2  0x00007f772f4a333f in PyEval_RestoreThread ()
   from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
>>>>> #3  0x00007f772f47aa76 in PyGILState_Ensure () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#4  0x00007f7715e2764c in torch::autograd::PyFunctionPreHook::~PyFunctionPreHook() ()
   from /home/ubuntu/.local/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#5  0x00007f76d1e7446a in torch::autograd::Variable::AutogradMeta::~AutogradMeta() ()
   from /home/ubuntu/.local/lib/python3.7/site-packages/torch/lib/libtorch.so
#6  0x00007f76cdc9dee0 in c10::TensorImpl::release_resources() ()
   from /home/ubuntu/.local/lib/python3.7/site-packages/torch/lib/libc10.so
#7  0x00007f7715e32bb2 in THPVariable_clear(THPVariable*) ()
   from /home/ubuntu/.local/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#8  0x00007f7715e32bf6 in THPVariable_dealloc(THPVariable*) ()
   from /home/ubuntu/.local/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#9  0x00007f772f51eeec in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#10 0x00007f772f541dbb in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#11 0x00007f772f54a2c4 in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#12 0x00007f772f530607 in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#13 0x00007f772f56e5d7 in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#14 0x00007f772f51f087 in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#15 0x00007f772f563867 in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#16 0x00007f772f56e3ae in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#17 0x00007f772f44f84e in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#18 0x00007f772f44fff8 in **_PyObject_GC_Malloc ()**
   from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#19 0x00007f772f4500cd in _PyObject_GC_New () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#20 0x00007f772f539b3f in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#21 0x00007f772f590936 in PyObject_GetIter () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#22 0x00007f772f374c35 in _PyEval_EvalFrameDefault ()
   from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#23 0x00007f772f4a2356 in _PyEval_EvalCodeWithName ()
   from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#24 0x00007f772f57951e in _PyFunction_FastCallDict ()
   from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#25 0x00007f772f57c41d in _PyObject_Call_Prepend ()
   from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#26 0x00007f772f519e15 in ?? () from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#27 0x00007f772f57a253 in _PyObject_FastCallKeywords ()
   from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#28 0x00007f772f377b6b in _PyEval_EvalFrameDefault ()
   from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#29 0x00007f772f4a2356 in _PyEval_EvalCodeWithName ()
   from /usr/lib/x86_64-linux-gnu/libpython3.7m.so
#30 0x00007f772f579333 in _PyFunction_FastCallKeywords ()
   from /usr/lib/x86_64-linux-gnu/libpython3.7m.so

from libpython-clj.

cnuernber avatar cnuernber commented on July 19, 2024

Is this an issue though? Python isn't multithreaded and it appears like you are attempting to capture the gil with two threads. Or is it that this hangs forever in which case I think you have found something. Can you get us access to the pathway through which this happens? A full repro as a github repo would be really ideal as then I can of course attempt this on my machine.

from libpython-clj.

fonghou avatar fonghou commented on July 19, 2024

I believe deadlock can still happen in single thread libpython-clj interpreter/with-gl usage. Above stack traces attempt to show that as the native stack and java stack are on the same thread. Does this mean we can't invoke python code that also lock GIL (e.g. torch) within scope of libpython-clj's interpreter/with-gil even on the same thread?

from libpython-clj.

cnuernber avatar cnuernber commented on July 19, 2024

No, if you aren't in a libpython-clj function call then it should release the gil assuming the garbage collector isn't attempting to release a resource. libpython-clj doesn't keep the gil, when you aren't using the library it should have the gil completely released. How can you produce this example in the same thread?

I am really appreciating this thread, btw. I think you are on to something and I think there are things I can do to make the library better here and I am just trying to narrow it down into a distinct reproduceable use case.

from libpython-clj.

fonghou avatar fonghou commented on July 19, 2024

I couldn't reproduce the problem deterministic right now. Only while doing lots of interactive evals in IDE and REPL sessions with the code below. However as soon as deadlock occurs, it's the single thread that holds GIL first in interpreter/with-gil, then enters native code path that also attempt to PyGILState_Ensure (via _PyObject_GC_Malloc). As you mentioned, GIL may not be re-entrant. That native code would block PyGILState_Ensure call while a previous JVM stack frame on the same thread still holds the GIL (interpreter/with-gil won't release GIL until it returns, correct?). I highlighted a few key points in previous stack traces, hopefully made it clear.

(ns gpt2
  (:require [libpython-clj.python :as py]
            [libpython-clj.require :refer [require-python]]))

(require-python 'transformers)
(require-python 'torch)

(def tokenizer (py/$a transformers/GPT2Tokenizer from_pretrained "gpt2"))

(def model (py/$a transformers/GPT2LMHeadModel from_pretrained "gpt2"))

(py/$a model eval)

(defn generate-sequence-step [{:keys [generated-tokens context past]}]
  (let [[output past] (model context :past past)
        token (-> (torch/argmax (first output)))
        new-generated (conj generated-tokens (py/$a token tolist))]
    {:generated-tokens new-generated
     :context (py/$a token unsqueeze 0)
     :past past
     :token token}))

(defn decode-sequence [{:keys [generated-tokens]}]
  (py/$a tokenizer decode generated-tokens))

(defn generate-text [starting-text num-of-words-to-predict]
  (py/with-gil-stack-rc-context
    (let [tokens (into [] (py/$a tokenizer encode starting-text))
          context (torch/tensor [tokens])
          result (reduce
                    (fn [r i]
                      (generate-sequence-step r))
                    {:generated-tokens tokens
                     :context context
                     :pass nil}
                    (range num-of-words-to-predict))]
    (decode-sequence result))))

(comment

  (def text "how was Jim Henson? Jim Henson was a")

  (def indexed-tokens (py/$a tokenizer encode text))

  (def tokens-tensor (torch/tensor [indexed-tokens]))

  (def predictions (py/with [r (torch/no_grad)]
                            (first (model tokens-tensor))))

  (def predicted-index (let [last-word-predictions (-> predictions first last)
                             arg-max (torch/argmax last-word-predictions)]
                         (py/$a arg-max item)))


  (py/$a tokenizer decode (-> (into [] indexed-tokens)
                              (conj predicted-index)))

  (generate-text "Clojure is a dynamic, general purpose programming language, combining the approachability and interactive" 20)

  )

from libpython-clj.

cnuernber avatar cnuernber commented on July 19, 2024

Looking at this now and collecting some thoughts.

If there was a serious problem with reentrant locking it should happen every time .. it could not ever work even in a single threaded case. This has to be some interaction between multiple eval/repl threads and python. So for production use and for a lot of repl uses this isn't super serious. That said, it is still serious enough that I at least want to repro it here.

from libpython-clj.

cnuernber avatar cnuernber commented on July 19, 2024

Looking at this now and collecting some thoughts.

If there was a serious problem with reentrant locking it should happen every time .. it could not ever work even in a single threaded case. This has to be some interaction between multiple eval/repl threads and python. So for production use and for a lot of repl uses this isn't super serious. That said, it is still serious enough that I at least want to repro it here.

Wow, this managed to hard crash my laptop a couple times. pytorch is doing something pretty different here; potentially it is attempting to use an opencl context or something like that. Other examples (like notably the facial rec example) do not exhibit this behavior.

from libpython-clj.

cnuernber avatar cnuernber commented on July 19, 2024

I think most of this is garbage collection related. Torch may not be designed to be used from multiple threads like this.
Changing generate-text2 to this:

(defn generate-text2 [starting-text num-of-words-to-predict temp]
  (py/with-gil-stack-rc-context
    (let [tokens (into [] (py/$a tokenizer encode starting-text))
          context (torch/tensor [tokens])
          result (reduce
                  (fn [r i]
                    (sample-sequence-step (assoc r :temp temp)))

                  {:generated-tokens tokens
                   :context context
                   :past nil}

                  (range num-of-words-to-predict))]
      (decode-sequence result))))

Note the use of py/with-gil-stack-rc-context.

That allowed a simple loop:

  (dotimes [iter 100]
    (println "generating" iter)
    (generate-text2 "Rich Hickey developed Clojure because he wanted a modern Lisp for functional programming, symbiotic with the established Java platform"
                    100
                    0.8))

To complete. Without that, the loop crashed my entire machine. So there is one clue.

from libpython-clj.

cnuernber avatar cnuernber commented on July 19, 2024

This also worked:

  (do
    (System/gc)
    (->> (range 100)
         (pmap (fn [iter]
                 (println "generating" iter)
                 (generate-text2 "Rich Hickey developed Clojure because he wanted a modern Lisp for functional programming, symbiotic with the established Java platform"
                                 100
                                 0.8)))
         (dorun)))

Note the system/gc call before going into the pmap. It crashed once without that. For sure, there is some interaction with the garbage collector that is causing this behavior. But it can be stabilized by controlling when and how it runs. If I were to offer what I think is probably good practice here I would say do a system/gc call after you have defined your namespace variables and have functions that take & return pure JVM data wrapped with (py/with-gil-stack-gc-context) if you are exposing this stuff to anyone.

from libpython-clj.

fonghou avatar fonghou commented on July 19, 2024

Thank you for looking in this!

Your analysis and native stack strace I collected look very close to pytorch/pytorch#29065. The root case may not be on libpython-clj side.

Just FYI, https://bugs.python.org/issue37186 claims lots of python native extensions experienced deadlock issues, in particular from embedded interpreter.

from libpython-clj.

fonghou avatar fonghou commented on July 19, 2024

I think this issue raises a fundamental question on how hosting embedded python interpreter should manage GIL that pontifically call into python native c extensions within the same thread's call stack.

Either

  1. It's not deadlock safe for hosting app (ie. libpython-clj jvm) lock on GIL (esp. in large dynamic scope) then eval/call arbitrary python code because those python code can executes native extensions that also use PyEval_AcquireThread, PyEval_RestoreThread c-api (both have doc says " If the lock has been created, the current thread must not have acquired it, otherwise deadlock ensues.").

Or

  1. All python native extensions must guard above two python deadlock-prone api with
if (PyGILState_Check()) {
   // not safe to call PyEval_AcquireThread, PyEval_RestoreThread
   // use PyGILState_GetThisThreadState instread?
} else {
   // safe to call PyEval_AcquireThread, PyEval_RestoreThread
}

I did some github code searches. Sadly, almost none of popular python native extensions (numpy, torch) follow this pattern. I'm not sure why. It seems these extensions assume GIL have been released by "main interpreter" before calling into c-api extension code (I'm not sure python cli does this by default or not).

from libpython-clj.

cnuernber avatar cnuernber commented on July 19, 2024

It is interesting they release the gil. That only has an effect if you have multiple interpreters running, correct? If you have one interpreter then it is deep in the stack trace that is releasing/grabbing the gil and using that same interpreter to execute anything else while some c code is doing something to me seemed like betting on at least some of python to be re-entrant. This is actually why I use java locking on top of the GIL to avoid the same interpreter being used from multiple threads if one of them calls into a c thing that releases the gil while it does its work.

There is an issue/story about multiple interpreters that we have talked about a lot but every time we go down that road we stop because it doesn't seem like a great idea. The gains are small and getting, for instance, graalvm working so you have low startup times and just managing multiple clojure/python processes seems like a much bigger win than multiple interpreters. Plus things like require-python would need to be re-engineered if you have multiple interpreters.

Some systems ensure that a given interpreter is only ever accessed from one thread and for production I think this is wise but for REPL driven development it is unnecessary in most cases I think. That is a fallback though; providing a way for everything to work in one thread. Currently garbage collection puts things into a queue that is emptied in another thread but it could put things into a queue that is checked, for instance, on the beginning or end of every python call (or n calls or something).

from libpython-clj.

fonghou avatar fonghou commented on July 19, 2024

I think the strategy of embeding only one python interpreter in jvm and using java locking on top of GIL is a sensible choice. I only wish PyEval_AcquireThread and PyEval_RestoreThread allow nested re-entrant calls within the same thread, rather than bluntly deadlock even if GIL already been acquired high up in the call stack (which is a typical case for embedded interpreter usage). These c-api could setup some kind of thread local flag, and allow re-entrant calls when any parent stack in the same thread already hold GIL.

from libpython-clj.

cnuernber avatar cnuernber commented on July 19, 2024

cpython could also allocate reentrant mutexes. There is probably a very good reason (like it isn't possible on all platforms they want to run on) for them not doing this. But we just used to use reentrant mutexes out of habit to avoid situations like this.

from libpython-clj.

cnuernber avatar cnuernber commented on July 19, 2024

Closing this issue. You can now use the library completely from one thread (the garbage collection queue is now cooperatively clear from the thread holding the GIL). So with 1.32 you have both major perf and stability enhancements. I think this issue will still happen as it is a base issue with some torch c extensions but it should be quieted as much as possible and we have a single-thread pathway for production use. I believe this is the best we can do at this time.

from libpython-clj.

behrica avatar behrica commented on July 19, 2024

I just wanted to comment here that I had a similar issue with clj-python (1.38) + pytorch + notespace.

I converted this code to clj-libpython:
https://github.com/ThilinaRajapakse/simpletransformers#minimal-start-for-binary-classification

And "playing" in the repl and executing and re-executing bits of the clojure code, especially the line which runs the training:

(py. model train_model train-df)) 

at certain moments, it got "stuck" and I could not do any clj-python calls anymore. They were hanging. Normal clojure commands still worked.

I tried the same with "py/with-gil-stack-rc-context"
and it behaved far better.

(py/with-gil-stack-rc-context                                                                                                                                
   (py. model train_model train-df))    

from libpython-clj.

cnuernber avatar cnuernber commented on July 19, 2024

@behrica - Thanks, more info is always very helpful. Torch specifically (I haven't heard anything about tensorflow or mxnet) and libpython-clj have some subtle incompatibility. No one has been able to reproduce this aside from using various stochastic methods. This is literally the most frustrating outstanding issue.

from libpython-clj.

behrica avatar behrica commented on July 19, 2024

@cnuernber
I have the "minimal clojure" code and a Dockercontainer which does work now.
It is "slightly" different them my environment where it fails.

Failing Working
GPU CPU
Cuda 10.2
python 3.6.10 Python 3.8.2

The "venvs" diifer slighly as well:

3,5c3,6
< boto3               1.12.38            
< botocore            1.15.38            
< certifi             2020.4.5.1         
---
> boto3               1.12.36            
> botocore            1.15.36            
> certifi             2019.11.28         
> cffi                1.14.0             
7a9
> dataclasses         0.7                
20a23
> olefile             0.46               
22c25
< Pillow              7.1.1              
---
> Pillow              7.0.0              
24a28
> pycparser           2.20               
44c48
< transformers        2.8.0              
---
> transformers        2.7.0              

specially transformers 2.8.0 is in the "working" while transformers 2-7.0 is in the failing

from libpython-clj.

behrica avatar behrica commented on July 19, 2024

Should we discuss more in an other issue ?

Maybe "libpython-clj using simpletransformers sometimes crashing"

from libpython-clj.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.