From my understanding, starting cinder with -X jit wi

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

update: after looking a little closer, I couldn't find a probl

opt-in to JIT compile about cinder HOT 15 OPEN

facebookincubator commented on May 23, 2024

opt-in to JIT compile

from cinder.

Comments (15)

jbower-fb commented on May 23, 2024 1

There may be a gross inefficiency with -X jit-enable-jit-list-wildcards.

With that option removed, the overhead is only 0.5% (still high, but 4x better).

Yikes. We don't use that option in production thus far, but that's useful to know. ~~I've gone ahead and opened a new issue for that: #30~~, I see you've opened #29 for this, thank you for digging in.

from cinder.

jbower-fb commented on May 23, 2024

While I agree it's not intuitive this behavior is actually intentional. The setup we have internally is our web-server goes through a warm-up process where it tries to import all the known hot Python modules. These may not necessarily be executed at this stage but when we hit cinderjit.disable() this goes and compiles all the loaded functions which are on our JIT-list. We execute cinderjit.disable() right before we start forking worker processes because we don't want JIT to happen in the workers as there is currently no way to share further JIT work between processes. Letting the workers independently JIT would cause excessive memory usage and wasted CPU cycles.

Note that cinderjit.disable() only inhibits further JIT compilation, functions which have already been JIT can still be executed in their JIT compiled form. Also, JIT compilation happens lazily as allowed functions are executed (not loaded), or on cinderjit.disable().

While I don't think there's a way to do what you want right now, I have some changes in flight which may help. I'll see if I can get those released in the next week.

from cinder.

belm0 commented on May 23, 2024

Thank you. Would it not just be a matter of having an alternate flag (JIT_START_DISABLED) and in that case _PyJIT_Initialize() will leave jit_config.is_enabled at 0?

If so I'm happy to try making a PR.

from cinder.

jbower-fb commented on May 23, 2024

Hey, sorry it's taking me a while to get back to you. Like I said I have something unpublished I think will help but sands have shifted under me and it's proving harder to get this out than I'd thought.

What you're suggesting may work (if it's that easy, maybe try it out for yourself and see how it does?). My concern is right now "disabling" the JIT is more nuanced than it appears. The only two things we've really tried are never enabling JIT, or enabling JIT from start-up and then "disabling" it once (which is not quite what it sounds like from my earlier comment). I'm not sure how well enabling it part way through would work, which is I think what you're describing would require. At the very least you'd need to go through and review all the call-sites that check to see if the JIT is enabled and make sure they support this new operation.

The work I mentioned I have in mind allows you to add entries to the JIT list at runtime from managed code. This way you could start with JIT enabled but an empty JIT list so nothing actually happens, and then you could lazily or otherwise use decorators to dynamically add things to the list and force them to compile.

If you do decide to play with this yourself you'll probably find the -X jit-debug option useful as this will let you see what functions are getting compiled (along with lots of other noise). You'll probably also want to use some of the functions in the cinderjit module (search for jit_methods in pyjit.cpp).

FWIW, the reason we use an externally specified list is we populate this based on functions we are know are hot based on data collected from execution in production. In 0764b2e I added some features you could use to do the same thing (although it's a bit of a nuisance to use right now which is what the other things I mentioned would help fix).

from cinder.

belm0 commented on May 23, 2024

At the very least you'd need to go through and review all the call-sites that check to see if the JIT is enabled and make sure they support this new operation.

This part confused me-- maybe I still don't understand what "enable JIT" implies. cinderjit.enable() and disable() set jit_config.is_enabled internally. Isn't that flag already honored by the call sites? I was thinking that init could start with jit_config.is_enabled = 0, and then rely on force_compile() to select functions at runtime.

cinder/Jit/pyjit.cpp

Lines 964 to 970 in 8664502

    
           int _PyJIT_Enable() { 
        
             if (jit_config.init_state != JIT_INITIALIZED) { 
        
               return 0; 
        
             } 
        
             jit_config.is_enabled = 1; 
        
             return 0; 
        
           }

Background on where I'm at with Cinder: out of the box our app is seeing a nice speedup of 12% vs. python 3.7. (I haven't checked if any of that is attributable to python 3.8 or building with gcc 8.x vs. 6.x.) The app is heavily async so I suspect a lot comes from the optimization to avoid StopIteration that is upstreamed to future Python. (On the other hand, per Trio policy, we always have one yield in async functions so the other optimization to elide coroutines doesn't help us.)

Beyond that though, I haven't been able to produce any real gains from the JIT. For sure keeping JIT enabled is not viable, because it gets bogged down with the numerous lambdas and closures. And it's not clear which functions would give us a win if jitted (a sampling profiler wouldn't help much here).

The instruction counter feature sounds promising, I'll check it out.

from cinder.

belm0 commented on May 23, 2024

@jbower-fb from my attempts, -X jit-capture-interp-cost hasn't worked out too well:

The counts seem to include the cost of children in the call stack, which isn't what you'd want for identifying functions to JIT. Notably, it seems to include everything from a yield-- for example, @contextmanager functions with simply a yield in the body have high cost. And since async relies heavily on yield, the top items by instruction count are mostly meaningless in my heavily async app.

Even when discounting the above issue, and locating functions with high count that don't yield or explicitly call other functions, I have not been able to come up with a win from the JIT. After enabling a small number of functions via jit-list, my program always ends up being a few percent slower (even though the jitted functions themselves might run 25 - 50% faster). It may be that cinderjit just being enabled has some overhead that I haven't been able to overcome.

from cinder.

jbower-fb commented on May 23, 2024

The counts seem to include the cost of children in the call stack,

Interesting, that's not what should happen and I wonder how you're seeing that? I wrote a small example and it seems to give the results I would expect - counts are attributed non-cumulatively to functions which perform execution:

$ cat simple.py 
import cinder
import contextlib
import pprint

async def y():
  class DummyGenerator:
    def __await__(self): return iter([])
  for _ in range(100): pass
  await DummyGenerator()

async def x(): await y()

# Clear out existing data for a clean base-line
cinder.get_and_clear_code_interp_cost()
try:
  x().send(None)
except StopIteration: pass
pprint.pprint(cinder.get_and_clear_code_interp_cost())

@contextlib.contextmanager
def a(): yield 2

def use_cms():
  with a():
    for _ in range(100): pass

cinder.get_and_clear_code_interp_cost()
use_cms()
pprint.pprint(cinder.get_and_clear_code_interp_cost())

$ ./python -X jit-capture-interp-cost ./simple.py 
{'DummyGenerator@./simple.py:6': 10,
 'x@./simple.py:11': 5,
 'y.<locals>.DummyGenerator.__await__@./simple.py:7': 4,
 'y@./simple.py:5': 320}
{'_GeneratorContextManager.__enter__@/data/users/jbower/cinder2/Lib/contextlib.py:108': 13,
 '_GeneratorContextManager.__exit__@/data/users/jbower/cinder2/Lib/contextlib.py:117': 19,
 '_GeneratorContextManagerBase.__init__@/data/users/jbower/cinder2/Lib/contextlib.py:82': 37,
 'a@./simple.py:20': 5,
 'contextmanager.<locals>.helper@/data/users/jbower/cinder2/Lib/contextlib.py:238': 6,
 'use_cms@./simple.py:23': 316}

As I would expect most of the cost in the above is in y() and use_cms(), and not attributed to x(), a() as I think you're finding? Can you provide a repro of some kind?

After enabling a small number of functions via jit-list, my program always ends up being a few percent slower (even though the jitted functions themselves might run 25 - 50% faster). It may be that cinderjit just being enabled has some overhead that I haven't been able to overcome.

The cost of compiling JIT functions is really quite high (indeed, excruciating for a debug build). If the overall time spent executing is not very significant, e.g. just a few 10s of seconds, I would not be surprised to see a net-loss. Is your test target long-running, or can its runtime be extended (e.g. running its core in an artificial loop)? If it's still running slower after that it may be you're hitting some poorly optimized JIT situations. For now you'd need to prune the JIT-list more, although this should get better over the next couple of months. So far this year we focused on JIT "coverage" which means making as many things work at any cost, now we're focused on tightening those up.

from cinder.

belm0 commented on May 23, 2024

The accounting problem may be specific to ~~async~~ the async library we're using (Trio)-- I'll try to give a specific repro.

The cost of compiling JIT functions is really quite high (indeed, excruciating for a debug build.

I mean there seems to be overhead independent of JIT compiling. For example, with an empty jit-list, our app still seems to slow down by 1 - 2%.

from cinder.

belm0 commented on May 23, 2024

update:

after looking a little closer, I couldn't find a problem with -X jit-capture-interp-cost accounting
I attempted enabling various combinations of the top 20 or 30 items in the interp cost list, and couldn't achieve an overall CPU gain over non-jit
as mentioned, I see a 1 - 2% slowdown in our app even with an empty jit list. I imagine cinderjit has to intercept every call and compare it against the list, and that may be a lot of overhead for an application with many short calls.
I'm exploring --static, as that seems like the path to accumulate enough speedup to offset the overhead of jit being enabled
I have a single-module test with some numeric functions like bezier and inverse-bezier, suitable for experimenting with --static
however lack of primitive double support (#26 and #27) is blocking me from getting the expected 4x or 8x gains from static + jit (based on similar tests I've done with integer functions). Without using primitives on these functions, --static + jit is not giving a speedup vs. plain jit.

from cinder.

jbower-fb commented on May 23, 2024

as mentioned, I see a 1 - 2% slowdown in our app even with an empty jit list. I imagine cinderjit has to intercept every call and compare it against the list, and that may be a lot of overhead for an application with many short calls.

The overhead should only come on the first call to a given function (see PyEntry_LazyInit() and friends which swap around the value of func->vectorcall in ceval.c). So, this shouldn't be very much unless you're invoking a large number of new functions throughout the program's execution.

Shooting in the dark a bit, does -X jit-no-type-slots help? If not perhaps a perf run could shed some light on things.

from cinder.

belm0 commented on May 23, 2024

I didn't notice a change from -X jit-no-type-slots.

this shouldn't be very much unless you're invoking a large number of new functions throughout the program's execution.

I would say our app is heavy on nested functions, closures, and lambdas.

from cinder.

jbower-fb commented on May 23, 2024

I would say our app is heavy on nested functions, closures, and lambdas.

Hmm, that may actually have an impact. After my comment yesterday I spoke to a colleague and we discussed that nested functions etc. go down the PyEntry_LazyInit() path every time the containing function is entered. Again, I think perf may help confirm that theory.

from cinder.

belm0 commented on May 23, 2024

I don't know how to use perf offhand, but I fumbled around with it using perf top -g -p ... and then filtering on PyEntry. It seems to confirm the 2% overhead when running our production application with an empty jit list.

from cinder.

belm0 commented on May 23, 2024

There may be a gross inefficiency with -X jit-enable-jit-list-wildcards.

With that option removed, the overhead is only 0.5% (still high, but 4x better).

empty jit list (wildcards disabled):

empty jit list, -X jit-enable-jit-list-wildcards:

from cinder.

tekknolagi commented on May 23, 2024

For what it's worth, we now have -X jit-auto=N, where the JIT will automatically compile functions after N calls. It must be used in conjunction with -X jit.

from cinder.

opt-in to JIT compile about cinder HOT 15 OPEN

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	int _PyJIT_Enable() {
	if (jit_config.init_state != JIT_INITIALIZED) {
	return 0;
	}
	jit_config.is_enabled = 1;
	return 0;
	}