Code Monkey home page Code Monkey logo

threaded-code-benchmark's People

Contributors

shadowofneptune avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

threaded-code-benchmark's Issues

It would be nice to benchmark more bytecode sequences with different properties

The current bytecode sequence has an important property: no instruction appears twice. Hence for each instruction handler the next instruction handler to dispatch to is perfectly predictable. It amplifies the advantage of implementations that replicate instruction dispatch.

In most if not all VMs the next instruction handler to dispatch to is seldom perfectly predictable. It would be nice to benchmark a more complex instruction sequence to get a more realistic estimate.

The reasons continuation-passing fared worse, analysed

When I saw this on HN, I got immediately hooked as I am a big proponent of implementing interpreters in CP-style. For such a trivial interpreter, there must not be any difference, yet CP-style performed significantly worse.

I've compiled the sources with clang 10.0.0 at -O3. Got similar results.

Looking at the disassembly of computed_goto.c, some interesting quirks are immediately apparent:

print_i:
    if (i % PRINT_INTERVAL == 0) {
        printf("* ");
        fflush(stdout);
    }
    goto *dispatcher[prog[++ip]];
    testl   %r13d, %r13d
    je  .LBB0_4
# %bb.5:                                #   in Loop: Header=BB0_3 Depth=1
    addl    $1, %ebx
    leaq    (%r14,%rbx,4), %rax
    movl    (%rax), %eax
    jmpq    *run_program.dispatcher(,%rax,8)

The compiler was able to figure out that the condition is most likely false. Also modulo is computed somewhere else, presumably reducing latency. r13d is zero if computing modulo would yield 0.

branch:
    if (i == MAX_ITERATIONS)
        ip += 1;
    goto *dispatcher[prog[++ip]];
    addl    %r12d, %ebx
    addl    $1, %ebx
    leaq    (%r14,%rbx,4), %rax
    movl    (%rax), %eax
    jmpq    *run_program.dispatcher(,%rax,8)

Interestingly, there are no branches whatsoever! It's easy to infer that %ebx holds ip. So it doesip += %r12d + 1. Presumably, %r12d holds either 0 or 1.

To summarise, it looks like the compiler introduced two temporary variables

%r13d = i % PRINT_INTERVAL == 0
%r12d = i == MAX_ITERATIONS

The following excerpt proves our findings:

incr_i:
    i += 1;
    goto *dispatcher[prog[++ip]];
    addl    $1, %r15d   # <-------------- ++i
    addl    $1, %ebx
    movl    %ebx, %eax
    leaq    (%r14,%rax,4), %rax
    xorl    %r12d, %r12d
    cmpl    $2000000000, %r15d      # imm = 0x77359400
    sete    %r12b    # <-------------------------- %r12d = i == MAX_ITERATIONS
    movl    %r15d, %ecx
    imulq   $1441151881, %rcx, %rcx # imm = 0x55E63B89
    shrq    $58, %rcx   # <---------- %rcx = i / PRINT_INTERVAL
    imull   $200000000, %ecx, %ecx  # imm = 0xBEBC200
    movl    %r15d, %r13d
    subl    %ecx, %r13d  # <---------- %r13d = i - i / PRINT_INTERVAL
    movl    (%rax), %eax
    jmpq    *run_program.dispatcher(,%rax,8)

We can see that the compiler is quite sophisticated. The reasons this transformation is beneficial are quite apparent: fewer branches, latency reduction. I must mention that this optimisation opportunity is specific to the toy interpreter under the benchmark and is typically not applicable to more complex interpreters.

CP-style implementation has branches in print_i and branch_i as expected. Even though they should be accurately predicted, for some reasons they are not. Providing hints to the compiler makes both implementation exhibit the same performance on my compiler/CPU:

diff --git a/continuation_passing.c b/continuation_passing.c
index 2272860..647cdf9 100644
--- a/continuation_passing.c
+++ b/continuation_passing.c
@@ -26,7 +26,7 @@ void incr_i(unsigned ip, unsigned i, program prog)

 void print_i(unsigned ip, unsigned i, program prog)
 {
-       if (i % PRINT_INTERVAL == 0) {
+       if (__builtin_expect(i % PRINT_INTERVAL == 0, 0)) {
                printf("* ");
                fflush(stdout);
        }
@@ -35,7 +35,7 @@ void print_i(unsigned ip, unsigned i, program prog)

 void branch(unsigned ip, unsigned i, program prog)
 {
-       if (i == MAX_ITERATIONS)
+       if (__builtin_expect(i == MAX_ITERATIONS, 0))
                ip += 1;
        next(ip, i, prog);
 }

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.