The threaded-code-benchmark from shadowofneptune

Benchmark should use a time source with sub-second resolution

The benchmark uses time() from time.h to measure elapsed time. It measures time with 1 second resolution. As certain runs complete in 6 seconds or less, the not so fine grained time source introduces significant inaccuracy.

It would be nice to benchmark more bytecode sequences with different properties

The current bytecode sequence has an important property: no instruction appears twice. Hence for each instruction handler the next instruction handler to dispatch to is perfectly predictable. It amplifies the advantage of implementations that replicate instruction dispatch.

In most if not all VMs the next instruction handler to dispatch to is seldom perfectly predictable. It would be nice to benchmark a more complex instruction sequence to get a more realistic estimate.

The reasons continuation-passing fared worse, analysed

When I saw this on HN, I got immediately hooked as I am a big proponent of implementing interpreters in CP-style. For such a trivial interpreter, there must not be any difference, yet CP-style performed significantly worse.

I've compiled the sources with clang 10.0.0 at -O3. Got similar results.

Looking at the disassembly of computed_goto.c, some interesting quirks are immediately apparent:

print_i:
    if (i % PRINT_INTERVAL == 0) {
        printf("* ");
        fflush(stdout);
    }
    goto *dispatcher[prog[++ip]];

    testl   %r13d, %r13d
    je  .LBB0_4
# %bb.5:                                #   in Loop: Header=BB0_3 Depth=1
    addl    $1, %ebx
    leaq    (%r14,%rbx,4), %rax
    movl    (%rax), %eax
    jmpq    *run_program.dispatcher(,%rax,8)

The compiler was able to figure out that the condition is most likely false. Also modulo is computed somewhere else, presumably reducing latency. r13d is zero if computing modulo would yield 0.

branch:
    if (i == MAX_ITERATIONS)
        ip += 1;
    goto *dispatcher[prog[++ip]];

    addl    %r12d, %ebx
    addl    $1, %ebx
    leaq    (%r14,%rbx,4), %rax
    movl    (%rax), %eax
    jmpq    *run_program.dispatcher(,%rax,8)

Interestingly, there are no branches whatsoever! It's easy to infer that %ebx holds ip. So it doesip += %r12d + 1. Presumably, %r12d holds either 0 or 1.

To summarise, it looks like the compiler introduced two temporary variables

%r13d = i % PRINT_INTERVAL == 0
%r12d = i == MAX_ITERATIONS

The following excerpt proves our findings:

incr_i:
    i += 1;
    goto *dispatcher[prog[++ip]];

    addl    $1, %r15d   # <-------------- ++i
    addl    $1, %ebx
    movl    %ebx, %eax
    leaq    (%r14,%rax,4), %rax
    xorl    %r12d, %r12d
    cmpl    $2000000000, %r15d      # imm = 0x77359400
    sete    %r12b    # <-------------------------- %r12d = i == MAX_ITERATIONS
    movl    %r15d, %ecx
    imulq   $1441151881, %rcx, %rcx # imm = 0x55E63B89
    shrq    $58, %rcx   # <---------- %rcx = i / PRINT_INTERVAL
    imull   $200000000, %ecx, %ecx  # imm = 0xBEBC200
    movl    %r15d, %r13d
    subl    %ecx, %r13d  # <---------- %r13d = i - i / PRINT_INTERVAL
    movl    (%rax), %eax
    jmpq    *run_program.dispatcher(,%rax,8)

We can see that the compiler is quite sophisticated. The reasons this transformation is beneficial are quite apparent: fewer branches, latency reduction. I must mention that this optimisation opportunity is specific to the toy interpreter under the benchmark and is typically not applicable to more complex interpreters.

CP-style implementation has branches in print_i and branch_i as expected. Even though they should be accurately predicted, for some reasons they are not. Providing hints to the compiler makes both implementation exhibit the same performance on my compiler/CPU:

diff --git a/continuation_passing.c b/continuation_passing.c
index 2272860..647cdf9 100644
--- a/continuation_passing.c
+++ b/continuation_passing.c
@@ -26,7 +26,7 @@ void incr_i(unsigned ip, unsigned i, program prog)

 void print_i(unsigned ip, unsigned i, program prog)
 {
-       if (i % PRINT_INTERVAL == 0) {
+       if (__builtin_expect(i % PRINT_INTERVAL == 0, 0)) {
                printf("* ");
                fflush(stdout);
        }
@@ -35,7 +35,7 @@ void print_i(unsigned ip, unsigned i, program prog)

 void branch(unsigned ip, unsigned i, program prog)
 {
-       if (i == MAX_ITERATIONS)
+       if (__builtin_expect(i == MAX_ITERATIONS, 0))
                ip += 1;
        next(ip, i, prog);
 }

shadowofneptune / threaded-code-benchmark Goto Github PK

threaded-code-benchmark's People

Contributors

Stargazers

Watchers

threaded-code-benchmark's Issues

Benchmark should use a time source with sub-second resolution

It would be nice to benchmark more bytecode sequences with different properties

The reasons continuation-passing fared worse, analysed

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent