shadowofneptune / threaded-code-benchmark Goto Github PK
View Code? Open in Web Editor NEWA demonstration of ways to implement 'threaded code,' a technique used in Forth and in virtual machines like the JVM.
License: MIT License
A demonstration of ways to implement 'threaded code,' a technique used in Forth and in virtual machines like the JVM.
License: MIT License
The benchmark uses time()
from time.h
to measure elapsed time. It measures time with 1 second resolution. As certain runs complete in 6 seconds or less, the not so fine grained time source introduces significant inaccuracy.
The current bytecode sequence has an important property: no instruction appears twice. Hence for each instruction handler the next instruction handler to dispatch to is perfectly predictable. It amplifies the advantage of implementations that replicate instruction dispatch.
In most if not all VMs the next instruction handler to dispatch to is seldom perfectly predictable. It would be nice to benchmark a more complex instruction sequence to get a more realistic estimate.
When I saw this on HN, I got immediately hooked as I am a big proponent of implementing interpreters in CP-style. For such a trivial interpreter, there must not be any difference, yet CP-style performed significantly worse.
I've compiled the sources with clang 10.0.0 at -O3
. Got similar results.
Looking at the disassembly of computed_goto.c
, some interesting quirks are immediately apparent:
print_i:
if (i % PRINT_INTERVAL == 0) {
printf("* ");
fflush(stdout);
}
goto *dispatcher[prog[++ip]];
testl %r13d, %r13d
je .LBB0_4
# %bb.5: # in Loop: Header=BB0_3 Depth=1
addl $1, %ebx
leaq (%r14,%rbx,4), %rax
movl (%rax), %eax
jmpq *run_program.dispatcher(,%rax,8)
The compiler was able to figure out that the condition is most likely false. Also modulo is computed somewhere else, presumably reducing latency. r13d
is zero if computing modulo would yield 0.
branch:
if (i == MAX_ITERATIONS)
ip += 1;
goto *dispatcher[prog[++ip]];
addl %r12d, %ebx
addl $1, %ebx
leaq (%r14,%rbx,4), %rax
movl (%rax), %eax
jmpq *run_program.dispatcher(,%rax,8)
Interestingly, there are no branches whatsoever! It's easy to infer that %ebx
holds ip
. So it doesip += %r12d + 1
. Presumably, %r12d
holds either 0 or 1.
To summarise, it looks like the compiler introduced two temporary variables
%r13d = i % PRINT_INTERVAL == 0
%r12d = i == MAX_ITERATIONS
The following excerpt proves our findings:
incr_i:
i += 1;
goto *dispatcher[prog[++ip]];
addl $1, %r15d # <-------------- ++i
addl $1, %ebx
movl %ebx, %eax
leaq (%r14,%rax,4), %rax
xorl %r12d, %r12d
cmpl $2000000000, %r15d # imm = 0x77359400
sete %r12b # <-------------------------- %r12d = i == MAX_ITERATIONS
movl %r15d, %ecx
imulq $1441151881, %rcx, %rcx # imm = 0x55E63B89
shrq $58, %rcx # <---------- %rcx = i / PRINT_INTERVAL
imull $200000000, %ecx, %ecx # imm = 0xBEBC200
movl %r15d, %r13d
subl %ecx, %r13d # <---------- %r13d = i - i / PRINT_INTERVAL
movl (%rax), %eax
jmpq *run_program.dispatcher(,%rax,8)
We can see that the compiler is quite sophisticated. The reasons this transformation is beneficial are quite apparent: fewer branches, latency reduction. I must mention that this optimisation opportunity is specific to the toy interpreter under the benchmark and is typically not applicable to more complex interpreters.
CP-style implementation has branches in print_i
and branch_i
as expected. Even though they should be accurately predicted, for some reasons they are not. Providing hints to the compiler makes both implementation exhibit the same performance on my compiler/CPU:
diff --git a/continuation_passing.c b/continuation_passing.c
index 2272860..647cdf9 100644
--- a/continuation_passing.c
+++ b/continuation_passing.c
@@ -26,7 +26,7 @@ void incr_i(unsigned ip, unsigned i, program prog)
void print_i(unsigned ip, unsigned i, program prog)
{
- if (i % PRINT_INTERVAL == 0) {
+ if (__builtin_expect(i % PRINT_INTERVAL == 0, 0)) {
printf("* ");
fflush(stdout);
}
@@ -35,7 +35,7 @@ void print_i(unsigned ip, unsigned i, program prog)
void branch(unsigned ip, unsigned i, program prog)
{
- if (i == MAX_ITERATIONS)
+ if (__builtin_expect(i == MAX_ITERATIONS, 0))
ip += 1;
next(ip, i, prog);
}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.