luajit / luajit Goto Github PK

Mirror of the LuaJIT git repository

License: Other

Makefile 0.99% C 82.39% Lua 15.61% C++ 0.01% Batchfile 0.94% Roff 0.08%

luajit's Introduction

README for LuaJIT 2.1
---------------------

LuaJIT is a Just-In-Time (JIT) compiler for the Lua programming language.

Project Homepage: https://luajit.org/

LuaJIT is Copyright (C) 2005-2023 Mike Pall.
LuaJIT is free software, released under the MIT license.
See full Copyright Notice in the COPYRIGHT file or in luajit.h.

Documentation for LuaJIT is available in HTML format.
Please point your favorite browser to:

 doc/luajit.html

luajit's People

Stargazers

Watchers

Forkers

ppc64 takaaptech corsix creationix mraleph fperrad cosmic-fibre ffa7a7 sindrom91 demimarie ido fsfod leecrest beamng tarantool katlogic gitforks pi razaraz s207152 benbon ekunia2 randwithcomputer gaoxiaojun wingo lyzardiar capr snabbco neuroradiology qnix akopytov mwkmwkmwk mpeterv noeticwxb wqweto pbosetti funnytest hollynoob wohlsoft tempbottle dailypipsgxj koolhazz alexshilucky michaeleischer funny-falcon gvsurenderreddy dodiadodia zoowii disigma-network mengxiang0811 mjansson ynlh lauragonzalezzz tonspike marysam okorokem cybersys algotradinghub methodicalacceleratordesign emmericp yang123vc wdv4758h strrchr apinski-cavium halfcrazy cyecp outerra cooboyno111 a19810227 doujiang24 dpino lukego iehrlich niravthakkar zhongweiy xsw9527 odabugs 0xbadeaffe xfox-cto aa10000 linux-on-ibm-z padmajajagan ring2003 wordijp y-way jackbro mobilesolution2016 ghostzch adriweb aabb667 naughtycode vonrosenchild dzemz izagz touchstar lflxp baishancloud zacharycarter amoldeshpande javierguerragiraldez

luajit's Issues

Parse binary numbers 0bxxx

Very useful for bit operations. The enhancement to the number parsing code should not be that hard.

But this should be synced up to a release, since it breaks source code compatibility. It either should make it into the code base for 2.1.0-final or not into 2.1 at all. Remove the label, if this is the case.

Parse unicode escapes \uxxxx \Uxxxxxxxx

The escape sequence for a 16 bit or 32 bit code point should be converted to UTF-8 in the lexer. Lowercase \u accepts exactly 4 hex digits (in either case), uppercase \U accepts exactly 8 hex digits.

This should be synced up to a release, since it breaks source code compatibility. It either should make it into the code base for 2.1.0-final or not into 2.1 at all. Remove the label, if this is the case.

Improve lexer scanning speed

Yes, the lexer is quite fast, but it could be improved even more. The lexer currently (slowly) copies all source characters before interning tokens, scanning numbers, etc. The inner loop for that, as emitted by the C compiler, looks ghastly.

But most of the time it could just pass a pointer to the input buffer. At least if the look-ahead is big enough. There are some difficulties with escapes et al, but one could just fall back to the current copy strategy.

The conditions for this tuning would change, if there was a more efficient approach to string interning: write directly to a buffer and either use that for the new string or recycle it. But that probably depends on the new GC. And it might not be that helpful here, since in practice most of the tokens are already interned.

Add MIPS soft-float and dual-number port

Many embedded devices use MIPS CPUs without a hardware FPU. A soft-float and dual-number port would allow them to use LuaJIT, e.g. for routers running OpenWRT.

A soft-float and dual-number port affects the MIPS interpreter core, the MIPS JIT compiler backend and some soft-float-related support code.

Add Value-Range Propagation (VRP)

Value-Range Propagation allows optimizing operations depending on the range of previous values or guarded assertions in the IR.

E.g. the value of BAND(x, +255) is in the range [0,+255]. This has some useful implications for following operations: e.g. the value is non-negative, it fits into a byte, it can be implicitly zero-extended instead of sign-extended to 64 bits, etc.

A similar example is LT(x, +10). This asserts that the value x is in the range [MININT,+9] for all code that follows. That would allow to eliminate (say) an NE(x, +20). This often happens in elseif-style case comparisons.

[There are also options for back-propagating knowledge about value ranges to earlier IR instructions, but this is hard. Have a look at the DSE optimization for ideas.]

An implementation of VRP for LuaJIT would probably best use an adjunct data structure to the IR, e.g. a simple fixed-size cache for VRP results. For efficiency, this cache should be filled on-demand and have a hint system. E.g. a guard would set a bit in a bloom filter for the non-constant operand and only then would a following VRP invocation for a use of the same operand need to do an exhaustive search.

Note that the actual benefit of VRP is limited. Guards that never fail are very cheap on out-of-order CPUs. Avoiding conversions is probably more rewarding. Cost vs. benefit for this optimization should be carefully evaluated. It certainly doesn't pay off to do a deeper analysis (multiple chained ops) or a finer-grained analysis (multiple intervals), unless it comes naturally from the way the compiler pipeline works.

This would be a nice, isolated project for someone who's interested in compiler internals.

Tracing state machine does not account for OOM in trace_save

trace_stop first patches code (bytecode or machine code) and then commits executable code (lj_mcode_commit) and allocates a new trace object (trace_save).

Errors inside lj_mcode_commit just cause panic.

However trace_save calls lj_mem_newt (aka lj_mem_realloc) which can throw lj_err_mem.

Tracer catches this OOM and aborts tracing, but trace_abort will simply drop trace to the floor without trying to undo trace_stop side effects.

This can lead to a variety of problems, most notably "dangling" references to this trace from patched bytecode (or patched machine code) which now point to a dropped trace.

I don't have a really good standalone repro for this bug, but it is easy to reproduce: just add an unconditional throw inside trace_save to emulate OOM.

diff --git a/src/lj_trace.c b/src/lj_trace.c
index 79c50b0..473b7c2 100644
--- a/src/lj_trace.c
+++ b/src/lj_trace.c
@@ -125,6 +125,7 @@ static void trace_save(jit_State *J)
   size_t sz = sztr + szins +
              J->cur.nsnap*sizeof(SnapShot) +
              J->cur.nsnapmap*sizeof(SnapEntry);
+  lj_err_mem(J->L);  /* Always OOM. */
   GCtrace *T = lj_mem_newt(J->L, (MSize)sz, GCtrace);
   char *p = (char *)T + sztr;
   memcpy(T, &J->cur, sizeof(GCtrace));

$ gdb --args src/luajit
(gdb) r
Starting program: /usr/local/google/home/vegorov/src/third_party/LuaJIT-2.1/src/luajit 
LuaJIT 2.1.0-alpha -- Copyright (C) 2005-2015 Mike Pall. http://luajit.org/
JIT: ON SSE2 SSE3 SSE4.1 fold cse dce fwd dse narrow loop abc sink fuse
> for i = 0, 10000 do end

Program received signal SIGSEGV, Segmentation fault.
0x000000000041d54e in ?? ()
(gdb) bt
#0  0x000000000041d54e in ?? ()
#1  0x000000000040d2bd in lua_pcall ()
#2  0x000000000040485c in _start ()
(gdb) x/i $pc
=> 0x41d54e:  mov    rax,QWORD PTR [rax+0x40]  ;; this is mov RDa, TRACE:RD->mcode

Enable / implement PPC64 LE Linux !LJ_GC64 interpreter and JIT

Depends on #25

Sponsorship available.

Better integration of FFI number types

Plenty of builtins cannot deal with FFI number types, in particular int64_t/uint64_t. But this would be quite useful; good examples are math.min() and math.max().

A general solution isn't easy, because most of the builtins need to invoke code that's specialized to the number type to do their job. And the compiler frontend/backend needs to support these, too.

IMHO there's no easy solution for 3rd party code in general or the classic Lua/C API in particular. Auto-coercions can be quite dangerous for unsuspecting code.

Whenever this is tackled, one should think of the implications for other core functionality, like the numeric for i=start,stop,step do loop. Better split this off to a different issue.

Add ARM64 JIT compiler backend

It should not be difficult to find sponsors for this feature, since ARM Holdings plc and their hardware partners want to invade the server world with this architecture. Also, Android is heading for ARM64, too (not in user mode, yet).

IMHO ARM64 is the cleanest 64 bit architecture so far and writing the backend should be a joy. Lots of low-hanging fruit for on-the-fly optimizations in the backend.

However, there's the ugly issue #25, on which this depends on.

New Garbage Collector

The garbage collector used by LuaJIT 2.x is essentially the same as the Lua 5.1 GC. The current garbage collector is relatively slow compared to implementations for other language runtimes. It's not competitive with top-of-the-line GCs, especially for large workloads.

A suitable design for a new garbage collector can be found here: http://wiki.luajit.org/New-Garbage-Collector

This is an arena-based, quad-color incremental, generational, non-copying, high-speed, cache-optimized garbage collector.

This is a major change and probably demands a major version bump, whenever it materializes.

Link with pthread

This seems to be a Linux issue, mainly. But it might affect *BSD, too.

Not linking against pthread causes problems if some other library that's loaded later-on needs it. It also causes trouble for GDB. There's no conclusive info out there, some reports claim this has been fixed at the OS level. Others explain this cannot possibly be fixed.

Unfortunately, always linking against pthread has various side-effects:

It's tricky to get it right on all POSIX variants and in particular on all Linux distros.
It increases startup time, which is kind of important for a few users (and for silly benchmarks).
It may cause existing workarounds to break.

FFI: Correctly handle redeclarations

Redeclarations are currently silently ignored, most of the time.

This should be changed to ignore identical redeclarations, wherever the C standards say so and to reject other redeclarations. Be careful with pre-declared internal types that might be innocently redeclared in cut'n'pasted code -- this should probably not be rejected.

You'll probably be confronted with a storm of bug reports from users who've relied on the silent ignore 'feature'. Very sorry for that. Plan some extra time for this.

String buffer API

A string buffer API allows efficient incremental construction of a string. The crucial detail is that this allows postponing the decision when to intern the string, if at all. E.g. some io.* builtins could be extended to output the string buffer data without an intermediate string interning.

The 2.1 JIT compiler already has some optimizations for typical incremental string construction. But these cannot eliminate all intermediate string allocations, especially for the typical head + loop + tail construction.

The first implementation should probably use a special userdata type (see UDTYPE_* in lj_obj.h) and only allow a very limited number of operations. One could use the same approach as require("table.new") for gentle introduction of a new string.buffer() constructor in the core string.* library namespace.

At a later stage, the string buffer could be made an alias for the core string type. This would implicitly intern the string for any function that expects a plain string, but would open optimization opportunities for many other builtins like string.sub(). Since this requires changes to the core of the type system, this should be discussed separately and in context, e.g. for the new GC.

[Please do not infest this issue with a discussion of ropes (string views with links to parent strings). The latter are only useful for very specific scenarios (like an editor), but harmful when used for a generic string type. E.g. the JVM has recently overhauled their rope-based string representation and replaced it with a much simpler co-located string type (same as used in Lua/LuaJIT). That decision was backed by extensive benchmarking, since it wasn't a simple change, for sure.]

Optimizations for power operator a^b

The current narrowing and strength reduction for a^b aka math.pow(a, b) is problematic.

The a^i integer narrowing is definitely useful:

It opens other optimization opportunities (unrolling).
It gives more precise results than the exp/log method in some important cases, like 3^i.
But it may give different results than calling the libm pow() function. Whichever is more accurate is a question of the libm implementation and some debate. Especially when some intermediate step and/or the final result overflows FP precision.

The split into exp2(b*log2(a)) is of questionable utility and more troublesome:

It mainly benefits some silly benchmarks, where log2(a) can be hoisted out.
It doesn't give proper results for NaN, Inf and other special cases. Whether the libm implementation does that is often questionable, though.
It adds a dependency on C99 exp2() and log2(), which (still) aren't part of some libm implementations.
And/or their and our exp2() and log2() workaround implementations suck: exp(x_k) and log(x_k) are hopelessly imprecise.
All JIT backends have to do some weird dance to join these back together to pow() in the hope that this fixes some of these issues.

Definitely needs further discussion and analysis.

Fix MinGW64 C++ destructor handling

http://www.freelists.org/post/luajit/Partial-support-for-MinGW64-in-git

Needs to be verified, if this is still broken. Not sure this can be fixed, if the LuaJIT exception handler is not the culprit.

FFI: Redesign handling of uint32_t

Currently, uint32_t is implicitly converted to a double in the IR. Some simple FOLD optimizations can eliminate the CONV . But most optimizations cannot work on this weird representation.

Also, outside of the IR, dual-number mode is an int32_t/double combo. That doesnt' leave much choice other than storing it as a double, which is of course inefficient. Always boxing uint32_t would be slow. And these semantcis would be really bad for most FFI glue code, since you can't just pass on a uint32_t result to a function that expects a plain number.

The Javascript engines use a triple-number mode with a int32_t/uint32_t/double combo, due to the semantics of their bit operators. Lua BitOp functions always yield int32_t (for various reasons), but its tough to avoid uint32_t when dealing with the FFI.

I don't think there's a simple solution within these constraints.

Create official bindings to other languages.

As LuaJIT is now it's own GitHub organisation it would be nice to provide "officially backed projects" for some popular languages like:

Go
Rust
Ruby
Python

Each one of them could have it's own repo to not infer with core one. It could help to spread LuaJIT usage over the world.

PS
Sorry for polluting issue area, but I cannot find better way to propose this idea. I think that you would understand.

Finish LJ_GC64 mode

What's already done:

Core infrastructure.
x64 interpreter. Currently needs to be explicitly enabled with -DLUAJIT_ENABLE_GC64.
ARM64 interpreter.

What's missing:

Changes to the compiler data structures. In particular, most IR constants are 32 bit right now.
Changes to the compiler frontend. E.g. frames in snapshots.
Changes to the compiler backends. x64 should be first, since the support is mostly there.

Note that LJ_GC64 implies LJ_FR2 (2-slot frame info), which causes its own complications for the compiler.

There are quite a few asserts for both LJ_GC64 and LJ_FR2 all over the code, wherever I knew support was missing. But this is incomplete, so don't believe it'll just work as soon as it builds cleanly.

[Considering the scope and the difficulty of the changes, only the current incomplete LJ_GC64 mode support will make it into 2.1.]

Reorganize simple builtins

There should be an easier way to handle simple builtins that just call a libc/libm function. Examples are math.sin() etc. or some os.* functions.

It's tedious to add assembler code and/or a dedicated trace recording function for each of them. Which is the main reason this hasn't been done for the less performance-sensitive functions (e.g. os.*).

Note that e.g. math.sqrt() is not a candidate, since it does have a much more efficient machine code equivalent on some architectures (one main difference vs. libm sqrt() is not having to set errno!).

Before this is tackled, one should think about how this can be reused for some composite builtins that could be better represented with bytecode, which simplifies tracing.

FFI: Add vector/SIMD operations

Currently, vector data types may be defined with the FFI, but you really can't do much with them. The goal of this project is to add full support for vector data types to the JIT compiler and the CPU-specific backends (if the target CPU has a vector extension).

A new ffi.vec module should declare standard vector types and attaches the machine-specific SIMD intrinsics as (meta)methods.

Prerequisites for this project are user-definable intrinsics #39 and the new garbage collector #38 .

FFI: Redesign handling of float type

The C float type is a bit of a stepchild, right now. It implicitly converts to double on accesses, there's no constant representation in the IR, no float literals, etc. As a corollary, you can't really do arithmetic with floats, except via doubles.

On a modern desktop or server CPU, it's not faster to do scalar float computations vs. scalar double computations (with the exception of divisions). But the trade-off is very different for embedded targets.

Not converting to doubles is bad for the usability of the FFI. Explicit boxing of floats is tedious and slow. Adding a core float type everywhere seems overkill.

Note this is unrelated to SIMD float operations, since the vectors wouldn't need to be unpacked (whenever SIMD support would be implemented).

I don't think there's a simple solution for that within the current type system of LuaJIT. See issue #31, too.

FFI: Initialize nested structs/unions

Some patches have been posted to the mailing list, but they do not cover the general case.

This is related to a more general issue: all of the FFI initialization code that does this for the interpreter has some big overlap with similar, but not quite identical code in the trace recorder. There are subtle differences and careful omissions of complicated cases. One of the bigger warts of the code base.

Support yielding over pcall from C API (lua_pcallk)

LuaJIT currently supports yielding over pcall from lua code:

LuaJIT 2.0.4 -- Copyright (C) 2005-2015 Mike Pall. http://luajit.org/
JIT: ON CMOV SSE2 SSE3 SSE4.1 fold cse dce fwd dse narrow loop abc sink fuse
> local co = coroutine.create(function() print("A"); print(pcall(function() coroutine.yield() print("B") end)); print("C"); end); coroutine.resume(co); coroutine.resume(co);
A
B
true
C

But LuaJIT does not have a way to perform the above from C code.

Lua 5.2 and 5.3 extend lua_pcall via lua_pcallk which takes an additional 'continuation' argument.

FFI: Add internal pre-processor

There are existing external solutions, like http://lcpp.schmoock.net/, but they have disadvantages wrt. performance and integration. Really, one should be able to treate #define FOOBAR 42 just like enum { FOOBAR = 42 }, i.e. print(ffi.C.FOOBAR + 1) ought to work the same for both. But this requires deep integration.

The data structures for the builtin C parser of the FFI have already been designed to be extended with an internal pre-processor. E.g. one could add a CT_DEFINE and re-use most of the token interning logic.

Writing a standards-conformant C pre-processor is not as simple as it sounds. This requires lazy evaluation of a stored token stream. And there are weird interactions with comments and whitespace.

Before anyone tackles this, please read the standards and study a couple of conforming implementations, first.

47 bit address space restriction on ARM64

Hi, I encounter a problem to run luajit-2.1 on ARM64 platform, as follow:
My kernel enable "AArch64 Linux memory layout with 64KB pages + 3 levels", show at "https://www.kernel.org/doc/Documentation/arm64/memory.txt". So I have 48 bits Virtual Address.
But in the luajit lj_obj.c, that said "64 bit platform, 47 bit pointers", and when the code use the "LJ_GCVMASK" to get the real pointers, return value is wrong, the 48 bits VA is cut to 47 bits length. So the luajit program get the SIGSEGV signal.
We can use the "AArch64 Linux memory layout with 64KB pages + 2 levels" in our kernel that have 42bits VA, that will resolve this problem, but because of some limitations we can't use this mode.
So how can I resolve this problem?

Tune loop unrolling heuristics

The loop unrolling heuristics are not ideal for all situations. Some dependent projects like OpenResty change the loop unrolling limits.

However the trace heuristics are a very, very delicate matter. They interact with each other in many ways and it'll only show in very specific code. This is hard to tune unless one is willing to do lots of benchmarking and stare at generated code for hours. That's how the current heuristics have been derived.

A major project for someone with enough time at their hands.

Rework iOS build instructions

The iOS build instructions need to be reworked. Changing the iOS SDK version in the docs every now and then is not a maintainable solution.

Also, for the 2.1 branch, build instructions for iOS ARM64 need to be added before the final release.

Replace package.* library with pure Lua code

The library does lots of string manipulations and simple decisions. It's not performance critical, too. It's much more suitable for a pure Lua implementation than the existing C code.

This should be compiled to embedded bytecode during the build phase.

I have a sketch of that, but it suffers from the same problem as other builtins: it relies on features that cannot be represented in bytecode, yet. And it really needs to call into C code for low-level tasks.

See issue #15.

OSX default compiler

The Makefile relies on GCC and a few GCC-specific features (like -fpmath=sse).

We can probably safely switch to Clang on OSX by default. However, this needs some restructuring to avoid burying the compiler override in the OS selection jungle further down src/Makefile.

Use LEA 32/64 bit fusion, but only in specific cases

The x86/x64 LEA instruction is used as a 3/4-operand add. This often saves a register move.

However, this optimization can only be used under specific conditions. Right now, the conditions are too strict for 64 bit operands: it's disabled. This should be evaluated carefully before changing anything.

This was formerly rare, but now it's quite common with FFI int64_t/uint64_t or 64 bit pointer arithmetic.

Internal string formatting for doubles

String scanning uses builtin code since 2.0. And it simplified many things, like weirdness with locale settings, dependency on buggy strtod() etc.

String formatting has been rewritten for 2.1. But a dependency on sprintf() remains for formatting doubles. This should be replaced with builtin code.

Formatting floating point numbers is surprisingly hard, if one wants to do it both correct, fast and with constant memory.

TODO: Add links to papers and find a suitable algorithm that covers the cases that LuaJIT needs: tostring() and the specific subset of options for string.format("%...[fg]").

Fix or ditch trace stitching

Trace stitching is currently broken and disabled.

It was quite useful for some code bases with many C calls that would otherwise cause a NYI. It was not so useful for other code bases where enough effort has been put into converting the code to do C calls with the FFI (e.g. OpenResty).

Previous discussion: http://www.freelists.org/post/luajit/small-script-to-reproduce-bogus-trace-stitch-errors-at-line-0-with-coroutines-in-latest-21,1
(please read the whole thread)

If this is to be fixed and re-enabled, please note the stitching heuristics haven't been successfully tuned at all. The trace length turned out not to be a useful criterion, when evaluated in isolation. The -Ominstitch limit was effectively used as an on/off switch.

Add Hyperblock Scheduling

Producing good code for unbiased branches is a key problem for trace compilers. This is the main cause for "trace explosion" and bad performance with certain types of branchy code.

Hyperblock scheduling promises to solve this nicely at the price of a major redesign of the compiler: selected traces are woven together to a single hyper-trace. This would also pave the way for emitting predicated instructions, which benefits some CPUs (e.g. ARM) and is a prerequisite for efficient vectorization.

Be warned: this is heavy stuff! Something for a thesis, not for an afternoon.

Probably needs a major reorganization of the compiler. Requires an (adjunct) predicated IR. Selecting a profitable region for this optimization is the first hurdle. The merge phase for the predicated IR is tough.

Ask me, if you need a motivating example of how a predicated IR would look like and what this would enable. But don't hold your breath, this requires a long explanation.

__gc on tables

Lua 5.2 and 5.3 allow a __gc metamethod on tables.
Currently for compatibility with Lua 5.1 or LuaJIT you need to do hacks via newproxy.
e.g. https://github.com/wahern/cqueues/blob/3f2fc57a07bb9e658f4d53ccc60ba8177e3f1236/src/dns.resolvers.lua#L51

Add ARM Thumb2 port

This is relevant for the low-end Cortex-M CPUs that only support Thumb2 code. They simply can't run ARM32 machine code.

Even though mobile ARM CPUs are going the ARM64 route (which explicitly includes ARM32), the Cortex-M SoCs are not going away. Most of them have enough internal memory to run LuaJIT and they're quite popular due to their low cost.

This needs a DynASM Thumb2 port, a port of the interpreter and a new JIT compiler backend.

Beginnings of this can be found at: https://github.com/tcr/luajit
But this work has apparently been abandoned and never submitted upstream. It's interwoven with unrelated support code for their JS to Lua transpiler. It looks incomplete, too. It needs significant cleanup and/or rewriting.

Add MIPS64 port

Depends on issue #25.

Sponsorship available.

Better pointer hash

The pointer hash leaves something to be desired. However the constraints on acceptable hash functions are very strict. Not just for performance, but also for instruction space, since this is inlined in lots of places, including JIT-compiled code.

Changing the pointer hash affects: performance (hah!), inlined hash constants in C switch statements, all JIT backends.

There have been previous discussions on the mailing list with (so-so) benchmarks. TODO: gather all links.

FFI: Compile bit field access

Handling bit fields in C structs is notoriously complicated. The memory layout is target-specific and compiler-specific. Most of that has been taken care of, but it hasn't been throughly tested for compatibility with the system C compiler.

Bit field access is currently NYI in the compiler frontend. There are some complications with unaligned accesses or split accesses. But one could implement the easier parts of that first.

Also, currently only fields of max. 32 bits are handled. Larger fields are not mandated by the standard, but this is a common C compiler extension.

Change docs for minimum compiler requirements

E.g. GCC 4.2 is a minimum to build cleanly right now.. This is a side-effect of the __USE_ISOC99 guard for exp2() and log2() in the header files of newer Linux distros. Ancient GCC versions use gnu89 mode by default.

Note that explicitly setting gnu99 or c99 mode is not desired, since this causes collateral damage for some older distros.

Clone Mike Pall

Unclear how to find the upstream tracker to push this issue to. Will send pull request to upstream with DNA/RNA sequence, once this has been resolved.

Long-term enhancement request. No milestone assigned, yet.

Add s390x port

Depends on #25

Sponsorship available

system() is deprecated in iOS 8

Not that useful on iOS, anyway. Setting LJ_TARGET_CONSOLE for iOS is too restrictive.

Untangle lj_def.h vs. lj_arch.h dependencies

See report by Peter Cawley: http://www.freelists.org/post/luajit/Dependencies-between-lj-archh-and-lj-defh

And my reply: http://www.freelists.org/post/luajit/Dependencies-between-lj-archh-and-lj-defh,1

Metatable/__index specialization

Accesses to metatables and __index tables with constant keys are already specialized by the JIT compiler to use optimized hash lookups (HREFK). This is based on the assumption that individual objects don't change their metatable (once assigned) and that neither the metatable nor the __index table are modified. This turns out to be true in practice, but those assumptions still need to be checked at runtime, which can become costly for OO-heavy programming.

Further specialization can be obtained by strictly relying on these assumptions and omitting the related checks in the generated code. In case any of the assumptions are broken (e.g. a metatable is written to), the previously generated code must be invalidated or flushed.

Different mechanisms for detecting broken assumptions and for invalidating the generated code should be evaluated.

This optimization works at the lowest implementation level for metatables in the VM. It should equally benefit any code that uses metatables, not just the typical frameworks that implement a class-based system on top of it.

FFI: Add user-definable intrinsics

This is a low-level equivalent to GCC inline assembler: given a C function declaration and a machine code template, an intrinsic function (builtin) can be constructed and later called. This allows generating and executing arbitrary instructions supported by the target CPU. The JIT compiler inlines the intrinsic into the generated machine code for maximum performance.

Developers usually shouldn't need to write machine code templates themselves. Common libraries of intrinsics for different purposes should be provided or contributed by experts.

TODO: Add previous discussions from the mailing list.

FFI: Handle all constant declarations

The C parser and the internal constant expression evaluator are quite limited, right now. Basically to max. 32 bit integers.

Handling 64 bit integer constants is probably the most important issue. Doubles and floats are less important. There are some missing cases for string constants, too.

Drop tuning for ancient x86 Intel Atom CPUs

This is mainly about the JIT_F_LEA_AGU flag. This involves some tuning for deficiencies of these CPUs wrt. AGU <-> ALU forwards.

Actually, this only applies to really old Intel Atom CPUs, that were soldered into netbooks around 2008 (remember those?). AFAIK the newer CPUs under the Atom brand have very different code tuning requirements, which are more in line with the desktop CPUs (i.e. don't worry, just emit plain and simple machine code).

One should evaluate whether the AGU/ALU issue is truly gone (delve into Agner docs) and then ruthlessly rip out all of the special cases.

Replace more builtins with bytecode

More builtins should be replaced with embedded bytecode. Embedded bytecode can be traced easily, which would eliminate most remaining NY cases. It has other advantages: more inlining opportunities, code compactness, allows yielding, etc.

The main obstacle is that some of these builtins use features that cannot be represented in the current bytecode. There are already some internal bytecodes for type checking, like BC_ISTYPE or optimized variants of bytecode, like BC_TGETR. Adding more bytecodes requires some careful design considerations -- it doesn't make sense to add a complicated bytecode just for a single builtin.

This issue should be used as the master issue for listing and tracking individual builtins. A list and a quick evaluation of the feasibility would be a first step.

Reorganize internal buffer management

There are some infelicities with the current internal buffer management in lj_buf.[hc] and the compiler frontend.

One would like to reuse the one, global temporary buffer G(L)->tmpbuf for almost everything. But there are situations where one could really use secondary buffers. The compiler carefully avoids these cases right now, but that misses some optimization opportunities.

An internal acquire/release-style API might work. But it needs to be efficient for the typical case, where only a single buffer is needed.

Cannot submit PR from "legacy" Github forks

I discovered that I cannot send Pull Requests to the LuaJIT/luajit repository because my existing Github forks of LuaJIT come from a different "fork network" i.e. the LuaDist/luajit mirror repo.

Looks like 68 people have forked the LuaDist/luajit repository and everybody is going to have the same problem.

Seems like a general problem on Github when there is an unofficial-but-popular mirror repository and then the canonical repository later joins under a new account.

Github support feedback:

The only solution here would be for people to re-fork from the new canonical repository. It's unfortunate that the new canonical repository wasn't created in the same fork network as the original, which would have meant that we could change the root repository in that fork network.

The downside of that is that I need to destroy the local history of my Github forks in the process i.e. issues and pull requests. That is not the end of the world but it is a nasty surprise that others may also want to be aware of and deal with sooner rather than later.

I am looking for creative solutions... ideas welcome.