Investigate interpreter dispatch methods #97

jserv · 2022-12-18T21:24:34Z

It would still make sense to consolidate the existing interpreter as the foundation of tiered compilation before we actually develop JIT compiler (#81). See A look at the internals of 'Tiered JIT Compilation' in .NET Core for context. Although #95 uses tail-cail optimization (TCO) to reduce interpreter dispatch cost, we still need to investigate at several interpreter dispatch techniques before deciding how to move forward with more performance improvements and code maintenance.

The author of wasm3 provides an interesting project interp, which implements the following methods:

Switching
Direct Threaded Code (DTC)
Indirect Threaded Code (ITC)
Tail-Calls
machine code Inlining. See Optimizing software-hardware interplay in efficient virtual machines

Preliminary experiments on Intel Xeon CPU E5-2650 v4 @ 2.20GHz with bench.

[ Calls Loop ]

time                 2.782 s    (1.765 s .. 3.482 s)
                     0.985 R²   (0.949 R² .. 1.000 R²)
mean                 2.743 s    (2.623 s .. 2.903 s)
std dev              167.7 ms   (43.46 ms .. 225.4 ms)
variance introduced by outliers: 19% (moderately inflated)

[ Switching ]

time                 2.430 s    (2.135 s .. 2.684 s)
                     0.998 R²   (0.994 R² .. 1.000 R²)
mean                 2.550 s    (2.461 s .. 2.682 s)
std dev              135.7 ms   (23.52 ms .. 176.6 ms)
variance introduced by outliers: 19% (moderately inflated)

[ Direct Threaded Code ]

time                 2.058 s    (1.242 s .. 2.725 s)
                     0.974 R²   (0.964 R² .. 1.000 R²)
mean                 1.756 s    (1.571 s .. 1.920 s)
std dev              191.4 ms   (108.1 ms .. 268.4 ms)
variance introduced by outliers: 23% (moderately inflated)

[ Token (Indirect) Threaded Code ]

time                 1.912 s    (1.376 s .. 3.088 s)
                     0.957 R²   (0.931 R² .. 1.000 R²)
mean                 1.564 s    (1.456 s .. 1.762 s)
std dev              193.0 ms   (12.64 ms .. 237.9 ms)
variance introduced by outliers: 23% (moderately inflated)

[ Tail Calls ]

time                 1.414 s    (1.027 s .. 1.736 s)
                     0.987 R²   (0.985 R² .. 1.000 R²)
mean                 1.131 s    (1.020 s .. 1.239 s)
std dev              139.4 ms   (2.226 ms .. 168.8 ms)
variance introduced by outliers: 23% (moderately inflated)

[ machine code Inlining ]

time                 344.6 ms   (57.24 ms .. 478.0 ms)
                     0.923 R²   (NaN R² .. 1.000 R²)
mean                 383.3 ms   (342.6 ms .. 412.6 ms)
std dev              42.76 ms   (23.80 ms .. 53.86 ms)
variance introduced by outliers: 23% (moderately inflated)

After #95 is merged, we are concerned about

efficient interpreting.
the flexibility to switch between JIT compilation and interpretation.
less impact on the current codebase.

Reference:

The text was updated successfully, but these errors were encountered:

jserv · 2022-12-21T21:08:30Z

Mike Pall, creator of LuaJIT, talked about writing a fast interpreter with control-flow graph optimization.

The control-flow graph of an interpreter with C switch-based dispatch looks like this:

      .------.
      V      |
      |      |
      L      |  L = instruction load
      D      |  D = instruction dispatch
   / /|\ \   |
  / / | \ \  |
  C C C C C  |  C = operand decode
  X X X X X  |  X = instruction execution
  \ \ | / /  |
   \ \|/ /   |
      |      |
      V      |
      `------'

Each individual instruction execution looks like this:

  ......V......
  :X    |     :
  :     |\ \  :
  :     F S S :  F = fast path
  :     |/ /  :  S = slow paths
  :     |     :
  :.....V.....:

We're talking here about dozens of instructions and hundreds of slow paths. The compiler has to deal with the whole mess and gets into trouble:

Diamond-shaped control-flow is known to be the worst-case scenario for most optimizations and for register allocation.
Nested diamond-shaped control-flow is even worse.
The compiler doesn't have enough hints to see what the fast paths and what the slow paths are. Even if you'd be able to tell it, it still sees this as a single giant control-flow graph.
Anything in this loop could potentially influence anything else, so almost nothing can be hoisted or eliminated. The slow paths kill the opportunities for the fast paths and the complex instructions kill the opportunities for the simpler instructions.
The standard register allocation heuristics fail at this scale, so the compiler has trouble keeping important variables in
registers.

We can use a direct or indirect-threaded interpreter even in C, e.g. with the computed 'goto &' feature of GCC:

  * * * * *
  | | | | |
  C C C C C    C = operand decode
  X X X X X    X = instruction execution
  L L L L L    L = next instruction load
  D D D D D    D = next instruction dispatch
  | | | | |
  V V V V V

This effectively replicates the load and the dispatch, which helps the CPU branch predictors. But it has its own share of problems:

There's no loop the compiler could recognize. And all of those gotos can jump to pretty much anywhere in the code. Therefore the compiler is unable to hoist anything, because there will be a slow path where an aliasing store kills all opportunities.
The register allocator can only treat each of these segments separately and will do a real bad job. There's just no way to
give it a goal function like "I want the same register assignment before each goto".
Tail-merging and CSE will happily join all these common tails of each instruction and generate a single dispatch point. Ick. You can try to disable some optimizations for this piece of code, but this will negatively impact all paths.
There's a point where you start to fight the compiler and this is a game you cannot win.

If you write an interpreter loop in assembler, you can do much
better:

Keep a fixed register assignment for all instructions.
Keep everything in registers for the fast paths. Spill/reload only in the slow paths.
Move the slow paths elsewhere, to help with I-Cache density.
Pre-load instructions and pre-decode operands.

Here's how this would look like:

  *  *  *  *  *
  |  |  |  |  |
  C  C  C  C  C    C = partial operand decode for this instruction
  F> F> F> F> F>   F = fast path, > = exit to slow path
  L  L  L  L  L    L = next instruction load
  C  C  C  C  C    C = partial operand decode for the next instruction
  D  D  D  D  D    D = next instruction dispatch
  |  |  |  |  |
  V  V  V  V  V

You can get this down to just a few machine code instructions.
LuaJIT's interpreter is fast, because

jserv · 2022-12-21T21:09:55Z

Fast VMs without assembly - speeding up the interpreter loop: threaded interpreter, duff's device, JIT, Nostradamus distributor by the author of Bochs x86 emulator.

jserv · 2022-12-24T22:51:58Z

Virtual Machine Dispatch Experiments in Rust

Computed gotos or tail calls may give a worthwhile advantage on older or low-power architectures when implementing an FSM or a VM dispatch loop. There are a lot of these around, ARM processors being ubiquitous. The performance improvement over a single match statement could be up to 20%.

jserv · 2023-01-06T15:29:47Z

JamVM was an efficient interpreter-only Java virtual machine with code-copying technique.

VM Rumble - JamVM by Dr. Robert Lougher
JamVM interpreter
JamVM code-copying
JamVM techniques
Implementation and Evaluation of Fast Untyped Memory in a Java Virtual Machine, Chapter 3
Java On Raspberry Pi Performance: comparisons with JIT-enabled JVM implementations

qwe661234 · 2023-01-07T12:55:43Z

Our experimental results show that the greater the number of instructions in a basic block, the greater the impact TCO has. We also discovered that if we use the branch instruction as the end of a basic block, there are only a few instructions in a basic block. To enlarge the number of instructions in a basic block, we violate the definition of basic block and only use a jump or call instruction as the end of a block, rather than a branch instruction. Related implementation is in branch wip/enlarge_insn_in_block.

CoreMark Result

Model	Compiler	TCO	Enlarge Basic Block	Speedup
Core i7-8700	clang-15	971.951	1035.826899	+6.6%
Core i7-8700	gcc-12	963.336	1123.895132	+16.6%
eMAG 8180	clang-15	335.396	383.819427	+14.4%
eMAG 8180	gcc-12	332.561	374.303071	+12.5%

Compare the number of instructions in a basic block

Model: Core i7-8700, Compiler: clang-15

Before

Instruction |        Count        |       Invoked Times  
          1 |       257 [ 14.91 %]|        67811665 [  0.00 %]
          2 |       326 [ 18.91 %]|       128953458 [  0.00 %]
          3 |       287 [ 16.65 %]|       113370334 [  0.00 %]
          4 |       150 [  8.70 %]|        28792048 [  0.00 %]
          5 |       178 [ 10.32 %]|        87276451 [  0.00 %]
          6 |       102 [  5.92 %]|        42771905 [  0.00 %]
          7 |        74 [  4.29 %]|  94368206793558 [100.00 %]
          .
          .
          .

After

Instruction |        Count        |       Invoked Times 
          0 |         0 [  0.00 %]|               0 [  0.00 %]
          1 |        43 [  3.48 %]|          221728 [  0.02 %]
          2 |        76 [  6.15 %]|        46721545 [  4.35 %]
          3 |       109 [  8.83 %]|         7935500 [  0.74 %]
          4 |        78 [  6.32 %]|        29186819 [  2.72 %]
          5 |        79 [  6.40 %]|        24730008 [  2.30 %]
          6 |        66 [  5.34 %]|        98268483 [  9.15 %]
          7 |        54 [  4.37 %]|        93258283 [  8.68 %]
          8 |        36 [  2.91 %]|        15350793 [  1.43 %]
          9 |        42 [  3.40 %]|         1424689 [  0.13 %]
         10 |        32 [  2.59 %]|         9619665 [  0.90 %]
         11 |        41 [  3.32 %]|         1017886 [  0.09 %]
         12 |        32 [  2.59 %]|        63336493 [  5.90 %]
         13 |        36 [  2.91 %]|        27690567 [  2.58 %]
         14 |        39 [  3.16 %]|        87729161 [  8.17 %]
         15 |        27 [  2.19 %]|        16175213 [  1.51 %]
         16 |        32 [  2.59 %]|         2152239 [  0.20 %]
         17 |        23 [  1.86 %]|        41594781 [  3.87 %]
         18 |        31 [  2.51 %]|        23869232 [  2.22 %]
         19 |        25 [  2.02 %]|        91090239 [  8.48 %]
         20 |        21 [  1.70 %]|         1145062 [  0.11 %]
         21 |        23 [  1.86 %]|          331961 [  0.03 %]
         22 |        13 [  1.05 %]|        10018980 [  0.93 %]
         23 |        13 [  1.05 %]|        10111048 [  0.94 %]
         24 |        12 [  0.97 %]|            2596 [  0.00 %]
         25 |        12 [  0.97 %]|        19443780 [  1.81 %]
         26 |         5 [  0.40 %]|              17 [  0.00 %]
         27 |        16 [  1.30 %]|        28597552 [  2.66 %]
         28 |        12 [  0.97 %]|        28894686 [  2.69 %]
         29 |         6 [  0.49 %]|         2834075 [  0.26 %]
         30 |        11 [  0.89 %]|          124481 [  0.01 %]
         31 |         3 [  0.24 %]|              10 [  0.00 %]
         32 |        13 [  1.05 %]|           49620 [  0.00 %]
         33 |         7 [  0.57 %]|          935059 [  0.09 %]
         34 |         4 [  0.32 %]|         7939518 [  0.74 %]
         35 |         5 [  0.40 %]|         4060710 [  0.38 %]
         36 |         6 [  0.49 %]|         3733205 [  0.35 %]
         37 |         2 [  0.16 %]|              12 [  0.00 %]
         38 |         6 [  0.49 %]|              75 [  0.00 %]
         39 |         4 [  0.32 %]|           93365 [  0.01 %]
         40 |         3 [  0.24 %]|            4242 [  0.00 %]
         41 |         8 [  0.65 %]|         6495489 [  0.60 %]
         42 |        15 [  1.21 %]|             494 [  0.00 %]
         43 |         6 [  0.49 %]|       182707629 [ 17.01 %]
         .
         .
         .

jserv · 2023-01-08T03:38:30Z

JamVM was an efficient interpreter-only Java virtual machine with code-copying technique.

I reworked JamVM to make it work with OpenJDK 8. See the revised jamvm, which supports both x86-64 and Aarch64 for GNU/Linux.

jserv · 2023-01-10T06:52:16Z

Web49 claims to be a faster WebAssembly interpreter, using "one big function" with computed goto.
HackerNews discussion

qwe661234 · 2023-01-10T13:11:59Z

This discussion of Web49 shows some problems of threaded code jumps and its solutions. The first problem is poor register allocation, it also mentioned in Re: Suggestions on implementing an efficient instruction set simulator in LuaJIT2:

The register allocator can only treat each of these segments separately and will do a real bad job. There's just no way to give it a goal function like "I want the same register assignment before each goto".

The solution mentioned in this discussion, the first solution is grouping instructions that use the same registers into their own functions would help with that (arithmetic expressions tend to generate sequences like this).

The second limitations with computed gotos is the inability to derive the address of a label from outside the function. You always end up with some amount of superfluous conditional code for selecting the address inside the function, or indexing through a table.

One solution in this discussion is exported goto labels directly using inline assembly. Further, inline assembly can now represent control flow, so you can define the labels in inline assembly and the computed jump at the end of an opcode. That's pretty robust to compiler transforms.

Show HN: I wrote a WebAssembly Interpreter and Toolkit in C

qwe661234 · 2023-01-11T12:42:03Z

JamVM was an efficient interpreter-only Java virtual machine with code-copying technique.

VM Rumble - JamVM by Dr. Robert Lougher

JamVM interpreter

JamVM code-copying

JamVM techniques

Implementation and Evaluation of Fast Untyped Memory in a Java Virtual Machine, Chapter 3

Java On Raspberry Pi Performance: comparisons with JIT-enabled JVM implementations

To investigate dispatch method, machine code inlining, I intend to follow the machine code inlining technique in jamvm and rewrite the dispatch funciton of rv32emu.

interp/engine/interp-inlining.h

jserv · 2023-01-11T17:42:31Z

Superinstruction is well-known techniques of improving performance of interpreters, eliminating jumps between virtual machine operations (interpreter dispatch) and enabling more optimizations in merged code. Quote from Towards Superinstructions for Java Interpreters:

combining the source code for instructions together exposes a larger “window” of code to the C compiler, which
allows greater opportunities for optimisation.
adding quick instructions to the Kaffe interpreter could speed it up by almost a factor of three.
A problem with quick instructions is that they make it difficult to replace sequences of instructions with superinstructions. No instruction that will be replaced with another instruction at run time can be placed in a superinstruction, since that would involve replacing the entire superinstruction.
some superinstructions are being introduced into these commonly used methods, giving the significant performance boost. The benchmark that gets greatest benefit from superinstructions is mpegaudio, with a maximum speedup of about 1.56. It is interesting to note that this benefit is not substantial until 256 superinstructions are introduced.
The most important future development will be a scheme to allow “quickable” instructions to participate in superinstructions. Many of the most frequently executed Java instructions such as field access (16.4% of executed instructions in SPECjvm98) and method invokes (5.7%) are “quickable”.

See also: Threaded code and quick instructions for Kaffe
benchmark

using SPEC Client98 benchmark suite
Configuration: Pentium 130 (no L2 cache)
seconds needed for 10 runs (real):

	  check	mtrt	jess   compress  db  mpegaudio  jack	javac	
intrp	  0.65	529.9	192.5	1035.1	85.8   803.4   442.1   141.5
quick	  0.35  163.6    51.4    282.5 	26.7   358.5   156.8    42.9

US Patent USRE36204E Method and apparatus for resolving data references in generated code filed by Sun Microsystems has been expired.

qwe661234 · 2023-01-13T18:09:16Z

This branch is for investigate code-copying dispatch. However, there are some issues now, as we cannot reuse the copied page to emulate. It can pass some fundamental tests, such as hello.elf, coremark.elf, and puzzle.elf, without reusing copied pages.

jserv · 2023-01-14T05:42:20Z

This branch is for investigate code-copying dispatch. However, there are some issues now, as we cannot reuse the copied page to emulate. It can pass some fundamental tests, such as hello.elf, coremark.elf, and puzzle.elf, without reusing copied pages.

Can you use GCC's _attribute__((always_inline)) to forcely inline functions rather than introducing function-like macros? The former is better for type checking.

qwe661234 · 2023-01-14T16:41:55Z

This branch is for investigate code-copying dispatch. However, there are some issues now, as we cannot reuse the copied page to emulate. It can pass some fundamental tests, such as hello.elf, coremark.elf, and puzzle.elf, without reusing copied pages.

Can you use GCC's _attribute__((always_inline)) to forcely inline functions rather than introducing function-like macros? The former is better for type checking

Only the function insn_is_misaligned needs to be changed to function-like macros after testing. I attempt to forcefully inline the function insn_is_misaligned using GCC's _attribute__((always_inline)), however, it would fail because calling function in copying code might result in an incorrect return address being pushed.

Using inline assembly to push the right return address and jump to the function insn_is_misaligned is another option, but I don't think it's a good one.

qwe661234 · 2023-01-14T16:59:48Z

We can reuse the copied page to emulate some fundamental tests in the most recent commit of this branch, but some tests still need to be fixed.

There is an important issue with memory page size; some basic blocks in our 'arch-test' are so large that they require approximately 95537 bytes memory. This problem can be solved by increasing the memory page size to more than 95537 bytes, but this wastes memory because most basic blocks require less than 8152 bytes memory.

qwe661234 · 2023-01-15T16:47:06Z

We can reuse the copied page to emulate some fundamental tests in the most recent commit of this branch, but some tests still need to be fixed.

There is an important issue with memory page size; some basic blocks in our 'arch-test' are so large that they require approximately 95537 bytes memory. This problem can be solved by increasing the memory page size to more than 95537 bytes, but this wastes memory because most basic blocks require less than 8152 bytes memory.

The problem was resolved by roughly estimating how many memory pages a basic block needed before being allocated.

qwe661234 · 2023-02-10T10:43:34Z

This branch is used to investigate code-copying dispatch, and the latest commit, compiled with clang-15 and gcc-12, can now pass all arch-tests on an x86 machine. However, it still has some issues when running on an aarch64 machine. Worse, the performance of code-copying dispatch is worse than TCO, as shown in the relative analysis below.

Model	Compiler	`a304446`	`0e1333a`	Speedup
Core i7-8700	clang-15	971.951	715.869	-26.3%
Core i7-8700	gcc-12	963.336	672.198	-30.0%

In the latest commit, I modified all functions called by copying code as function pointers stored in structure rv to make copying code calculate relative address correctly.

jserv · 2023-02-10T13:32:11Z

This branch is used to investigate code-copying dispatch, and the latest commit, compiled with clang-15 and gcc-12, can now pass all arch-tests on an x86 machine. However, it still has some issues when running on an aarch64 machine. Worse, the performance of code-copying dispatch is worse than TCO, as shown in the relative analysis below.

It looks reasonably fine because there are many small blocks. Then, extend the scope of block by changing the way to separate blocks.

jserv · 2023-02-26T14:48:54Z

The Cacao interpreter contains several novel research papers along with open source work.
See also: Interpreter Research

jserv · 2023-05-27T18:47:36Z

Close in favor of baseline JIT compiler.

jserv assigned qwe661234 Dec 18, 2022

jserv mentioned this issue Jan 7, 2023

Enlarge the number of instructions in a basic block #106

Closed

jserv closed this as completed May 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate interpreter dispatch methods #97

Investigate interpreter dispatch methods #97

jserv commented Dec 18, 2022 •

edited

Loading

jserv commented Dec 21, 2022

jserv commented Dec 21, 2022 •

edited

Loading

jserv commented Dec 24, 2022

jserv commented Jan 6, 2023 •

edited

Loading

qwe661234 commented Jan 7, 2023 •

edited

Loading

jserv commented Jan 8, 2023 •

edited

Loading

jserv commented Jan 10, 2023

qwe661234 commented Jan 10, 2023

qwe661234 commented Jan 11, 2023

jserv commented Jan 11, 2023

qwe661234 commented Jan 13, 2023

jserv commented Jan 14, 2023

qwe661234 commented Jan 14, 2023

qwe661234 commented Jan 14, 2023

qwe661234 commented Jan 15, 2023

qwe661234 commented Feb 10, 2023 •

edited

Loading

jserv commented Feb 10, 2023

jserv commented Feb 26, 2023 •

edited

Loading

jserv commented May 27, 2023

Investigate interpreter dispatch methods #97

Investigate interpreter dispatch methods #97

Comments

jserv commented Dec 18, 2022 • edited Loading

jserv commented Dec 21, 2022

jserv commented Dec 21, 2022 • edited Loading

jserv commented Dec 24, 2022

jserv commented Jan 6, 2023 • edited Loading

qwe661234 commented Jan 7, 2023 • edited Loading

CoreMark Result

Compare the number of instructions in a basic block

Model: Core i7-8700, Compiler: clang-15

jserv commented Jan 8, 2023 • edited Loading

jserv commented Jan 10, 2023

qwe661234 commented Jan 10, 2023

qwe661234 commented Jan 11, 2023

jserv commented Jan 11, 2023

qwe661234 commented Jan 13, 2023

jserv commented Jan 14, 2023

qwe661234 commented Jan 14, 2023

qwe661234 commented Jan 14, 2023

qwe661234 commented Jan 15, 2023

qwe661234 commented Feb 10, 2023 • edited Loading

jserv commented Feb 10, 2023

jserv commented Feb 26, 2023 • edited Loading

jserv commented May 27, 2023

jserv commented Dec 18, 2022 •

edited

Loading

jserv commented Dec 21, 2022 •

edited

Loading

jserv commented Jan 6, 2023 •

edited

Loading

qwe661234 commented Jan 7, 2023 •

edited

Loading

jserv commented Jan 8, 2023 •

edited

Loading

qwe661234 commented Feb 10, 2023 •

edited

Loading

jserv commented Feb 26, 2023 •

edited

Loading