Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use portable JIT compilation for accelerating RISC-V emulation #81

Open
jserv opened this issue Nov 4, 2022 · 25 comments
Open

Use portable JIT compilation for accelerating RISC-V emulation #81

jserv opened this issue Nov 4, 2022 · 25 comments
Assignees
Labels
research Study certain topics

Comments

@jserv
Copy link
Contributor

jserv commented Nov 4, 2022

rv8 demonstrates how RISC-V instruction emulation can benefit from JIT compilation and aggressive optimizations. However, it is dedicated to x86-64 and hard to support other host architectures, such as Apple M1 (Aarch64). SFUZZ is a high performance fuzzer using RISC-V to x86 binary translations with modern fuzzing techniques. RVVM is another example to implement tracing JIT.

The goal of this task to utilize existing JIT framework as a new abstraction layer while we accelerate RISC-V instruction executions. In particular, we would

  1. Avoid direct machine code generation. Instead, most operations are enforced in intermediate representation (IR) level.
  2. Perform common optimization techniques such as peephole optimization. ria-jit performs excellent work in regards to such optimization. See src/gen/instr/patterns.c and MEMZERO and MEMCOPY instructions proposal
  3. Use high-level but still efficient IR. MIR is an interesting implementation, which allows using subset of C11 for IR. SFUZZ's note Code Generation is worth reading. ETISS (Extendable Translating Instruction Set Simulator) translates binary instructions into C code and appends translated code into a block, which will be compiled and executed at runtime. As aforementioned, it is Extendable, thus it supports myriad level of customization by adopting the technique of plug-ins.
  4. Ensure shorter startup-time. It can be achieved by means of lightweight JIT framework and AOT compilation.

The JIT compilation's high level operation can be summed up as follows:

  • Look in a hash map for a code block matching the current PC
  • if a block is found
    • execute this block
  • if a block is not found
    • allocate a new block
    • invoke the translator for this block
    • insert it into the hash map
    • execute this block

Every block will come to an end after a branch instruction has been translated since translation occurs at the basic block level. Then, there is room for further optimization passes performed on the generated code.

We gain speed by using the technique for the reasons listed below:

  • No instruction fetch
  • No instruction decode
  • Immediate values are baked into translated instructions
    • Values of 0 can be optimized
  • register x0 can be optimized
    • No lookup required
    • Writes are discarded
  • Reduced emulation loop overhead
  • Blocks can be chained based on previous branch pattern for faster lookup

Reference:

@jserv
Copy link
Contributor Author

jserv commented Nov 17, 2022

Think about whether we want to load a 64-bit immediate under the present RISC-V specification. Despite memory access being substantially slower, it is simpler to just load a constant from memory since it requires less instructions.

In RISC-V

    lui r1, 0x01234
    lui r2, 0x89ABC
    addi r1, 0x567
    addi r2, 0xDEF
    shl r1, 32
    or r1, r2

24-bytes, 6 instruction cycles and 2 registers.

In x86_64:

    movabs rax, 0x0123456789ABCDEF

11-bytes, 1 instruction cycle, 1 register.

@Ma-Y
Copy link

Ma-Y commented Nov 17, 2022

it's not so apparent where the starting and ending position of a particular code block is . does this mean hard coding pseudo pattern(common code block pattern) for the binary to match on?

@jserv
Copy link
Contributor Author

jserv commented Nov 23, 2022

blink is a virtual machine for running x86-64-linux programs on different operating systems and hardware architectures. Recently, it implements a JIT. Quote from blink/jit.c:

This file implements an abstraction for assembling executable code at runtime. This is intended to be used in cases where it's desirable to have fast "threaded" pathways, between existing functions, which were compiled statically. We need this because virtual machine dispatching isn't very fast, when it's implemented by loops or indirect branches.
Modern CPUs go much faster if glue code without branches is outputted to memory at runtime, i.e. a small function that calls the functions.

@jserv
Copy link
Contributor Author

jserv commented Nov 23, 2022

wasm3 is a fast WebAssembly interpreter without JIT. In general, its strategy seems capable of executing code around 4-15x slower than compiled code on a modern x86 processor.

  • Reduce bytecode decoding overhead
    • Bytecode/opcodes are translated into more efficient "operations" during a compilation pass, generating pages of meta-machine code
    • wasm3 trades some space for time. Opcodes map to up to 3 different operations depending on the number of source operands and commutative-ness.
    • Commonly occurring sequences of operations can also be optimized into a "fused" operation. This sometimes results in improved performance.
    • In wasm3, the stack machine model is translated into a more direct and efficient "register file" approach.
  • Tightly Chained Operations

Tails is a minimal, fast Forth-like interpreter core. It uses no assembly code, only C++, but an elegant tail-recursion technique inspired by Wasm3 makes it nearly as efficient as hand-written assembly.

@jserv
Copy link
Contributor Author

jserv commented Nov 28, 2022

Benchmark results of rv8: (RV32 only, smaller is better)

- interpreter JIT qemu-user native-x86
primes 34.33 1.75 1.55 1.00
miniz 40.19 1.95 2.68 1.00
SHA-512 76.23 3.40 3.78 1.00
AES 88.02 2.57 4.08 1.00
qsort 28.82 1.86 7.50 1.00
dhrystone 128.97 2.72 12.04 1.00

@qwe661234
Copy link
Collaborator

Our strategy to develop JIT is utilizing Clang to generate optimized target code. The JIT compiler design is shown in Figure 1.
When a block's usage frequency exceeds a predetermined threshold, we trace its taken and untaken branches and use a code generator to convert the instruction sequence into C code. We then compile the C code using Clang and store the resulting target machine code as a function in the code cache for future use.

An example instruction sequence that is a hot spot in Mandelbrot is shown in Figure 2, along with the corresponding EBB. The generated code for this EBB is shown as below.

  • Figure1
  • Figure2
  • Instruction Sequence in Mandelbrot
10750: lw  a5,128(sp)
10754: beq a5,s10,0x10760
10758: add s1,s1,a0
1075c: j   0x10728
10760: mv  s8,a0
10764: sub s9,s1,s7
10768: bne s1,s7,0x10a0c
...
  • Generated Code
insn_10750:
  ...
  goto insn_10754;
insn_10754:
  ...
  if (...)
    goto insn_10760;
  goto insn_10758;
...

@qwe661234
Copy link
Collaborator

The commit 36f304c implements the JIT strategy described above. The benchmark results, as shown in the statistics below, demonstrate that the JIT has a positive effect on benchmarks with long execution times. However, in the case of Mandelbrot, its short execution time means that the overhead of the JIT outweighs its benefits.

Test rv32emu (interpreter) rv32emu (JIT) Speedup
CoreMark 1155.174 (Iterations/Sec) 1796.483 (Iterations/Sec) +55.5%
dhrystone 1282 DMIPS 2521 DMIPS +96.64%
nqueens 7766.80 msec 3893.81 msec +99.47%
mandelbrot 32.28 msec 77.34 msec -139.59%

@qwe661234
Copy link
Collaborator

We experiment the same JIT strategy based on different compiler clang and mir.

clang

Metric Interpreter-Only Just-in-time Compiler Speedup
CoreMark 1155.174 (Iterations/Sec) 1796.483 (Iterations/Sec) +55.5%
Dhrystone 1282 DMIPS 2521 DMIPS +96.64%

mir

Metric Interpreter-Only Just-in-time Compiler Speedup
CoreMark 1155.174 (Iterations/Sec) 2194.907 (Iterations/Sec) +90%
Dhrystone 1282 DMIPS 2522 DMIPS +96.7%

The issue with Clang is that we need to fork a Clang process, which results in a significant overhead. However, its ability to optimize code is strong. In contrast, launching mir has a relatively small overhead, but its ability to optimize code is relatively weak and we cannot determine the code size of the machine code compiled by mir. This limitation prevents us from using a code cache to manage machine code effectively.

@jserv
Copy link
Contributor Author

jserv commented Jun 15, 2023

The issue with Clang is that we need to fork a Clang process, which results in a significant overhead. However, its ability to optimize code is strong. In contrast, launching mir has a relatively small overhead, but its ability to optimize code is relatively weak and we cannot determine the code size of the machine code compiled by mir. This limitation prevents us from using a code cache to manage machine code effectively.

The preliminary baseline JIT compiler has been landed in wip/jit branch. Meanwhile, it comes with some known issues, including Arm64 breakage. The design principle of baseline JIT is to stick to C2MIR approach where we can improve macro operation fusion and/or RISC-V oriented optimizations.

@qwe661234
Copy link
Collaborator

qwe661234 commented Jul 4, 2023

Ongoing

@jserv
Copy link
Contributor Author

jserv commented Jul 5, 2023

You shall describe the details in #142 rather than here. In #142, we care about the feasibility to improve block-based execution by introducing dominator tree.

@jserv
Copy link
Contributor Author

jserv commented Jul 10, 2023

The author of RVVM discussed the design choices where it's substantially different to QEMU.

Performance-wise:

  • Instead of a static translate-and-run flow like in QEMU, RVVM has an interpret-trace-run execution loop which is remotely similar to JVM, and allows to collect some data like branch probabilities and hot loops, and optimize better
  • Using a hardware host FPU instead of softfp emulation. This is like, 10x faster with some synthetic FPU benchmarks
  • Conscious decisions for beneficial trade-offs, like fast-path JIT trace cache, JIT IR is more streamlined to "Big ISA Triad" (RISC-V, ARM64, x86-64), etc

Infrastructure-wise:

  • A public library API for a lot of things: Machine management (Construct and run 'em in any program), device integration, registering new CPU instructions, userspace emulation
  • Subjectively, a more lean and clean codebase, in places where it wasn't harmed by either 1) performance decisions to copy-paste or restructure things 2) complexity of related things like JIT backend arches
  • Portability. RVVM officially runs in WASM, runs on Haiku, SerenityOS, KolibriOS, even DOS!

@jserv
Copy link
Contributor Author

jserv commented Aug 2, 2023

lightrec is a MIPS-to-everything dynamic re-compiler (aka JIT compiler or dynrec) for PlayStation emulators, using GNU Lightning as the code emitter. Features:

  • High-level optimizations. The MIPS code is first pre-compiled into a form of Intermediate Representation (IR). Basically, just a single-linked list of structures representing the instructions. On that list, several optimization steps are performed: instructions are modified, reordered, tagged; new meta-instructions can also be added.
  • Lazy compilation. If Lightrec detects a block of code that would be very hard to compile properly (e.g. a branch with a branch in its delay slot), the block is marked as not compilable, and will always be emulated with the built-in interpreter. This allows to keep the code emitter simple and easy to understand.
  • Run-time profiling. The generated code will gather run-time information about the I/O access (whether they hit RAM, or hardware registers). The code generator will then use this information to generate direct read/writes to the emulated memories, instead of jumping to C for every call.
  • Threaded compilation. When entering a loading zone, where a lot of code has to be compiled, we don't want the compilation process to slow down the pace of emulation. To avoid that, the code compiler optionally runs on a thread, and the main loop will emulate the blocks that have not been compiled yet with the interpreter. This helps to drastically reduce the stutter that typically happens when a lot of new code is run.

Check optimizer.c, blockcache.c, and TLSF for the implementation.

Test hardware: Desktop PC – Core i7 7700k, Windows 10

Game Interpreter (No Dithering) Interpreter (With Dithering) Dynarec (No Dithering) Dynarec (With Dithering)
Final Doom 246fps 245fps 621fps 616fps
Resident Evil 250fps 248fps 642fps 639fps
Tekken 3 190fps 175fps 279fps 250fps

@jserv
Copy link
Contributor Author

jserv commented Aug 3, 2023

The copyjit draws inspiration from the paper "Copy-and-Patch Compilation." However, what if patching could be entirely eliminated?

The core concept revolves around using the compiler to generate 'templates' that can be directly copied into place. This approach heavily relies on continuation passing, which means that all operations defined by the jit library must allow for continuation passing optimizations. In copy-and-patch, the templates are filled in at runtime with user-selected values. Unfortunately, this method relies on parsing ELF relocations, which necessitates porting the library to different platforms. While not a major issue, avoiding runtime patching of relocations could potentially enable the creation of a JIT library that is architecture agnostic and offers very low latencies.

bcgen generates a number of files in a directory called gen in the working directory. These generated files are included by bcode.c, which you can compile into an object file that provides an interface to compiling and running bytecode system.

See also: A Template-Based Code Generation Approach for MLIR
截圖 2023-08-04 下午5 09 17
截圖 2023-08-04 下午5 10 10
截圖 2023-08-04 下午5 11 19

@jserv
Copy link
Contributor Author

jserv commented Aug 16, 2023

QEMU employs a two-step process for executing binaries, involving an intermediate representation known as tiny code. This tiny code is interpreted in two ways: first through emulation and second via compilation into native code using a JIT compiler, often leading to enhanced speed.

However, the use of JITs demands the allocation of executable memory to house the compiled code, which is not permitted in iOS. To circumvent this restriction, a technique is employed that involves reusing portions of code that are already in executable memory. This concept takes on various names such as code re-use, ROP (Return-Oriented Programming), and ret2code. It is formalized as "weird machines" due to the differing semantics between the original code and final execution.

This process involves the creation of code gadgets, such as ldr x0, [sp], #16; ret (Aarch64), which execute an action and return to their caller upon completion of the first instruction. These gadgets are chained together, forming sequences that achieve specific objectives. The qemu-tcg-tcti project has developed a script that generates pre-compiled code snippets (gadgets) with different functionalities. These gadgets are compiled during the iOS app's build process and are mapped to executable memory, eliminating the need for runtime creation. The final step is the creation of a JIT that constructs a memory segment containing values for use by these gadgets before transitioning to the next one.

This inventive approach allows the creation of complete programs by reusing existing code, a technique historically employed for creating exploits. It provides a creative solution for implementing JIT compilers in architectures that disallow the allocation of executable memory. See commit 4de86e.

UTM already merges the above qemu-tcg-tcti effort (see patches directory):

UTM SE ("slow edition") uses a threaded interpreter which performs better than a traditional interpreter but still slower than JIT. This technique is similar to what iSH does for dynamic execution. As a result, UTM SE does not require jailbreaking or any JIT workarounds and can be sideloaded as a regular app.

Report on Apr 1 2021:

  • TCI (normal interpreter) boot this VM in 130 seconds
  • TCTI (threaded interpreter) boots it in only 22 seconds

Reference:

@jserv
Copy link
Contributor Author

jserv commented Sep 6, 2023

pylbbv is a lazy basic block versioning + copy and patch JIT interpreter for CPython.

The copy-and-patch JIT compiler uses a stencil compiler.

  • At runtime, each basic block, except the branches (exits) are compiled to machine code.
  • If compilation is successful, execution jumps into the machine code rather than the interpreter bytecode.
  • The branches remain as CPython interpreter bytecode, to faciliatate easy branching.
  • Upon encountering a branch, the interpreter leaves the machine code to go back into bytecode.
  • Execution thus interleaves between machine code and the interpreter.

@jserv
Copy link
Contributor Author

jserv commented Sep 10, 2023

luajit-remake transforms an LLVM function to make it suitable for compilation and back-parsing into a copy-and-patch stencil.

The transformation process involves the following steps:

The function is split into two parts: the fast path and the slow path. The identification of the slow path logic is done using BlockFrequencyInfo, and proper annotations are added to the LLVM IR to enable the identification and separation of the slow path during assembly generation.

  1. A simple heuristic is applied to modify the assembly code. This modification ensures that the function falls through to the next bytecode when it might be beneficial.
  2. The pass comprises two phases: one at the LLVM IR level (IR to IR transformation) and another at the ASM (.s file) level (ASM to ASM transformation).

It is important to note that the IR-level rewrite pass should be executed immediately before the LLVM module is compiled to assembly. Once this pass is applied, no further transformations to the LLVM IR are allowed.

@jserv
Copy link
Contributor Author

jserv commented Sep 21, 2023

Jonathan Müller has an excellent talk on A deep dive into dispatching techniques. He compared the manual jump table and the one generated by optimizing compiler.

  • Manual jump table
movzx eax, byte ptr [rbx] ; rax := ip->op
jmp qword ptr [r13 + 8*rax] ; goto *execute_table[rax]
  • Switch jump table
movzx eax, byte ptr [rbx] ; rax := ip->op
movsxd rax, dword ptr [r13 + 4*rax] ; rax := execute_table[rax]
add rax, r13 ; rax := rax + &execute_table
jmp rax ;goto

Compiler generates jump table with 4 byte relative offsets, not 8 byte absolute offsets, resulting faster execution on Intel Core i5-1145G7.

@jserv
Copy link
Contributor Author

jserv commented Oct 12, 2023

WebAssembly Micro Runtime (WAMR) is a lightweight standalone WebAssembly (Wasm) runtime with small footprint, high performance and highly configurable features for applications cross from embedded devices.

@jserv
Copy link
Contributor Author

jserv commented Nov 11, 2023

Possible lightweight JIT framework:

@jserv
Copy link
Contributor Author

jserv commented Nov 30, 2023

The core of the security concern lies in the inherent complexity of the system. Even extensively used and battle-tested tools like wasmtime have experienced severe vulnerabilities, such as the recent critical bug that could potentially lead to remote code execution (as seen in Guest-controlled out-of-bounds read/write on x86_64 · bytecodealliance/wasmtime).

The strategy employed here, assuming it progresses beyond the experimental phase, comprises three key elements to ensure robust security:

  1. Simplicity: A commitment to keeping all components as straightforward as possible. Simplicity enhances security by reducing the attack surface.
  2. Thorough Testing: The application of exhaustive and comprehensive testing procedures across all aspects of the system, leaving no stone unturned.
  3. Rigorous Sandboxing: Implementing stringent sandboxing mechanisms akin to a highly secure facility, to ensure that even if an attacker were to break out of the virtual machine and achieve remote execution, their capabilities would be severely restricted, potentially limited to consuming the host's CPU resources at most.

@jserv
Copy link
Contributor Author

jserv commented Nov 30, 2023

An experimental JIT for PHP, built upon dstogov/ir project, has been developed and can be found in the master branch of the php-src repository.

By following the provided build instructions, we can build a development version of PHP, which will display the following:

$ sapi/cli/php --version
PHP 8.4.0-dev (cli) (built: Dec  1 2023 01:59:26) (NTS)
Copyright (c) The PHP Group
Zend Engine v4.4.0-dev, Copyright (c) Zend Technologies

Check if opcache s loaded

$ sapi/cli/php -v | grep -i opcache
  • Disable opcache.JIT
$ sapi/cli/php -d opcache.jit=off Zend/bench.php
..
Total              0.310

(unit: second)

  • Enable opcache.JIT
$ sapi/cli/php -d opcache.jit=tracing Zend/bench.php
...
Total              0.089

@jserv jserv added the research Study certain topics label Dec 25, 2023
@jserv
Copy link
Contributor Author

jserv commented Dec 25, 2023

Whose baseline compiler is it anyway? by Ben L. Titzer

We show the design of a new single-pass compiler for a research Wasm engine that integrates with an in-place
interpreter and host garbage collector using value tags, while also supporting flexible instrumentation. In experiments, we measure the effectiveness of optimizations targeting value tags and find, somewhat surprisingly, that the runtime overhead can be reduced to near zero. We also assess the relative compile speed and execution time of six baseline compilers and place these baseline compilers in a two-dimensional tradeoff space with other execution tiers for Wasm.

@jserv
Copy link
Contributor Author

jserv commented Dec 30, 2023

The concept of delay slot in MIPS was initially a straightforward solution to manage pipeline hazards in five-stage pipelines. However, it became a challenge for processors with longer pipelines and the ability to issue multiple instructions per clock cycle. From a software perspective, delay slot has drawbacks, making programs harder to read and often less efficient due to frequently inserting nop (no operation) instructions in the delay slot.

Historically, in the 1980s, the idea of branch delay slot made sense for pipelines consisting of 5 or 6 stages, as it helped to mitigate the one-cycle branch penalty inherent in these systems. But with the evolution of processor architectures, this approach has become outdated. For instance, in modern Pentium microarchitectures, the branch penalty can range from 15 to 25 cycles, rendering a single instruction delay slot ineffective. Implementing a delay slot that could accommodate a 15-instruction delay would be impractical and would disrupt the compatibility of instruction sets.

Advancements in technology have introduced more efficient solutions. Branch prediction, now a mature technology, has proven to be more efficient. The rate of misprediction with current branch predictors is significantly lower than the occurrence of branches with a nop delay slot. This holds true even in systems with a relatively short 6-cycle delay, like the Nios II architecture.

Given these considerations, both in terms of hardware and software efficiency, delay slots are less advantageous. Therefore, modern architectures like RISC-V have chosen to omit the delay slot feature, aligning with current technological capabilities and requirements.

The lightrec, a MIPS recompiler that employs GNU Lightning for code emission, must handle the delay slot characteristic of MIPS. This feature, however, is not present in RISC-V and other more recent RISC designs, which typically exclude the delay slot. This omission reflects a broader trend in newer RISC architectures to move away from this once-common design element.

@jserv
Copy link
Contributor Author

jserv commented Jan 8, 2024

rv64_emulator is a RISC-V ISA emulation suite which contains a full system emulator and an ELF instruction frequency analyzer, with JIT compiler for Arm64.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
research Study certain topics
Projects
None yet
Development

No branches or pull requests

4 participants