jit: Transition from linear to more effective form #238

qwe661234 · 2023-10-04T08:55:18Z

The chained block structure used by both interpreter and tier-1 compiler is linear, with each block pointing only to the subsequent block. Enhancing a block to reference its previous block brings significant value, especially for hot spot profiling. This advancement paves the way for developing a graph-based intermediate representation (IR). In this IR, graph edges symbolize use-define chains. Rather than working on a two-tiered Control-Flow Graph (CFG) comprising basic blocks (tier 1) and instructions (tier 2), analyses and transformations will directly interact with and modify this use-def information in a streamlined, single-tiered graph structure.

The sfuzz project employs a custom intermediate representation. The initial step in the actual code generation process involves lifting the entire function into this intermediate representation. During the initialization phase, when the target is first loaded, the size of the function is determined. This is achieved by parsing the elf metadata and creating a hashmap that maps function start addresses to their respective sizes.

The IR-lifting process iterates through the original instructions and generates an IR instruction for each original instruction using a large switch statement. The following example illustrates what the intermediate representation might resemble for a very minimal function that essentially performs a branch operation based on a comparison in the first block.

Reference: A Simple Graph-Based Intermediate Representation

when the using frequency of a block exceeds a predetermined threshold, the baseline tiered1 JIT compiler traces the chained block and generate corresponding low quailty machine code. The resulting target machine code is stored in the code cache for future utilization. The primary objective of introducing the baseline JIT compiler is to enhance the execution speed of RISC-V instructions. This implementation requires two additional components: a tiered1 machine code generator, and code cache. Furthermore, this baseline JIT compiler serves as the foundational target for future improvements. In addition, we have developed a Python script that effectively traces code templates and automatically generates JIT code templates. This approach eliminates the need for manually writing duplicated code. Related: sysprog21#238

When the using frequency of a block exceeds a predetermined threshold, the tier 1 JIT compiler traces the chained block and generate corresponding low quailty machine code. The resulting target machine code is stored in the code cache for future utilization. The primary objective of introducing the tier 1 JIT compiler is to enhance the execution speed of RISC-V instructions. This implementation requires two additional components: a tier 1 machine code generator, and code cache. Furthermore, this tier 1 JIT compiler serves as the foundational target for future improvements. In addition, we have developed a Python script that effectively traces code templates and automatically generates JIT code templates. This approach eliminates the need for manually writing duplicated code. As shown in the performance analysis below, the tier 1 JIT compiler's performance closely parallels that of QEMU in benchmarks with a constrained dynamic instruction count. However, for benchmarks featuring a substantial dynamic instruction count or lacking specific hotspots—examples include pi and STRINGSORT—the tier 1 JIT compiler demonstrates noticeably slower execution compared to QEMU. Hence, a robust tier 2 compiler is essential to generate optimized machine code across diverse execution paths, coupled with a runtime profiler for detecting hotspots. | Metric | rv32emu(JIT-T1) | qemu | |----------+-----------------+-------| |aes | 0.02| 0.031| |mandelbrot| 0.029| 0.0115| |puzzle | 0.0115| 0.009| |pi | 0.0413| 0.0177| |dhrystone | 0.331| 0.393| |Nqeueens | 0.854| 0.749| |qsort-O2 | 2.384| 2.16| |miniz-O2 | 1.33| 1.01| |primes-O2 | 2.93| 1.069| |sha512-O2 | 2.057| 0.939| |stream | 12.747| 10.36| |STRINGSORT| 89.012| 11.496| Related: sysprog21#238

When the using frequency of a block exceeds a predetermined threshold, the tier 1 JIT compiler traces the chained block and generate corresponding low quailty machine code. The resulting target machine code is stored in the code cache for future utilization. The primary objective of introducing the tier 1 JIT compiler is to enhance the execution speed of RISC-V instructions. This implementation requires two additional components: a tier 1 machine code generator, and code cache. Furthermore, this tier 1 JIT compiler serves as the foundational target for future improvements. In addition, we have developed a Python script that effectively traces code templates and automatically generates JIT code templates. This approach eliminates the need for manually writing duplicated code. As shown in the performance analysis below, the tier 1 JIT compiler's performance closely parallels that of QEMU in benchmarks with a constrained dynamic instruction count. However, for benchmarks featuring a substantial dynamic instruction count or lacking specific hotspots—examples include pi and STRINGSORT—the tier 1 JIT compiler demonstrates noticeably slower execution compared to QEMU. Hence, a robust tier 2 compiler is essential to generate optimized machine code across diverse execution paths, coupled with a runtime profiler for detecting hotspots. * Perfromance | Metric | rv32emu(JIT-T1) | qemu | |----------+-----------------+-------| |aes | 0.02| 0.031| |mandelbrot| 0.029| 0.0115| |puzzle | 0.0115| 0.009| |pi | 0.0413| 0.0177| |dhrystone | 0.331| 0.393| |Nqeueens | 0.854| 0.749| |qsort-O2 | 2.384| 2.16| |miniz-O2 | 1.33| 1.01| |primes-O2 | 2.93| 1.069| |sha512-O2 | 2.057| 0.939| |stream | 12.747| 10.36| |STRINGSORT| 89.012| 11.496| As demonstrated in the memory usage analysis below, the tier 1 JIT compiler utilizes less memory than QEMU across all benchmarks. * Memory usage | Metric | rv32emu(JIT-T1) | qemu | |----------+-----------------+---------| |aes | 186,228|1,343,012| |mandelbrot| 152,203| 841,841| |puzzle | 153,423| 890,225| |pi | 152,923| 879,957| |dhrystone | 154,466| 856,404| |Nqeueens | 154,880| 858,618| |qsort-O2 | 155,091| 933,506| |miniz-O2 | 165,627|1,076,682| |primes-O2 | 150,540| 928,446| |sha512-O2 | 153,553| 978,177| |stream | 165,911| 957,845| |STRINGSORT| 167,871|1,104,702| Related: sysprog21#238

When the using frequency of a block exceeds a predetermined threshold, the tier 1 JIT compiler traces the chained block and generate corresponding low quailty machine code. The resulting target machine code is stored in the code cache for future utilization. The primary objective of introducing the tier 1 JIT compiler is to enhance the execution speed of RISC-V instructions. This implementation requires two additional components: a tier 1 machine code generator, and code cache. Furthermore, this tier 1 JIT compiler serves as the foundational target for future improvements. In addition, we have developed a Python script that effectively traces code templates and automatically generates JIT code templates. This approach eliminates the need for manually writing duplicated code. As shown in the performance analysis below, the tier 1 JIT compiler's performance closely parallels that of QEMU in benchmarks with a constrained dynamic instruction count. However, for benchmarks featuring a substantial dynamic instruction count or lacking specific hotspots—examples include pi and STRINGSORT—the tier 1 JIT compiler demonstrates noticeably slower execution compared to QEMU. Hence, a robust tier 2 compiler is essential to generate optimized machine code across diverse execution paths, coupled with a runtime profiler for detecting hotspots. * Perfromance | Metric | rv32emu-T1C | qemu | |----------+-------------+-------| |aes | 0.02| 0.031| |mandelbrot| 0.029| 0.0115| |puzzle | 0.0115| 0.009| |pi | 0.0413| 0.0177| |dhrystone | 0.331| 0.393| |Nqeueens | 0.854| 0.749| |qsort-O2 | 2.384| 2.16| |miniz-O2 | 1.33| 1.01| |primes-O2 | 2.93| 1.069| |sha512-O2 | 2.057| 0.939| |stream | 12.747| 10.36| |STRINGSORT| 89.012| 11.496| As demonstrated in the memory usage analysis below, the tier 1 JIT compiler utilizes less memory than QEMU across all benchmarks. * Memory usage | Metric | rv32emu-T1C | qemu | |----------+-------------+---------| |aes | 186,228|1,343,012| |mandelbrot| 152,203| 841,841| |puzzle | 153,423| 890,225| |pi | 152,923| 879,957| |dhrystone | 154,466| 856,404| |Nqeueens | 154,880| 858,618| |qsort-O2 | 155,091| 933,506| |miniz-O2 | 165,627|1,076,682| |primes-O2 | 150,540| 928,446| |sha512-O2 | 153,553| 978,177| |stream | 165,911| 957,845| |STRINGSORT| 167,871|1,104,702| Related: sysprog21#238

When the using frequency of a block exceeds a predetermined threshold, the tier-1 JIT compiler traces the chained block and generate corresponding low quailty machine code. The resulting target machine code is stored in the code cache for future utilization. The primary objective of introducing the tier-1 JIT compiler is to enhance the execution speed of RISC-V instructions. This implementation requires two additional components: a tier-1 machine code generator, and code cache. Furthermore, this tier-1 JIT compiler serves as the foundational target for future improvements. In addition, we have developed a Python script that effectively traces code templates and automatically generates JIT code templates. This approach eliminates the need for manually writing duplicated code. As shown in the performance analysis below, the tier-1 JIT compiler's performance closely parallels that of QEMU in benchmarks with a constrained dynamic instruction count. However, for benchmarks featuring a substantial dynamic instruction count or lacking specific hotspots—examples include pi and STRINGSORT—the tier-1 JIT compiler demonstrates noticeably slower execution compared to QEMU. Hence, a robust tier-2 JIT compiler is essential to generate optimized machine code across diverse execution paths, coupled with a runtime profiler for detecting hotspots. * Perfromance | Metric | rv32emu-T1C | qemu | |----------+-------------+-------| |aes | 0.02| 0.031| |mandelbrot| 0.029| 0.0115| |puzzle | 0.0115| 0.009| |pi | 0.0413| 0.0177| |dhrystone | 0.331| 0.393| |Nqeueens | 0.854| 0.749| |qsort-O2 | 2.384| 2.16| |miniz-O2 | 1.33| 1.01| |primes-O2 | 2.93| 1.069| |sha512-O2 | 2.057| 0.939| |stream | 12.747| 10.36| |STRINGSORT| 89.012| 11.496| As demonstrated in the memory usage analysis below, the tier-1 JIT compiler utilizes less memory than QEMU across all benchmarks. * Memory usage | Metric | rv32emu-T1C | qemu | |----------+-------------+---------| |aes | 186,228|1,343,012| |mandelbrot| 152,203| 841,841| |puzzle | 153,423| 890,225| |pi | 152,923| 879,957| |dhrystone | 154,466| 856,404| |Nqeueens | 154,880| 858,618| |qsort-O2 | 155,091| 933,506| |miniz-O2 | 165,627|1,076,682| |primes-O2 | 150,540| 928,446| |sha512-O2 | 153,553| 978,177| |stream | 165,911| 957,845| |STRINGSORT| 167,871|1,104,702| Related: sysprog21#238

We follow the template and API of X64 to implement A64 tier-1 JIT compiler. * Perfromance | Metric | rv32emu-T1C | qemu | |----------+-------------+-------| |aes | 0.034| 0.045| |puzzle | 0.0115| 0.0169| |pi | 0.035| 0.032| |dhrystone | 1.914| 2.005| |Nqeueens | 3.87| 2.898| |qsort-O2 | 7.819| 11.614| |miniz-O2 | 7.604| 3.803| |primes-O2 | 10.551| 5.986| |sha512-O2 | 6.497| 2.853| |stream | 52.25| 45.776| As demonstrated in the memory usage analysis below, the tier-1 JIT compiler utilizes less memory than QEMU across all benchmarks. * Memory usage | Metric | rv32emu-T1C | qemu | |----------+-------------+---------| |aes | 183,212|1,265,962| |puzzle | 145,239| 891,357| |pi | 144,739| 872,525| |dhrystone | 146,282| 853,256| |Nqeueens | 146,696| 854,174| |qsort-O2 | 146,907| 856,721| |miniz-O2 | 157,475| 999,897| |primes-O2 | 142,356| 851,661| |sha512-O2 | 145,369| 901,136| |stream | 157,975| 955,809| Related: sysprog21#238

We follow the template and API of X64 to implement A64 tier-1 JIT compiler. * Perfromance | Metric | rv32emu-T1C | qemu | |----------+-------------+-------| |aes | 0.034| 0.045| |puzzle | 0.0115| 0.0169| |pi | 0.035| 0.032| |dhrystone | 1.914| 2.005| |Nqeueens | 3.87| 2.898| |qsort-O2 | 7.819| 11.614| |miniz-O2 | 7.604| 3.803| |primes-O2 | 10.551| 5.986| |sha512-O2 | 6.497| 2.853| |stream | 52.25| 45.776| As demonstrated in the memory usage analysis below, the tier-1 JIT compiler utilizes less memory than QEMU across all benchmarks. * Memory usage | Metric | rv32emu-T1C | qemu | |----------+-------------+---------| |aes | 183,212|1,265,962| |puzzle | 145,239| 891,357| |pi | 144,739| 872,525| |dhrystone | 146,282| 853,256| |Nqeueens | 146,696| 854,174| |qsort-O2 | 146,907| 856,721| |miniz-O2 | 157,475| 999,897| |primes-O2 | 142,356| 851,661| |sha512-O2 | 145,369| 901,136| |stream | 157,975| 955,809| Related: sysprog21#238 Close: sysprog21#296

jserv · 2023-12-25T09:47:07Z

After the merge of tier-1 JIT compiler, it is time to revisit our IR.

jserv · 2023-12-28T14:31:19Z

Modern CPUs invest substantial effort in predicting these indirect branches, but the Branch Target Buffer (BTB) has its limitations in size. Eliminating any form of indirect call or jump, including those through dispatch tables, is greatly beneficial. This is because contemporary CPUs are equipped with large reorder buffers that can process extensive code efficiently, provided branch prediction is effective. However, in larger programs with widespread use of indirect jumps, optimal branch prediction becomes increasingly challenging.

jserv · 2024-03-02T05:08:54Z

FEX is an advanced x86 emulation frontend, crafted to facilitate the running of x86 and x86-64 binaries on Arm64 platforms, comparable to qemu-user. At the heart of FEX's emulation capability is the FEXCore, which employs an SSA-based Intermediate Representation (IR) crafted from the input x86-64 assembly. Working with SSA is particularly advantageous during the translation of x86-64 code to IR, throughout the optimization stages with custom passes, and when transferring the IR to FEX's CPU backends.

Key aspects of FEX's emulation IR include:

Precisely Defined IR Variable Sizes: It accommodates standard element sizes (1, 2, 4, 8 bytes, and certain 16-byte operations), as well as a flexible number of vector elements, distinguishing between float and integer operations based on the operation type.
Distinct Scalar and Vector IR Operations: Operations are clearly differentiated, such as scalar multiplication (MUL) vs. vector multiplication (VMUL).
Dedicated Load/Store Context IR Operations: These operations facilitate a clear distinction between guest memory and the monitored x86-64 state.
Specific CPUID IR Operation: Enables the return of complex data (data across four registers) and simplifies optimization for constant CPUID functions, allowing for further constant propagation.
Explicit Syscall Operation: Similar to the CPUID operation, this feature allows for efficient direct calls to the syscall handler by enabling constant propagation, reducing call overheads.
Branching Support within the IR: Includes conditional branching that either proceeds to the targeted branch or continues to the next block, and unconditional branching to jump directly to a specified block, aiming to align with LLVM semantics for block limitations without strict enforcement.
Debug Print Operation: For outputting values during debugging sessions.
Explicit Memory Access IR Operations: Designed for guest memory access, performing address translation into the virtual machine's memory space by adding the VM memory base to the 64-bit address. This approach allows for potential escape from the VM and is not deemed safe without JIT validation of the memory region for access correctness.

These features underscore FEX's design philosophy, emphasizing precise control, optimization flexibility, and efficient translation mechanisms within its emulation environment.

Reference: FEXCore IR

jserv · 2024-04-18T18:54:58Z

The Java HotSpot Server Compiler (C2) utilizes a Sea-of-Nodes IR form designed for high performance with minimal overhead, similar to LLVM's approach with its control flow graph (CFG). However, in textual IR presentations, the CFG is not depicted as a traditional 'graph' but rather through labels and jumps that effectively outline the graph's edges. Like C2’s IR, the Sea-of-Nodes IR can be described in a linear textual format and only visually represented as a "graph" when loaded into memory. This allows for flexibility in handling nodes without control dependencies, known as "floating nodes," which can be placed in any basic block in the textual format and reassigned in memory to maintain their floating characteristic.

While the current tier-2 JIT compiler, built with LLVM, offers aggressive optimizations, it is also resource-intensive, consuming considerable memory and prolonging compilation times. An alternative, the IR Framework, emerges as a viable option that enhances performance while minimizing memory usage. This framework not only defines an IR but also offers a streamlined API for IR construction, coupled with algorithms for optimization, scheduling, register allocation, and code generation. The code generated in-memory can be executed directly, potentially increasing efficiency.

The Ideal Graph Visualizer (IGV) is a tool designed for developers to analyze and troubleshoot performance issues by examining compilation graphs. It specifically focuses on IR graphs, which serve as a language-independent bridge between the source code and the machine code generated by compilers.

jserv · 2024-08-14T02:26:18Z

Inspired by rvdbt, we may adopt its QuickIR, a lightweight non-SSA internal representation used by the QMC compiler. QuickIR interacts with both local and global states; the former represents optimized temporaries, while the latter includes the emulated CPU state and any internal data structures attached to CPUState, a concept common to many emulators. The terms local and global also extend to control flow, where global branch instructions gbr and gbrind manage branches that escape the current translation region. If a particular instruction or its slowpath cannot be represented in QuickIR, a special hcall might be used to invoke a pre-registered guest runtime stub. These stubs are also generated from interpreter handlers, making it straightforward to extend the translated ISA without mandatory frontend support for new instructions.

QuickIR sample (1) - single basic block

00018c40:  slli   s2, s2, 8
00018c44:  or     s2, s2, s3
00018c48:  addi   s3, zr, 61
00018c4c:  jal    zr, 12

bb.0: succs[ ] preds[ ]
        #0 sll [@s2|i32] [@s2|i32] [$8|i32]
        #1 or  [@s2|i32] [@s2|i32] [@s3|i32]
        #3 mov [@s3|i32] [$3d|i32]
        #4 gbr [$18c58|i32]

QuickIR sample (2) - conditional branch representation

00018fc0:  lw     a5, a1, 0
00018fc4:  sw     zr, a1, 4
00018fc8:  addi   a6, a1, 0
00018fcc:  beq    a5, zr, 132

bb.0: succs[ 2 1 ] preds[ ]
	#0 mov [g:80|i32] [$18fc0|i32]
	#1 vmload:i32:s [@a5|i32] [@a1|i32]
        #2 mov [g:80|i32] [$18fc4|i32]
        #3 add [%32|i32] [@a1|i32] [$4|i32]
        #4 vmstore:i32:u [%32|i32] [$0|i32]
        #6 mov [@a6|i32] [@a1|i32]
        #9 brcc:eq [@a5|i32] [$0|i32]
bb.1: succs[ ] preds[ 0 ]
        #7 gbr [$18fd0|i32]
bb.2: succs[ ] preds[ 0 ]
        #8 gbr [$19050|i32]

g:80 - program counter location in global CPUState, manually flushed by frontend before translating "unsafe" vmload instruction
%32 - temporary local register, frontend may emit an arbitrary number of locals in a single region

jserv · 2024-09-23T07:39:53Z

sovietov_graph_irs_2023.pdf
Slides from a talk "Graph-Based Intermediate Representations: An Overview and Perspectives". Great information starting from linear code to dataflow IR.

qwe661234 changed the title ~~JIT: translate RISC-V into low-level code generators' IR~~ jit: translate RISC-V into low-level code generators' IR Oct 4, 2023

jserv changed the title ~~jit: translate RISC-V into low-level code generators' IR~~ jit: Translate RISC-V into low-level code generators' IR Nov 30, 2023

jserv mentioned this issue Dec 12, 2023

Introduce a tier-1 JIT compiler based on x86-64 architecture #289

Merged

qwe661234 mentioned this issue Dec 22, 2023

Introduce a tier-1 JIT compiler based on aarch64 architecture #304

Merged

jserv assigned qwe661234 Dec 25, 2023

jserv added the enhancement New feature or request label Dec 26, 2023

jserv changed the title ~~jit: Translate RISC-V into low-level code generators' IR~~ jit: Transition from Linear to Graph-Based IR Dec 28, 2023

jserv mentioned this issue Dec 30, 2023

Implement fast tier-1 JIT compiler #283

Closed

jserv assigned vacantron Feb 19, 2024

jserv changed the title ~~jit: Transition from Linear to Graph-Based IR~~ jit: Transition from linear to more effective form Sep 9, 2024

jserv added this to the release-2024.2 milestone Oct 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jit: Transition from linear to more effective form #238

jit: Transition from linear to more effective form #238

qwe661234 commented Oct 4, 2023 •

edited by jserv

Loading

jserv commented Dec 25, 2023

jserv commented Dec 28, 2023

jserv commented Mar 2, 2024

jserv commented Apr 18, 2024 •

edited

Loading

jserv commented Aug 14, 2024

jserv commented Sep 23, 2024

jit: Transition from linear to more effective form #238

jit: Transition from linear to more effective form #238

Comments

qwe661234 commented Oct 4, 2023 • edited by jserv Loading

jserv commented Dec 25, 2023

jserv commented Dec 28, 2023

jserv commented Mar 2, 2024

jserv commented Apr 18, 2024 • edited Loading

jserv commented Aug 14, 2024

jserv commented Sep 23, 2024

qwe661234 commented Oct 4, 2023 •

edited by jserv

Loading

jserv commented Apr 18, 2024 •

edited

Loading