Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jit: Transition from linear to more effective form #238

Open
qwe661234 opened this issue Oct 4, 2023 · 6 comments
Open

jit: Transition from linear to more effective form #238

qwe661234 opened this issue Oct 4, 2023 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@qwe661234
Copy link
Collaborator

qwe661234 commented Oct 4, 2023

The chained block structure used by both interpreter and tier-1 compiler is linear, with each block pointing only to the subsequent block. Enhancing a block to reference its previous block brings significant value, especially for hot spot profiling. This advancement paves the way for developing a graph-based intermediate representation (IR). In this IR, graph edges symbolize use-define chains. Rather than working on a two-tiered Control-Flow Graph (CFG) comprising basic blocks (tier 1) and instructions (tier 2), analyses and transformations will directly interact with and modify this use-def information in a streamlined, single-tiered graph structure.

The sfuzz project employs a custom intermediate representation. The initial step in the actual code generation process involves lifting the entire function into this intermediate representation. During the initialization phase, when the target is first loaded, the size of the function is determined. This is achieved by parsing the elf metadata and creating a hashmap that maps function start addresses to their respective sizes.

The IR-lifting process iterates through the original instructions and generates an IR instruction for each original instruction using a large switch statement. The following example illustrates what the intermediate representation might resemble for a very minimal function that essentially performs a branch operation based on a comparison in the first block.

Reference: A Simple Graph-Based Intermediate Representation

@qwe661234 qwe661234 changed the title JIT: translate RISC-V into low-level code generators' IR jit: translate RISC-V into low-level code generators' IR Oct 4, 2023
@jserv jserv changed the title jit: translate RISC-V into low-level code generators' IR jit: Translate RISC-V into low-level code generators' IR Nov 30, 2023
qwe661234 added a commit to qwe661234/rv32emu that referenced this issue Dec 13, 2023
when the using frequency of a block exceeds a predetermined threshold,
the baseline tiered1 JIT compiler traces the chained block and generate
corresponding low quailty machine code. The resulting target machine
code is stored in the code cache for future utilization.

The primary objective of introducing the baseline JIT compiler is to
enhance the execution speed of RISC-V instructions. This implementation
requires two additional components: a tiered1 machine code generator,
and code cache. Furthermore, this baseline JIT compiler serves as the
foundational target for future improvements.

In addition, we have developed a Python script that effectively traces
code templates and automatically generates JIT code templates. This
approach eliminates the need for manually writing duplicated code.

Related: sysprog21#238
qwe661234 added a commit to qwe661234/rv32emu that referenced this issue Dec 13, 2023
when the using frequency of a block exceeds a predetermined threshold,
the baseline tiered1 JIT compiler traces the chained block and generate
corresponding low quailty machine code. The resulting target machine
code is stored in the code cache for future utilization.

The primary objective of introducing the baseline JIT compiler is to
enhance the execution speed of RISC-V instructions. This implementation
requires two additional components: a tiered1 machine code generator,
and code cache. Furthermore, this baseline JIT compiler serves as the
foundational target for future improvements.

In addition, we have developed a Python script that effectively traces
code templates and automatically generates JIT code templates. This
approach eliminates the need for manually writing duplicated code.

Related: sysprog21#238
qwe661234 added a commit to qwe661234/rv32emu that referenced this issue Dec 13, 2023
when the using frequency of a block exceeds a predetermined threshold,
the baseline tiered1 JIT compiler traces the chained block and generate
corresponding low quailty machine code. The resulting target machine
code is stored in the code cache for future utilization.

The primary objective of introducing the baseline JIT compiler is to
enhance the execution speed of RISC-V instructions. This implementation
requires two additional components: a tiered1 machine code generator,
and code cache. Furthermore, this baseline JIT compiler serves as the
foundational target for future improvements.

In addition, we have developed a Python script that effectively traces
code templates and automatically generates JIT code templates. This
approach eliminates the need for manually writing duplicated code.

Related: sysprog21#238
qwe661234 added a commit to qwe661234/rv32emu that referenced this issue Dec 13, 2023
when the using frequency of a block exceeds a predetermined threshold,
the baseline tiered1 JIT compiler traces the chained block and generate
corresponding low quailty machine code. The resulting target machine
code is stored in the code cache for future utilization.

The primary objective of introducing the baseline JIT compiler is to
enhance the execution speed of RISC-V instructions. This implementation
requires two additional components: a tiered1 machine code generator,
and code cache. Furthermore, this baseline JIT compiler serves as the
foundational target for future improvements.

In addition, we have developed a Python script that effectively traces
code templates and automatically generates JIT code templates. This
approach eliminates the need for manually writing duplicated code.

Related: sysprog21#238
qwe661234 added a commit to qwe661234/rv32emu that referenced this issue Dec 13, 2023
when the using frequency of a block exceeds a predetermined threshold,
the baseline tiered1 JIT compiler traces the chained block and generate
corresponding low quailty machine code. The resulting target machine
code is stored in the code cache for future utilization.

The primary objective of introducing the baseline JIT compiler is to
enhance the execution speed of RISC-V instructions. This implementation
requires two additional components: a tiered1 machine code generator,
and code cache. Furthermore, this baseline JIT compiler serves as the
foundational target for future improvements.

In addition, we have developed a Python script that effectively traces
code templates and automatically generates JIT code templates. This
approach eliminates the need for manually writing duplicated code.

Related: sysprog21#238
qwe661234 added a commit to qwe661234/rv32emu that referenced this issue Dec 13, 2023
when the using frequency of a block exceeds a predetermined threshold,
the baseline tiered1 JIT compiler traces the chained block and generate
corresponding low quailty machine code. The resulting target machine
code is stored in the code cache for future utilization.

The primary objective of introducing the baseline JIT compiler is to
enhance the execution speed of RISC-V instructions. This implementation
requires two additional components: a tiered1 machine code generator,
and code cache. Furthermore, this baseline JIT compiler serves as the
foundational target for future improvements.

In addition, we have developed a Python script that effectively traces
code templates and automatically generates JIT code templates. This
approach eliminates the need for manually writing duplicated code.

Related: sysprog21#238
qwe661234 added a commit to qwe661234/rv32emu that referenced this issue Dec 14, 2023
When the using frequency of a block exceeds a predetermined threshold,
the tier 1 JIT compiler traces the chained block and generate
corresponding low quailty machine code. The resulting target machine
code is stored in the code cache for future utilization.

The primary objective of introducing the tier 1 JIT compiler is to
enhance the execution speed of RISC-V instructions. This implementation
requires two additional components: a tier 1 machine code generator,
and code cache. Furthermore, this tier 1 JIT compiler serves as the
foundational target for future improvements.

In addition, we have developed a Python script that effectively traces
code templates and automatically generates JIT code templates. This
approach eliminates the need for manually writing duplicated code.

As shown in the performance analysis below, the tier 1 JIT compiler's
performance closely parallels that of QEMU in benchmarks with a
constrained dynamic instruction count. However, for benchmarks
featuring a substantial dynamic instruction count or lacking specific
hotspots—examples include pi and STRINGSORT—the tier 1 JIT compiler
demonstrates noticeably slower execution compared to QEMU.

Hence, a robust tier 2 compiler is essential to generate optimized
machine code across diverse execution paths, coupled with a runtime
profiler for detecting hotspots.

| Metric   | rv32emu(JIT-T1) | qemu  |
|----------+-----------------+-------|
|aes	   |             0.02|  0.031|
|mandelbrot|	        0.029| 0.0115|
|puzzle	   |           0.0115|	0.009|
|pi	   |           0.0413| 0.0177|
|dhrystone |	        0.331|	0.393|
|Nqeueens  |	        0.854|	0.749|
|qsort-O2  |	        2.384|	 2.16|
|miniz-O2  |	         1.33|	 1.01|
|primes-O2 |	         2.93|	1.069|
|sha512-O2 |	        2.057|	0.939|
|stream	   |           12.747|	10.36|
|STRINGSORT|	       89.012| 11.496|

Related: sysprog21#238
qwe661234 added a commit to qwe661234/rv32emu that referenced this issue Dec 14, 2023
When the using frequency of a block exceeds a predetermined threshold,
the tier 1 JIT compiler traces the chained block and generate
corresponding low quailty machine code. The resulting target machine
code is stored in the code cache for future utilization.

The primary objective of introducing the tier 1 JIT compiler is to
enhance the execution speed of RISC-V instructions. This implementation
requires two additional components: a tier 1 machine code generator,
and code cache. Furthermore, this tier 1 JIT compiler serves as the
foundational target for future improvements.

In addition, we have developed a Python script that effectively traces
code templates and automatically generates JIT code templates. This
approach eliminates the need for manually writing duplicated code.

As shown in the performance analysis below, the tier 1 JIT compiler's
performance closely parallels that of QEMU in benchmarks with a
constrained dynamic instruction count. However, for benchmarks
featuring a substantial dynamic instruction count or lacking specific
hotspots—examples include pi and STRINGSORT—the tier 1 JIT compiler
demonstrates noticeably slower execution compared to QEMU.

Hence, a robust tier 2 compiler is essential to generate optimized
machine code across diverse execution paths, coupled with a runtime
profiler for detecting hotspots.

* Perfromance
| Metric   | rv32emu(JIT-T1) | qemu  |
|----------+-----------------+-------|
|aes	   |             0.02|  0.031|
|mandelbrot|	        0.029| 0.0115|
|puzzle	   |           0.0115|	0.009|
|pi        |           0.0413| 0.0177|
|dhrystone |	        0.331|	0.393|
|Nqeueens  |	        0.854|	0.749|
|qsort-O2  |	        2.384|	 2.16|
|miniz-O2  |	         1.33|	 1.01|
|primes-O2 |	         2.93|	1.069|
|sha512-O2 |	        2.057|	0.939|
|stream	   |           12.747|	10.36|
|STRINGSORT|	       89.012| 11.496|

As demonstrated in the memory usage analysis below, the tier 1 JIT
compiler utilizes less memory than QEMU across all benchmarks.

* Memory usage
| Metric   | rv32emu(JIT-T1) |   qemu  |
|----------+-----------------+---------|
|aes	   |          186,228|1,343,012|
|mandelbrot|	      152,203|  841,841|
|puzzle	   |          153,423|  890,225|
|pi        |          152,923|  879,957|
|dhrystone |	      154,466|  856,404|
|Nqeueens  |	      154,880|  858,618|
|qsort-O2  |	      155,091|  933,506|
|miniz-O2  |	      165,627|1,076,682|
|primes-O2 |	      150,540|  928,446|
|sha512-O2 |	      153,553|	978,177|
|stream	   |          165,911|  957,845|
|STRINGSORT|	      167,871|1,104,702|

Related: sysprog21#238
qwe661234 added a commit to qwe661234/rv32emu that referenced this issue Dec 14, 2023
When the using frequency of a block exceeds a predetermined threshold,
the tier 1 JIT compiler traces the chained block and generate
corresponding low quailty machine code. The resulting target machine
code is stored in the code cache for future utilization.

The primary objective of introducing the tier 1 JIT compiler is to
enhance the execution speed of RISC-V instructions. This implementation
requires two additional components: a tier 1 machine code generator,
and code cache. Furthermore, this tier 1 JIT compiler serves as the
foundational target for future improvements.

In addition, we have developed a Python script that effectively traces
code templates and automatically generates JIT code templates. This
approach eliminates the need for manually writing duplicated code.

As shown in the performance analysis below, the tier 1 JIT compiler's
performance closely parallels that of QEMU in benchmarks with a
constrained dynamic instruction count. However, for benchmarks
featuring a substantial dynamic instruction count or lacking specific
hotspots—examples include pi and STRINGSORT—the tier 1 JIT compiler
demonstrates noticeably slower execution compared to QEMU.

Hence, a robust tier 2 compiler is essential to generate optimized
machine code across diverse execution paths, coupled with a runtime
profiler for detecting hotspots.

* Perfromance
| Metric   | rv32emu-T1C | qemu  |
|----------+-------------+-------|
|aes	   |         0.02|  0.031|
|mandelbrot|	    0.029| 0.0115|
|puzzle	   |       0.0115|  0.009|
|pi        |       0.0413| 0.0177|
|dhrystone |	    0.331|  0.393|
|Nqeueens  |	    0.854|  0.749|
|qsort-O2  |	    2.384|   2.16|
|miniz-O2  |	     1.33|   1.01|
|primes-O2 |	     2.93|  1.069|
|sha512-O2 |	    2.057|  0.939|
|stream	   |       12.747|  10.36|
|STRINGSORT|       89.012| 11.496|

As demonstrated in the memory usage analysis below, the tier 1 JIT
compiler utilizes less memory than QEMU across all benchmarks.

* Memory usage
| Metric   | rv32emu-T1C |   qemu  |
|----------+-------------+---------|
|aes	   |      186,228|1,343,012|
|mandelbrot|	  152,203|  841,841|
|puzzle	   |      153,423|  890,225|
|pi        |      152,923|  879,957|
|dhrystone |	  154,466|  856,404|
|Nqeueens  |	  154,880|  858,618|
|qsort-O2  |	  155,091|  933,506|
|miniz-O2  |	  165,627|1,076,682|
|primes-O2 |	  150,540|  928,446|
|sha512-O2 |	  153,553|  978,177|
|stream	   |      165,911|  957,845|
|STRINGSORT|      167,871|1,104,702|

Related: sysprog21#238
qwe661234 added a commit to qwe661234/rv32emu that referenced this issue Dec 14, 2023
When the using frequency of a block exceeds a predetermined threshold,
the tier 1 JIT compiler traces the chained block and generate
corresponding low quailty machine code. The resulting target machine
code is stored in the code cache for future utilization.

The primary objective of introducing the tier 1 JIT compiler is to
enhance the execution speed of RISC-V instructions. This implementation
requires two additional components: a tier 1 machine code generator,
and code cache. Furthermore, this tier 1 JIT compiler serves as the
foundational target for future improvements.

In addition, we have developed a Python script that effectively traces
code templates and automatically generates JIT code templates. This
approach eliminates the need for manually writing duplicated code.

As shown in the performance analysis below, the tier 1 JIT compiler's
performance closely parallels that of QEMU in benchmarks with a
constrained dynamic instruction count. However, for benchmarks
featuring a substantial dynamic instruction count or lacking specific
hotspots—examples include pi and STRINGSORT—the tier 1 JIT compiler
demonstrates noticeably slower execution compared to QEMU.

Hence, a robust tier 2 compiler is essential to generate optimized
machine code across diverse execution paths, coupled with a runtime
profiler for detecting hotspots.

* Perfromance
| Metric   | rv32emu-T1C | qemu  |
|----------+-------------+-------|
|aes	   |         0.02|  0.031|
|mandelbrot|	    0.029| 0.0115|
|puzzle	   |       0.0115|  0.009|
|pi        |       0.0413| 0.0177|
|dhrystone |	    0.331|  0.393|
|Nqeueens  |	    0.854|  0.749|
|qsort-O2  |	    2.384|   2.16|
|miniz-O2  |	     1.33|   1.01|
|primes-O2 |	     2.93|  1.069|
|sha512-O2 |	    2.057|  0.939|
|stream	   |       12.747|  10.36|
|STRINGSORT|       89.012| 11.496|

As demonstrated in the memory usage analysis below, the tier 1 JIT
compiler utilizes less memory than QEMU across all benchmarks.

* Memory usage
| Metric   | rv32emu-T1C |   qemu  |
|----------+-------------+---------|
|aes	   |      186,228|1,343,012|
|mandelbrot|	  152,203|  841,841|
|puzzle	   |      153,423|  890,225|
|pi        |      152,923|  879,957|
|dhrystone |	  154,466|  856,404|
|Nqeueens  |	  154,880|  858,618|
|qsort-O2  |	  155,091|  933,506|
|miniz-O2  |	  165,627|1,076,682|
|primes-O2 |	  150,540|  928,446|
|sha512-O2 |	  153,553|  978,177|
|stream	   |      165,911|  957,845|
|STRINGSORT|      167,871|1,104,702|

Related: sysprog21#238
qwe661234 added a commit to qwe661234/rv32emu that referenced this issue Dec 14, 2023
When the using frequency of a block exceeds a predetermined threshold,
the tier 1 JIT compiler traces the chained block and generate
corresponding low quailty machine code. The resulting target machine
code is stored in the code cache for future utilization.

The primary objective of introducing the tier 1 JIT compiler is to
enhance the execution speed of RISC-V instructions. This implementation
requires two additional components: a tier 1 machine code generator,
and code cache. Furthermore, this tier 1 JIT compiler serves as the
foundational target for future improvements.

In addition, we have developed a Python script that effectively traces
code templates and automatically generates JIT code templates. This
approach eliminates the need for manually writing duplicated code.

As shown in the performance analysis below, the tier 1 JIT compiler's
performance closely parallels that of QEMU in benchmarks with a
constrained dynamic instruction count. However, for benchmarks
featuring a substantial dynamic instruction count or lacking specific
hotspots—examples include pi and STRINGSORT—the tier 1 JIT compiler
demonstrates noticeably slower execution compared to QEMU.

Hence, a robust tier 2 compiler is essential to generate optimized
machine code across diverse execution paths, coupled with a runtime
profiler for detecting hotspots.

* Perfromance
| Metric   | rv32emu-T1C | qemu  |
|----------+-------------+-------|
|aes	   |         0.02|  0.031|
|mandelbrot|	    0.029| 0.0115|
|puzzle	   |       0.0115|  0.009|
|pi        |       0.0413| 0.0177|
|dhrystone |	    0.331|  0.393|
|Nqeueens  |	    0.854|  0.749|
|qsort-O2  |	    2.384|   2.16|
|miniz-O2  |	     1.33|   1.01|
|primes-O2 |	     2.93|  1.069|
|sha512-O2 |	    2.057|  0.939|
|stream	   |       12.747|  10.36|
|STRINGSORT|       89.012| 11.496|

As demonstrated in the memory usage analysis below, the tier 1 JIT
compiler utilizes less memory than QEMU across all benchmarks.

* Memory usage
| Metric   | rv32emu-T1C |   qemu  |
|----------+-------------+---------|
|aes	   |      186,228|1,343,012|
|mandelbrot|	  152,203|  841,841|
|puzzle	   |      153,423|  890,225|
|pi        |      152,923|  879,957|
|dhrystone |	  154,466|  856,404|
|Nqeueens  |	  154,880|  858,618|
|qsort-O2  |	  155,091|  933,506|
|miniz-O2  |	  165,627|1,076,682|
|primes-O2 |	  150,540|  928,446|
|sha512-O2 |	  153,553|  978,177|
|stream	   |      165,911|  957,845|
|STRINGSORT|      167,871|1,104,702|

Related: sysprog21#238
qwe661234 added a commit to qwe661234/rv32emu that referenced this issue Dec 14, 2023
When the using frequency of a block exceeds a predetermined threshold,
the tier-1 JIT compiler traces the chained block and generate
corresponding low quailty machine code. The resulting target machine
code is stored in the code cache for future utilization.

The primary objective of introducing the tier-1 JIT compiler is to
enhance the execution speed of RISC-V instructions. This implementation
requires two additional components: a tier-1 machine code generator,
and code cache. Furthermore, this tier-1 JIT compiler serves as the
foundational target for future improvements.

In addition, we have developed a Python script that effectively traces
code templates and automatically generates JIT code templates. This
approach eliminates the need for manually writing duplicated code.

As shown in the performance analysis below, the tier-1 JIT compiler's
performance closely parallels that of QEMU in benchmarks with a
constrained dynamic instruction count. However, for benchmarks
featuring a substantial dynamic instruction count or lacking specific
hotspots—examples include pi and STRINGSORT—the tier-1 JIT compiler
demonstrates noticeably slower execution compared to QEMU.

Hence, a robust tier-2 JIT compiler is essential to generate optimized
machine code across diverse execution paths, coupled with a runtime
profiler for detecting hotspots.

* Perfromance
| Metric   | rv32emu-T1C | qemu  |
|----------+-------------+-------|
|aes	   |         0.02|  0.031|
|mandelbrot|	    0.029| 0.0115|
|puzzle	   |       0.0115|  0.009|
|pi        |       0.0413| 0.0177|
|dhrystone |	    0.331|  0.393|
|Nqeueens  |	    0.854|  0.749|
|qsort-O2  |	    2.384|   2.16|
|miniz-O2  |	     1.33|   1.01|
|primes-O2 |	     2.93|  1.069|
|sha512-O2 |	    2.057|  0.939|
|stream	   |       12.747|  10.36|
|STRINGSORT|       89.012| 11.496|

As demonstrated in the memory usage analysis below, the tier-1 JIT
compiler utilizes less memory than QEMU across all benchmarks.

* Memory usage
| Metric   | rv32emu-T1C |   qemu  |
|----------+-------------+---------|
|aes	   |      186,228|1,343,012|
|mandelbrot|	  152,203|  841,841|
|puzzle	   |      153,423|  890,225|
|pi        |      152,923|  879,957|
|dhrystone |	  154,466|  856,404|
|Nqeueens  |	  154,880|  858,618|
|qsort-O2  |	  155,091|  933,506|
|miniz-O2  |	  165,627|1,076,682|
|primes-O2 |	  150,540|  928,446|
|sha512-O2 |	  153,553|	978,177|
|stream	   |      165,911|  957,845|
|STRINGSORT|      167,871|1,104,702|

Related: sysprog21#238
qwe661234 added a commit to qwe661234/rv32emu that referenced this issue Dec 14, 2023
When the using frequency of a block exceeds a predetermined threshold,
the tier-1 JIT compiler traces the chained block and generate
corresponding low quailty machine code. The resulting target machine
code is stored in the code cache for future utilization.

The primary objective of introducing the tier-1 JIT compiler is to
enhance the execution speed of RISC-V instructions. This implementation
requires two additional components: a tier-1 machine code generator,
and code cache. Furthermore, this tier-1 JIT compiler serves as the
foundational target for future improvements.

In addition, we have developed a Python script that effectively traces
code templates and automatically generates JIT code templates. This
approach eliminates the need for manually writing duplicated code.

As shown in the performance analysis below, the tier-1 JIT compiler's
performance closely parallels that of QEMU in benchmarks with a
constrained dynamic instruction count. However, for benchmarks
featuring a substantial dynamic instruction count or lacking specific
hotspots—examples include pi and STRINGSORT—the tier-1 JIT compiler
demonstrates noticeably slower execution compared to QEMU.

Hence, a robust tier-2 JIT compiler is essential to generate optimized
machine code across diverse execution paths, coupled with a runtime
profiler for detecting hotspots.

* Perfromance
| Metric   | rv32emu-T1C | qemu  |
|----------+-------------+-------|
|aes	   |         0.02|  0.031|
|mandelbrot|	    0.029| 0.0115|
|puzzle	   |       0.0115|  0.009|
|pi        |       0.0413| 0.0177|
|dhrystone |	    0.331|  0.393|
|Nqeueens  |	    0.854|  0.749|
|qsort-O2  |	    2.384|   2.16|
|miniz-O2  |	     1.33|   1.01|
|primes-O2 |	     2.93|  1.069|
|sha512-O2 |	    2.057|  0.939|
|stream	   |       12.747|  10.36|
|STRINGSORT|       89.012| 11.496|

As demonstrated in the memory usage analysis below, the tier-1 JIT
compiler utilizes less memory than QEMU across all benchmarks.

* Memory usage
| Metric   | rv32emu-T1C |   qemu  |
|----------+-------------+---------|
|aes	   |      186,228|1,343,012|
|mandelbrot|	  152,203|  841,841|
|puzzle	   |      153,423|  890,225|
|pi        |      152,923|  879,957|
|dhrystone |	  154,466|  856,404|
|Nqeueens  |	  154,880|  858,618|
|qsort-O2  |	  155,091|  933,506|
|miniz-O2  |	  165,627|1,076,682|
|primes-O2 |	  150,540|  928,446|
|sha512-O2 |	  153,553|  978,177|
|stream	   |      165,911|  957,845|
|STRINGSORT|      167,871|1,104,702|

Related: sysprog21#238
qwe661234 added a commit to qwe661234/rv32emu that referenced this issue Dec 14, 2023
When the using frequency of a block exceeds a predetermined threshold,
the tier-1 JIT compiler traces the chained block and generate
corresponding low quailty machine code. The resulting target machine
code is stored in the code cache for future utilization.

The primary objective of introducing the tier-1 JIT compiler is to
enhance the execution speed of RISC-V instructions. This implementation
requires two additional components: a tier-1 machine code generator,
and code cache. Furthermore, this tier-1 JIT compiler serves as the
foundational target for future improvements.

In addition, we have developed a Python script that effectively traces
code templates and automatically generates JIT code templates. This
approach eliminates the need for manually writing duplicated code.

As shown in the performance analysis below, the tier-1 JIT compiler's
performance closely parallels that of QEMU in benchmarks with a
constrained dynamic instruction count. However, for benchmarks
featuring a substantial dynamic instruction count or lacking specific
hotspots—examples include pi and STRINGSORT—the tier-1 JIT compiler
demonstrates noticeably slower execution compared to QEMU.

Hence, a robust tier-2 JIT compiler is essential to generate optimized
machine code across diverse execution paths, coupled with a runtime
profiler for detecting hotspots.

* Perfromance
| Metric   | rv32emu-T1C | qemu  |
|----------+-------------+-------|
|aes	   |         0.02|  0.031|
|mandelbrot|	    0.029| 0.0115|
|puzzle	   |       0.0115|  0.009|
|pi        |       0.0413| 0.0177|
|dhrystone |	    0.331|  0.393|
|Nqeueens  |	    0.854|  0.749|
|qsort-O2  |	    2.384|   2.16|
|miniz-O2  |	     1.33|   1.01|
|primes-O2 |	     2.93|  1.069|
|sha512-O2 |	    2.057|  0.939|
|stream	   |       12.747|  10.36|
|STRINGSORT|       89.012| 11.496|

As demonstrated in the memory usage analysis below, the tier-1 JIT
compiler utilizes less memory than QEMU across all benchmarks.

* Memory usage
| Metric   | rv32emu-T1C |   qemu  |
|----------+-------------+---------|
|aes	   |      186,228|1,343,012|
|mandelbrot|	  152,203|  841,841|
|puzzle	   |      153,423|  890,225|
|pi        |      152,923|  879,957|
|dhrystone |	  154,466|  856,404|
|Nqeueens  |	  154,880|  858,618|
|qsort-O2  |	  155,091|  933,506|
|miniz-O2  |	  165,627|1,076,682|
|primes-O2 |	  150,540|  928,446|
|sha512-O2 |	  153,553|  978,177|
|stream	   |      165,911|  957,845|
|STRINGSORT|      167,871|1,104,702|

Related: sysprog21#238
qwe661234 added a commit to qwe661234/rv32emu that referenced this issue Dec 14, 2023
When the using frequency of a block exceeds a predetermined threshold,
the tier-1 JIT compiler traces the chained block and generate
corresponding low quailty machine code. The resulting target machine
code is stored in the code cache for future utilization.

The primary objective of introducing the tier-1 JIT compiler is to
enhance the execution speed of RISC-V instructions. This implementation
requires two additional components: a tier-1 machine code generator,
and code cache. Furthermore, this tier-1 JIT compiler serves as the
foundational target for future improvements.

In addition, we have developed a Python script that effectively traces
code templates and automatically generates JIT code templates. This
approach eliminates the need for manually writing duplicated code.

As shown in the performance analysis below, the tier-1 JIT compiler's
performance closely parallels that of QEMU in benchmarks with a
constrained dynamic instruction count. However, for benchmarks
featuring a substantial dynamic instruction count or lacking specific
hotspots—examples include pi and STRINGSORT—the tier-1 JIT compiler
demonstrates noticeably slower execution compared to QEMU.

Hence, a robust tier-2 JIT compiler is essential to generate optimized
machine code across diverse execution paths, coupled with a runtime
profiler for detecting hotspots.

* Perfromance
| Metric   | rv32emu-T1C | qemu  |
|----------+-------------+-------|
|aes	   |         0.02|  0.031|
|mandelbrot|	    0.029| 0.0115|
|puzzle	   |       0.0115|  0.009|
|pi        |       0.0413| 0.0177|
|dhrystone |	    0.331|  0.393|
|Nqeueens  |	    0.854|  0.749|
|qsort-O2  |	    2.384|   2.16|
|miniz-O2  |	     1.33|   1.01|
|primes-O2 |	     2.93|  1.069|
|sha512-O2 |	    2.057|  0.939|
|stream	   |       12.747|  10.36|
|STRINGSORT|       89.012| 11.496|

As demonstrated in the memory usage analysis below, the tier-1 JIT
compiler utilizes less memory than QEMU across all benchmarks.

* Memory usage
| Metric   | rv32emu-T1C |   qemu  |
|----------+-------------+---------|
|aes	   |      186,228|1,343,012|
|mandelbrot|	  152,203|  841,841|
|puzzle	   |      153,423|  890,225|
|pi        |      152,923|  879,957|
|dhrystone |	  154,466|  856,404|
|Nqeueens  |	  154,880|  858,618|
|qsort-O2  |	  155,091|  933,506|
|miniz-O2  |	  165,627|1,076,682|
|primes-O2 |	  150,540|  928,446|
|sha512-O2 |	  153,553|  978,177|
|stream	   |      165,911|  957,845|
|STRINGSORT|      167,871|1,104,702|

Related: sysprog21#238
qwe661234 added a commit to qwe661234/rv32emu that referenced this issue Dec 14, 2023
When the using frequency of a block exceeds a predetermined threshold,
the tier-1 JIT compiler traces the chained block and generate
corresponding low quailty machine code. The resulting target machine
code is stored in the code cache for future utilization.

The primary objective of introducing the tier-1 JIT compiler is to
enhance the execution speed of RISC-V instructions. This implementation
requires two additional components: a tier-1 machine code generator,
and code cache. Furthermore, this tier-1 JIT compiler serves as the
foundational target for future improvements.

In addition, we have developed a Python script that effectively traces
code templates and automatically generates JIT code templates. This
approach eliminates the need for manually writing duplicated code.

As shown in the performance analysis below, the tier-1 JIT compiler's
performance closely parallels that of QEMU in benchmarks with a
constrained dynamic instruction count. However, for benchmarks
featuring a substantial dynamic instruction count or lacking specific
hotspots—examples include pi and STRINGSORT—the tier-1 JIT compiler
demonstrates noticeably slower execution compared to QEMU.

Hence, a robust tier-2 JIT compiler is essential to generate optimized
machine code across diverse execution paths, coupled with a runtime
profiler for detecting hotspots.

* Perfromance
| Metric   | rv32emu-T1C | qemu  |
|----------+-------------+-------|
|aes	   |         0.02|  0.031|
|mandelbrot|	    0.029| 0.0115|
|puzzle	   |       0.0115|  0.009|
|pi        |       0.0413| 0.0177|
|dhrystone |	    0.331|  0.393|
|Nqeueens  |	    0.854|  0.749|
|qsort-O2  |	    2.384|   2.16|
|miniz-O2  |	     1.33|   1.01|
|primes-O2 |	     2.93|  1.069|
|sha512-O2 |	    2.057|  0.939|
|stream	   |       12.747|  10.36|
|STRINGSORT|       89.012| 11.496|

As demonstrated in the memory usage analysis below, the tier-1 JIT
compiler utilizes less memory than QEMU across all benchmarks.

* Memory usage
| Metric   | rv32emu-T1C |   qemu  |
|----------+-------------+---------|
|aes	   |      186,228|1,343,012|
|mandelbrot|	  152,203|  841,841|
|puzzle	   |      153,423|  890,225|
|pi        |      152,923|  879,957|
|dhrystone |	  154,466|  856,404|
|Nqeueens  |	  154,880|  858,618|
|qsort-O2  |	  155,091|  933,506|
|miniz-O2  |	  165,627|1,076,682|
|primes-O2 |	  150,540|  928,446|
|sha512-O2 |	  153,553|  978,177|
|stream	   |      165,911|  957,845|
|STRINGSORT|      167,871|1,104,702|

Related: sysprog21#238
qwe661234 added a commit to qwe661234/rv32emu that referenced this issue Dec 15, 2023
When the using frequency of a block exceeds a predetermined threshold,
the tier-1 JIT compiler traces the chained block and generate
corresponding low quailty machine code. The resulting target machine
code is stored in the code cache for future utilization.

The primary objective of introducing the tier-1 JIT compiler is to
enhance the execution speed of RISC-V instructions. This implementation
requires two additional components: a tier-1 machine code generator,
and code cache. Furthermore, this tier-1 JIT compiler serves as the
foundational target for future improvements.

In addition, we have developed a Python script that effectively traces
code templates and automatically generates JIT code templates. This
approach eliminates the need for manually writing duplicated code.

As shown in the performance analysis below, the tier-1 JIT compiler's
performance closely parallels that of QEMU in benchmarks with a
constrained dynamic instruction count. However, for benchmarks
featuring a substantial dynamic instruction count or lacking specific
hotspots—examples include pi and STRINGSORT—the tier-1 JIT compiler
demonstrates noticeably slower execution compared to QEMU.

Hence, a robust tier-2 JIT compiler is essential to generate optimized
machine code across diverse execution paths, coupled with a runtime
profiler for detecting hotspots.

* Perfromance
| Metric   | rv32emu-T1C | qemu  |
|----------+-------------+-------|
|aes	   |         0.02|  0.031|
|mandelbrot|	    0.029| 0.0115|
|puzzle	   |       0.0115|  0.009|
|pi        |       0.0413| 0.0177|
|dhrystone |	    0.331|  0.393|
|Nqeueens  |	    0.854|  0.749|
|qsort-O2  |	    2.384|   2.16|
|miniz-O2  |	     1.33|   1.01|
|primes-O2 |	     2.93|  1.069|
|sha512-O2 |	    2.057|  0.939|
|stream	   |       12.747|  10.36|
|STRINGSORT|       89.012| 11.496|

As demonstrated in the memory usage analysis below, the tier-1 JIT
compiler utilizes less memory than QEMU across all benchmarks.

* Memory usage
| Metric   | rv32emu-T1C |   qemu  |
|----------+-------------+---------|
|aes	   |      186,228|1,343,012|
|mandelbrot|	  152,203|  841,841|
|puzzle	   |      153,423|  890,225|
|pi        |      152,923|  879,957|
|dhrystone |	  154,466|  856,404|
|Nqeueens  |	  154,880|  858,618|
|qsort-O2  |	  155,091|  933,506|
|miniz-O2  |	  165,627|1,076,682|
|primes-O2 |	  150,540|  928,446|
|sha512-O2 |	  153,553|  978,177|
|stream	   |      165,911|  957,845|
|STRINGSORT|      167,871|1,104,702|

Related: sysprog21#238
qwe661234 added a commit to qwe661234/rv32emu that referenced this issue Dec 15, 2023
When the using frequency of a block exceeds a predetermined threshold,
the tier-1 JIT compiler traces the chained block and generate
corresponding low quailty machine code. The resulting target machine
code is stored in the code cache for future utilization.

The primary objective of introducing the tier-1 JIT compiler is to
enhance the execution speed of RISC-V instructions. This implementation
requires two additional components: a tier-1 machine code generator,
and code cache. Furthermore, this tier-1 JIT compiler serves as the
foundational target for future improvements.

In addition, we have developed a Python script that effectively traces
code templates and automatically generates JIT code templates. This
approach eliminates the need for manually writing duplicated code.

As shown in the performance analysis below, the tier-1 JIT compiler's
performance closely parallels that of QEMU in benchmarks with a
constrained dynamic instruction count. However, for benchmarks
featuring a substantial dynamic instruction count or lacking specific
hotspots—examples include pi and STRINGSORT—the tier-1 JIT compiler
demonstrates noticeably slower execution compared to QEMU.

Hence, a robust tier-2 JIT compiler is essential to generate optimized
machine code across diverse execution paths, coupled with a runtime
profiler for detecting hotspots.

* Perfromance
| Metric   | rv32emu-T1C | qemu  |
|----------+-------------+-------|
|aes	   |         0.02|  0.031|
|mandelbrot|	    0.029| 0.0115|
|puzzle	   |       0.0115|  0.009|
|pi        |       0.0413| 0.0177|
|dhrystone |	    0.331|  0.393|
|Nqeueens  |	    0.854|  0.749|
|qsort-O2  |	    2.384|   2.16|
|miniz-O2  |	     1.33|   1.01|
|primes-O2 |	     2.93|  1.069|
|sha512-O2 |	    2.057|  0.939|
|stream	   |       12.747|  10.36|
|STRINGSORT|       89.012| 11.496|

As demonstrated in the memory usage analysis below, the tier-1 JIT
compiler utilizes less memory than QEMU across all benchmarks.

* Memory usage
| Metric   | rv32emu-T1C |   qemu  |
|----------+-------------+---------|
|aes	   |      186,228|1,343,012|
|mandelbrot|	  152,203|  841,841|
|puzzle	   |      153,423|  890,225|
|pi        |      152,923|  879,957|
|dhrystone |	  154,466|  856,404|
|Nqeueens  |	  154,880|  858,618|
|qsort-O2  |	  155,091|  933,506|
|miniz-O2  |	  165,627|1,076,682|
|primes-O2 |	  150,540|  928,446|
|sha512-O2 |	  153,553|  978,177|
|stream	   |      165,911|  957,845|
|STRINGSORT|      167,871|1,104,702|

Related: sysprog21#238
qwe661234 added a commit to qwe661234/rv32emu that referenced this issue Dec 15, 2023
When the using frequency of a block exceeds a predetermined threshold,
the tier-1 JIT compiler traces the chained block and generate
corresponding low quailty machine code. The resulting target machine
code is stored in the code cache for future utilization.

The primary objective of introducing the tier-1 JIT compiler is to
enhance the execution speed of RISC-V instructions. This implementation
requires two additional components: a tier-1 machine code generator,
and code cache. Furthermore, this tier-1 JIT compiler serves as the
foundational target for future improvements.

In addition, we have developed a Python script that effectively traces
code templates and automatically generates JIT code templates. This
approach eliminates the need for manually writing duplicated code.

As shown in the performance analysis below, the tier-1 JIT compiler's
performance closely parallels that of QEMU in benchmarks with a
constrained dynamic instruction count. However, for benchmarks
featuring a substantial dynamic instruction count or lacking specific
hotspots—examples include pi and STRINGSORT—the tier-1 JIT compiler
demonstrates noticeably slower execution compared to QEMU.

Hence, a robust tier-2 JIT compiler is essential to generate optimized
machine code across diverse execution paths, coupled with a runtime
profiler for detecting hotspots.

* Perfromance
| Metric   | rv32emu-T1C | qemu  |
|----------+-------------+-------|
|aes	   |         0.02|  0.031|
|mandelbrot|	    0.029| 0.0115|
|puzzle	   |       0.0115|  0.009|
|pi        |       0.0413| 0.0177|
|dhrystone |	    0.331|  0.393|
|Nqeueens  |	    0.854|  0.749|
|qsort-O2  |	    2.384|   2.16|
|miniz-O2  |	     1.33|   1.01|
|primes-O2 |	     2.93|  1.069|
|sha512-O2 |	    2.057|  0.939|
|stream	   |       12.747|  10.36|
|STRINGSORT|       89.012| 11.496|

As demonstrated in the memory usage analysis below, the tier-1 JIT
compiler utilizes less memory than QEMU across all benchmarks.

* Memory usage
| Metric   | rv32emu-T1C |   qemu  |
|----------+-------------+---------|
|aes	   |      186,228|1,343,012|
|mandelbrot|	  152,203|  841,841|
|puzzle	   |      153,423|  890,225|
|pi        |      152,923|  879,957|
|dhrystone |	  154,466|  856,404|
|Nqeueens  |	  154,880|  858,618|
|qsort-O2  |	  155,091|  933,506|
|miniz-O2  |	  165,627|1,076,682|
|primes-O2 |	  150,540|  928,446|
|sha512-O2 |	  153,553|  978,177|
|stream	   |      165,911|  957,845|
|STRINGSORT|      167,871|1,104,702|

Related: sysprog21#238
qwe661234 added a commit to qwe661234/rv32emu that referenced this issue Dec 15, 2023
When the using frequency of a block exceeds a predetermined threshold,
the tier-1 JIT compiler traces the chained block and generate
corresponding low quailty machine code. The resulting target machine
code is stored in the code cache for future utilization.

The primary objective of introducing the tier-1 JIT compiler is to
enhance the execution speed of RISC-V instructions. This implementation
requires two additional components: a tier-1 machine code generator,
and code cache. Furthermore, this tier-1 JIT compiler serves as the
foundational target for future improvements.

In addition, we have developed a Python script that effectively traces
code templates and automatically generates JIT code templates. This
approach eliminates the need for manually writing duplicated code.

As shown in the performance analysis below, the tier-1 JIT compiler's
performance closely parallels that of QEMU in benchmarks with a
constrained dynamic instruction count. However, for benchmarks
featuring a substantial dynamic instruction count or lacking specific
hotspots—examples include pi and STRINGSORT—the tier-1 JIT compiler
demonstrates noticeably slower execution compared to QEMU.

Hence, a robust tier-2 JIT compiler is essential to generate optimized
machine code across diverse execution paths, coupled with a runtime
profiler for detecting hotspots.

* Perfromance
| Metric   | rv32emu-T1C | qemu  |
|----------+-------------+-------|
|aes	   |         0.02|  0.031|
|mandelbrot|	    0.029| 0.0115|
|puzzle	   |       0.0115|  0.009|
|pi        |       0.0413| 0.0177|
|dhrystone |	    0.331|  0.393|
|Nqeueens  |	    0.854|  0.749|
|qsort-O2  |	    2.384|   2.16|
|miniz-O2  |	     1.33|   1.01|
|primes-O2 |	     2.93|  1.069|
|sha512-O2 |	    2.057|  0.939|
|stream	   |       12.747|  10.36|
|STRINGSORT|       89.012| 11.496|

As demonstrated in the memory usage analysis below, the tier-1 JIT
compiler utilizes less memory than QEMU across all benchmarks.

* Memory usage
| Metric   | rv32emu-T1C |   qemu  |
|----------+-------------+---------|
|aes	   |      186,228|1,343,012|
|mandelbrot|	  152,203|  841,841|
|puzzle	   |      153,423|  890,225|
|pi        |      152,923|  879,957|
|dhrystone |	  154,466|  856,404|
|Nqeueens  |	  154,880|  858,618|
|qsort-O2  |	  155,091|  933,506|
|miniz-O2  |	  165,627|1,076,682|
|primes-O2 |	  150,540|  928,446|
|sha512-O2 |	  153,553|  978,177|
|stream	   |      165,911|  957,845|
|STRINGSORT|      167,871|1,104,702|

Related: sysprog21#238
qwe661234 added a commit to qwe661234/rv32emu that referenced this issue Dec 22, 2023
We follow the template and API of X64 to implement A64 tier-1 JIT
compiler.

* Perfromance
| Metric   | rv32emu-T1C | qemu  |
|----------+-------------+-------|
|aes	   |        0.034|  0.045|
|puzzle	   |       0.0115| 0.0169|
|pi        |        0.035|  0.032|
|dhrystone |	    1.914|  2.005|
|Nqeueens  |	     3.87|  2.898|
|qsort-O2  |	    7.819| 11.614|
|miniz-O2  |	    7.604|  3.803|
|primes-O2 |	   10.551|  5.986|
|sha512-O2 |	    6.497|  2.853|
|stream	   |        52.25| 45.776|

As demonstrated in the memory usage analysis below, the tier-1 JIT
compiler utilizes less memory than QEMU across all benchmarks.

* Memory usage
| Metric   | rv32emu-T1C |   qemu  |
|----------+-------------+---------|
|aes	   |      183,212|1,265,962|
|puzzle	   |      145,239|  891,357|
|pi        |      144,739|  872,525|
|dhrystone |	  146,282|  853,256|
|Nqeueens  |	  146,696|  854,174|
|qsort-O2  |	  146,907|  856,721|
|miniz-O2  |	  157,475|  999,897|
|primes-O2 |	  142,356|  851,661|
|sha512-O2 |	  145,369|  901,136|
|stream	   |      157,975|  955,809|

Related: sysprog21#238
qwe661234 added a commit to qwe661234/rv32emu that referenced this issue Dec 22, 2023
We follow the template and API of X64 to implement A64 tier-1 JIT
compiler.

* Perfromance
| Metric   | rv32emu-T1C | qemu  |
|----------+-------------+-------|
|aes	   |        0.034|  0.045|
|puzzle	   |       0.0115| 0.0169|
|pi        |        0.035|  0.032|
|dhrystone |	    1.914|  2.005|
|Nqeueens  |	     3.87|  2.898|
|qsort-O2  |	    7.819| 11.614|
|miniz-O2  |	    7.604|  3.803|
|primes-O2 |	   10.551|  5.986|
|sha512-O2 |	    6.497|  2.853|
|stream	   |        52.25| 45.776|

As demonstrated in the memory usage analysis below, the tier-1 JIT
compiler utilizes less memory than QEMU across all benchmarks.

* Memory usage
| Metric   | rv32emu-T1C |   qemu  |
|----------+-------------+---------|
|aes	   |      183,212|1,265,962|
|puzzle	   |      145,239|  891,357|
|pi        |      144,739|  872,525|
|dhrystone |	  146,282|  853,256|
|Nqeueens  |	  146,696|  854,174|
|qsort-O2  |	  146,907|  856,721|
|miniz-O2  |	  157,475|  999,897|
|primes-O2 |	  142,356|  851,661|
|sha512-O2 |	  145,369|  901,136|
|stream	   |      157,975|  955,809|

Related: sysprog21#238
qwe661234 added a commit to qwe661234/rv32emu that referenced this issue Dec 23, 2023
We follow the template and API of X64 to implement A64 tier-1 JIT compiler.

* Perfromance
| Metric   | rv32emu-T1C | qemu  |
|----------+-------------+-------|
|aes	   |        0.034|  0.045|
|puzzle	   |       0.0115| 0.0169|
|pi        |        0.035|  0.032|
|dhrystone |	    1.914|  2.005|
|Nqeueens  |	     3.87|  2.898|
|qsort-O2  |	    7.819| 11.614|
|miniz-O2  |	    7.604|  3.803|
|primes-O2 |	   10.551|  5.986|
|sha512-O2 |	    6.497|  2.853|
|stream	   |        52.25| 45.776|

As demonstrated in the memory usage analysis below, the tier-1 JIT
compiler utilizes less memory than QEMU across all benchmarks.

* Memory usage
| Metric   | rv32emu-T1C |   qemu  |
|----------+-------------+---------|
|aes	   |      183,212|1,265,962|
|puzzle	   |      145,239|  891,357|
|pi        |      144,739|  872,525|
|dhrystone |	  146,282|  853,256|
|Nqeueens  |	  146,696|  854,174|
|qsort-O2  |	  146,907|  856,721|
|miniz-O2  |	  157,475|  999,897|
|primes-O2 |	  142,356|  851,661|
|sha512-O2 |	  145,369|  901,136|
|stream	   |      157,975|  955,809|

Related: sysprog21#238
qwe661234 added a commit to qwe661234/rv32emu that referenced this issue Dec 24, 2023
We follow the template and API of X64 to implement A64 tier-1 JIT compiler.

* Perfromance
| Metric   | rv32emu-T1C | qemu  |
|----------+-------------+-------|
|aes	   |        0.034|  0.045|
|puzzle	   |       0.0115| 0.0169|
|pi        |        0.035|  0.032|
|dhrystone |	    1.914|  2.005|
|Nqeueens  |	     3.87|  2.898|
|qsort-O2  |	    7.819| 11.614|
|miniz-O2  |	    7.604|  3.803|
|primes-O2 |	   10.551|  5.986|
|sha512-O2 |	    6.497|  2.853|
|stream	   |        52.25| 45.776|

As demonstrated in the memory usage analysis below, the tier-1 JIT
compiler utilizes less memory than QEMU across all benchmarks.

* Memory usage
| Metric   | rv32emu-T1C |   qemu  |
|----------+-------------+---------|
|aes	   |      183,212|1,265,962|
|puzzle	   |      145,239|  891,357|
|pi        |      144,739|  872,525|
|dhrystone |	  146,282|  853,256|
|Nqeueens  |	  146,696|  854,174|
|qsort-O2  |	  146,907|  856,721|
|miniz-O2  |	  157,475|  999,897|
|primes-O2 |	  142,356|  851,661|
|sha512-O2 |	  145,369|  901,136|
|stream	   |      157,975|  955,809|

Related: sysprog21#238
Close: sysprog21#296
qwe661234 added a commit to qwe661234/rv32emu that referenced this issue Dec 24, 2023
We follow the template and API of X64 to implement A64 tier-1 JIT compiler.

* Perfromance
| Metric   | rv32emu-T1C | qemu  |
|----------+-------------+-------|
|aes	   |        0.034|  0.045|
|puzzle	   |       0.0115| 0.0169|
|pi        |        0.035|  0.032|
|dhrystone |	    1.914|  2.005|
|Nqeueens  |	     3.87|  2.898|
|qsort-O2  |	    7.819| 11.614|
|miniz-O2  |	    7.604|  3.803|
|primes-O2 |	   10.551|  5.986|
|sha512-O2 |	    6.497|  2.853|
|stream	   |        52.25| 45.776|

As demonstrated in the memory usage analysis below, the tier-1 JIT
compiler utilizes less memory than QEMU across all benchmarks.

* Memory usage
| Metric   | rv32emu-T1C |   qemu  |
|----------+-------------+---------|
|aes	   |      183,212|1,265,962|
|puzzle	   |      145,239|  891,357|
|pi        |      144,739|  872,525|
|dhrystone |	  146,282|  853,256|
|Nqeueens  |	  146,696|  854,174|
|qsort-O2  |	  146,907|  856,721|
|miniz-O2  |	  157,475|  999,897|
|primes-O2 |	  142,356|  851,661|
|sha512-O2 |	  145,369|  901,136|
|stream	   |      157,975|  955,809|

Related: sysprog21#238
Close: sysprog21#296
qwe661234 added a commit to qwe661234/rv32emu that referenced this issue Dec 24, 2023
We follow the template and API of X64 to implement A64 tier-1 JIT compiler.

* Perfromance
| Metric   | rv32emu-T1C | qemu  |
|----------+-------------+-------|
|aes	   |        0.034|  0.045|
|puzzle	   |       0.0115| 0.0169|
|pi        |        0.035|  0.032|
|dhrystone |	    1.914|  2.005|
|Nqeueens  |	     3.87|  2.898|
|qsort-O2  |	    7.819| 11.614|
|miniz-O2  |	    7.604|  3.803|
|primes-O2 |	   10.551|  5.986|
|sha512-O2 |	    6.497|  2.853|
|stream	   |        52.25| 45.776|

As demonstrated in the memory usage analysis below, the tier-1 JIT
compiler utilizes less memory than QEMU across all benchmarks.

* Memory usage
| Metric   | rv32emu-T1C |   qemu  |
|----------+-------------+---------|
|aes	   |      183,212|1,265,962|
|puzzle	   |      145,239|  891,357|
|pi        |      144,739|  872,525|
|dhrystone |	  146,282|  853,256|
|Nqeueens  |	  146,696|  854,174|
|qsort-O2  |	  146,907|  856,721|
|miniz-O2  |	  157,475|  999,897|
|primes-O2 |	  142,356|  851,661|
|sha512-O2 |	  145,369|  901,136|
|stream	   |      157,975|  955,809|

Related: sysprog21#238
Close: sysprog21#296
qwe661234 added a commit to qwe661234/rv32emu that referenced this issue Dec 24, 2023
We follow the template and API of X64 to implement A64 tier-1 JIT compiler.

* Perfromance
| Metric   | rv32emu-T1C | qemu  |
|----------+-------------+-------|
|aes	   |        0.034|  0.045|
|puzzle	   |       0.0115| 0.0169|
|pi        |        0.035|  0.032|
|dhrystone |	    1.914|  2.005|
|Nqeueens  |	     3.87|  2.898|
|qsort-O2  |	    7.819| 11.614|
|miniz-O2  |	    7.604|  3.803|
|primes-O2 |	   10.551|  5.986|
|sha512-O2 |	    6.497|  2.853|
|stream	   |        52.25| 45.776|

As demonstrated in the memory usage analysis below, the tier-1 JIT
compiler utilizes less memory than QEMU across all benchmarks.

* Memory usage
| Metric   | rv32emu-T1C |   qemu  |
|----------+-------------+---------|
|aes	   |      183,212|1,265,962|
|puzzle	   |      145,239|  891,357|
|pi        |      144,739|  872,525|
|dhrystone |	  146,282|  853,256|
|Nqeueens  |	  146,696|  854,174|
|qsort-O2  |	  146,907|  856,721|
|miniz-O2  |	  157,475|  999,897|
|primes-O2 |	  142,356|  851,661|
|sha512-O2 |	  145,369|  901,136|
|stream	   |      157,975|  955,809|

Related: sysprog21#238
Close: sysprog21#296
qwe661234 added a commit to qwe661234/rv32emu that referenced this issue Dec 24, 2023
We follow the template and API of X64 to implement A64 tier-1 JIT compiler.

* Perfromance
| Metric   | rv32emu-T1C | qemu  |
|----------+-------------+-------|
|aes	   |        0.034|  0.045|
|puzzle	   |       0.0115| 0.0169|
|pi        |        0.035|  0.032|
|dhrystone |	    1.914|  2.005|
|Nqeueens  |	     3.87|  2.898|
|qsort-O2  |	    7.819| 11.614|
|miniz-O2  |	    7.604|  3.803|
|primes-O2 |	   10.551|  5.986|
|sha512-O2 |	    6.497|  2.853|
|stream	   |        52.25| 45.776|

As demonstrated in the memory usage analysis below, the tier-1 JIT
compiler utilizes less memory than QEMU across all benchmarks.

* Memory usage
| Metric   | rv32emu-T1C |   qemu  |
|----------+-------------+---------|
|aes	   |      183,212|1,265,962|
|puzzle	   |      145,239|  891,357|
|pi        |      144,739|  872,525|
|dhrystone |	  146,282|  853,256|
|Nqeueens  |	  146,696|  854,174|
|qsort-O2  |	  146,907|  856,721|
|miniz-O2  |	  157,475|  999,897|
|primes-O2 |	  142,356|  851,661|
|sha512-O2 |	  145,369|  901,136|
|stream	   |      157,975|  955,809|

Related: sysprog21#238
Close: sysprog21#296
qwe661234 added a commit to qwe661234/rv32emu that referenced this issue Dec 24, 2023
We follow the template and API of X64 to implement A64 tier-1 JIT compiler.

* Perfromance
| Metric   | rv32emu-T1C | qemu  |
|----------+-------------+-------|
|aes	   |        0.034|  0.045|
|puzzle	   |       0.0115| 0.0169|
|pi        |        0.035|  0.032|
|dhrystone |	    1.914|  2.005|
|Nqeueens  |	     3.87|  2.898|
|qsort-O2  |	    7.819| 11.614|
|miniz-O2  |	    7.604|  3.803|
|primes-O2 |	   10.551|  5.986|
|sha512-O2 |	    6.497|  2.853|
|stream	   |        52.25| 45.776|

As demonstrated in the memory usage analysis below, the tier-1 JIT
compiler utilizes less memory than QEMU across all benchmarks.

* Memory usage
| Metric   | rv32emu-T1C |   qemu  |
|----------+-------------+---------|
|aes	   |      183,212|1,265,962|
|puzzle	   |      145,239|  891,357|
|pi        |      144,739|  872,525|
|dhrystone |	  146,282|  853,256|
|Nqeueens  |	  146,696|  854,174|
|qsort-O2  |	  146,907|  856,721|
|miniz-O2  |	  157,475|  999,897|
|primes-O2 |	  142,356|  851,661|
|sha512-O2 |	  145,369|  901,136|
|stream	   |      157,975|  955,809|

Related: sysprog21#238
Close: sysprog21#296
qwe661234 added a commit to qwe661234/rv32emu that referenced this issue Dec 25, 2023
We follow the template and API of X64 to implement A64 tier-1 JIT compiler.

* Perfromance
| Metric   | rv32emu-T1C | qemu  |
|----------+-------------+-------|
|aes	   |        0.034|  0.045|
|puzzle	   |       0.0115| 0.0169|
|pi        |        0.035|  0.032|
|dhrystone |	    1.914|  2.005|
|Nqeueens  |	     3.87|  2.898|
|qsort-O2  |	    7.819| 11.614|
|miniz-O2  |	    7.604|  3.803|
|primes-O2 |	   10.551|  5.986|
|sha512-O2 |	    6.497|  2.853|
|stream	   |        52.25| 45.776|

As demonstrated in the memory usage analysis below, the tier-1 JIT
compiler utilizes less memory than QEMU across all benchmarks.

* Memory usage
| Metric   | rv32emu-T1C |   qemu  |
|----------+-------------+---------|
|aes	   |      183,212|1,265,962|
|puzzle	   |      145,239|  891,357|
|pi        |      144,739|  872,525|
|dhrystone |	  146,282|  853,256|
|Nqeueens  |	  146,696|  854,174|
|qsort-O2  |	  146,907|  856,721|
|miniz-O2  |	  157,475|  999,897|
|primes-O2 |	  142,356|  851,661|
|sha512-O2 |	  145,369|  901,136|
|stream	   |      157,975|  955,809|

Related: sysprog21#238
Close: sysprog21#296
qwe661234 added a commit to qwe661234/rv32emu that referenced this issue Dec 25, 2023
We follow the template and API of X64 to implement A64 tier-1 JIT compiler.

* Perfromance
| Metric   | rv32emu-T1C | qemu  |
|----------+-------------+-------|
|aes	   |        0.034|  0.045|
|puzzle	   |       0.0115| 0.0169|
|pi        |        0.035|  0.032|
|dhrystone |	    1.914|  2.005|
|Nqeueens  |	     3.87|  2.898|
|qsort-O2  |	    7.819| 11.614|
|miniz-O2  |	    7.604|  3.803|
|primes-O2 |	   10.551|  5.986|
|sha512-O2 |	    6.497|  2.853|
|stream	   |        52.25| 45.776|

As demonstrated in the memory usage analysis below, the tier-1 JIT
compiler utilizes less memory than QEMU across all benchmarks.

* Memory usage
| Metric   | rv32emu-T1C |   qemu  |
|----------+-------------+---------|
|aes	   |      183,212|1,265,962|
|puzzle	   |      145,239|  891,357|
|pi        |      144,739|  872,525|
|dhrystone |	  146,282|  853,256|
|Nqeueens  |	  146,696|  854,174|
|qsort-O2  |	  146,907|  856,721|
|miniz-O2  |	  157,475|  999,897|
|primes-O2 |	  142,356|  851,661|
|sha512-O2 |	  145,369|  901,136|
|stream	   |      157,975|  955,809|

Related: sysprog21#238
Close: sysprog21#296
qwe661234 added a commit to qwe661234/rv32emu that referenced this issue Dec 25, 2023
We follow the template and API of X64 to implement A64 tier-1 JIT compiler.

* Perfromance
| Metric   | rv32emu-T1C | qemu  |
|----------+-------------+-------|
|aes	   |        0.034|  0.045|
|puzzle	   |       0.0115| 0.0169|
|pi        |        0.035|  0.032|
|dhrystone |	    1.914|  2.005|
|Nqeueens  |	     3.87|  2.898|
|qsort-O2  |	    7.819| 11.614|
|miniz-O2  |	    7.604|  3.803|
|primes-O2 |	   10.551|  5.986|
|sha512-O2 |	    6.497|  2.853|
|stream	   |        52.25| 45.776|

As demonstrated in the memory usage analysis below, the tier-1 JIT
compiler utilizes less memory than QEMU across all benchmarks.

* Memory usage
| Metric   | rv32emu-T1C |   qemu  |
|----------+-------------+---------|
|aes	   |      183,212|1,265,962|
|puzzle	   |      145,239|  891,357|
|pi        |      144,739|  872,525|
|dhrystone |	  146,282|  853,256|
|Nqeueens  |	  146,696|  854,174|
|qsort-O2  |	  146,907|  856,721|
|miniz-O2  |	  157,475|  999,897|
|primes-O2 |	  142,356|  851,661|
|sha512-O2 |	  145,369|  901,136|
|stream	   |      157,975|  955,809|

Related: sysprog21#238
Close: sysprog21#296
qwe661234 added a commit to qwe661234/rv32emu that referenced this issue Dec 25, 2023
We follow the template and API of X64 to implement A64 tier-1 JIT compiler.

* Perfromance
| Metric   | rv32emu-T1C | qemu  |
|----------+-------------+-------|
|aes	   |        0.034|  0.045|
|puzzle	   |       0.0115| 0.0169|
|pi        |        0.035|  0.032|
|dhrystone |	    1.914|  2.005|
|Nqeueens  |	     3.87|  2.898|
|qsort-O2  |	    7.819| 11.614|
|miniz-O2  |	    7.604|  3.803|
|primes-O2 |	   10.551|  5.986|
|sha512-O2 |	    6.497|  2.853|
|stream	   |        52.25| 45.776|

As demonstrated in the memory usage analysis below, the tier-1 JIT
compiler utilizes less memory than QEMU across all benchmarks.

* Memory usage
| Metric   | rv32emu-T1C |   qemu  |
|----------+-------------+---------|
|aes	   |      183,212|1,265,962|
|puzzle	   |      145,239|  891,357|
|pi        |      144,739|  872,525|
|dhrystone |	  146,282|  853,256|
|Nqeueens  |	  146,696|  854,174|
|qsort-O2  |	  146,907|  856,721|
|miniz-O2  |	  157,475|  999,897|
|primes-O2 |	  142,356|  851,661|
|sha512-O2 |	  145,369|  901,136|
|stream	   |      157,975|  955,809|

Related: sysprog21#238
Close: sysprog21#296
@jserv
Copy link
Contributor

jserv commented Dec 25, 2023

After the merge of tier-1 JIT compiler, it is time to revisit our IR.

@jserv jserv added the enhancement New feature or request label Dec 26, 2023
@jserv jserv changed the title jit: Translate RISC-V into low-level code generators' IR jit: Transition from Linear to Graph-Based IR Dec 28, 2023
@jserv
Copy link
Contributor

jserv commented Dec 28, 2023

Modern CPUs invest substantial effort in predicting these indirect branches, but the Branch Target Buffer (BTB) has its limitations in size. Eliminating any form of indirect call or jump, including those through dispatch tables, is greatly beneficial. This is because contemporary CPUs are equipped with large reorder buffers that can process extensive code efficiently, provided branch prediction is effective. However, in larger programs with widespread use of indirect jumps, optimal branch prediction becomes increasingly challenging.

@jserv
Copy link
Contributor

jserv commented Mar 2, 2024

FEX is an advanced x86 emulation frontend, crafted to facilitate the running of x86 and x86-64 binaries on Arm64 platforms, comparable to qemu-user. At the heart of FEX's emulation capability is the FEXCore, which employs an SSA-based Intermediate Representation (IR) crafted from the input x86-64 assembly. Working with SSA is particularly advantageous during the translation of x86-64 code to IR, throughout the optimization stages with custom passes, and when transferring the IR to FEX's CPU backends.

Key aspects of FEX's emulation IR include:

  • Precisely Defined IR Variable Sizes: It accommodates standard element sizes (1, 2, 4, 8 bytes, and certain 16-byte operations), as well as a flexible number of vector elements, distinguishing between float and integer operations based on the operation type.
  • Distinct Scalar and Vector IR Operations: Operations are clearly differentiated, such as scalar multiplication (MUL) vs. vector multiplication (VMUL).
  • Dedicated Load/Store Context IR Operations: These operations facilitate a clear distinction between guest memory and the monitored x86-64 state.
  • Specific CPUID IR Operation: Enables the return of complex data (data across four registers) and simplifies optimization for constant CPUID functions, allowing for further constant propagation.
  • Explicit Syscall Operation: Similar to the CPUID operation, this feature allows for efficient direct calls to the syscall handler by enabling constant propagation, reducing call overheads.
  • Branching Support within the IR: Includes conditional branching that either proceeds to the targeted branch or continues to the next block, and unconditional branching to jump directly to a specified block, aiming to align with LLVM semantics for block limitations without strict enforcement.
  • Debug Print Operation: For outputting values during debugging sessions.
  • Explicit Memory Access IR Operations: Designed for guest memory access, performing address translation into the virtual machine's memory space by adding the VM memory base to the 64-bit address. This approach allows for potential escape from the VM and is not deemed safe without JIT validation of the memory region for access correctness.

These features underscore FEX's design philosophy, emphasizing precise control, optimization flexibility, and efficient translation mechanisms within its emulation environment.

Reference: FEXCore IR

@jserv
Copy link
Contributor

jserv commented Apr 18, 2024

The Java HotSpot Server Compiler (C2) utilizes a Sea-of-Nodes IR form designed for high performance with minimal overhead, similar to LLVM's approach with its control flow graph (CFG). However, in textual IR presentations, the CFG is not depicted as a traditional 'graph' but rather through labels and jumps that effectively outline the graph's edges. Like C2’s IR, the Sea-of-Nodes IR can be described in a linear textual format and only visually represented as a "graph" when loaded into memory. This allows for flexibility in handling nodes without control dependencies, known as "floating nodes," which can be placed in any basic block in the textual format and reassigned in memory to maintain their floating characteristic.

While the current tier-2 JIT compiler, built with LLVM, offers aggressive optimizations, it is also resource-intensive, consuming considerable memory and prolonging compilation times. An alternative, the IR Framework, emerges as a viable option that enhances performance while minimizing memory usage. This framework not only defines an IR but also offers a streamlined API for IR construction, coupled with algorithms for optimization, scheduling, register allocation, and code generation. The code generated in-memory can be executed directly, potentially increasing efficiency.

The Ideal Graph Visualizer (IGV) is a tool designed for developers to analyze and troubleshoot performance issues by examining compilation graphs. It specifically focuses on IR graphs, which serve as a language-independent bridge between the source code and the machine code generated by compilers.

@jserv
Copy link
Contributor

jserv commented Aug 14, 2024

Inspired by rvdbt, we may adopt its QuickIR, a lightweight non-SSA internal representation used by the QMC compiler. QuickIR interacts with both local and global states; the former represents optimized temporaries, while the latter includes the emulated CPU state and any internal data structures attached to CPUState, a concept common to many emulators. The terms local and global also extend to control flow, where global branch instructions gbr and gbrind manage branches that escape the current translation region. If a particular instruction or its slowpath cannot be represented in QuickIR, a special hcall might be used to invoke a pre-registered guest runtime stub. These stubs are also generated from interpreter handlers, making it straightforward to extend the translated ISA without mandatory frontend support for new instructions.

QuickIR sample (1) - single basic block

00018c40:  slli   s2, s2, 8
00018c44:  or     s2, s2, s3
00018c48:  addi   s3, zr, 61
00018c4c:  jal    zr, 12
bb.0: succs[ ] preds[ ]
        #0 sll [@s2|i32] [@s2|i32] [$8|i32]
        #1 or  [@s2|i32] [@s2|i32] [@s3|i32]
        #3 mov [@s3|i32] [$3d|i32]
        #4 gbr [$18c58|i32]

QuickIR sample (2) - conditional branch representation

00018fc0:  lw     a5, a1, 0
00018fc4:  sw     zr, a1, 4
00018fc8:  addi   a6, a1, 0
00018fcc:  beq    a5, zr, 132
bb.0: succs[ 2 1 ] preds[ ]
	#0 mov [g:80|i32] [$18fc0|i32]
	#1 vmload:i32:s [@a5|i32] [@a1|i32]
        #2 mov [g:80|i32] [$18fc4|i32]
        #3 add [%32|i32] [@a1|i32] [$4|i32]
        #4 vmstore:i32:u [%32|i32] [$0|i32]
        #6 mov [@a6|i32] [@a1|i32]
        #9 brcc:eq [@a5|i32] [$0|i32]
bb.1: succs[ ] preds[ 0 ]
        #7 gbr [$18fd0|i32]
bb.2: succs[ ] preds[ 0 ]
        #8 gbr [$19050|i32]
  • g:80 - program counter location in global CPUState, manually flushed by frontend before translating "unsafe" vmload instruction
  • %32 - temporary local register, frontend may emit an arbitrary number of locals in a single region

@jserv jserv changed the title jit: Transition from Linear to Graph-Based IR jit: Transition from linear to more effective form Sep 9, 2024
@jserv
Copy link
Contributor

jserv commented Sep 23, 2024

sovietov_graph_irs_2023.pdf
Slides from a talk "Graph-Based Intermediate Representations: An Overview and Perspectives". Great information starting from linear code to dataflow IR.

@jserv jserv added this to the release-2024.2 milestone Oct 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants