Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This optimizer our RISC-V instruction decoder by using big jump tables, through token threading, to the point that decoding takes just 2 instructions for most RISC-V instructions, even compressed ones. And overall the FENCE instruction takes 12 host instructions in GCC AMD64, and Clang ARM64.
Here is the GCC x86_64 trace as proof:
And the Clang arm64:
In emulator v0.18.x the trace for same FENCE RISC-V instruction took about 40 x86_64 instructions.
Overall the performance varies between 1.2x up to 2x speedup across many benchmarks relative to emulator v0.18.1, here is results for many benchmarks with
stress-ng
:Benchmarks
You can see 1.94 a speedup for integer operations. Notably I am able to reach GHz speed for some simple integer arithmetic benchmarks, with the interpreter being only 10~20x slower than host native.
The table of benchmarks were created by running
hyperfine
andstress-ng
, for example:This PR is evolution of #226