Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big jump tables #288

Merged
merged 0 commits into from
Dec 7, 2024
Merged

Big jump tables #288

merged 0 commits into from
Dec 7, 2024

Conversation

edubart
Copy link
Contributor

@edubart edubart commented Oct 29, 2024

This optimizer our RISC-V instruction decoder by using big jump tables, through token threading, to the point that decoding takes just 2 instructions for most RISC-V instructions, even compressed ones. And overall the FENCE instruction takes 12 host instructions in GCC AMD64, and Clang ARM64.

Here is the GCC x86_64 trace as proof:

//// FENCE GCC x86_64 (2/12 instructions)
// increment mcycle (3 instructions)
=> 0x7ffff7a2e98c <loop+28108>:   add    $0x1,%r15                     // mcycle += 1
=> 0x7ffff7a2e990 <loop+28112>:   cmp    %r13,%r15                     // mcycle < mcycle_tick_end
=> 0x7ffff7a2e993 <loop+28115>:   jae    0x7ffff7a2f230 <loop+30320>   // -> break loop
// fetch (5 instructions)
=> 0x7ffff7a2e999 <loop+28121>:   mov    %r10,%rbx                     // pc
=> 0x7ffff7a2e99c <loop+28124>:   xor    %rbp,%rbx                     // pc ^ fetch_vaddr_page
=> 0x7ffff7a2e99f <loop+28127>:   cmp    $0xffd,%rbx                   // check fetch page
=> 0x7ffff7a2e9a6 <loop+28134>:   ja     0x7ffff7a27d00 <loop+320>     // -> miss fetch
=> 0x7ffff7a2e9ac <loop+28140>:   mov    (%r14,%rbp,1),%ebx            // insn = *(uint32_t*)(pc + fetch_vh_offset)
// decode (2 instructions)
=> 0x7ffff7a2e9b0 <loop+28144>:   movzwl %bx,%ecx                      // insn & 0b1111111111111111
=> 0x7ffff7a2e9b3 <loop+28147>:   jmp    *(%r11,%rcx,8)                // -> jump to instruction
// execute (2 instructions)
=> 0x7ffff7a2ea3b <loop+28283>:   add    $0x4,%rbp                     // pc += 4
=> 0x7ffff7a2ea3f <loop+28287>:   jmp    0x7ffff7a2e98c <loop+28108>   // -> jump to loop begin

And the Clang arm64:

//// FENCE Clang arm64 (2/12 instructions)
// increment mcycle
=> 0xfffff7b8a328 <loop+4568>:    add x25, x25, $0x1
=> 0xfffff7b8a32c <loop+4572>:    cmp x25, x27
=> 0xfffff7b8a330 <loop+4576>:    b.cs    0xfffff7b8e7a8 <loop+22104>
// fetch
=> 0xfffff7b8a334 <loop+4580>:    eor x19, x20, x28
=> 0xfffff7b8a338 <loop+4584>:    cmp x19, $0xffd
=> 0xfffff7b8a33c <loop+4588>:    b.hi    0xfffff7b89264 <loop+276>
=> 0xfffff7b8a340 <loop+4592>:    ldr w19, [x20, x22]
// decode
=> 0xfffff7b8a344 <loop+4596>:    and w10, w19, $0xffff
=> 0xfffff7b8a348 <loop+4600>:    ldr x16, [x24, x10, lsl $3]
=> 0xfffff7b8a34c <loop+4604>:    br  x16
// execute
=> 0xfffff7b8dde8 <loop+19608>:   add x20, x20, $0x4
=> 0xfffff7b8ddec <loop+19612>:   b   0xfffff7b8a328 <loop+4568>

In emulator v0.18.x the trace for same FENCE RISC-V instruction took about 40 x86_64 instructions.

Overall the performance varies between 1.2x up to 2x speedup across many benchmarks relative to emulator v0.18.1, here is results for many benchmarks with stress-ng :

Benchmarks

Times faster Benchmark
2.56 ± 0.03 stress-ng --no-rand-seed --syscall 1 --syscall-ops 4000
2.15 ± 0.02 stress-ng --no-rand-seed --cpu 1 --cpu-method loop --cpu-ops 400
1.95 ± 0.00 stress-ng --no-rand-seed --cpu 1 --cpu-method fibonacci --cpu-ops 400
1.94 ± 0.01 stress-ng --no-rand-seed --cpu 1 --cpu-method int64 --cpu-ops 400
1.90 ± 0.01 stress-ng --no-rand-seed --memcpy 1 --memcpy-ops 50
1.88 ± 0.02 stress-ng --no-rand-seed --crypt 1 --crypt-method SHA-256 --crypt-ops 400000
1.87 ± 0.01 stress-ng --no-rand-seed --qsort 1 --qsort-ops 5
1.83 ± 0.01 stress-ng --no-rand-seed --memrate 1 --memrate-bytes 2M --memrate-ops 200
1.82 ± 0.03 stress-ng --no-rand-seed --hash 1 --hash-ops 40000
1.75 ± 0.00 stress-ng --no-rand-seed --heapsort 1 --heapsort-ops 3
1.72 ± 0.01 stress-ng --no-rand-seed --zlib 1 --zlib-ops 20
1.66 ± 0.00 stress-ng --no-rand-seed --matrix 1 --matrix-method mult --matrix-ops 20000
1.49 ± 0.02 stress-ng --no-rand-seed --hdd 1 --hdd-ops 2000
1.41 ± 0.00 stress-ng --no-rand-seed --fp 1 --fp-method floatadd --fp-ops 1000
1.33 ± 0.01 stress-ng --no-rand-seed --fma 1 --fma-ops 40000
1.24 ± 0.01 stress-ng --no-rand-seed --trig 1 --trig-ops 50
1.16 ± 0.01 stress-ng --no-rand-seed --fork 1 --fork-ops 1000
1.14 ± 0.01 stress-ng --no-rand-seed --malloc 1 --malloc-ops 40000

You can see 1.94 a speedup for integer operations. Notably I am able to reach GHz speed for some simple integer arithmetic benchmarks, with the interpreter being only 10~20x slower than host native.

The table of benchmarks were created by running hyperfine and stress-ng, for example:

$ hyperfine -w 1 'cartesi-machine -- stress-ng --no-rand-seed --cpu 1 --cpu-method loop --cpu-ops 400' '/usr/bin/cartesi-machine -- stress-ng --no-rand-seed --cpu 1 --cpu-method loop --cpu-ops 400'
Benchmark 1: cartesi-machine -- stress-ng --no-rand-seed --cpu 1 --cpu-method loop --cpu-ops 400
  Time (mean ± σ):      2.225 s ±  0.021 s    [User: 2.213 s, System: 0.010 s]
  Range (min … max):    2.197 s …  2.257 s    10 runs
 
Benchmark 2: /usr/bin/cartesi-machine -- stress-ng --no-rand-seed --cpu 1 --cpu-method loop --cpu-ops 400
  Time (mean ± σ):      4.615 s ±  0.041 s    [User: 4.602 s, System: 0.009 s]
  Range (min … max):    4.561 s …  4.682 s    10 runs
 
Summary
  cartesi-machine -- stress-ng --no-rand-seed --cpu 1 --cpu-method loop --cpu-ops 400 ran
    2.07 ± 0.03 times faster than /usr/bin/cartesi-machine -- stress-ng --no-rand-seed --cpu 1 --cpu-method loop --cpu-ops 400

This PR is evolution of #226

@edubart edubart added the enhancement New feature or request label Oct 29, 2024
@edubart edubart self-assigned this Oct 29, 2024
@edubart edubart force-pushed the feature/big-jump-tables branch 2 times, most recently from 1e261ce to 67f62d4 Compare November 2, 2024 21:31
@edubart edubart force-pushed the feature/big-jump-tables branch from c390506 to 2e1555a Compare December 7, 2024 19:31
@edubart edubart merged commit 2e1555a into edubart Dec 7, 2024
7 checks passed
@edubart edubart deleted the feature/big-jump-tables branch December 7, 2024 19:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

1 participant