Skip to content

euphoric-hardware/riscv-functional-sim

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

69 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Generated Functional Simulation (SAIL-Spike)

Background and Goals

We want to build a RISC-V instruction set simulator (ISS) from first principles.

Support Many Modes of Operation

We want to support all these modes both as top and as a library.

Master

  • ISS executes a binary directly
  • As a library: still master, but can be controlled by custom top
    • Can dump traces into buffer for top to analyze
    • Can checkpoint / restore / rewind
    • Can emulate time from external source (e.g. sampled RTL simulation, uArch perf model)

Ganged / CoSim

  • ISS executes a binary and emulates all arch state, but not as strict master. It receives arch events from RTL simulation and determines whether the next instruction group to commit in RTL sim is legal and matches the expected commit from the ISS.
  • There is a big range of what RTL is verified in ganged simulation based on which SoC components are simulated in the ISS exactly vs simply 'believed' from RTL simulation.
    • For instance, a DMA engine can be modeled exactly in the ISS, or the transactions to/from the DMA engine in RTL can be simply replayed in the ISS (eliding verification of the DMA engine's behavior itself).

Slave

  • ISS acts as a trace ingester from RTL sim / trace of another execution
  • All SoC components and arch state are still modeled. The trace can contain partial information about the SoC (e.g. only the core / DRAM state can be reconstructed).
  • In this mode, the ISS is used as a library and the top-level peeks the reconstructed arch state as needed (e.g. for trace-driven profiling / flamegraph construction)
  • We can use this mode to do replay single-stepping of the SoC, a single instruction at a time

Symbolic execution

  • The modeled arch state is a mix of concrete and symbolic state
  • This works similar to the slave mode, except the state update rules are computed symbolically
  • This is useful for information flow tracking and memory trace reconstruction, among other things

Other things

There are a bunch of other use-cases and features we wish to support that are quite iffy in the current spike + Chipyard world.

  • Exact SoC modeling: all undefined / vague behaviors pinned down. All SoC components and their arch state are modeled.
    • An identical setup in the ISS that matches the SoC exactly
    • RTL that's generated should be driving the parameterization of the functional sim (not the other way around)
    • First-class support for passing a dts and bootrom into the functional sim from the RTL generator
  • Checkpoint / restore: deser of arch state + testbench component / IO model state. No loss of information.
  • Trace analysis: generic analysis pass writing using a generic ISA IR. Ability to dump execution traces into a trace buffer controlled and drained by a custom top.
  • Sampled simulation: a custom top that leverages the above for sampled RTL simulation for accurate performance trace estimation.
  • Instruction generation: for DV or fuzzing a RISC-V DUT.
  • Formal equivalence checking: similar to riscv-formal.
  • RTL generation: targeting a simple single-cycle core model.
  • Coverage analysis: given a trace, track code path coverage within the ISS + instruction-level coverage (see RISC-V ISAC)
  • High performance disassembler
    • Disassembles execution traces into Rust-native structures either based on instruction encoding type (R, I, ...) or semantic instruction type (arithmetic, memory access, control flow, etc.)
    • Leverage the host's SIMD ISA for high performance decoding

Unification of Testbench/IO Models

We should unify models between all simulation backends (ISS, RTL simulation, FPGA prototyping, FPGA-based emulation / Firesim, ASIC-based emulation) + reality (testchip bringup). This includes: fesvr + IO models + everything on the edge of the RISC-V target. The current state of fesvr + IO models is quite unified, but not sufficient since we need exact state checkpointing and restore (and ideally no more C++).

There are some challenges like accurate checkpointing + restore for stateful non-DUT components, especially if we use Rust's coroutines. On the Chipyard RTL simulation side, we also need to make top-level ports explicit (no internal DPIs).

EZ Custom Tops

We should have a basic top that works like Spike, but we should support library usage where users can write their own top.

Custom tops with spike are a pain. We want to simplify it. Dromajo does a better job, but we can do even better.

High Performance

We're sure there are many tricks here (faster instruction decoding, caching, basic block-granularity execution) that are played by NEMU but not spike. We anticipate we can build an ISS that can run at 500+ MIPS, which could obviate DBT.

Principled Discrete Event Simulation

Spike uses an ad-hoc mechanism of multiple host threads and switch_to() calls to emulate parallel simulation threads (e.g. between switching between fesvr, IO models, and the target RISC-V core - each with their separate contexts and stacks). Ideally, we can leverage an actual discrete event simulation framework (like DAM) and remove these host thread switching hacks (or build something on top of tokio).

One challenge is to integrate this with the Chipyard RTL simulation environment and Firesim. This also needs to play nicely with serialization of IO/testbench model state, which seems very tricky, if not impossible. Perhaps the only way to make this easy is to force all state to be in the RTL abstraction or serializable software datastructures and keep all the instant update rules as regular arbitrary Rust code. This implies that state machines must be explicitly constructed however, which is a big annoyance.

Generated ISS

Ideally we don't want to build a point implementation, but rather an ISS generator that consumes a formal spec of the ISA. We would like to (eventually) avoid a hand-written ISA implementation (like in spike or qemu). This is very idealistic and many prior attempts have been made (e.g. riscv-sail, pydrofoil), but none can achieve high performance and ease of integration with custom tops.

Dynamic Binary Translation (DBT) Mode

For maximum performance, there is no substitute for host-ISA codegen (dynamic binary translation). Since we don't need to support multiple ISAs, we can avoid an intermediary layer like in qemu (i.e. TCG-IR). Since we're using Rust, it would be great to use the Cranelift IR and JIT!

Prior Work

ISS

DBT

Architectural Description Languages / Generated ISS

Background:

Existing tools and languages:


Development plans

Functional simulator steps (Safin, Ansh, Pramath)

  • Spike
    • Make sure you can run baremetal binaries in Spike (riscv-tests are a good start)
  • Define bare-minimum architectural state and simple execution logic of each processor
    • RISC-V user mode specification
    • For the architectural state, start off with: program counter (PC), register file
      • In Spike, this is defined in state_t in riscv/processor.h
      • The RISC-V specification supports two datapath widths: 32 bits vs 64 bits
        • The register file (and other following architectural states) must be parameterized to support both bitwidths
        • However, lets not worry too much about 32 bits at the moment. Get started with 64 bit architecture first
      • The architectural state has to trivally serializable
        • Various use cases: sampled simulation, verification and debug
        • Should use Rust's type class derivation
    • Write interpretation logic for a processor that can execute RV32G/RV64G instructions (look in the RISC-V specification)
      • RV32I
      • RV64I
      • RV32/64G
      • In Spike, this is defined in processor_t::step in riscv/execute.cc
      • Types of instructions
        • Integer computational instructions: add, sub, shift left/right
        • Control flow instructions: branch, jump
        • Hint instructions (ignore in this step)
        • Load and store instructions (ignore in this step)
        • Fence (ignore in this step)
      • Details
        • Fetch instruction at PC
        • Decode the fetched instruction
        • Interpret the instruction and update the register file state accordingly
        • Update PC
      • The input format can be a file containing instructions
  • Support memory instructions
    • Define memory and connect it with the above processor to support load and store instructions
    • The memory shouldn't be as large as the actual physical memory that we are emulating
      • Programs normally doesn't use all available physical memory
      • In Spike, the physical memory is defined as sparse_memory_map in devices.h
      • Similarly, we can use a dictionary to implement physical memory
    • The memory state is also a part of the architectural state. Hence, this should be easily serializable as well
  • Move on to the supervisor mode instructions
  • Support interrupts
    • Interrupts and exceptions in RISC-V are executed by writing to control state registers (CSR)s
    • CSRs are also a part of the architectural state (and hence should be serializable)
    • There are two types of interrupts: core local interrupts and platform level interrupts
      • Core local interrupts are handled by a device called CLINT in the SoC. Some examples are software interrupts, timer interrupts, and external interrupts
      • Platform level interrupts are handled by a device called PLIC in the SoC. It is used to by the cores to interact with IO devices
      • When a interrupt signal comming into the core goes high, the mip CSR is written and the core handles the interrupt
      • Detailed description of CSRs related to interrupts and exceptions are in page 6 of the "sifive interrupt cookbook"
    • Example use case of CLINT: target boot process. A typical target boot process looks like this:
      • The host machine loads the binary into the target system using FESVR. While this is happening the cores in the target are spinning, waiting for a interrupt
      • Once this is done, the host machine sends a message to the target where the endpoint address is the CLINT
      • The CLINT receives the message, raises the interrupt signal, the core PC jumps to the starting address of the program
    • Implement a CLINT
      • In Spike, this is defined in riscv/clint.cc
      • The state of CLINT is also a part of the architectural state!
    • A CLINT also as a range of addresses that can be used to send messages to it
      • A typical address of the CLINT is 0x2000000
      • The address of the CLINT should also be included in the DTS!
    • At this point, we can try hooking the emulation framework w/ FESVR to run simple binaries
  • Support exception handling
    • On an exception, these CSRs must be written
      • mepc: PC that caused the exception
      • mcause: trap cause
      • mtval: trap value
    • Add these CSRs and check if exception handling works correctly
  • Support virtual memory
    • To support virtual memory, we need to implement TLBs
    • satp
    • pmp
    • More on this later...
  • Critical extensions

Rearchitecting FESVR (Safin)

  • Front end server (FESVR) background
    • Chipyard FESVR documentation
    • FESVR acts as a bridge between the host system (which is running the simulation) and the target system (simulated RISC-V SoC)
    • FESVR performs tasks such as loading the binary into the target system, handling target system calls, and exchanging data between the host and the target (e.g., print messages)
      • Program loading
        • Loads the RISC-V binary into the simulated system
        • Provides the simulation w/ necessary arguments
      • System call proxying
        • riscv pk
        • There are cases when the programming running inside the simulator has to perform syscalls related to IO (e.g. prints, opening network sockets)
        • This program emulates these syscalls in the host machine
    • FESVR is shared across a wide range of simulation frameworks: Spike (functional sim), Chipyard sims (RTL sim), FireSim (FPGA Sim)
    • FESVR architecture
      • There are two threads when running simulations: the host thread and the target thread
      • The host thread performs the FESVR functionalities mentioned above: Program loading and syscall proxying
      • The target thread is responsible for executing the target binary in the simulated system
      • There is a buffer where each thread reads/writes messages
      • Example 1: lets say the host wants to write the binary into the target system's memory
        • Host thread instructs FESVR to write the binary to the target's DRAM address
        • Once this is done, host threads instructs FESVR to write to the CLINT to indicate that the binary loading is finished
        • Halt the host thread and switch to the target thread
        • Target thread starts executing the simulation
      • Example 2: lets say the target wants to print a message
        • Target thread performs a printf
        • The printf contains instructions that writes to a "magic address" that is connected to a target -> host buffer
        • Target thread halts and switches to the host thread
        • Host thread reads the message, emulates the printf behavior, and returns the control back to the target thread
  • It would be nice if we can replace the host/target threads as coroutines
    • Write a custom sim_t in riscv/sim.cc so that it doesn't inherit htif_t, but uses processor_t, mems, clint, plic bus, etc for instruction execution
    • Rewrite FESVR in Rust using async libraries such as tokio
      • We can use elf reading libraries such as rust-elf or elfio
      • Don't have to think about DTM based HTIF for now. Can just implement the TSI protocol based interface
    • Write rust bindings between the rewritten FESVR & the custom sim_t and see if we can run RISC-V binaries

Architectural Definition Language

  • Generating rust code: we can use the syn library to represent arbitrary Rust ASTs for code generation
  • Scala embedded DSL for the specification language
    • Need a way of interpreting the language to generate code for functional simulation
      • An add instruction cannot be represented as an Scala add as we must interpret the operation for code gen
    • Need a clear separation of architectural state and update rules
      • For unprivileged instructions, this is straightforward
      • Privileged instructions are where this might become challenging
        • Virtual memory: this seems quite doable. Need to write rules for SATP & TLB updates
        • Interrupt & exception handling
        • Expressing the behaviors of PLIC & CLINT

Approach 1: Use Chisel as the frontend, but build a custom interpreter

  • Can reuse a lot of the Chisel constructs like Vec, Bundle, UInt
  • Bridge the in-memory representation of CIRCT into FIRRTL2 and write FIRRTL passes that will emit components for the functional simulation
    • The compiler can take in the DTS of the SoC that you want to model, and compose the architectural states accordingly
    • What about undefined behaviors? If we bridge to FIRRTL2 in high FIRRTL, DontCares aren't blasted into zeros so we can reinterpret this
  • Potential downsides
    • Difficult (or impossible) to utilize the host type system which naturally leads to lower ergonomics
    • This also means that we have to describe every instruction by hand (as we cannot use type class derivation)
    • Also, if we were to just use Chisel, we would need to use stuff like annotations which will make the codebase messy and confusing
  • One benefit of this approach is that it may be easier to generate performance models along with the functional model in the future
    • As we can describe branch predictor structures using existing Chisel, perhaps we can add additional passes in the interpretter to add these models as part of the functional simulator
    • Can be used for generating embeddings for sampled simulation or high level workload analysis based on traces
  • Logistical thoughts
    • To get started initially, this might be easier as we can just simply write Chisel in a specific way
    • Bridging into FIRRTL2 is not fun, but very doable
    • The time to start writing the interpretter is low (probably less guidance required), but maybe the following progress may not be as fast than approach 2

Approach 2: DSL for defining the architectural spec

  • The interpretter implementation may be a bit more cleaner as we can use type class derivation
  • A high level sketch may look something like this:
case class ProcessorState(
    pc: UInt,
    rf: Vec[UInt],
    ...)

case class Instruction[T <: StateUpdateRule] {
    // Can use type class derivation to derive the update rule from using scala compiler
    def updaterule(p: ProcessorState)

    // Type class derivation logic for product types like `Add`
}

case class Add(rs1: UInt, rs2: UInt, rd: UInt, op: (UInt, UInt) => UInt) derives Instruction
  • However, we must redefine basic datatypes such as UInt, Vec, Bundle
    • I don't think this is such a big deal. This shouldn't take too much time as we only want a limited set of primitives
    • Defining aggregate types may be a pain. But, do we really need aggregate types like in Chisel? For this particular DSL, I think it is perfectly fine to define aggregate types as a product type (for a HDL, I do think this is problematic due to ergonomics, but it really doesn't matter here)
    • It is also beneficial in that it is easier to enforce "correct behavior" to the spec writers by using the host language's type system
  • Logistical thoughts
    • In the long term, I do prefer this than the above approach. Approach 1 just feels like more of an hack on top of Chisel rather than a clean implementation
    • However, this might feel quite difficult and they can loose motivation unless we guide them aggressively. We may even have to work on the initial implementation to get them going

Configurability in functional simulation

  • For cases when the functional simulation runs in ganged mode, trace-driven mode, or runahead mode (for sampling) we want the functional simulator device tree configuration to look identical to the actual SoC configuration
  • Spike is problematic in that aligning the functional/RTL simulation configuration requires various hacks/modifications
    • Spike has its own bootrom which is the first source of divergence between RTL & functional simulation
    • For baremetal binaries, Spike boots without FESVR having to send a interrupt to the CLINT. This is another source of divergence that we would like to eliminate
    • API to add IO devices is not clean as device models are loaded in a dynamically linked libraries into simulation. However, there is no need for device models to be dynamically loaded in the first place

How I would like the configuration system to look like

  • There has to be two modes

DTS generated mode

  • The functional simulation configuration is "generated" from the device tree source (DTS) of the SoC that you want to model
  • As mentioned above, this is useful for ganged simulation, trace-driven simulation, and runahead mode
  • Parse the DTS file: fdt
  • Generate a bus hierarchy according to the DTS

Default mode

  • The functional simulation uses a pre-determined DTS configuration with minimal IO device models (UART)
  • The device models should not by dynamically linked libraries. Device models should be statically compiled into the binary and the top level should expose runtime flags to add/change device configurations
  • Use case of this mode is for simple software verification/debugging
  • We can think of the default mode as where there is no DTS provided and we are using a preconfigured DTS

Implementation

  • Need a way of registering all the possible device models and searching for matching ones from the DTS
  • Need to implement a "bus" struct that has APIs for
    • Adding new devices on the bus and register its address range
    • Receive load/store requests and route them to the correct device
    • When it receives a request with an invalid address, return a response indicating that the request had a invalid address
  • Possible devices includes: cores, DRAM, CLINT, PLIC, NIC, block device, uart, bootrom ...
  • Future work: I would also like this "bus" struct to be able to defer certain transactions until there is a hint from the top level. This can be useful for ganged simulation

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages