-
Notifications
You must be signed in to change notification settings - Fork 10
Gem Forge Framework
Gem-Forge is originally developed to accelerate the architectural design space exploration with both trace-based and execution-based simulation. Unlike other trace-based simulators that work on traces for assembly instructions, Gem-Forge is built on traces of LLVM IR instructions, which allows the user to directly leverage various compiler analyses and access high-level information lost when compiling to low-level assembly code. Users can manipulate LLVM IR traces to explore architectural changes, e.g. replacing some instructions with a new complex instruction, vectorizing, etc. Both the vanilla trace and the modified one can be simulated by a modified CPU model in gem5, which gives you the performance results. The overall workflow can be summarized like this:
-
Collect Traces
Gem-Forge provides an LLVM pass to instrument the program with calls to a runtime library. When the instrumented program runs, it reads some environment variables and dumps a trace of LLVM IR instructions in compressed Protobuf format. The runtime tracer can be configured into two modes: profiling mode which just dumps the instructions to reduce the trace size, and detailed mode which includes both the instructions and all the operands. Typically users would first run the program in profiling mode to identify hot regions (e.g., using SimPoint), and then collect detailed traces for these hot regions. The detailed mode records the actual value of instructions' operands and results, which are important for many analyses, e.g. recognizing the memory access pattern.The tracer is implemented in
transform/src/trace
. -
Transform Traces
After collecting the trace, you can transform it to reflect the architectural changes you want to explore. For example, your new CPU has a special function unit to compute multiply-and-accumulate, and you want to verify how much benefit the benchmark would get from this new feature without really implement the compiler analysis and backend. In such cases, Gem-Forge can be handy as you can just create a "fake"mul-acc
instruction to replace those matched sequences in the trace. The transformed trace can be simulated in Gem5 to validate your assumption.We already provide many transformations, e.g. stream analysis, vectorization, etc. They are located in
transform/src
, with separate folders for each transformation. -
Simulate Traces
Finally, traces are simulated in Gem5. We add a newGemForgeCPU
model ingem5/src/cpu/gem_forge
, which takes in an LLVM IR trace and feeds into an out-of-order pipeline. This CPU is divided into five stages: fetch, decode, rename, IEW, and commit, similar to Gem5's O3 CPU.
Although trace-based simulation allows you to quickly verify the idea and potential benefits, it is not as accurate as execution-based simulation and only serves as a first-order approximation, especially for LLVM IR traces, as LLVM IR assumes an infinite number of registers. Also, the misspeculated path is not in the trace, which also leads to inaccurate results. Therefore, after validation with trace-based simulation, users may want to switch to execution-based workflow to fully examine the benefits and overheads.
Gem-Forge also provides supports for execution-based simulation. We have supported execution-based simulation for stream specialization, i.e. we can generate a binary with stream instructions and simulate it in Gem5's original Minor/O3 CPU model with Ruby coherence support.
Stream specialization is perhaps the most important result of Gem-Forge framework. Here we provide a high-level documentation.
Stream specialization is implemented in transform/src/stream
. These are the important files:
-
StaticStream.cpp
This defines the root class for streams, and is derived byStaticMemStream
andStaticIndVarStream
, which stands for memory access streams (load, store, atomics) and induction variable streams respectively. -
StaticStreamRegionAnalyzer.cpp
This is the key analysis to recognize streams. It first instantiates a stream for everyphi
nodes in the loop header block and memory accesses in the loop. Then it analyzes the stream pattern and chooses qualified streams. -
execution/StreamExecutionTransformer.cpp
After streams are chosen,StreamExecutionTransformer
is in charge of transforming the program to remove redundant address generation and insert new stream instructions. Stream instructions are implemented as intrinsics in LLVM IR (llvm/llvm/include/llvm/IR/Intrinsics.td
) and the real x86 instructions are defined inllvm/llvm/lib/Target/X86/X86InstrSSP.td
. -
StreamPassOptions.cpp
This file defines the key options to enable/disable features for stream analysis, e.g. configure streams at outer loops. -
StreamMessage.proto
This Protobuf file defines the stream configuration, which is used to configure the stream engine during simulation.
We also implement stream specialization in Gem5. They are mostly located here (omitting the gem5/src
prefix):
-
cpu/gem_forge/accelerator/stream
This folder contains most of the stream implementation. The most important file isstream_engine.cc
, which is the core stream engine. -
cpu/gem_forge/accelerator/stream/cache
This folder contains the implementation of stream engines in the cache, e.g. L2 and L3 stream engine. This is used for stream floating. -
mem/protocol/stream
This folder contains the MESI coherence protocol with stream support.