-
Notifications
You must be signed in to change notification settings - Fork 8
CGRA Design Flow
Our CGRA design flow generates both the CGRA-based hardware accelerator and the application compiler. The main feature of AHA project is the co-design of accelerators and compilers, where the compiler updates automatically as the accelerator evolves.
CGRAs generally consist of: PEs, memories, and an interconnect. Accordingly, three high-level domain-specific hardware specification languages (DSLs) are used to specify each component. Each DSL is used to generate both the RTL code for the CGRA and the collateral for generating the new compilers. The three DSLs we use are:
- PEak for PEs
- Lake for memories
- Canal for interconnects
All of these DSL's are embedded in Python, and are based on a low-level hardware description DSL called magma which is also embedded in Python.
Inside the Docker container, we generate verilog of the CGRA using the following command. The width
and height
flag represent the size of the array we want to generate. Although the size might be flexible for different applications, it's still better to generate a larger array so that we can have more tiles to use. Note that every flag needs to be included in the command.
aha garnet --width 32 --height 16 --verilog --use_sim_sram --rv --sparse-cgra --sparse-cgra-combined
When generating the verilog for physical design, remove the --use_sim_sram
flag
After running the command, it would create the verilog file /aha/garnet/garnet.v
.
We will now go through the steps to compile an application onto the CGRA hardware and simulate it.
In most of the experiments, we target applications in dense linear algebra applications domain.
The application is written in high-level DSL called Halide, which is embedded in C++. The Halide application would go through a several stages before we can simulate it.
We will use Gaussian as an example to go through the three steps of the compiler: map, pnr, and test.
aha map apps/gaussian
will first compile the halide application using the Halide to Hardware compiler. It takes in the gaussian_generator.cpp and process.cpp described in https://stanfordaha.github.io/aha-wiki-page/pages/03_h2h_files/.
Next, aha map
will use a tool called MetaMapper to map the compute of the application (described in the CoreIR file bin/gaussian_compute.json
) to PEs. It will produce a CoreIR file called bin/gaussian_compute_mapped.json
.
Finally, it will use the clockwork tool to schedule the application and map the storage and buffers of the application to memory tiles. This step takes in bin/gaussian_compute_mapped.json
and bin/gaussian_memory.cpp
and produces a CoreIR file called bin/design_top.json
(described in detail here: https://stanfordaha.github.io/aha-wiki-page/pages/08_design_files/).
Now we have both verilog file and fully mapped CoreIR file. The aha pnr apps/gaussian --width 32 --height 16
command will now place and route the application, perform pipelining, and generate a bitstream used to configure the CGRA to execute your application.
The width
and height
flag specify what portion of the array we like to map to. This can only be as large as the verilog that you generated using aha garnet
but may be smaller if desired.
After running the command, the bitstream file would be saved in ./bin/gaussian.bs
. Additionally, several design files (described here: https://stanfordaha.github.io/aha-wiki-page/pages/08_design_files/) will be generated as well.
Then we can run aha test apps/gaussian
, which will run a VCS functional simulation of your application running on your CGRA verilog. On the kiwi server, use module load base vcs
to load vcs before running aha test
.
You can optionally generate an fsdb waveform using the --waveform
flag. This will require Verdi to be loaded: module load verdi
. The waveform will be called cgra.fsdb
and will be in /aha/garnet/tests/test_app
.
To run your mapped application through a critical path timing model, we can use aha sta apps/gaussian
. The maximum frequency that you will be able to run the CGRA array at when running the application will be printed.
The aha sta
command can also be used to generate a visualization of the critical path of the application using the --visualize or -v
flag. It will produce ./bin/pnr_result_{width}.png
. See aha/aha/util/sta.py
for more details.
When testing architectural or compiler changes, its often useful to automate the compilation and testing of many applications. aha regress
is the command that we use for this purpose. In this step, it will iterate through a list of applications and run aha map
, aha pnr
, and aha test
. There are many regression application suites, each with different uses.
-
aha regress fast
will run the smallest sparse and smallest dense image processing application and is intended to run in minutes. -
aha regress daily
tests more complex applications including gaussian, harris, camera pipeline, unsharp, and resnet. It generally takes hours. -
aha regress full
tests every application and takes several hours to complete.
To see a list of all regress suites and the applications they include, see aha/util/regress.py
.
To modify the compilation, scheduling, place-and-route, and pipelining of our applications, we use environmental variables. Here is a brief description of the important variables that we use:
-
HALIDE_GEN_ARGS
determines scheduling parameters used in the{app}_generator.cpp
described here: https://stanfordaha.github.io/aha-wiki-page/pages/03_h2h_files. A valid assignment to this variable is a space separated list ofGeneratorParam
listed at the top of{app}_generator.cpp
. For example if you want to run gaussian you could setHALIDE_GEN_ARGS="mywidth=62 myunroll=2 schedule=3
. -
PIPELINED
determines whether or not compute pipelining is turned on. Valid values arePIPELINED=1
(default) compute pipelining turned on orPIPELINED=0
compute pipelining turned off -
DISABLE_GP
determines whether or not the global placement stage of place-and-route is done.DISABLE_GP=1
(default) turns off global placement,DISABLE_GP=0
turns on global placement -
HL_TARGET
has two valid valuesHL_TARGET=host-x86-64
orHL_TARGET=host-x86-64-enable_ponds
. UseHL_TARGET=host-x86-64-enable_ponds
do enable the memory mapper to use the register files (ponds) present in our PE tiles. By default, this is set toHL_TARGET=host-x86-64
for image processing applications andHL_TARGET=host-x86-64-enable_ponds
for machine learning applications. -
PNR_PLACER_EXP
is a tuning knob for the placement stage of place-and-route. Generally, a higher value results in an application placement and routing that has a shorter critical path and may reduce the runtime of applications running on the array. By default, this variable is not set, and the place-and-route tool will choose the first value that routes successfully. -
SWEEP_PNR_PLACER_EXP
will tell the place-and-route tool to try every value ofPNR_PLACER_EXP
from 1 to 30 and choose the result with the shortest critical path. By default it is not set.
To avoid needing to set each of these environmental variables each time we run an application, we have included default values in aha/util/application_parameters.json
. If you run any AHA flow command, it will use the default
entry in the json file for the application you are compiling. If you would like to run the fastest (shortest critical path) version of the application you can run:
aha halide apps/gaussian --env-parameters fastest
Some machine learning applications, including ResNet, have significantly different schedules depending on the layer in the network. In the AHA flow, we use one application with different environmental parameters to choose which layer and which schedule to use. Default parameters for each layer are saved in aha/util/application_parameters.json
and can be selected using the --layer
flag. For example:
aha map apps/resnet_output_stationary --layer conv1
aha pnr apps/resnet_output_stationary --width 32 --height 16 --layer conv1