The DMC-1 design is in the file dmc-1.si
(it contains many comments).
In the same directory there are a number of helper units which are not strictly part of the DMC-1 but are provided to help assemble a SOC: a fifo command queue to store draw calls before sending them to the GPU, and screen drivers for a SPI and parallel interface.
These are used to assemble a complete SOC including a RISC-V processor in soc-ice40-dmc-1-risc_v.si
.
What's a SOC? This means System On a Chip and is basically the assembled system that forms the hardware running our demos.
Let's start from the top and look at the main components of the SOC running the demos. The SOC instantiates six major components:
- The CPU RAM, using the UP5K on-chip 128KB SPRAM and a small BRAM for booting
// ==============================
// CPU
uint32 user_data(0);
bram_port_io mem_io_cpu;
rv32i_cpu cpu(
- The CPU (my ice-v-dual CPU)
mem <:> mem_io_cpu,
);
// ==============================
// GPU
texmem_io txm_io;
DMC_1_gpu gpu(txm <:> txm_io,
- The GPU
// ==============================
// screen display
uint1 screen_valid(0); uint1 screen_ready(0);
uint1 screen_send_dc(0); uint8 screen_send_byte(0);
screen_controller screen_ctrl(
- The GPU texture memory
ram_io2 <:> ram_io2, ram_io3 <:> ram_io3,
);
$else -- not MCH2022
spiflash_rom_core txm<@clock2x,!rst2x,reginputs> (
sf_clk :> sf_clk, sf_csn :> sf_csn,
sf_io0 <:> sf_io0, sf_io1 <:> sf_io1,
sf_io2 <:> sf_io2, sf_io3 <:> sf_io3,
- The GPU command queue
always {
- The screen controller and screen driver
screen_ready <: screen_ready,
send_dc :> screen_send_dc,
send_byte :> screen_send_byte,
);
// ...
$if MCH2022 then
screen_driver screen(
ready :> screen_ready,
screen_d :> lcd_d,
screen_dc :> lcd_rs,
Overall the CPU is the sole owner of the RAM, the GPU is the sole owner of the texture memory and screen controller. However, the CPU has a bypass access to both the screen and texture memory for initialization. The CPU should not use this access while the GPU is drawing (this would not work properly).
The CPU is directly plugged to the RAM with a binding to the group mem
,
see <:> mem
on both the RAM and CPU above. mem
is declared as bram_port_io mem
,
a group of variables for interfacing with BRAM type memories.
The bram_spram_32bits
unit
combines a BRAM (that can be initialized, but we have little of it) with the
larger SPRAM (128 KB). The bram_spram_32bits
unit defines an address space with
BRAM located after SPRAM.
Since the SPRAM cannot be initialized at startup, a small boot code resides in BRAM
and copies the code to run from flash memory (more on that later) to the larger
SPRAM, then the CPU jumps to this code.
This is done in boot.c:
// copy to the start of the memory segment
volatile int *code = (volatile int *)0x0000004;
spiflash_copy(PAYLOAD_OFFSET/*offset*/,code,131064/*SPRAM size*/);
// ^^^^^^ max code image size
// jump!
*LEDS = 15;
synch = 1; // sync with other core
asm volatile ("li t0,4; jalr x0,0(t0);");
This boot code is compiled and stored in the initialized BRAM, that's how
our CPU is able to start, using for boot address the start of the BRAM
in bram_spram_32bits
.
Now, what's this flash memory we mentioned earlier? In most cases it is the on board SPIflash chip that stores FPGA configuration. This is a relatively large memory (some MB) that is very slow to write to, but can be relatively fast to read from (using quad SPI at 100 MHz one can reach 50 MB/sec, in ideal contiguous burst reads). The DMC-1 design uses this memory as our texture memory, but we also store the CPU code to be loaded at boot in there. In practice, the code is stored at offset 1MB and the textures at offset 2MB in the flash memory.
On the MCH2022 badge there is no flash memory but instead a volatile PSRAM chip. The protocol is nearly identical (quad SPI), but the PSRAM has to be pre-loaded with code and texture data before the design is started. This is achieved using another design to load in PSRAM from UART, then reconfiguring the FPGA with our hardware. (As long as power is maintained the PSRAM retains its content).
Of course, since both the CPU and GPU may access flash memory, we have to be a bit
careful. The SOC implements a mechanism so that the CPU can request memory
transfers from SPIflash to its RAM, but the CPU has to ensure there is no
ongoing GPU activity before using this. This is true at boot time or when the
GPU reports an empty command queue, as indicated by bit 2 of user_data
.
First, note that the GPU is bound to a texmem_io
group called txm_io
:
// ==============================
// screen display
uint1 screen_valid(0); uint1 screen_ready(0);
uint1 screen_send_dc(0); uint8 screen_send_byte(0);
screen_controller screen_ctrl(
These variables are updated in the always block:
txm_io.busy = txm.busy;
$if MCH2022 then
txm.wenable = 0; // we never write to PSRAM
$end
$if SIMULATION then
//if ((|mem_io_ram.wenable) & txm_burst_data_available) {
As you can see, this connects txm_io
to txm
which is the flash memory unit itself:
qpsram_ram txm...
(on mch2022) or spiflash_rom_core
(on icebreaker).
To access flash the CPU first uses memory mapping to set an address and initiate a transfer:
// and should wait on user_data, txm.busy bit
// NOTE3: destination addresses have to be 32-bits aligned
uint1 len_else_dst = prev_mem_wdata[31,1];
// ^^^ bit 31 indicates whether we are setting the length (bytes)
// or the destination address (in ram)
uint1 src_and_trigger = prev_mem_wdata[30,1];
// ^^^ bit 31 indicates whether we are setting the source (in tex
// memory) or the destination address (in ram)
// Set size, then destination, then source (triggers the burst)
if (len_else_dst) {
txm_burst_len = prev_mem_wdata[2,22];
} else {
if (~src_and_trigger) {
txm_burst_read_dst = prev_mem_wdata[2,22];
}
}
txm.addr = prev_mem_wdata[0,24];
txm_burst_read_active = src_and_trigger; // start
txm_burst_data_cursor = 5b10000;
$if SIMULATION then
//if (src_and_trigger) {
// __display("[cycle %d] BURST start, from @%x to @%x, len %d",iter,txm.addr,txm_burst_read_dst<<2,txm_burst_len);
//}
$end
This immediately initiates a transfer. During a transfer the CPU is stalled: its RAM is being written every cycle and it can no longer fetch instruction.
: txm_burst_data;
txm_burst_data_cursor =
The rest of the SOC is dedicated to the command queue and screen. Let's first take a look at the command queue. Whenever the CPU wants to draw, it sends a command, or draw call to the GPU. This is done through memory mapping, intercepting memory addresses that are outside the range of the actual available memory. Because each command is 64 bits, this takes two memory writes:
// received second half
cmdq.in_command[ 0,32] = prev_mem_wdata[0,32];
// store in fifo
cmdq.in_add = 1;
}
}
case 4b1000: { // GPU commands in part
switch (prev_mem_addr[6,4]) {
// tex0 PARAMETER_PLANE_A (2<<30) | ((ny) & 1023) | (((uy) & 1023)<<10) | (((vy) & 1023)<<20)
case 0: { // PARAMETER_PLANE_A_ny
cmdq.in_command[$32+ 0$,10] = prev_mem_wdata[0,10];
The case 4b0001:
is part of a switch
performed on accessed memory addresses,
when a high bit of the address is set (well outside the truly available
memory space):
uo.data_in = prev_mem_wdata[ 0,8];
uo.data_in_ready = ~uo.busy;
$if SIMULATION then
The command queue is a FIFO data structure that records the commands.
Each command is added to the FIFO by pulsing cmdq.in_add
.
The GPU then attempts to process them as fast as possible. Of course the command queue
might fill up, a situation the CPU has to know about. This again uses the user_data
register:
user_data[0,24] = { ... , ~cmdq.full, ... };
The cmdq.full
signal will become high sometime before the FIFO is catastrophically
full, so that the CPU can still send a few commands. But then it should stop, or
the FIFO will overflow and bad things happen (typically, the GPU hangs).
We send the next command to the GPU with these two lines:
prev_mem_addr = mem_io_cpu.addr;
prev_mem_wenable = mem_io_cpu.wenable;
prev_mem_wdata = mem_io_cpu.wdata;
The first line pulses gpu.valid
to tell the GPU to process the command, while
cmdq.in_next
tells the command queue to move to the next command. The command
itself is given to the GPU through a binding:
always {
The GPU is driving the screen directly, with the CPU only having access during initialization. This can be seen in particular in this part of the SOC:
screen.data_or_command = screen_send_dc;
screen.byte = screen_send_byte;
// uart: maintain low, pulse on send
uo.data_in_ready = 0;
Note that screen_ctrl.in_data
receives data either from the GPU
or the CPU. The CPU can trigger screen_ctrl.valid
through a memory mapped address:
oled_resn = ~ prev_mem_wdata[0,1];
$elseif not MCH2022 then
o_resn = ~ prev_mem_wdata[0,1];
The screen controller unit screen_ctrl
then drives a lower level screen driver
screen
that talks to the actual screen though either parallel (e.g. mch2022 badge)
or SPI protocols (e.g. icebreaker).
Alright, enough of the SOC, let's dissect the GPU!
There are some other details about the SOC, in particular to setup for simulation and to introduce latencies to relax the pressure on the maximum frequency. All the lines starting with
$$
are pre-processor lines selecting between various options and configuring some parameters for the CPU and other units.
To be written