Skip to content

JCAP Log #9: Video Part 4

Connor Spangler edited this page Apr 30, 2018 · 35 revisions

NES CPU-PPU Setup

Video System Implementation

We finally have all the information we need to fully implement a VGA arcade graphics system, minus one final critical consideration. The main and cog RAM sizes and their constraint of the graphics representation solution have already been addressed, however the Propeller 1 core clock as it pertains to the pixel clock represents one final technical hurdle to overcome, and will ultimately define the high-level architecture of the system.

Decisions

Thanks to the robust community behind the Propeller 1 microcontroller, a vast amount of communal knowledge can be drawn from to apply to our own design decisions. This is in no better way showcased than with video display. Dozens of developers have created hundreds of different solutions to display a wide variety of video types, resolutions, refresh rates, and other variations. What can be learned from these implementations is that video display of a high complexity and quality simply cannot be accomplished in a single cog. This is largely a constraint imposed by the generation of pixel data itself. Some solutions solve this by splitting the scanlines into groups which are assigned to different cogs, while others interlace individual scanlines generated by individual cogs.

In our case, with two layers of indirection and sprite effects to implement, we'll be forced to use a different paradigm altogether: a scanline driver. With this method, one cog is the "display" cog. Its sole job is to take pixel data from main RAM and display it via the video generator circuit. N cogs are then spooled up as "render" cogs. Their job is to generate interleaved scanlines of pixels which are then requested sequentially by the display cog. The choice of this methodology is a direct result of simply doing the math...

Colors


6-bit Color Palette

6-bit Color Palette

A critical constraint posed by the "indirect" method of using waitvid discussed in Video Part 2 is that each series of 16 pixels can only have 4 colors: 2 bits per pixel addressing one of the four color bytes. We need 16 colors per 8x8 pixel tile, which even if we only push out 8 pixels per waitvid we're still restricted to a 4 color palette. The solution to this problem is novel: simply switch the color palette with the pixel palette. By populating the color palette with the colors of the next four pixels, we can directly display them by waitviding each color sequentially, i.e. waitvid pixels, #%%3210. This new paradigm works perfectly at giving us "full color", but requires more waitvids per screen, an issue that will need to be addressed.

Nanoseconds


Nanoseconds (xkcd)

There's a Relevant xkcd for Everything

It is in no way shape or form an exaggeration to say that the timing of this video system on the Propeller 1 comes down to single nanoseconds. Let's look at the numbers to find out why...

Our 640x480 @ 60 Hz VGA pixel clock is 25.175 MHz, which means we're displaying a pixel every 40 nanoseconds, or a group of 4 every 160 nanoseconds. Using our "direct" method of pixel output discussed above - displaying 4 at a time - we'll need to have a waitvid being blocked every 40*4=160 nanoseconds. Between each waitvid, we also will need to perform a rdlong to retrieve the next 4 pixels from main RAM. We're excluding using a djnz to loop through the instructions, and instead we're generating all instructions into a monolithic region of scancode, as (you're about to see) we don't have time to perform a jump and we have the space in cog RAM to generate the scancode. A worst-case scenario waitvid takes 7 clock cycles from execution to pixels being pushed out of the video generator. A worst-case scenario rdlong takes 23 cycle, however because the intermediate waitvids are only 7 cycles, we're always hitting the best case of 8 cycles.

Assuming an 80 MHz core clock, where each instruction cycle is 12.5 ns, our reading and printing routine takes (8+7) x 12.5 = 188 nanoseconds. That means we're blowing our 160 ns deadline! Our 2-instruction routine cannot be any more efficient, at least not without resorting to some nasty hacks that are difficult to understand and implement (a no-go for a project intended to be easily worked off of by all). So instead, we can simply increase the core clock! Some simple math, as well as community recommendations on overclock stability, leads us to an ideal clock speed of 104 MHz. At this speed, we have more than enough time to read and display each long (almost 40 nanoseconds of headroom). The only consideration left is storing our data.

Bytes

Concerns about the size of our data manifests in a few places: the scanline buffer in each render cog, the main RAM scanline buffer they write to and the display cog reads from, the tile map which represents the screen area, and the scancode buffer which the display cog uses to read the main RAM scanline buffer longs and display them. As discussed in Video Part 3, given a 640x480 screen with a 1:1 tile map, we'd need over 9.5 kB of main RAM to store a single tile map. Taking up a third of our memory for a single map is not optimal. In the render cogs, we need to generate 640/4=160 longs, so we'd have to allocate 1/4th of cog RAM to that buffer. This is mirrored in main RAM, where it's less of a problem. As for the scancode, we would need 160*2=320 longs of buffer. That is pretty untenable.

A solution to this problem actually arises from the fact that classic arcade games were far lower resolution than anything we see today. Horizontal and vertical resolutions of less than 400 and 300 were the norm, viewable on CGA-compatible 15 kHz monitors. What this means for us is we can both drastically reduce our memory footprint AND develop a more faithful classic arcade graphics system by utilizing upscaling. Upscaling is exactly what it sounds like: scaling an image up from a lower resolution to a higher one. This involves duplication or stretching of pixels to fill a larger visible area with the same amount of unique data. By leveraging upscaling, we can render data at 320x240 (1/2 resolution) while displaying it at full 640x480. What's more, achieving this is as simple as changing vscl in the visible screen are to display each pixel twice, and then displaying each line twice. Just like that we have a 2x upscaled image to 640x480 while cutting our render cog buffers in half, display cog scancode buffer in half, and our tile map footprint by a factor of 4.

Propellers


Multi-Prop Setup

Example Multi-Prop Setup

The final question remains: how do we set up our display and render cogs? A lot of number crunching, optimization, trial, and error went into answering this question. Suffice to say - in order to implement our tile and sprite based video system - we need at least 5 render cogs in conjunction with our one display cog, giving us only two free cogs for everything else. Obviously, this is not practical in a system which also requires input, sound, and game code to be running. This means it's time to take another leaf out of the NES's book, and implement a CPU-GPU dual-processor system.

The idea is simple: a primary CPU runs our game, input, and sound code while a secondary GPU runs our graphics code. The video data required for each frame is sent from the CPU to the GPU during the vertical sync period, and is used in the rendering and display of each frame. By offloading the graphics work to the GPU, we can also add another cog to the render pool to increase the amount of processing we can do. On both ends of the wire linking the microcontrollers, a single cog is running which performs the data transmission and reception.