Skip to content

Conditional Branches, Branch Prediction, and Minimizing Branch Penalties

amirroth edited this page Sep 11, 2021 · 16 revisions

There are a number of reasons why dynamic instruction count (i.e., number of instructions executed) does not correlate perfectly with observed runtime. One reason is that not all instructions are created equal in terms of latency. An integer ADD instruction takes one clock cycle to execute, a DIV takes 24 cycles (here is an interesting documentreference that tells you the latencies of all x86_64 instructions on different processors).

A second reason is that memory access instructions--these are usually called LOAD and STORE but in x86_64 they are both called MOV--can take a variable number of instructions depending on whether the data they are accessing is found in the first level data cache, the L2 or L3 cache or even off-chip in main memory (here is a tutorial about the memory hierarchy, spatial locality, and how to design data structures that exploit it).

The third reason is conditional branches, i.e., the instructions that implement conditional control flow in if statements and for loops. Yes, every for loop contains at least one conditional branch at the end of the body that checks whether the termination condition has been met and determines whether to jump back to the beginning of the body or fall through to post-loop code. What is it about conditional branches that slows down execution?

Pipelining, Branch Prediction, and Branch Mispredictions

The external semantics of processors are such that instructions execute sequentially and atomically--an instruction finishes executing and updating processor state before the next instruction is fetched from the instruction cache. If processors actually worked this way, then conditional branches would not present performance problems. However, only the very earliest processors actually did work this way. For the last fifty years, processors have made heavy use of a technique called pipelining. The processor cuts up instruction execution into stages and pushes instructions through those stages in overlapping fashion. The canonical (i.e., textbook) pipeline has five stages called:

  • fetch: load instruction from instruction cache
  • decode: unpack instruction and read registers
  • execute: do any arithmetic operations
  • memory: loads and stores access the data cache
  • writeback: write register values

A canonical pipeline is essentially working on five instructions simultaneously, one in each stage. Modern processors have much deeper pipelines, with 20 stages or more, and can also process multiple instructions in each stage simultaneously. A modern core can often be working on 80 instructions simultaneously! Incidentally, pipelining was a major contributor to the steady increase of clock frequencies between the 1970s and early 2000s, and diminishing returns on pipelining have contributed to the plateauing of frequencies since then.

A branch does not access memory or write a register value and so it completes execution in the execute stage, at which point its target is known. However, by that time five-stage pipeline has already started working on the next two instructions, one in fetch and one in decode. And a more realistic pipeline may have started working on the next 20 instructions? How does the processor know whether those instructions are from the "true" or "taken" side of the branch or the "false"/"not taken" side of the branch? How does a processor even know an instruction is a branch before it has decoded it?

It basically guesses based on historical behavior. The processor has a hardware table--essentially a cache--called a branch predictor that is indexed by instruction address and contains information about past branching behavior including direction (taken or not) and target. The predictor is updated when a branch executes and is consulted when the pipeline fetches the same instruction again. Because only branches update the branch predictor, if the processor does not find an entry in the predictor it essentially assumes the instruction is not a branch.

When the processor guesses the branch direction and target correctly, everything is cool. However, when it guesses incorrectly--this is called a branch mis-prediction--it has to flush all of the instructions after the branch from the pipeline and restart. In a modern processor, this means flushing anywhere between 20 and 40 instructions. This is where the performance penalty associated with branches comes from.

Modern processors are actually pretty good at predicting branches, with accuracies of 95% and above. However, those infrequent mis-predictions do hurt--a mis-predicted branch costs 20-40 times the average instruction. Part of high-performance programming is knowing how to minimize branch mis-predictions. And although that sounds impossible--how can you write a program with no if statements or for loops?--the fact is that there are idioms that result in branches that are more difficult to predict and you can reduce branch mis-predictions by avoiding those idioms.

Data Dependent Branches

Consider the following loop:

for (int SurfNum = 1; SurfNum <= state.dataSurfaces->TotSurfaces; ++SurfNum) {
   SurfaceData &surf(state.dataSurfaces->Surface(SurfNum);  
   if (surf.Class != DataSurfaces::SurfaceClass::Window) continue;
   // Do some window related stuff
}

Here are the corresponding pseudo-instructions, assume SurfNum is in register R1 and &state.dataSurfaces->Surface is in register R2 and state.dataSurfaces->TotSurfaces is in R3.

LOOP-BODY:
   MUL R1, 200 -> R4 // 200 is the size of the Surface class in bytes (not really, but let's just say it is)
   ADD R2, R4 -> R4
   LOAD R4, 16 -> R4 // 16 is the offset of the Class field inside the Surface class (not really, but let's say it is)
   COMPARE-EQ R4, 7 -> R4 // 7 is the integer value of DataSurfaces::SurfaceClass::Window)
   BRANCH-IF-FALSE R4, LOOP-BODY: // surface class test branch

   ... // some window related stuff
   
   ADD R1, 1 -> R1 // ++SurfNum
   COMPARE-LEQ R1, R3 -> R4 
   BRANCH-IF-TRUE R4, LOOP-BODY: // loop continuation/termination branch
POST-LOOP-CODE:

There are two branches in this loop: the surface class test branch and the loop continuation/termination branch. They look similar but from the point of view of branch mis-predictions they could not be more different. The loop continuation/termination branch will be mis-predicted at most twice regardless of how long the loop is (on the first iteration and the last) and most processors have a trick for recognizing loop branches and mis-predicting them only once per loop (that trick is essentially a hysteresis that makes the branch have to go a certain way twice in a row before the branch predictor "changes its mind"). On the other hand, the surface class test branch will be mis-predicted every iteration in which surface class changes from window to non-window or vice versa relative to the previous iteration--this could be on many iterations! The moral of the story is there is no need to reduce loops and loop continuation/termination branches, but we should try to eliminate the pathologies that are associated with idioms like the surface class test.

One way to do this is to sort surfaces so that all window surfaces are adjacent. is to convert data dependent branches into loop continuation/termination branches and take advantage of the loop branch trick that processors use to reduce the number of mis-predictions by half. How do we do this? One way is to sort surfaces so that all windows are adjacent.

for (int ZoneNum = 1; ZoneNum <= state.dataGlobals->NumOfZones; ++ZoneNum) {
   
   for (int SurfNum = state.dataGlobals->Zone(ZoneNum).SurfWinFirst; SurfNum <= state.dataGlobals->Zone(ZoneNum).SurfWinLast; ++SurfNum) {
       // Do window-related stuff
   }
}
Clone this wiki locally