-
Notifications
You must be signed in to change notification settings - Fork 392
Conditional Branches, Branch Prediction, and Minimizing Branch Penalties
There are a number of reasons why dynamic instruction count (i.e., number of instructions executed) does not correlate perfectly with observed runtime. One reason is that not all instructions are created equal in terms of latency. An integer ADD
instruction takes one clock cycle to execute, a DIV
takes 24 cycles (here is an interesting reference that tells you the latencies of all x86_64 instructions on different processors).
A second reason is that memory access instructions--these are usually called LOAD
and STORE
but in x86_64 they are both called MOV
--can take a variable number of clock cycles depending on whether the data they are accessing is found in the first level data cache, the L2 or L3 cache or even off-chip in main memory (here is a page about the memory hierarchy, spatial locality, and how to design data structures that exploit it).
The third reason is conditional branches, i.e., the instructions that implement conditional control flow in if
statements and for
loops. Yes, every for
loop contains at least one conditional branch at the end of the body that checks whether the termination condition has been met and determines whether to jump back to the beginning of the body or fall through to post-loop code. What is it about conditional branches that slows down execution?
The external semantics of processors are such that instructions execute sequentially and atomically--an instruction finishes executing and updating processor state before the next instruction is fetched from the instruction cache. If processors actually worked this way, then conditional branches would not present performance problems. However, only the very earliest processors actually did work this way. For the last fifty years, processors have made heavy use of a technique called pipelining. The processor cuts up instruction execution into stages and pushes instructions through those stages in overlapping fashion. The canonical (i.e., textbook) pipeline has five stages called:
-
fetch
: load instruction from instruction cache -
decode
: unpack instruction and read registers -
execute
: do any arithmetic operations -
memory
: loads and stores access the data cache -
writeback
: write register values
A canonical pipeline is essentially working on five instructions simultaneously, one in each stage. Modern processors have much deeper pipelines, with 20 stages or more, and can also process multiple instructions in each stage simultaneously. A modern core can often be working on 80 instructions simultaneously! Incidentally, pipelining was a major contributor to the steady increase of clock frequencies between the 1970s and early 2000s, and diminishing returns on pipelining have contributed to the plateauing of frequencies since then.
A branch does not access memory or write a register value and so it completes execution in the execute
stage, at which point its target is known. However, by that time five-stage pipeline has already started working on the next two instructions, one in fetch
and one in decode
. And a more realistic pipeline may have started working on the next 20 instructions? How does the processor know whether those instructions are from the "true" or "taken" side of the branch or the "false"/"not taken" side of the branch? How does a processor even know an instruction is a branch before it has decoded it?
It basically guesses based on historical behavior. The processor has a hardware table--essentially a cache--called a branch predictor that is indexed by instruction address and contains information about past branching behavior including direction (taken or not) and target. The predictor is updated when a branch executes and is consulted when the pipeline fetches the same instruction again. Because only branches update the branch predictor, if the processor does not find an entry in the predictor it essentially assumes the instruction is not a branch.
When the processor guesses the branch direction and target correctly, everything is cool. However, when it guesses incorrectly--this is called a branch mis-prediction--it has to flush all of the instructions after the branch from the pipeline and restart. In a modern processor, this means flushing anywhere between 20 and 40 instructions. This is where the performance penalty associated with branches comes from.
Modern processors are actually pretty good at predicting branches, with accuracies of 95% and above. However, those infrequent mis-predictions do hurt--a mis-predicted branch costs 20-40 times the average instruction. Part of high-performance programming is knowing how to minimize branch mis-predictions. And although that sounds impossible--how can you write a program with no if statements or for loops?--the fact is that there are idioms that result in branches that are more difficult to predict and you can reduce branch mis-predictions by avoiding those idioms.
Consider the following loop:
for (int SurfNum = 1; SurfNum <= state.dataSurfaces->TotSurfaces; ++SurfNum) {
SurfaceData &surf(state.dataSurfaces->Surface(SurfNum);
if (surf.Class != DataSurfaces::SurfaceClass::Window) continue;
// Do some window related stuff
}
Here are the corresponding pseudo-instructions, assume SurfNum
is in register R1
and &state.dataSurfaces->Surface
is in register R2
and state.dataSurfaces->TotSurfaces
is in R3
.
LOOP-BODY:
MUL R1, 200 -> R4 // 200 is the size of the Surface class in bytes (not really, but let's just say it is)
ADD R2, R4 -> R4
LOAD R4, 16 -> R4 // 16 is the offset of the Class field inside the Surface class (not really, but let's say it is)
COMPARE-EQ R4, 7 -> R4 // 7 is the integer value of DataSurfaces::SurfaceClass::Window)
BRANCH-IF-FALSE R4, LOOP-BODY: // surface class test branch
... // some window related stuff
ADD R1, 1 -> R1 // ++SurfNum
COMPARE-LEQ R1, R3 -> R4
BRANCH-IF-TRUE R4, LOOP-BODY: // loop continuation/termination branch
POST-LOOP-CODE:
There are two branches in this loop: the surface class test branch and the loop continuation/termination branch. They look similar but from the point of view of branch mis-predictions they could not be more different. The loop continuation/termination branch will be mis-predicted at most twice regardless of how long the loop is (on the first iteration and the last) and most processors have a trick for recognizing loop branches and mis-predicting them only once per loop (that trick is essentially a hysteresis that makes the branch have to go a certain way twice in a row before the branch predictor "changes its mind"). On the other hand, the surface class test branch will be mis-predicted every iteration in which surface class changes from window to non-window or vice versa relative to the previous iteration--this could be on many iterations! The moral of the story is there is no need to reduce loops and loop continuation/termination branches, but we should try to eliminate the pathologies that are associated with idioms like the surface class test.
One way to do this is to sort surfaces so that all window surfaces are adjacent and so window vs. non-window switches from one iteration to the next less frequently. Now, if you've done that and you know where the window surfaces are you may as well just iterate over that subset explicitly and not even do the check.
for (int ZoneNum = 1; ZoneNum <= state.dataGlobals->NumOfZones; ++ZoneNum) {
ZoneData &zone(state.dataGlobals->Zone(ZoneNum);
for (int SurfNum = zone.SurfWinFirst; SurfNum <= zone.SurfWinLast; ++SurfNum) {
// Do window-related stuff
}
}
This implementation reduces the number of window-related mis-predictions to one per Zone if that Zone actually has windows at all. It also reduces the total number of loop iterations and instructions, which is also excellent! This pattern has been implemented in EnergyPlus for the Surface array (here is an example) and should be propagated.
In addition to reducing branch mis-predictions, this idiom has a second benefit--by removing branches from loop bodies it facilitates compiler optimizations including vectorization (see this page about making code more vectorization friendly).
There are other idioms that result in data-driven branches that can be modified to reduce or even eliminate their occurrence. One of these is the compound conditional.
enum WinShadingFlag shading = state.dataSurfaces->SurfWinShadingFlag(SurfNum);
if (shading == WinShadingFlag::ExtBlind ||
shading == WinShadingFlag::IntBlind ||
shading == WinShadingFlag::BGBlind) {
// Do something specific to window blinds
}
Using the data dependent branch-to-loop continuation branch conversion trick is not going to work because window shading status is dynamic and windows cannot be statically sorted and grouped as such. Some kind of data-dependent branch may be necessary here, but the specific point is that this particular idiom generates not one by three data dependent branches because the C-language defines "short circuit" semantics for compound conditions, i.e., the conditions are evaluated left-to-right to the extent that it is needed to determine the overall outcome. In this specific case if shading
is equal to WinShadingFlag::ExtBlind
, then the condition evaluates to true and the second and third clauses are not evaluated at all. The result is that this idiom evaluates to three separate comparisons and branches, which can lead to three separate mis-predictions and also take up more space in the branch predictor, potentially evicting prediction information for other branches (remember the branch prediction is essentially a cache). There are several ways of converting these idioms into ones that have only a single (data-dependent) branch rather than three. The use of the switch
/case
construct is one. A switch
/case
statement is also a type of data dependent branch, the processor doesn't have to predict whether the switch will be "taken" or not, it always will be, but it does have to predict the address of the case
that it will jump to which can be just as difficult. However, switch
/case
allows the three possibilities WinShadingFlag::IntBlind
, ExtBlind
, and BGBlind
to be treated as a single case rather than three sequential cases. The use of bitfields is another possibility. These are discussed on this page.
One common code idiom in EnergyPlus--and in general--is conditional assignment of boolean, i.e, bool
, variables. Here is a random example from EnergyPlus. This nested if-else generates two data-dependent branches.
if (state.dataWindowAC->WindAC(WindACNum).DXCoilType_Num == CoilDX_CoolingHXAssisted) {
// Check for a setpoint at the HX outlet node, if it doesn't exist always run the HX
if (state.dataLoopNodes->Node(state.dataWindowAC->WindAC(WindACNum).CoilOutletNodeNum).HumRatMax == SensedNodeFlagValue) {
HXUnitOn = true;
} else {
HXUnitOn = false;
}
} else {
HXUnitOn = false;
}
However, when conditionally assigning a bool
the conditions inside the if
statements that lead to assignments of true
are themselves the value of the bool
being assigned. Instead of using the conditions to conditionally assign the bool
variable to true
, just directly assign the bool
variable to the conditions, using &&
to implement nesting.
auto &windAC(state.dataWindowAC->WindAC(WindACNum));
HXUnitOn = (windAC.DXCoilType_Num == CoilDX_CoolingHXAssisted &&
state.dataLoopNodes->Node(windAC.CoilOutletNodeNum).HumRatMax == SensedNodeFlagValue);
The compiler can often perform this optimization automatically, but in some situations it may not be able to. Either way, this simplifies the compilers job and is somewhat cleaner code too (in my opinion).
In addition to conditional control flow constructs, C also has a conditional value operator: <condition> ? <true-value> : <false-value>
. This construct is not a general replacement for if-else control flow because both <true-value>
and <false-value>
have to evaluate to values, but it can be useful shorthand if that is what you are doing. Here is another example from EnergyPlus. Original code:
if (_Qactual >= 0.0) {
_OutletWaterTemp = _InletWaterTemp - _Qactual / MdotCpWater;
} else {
_OutletWaterTemp = _InletWaterTemp;
}
And conditional value operator form:
_OutletWaterTemp = (_Qactual >= 0.0) ? (_InletWaterTemp - _Qactual / MdotCpWater) : _InletWaterTemp;
What is the difference between these two in terms of instructions generated? It turns out that x86_64 has an instruction call conditional move (CMOV
) which has similar semantics to the conditional value operator. Specifically, the semantics of CMOV R1, R2 -> R3
are "if R1 is true
, copy the value of R2
to R3
". The CMOV
instruction allows the compiler to convert data-dependent control flow which is likely to generate a branch mis-prediction to data-flow which is not. The trade-off is that the processor has to wait for R2
to be computed even if R1
is false and in some cases that could be a long time. In this specific case, the calculation of R2
includes a DIV
which takes 24 cycles, which is roughly the penalty associated with a branch mis-prediction. The rule of thumb is that computation latencies are preferred to branch mis-predictions because they can be overlapped with other computation while recovery from a branch mis-prediction cannot, and so as long as the conditional computation is not too expensive (e.g., 2X the cost of a branch mis-prediction), the CMOV
implementation is preferred.
Again, the compiler can often perform the conditional value operator/CMOV
optimization automatically, but sometimes it cannot and sometimes it may choose not to.
The most effective code transformation eliminates conditional code in favor of data lookup. This is not always possible, but when it is ... magic! Here is a some less-than-optimal code from EnergyPlus:
constexpr const char *DayTypes(int &d)
{
switch (d) {
case 1:
return "Sunday";
case 2:
return "Monday";
case 3:
return "Tuesday";
case 4:
return "Wednesday";
...
}
Remember, a switch
statement is a data-driven branch. This type of code should be replaced with:
constexpr std::array<std::string_view, 15> DayOfWeekNames = {"", "Sunday", "Monday", "Tuesday", "Wednesday", ... };
And references to the DayOfWeekNames
array. This idiom has many advantages over the switch
implementation, elimination of the data-driven branch mis-prediction is just one of them. These cases are the most obvious but there are other more subtle situations that you should keep an eye out for. Here is one from the ConvectionCoefficients module.