A Practical GCC Trick To Use During Optimization
Mike Acton
April 15, 2006
April 15, 2006
Splitting a basic block (by force)
Warning: This is a trick to use during optimization. It is not documented nor gauranteed to work across multiple platforms or different revisions of the compiler. Many programmers will say that this non-portable code should not be used in production, but there is such a huge practical benefit to using it has to be mentioned.
One of the first things that the compiler's scheduler does is organize the code into blocks of code without branches, out-of-line function calls or other optimization barriers. These basic blocks will then be scheduled among eachother based on their dependencies. Toward the end of scheduling individual instructions will be scheduled within basic blocks. Finally, instructions may be moved across block boundaries.
Inline assembly is most definately an optimization barrier and every block of inline assembly is treated as an independent basic block. This is why assembly should be inlined one instruction at a time - to give the compiler the maximum number of options for scheduling.
Why do I mention this?
Well, it turns out that if you have an empty inline assembly statement you can rely on the side-effect of splitting the basic block witout actually adding any instructions.
#define GCC_SPLIT_BLOCK __asm__ ("");
So in this case we want to split the block after the loads but before the calculations. The compiler will then schedule these two blocks separately, then try to merge them - but there will be nothing to merge since it will meet all the contraints.
I'll also split the comparisons and branches section of the code at the end so it's easier to see what's happening (so easier to optimize). There's no problem with letting the compiler mix the instructions between the FPU calculations and the comparisons and branches, though.
So the new version:
bool overlaps(const Box& b) const
{
//
// LOADS
//
const float a_c0 = m_center[0];
const float a_c1 = m_center[1];
const float a_c2 = m_center[2];
const float a_e0 = m_extent[0];
const float a_e1 = m_extent[1];
const float a_e2 = m_extent[2];
const float b_c0 = b.m_center[0];
const float b_c1 = b.m_center[1];
const float b_c2 = b.m_center[2];
const float b_e0 = b.m_extent[0];
const float b_e1 = b.m_extent[1];
const float b_e2 = b.m_extent[2];
GCC_SPLIT_BLOCK
//
// CALCULATIONS
//
const float delta_c0 = a_c0 - b_c0;
const float delta_c1 = a_c1 - b_c1;
const float delta_c2 = a_c2 - b_c2;
const float abs_delta_c0 = ::fabs( delta_c0 );
const float abs_delta_c1 = ::fabs( delta_c1 );
const float abs_delta_c2 = ::fabs( delta_c2 );
const float sum_e0 = a_e0 + b_e0;
const float sum_e1 = a_e1 + b_e1;
const float sum_e2 = a_e2 + b_e2;
GCC_SPLIT_BLOCK
//
// COMPARES AND BRANCHES
//
const bool in_0 = abs_delta_c0 <= sum_e0;
const bool in_1 = abs_delta_c1 <= sum_e1;
const bool in_2 = abs_delta_c2 <= sum_e2;
const bool result = in_0 && in_1 && in_2;
return (result);
}
The final output is exactly what we were expecting. It's very obvious from output that the code was scheduled on either side of the splits. The new output:
_Z12test_overlapRK3BoxS1_:
//
// PUSH STACK
//
stwu 1,-16(1)
//
// LOADS
//
lfs 4,20(3)
lfs 3,20(4)
lfs 1,0(3)
lfs 13,4(3)
lfs 12,8(3)
lfs 11,12(3)
lfs 10,16(3)
lfs 9,0(4)
lfs 8,4(4)
lfs 7,8(4)
lfs 6,12(4)
lfs 5,16(4)
//
// CALCULATIONS
//
fsubs 0,1,9
fsubs 2,13,8
fsubs 1,12,7
fadds 11,11,6
fadds 10,10,5
fadds 4,4,3
fabs 0,0
fabs 13,2
fabs 12,1
//
// COMPARES AND BRANCHES
//
fcmpu 7,13,10
li 3,0
crnot 30,29
fcmpu 1,0,11
mfcr 0
rlwinm 0,0,31,1
fcmpu 6,12,4
crnot 26,25
cmpwi 7,0,0
mfcr 0
rlwinm 0,0,27,1
bgt- 1,.L14
cmpwi 6,0,0
beq- 7,.L14
beq- 6,.L14
li 3,1
//
// POP STACK AND RETURN
//
.L14:
addi 1,1,16
blr
ABOUT THE AUTHOR
Mike Acton is the director and adminstrator of CellPerformance,
is dedicated to helping the Cell community as much as possible
and plies his trade developing technology for PS3 games. Mike has made regular appearances as a speaker at SCEA develpment conferences and other events. Mike Acton is not a framework-happy C++ programmer. He actually likes C. And assembly. In his spare time he develops hardware on FPGAs in VHDL.
He prefers vi.
Also check out the CellPerformance main site, which has additional articles written by Mike Acton.