Sparky's Guide to Assembly Speedups on the EE Core

Editors note: This is a modified and corrected version of Sparky's original post on playstation2-linux.com.

Dunchs post has inspired me to supply you with a little info on speeding up code for the main cpu. In many ways it's actually more difficult then to speedup VU code but it's also very interesting and educational.

In various places of the manual you'll see something refered to as throughput/latency. "throughput" is how soon you can do an instruction of the same type as the one you just did but one that doesn't use the result of your current instruction and "latency" is how soon you can issue an instruction using/reading the result of the instruction.

For instance most of the fpu instructions are 1/4 instructions meaning throughput is 1 and latency is 4.

A little example of this:

mul.s $f3, $f0, $f0# 1. cycle
add.s $f4, $f0, $f1# 2. cycle
sub.s $f5, $f1, $f2# 3. cycle
abs.s $f6, $f0# 4. cycle
mul.s $f7, $f3, $f0# 5. cycle

So the result of $f3 launched at the 1. cycle is ready for reading in the 5. cycle, if you try to read it sooner, the cpu will stall until the result is ready so you might aswell find something usefull to do with the gap inbetween.

The accumulator instructions are nice because there's no penalty adding to the intermediate result which is always in the acc register.

mula.s $f0, $f0# 0. cycle, acc = f0*f0
madda.s $f1, $f1# 1. cycle,acc += f1*f1
madda.s $f2, $f2# 2. cycle, acc += f2*f2
...
madd.s %0, $fn, $fn# n. cycle, %0=acc+fn*fn
abs.s tmp, tmp,# n+1. cycle
abs.stmp2, tmp2# n+2. cycle
abs.s tmp3, tmp3# n+3. cycle
add.s %1, %0, %0# n+4. cycle (%0, ready)

And %0 is ready for use 4 cycles after the final madd.s instruction. Again you can just try and use it right away, but the cpu will just stall. By the way GCC for ps2/ps2linux doesn't know how to use the accumulator instructions.

Integer operations are 1/1 instructions except for the mul and div instructions. Be careful with these since there are macros expanding what you might think is one instruction into several. Such as "li $t0, 0xdeadbeef" which expands into "lui $t0, 0xdead" and ori "ori $t0, $t0, 0xbeef". There's also the "rol" which will expand into two shifts and an or, and many many other pseudo instructions. Using -S will not give you the final assembler code, so if you wish to know for sure if you got the code you were expecting you should use some kind of disassembler, perhaps using gdb or objdump --disassemble mycode.elf I also suggest you insert .set noreorder at the top of your loop and .set reorder at the bottom basicly telling the compiler to butt out. It often inserts more into your code then you bargened for.

Now another interesting aspect is instruction pairing. It's possible to execute several of the instructions 'two in one cycle' if you setup your code with care. You should have a look at the EE Core User's Manual and look for the table called "Categories of instructions and Routing to Physical pipes", I'm looking at an older version of Manual just now but I think someone told me it's on page 21, anyway it's easy enough to find it. 2 times * (star) means instruction will need both pipes, 2 times O means it can use either of the two pipes but one of them must be available. A single O means must execute in this pipe. Now the pairing strategy is simple, couple instructions that don't collide on pipes.

For instance you can do two integer operations in one cycle but you can NOT do two fpu. However you can do one fpu instruction and have it coupled with integer, load/store, branch and a few others. MM instructions can't couple with anything using either of the integer pipes (i0 and i1) but it can be coupled with a load/store/cop1 move/cop2 move/branch. If you play around with this I do suggest setting .set noreorder at the top and .set reorder at the bottom and double checking your code using objdump since it's extremely important that you get the instruction sequence you were expecting.

MM instructions can't pair with a nop so you will have to design a nop for it if you wan't to maintain pairing alignment. I suggest using something like "cfc2 $0, $vi00" or "qmfc2 $0, $vf00" as a 'nop partner' for MM instructions.

I would also like to add that you can't couple two integer operations if the first is doing a result that the next is reading.

This example executes in two cycles (no pairing):

addiu $t0, $0, 7
add $t1, $t0, $t0

This example executes as a pair in one cycle.

addiu $t0, $0, 7
addiu $t1, $0, 7+7

Instructions are always fetched in pairs at an 8 byte alignment but execution pairing can occur at any alignment. If you designed a perfect 4 cycle loop (4 pairs, 8 instructions) and left it at a non 8 byte aligned address the fetcher would need to fetch 5 times instead of 4 making it a 5 cycle loop instead of 4 eventhough you have 4 perfect pairs. In general you make things much easier on yourself if you write a .align 3 immediately above your loop (over the label of your loop) which will make it 8 byte aligned.

Another thing you might want to know is that loading from memory is same speed as the scratchpad if it's a hit, it's a 1/2 instruction meaning you can do another bus access in the next cycle but you should not use the register until the next cycle after that. Example:

a1: lw $t0, 0(%0)# 1. cycle
a2: nop
b1: lw $t1, 4(%0)# 2. cycle
b2: nop
c1: add $t2, $t0, $t0# 3. cycle
c2: nop

So $t0 is ready for use in the 3. cycle assuming it's a hit or fetched from scratchpad.

If you get a miss the stall will be longer then 30 cycles (use DMA for intensive data reading).

The scratchpad is devided into 4 banks each 4 kb, if you access from a different bank then last read/store you did, then there will be a penalty of 1 cycle and you will not be able to fill the stall with another instruction. You will be stalled unconditionally for 1 cycle. So don't switch banks too often. Another weird thing, you also get a 1 cycle loss when switching sign of offset regardless of the base register. I'll assume in this sample %0 and %1, points to somewhere on the scratchpad.

a1: lw $t0, 4(%0)# 1. cycle
a2: nop
b1: lw $t1, -4(%1)# 2. cycle
b2: nop
c1: nop# 4. cuz of stall
c2: nop

Again we get an unconditional stall, no chance to fill the gap with other code.

so make sure you fiddle with the base pointers before entering a loop so all your base registers (%0, %1 in this case) will provide you with either negative offsets all the way through the loop or possitive all the way. This is ONLY! an issue when dealing with the scratchpad.

If you think any of this was too strange to understand or badly described don't hesitate to ask.

Sparky.