Over the past few weeks I’ve been examining the performance of various algorithms on XScale and OMAP to try to come up with some coding rules for squeezing the most performance out of the cache. What I’ve observed is that the STM instruction is treated very differently on the OMAP SmartPhones I have at my disposal; apparently it performs much faster than discrete STRs (3x in my observations) probably because it doesn’t read any data to fill the untouched portion of the cache line. This fact means that compiled C code will never take advantage of this. Someone correct me if I’m wrong, but I’ve never seen a case where a C compiler will know to use a STM instruction. I offer this code as an example. It converts a scanline of 8-bit pixels into 16-bit pixels through a color lookup table. The loop is unrolled to gain a bit of advantage.
for (i=0;i<iWidth; i+= 
{
ul = *(unsigned long *)&pSrc[0];
pulDest[0] = usPalConvert[ul & 0xff] | (usPalConvert[(ul >> 8 ) & 0xff] << 16); // unrolling the loop speeds it up considerably
pulDest[1] = usPalConvert[(ul >> 16) & 0xff] | (usPalConvert[ul >> 24] << 16);
ul = *(unsigned long *)&pSrc[4];
pulDest[2] = usPalConvert[ul & 0xff] | (usPalConvert[(ul >> 8 ) & 0xff] << 16); // unrolling the loop speeds it up considerably
pulDest[3] = usPalConvert[(ul >> 16) & 0xff] | (usPalConvert[ul >> 24] << 16);
pulDest += 4;
pSrc += 8;
}
The output of the compiler with full optimization is basically this (this code shows just a pixel pair being converted and stored, the full 8-pixel loop is much longer):
ldr lr, [r5]
mov r3, lr, lsl #16
mov r3, r3, lsr #24
and r2, lr, #0xFF
add r3, r4, r3, lsl #1
ldrh r1, [r3]
add r2, r4, r2, lsl #1
ldrh r2, [r2]
mov r3, lr, lsl #8
mov r0, r3, lsr #24
orr r3, r2, r1, lsl #16
mov r2, lr, lsr #24
ldr lr, [r5, #4]
str r3, [r6]
It does a good job of register usage and avoiding pipeline stalls, but it just doesn’t take advantage of what the ARM is good at. My optimized routine (shown in its entirety) takes full advantage of the ARM barrel shifter and the STM instruction to produce significantly faster output due to reduced instruction count and good cache usage:
ARM816FAST proc
stmfd sp!,{r4-r11,lr}
mov r11,#0xff ; we need this constant for masking off pixels
mov r11,r11,LSL #1 ; odd shifted constants are not allowed
arm816f_top
ldmia r0!,{r4-r5} ; get 8 source pixels
and r10,r11,r4,LSL #1
ldrh r6,[r2,r10] ; convert pixel 1
and r10,r11,r4,LSR #7
ldrh r8,[r2,r10] ; convert pixel 2
and r10,r11,r4,LSR #15
and r4,r11,r4,LSR #23
orr r6,r6,r8,LSL #16 ; combine pixels 1 and 2
ldrh r7,[r2,r10] ; convert pixel 3
ldrh r8,[r2,r4] ; convert pixel 4
and r10,r11,r5,LSL #1
orr r7,r7,r8,LSL #16 ; combine pixels 3 and 4
ldrh r8,[r2,r10] ; convert pixel 5
and r10,r11,r5,LSR #7
ldrh r9,[r2,r10] ; convert pixel 6
and r10,r11,r5,LSR #15
orr r8,r8,r9,LSL #16; combine pixels 5 and 6
ldrh r9,[r2,r10] ; convert pixel 7
and r5,r11,r5,LSR #23
ldrh r10,[r2,r5] ; convert pixel 8
subs r3,r3,#1 ; decrement count
orr r9,r9,r10,LSL #16 ; combine pixels 7 and 8
stmia r1!,{r6-r9} ; store the 8 pixels to dest
bne arm816f_top ; loop through entire line
ldmia sp!,{r4-r11,pc}
endp
If anyone can improve upon this code, I’d certainly like to see it.