Larry’s Personal & Tech ramblings

Just another WordPress.com weblog

Coding for Speed (Q&D advice for targeting ARM)

Anyone who knows me is familiar with my Q&D (quick & dirty) approach to just about everything.  My feeling is that if you can get 95% of it done in a short time, but 100% takes a long time, be happy with 95% :) .

I thought it might be useful to discuss some simple ideas for making your applications run faster.  These ideas are directed at ARM based devices, but the same concepts apply to pretty much any computing platform.  Here is a quick list of do’s and dont’s for the C and ASM programmer:

1) Avoid using divide and modulus (%) (C/C++)

99% of all ARM embedded and portable devices in use today support the ARM5 instruction set which does not include divide.  Doing a divide calls a long and ugly function.  Along the same line of thought, I always get a good laugh when I see someone do this:

i = j % 8;

For those of you who don’t get the joke, using MOD to get the remainder from a power of 2 constant forces the use of a divide instruction.  The correct code (which becomes a single instruction) is:

i = j & 7;

Perhaps a smart compiler converts it when it sees nice constants, but replacing the 8 with a variable will force the compiler to use divide.  X86 systems do have a divide instruction, but this it’s still a good idea to avoid using modulus when it’s not necessary.

2) Be aware of the pipelined nature of the beast (ASM)

The ARM uses a pipelined architecture which usually goes something like this:

Fetch -> Decode -> Execute

The worst case is reading data from memory.  It takes the full 3 clocks to have the data ready in a register, so avoid depending on that register being ready to use immediately after a read.  Here are some examples:

Bad:

ldr r0,[r1],#4
add r2,r2,r0      ; pipe stall here will eat 2 clocks waiting for r0
subs r3,r3,#1

Better:

ldr r0,[r1],#4
subs r3,r3,#1   ; we’ve salvaged 1 wasted clock, but still have another
add r2,r2,r0

Best:

ldr r0,[r1],#4
subs r3,r3,#1
<another instruction which doesn’t touch r0>
add r2,r2,r0

I hope this makes it clear.  The other thing to avoid is branches which take 3 clocks and flush the instruction Q.  Conditional execution of every instruction on the ARM make branches far less needed.

 3) Understand what the machine is actually doing (C/C++)

The most disturbing thing for me is when I find someone who has a total non-understanding of how computers work and writes code like this:

void really_bad_program(void)
{
unsigned char uc1, uc2;

uc1 = 1;
memcpy(&uc2, &uc1, 1);

}

Believe it or not, I’ve actually seen this in a shipping product.  I’m not going to give a long explanation here; I’ll leave it as an exercise for those who don’t see what the problem is to find out on their own.  My point is to be aware of what high level code gets turned into by the compiler. Know the difference between an operator and a function.

That’s all for now.  If anyone finds this useful, I’ll continue this thread.

March 16, 2007 - Posted by bitbank | arm, asm, assembly language, benchmark, optimization, optimização, pocket pc, smartphone, tech, wince, xscale | | 3 Comments

3 Comments »

  1. nice work. Thanks.

    You can add about shift operations to replace multiple too here.

    Comment by Naresh | April 11, 2007

  2. Can you point me to any document/pdf that has some optimization techniques for C/C++ on ARM9E. I so not want to jump to assembly directly.
    Thanks.

    Comment by Sachin | May 16, 2007

  3. Please have more discussion how to make c/c++ code work better on ARM. Could you please tell me where I can find a programming manual for TI OMAP 850 or similar programming information on the web? Since OMAP 850 is a dual core chip, how do I maximize the use of both in my code? Your help is greatly appreciated.

    Thanks,

    Bill

    Comment by Bill | June 4, 2007

Leave a comment