Larry’s Personal & Tech ramblings

Just another WordPress.com weblog

ARM Multiply performance pt. 2

I wanted to revisit the multiply test because I hadn’t tested the difference between 32×32 and 16×16 multiplies.  On the XScale PXA255 and above, both 32×32 and 16×16 multiplies take 1 clock cycle.  On the OMAP 850 (and probably other OMAP’s based on the ARM9 core), the 16×16 multiply takes 1 clock and the 32×32 takes 2.  Useful to know if your code will be running on the OMAP and you really only need a 16×16 multiply.

L.B.

May 18, 2007 - Posted by bitbank | arm, arm9, asm, assembly language, benchmark, optimization, performance, pocket pc, smartphone, tech, xscale | | 2 Comments

2 Comments »

  1. some addition to the above:
    if u use arm11 also, use smlabb if u do 16×16 multiply instead of mul. mul takes 2 cycles to execute, smulbb takes 1 cycle, similarly use smlabb instead of mla if u use 16×16. if u need 32×16 use smulwb instead of mul

    Comment by prasad | May 19, 2007

  2. Hi Larry,

    You seem to have good knowledge of ARM assembler and I was wondering if you could help me out.

    If you have a look at the following link (it’s to a developer’s forum for the GP2X console) I’ve left a message which I thought might be of interest to you and would be great if you had a few moments to perhaps come up with a solution for me.

    http://www.gp32x.com/board/index.php?showtopic=37569&hl=

    How does a standard ‘C’ library handle a mathematical divide function if the native CPU does not have one? Obviously by work around, but I wonder how efficient that is. Surely a specifically written ARM assembler divide function must be quicker than a cobbled together ‘C’ equivalent?

    Anyway, if you have a little spare time, I’d really appreciate any thought you could put into this.

    Thanks!

    Comment by Slaanesh | July 5, 2007

Leave a comment