ARM Multiply performance pt. 2
I wanted to revisit the multiply test because I hadn’t tested the difference between 32×32 and 16×16 multiplies. On the XScale PXA255 and above, both 32×32 and 16×16 multiplies take 1 clock cycle. On the OMAP 850 (and probably other OMAP’s based on the ARM9 core), the 16×16 multiply takes 1 clock and the 32×32 takes 2. Useful to know if your code will be running on the OMAP and you really only need a 16×16 multiply.
L.B.
some addition to the above:
if u use arm11 also, use smlabb if u do 16×16 multiply instead of mul. mul takes 2 cycles to execute, smulbb takes 1 cycle, similarly use smlabb instead of mla if u use 16×16. if u need 32×16 use smulwb instead of mul
Hi Larry,
You seem to have good knowledge of ARM assembler and I was wondering if you could help me out.
If you have a look at the following link (it’s to a developer’s forum for the GP2X console) I’ve left a message which I thought might be of interest to you and would be great if you had a few moments to perhaps come up with a solution for me.
http://www.gp32x.com/board/index.php?showtopic=37569&hl=
How does a standard ‘C’ library handle a mathematical divide function if the native CPU does not have one? Obviously by work around, but I wonder how efficient that is. Surely a specifically written ARM assembler divide function must be quicker than a cobbled together ‘C’ equivalent?
Anyway, if you have a little spare time, I’d really appreciate any thought you could put into this.
Thanks!