Larry’s Personal & Tech ramblings

Just another WordPress.com weblog

ARM JPEG Benchmarks part 2

I thought it would be useful to re-run the tests with the C version of my JPEG code. From the results it appears that memory bandwidth is the real limiting factor to the speed and the pixel colorspace conversion gets the most benefit from my optimized ARM assembly language. Also it appears that the OMAP gains more from optimized ASM than the XScale does. Here are the numbers:

C-Code:

PPC: thumbnail: 10.7 milliseconds, DC only: 968 milliseconds, full res: 3734 milliseconds.
SP: thumbnail: 25.1 milliseconds

Mixed C and ASM

PPC: thumbnail: 8.8 milliseconds, DC only: 830 milliseconds, full res: 2700 milliseconds.
SP: thumbnail: 15.1 milliseconds

The load times for the “DC only” and “full res” tests include the time taken to read 4.3MB of data from RAM through the WinCE file system.

These results make sense in that the real benefit of optimization comes from fixing the algorithms and reducing memory usage. The optimized ARM assembly code is certainly helpful in speeding things up, but won’t offer an order of magnitude improvement over what the compiler generates.

July 11, 2007 Posted by bitbank | arm, arm9, asm, assembly language, benchmark, jpeg, omap, optimization, performance, pocket pc, smartphone, wince, xscale | | No Comments

ARM JPEG Benchmarks

The great thing about the ARM architecture is that the more I look at a piece of code, the more ways I find to optimize it. The conditional execution, barrel shifter and optional setting of the processor flags create many opportunities for optimization. I’ve spent some more time optimizing my ARM asm JPEG code and now have some hard numbers to publish. I used a HP iPAQ h2210 Pocket PC (400Mhz PXA255) and a HTC Hurricane SmartPhone (195Mhz OMAP 850) to do the testing. I was able to load the file from RAM on the Pocket PC (to reduce file I/O delays), but not on the SmartPhone. The SmartPhone file system does not use RAM for file storage. The slow speed of reading from the miniSD card overtakes the amount of processing time in the tests, so the only test that was run on the SmartPhone was decompressing a 160×120 thumbnail image in RAM. All tests were to decompress the image to a RGB565 bitmap. The thumbnail test decompresses the 160×120 EXIF thumbnail image. The “DC only” test creates a single pixel from each MCU (the 3072 x 2304 image is loaded as 384×288). The “Full res” test decompresses every pixel of the image.

PPC: thumbnail: 8.8 milliseconds, DC only: 830 milliseconds, full res: 2700 milliseconds.
SP: thumbnail: 15.1 milliseconds

The speed difference between the two devices is to be expected considering the different processor and memory bus speeds. The “DC only” test is useful because it shows the relative speed of Huffman decoding. The file size is 4.3MB, so in 830 milliseconds the code was able to decode all of the MCUs and produce a single pixel from each one.

I’ve uploaded the sample image to my web server here: CIMG2209.JPG

The image was taken with a Casio EX-Z750 and depicts a relatively complex scene with many fine details. Like most cameras, the Elixim series saves JPEG images with 2:1 horizontal color subsampling (when set to maximum quality). It’s not unreasonable for a point-and-shoot camera like the Z750 to save images at a less than optimal compression because the image coming off the CCD isn’t that great to begin with. What irks me is that cameras like the Canon 20D do the same thing. With a good SLR lens and imager, the Canon should allow you to save full res color JPEG images.

Comments?

July 7, 2007 Posted by bitbank | arm, arm9, asm, assembly language, benchmark, jpeg, omap, optimization, performance, photo, pocket pc, smartphone, viewer, wince, xscale | | 4 Comments

Having fun with JPEG decompression

An odd title considering that JPEG is a cryptic image compression standard.  My idea of fun is optimizing code until there’s nothing left to improve.  I decided a few weeks ago to take the plunge and rewrite the 3 core JPEG decode routines to speed up my imaging code.  One reason was that the great majority of cell phones today are based around the TI OMAP architecture typically running at around 200Mhz.  These devices seem slow at working with images, so I thought I could help that situation by speeding things up to improve both battery usage and the user experience.

The important, “inner loop” routines of JPEG image decoding are the Huffman decoding of the MCU (minimum coded unit), the IDCT (inverse discrete cosine transform), and the output stage (turning the YCrCb pixels into RGB pixels).  All 3 routines together turned out to be only a couple hundred lines of ARM code, but the result of rewriting it from C was quite dramatic.  The original C code has been optimized and tested over a long period of time and was in good shape to begin with, but C isn’t so great at bit manipulation and squeezing the most use out of register variables.  It took several iterations to get down to the bare minimum of code, but I’m quite happy with the results.  I used ARMV5 instructions, but made sure that the code performs well on both OMAP and XScale CPUs (unlike Intel’s integrated performance primitives).  Luckily my previous performance testing of the multiply instructions helped guide me to save a few clock cycles off of several routines.  The purpose of this work is threefold:
1) I’m readying a new version of my imaging application (PQV - Pocket QuickView) for Windows Mobile and need it to be competitive with other products.  I pride myself on having the fastest viewer available.
2) I have been staring at the C code for a long time and wondering how much better it could perform if written in optimized ARM asm.
3)  I believe this code has value to anyone doing imaging or video on ARM based devices.  Web browsers, image viewers, camera applications, video players can all benefit from this code.

I’ve been searching for the past week or so for customers of this code, but the typical response is the “not invented here” attitude standing in the way of improving products.

I will post some sample images and benchmarks shortly to back up my claims of fast JPEG decoding.

Anyone interested in licensing object or source code should contact me directly (bitbank@pobox.com).

June 21, 2007 Posted by bitbank | arm, arm9, asm, assembly language, benchmark, jpeg, omap, optimization, performance, pocket pc, smartphone, wince, xscale | | No Comments

ARM Multiply performance pt. 2

I wanted to revisit the multiply test because I hadn’t tested the difference between 32×32 and 16×16 multiplies.  On the XScale PXA255 and above, both 32×32 and 16×16 multiplies take 1 clock cycle.  On the OMAP 850 (and probably other OMAP’s based on the ARM9 core), the 16×16 multiply takes 1 clock and the 32×32 takes 2.  Useful to know if your code will be running on the OMAP and you really only need a 16×16 multiply.

L.B.

May 18, 2007 Posted by bitbank | arm, arm9, asm, assembly language, benchmark, optimization, performance, pocket pc, smartphone, tech, xscale | | 2 Comments

C versus ARM ASM - a better example

Over the past few weeks I’ve been examining the performance of various algorithms on XScale and OMAP to try to come up with some coding rules for squeezing the most performance out of the cache. What I’ve observed is that the STM instruction is treated very differently on the OMAP SmartPhones I have at my disposal; apparently it performs much faster than discrete STRs (3x in my observations) probably because it doesn’t read any data to fill the untouched portion of the cache line. This fact means that compiled C code will never take advantage of this. Someone correct me if I’m wrong, but I’ve never seen a case where a C compiler will know to use a STM instruction. I offer this code as an example. It converts a scanline of 8-bit pixels into 16-bit pixels through a color lookup table. The loop is unrolled to gain a bit of advantage.

for (i=0;i<iWidth; i+= 8)
{
ul = *(unsigned long *)&pSrc[0];
pulDest[0] = usPalConvert[ul & 0xff] | (usPalConvert[(ul >> 8 ) & 0xff] << 16); // unrolling the loop speeds it up considerably
pulDest[1] = usPalConvert[(ul >> 16) & 0xff] | (usPalConvert[ul >> 24] << 16);
ul = *(unsigned long *)&pSrc[4];
pulDest[2] = usPalConvert[ul & 0xff] | (usPalConvert[(ul >> 8 ) & 0xff] << 16); // unrolling the loop speeds it up considerably
pulDest[3] = usPalConvert[(ul >> 16) & 0xff] | (usPalConvert[ul >> 24] << 16);
pulDest += 4;
pSrc += 8;
}
The output of the compiler with full optimization is basically this (this code shows just a pixel pair being converted and stored, the full 8-pixel loop is much longer):

ldr lr, [r5]
mov r3, lr, lsl #16
mov r3, r3, lsr #24
and r2, lr, #0xFF
add r3, r4, r3, lsl #1
ldrh r1, [r3]
add r2, r4, r2, lsl #1
ldrh r2, [r2]
mov r3, lr, lsl #8
mov r0, r3, lsr #24
orr r3, r2, r1, lsl #16
mov r2, lr, lsr #24
ldr lr, [r5, #4]
str r3, [r6]

It does a good job of register usage and avoiding pipeline stalls, but it just doesn’t take advantage of what the ARM is good at. My optimized routine (shown in its entirety) takes full advantage of the ARM barrel shifter and the STM instruction to produce significantly faster output due to reduced instruction count and good cache usage:

ARM816FAST proc
stmfd sp!,{r4-r11,lr}
mov r11,#0xff ; we need this constant for masking off pixels
mov r11,r11,LSL #1 ; odd shifted constants are not allowed
arm816f_top
ldmia r0!,{r4-r5} ; get 8 source pixels
and r10,r11,r4,LSL #1
ldrh r6,[r2,r10] ; convert pixel 1
and r10,r11,r4,LSR #7
ldrh r8,[r2,r10] ; convert pixel 2
and r10,r11,r4,LSR #15
and r4,r11,r4,LSR #23
orr r6,r6,r8,LSL #16 ; combine pixels 1 and 2
ldrh r7,[r2,r10] ; convert pixel 3
ldrh r8,[r2,r4] ; convert pixel 4
and r10,r11,r5,LSL #1
orr r7,r7,r8,LSL #16 ; combine pixels 3 and 4
ldrh r8,[r2,r10] ; convert pixel 5
and r10,r11,r5,LSR #7
ldrh r9,[r2,r10] ; convert pixel 6
and r10,r11,r5,LSR #15
orr r8,r8,r9,LSL #16; combine pixels 5 and 6
ldrh r9,[r2,r10] ; convert pixel 7
and r5,r11,r5,LSR #23
ldrh r10,[r2,r5] ; convert pixel 8
subs r3,r3,#1 ; decrement count
orr r9,r9,r10,LSL #16 ; combine pixels 7 and 8
stmia r1!,{r6-r9} ; store the 8 pixels to dest
bne arm816f_top ; loop through entire line
ldmia sp!,{r4-r11,pc}
endp

If anyone can improve upon this code, I’d certainly like to see it.

April 7, 2007 Posted by bitbank | arm, arm9, asm, assembly language, benchmark, optmization, performance, pocket pc, smartphone, tech, wince | | 3 Comments

Working with the cache on ARM/OMAP

I decided to revisit my performance testing taking into account how the data cache affects performance. I wanted to see about speeding up JPEG decoding and how writing “the wrong way” into memory affected its performance. The JPEG test I used was to decode a fake 1600×1200 16bpp image. Since each scanline is 3200 bytes, a line of 8×8 macroblocks won’t fit entirely in the 16k data cache. What I found was suprising; I got results similar to the surpise I got with STM vs. STR on the OMAP and XScale. On the OMAP, writing individual 16-bit pixels (8 per line) takes 3 times as long as writing them in one shot with a STMIA of 4 registers. I can only imagine that on the OMAP, the STM instruction tells the memory controller not to read the cache line, but simply write through. On the XScale, the STM was actually a bit slower. This implies that the OMAP is smart about what type of write occurs and the XScale isn’t. Also that the OMAP doesn’t try to fill the rest of the cache line when doing a partial write if you use STM. Can anyone verify this theory?

April 3, 2007 Posted by bitbank | arm, arm9, asm, assembly language, benchmark, jpeg, optimization, performance, tech, xscale | | 3 Comments

ARM Multiply Performance

Someone asked me to do some testing of the performance of the ARM multiply instruction.  I hadn’t included it in my previous performance tests because it didn’t occur to me; I don’t use it in the inner loops of game emulators.

I decided to see if there was a difference in performance when working with different data values (e.g. multiplying by zero) and on the XScale vs. OMAP CPUs.  The firt test showed that there is no difference in the performance when working with zero and non-zero data.  The second test showed that the XScale has a much faster implementation of multiply than the OMAP.  On my 400Mhz PXA255 handheld, my tests showed that the unsigned multiply instruction (MUL) takes just 1 clock cycle, but on the OMAP 850 (used in many SmartPhones) it takes 2 clocks.  I haven’t tested the 32×32 multiply because it’s in the ARM5 instruction set and the VS2005 C compiler generates ARM4 compatible code.

March 26, 2007 Posted by bitbank | arm, arm9, asm, assembly language, benchmark, optimization, optimização, performance, pocket pc, smartphone, tech | | 2 Comments

Coding for Speed (Q&D advice for targeting ARM)

Anyone who knows me is familiar with my Q&D (quick & dirty) approach to just about everything.  My feeling is that if you can get 95% of it done in a short time, but 100% takes a long time, be happy with 95% :) .

I thought it might be useful to discuss some simple ideas for making your applications run faster.  These ideas are directed at ARM based devices, but the same concepts apply to pretty much any computing platform.  Here is a quick list of do’s and dont’s for the C and ASM programmer:

1) Avoid using divide and modulus (%) (C/C++)

99% of all ARM embedded and portable devices in use today support the ARM5 instruction set which does not include divide.  Doing a divide calls a long and ugly function.  Along the same line of thought, I always get a good laugh when I see someone do this:

i = j % 8;

For those of you who don’t get the joke, using MOD to get the remainder from a power of 2 constant forces the use of a divide instruction.  The correct code (which becomes a single instruction) is:

i = j & 7;

Perhaps a smart compiler converts it when it sees nice constants, but replacing the 8 with a variable will force the compiler to use divide.  X86 systems do have a divide instruction, but this it’s still a good idea to avoid using modulus when it’s not necessary.

2) Be aware of the pipelined nature of the beast (ASM)

The ARM uses a pipelined architecture which usually goes something like this:

Fetch -> Decode -> Execute

The worst case is reading data from memory.  It takes the full 3 clocks to have the data ready in a register, so avoid depending on that register being ready to use immediately after a read.  Here are some examples:

Bad:

ldr r0,[r1],#4
add r2,r2,r0      ; pipe stall here will eat 2 clocks waiting for r0
subs r3,r3,#1

Better:

ldr r0,[r1],#4
subs r3,r3,#1   ; we’ve salvaged 1 wasted clock, but still have another
add r2,r2,r0

Best:

ldr r0,[r1],#4
subs r3,r3,#1
<another instruction which doesn’t touch r0>
add r2,r2,r0

I hope this makes it clear.  The other thing to avoid is branches which take 3 clocks and flush the instruction Q.  Conditional execution of every instruction on the ARM make branches far less needed.

 3) Understand what the machine is actually doing (C/C++)

The most disturbing thing for me is when I find someone who has a total non-understanding of how computers work and writes code like this:

void really_bad_program(void)
{
unsigned char uc1, uc2;

uc1 = 1;
memcpy(&uc2, &uc1, 1);

}

Believe it or not, I’ve actually seen this in a shipping product.  I’m not going to give a long explanation here; I’ll leave it as an exercise for those who don’t see what the problem is to find out on their own.  My point is to be aware of what high level code gets turned into by the compiler. Know the difference between an operator and a function.

That’s all for now.  If anyone finds this useful, I’ll continue this thread.

March 16, 2007 Posted by bitbank | arm, asm, assembly language, benchmark, optimization, optimização, pocket pc, smartphone, tech, wince, xscale | | 3 Comments