Larry’s Personal & Tech ramblings

Just another WordPress.com weblog

Fast ARM JPEG code licensing

WordPress (the company which hosts this site) collects some interesting statistical data on the people who visit the blog.  To me, the most interesting data is a list of the search words which direct people to this site.  Since I started including the JPEG and ARM keywords in my posts, I’ve seen a steady stream of people searching for basically the same thing: Free optimized source code for decoding JPEG/MPEG images on ARM devices.  I’ve done such searches myself and have come to the conclusion that it’s not available.  For anyone who has done research and invested tons of time and energy into writing optimized code, it is unlikely that they will be willing to give it away for free.  There are plenty of open-source and free projects on the internet that are valuable and professionally done, but there usually comes a point in a project’s lifetime when the author commercializes it to get compensated for the time invested.

I try to share my knowledge and experience with the developer community; I understand the frustration of wasting precious time locating resources or coming up with workarounds for problems outside of (or within) my code.  I also make a living writing software, and so I must write code which is worth compensation from my customers and maintain innovative solutions which compare well with my competition.  The geek in me would love to have an open discussion about the fastest way to decode Huffman encoded data or minimize the calculations in the IDCT, but as a consultant, that would be self-defeating.

The “trade secrets” are visible in the source code, but hidden in the object code, so licensing object code will incur less risk to me and therefore cost considerably less.  I’ve licensed my code to various companies for values ranging from several hundred dollars to tens of thousands.  The price varies according to the risk and time required.  Companies needing help with ARM optimization issues are encouraged to contact me.  The amount I charge for my time or code is usually far more economical than having other programmers spend time trying to invent what I’ve already  got working.

July 31, 2007 Posted by bitbank | arm, arm9, asm, assembly language, jpeg, omap, optimization, pocket pc, smartphone, xscale | | 5 Comments

ARM JPEG Benchmarks part 2

I thought it would be useful to re-run the tests with the C version of my JPEG code. From the results it appears that memory bandwidth is the real limiting factor to the speed and the pixel colorspace conversion gets the most benefit from my optimized ARM assembly language. Also it appears that the OMAP gains more from optimized ASM than the XScale does. Here are the numbers:

C-Code:

PPC: thumbnail: 10.7 milliseconds, DC only: 968 milliseconds, full res: 3734 milliseconds.
SP: thumbnail: 25.1 milliseconds

Mixed C and ASM

PPC: thumbnail: 8.8 milliseconds, DC only: 830 milliseconds, full res: 2700 milliseconds.
SP: thumbnail: 15.1 milliseconds

The load times for the “DC only” and “full res” tests include the time taken to read 4.3MB of data from RAM through the WinCE file system.

These results make sense in that the real benefit of optimization comes from fixing the algorithms and reducing memory usage. The optimized ARM assembly code is certainly helpful in speeding things up, but won’t offer an order of magnitude improvement over what the compiler generates.

July 11, 2007 Posted by bitbank | arm, arm9, asm, assembly language, benchmark, jpeg, omap, optimization, performance, pocket pc, smartphone, wince, xscale | | No Comments

ARM JPEG Benchmarks

The great thing about the ARM architecture is that the more I look at a piece of code, the more ways I find to optimize it. The conditional execution, barrel shifter and optional setting of the processor flags create many opportunities for optimization. I’ve spent some more time optimizing my ARM asm JPEG code and now have some hard numbers to publish. I used a HP iPAQ h2210 Pocket PC (400Mhz PXA255) and a HTC Hurricane SmartPhone (195Mhz OMAP 850) to do the testing. I was able to load the file from RAM on the Pocket PC (to reduce file I/O delays), but not on the SmartPhone. The SmartPhone file system does not use RAM for file storage. The slow speed of reading from the miniSD card overtakes the amount of processing time in the tests, so the only test that was run on the SmartPhone was decompressing a 160×120 thumbnail image in RAM. All tests were to decompress the image to a RGB565 bitmap. The thumbnail test decompresses the 160×120 EXIF thumbnail image. The “DC only” test creates a single pixel from each MCU (the 3072 x 2304 image is loaded as 384×288). The “Full res” test decompresses every pixel of the image.

PPC: thumbnail: 8.8 milliseconds, DC only: 830 milliseconds, full res: 2700 milliseconds.
SP: thumbnail: 15.1 milliseconds

The speed difference between the two devices is to be expected considering the different processor and memory bus speeds. The “DC only” test is useful because it shows the relative speed of Huffman decoding. The file size is 4.3MB, so in 830 milliseconds the code was able to decode all of the MCUs and produce a single pixel from each one.

I’ve uploaded the sample image to my web server here: CIMG2209.JPG

The image was taken with a Casio EX-Z750 and depicts a relatively complex scene with many fine details. Like most cameras, the Elixim series saves JPEG images with 2:1 horizontal color subsampling (when set to maximum quality). It’s not unreasonable for a point-and-shoot camera like the Z750 to save images at a less than optimal compression because the image coming off the CCD isn’t that great to begin with. What irks me is that cameras like the Canon 20D do the same thing. With a good SLR lens and imager, the Canon should allow you to save full res color JPEG images.

Comments?

July 7, 2007 Posted by bitbank | arm, arm9, asm, assembly language, benchmark, jpeg, omap, optimization, performance, photo, pocket pc, smartphone, viewer, wince, xscale | | 4 Comments

Having fun with JPEG decompression

An odd title considering that JPEG is a cryptic image compression standard.  My idea of fun is optimizing code until there’s nothing left to improve.  I decided a few weeks ago to take the plunge and rewrite the 3 core JPEG decode routines to speed up my imaging code.  One reason was that the great majority of cell phones today are based around the TI OMAP architecture typically running at around 200Mhz.  These devices seem slow at working with images, so I thought I could help that situation by speeding things up to improve both battery usage and the user experience.

The important, “inner loop” routines of JPEG image decoding are the Huffman decoding of the MCU (minimum coded unit), the IDCT (inverse discrete cosine transform), and the output stage (turning the YCrCb pixels into RGB pixels).  All 3 routines together turned out to be only a couple hundred lines of ARM code, but the result of rewriting it from C was quite dramatic.  The original C code has been optimized and tested over a long period of time and was in good shape to begin with, but C isn’t so great at bit manipulation and squeezing the most use out of register variables.  It took several iterations to get down to the bare minimum of code, but I’m quite happy with the results.  I used ARMV5 instructions, but made sure that the code performs well on both OMAP and XScale CPUs (unlike Intel’s integrated performance primitives).  Luckily my previous performance testing of the multiply instructions helped guide me to save a few clock cycles off of several routines.  The purpose of this work is threefold:
1) I’m readying a new version of my imaging application (PQV - Pocket QuickView) for Windows Mobile and need it to be competitive with other products.  I pride myself on having the fastest viewer available.
2) I have been staring at the C code for a long time and wondering how much better it could perform if written in optimized ARM asm.
3)  I believe this code has value to anyone doing imaging or video on ARM based devices.  Web browsers, image viewers, camera applications, video players can all benefit from this code.

I’ve been searching for the past week or so for customers of this code, but the typical response is the “not invented here” attitude standing in the way of improving products.

I will post some sample images and benchmarks shortly to back up my claims of fast JPEG decoding.

Anyone interested in licensing object or source code should contact me directly (bitbank@pobox.com).

June 21, 2007 Posted by bitbank | arm, arm9, asm, assembly language, benchmark, jpeg, omap, optimization, performance, pocket pc, smartphone, wince, xscale | | No Comments

ARM Multiply performance pt. 2

I wanted to revisit the multiply test because I hadn’t tested the difference between 32×32 and 16×16 multiplies.  On the XScale PXA255 and above, both 32×32 and 16×16 multiplies take 1 clock cycle.  On the OMAP 850 (and probably other OMAP’s based on the ARM9 core), the 16×16 multiply takes 1 clock and the 32×32 takes 2.  Useful to know if your code will be running on the OMAP and you really only need a 16×16 multiply.

L.B.

May 18, 2007 Posted by bitbank | arm, arm9, asm, assembly language, benchmark, optimization, performance, pocket pc, smartphone, tech, xscale | | 2 Comments

A Clever Idea for the 6502 Stack

I realize that this is only relevant to about five people in the world, but it may help get others thinking along similar lines.  I’ve come up with a clever idea for those emulating the 6502 on ARM and I’ve decided to share it with my fellow emulator authors. One of the tricky parts of emulating the stack pointer on the 6502 is that it points to 0×100-0×1ff, but it’s an 8-bit register. Modifying its value usually involves masking the 8-bits and then ORing in the 0×100. The ARM barrel shifter allows for a more elegant solution to the problem. By using the upper 8 bits of a register to hold the stack value, and then setting bit 0, it’s value can be modified without having to worry about the 0×100 part. Here’s an example:

Increment the stack: add r0,r0,#0×1000000 ; this doesn’t affect the LSB

Write to the stack: strb r1,[r2,r0,ROR #24] ; r2=ZP/Stack memory, R0 = SP

With the rotated register, bit 0 shifts into position as bit 8 and keeps the pointer in the 0×100 to 0×1ff range.

Enjoy,
L.B.

April 11, 2007 Posted by bitbank | arm, arm9, asm, assembly language, emulation, optimization, performance, tech | | 1 Comment

C versus ARM ASM - a better example

Over the past few weeks I’ve been examining the performance of various algorithms on XScale and OMAP to try to come up with some coding rules for squeezing the most performance out of the cache. What I’ve observed is that the STM instruction is treated very differently on the OMAP SmartPhones I have at my disposal; apparently it performs much faster than discrete STRs (3x in my observations) probably because it doesn’t read any data to fill the untouched portion of the cache line. This fact means that compiled C code will never take advantage of this. Someone correct me if I’m wrong, but I’ve never seen a case where a C compiler will know to use a STM instruction. I offer this code as an example. It converts a scanline of 8-bit pixels into 16-bit pixels through a color lookup table. The loop is unrolled to gain a bit of advantage.

for (i=0;i<iWidth; i+= 8)
{
ul = *(unsigned long *)&pSrc[0];
pulDest[0] = usPalConvert[ul & 0xff] | (usPalConvert[(ul >> 8 ) & 0xff] << 16); // unrolling the loop speeds it up considerably
pulDest[1] = usPalConvert[(ul >> 16) & 0xff] | (usPalConvert[ul >> 24] << 16);
ul = *(unsigned long *)&pSrc[4];
pulDest[2] = usPalConvert[ul & 0xff] | (usPalConvert[(ul >> 8 ) & 0xff] << 16); // unrolling the loop speeds it up considerably
pulDest[3] = usPalConvert[(ul >> 16) & 0xff] | (usPalConvert[ul >> 24] << 16);
pulDest += 4;
pSrc += 8;
}
The output of the compiler with full optimization is basically this (this code shows just a pixel pair being converted and stored, the full 8-pixel loop is much longer):

ldr lr, [r5]
mov r3, lr, lsl #16
mov r3, r3, lsr #24
and r2, lr, #0xFF
add r3, r4, r3, lsl #1
ldrh r1, [r3]
add r2, r4, r2, lsl #1
ldrh r2, [r2]
mov r3, lr, lsl #8
mov r0, r3, lsr #24
orr r3, r2, r1, lsl #16
mov r2, lr, lsr #24
ldr lr, [r5, #4]
str r3, [r6]

It does a good job of register usage and avoiding pipeline stalls, but it just doesn’t take advantage of what the ARM is good at. My optimized routine (shown in its entirety) takes full advantage of the ARM barrel shifter and the STM instruction to produce significantly faster output due to reduced instruction count and good cache usage:

ARM816FAST proc
stmfd sp!,{r4-r11,lr}
mov r11,#0xff ; we need this constant for masking off pixels
mov r11,r11,LSL #1 ; odd shifted constants are not allowed
arm816f_top
ldmia r0!,{r4-r5} ; get 8 source pixels
and r10,r11,r4,LSL #1
ldrh r6,[r2,r10] ; convert pixel 1
and r10,r11,r4,LSR #7
ldrh r8,[r2,r10] ; convert pixel 2
and r10,r11,r4,LSR #15
and r4,r11,r4,LSR #23
orr r6,r6,r8,LSL #16 ; combine pixels 1 and 2
ldrh r7,[r2,r10] ; convert pixel 3
ldrh r8,[r2,r4] ; convert pixel 4
and r10,r11,r5,LSL #1
orr r7,r7,r8,LSL #16 ; combine pixels 3 and 4
ldrh r8,[r2,r10] ; convert pixel 5
and r10,r11,r5,LSR #7
ldrh r9,[r2,r10] ; convert pixel 6
and r10,r11,r5,LSR #15
orr r8,r8,r9,LSL #16; combine pixels 5 and 6
ldrh r9,[r2,r10] ; convert pixel 7
and r5,r11,r5,LSR #23
ldrh r10,[r2,r5] ; convert pixel 8
subs r3,r3,#1 ; decrement count
orr r9,r9,r10,LSL #16 ; combine pixels 7 and 8
stmia r1!,{r6-r9} ; store the 8 pixels to dest
bne arm816f_top ; loop through entire line
ldmia sp!,{r4-r11,pc}
endp

If anyone can improve upon this code, I’d certainly like to see it.

April 7, 2007 Posted by bitbank | arm, arm9, asm, assembly language, benchmark, optmization, performance, pocket pc, smartphone, tech, wince | | 3 Comments

Working with the cache on ARM/OMAP

I decided to revisit my performance testing taking into account how the data cache affects performance. I wanted to see about speeding up JPEG decoding and how writing “the wrong way” into memory affected its performance. The JPEG test I used was to decode a fake 1600×1200 16bpp image. Since each scanline is 3200 bytes, a line of 8×8 macroblocks won’t fit entirely in the 16k data cache. What I found was suprising; I got results similar to the surpise I got with STM vs. STR on the OMAP and XScale. On the OMAP, writing individual 16-bit pixels (8 per line) takes 3 times as long as writing them in one shot with a STMIA of 4 registers. I can only imagine that on the OMAP, the STM instruction tells the memory controller not to read the cache line, but simply write through. On the XScale, the STM was actually a bit slower. This implies that the OMAP is smart about what type of write occurs and the XScale isn’t. Also that the OMAP doesn’t try to fill the rest of the cache line when doing a partial write if you use STM. Can anyone verify this theory?

April 3, 2007 Posted by bitbank | arm, arm9, asm, assembly language, benchmark, jpeg, optimization, performance, tech, xscale | | 3 Comments

ARM Multiply Performance

Someone asked me to do some testing of the performance of the ARM multiply instruction.  I hadn’t included it in my previous performance tests because it didn’t occur to me; I don’t use it in the inner loops of game emulators.

I decided to see if there was a difference in performance when working with different data values (e.g. multiplying by zero) and on the XScale vs. OMAP CPUs.  The firt test showed that there is no difference in the performance when working with zero and non-zero data.  The second test showed that the XScale has a much faster implementation of multiply than the OMAP.  On my 400Mhz PXA255 handheld, my tests showed that the unsigned multiply instruction (MUL) takes just 1 clock cycle, but on the OMAP 850 (used in many SmartPhones) it takes 2 clocks.  I haven’t tested the 32×32 multiply because it’s in the ARM5 instruction set and the VS2005 C compiler generates ARM4 compatible code.

March 26, 2007 Posted by bitbank | arm, arm9, asm, assembly language, benchmark, optimization, optimização, performance, pocket pc, smartphone, tech | | 2 Comments

Sample App which calls ARM ASM from C

I’ve seen many people seeking help on how to write ARM assembly language on Windows CE (Pocket PC and SmartPhone) . I’ve created a small sample application which shows how to call an ASM function from C, pass parameters, use a global variable defined in C from ASM and return a value. The sole purpose of this application is to show you how to use the tools and settings of Microsoft’s WinCE development environment to accomplish mixing C and ASM. I used Visual Studio 2005 to create the project and it uses the included ARM assembler as a custom build tool. You can download the project files here:

http://www.bitbanksoftware.com/c_asm_sample.zip

This same code can easily be built in EVC3/4.

Enjoy,
L.B.

March 25, 2007 Posted by bitbank | arm, asm, assembly language, pocket pc, smartphone, tech | | 4 Comments