Larry’s Personal & Tech ramblings

Just another WordPress.com weblog

Fast ARM JPEG code licensing

WordPress (the company which hosts this site) collects some interesting statistical data on the people who visit the blog.  To me, the most interesting data is a list of the search words which direct people to this site.  Since I started including the JPEG and ARM keywords in my posts, I’ve seen a steady stream of people searching for basically the same thing: Free optimized source code for decoding JPEG/MPEG images on ARM devices.  I’ve done such searches myself and have come to the conclusion that it’s not available.  For anyone who has done research and invested tons of time and energy into writing optimized code, it is unlikely that they will be willing to give it away for free.  There are plenty of open-source and free projects on the internet that are valuable and professionally done, but there usually comes a point in a project’s lifetime when the author commercializes it to get compensated for the time invested.

I try to share my knowledge and experience with the developer community; I understand the frustration of wasting precious time locating resources or coming up with workarounds for problems outside of (or within) my code.  I also make a living writing software, and so I must write code which is worth compensation from my customers and maintain innovative solutions which compare well with my competition.  The geek in me would love to have an open discussion about the fastest way to decode Huffman encoded data or minimize the calculations in the IDCT, but as a consultant, that would be self-defeating.

The “trade secrets” are visible in the source code, but hidden in the object code, so licensing object code will incur less risk to me and therefore cost considerably less.  I’ve licensed my code to various companies for values ranging from several hundred dollars to tens of thousands.  The price varies according to the risk and time required.  Companies needing help with ARM optimization issues are encouraged to contact me.  The amount I charge for my time or code is usually far more economical than having other programmers spend time trying to invent what I’ve already  got working.

July 31, 2007 Posted by bitbank | arm, arm9, asm, assembly language, jpeg, omap, optimization, pocket pc, smartphone, xscale | | 5 Comments

ARM JPEG Benchmarks part 2

I thought it would be useful to re-run the tests with the C version of my JPEG code. From the results it appears that memory bandwidth is the real limiting factor to the speed and the pixel colorspace conversion gets the most benefit from my optimized ARM assembly language. Also it appears that the OMAP gains more from optimized ASM than the XScale does. Here are the numbers:

C-Code:

PPC: thumbnail: 10.7 milliseconds, DC only: 968 milliseconds, full res: 3734 milliseconds.
SP: thumbnail: 25.1 milliseconds

Mixed C and ASM

PPC: thumbnail: 8.8 milliseconds, DC only: 830 milliseconds, full res: 2700 milliseconds.
SP: thumbnail: 15.1 milliseconds

The load times for the “DC only” and “full res” tests include the time taken to read 4.3MB of data from RAM through the WinCE file system.

These results make sense in that the real benefit of optimization comes from fixing the algorithms and reducing memory usage. The optimized ARM assembly code is certainly helpful in speeding things up, but won’t offer an order of magnitude improvement over what the compiler generates.

July 11, 2007 Posted by bitbank | arm, arm9, asm, assembly language, benchmark, jpeg, omap, optimization, performance, pocket pc, smartphone, wince, xscale | | No Comments

ARM JPEG Benchmarks

The great thing about the ARM architecture is that the more I look at a piece of code, the more ways I find to optimize it. The conditional execution, barrel shifter and optional setting of the processor flags create many opportunities for optimization. I’ve spent some more time optimizing my ARM asm JPEG code and now have some hard numbers to publish. I used a HP iPAQ h2210 Pocket PC (400Mhz PXA255) and a HTC Hurricane SmartPhone (195Mhz OMAP 850) to do the testing. I was able to load the file from RAM on the Pocket PC (to reduce file I/O delays), but not on the SmartPhone. The SmartPhone file system does not use RAM for file storage. The slow speed of reading from the miniSD card overtakes the amount of processing time in the tests, so the only test that was run on the SmartPhone was decompressing a 160×120 thumbnail image in RAM. All tests were to decompress the image to a RGB565 bitmap. The thumbnail test decompresses the 160×120 EXIF thumbnail image. The “DC only” test creates a single pixel from each MCU (the 3072 x 2304 image is loaded as 384×288). The “Full res” test decompresses every pixel of the image.

PPC: thumbnail: 8.8 milliseconds, DC only: 830 milliseconds, full res: 2700 milliseconds.
SP: thumbnail: 15.1 milliseconds

The speed difference between the two devices is to be expected considering the different processor and memory bus speeds. The “DC only” test is useful because it shows the relative speed of Huffman decoding. The file size is 4.3MB, so in 830 milliseconds the code was able to decode all of the MCUs and produce a single pixel from each one.

I’ve uploaded the sample image to my web server here: CIMG2209.JPG

The image was taken with a Casio EX-Z750 and depicts a relatively complex scene with many fine details. Like most cameras, the Elixim series saves JPEG images with 2:1 horizontal color subsampling (when set to maximum quality). It’s not unreasonable for a point-and-shoot camera like the Z750 to save images at a less than optimal compression because the image coming off the CCD isn’t that great to begin with. What irks me is that cameras like the Canon 20D do the same thing. With a good SLR lens and imager, the Canon should allow you to save full res color JPEG images.

Comments?

July 7, 2007 Posted by bitbank | arm, arm9, asm, assembly language, benchmark, jpeg, omap, optimization, performance, photo, pocket pc, smartphone, viewer, wince, xscale | | 4 Comments

Having fun with JPEG decompression

An odd title considering that JPEG is a cryptic image compression standard.  My idea of fun is optimizing code until there’s nothing left to improve.  I decided a few weeks ago to take the plunge and rewrite the 3 core JPEG decode routines to speed up my imaging code.  One reason was that the great majority of cell phones today are based around the TI OMAP architecture typically running at around 200Mhz.  These devices seem slow at working with images, so I thought I could help that situation by speeding things up to improve both battery usage and the user experience.

The important, “inner loop” routines of JPEG image decoding are the Huffman decoding of the MCU (minimum coded unit), the IDCT (inverse discrete cosine transform), and the output stage (turning the YCrCb pixels into RGB pixels).  All 3 routines together turned out to be only a couple hundred lines of ARM code, but the result of rewriting it from C was quite dramatic.  The original C code has been optimized and tested over a long period of time and was in good shape to begin with, but C isn’t so great at bit manipulation and squeezing the most use out of register variables.  It took several iterations to get down to the bare minimum of code, but I’m quite happy with the results.  I used ARMV5 instructions, but made sure that the code performs well on both OMAP and XScale CPUs (unlike Intel’s integrated performance primitives).  Luckily my previous performance testing of the multiply instructions helped guide me to save a few clock cycles off of several routines.  The purpose of this work is threefold:
1) I’m readying a new version of my imaging application (PQV - Pocket QuickView) for Windows Mobile and need it to be competitive with other products.  I pride myself on having the fastest viewer available.
2) I have been staring at the C code for a long time and wondering how much better it could perform if written in optimized ARM asm.
3)  I believe this code has value to anyone doing imaging or video on ARM based devices.  Web browsers, image viewers, camera applications, video players can all benefit from this code.

I’ve been searching for the past week or so for customers of this code, but the typical response is the “not invented here” attitude standing in the way of improving products.

I will post some sample images and benchmarks shortly to back up my claims of fast JPEG decoding.

Anyone interested in licensing object or source code should contact me directly (bitbank@pobox.com).

June 21, 2007 Posted by bitbank | arm, arm9, asm, assembly language, benchmark, jpeg, omap, optimization, performance, pocket pc, smartphone, wince, xscale | | No Comments

ARM Multiply performance pt. 2

I wanted to revisit the multiply test because I hadn’t tested the difference between 32×32 and 16×16 multiplies.  On the XScale PXA255 and above, both 32×32 and 16×16 multiplies take 1 clock cycle.  On the OMAP 850 (and probably other OMAP’s based on the ARM9 core), the 16×16 multiply takes 1 clock and the 32×32 takes 2.  Useful to know if your code will be running on the OMAP and you really only need a 16×16 multiply.

L.B.

May 18, 2007 Posted by bitbank | arm, arm9, asm, assembly language, benchmark, optimization, performance, pocket pc, smartphone, tech, xscale | | 2 Comments

A Clever Idea for the 6502 Stack

I realize that this is only relevant to about five people in the world, but it may help get others thinking along similar lines.  I’ve come up with a clever idea for those emulating the 6502 on ARM and I’ve decided to share it with my fellow emulator authors. One of the tricky parts of emulating the stack pointer on the 6502 is that it points to 0×100-0×1ff, but it’s an 8-bit register. Modifying its value usually involves masking the 8-bits and then ORing in the 0×100. The ARM barrel shifter allows for a more elegant solution to the problem. By using the upper 8 bits of a register to hold the stack value, and then setting bit 0, it’s value can be modified without having to worry about the 0×100 part. Here’s an example:

Increment the stack: add r0,r0,#0×1000000 ; this doesn’t affect the LSB

Write to the stack: strb r1,[r2,r0,ROR #24] ; r2=ZP/Stack memory, R0 = SP

With the rotated register, bit 0 shifts into position as bit 8 and keeps the pointer in the 0×100 to 0×1ff range.

Enjoy,
L.B.

April 11, 2007 Posted by bitbank | arm, arm9, asm, assembly language, emulation, optimization, performance, tech | | 1 Comment

Working with the cache on ARM/OMAP

I decided to revisit my performance testing taking into account how the data cache affects performance. I wanted to see about speeding up JPEG decoding and how writing “the wrong way” into memory affected its performance. The JPEG test I used was to decode a fake 1600×1200 16bpp image. Since each scanline is 3200 bytes, a line of 8×8 macroblocks won’t fit entirely in the 16k data cache. What I found was suprising; I got results similar to the surpise I got with STM vs. STR on the OMAP and XScale. On the OMAP, writing individual 16-bit pixels (8 per line) takes 3 times as long as writing them in one shot with a STMIA of 4 registers. I can only imagine that on the OMAP, the STM instruction tells the memory controller not to read the cache line, but simply write through. On the XScale, the STM was actually a bit slower. This implies that the OMAP is smart about what type of write occurs and the XScale isn’t. Also that the OMAP doesn’t try to fill the rest of the cache line when doing a partial write if you use STM. Can anyone verify this theory?

April 3, 2007 Posted by bitbank | arm, arm9, asm, assembly language, benchmark, jpeg, optimization, performance, tech, xscale | | 3 Comments

ARM Multiply Performance

Someone asked me to do some testing of the performance of the ARM multiply instruction.  I hadn’t included it in my previous performance tests because it didn’t occur to me; I don’t use it in the inner loops of game emulators.

I decided to see if there was a difference in performance when working with different data values (e.g. multiplying by zero) and on the XScale vs. OMAP CPUs.  The firt test showed that there is no difference in the performance when working with zero and non-zero data.  The second test showed that the XScale has a much faster implementation of multiply than the OMAP.  On my 400Mhz PXA255 handheld, my tests showed that the unsigned multiply instruction (MUL) takes just 1 clock cycle, but on the OMAP 850 (used in many SmartPhones) it takes 2 clocks.  I haven’t tested the 32×32 multiply because it’s in the ARM5 instruction set and the VS2005 C compiler generates ARM4 compatible code.

March 26, 2007 Posted by bitbank | arm, arm9, asm, assembly language, benchmark, optimization, optimização, performance, pocket pc, smartphone, tech | | 2 Comments

Optimizing the 68K

I started writing an emulator for the 68000 several years ago for a Capcom CPS1 project.  At the time I was targeting the desktop PC, so writing it in C wasn’t a problem.  Lately I’ve had to create an ARM version for portable and embedded systems and speed is definitely an issue.  My first pass at a purely ARM 68k emulator didn’t take too long to get working, but was not terribly fast.  The complexity of the 68k kept me from taking too many chances with some of my optimization ideas.  I’ve spent much time since then thinking about ways of speeding it up without coming up without anything useful.  Starting last week I took another look at the code and I found myself with 5 new ideas for further optimization.  I’ve implemented 4 of them so far and I’ve sped up the code a good amount and reduced the size by 15%.  The last idea will end up growing the code size, but at this point that won’t have much effect on the speed since it doesn’t fit in the code cache anyway.  Without revealing too many details of what I did, here are my ideas which worked:

1) Delayed flag calculation - processors which affect the flags on almost every instruction waste lots of time manipulating them.

2) When possible, use the native ARM flags to calculate things such as OVERFLOW.

3) For Read-Modify-Write functions, try to re-use the effective address efficiently.

4) Use better register management and instruction ordering to reduce pipe stalls and memory accesses.

5) Work with the statistics of the opcodes and focus energy on the most used instructions.

The worst part of embarking on a project like this is that you start with working code and end up breaking while trying to improve it.  Much time is lost figuring out what broke, but in the end it’s all worth it :).

March 22, 2007 Posted by bitbank | arm, asm, assembly language, emulation, optimization, performance, tech | | 3 Comments

Coding for Speed (Q&D advice for targeting ARM)

Anyone who knows me is familiar with my Q&D (quick & dirty) approach to just about everything.  My feeling is that if you can get 95% of it done in a short time, but 100% takes a long time, be happy with 95% :) .

I thought it might be useful to discuss some simple ideas for making your applications run faster.  These ideas are directed at ARM based devices, but the same concepts apply to pretty much any computing platform.  Here is a quick list of do’s and dont’s for the C and ASM programmer:

1) Avoid using divide and modulus (%) (C/C++)

99% of all ARM embedded and portable devices in use today support the ARM5 instruction set which does not include divide.  Doing a divide calls a long and ugly function.  Along the same line of thought, I always get a good laugh when I see someone do this:

i = j % 8;

For those of you who don’t get the joke, using MOD to get the remainder from a power of 2 constant forces the use of a divide instruction.  The correct code (which becomes a single instruction) is:

i = j & 7;

Perhaps a smart compiler converts it when it sees nice constants, but replacing the 8 with a variable will force the compiler to use divide.  X86 systems do have a divide instruction, but this it’s still a good idea to avoid using modulus when it’s not necessary.

2) Be aware of the pipelined nature of the beast (ASM)

The ARM uses a pipelined architecture which usually goes something like this:

Fetch -> Decode -> Execute

The worst case is reading data from memory.  It takes the full 3 clocks to have the data ready in a register, so avoid depending on that register being ready to use immediately after a read.  Here are some examples:

Bad:

ldr r0,[r1],#4
add r2,r2,r0      ; pipe stall here will eat 2 clocks waiting for r0
subs r3,r3,#1

Better:

ldr r0,[r1],#4
subs r3,r3,#1   ; we’ve salvaged 1 wasted clock, but still have another
add r2,r2,r0

Best:

ldr r0,[r1],#4
subs r3,r3,#1
<another instruction which doesn’t touch r0>
add r2,r2,r0

I hope this makes it clear.  The other thing to avoid is branches which take 3 clocks and flush the instruction Q.  Conditional execution of every instruction on the ARM make branches far less needed.

 3) Understand what the machine is actually doing (C/C++)

The most disturbing thing for me is when I find someone who has a total non-understanding of how computers work and writes code like this:

void really_bad_program(void)
{
unsigned char uc1, uc2;

uc1 = 1;
memcpy(&uc2, &uc1, 1);

}

Believe it or not, I’ve actually seen this in a shipping product.  I’m not going to give a long explanation here; I’ll leave it as an exercise for those who don’t see what the problem is to find out on their own.  My point is to be aware of what high level code gets turned into by the compiler. Know the difference between an operator and a function.

That’s all for now.  If anyone finds this useful, I’ll continue this thread.

March 16, 2007 Posted by bitbank | arm, asm, assembly language, benchmark, optimization, optimização, pocket pc, smartphone, tech, wince, xscale | | 3 Comments