Larry’s Personal & Tech ramblings

Just another WordPress.com weblog

Micro View

“Micro_View” is a product I created several years ago for a client.  It’s a simple imaging library for Win32 and WinCE which allows you to load BMP, GIF and JPEG images into a HBITMAP or it can display them in a window.  The code is fast and small (the Win32 lib file is 96K).  I created a stand-alone command-line driven executable which displays an image in a borderless window and a link library which has 3 functions defined:

int APIENTRY MicroView(TCHAR *filename, int iOptions);
int APIENTRY MVLoadBitmap(TCHAR *filename, HBITMAP *, HPALETTE *);
int APIENTRY MVLoadResource(HINSTANCE hInst, TCHAR *rname, HBITMAP *pBitmap, HPALETTE *pPalette);

If you need to add simple image handling to your application, it doesn’t get much easier than this.  This is something that’s been collecting dust on my harddrive for quite a while, but would probably make a pretty good retail product.  I will see if I can package this up into a reasonably priced product in the next few days.  Please email me (bitbank@pobox.com) if you’re in need of such a library.

September 19, 2007 Posted by bitbank | arm, arm9, jpeg, omap, photo, tech, viewer, wince | | No Comments

PQV 4.0 is ready

It’s taken a lot longer than expected to release the latest version of my image viewer. Many other projects have gotten in the way, but I’ve finally wrapped up version 4 of PQV. The sales pages on Handango and PocketGear are not quite working yet, but the PayPal link works. The $14.99 license is per user, not device. A single license allows you to use the viewer on all your Windows Mobile devices and Windows Desktop. Here’s the new product page:

PQV 4.0

When you install the Pocket PC or SmartPhone versions, the Desktop PC version also gets installed.

September 5, 2007 Posted by bitbank | jpeg, photo, pocket pc, smartphone, viewer | | 5 Comments

Fast ARM JPEG code licensing

WordPress (the company which hosts this site) collects some interesting statistical data on the people who visit the blog.  To me, the most interesting data is a list of the search words which direct people to this site.  Since I started including the JPEG and ARM keywords in my posts, I’ve seen a steady stream of people searching for basically the same thing: Free optimized source code for decoding JPEG/MPEG images on ARM devices.  I’ve done such searches myself and have come to the conclusion that it’s not available.  For anyone who has done research and invested tons of time and energy into writing optimized code, it is unlikely that they will be willing to give it away for free.  There are plenty of open-source and free projects on the internet that are valuable and professionally done, but there usually comes a point in a project’s lifetime when the author commercializes it to get compensated for the time invested.

I try to share my knowledge and experience with the developer community; I understand the frustration of wasting precious time locating resources or coming up with workarounds for problems outside of (or within) my code.  I also make a living writing software, and so I must write code which is worth compensation from my customers and maintain innovative solutions which compare well with my competition.  The geek in me would love to have an open discussion about the fastest way to decode Huffman encoded data or minimize the calculations in the IDCT, but as a consultant, that would be self-defeating.

The “trade secrets” are visible in the source code, but hidden in the object code, so licensing object code will incur less risk to me and therefore cost considerably less.  I’ve licensed my code to various companies for values ranging from several hundred dollars to tens of thousands.  The price varies according to the risk and time required.  Companies needing help with ARM optimization issues are encouraged to contact me.  The amount I charge for my time or code is usually far more economical than having other programmers spend time trying to invent what I’ve already  got working.

July 31, 2007 Posted by bitbank | arm, arm9, asm, assembly language, jpeg, omap, optimization, pocket pc, smartphone, xscale | | 5 Comments

ARM JPEG Benchmarks part 2

I thought it would be useful to re-run the tests with the C version of my JPEG code. From the results it appears that memory bandwidth is the real limiting factor to the speed and the pixel colorspace conversion gets the most benefit from my optimized ARM assembly language. Also it appears that the OMAP gains more from optimized ASM than the XScale does. Here are the numbers:

C-Code:

PPC: thumbnail: 10.7 milliseconds, DC only: 968 milliseconds, full res: 3734 milliseconds.
SP: thumbnail: 25.1 milliseconds

Mixed C and ASM

PPC: thumbnail: 8.8 milliseconds, DC only: 830 milliseconds, full res: 2700 milliseconds.
SP: thumbnail: 15.1 milliseconds

The load times for the “DC only” and “full res” tests include the time taken to read 4.3MB of data from RAM through the WinCE file system.

These results make sense in that the real benefit of optimization comes from fixing the algorithms and reducing memory usage. The optimized ARM assembly code is certainly helpful in speeding things up, but won’t offer an order of magnitude improvement over what the compiler generates.

July 11, 2007 Posted by bitbank | arm, arm9, asm, assembly language, benchmark, jpeg, omap, optimization, performance, pocket pc, smartphone, wince, xscale | | No Comments

ARM JPEG Benchmarks

The great thing about the ARM architecture is that the more I look at a piece of code, the more ways I find to optimize it. The conditional execution, barrel shifter and optional setting of the processor flags create many opportunities for optimization. I’ve spent some more time optimizing my ARM asm JPEG code and now have some hard numbers to publish. I used a HP iPAQ h2210 Pocket PC (400Mhz PXA255) and a HTC Hurricane SmartPhone (195Mhz OMAP 850) to do the testing. I was able to load the file from RAM on the Pocket PC (to reduce file I/O delays), but not on the SmartPhone. The SmartPhone file system does not use RAM for file storage. The slow speed of reading from the miniSD card overtakes the amount of processing time in the tests, so the only test that was run on the SmartPhone was decompressing a 160×120 thumbnail image in RAM. All tests were to decompress the image to a RGB565 bitmap. The thumbnail test decompresses the 160×120 EXIF thumbnail image. The “DC only” test creates a single pixel from each MCU (the 3072 x 2304 image is loaded as 384×288). The “Full res” test decompresses every pixel of the image.

PPC: thumbnail: 8.8 milliseconds, DC only: 830 milliseconds, full res: 2700 milliseconds.
SP: thumbnail: 15.1 milliseconds

The speed difference between the two devices is to be expected considering the different processor and memory bus speeds. The “DC only” test is useful because it shows the relative speed of Huffman decoding. The file size is 4.3MB, so in 830 milliseconds the code was able to decode all of the MCUs and produce a single pixel from each one.

I’ve uploaded the sample image to my web server here: CIMG2209.JPG

The image was taken with a Casio EX-Z750 and depicts a relatively complex scene with many fine details. Like most cameras, the Elixim series saves JPEG images with 2:1 horizontal color subsampling (when set to maximum quality). It’s not unreasonable for a point-and-shoot camera like the Z750 to save images at a less than optimal compression because the image coming off the CCD isn’t that great to begin with. What irks me is that cameras like the Canon 20D do the same thing. With a good SLR lens and imager, the Canon should allow you to save full res color JPEG images.

Comments?

July 7, 2007 Posted by bitbank | arm, arm9, asm, assembly language, benchmark, jpeg, omap, optimization, performance, photo, pocket pc, smartphone, viewer, wince, xscale | | 4 Comments

Having fun with JPEG decompression

An odd title considering that JPEG is a cryptic image compression standard.  My idea of fun is optimizing code until there’s nothing left to improve.  I decided a few weeks ago to take the plunge and rewrite the 3 core JPEG decode routines to speed up my imaging code.  One reason was that the great majority of cell phones today are based around the TI OMAP architecture typically running at around 200Mhz.  These devices seem slow at working with images, so I thought I could help that situation by speeding things up to improve both battery usage and the user experience.

The important, “inner loop” routines of JPEG image decoding are the Huffman decoding of the MCU (minimum coded unit), the IDCT (inverse discrete cosine transform), and the output stage (turning the YCrCb pixels into RGB pixels).  All 3 routines together turned out to be only a couple hundred lines of ARM code, but the result of rewriting it from C was quite dramatic.  The original C code has been optimized and tested over a long period of time and was in good shape to begin with, but C isn’t so great at bit manipulation and squeezing the most use out of register variables.  It took several iterations to get down to the bare minimum of code, but I’m quite happy with the results.  I used ARMV5 instructions, but made sure that the code performs well on both OMAP and XScale CPUs (unlike Intel’s integrated performance primitives).  Luckily my previous performance testing of the multiply instructions helped guide me to save a few clock cycles off of several routines.  The purpose of this work is threefold:
1) I’m readying a new version of my imaging application (PQV - Pocket QuickView) for Windows Mobile and need it to be competitive with other products.  I pride myself on having the fastest viewer available.
2) I have been staring at the C code for a long time and wondering how much better it could perform if written in optimized ARM asm.
3)  I believe this code has value to anyone doing imaging or video on ARM based devices.  Web browsers, image viewers, camera applications, video players can all benefit from this code.

I’ve been searching for the past week or so for customers of this code, but the typical response is the “not invented here” attitude standing in the way of improving products.

I will post some sample images and benchmarks shortly to back up my claims of fast JPEG decoding.

Anyone interested in licensing object or source code should contact me directly (bitbank@pobox.com).

June 21, 2007 Posted by bitbank | arm, arm9, asm, assembly language, benchmark, jpeg, omap, optimization, performance, pocket pc, smartphone, wince, xscale | | No Comments

Working with the cache on ARM/OMAP

I decided to revisit my performance testing taking into account how the data cache affects performance. I wanted to see about speeding up JPEG decoding and how writing “the wrong way” into memory affected its performance. The JPEG test I used was to decode a fake 1600×1200 16bpp image. Since each scanline is 3200 bytes, a line of 8×8 macroblocks won’t fit entirely in the 16k data cache. What I found was suprising; I got results similar to the surpise I got with STM vs. STR on the OMAP and XScale. On the OMAP, writing individual 16-bit pixels (8 per line) takes 3 times as long as writing them in one shot with a STMIA of 4 registers. I can only imagine that on the OMAP, the STM instruction tells the memory controller not to read the cache line, but simply write through. On the XScale, the STM was actually a bit slower. This implies that the OMAP is smart about what type of write occurs and the XScale isn’t. Also that the OMAP doesn’t try to fill the rest of the cache line when doing a partial write if you use STM. Can anyone verify this theory?

April 3, 2007 Posted by bitbank | arm, arm9, asm, assembly language, benchmark, jpeg, optimization, performance, tech, xscale | | 3 Comments