Larry’s Personal & Tech ramblings

Just another WordPress.com weblog

Working with the cache on ARM/OMAP

I decided to revisit my performance testing taking into account how the data cache affects performance. I wanted to see about speeding up JPEG decoding and how writing “the wrong way” into memory affected its performance. The JPEG test I used was to decode a fake 1600×1200 16bpp image. Since each scanline is 3200 bytes, a line of 8×8 macroblocks won’t fit entirely in the 16k data cache. What I found was suprising; I got results similar to the surpise I got with STM vs. STR on the OMAP and XScale. On the OMAP, writing individual 16-bit pixels (8 per line) takes 3 times as long as writing them in one shot with a STMIA of 4 registers. I can only imagine that on the OMAP, the STM instruction tells the memory controller not to read the cache line, but simply write through. On the XScale, the STM was actually a bit slower. This implies that the OMAP is smart about what type of write occurs and the XScale isn’t. Also that the OMAP doesn’t try to fill the rest of the cache line when doing a partial write if you use STM. Can anyone verify this theory?

April 3, 2007 - Posted by bitbank | arm, arm9, asm, assembly language, benchmark, jpeg, optimization, performance, tech, xscale | | 3 Comments

3 Comments »

  1. Did you check out scope of having write buffer on OMAP and not on xscale?

    Comment by Naresh | April 9, 2007

  2. Dear Sir,

    I am looking for assembly language code for jpeg decoder on arm processor. can you help me.

    Thank you.

    Comment by Shomik Chakraborty | April 19, 2007

  3. Even I had performance improvements using STMIA instruction.However it was an ARM7 based platform with no cache and writing was to external memory.ARM7 supports burst writes of upto 4×32 bit writes and fortunately my memory interface controller also supported it.The same is valid for LDM instruction also for this platform.

    Comment by Pradeep | August 18, 2007

Leave a comment