Working with the cache on ARM/OMAP
I decided to revisit my performance testing taking into account how the data cache affects performance. I wanted to see about speeding up JPEG decoding and how writing “the wrong way” into memory affected its performance. The JPEG test I used was to decode a fake 1600×1200 16bpp image. Since each scanline is 3200 bytes, a line of 8×8 macroblocks won’t fit entirely in the 16k data cache. What I found was suprising; I got results similar to the surpise I got with STM vs. STR on the OMAP and XScale. On the OMAP, writing individual 16-bit pixels (8 per line) takes 3 times as long as writing them in one shot with a STMIA of 4 registers. I can only imagine that on the OMAP, the STM instruction tells the memory controller not to read the cache line, but simply write through. On the XScale, the STM was actually a bit slower. This implies that the OMAP is smart about what type of write occurs and the XScale isn’t. Also that the OMAP doesn’t try to fill the rest of the cache line when doing a partial write if you use STM. Can anyone verify this theory?