Sunday, 20 June 2010

3 hrs of sleep after optimizing Geodes memcpy performance

Since memcpy seems to be a real bottleneck on our Naos for video processing, we did some more investigation in optimizing its performance (for larger chunks that needs to be copied). First of all we measured the shipped memcpy function in combination with Geodes Time Stamp Counter registers (RDTSC) in order to gain some tranfer rates. Results:

nao@nao [74] [~]$ ./memcpy 500 45 65536
Memory to memory copy rate = 154.037506 MBytes / sec. Block size = 65536.

Well, even -O3 and -march=geode didn't help that much (approximately same rates, compiled with gcc (Debian 4.3.2-1.1) 4.3.2).

Since our Geode has built-in MMX, we wrote our own userspace memcpy in combination with the provided MMX instruction set last night. What comes out is some real fancy Assembler code and cache-aligned data structures. Well, benchmarking this made us grin broadly - this is what we'd like to see:

nao@nao [77] [~]$ ./memcpy 500 45 65536
Memory to memory copy rate = 674.978516 MBytes / sec. Block size = 65536.

Some facts about the processor from the OLPC pages who nearly seem to be the only one who have some interesting details next to AMD (

The instruction and data L1 caches size: 64 KB, 16-way set associative, with 32 byte line size. Data is write-back. The L1 cache miss latency is ~10-12 clocks (*).

The L2 cache is 128 KB, 4-way set associative, with an undocumented line size. It can be configured to be data-only, instruction-only, or combined. It is described as a unified L2 victim cache. The measured line size is 32 bytes (*), the L2 cache miss latency is ~28-35 clocks (*).

The instruction and data L1 TLBs are both 16-entry, fully associative. The L2 TLB is 64-entry, 2-way set associative. Miss latency ~50+ clocks (*).

MSRs are provided for observing cache line age and TLB entry age.

The LX has the limitation that it can only execute 3 concurrent prefetches (see "This reflects the internal limit of the LX of 3 outstanding prefetch transactions").

The DDR memory as in the XO (DDR333-166Mhz) has an achievable read speed of 600MB/sec in optimal conditions: sequential access, 8 byte reads (MMX), prefech 64 bytes ahead (2x cache line size). It means that on average the Geode LX can read a new cache line every ~20+ cycle. See test5.

(*): measured with: The Calibrator

