nao@nao  [~]$ ./memcpy 500 45 65536 Memory to memory copy rate = 154.037506 MBytes / sec. Block size = 65536.
Well, even -O3 and -march=geode didn't help that much (approximately same rates, compiled with gcc (Debian 4.3.2-1.1) 4.3.2).
Since our Geode has built-in MMX, we wrote our own userspace memcpy in combination with the provided MMX instruction set last night. What comes out is some real fancy Assembler code and cache-aligned data structures. Well, benchmarking this made us grin broadly - this is what we'd like to see:
nao@nao  [~]$ ./memcpy 500 45 65536 Memory to memory copy rate = 674.978516 MBytes / sec. Block size = 65536.
Some facts about the processor from the OLPC pages who nearly seem to be the only one who have some interesting details next to AMD (http://wiki.laptop.org/go/Geode_LX):
The instruction and data L1 caches size: 64 KB, 16-way set associative, with 32 byte line size. Data is write-back. The L1 cache miss latency is ~10-12 clocks (*).
The L2 cache is 128 KB, 4-way set associative, with an undocumented line size. It can be configured to be data-only, instruction-only, or combined. It is described as a unified L2 victim cache. The measured line size is 32 bytes (*), the L2 cache miss latency is ~28-35 clocks (*).
The instruction and data L1 TLBs are both 16-entry, fully associative. The L2 TLB is 64-entry, 2-way set associative. Miss latency ~50+ clocks (*).
MSRs are provided for observing cache line age and TLB entry age.
The LX has the limitation that it can only execute 3 concurrent prefetches (see "This reflects the internal limit of the LX of 3 outstanding prefetch transactions").
The DDR memory as in the XO (DDR333-166Mhz) has an achievable read speed of 600MB/sec in optimal conditions: sequential access, 8 byte reads (MMX), prefech 64 bytes ahead (2x cache line size). It means that on average the Geode LX can read a new cache line every ~20+ cycle. See test5.
(*): measured with: The Calibrator