back |
|
Instruction Pipelining is a technique used in every modern CPU. It means splitting up the task of executing an instruction into many subtasks , pipeline stages. Every stage of the pipeline has a specific task. Lets take a look at a very simplified version of the 68040/060 instruction pipeline:
For us the last stage is the most interesting one (Write Back). This stage will only used by instructions, which are actually writing to the memory. A memory write can be held in this stage, while multiple other instructions are being executed in the other stages of the pipeline. (The 060 can even hold up to 4 writes in a special write buffer) With one exception. When another memory read (non-cached read ) occurs the execution of this intruction is discontinued until the former writes have finished. The 68030 has a very similar behaviour due to its Write Pending Buffer. Chipram writes usually take a very long time. With the knowledge about the pipeline we can use a very simple trick to speed up the c2p process - we simply use the time, which is needed for the chipram writes to perform the c2p conversion. We just have to take care, that we dont read from the memory until the write is finished - otherwhise cpu time is wasted. Due to this simple, but very effective trick it is possible to do a chunky to planar conversion as fast as a fastmem to chipmem copy is. Depending on the CPU speed it is possible to perform more or less c2p passes during the chipmem writes. On fast 68040s and on 68060s a full c2p can be done in copyspeed. On 68030/50 it is possible to execute about 3 c2p passes during the chipwrites. So a good 68040/68060 cpu only c2p is organized like this: rept 4 move.l (fast)+,reg1 move.l (fast)+,reg2 move.l (fast)+,reg3 move.l (fast)+,reg4 move.l (fast)+,reg5 move.l (fast)+,reg6 move.l (fast)+,reg7 move.l (fast)+,reg8 move.l tmp1,(chip)+ (c2p pass..) move.l tmp2,(chip) (c2p pass..) move.l tmp3,(chip) (c2p pass..) move.l tmp4,(chip) (c2p pass..) move.l tmp5,(chip) (c2p pass..) move.l tmp6,(chip) (c2p pass..) move.l tmp7,(chip) (c2p pass..) move.l tmp8,(chip) (c2p pass..) endr Using this loop copying (c2ping) a 320x256x8 screen takes about 220 rasterlines on my system (A4060/50) with no DMA switched on. This is what usually is referred to as "Copyspeed". Looks hard to optimize ? Yes, but there is a way. Taking a closer look to this loop we quickly recognize, that not all instructions are being executed during chipmem writes: The fastmem reading moves are not, because they are stalled until the chipmem writes are finished. This can be avoided: as I already mentioned above data cache reads can be executed even if a memory write is pending. So everything we have to do is to make sure, that the source data are already in the data cache. This is quite easy. Each cacheline holds 16 bytes of data. For every read access to cached memory an entire cacheline is loaded into the data cache. So we only have to read data from the source memory in 16 byte steps. A possible loop utilizing this trick might look like this: tst.w 0*16(fast) tst.w 1*16(fast) tst.w 2*16(fast) tst.w 3*16(fast) tst.w 4*16(fast) tst.w 5*16(fast) tst.w 6*16(fast) tst.w 7*16(fast) ;Preload 8 cachelines = 8*16 = 128 bytes rept 4 move.l (fast)+,reg1 move.l (fast)+,reg2 move.l (fast)+,reg3 move.l (fast)+,reg4 move.l (fast)+,reg5 move.l (fast)+,reg6 move.l (fast)+,reg7 move.l (fast)+,reg8 move.l tmp1,(chip)+ (c2p pass..) move.l tmp2,(chip) (c2p pass..) move.l tmp3,(chip) (c2p pass..) move.l tmp4,(chip) (c2p pass..) move.l tmp5,(chip) (c2p pass..) move.l tmp6,(chip) (c2p pass..) move.l tmp7,(chip) (c2p pass..) move.l tmp8,(chip) (c2p pass..) endr ;Convert 4*32 = 128 bytes of data With this loop copying a 320x256x8 screen (No DMA) takes about 216 rasterlines. So we gained 4 rasterlines ! This is not very much, but at least it is passing the barrier :) On 060/50 5 cycles are gained for each c2p loop. (converting 32 bytes) There is still another trick to speed up the "data preloading". Its possible to load two cachelines at once with a single instruction. A 16 byte aligned buffer is needed though: tst.w 0*16+15(fast) tst.w 2*16+15(fast) tst.w 4*16+15(fast) tst.w 6*16+15(fast) ;Load 8 Cachelines This trick might also speed up fastmem->fastmem operations in some cases. Here is a full 040/060 c2p utilizing Cache Preloading.
|
Last change: 16.01.2001