Optimizing Write Pipelining

back
Optimizing Write Pipelining

Passing the old "Copyspeed" barrier ..

by Tim Böscke (Azure)

This document is about the basics of "Write Pipelining" and describes a new approach to gain even more speed out of so called "copyspeed" c2ps on 040/060.

Instruction Pipelining is a technique used in every modern CPU. It means splitting up the task of executing an instruction into many subtasks , pipeline stages. Every stage of the pipeline has a specific task. Lets take a look at a very simplified version of the 68040/060 instruction pipeline:

Instruction Fetch
Instruction Decode
Data Fetch
Instruction Execute
Write Back

The idea behind this is, that a new instruction can enter the pipeline, when the first instruction enters stage two etc.. This way many instructions are being executed at the same time in various stages, so that no part of the CPU is idle.

For us the last stage is the most interesting one (Write Back). This stage will only used by instructions, which are actually writing to the memory. A memory write can be held in this stage, while multiple other instructions are being executed in the other stages of the pipeline. (The 060 can even hold up to 4 writes in a special write buffer) With one exception. When another memory read (non-cached read ) occurs the execution of this intruction is discontinued until the former writes have finished.

The 68030 has a very similar behaviour due to its Write Pending Buffer.

Chipram writes usually take a very long time. With the knowledge about the pipeline we can use a very simple trick to speed up the c2p process - we simply use the time, which is needed for the chipram writes to perform the c2p conversion. We just have to take care, that we dont read from the memory until the write is finished - otherwhise cpu time is wasted.

Due to this simple, but very effective trick it is possible to do a chunky to planar conversion as fast as a fastmem to chipmem copy is. Depending on the CPU speed it is possible to perform more or less c2p passes during the chipmem writes. On fast 68040s and on 68060s a full c2p can be done in copyspeed. On 68030/50 it is possible to execute about 3 c2p passes during the chipwrites.

So a good 68040/68060 cpu only c2p is organized like this:


	rept 4

	move.l	(fast)+,reg1
	move.l	(fast)+,reg2
	move.l	(fast)+,reg3
	move.l	(fast)+,reg4
	move.l	(fast)+,reg5
	move.l	(fast)+,reg6
	move.l	(fast)+,reg7
	move.l	(fast)+,reg8
	move.l	tmp1,(chip)+
	(c2p pass..)
	move.l	tmp2,(chip)
	(c2p pass..)
	move.l	tmp3,(chip)
	(c2p pass..)
	move.l	tmp4,(chip)
	(c2p pass..)
	move.l	tmp5,(chip)
	(c2p pass..)
	move.l	tmp6,(chip)
	(c2p pass..)
	move.l	tmp7,(chip)
	(c2p pass..)
	move.l	tmp8,(chip)
	(c2p pass..)

	endr

Using this loop copying (c2ping) a 320x256x8 screen takes about 220 rasterlines on my system (A4060/50) with no DMA switched on. This is what usually is referred to as "Copyspeed".

Looks hard to optimize ? Yes, but there is a way. Taking a closer look to this loop we quickly recognize, that not all instructions are being executed during chipmem writes: The fastmem reading moves are not, because they are stalled until the chipmem writes are finished.

This can be avoided: as I already mentioned above data cache reads can be executed even if a memory write is pending. So everything we have to do is to make sure, that the source data are already in the data cache.

This is quite easy. Each cacheline holds 16 bytes of data. For every read access to cached memory an entire cacheline is loaded into the data cache. So we only have to read data from the source memory in 16 byte steps.

A possible loop utilizing this trick might look like this:

	tst.w		0*16(fast)
	tst.w		1*16(fast)
	tst.w		2*16(fast)
	tst.w		3*16(fast)
	tst.w		4*16(fast)
	tst.w		5*16(fast)
	tst.w		6*16(fast)
	tst.w		7*16(fast)	;Preload 8 cachelines = 8*16 = 128 bytes

	rept	4
	move.l	(fast)+,reg1
	move.l	(fast)+,reg2
	move.l	(fast)+,reg3
	move.l	(fast)+,reg4
	move.l	(fast)+,reg5
	move.l	(fast)+,reg6
	move.l	(fast)+,reg7
	move.l	(fast)+,reg8
	move.l	tmp1,(chip)+
	(c2p pass..)
	move.l	tmp2,(chip)
	(c2p pass..)
	move.l	tmp3,(chip)
	(c2p pass..)
	move.l	tmp4,(chip)
	(c2p pass..)
	move.l	tmp5,(chip)
	(c2p pass..)
	move.l	tmp6,(chip)
	(c2p pass..)
	move.l	tmp7,(chip)
	(c2p pass..)
	move.l	tmp8,(chip)
	(c2p pass..)
	endr
				;Convert 4*32 = 128 bytes of data

With this loop copying a 320x256x8 screen (No DMA) takes about 216 rasterlines. So we gained 4 rasterlines ! This is not very much, but at least it is passing the barrier :)

On 060/50 5 cycles are gained for each c2p loop. (converting 32 bytes)

There is still another trick to speed up the "data preloading". Its possible to load two cachelines at once with a single instruction. A 16 byte aligned buffer is needed though:

	tst.w		0*16+15(fast)
	tst.w		2*16+15(fast)
	tst.w		4*16+15(fast)
	tst.w		6*16+15(fast)	;Load 8 Cachelines

This trick might also speed up fastmem->fastmem operations in some cases.

Here is a full 040/060 c2p utilizing Cache Preloading.