A brief discussion about texturemapper innerloops using ADDX 
HTMLVersion by Azure Note: This is no basicintroduction to texturemapping  its just an article on optimizing innerloops. If you have U and V (the two texture coordinates) as 8.8 fixed point, then you can have: ; d0 U UUuu (UU == integer bits, uu = fractional) ; d1 V VVvv ; d2 dU/dx UUuu "horizontal slope of U" ; d3 dV/dx VVvv .pixel move.w d1,d4 ; d4 (offset reg) = VVxx move.w d0,d5 ; Temporarily... lsr.w #8,d5 ; ...to get UU in lower byte of a reg move.b d5,d4 ; d4 = VVUU = correct offset move.b (a0,d4.w),(a1)+ add.w d2,d0 add.w d3,d1 dbf d7,.pixelThere is however a very nifty instruction which we can make use of here: addx.
Addx is intended to be used for adding arbitrarily large integers.
What it does, is that it adds together the two operands, and if X flag is
set then it also adds 1 to the result. Imagine that you want to add the 96bit numebr in d2:d1:d0 to another 96bit number in d5:d4:d3.
Then you would proceed like this: add.l d0,d3 ; First add lowermost part addx.l d1,d4 ; .. then the next  with X flag addx.l d2,d5 ; .. and finally the highest (with X)[There is a special form of addx: "addx (am),(an)" It is intended for use when the two numbers are lying around in memory. Quite useful for handling arbitrarily large numbers, or just numbers of different sizes. (There's no use for this in realtime applications though :))] This can be very valuable when dealing with fixedpoint too, though;
since you can split add.w d1,d0into add.b d1,d0 addx.b d3,d2... then you would be able to get at the upper 8 bits of an 8.8 fixed point number without shifting! There's however one more thing to realize before we try implementing this. One can perform two wordadditions using only one add.l by having the values packed into two registers; there will sometimes be an overflow from the lower part to the upper part of the answer, but that overflow is always 1 so it is in some applications negligible.
Consider this: ; d0 00aa00bb ; d1 00cc00dd add.l d1,d0... then d0 will be = (aa + cc) shifted up 16 + (bb + dd); it will take 256 times of repeating until there comes any overflow into the (aa+cc) calculation.
Now we look at our tmapper's stepping: add.w d1,d0 ; Interpolate V add.b d3,d2 ; Interpolate U fraction addx.b d5,d4 ; Interpolate U integer... and here we [after some thinking :)] realize that we can make the "add.b d3,d2" in the uppermost bytes of d0 and d1 instead: ; d0 V **VVvv (this is how I denote different contents ; d0 U uu**** in the same register) ; d1 dV/dx **VVvv ; d1 dU/dx uu**** ; d2 U UU ; d3 dU/dx UU .pixel move.w d0,d4 ; d4 = VVvv move.b d2,d4 ; d4 = VVUU move.b (a0,d4.w),(a5)+ add.l d1,d0 addx.b d3,d2 dbf d7,.pixelnow that's pretty short. :) In fact this is what is commonly referred to as "a 5inst tmapper" (5 instructions not counting the dbf). There is no obvious way of speeding this up, it has looked like this since 1994 at least.
Notice what the add/addx thing can be thought of to look like: d2:d0 + d3:d1 = UU:uuVVvv + UU:uuVVvvThis shows that the add/addx is nothing but two adds packed into one (but the "one" add happens to be executed through two instructions)  we are therefore not doing anything excessively weird yet.
The error that gets carried from the VVvv parts into the uu part is, if you
initialize the '' parts to 00, max 1 u per 256 iterations = 1 U per 65536
iterations. That error is highly negligible! But  perhaps you want more accuracy? 16 fractional bits? It is usually not necessary on the Amiga in 320x256 resolution, but it is good to know that it is possible.
Let us first begin with the problem of only interpolating one value: ; d0 C CCcccc (C for colour, might be for a gouraud rout) ; d1 dC/dx CCccccusually one would do: add.l d1,d0 move.l d0,d2 swap d2... and then use d2.
That can be changed into: add.w d1,d0 addx.b d3,d2... however, look at this code if we unroll (repeat) it several times: add.w d1,d0 addx.b d3,d2 add.w d1,d0 addx.b d3,d2 add.w d1,d0 addx.b d3,d2Notice that after the first "addx.b d3,d2", the subsequent "add.w d1,d0" could be done at the top of d2/d3 in the addx! (remember that an addx is just like an add  plus the X flag)
Therefore, or code could also look like this: ; d0 C cccc ; d1 dC/dx cccc ; d2 C ccccCC ; d3 dC/dx ccccCC add.w d1,d0 addx.l d3,d2 addx.l d3,d2 addx.l d3,d2and just to get rid of the need for d0/d1 at the beginning: move.l d3,d4 clr.w d4 add.l d4,d2 ; Step only fractional part  "init" addx.l d3,d2 addx.l d3,d2 addx.l d3,d2(actually one should end this chain with "addx.w", but that only matters if one needs the result value at the end of the operation  normally one doesn't, at least not in tmappers.) This is a very interesting approach because it creates something which I like to call a "cyclic add", which sotosay never ends.
Building this with two values (16.16) could look like this: ; d0 bbbbAAAA ; d1 aaaaBBBB ; d2 & d3 same, but d?/dx move.l d3,d4 clr.w d4 add.l d4,d1 ; start the chain (adding last fractional ; part) addx.l d2,d0 addx.l d3,d1 ; iteration 1 done addx.l d2,d0 addx.l d3,d1 ; iteration 2 done ...What kind of setup gives us the look of d1:d0?
Check them out as "normal" and as "finished" in 64bit format: d1:d0 "normal" = BBBBbbbb:AAAAaaaa d1:d0 "2addx" = aaaaBBBB:bbbbAAAA... which means, that the setup operation was  in theory, of course  "ror.q #16,d1:d0". This shows us why we should add the uppermost word of the 64bit value first (the move.l/clr.w/add.l init): Because that's where the lowest bits of the original value are.
Now let us finally implement this into a texturemapper: ; d0 U **UU ; d0 V vv** ; d1 dU/dx **UU ; d1 dV/dx vv** ; d2 U uuuu**** ; d2 V ****VVvv ; d3 dU/dx uuuu**** ; d3 dV/dx ****VVvv... and the code: move.l d3,d4 clr.w d4 add.l d4,d2 ; Start the Xflag in the "cyclic add" .pixel move.w d2,d4 ; d4 = VVvv move.b d0,d4 ; d4 = VVUU move.b (a0,d4.w),(a1)+ addx.l d1,d0 addx.l d3,d2 dbf d7,.pixelThese loops are nice and fast on 020/030, sure, but how about 040/060? There the instructions should be reordered a bit to remove stalls. (The move.w/move.b causes a 0.5 cycle stall on 060, and the closeness between move.b and pixelcopying move.b causes 1 cycle stall on 040 and 060.)
Reorder the 8.8 loop to this: .pixel move.w d0,d4 ; d4 = VVvv add.l d1,d0 move.b d2,d4 ; d4 = VVUU addx.b d3,d2 move.b (a0,d4.w),(a5)+ dbf d7,.pixelAnd the 16.16 loop to this: .pixel move.w d2,d4 ; d4 = VVvv addx.l d1,d0 move.b d0,d4 ; d4 = VVUU addx.l d3,d2 move.b (a0,d4.w),(a1)+ dbf d7,.pixelThe above loops can be sped up a tiny bit more, but any more optimization is left as an exercise for the readers. ;) Oh, and you might want to keep the upper word of d4 cleared (through a "moveq #0,d4" before the pixelloop), since you then can use d4.l as offset into the texture and a0 thus doesn't need to point to the middle of the texture (if you have size 256x256 textures). It all depends on the circumstances though...
