|Re: New 12c loop count speed test|
Message #40 Posted by Andrés C. Rodríguez (Argentina) on 30 Apr 2009, 6:21 p.m.,
in response to message #39 by cyrille de Brébisson
Cyrille: with some guessing (I apologize), I tried to put the code
you shared in a format compatible with the MoHPC Forum, which at
times is rather unfriendly with respect to long posts formatting.
Please correct any mistake.
But my code works for 64 bits :-) thanks to assembly coding and
the power of ARM assembly, adding the last nibble only required
one extra instruction...
// Used to do an addition on two DCB numbers with a result in DCB
// for example, dcbAdjust(a, b), when a and b are dcb representation
// of numbers, will return the dcb representation of the number r=a+b...
// Note that the addition is done prior to the call...
// int64 dcbAddAdjust(int64 r, int64 b); // r=a+b
// a += 0x0666666666666666 ULL;
// preadjust as if carries occur
// u64 s = a + b;
// compute the sum
// b = a ^ b ^ s;
// find the carries
// b = ~b & 0x1111111111111111 ULL;
// compute mask for non-carries
// return s - ((b >> 4)*6);
// subtract out 6 * non-carries
ldr r12, cte66666666 // load 66666666 in lr to perform the a+666666666666666
/* 10hex = 16decimal, so the difference between the BCD
representation of 10 and the hex value of 10 is 6.
Adding 6 to each digit of one of the 2 numbers correspond to
assuming that the addition will generate a carry for each digit
and preemptively adds the 6 to each digit.
This need to be done so that we can easily detect which digit
addition really had a carry. Then the algorithm will remove the
extra "6" from these digits.
The carry for each digit does appear as an extra '1' added to the
last bit of each digit. This means that the parity of each digit
is equal to parity of digit in an exclusive-or parity of digit in
b exclusive-or carry from previous digit.
This means that calculating a xor b xor (a+b) and only looking at
the first bit of each digit will tell us for each digit if there
is a carry or not. Note that the fact that we added 6 to each
digit of a does not affect the calculation of parity for a as 6 is
an even number */
adds r0, r0, r12 // a+=6666666666666666
adc r1, r1, r12
ldr r12, cte88888888 // preload 888888888888; preloading it earlier than
// needed will reduce 1 wait state later on...
/* Step 2 of the algorithm. We now have pre adjusted a, we
calculate the sum of a+b. however, here we pay attention to
keeping the carry out of that calculation in the CPU carry flag
(this is why we use the adcs instruction as in add r1 and r3 and
the current carry, AND keep the carry out in the flag register),
so that we can later remove the 6 from the last digit of the
result if that carry is clear or, in our case, clear the bit in
the bitfield representing the digits where we need to remove 6 for
digit 16 if the carry is set.*/
adds r4, r0, r2 // s= a+b
adcs r5, r1, r3 // keep carry!!!
eor r2, r0, r2 // b=a^b.
/* This calculates the combined parity of each digits of a and b. */
eor r3, r1, r3 // Note that only the first bit of each digit has any interest.
// The others will be removed later.
eor r2, r2, r4 // b=a^b^s. 'Removing' the combined parity of each
eor r3, r3, r5 // digit of a and b from the parity of the sum of a and b
// gives the carry bit for each digit.
/* The next 3 instructions perform 3 64-bit operations at once (a
normal non-ARM CPU would need 6 instructions to do so!), so please follow!
We now have a set of 64 bits, 16 of them are of interest to us
(the least significant bit of each digit) as it indicates the carry. So, we need to:
1. Clear all the other bits (the b&1111111111111111 in the C code)
2. If the carry bit for digit 'n' is NOT set (i.e., no carry),
it means that we need to remove 6 from digit n-1. So we
need to invert each carry bit
3. We need to shift our bitfield to get the bit indicating the carry
caused by digit n located in the same region of the
registers as digit n (for the moment, the bit for digit n
is the least significant bit of digit n+1) */
/* Thanks to the ARM CPU instruction set, we can do this in only 3
instructions. - as far as 1 and 2 are concerned we can use the bic
instruction (Bit Clear) which performs a "a and ~b" operation so
we can combine them - a shift operation normally takes 3 operations,
but if we decide to shift the bitfield by only 1 bit (which
would place the carry for digit n in the bit 3 of digit n), then
we can use the shift ability of the ARM to perform in 1 operation
the 88888888 and (~b shift 1) for the lower 32 bits of b, then use
1 instruction (the sub) to handle the carry for digit 7 (which is
now held as the least significant bit of the register holding the
upper 32 bits of b) and finish the work by handling the upper 32
bits of b. because the shift is done 'on the fly'; we now need to
and b - not by 11111..., but by 1111... shifted by 1 or 8888......
Note that if the ARM had 64 bit registers, we could do the whole
thing in only 1 instruction */
bic r2, r12, r2
lsr #1 // (~b>>1) & 0x888888888888888
sub r2, r2, r3
bic r3, r12, r3
/* Handles carry that we lost from the register due to 64 bit
limitation during the addition of a+b but that was kept in the
flag register of the ARM CPU and never modified since, if the
carry is SET (i.e., there is a carry on the last digit), then we
do NOT need to remove 6 from the last digit, and the bit
corresponding to the need to remove that 6 is cleared from the
subcs r3, r3, #0x80000000
/* The last step of the algorithm is to remove 6 (or remove 2 and
remove 4; as 2+4=6) from each digit where there were no carry. For
each digit, we have a bit (bit 3 to be precise) in variable b that
is set if 6 needs to be removed from this digit. So, we need to
remove b>>1 (correspond to carry bit*4) and b>>2 (correspond to
carry bit *2) from the sum to get our result. Note that because we
know that only 1 bit in each group of 4 bit is potentially set,
there is no need to handle bit movement from higher 32 bits to
lower 32 bits of b. If the ARM had 64 bit registers, this would be
a non issue.*/
subs r4, r4, r2
lsr #2 // s-(b>>2)*3 = s-b<<2-b<<3
sbc r5, r5, r3
subs r0, r4, r2
sbc r1, r5, r3
END of Function
cte66666666: DC32 0x66666666
cte88888888: DC32 0x88888888
cte11111111: DC32 0x11111111