If you want to add several floating-point numbers, these rounding errors may sum up and make the result unusable.
Luckily, there is help on the way and it is called "Kahan Summation".
The basic idea is to introduce an extra error variable that remembers the errors made, so that at a later point, when the error has accumulated, it can be subtracted. Useful advice and other methods are described in here.
The common C++-code for this operation looks something like this:
double sum(const double* a, int length)
{
if(length == 0)
return 0.0;
double result = a[0];
double error = 0.0;
for(int i=1;i<length;i++)
{
double tmpA = a[i] - error;
double tmpSum = result + tmpA;
error = (tmpSum - result) - tmpA;
result = tmpSum;
}
return result;
}
Compared to normal summing with a simple for loop, this implementation is a few times slower. Even worse, turning on compiler optimizations may destroy the whole function because "error = (tmpSum - result) - tmpA" could be reduced to "error = 0". So there must be a way to stop the compiler from doing certain optimizations but not doing others.
That is hard.
So I decided to write it in Assembler. With SSE2 I could process two double values at a time which is a pretty good optimization which even good compilers may not find. Because my solution uses inline assembler, it will not be optimized any further and the functionality remains intact.
Here it comes:
double sum(const double* a, int length)
{
if(length == 0)
return 0.0;
else if(length == 1)
return a[0];
double result1 = 0.0;
double result2 = 0.0;
double error1 = 0.0;
double error2 = 0.0;
int i = length - 2;
_asm {
xorps xmm6, xmm6 ; Initialize error register (xmm6)
mov eax, DWORD PTR a ; Initialize result register (xmm4)
movhpd xmm4, QWORD PTR [eax]
movlpd xmm4, QWORD PTR [eax+8]
mov ecx, dword ptr i ; ecx = i save loop counter
loopStart:
sub ecx, 2 ; ecx -= 2
cmp ecx,0 ; if(ecx <= 0) goto saveResults
jl saveResults
add eax, 16 ; go to next two array elements
movhpd xmm0, QWORD PTR [eax]
movlpd xmm0, QWORD PTR [eax+8]; Save a[2*i], a[2*i+1] in xmm0
subpd xmm0, xmm6 ; a = a[i] - error
movapd xmm1, xmm4 ; ts = result + a
addpd xmm1, xmm0
movapd xmm6, xmm1 ; error = (ts - result) - a
subpd xmm6, xmm4
subpd xmm6, xmm0
movapd xmm4, xmm1 ; result = ts
jmp loopStart
saveResults:
movhpd result1, xmm4 ; Save result into result variables
movlpd result2, xmm4
mov dword ptr i, ecx
}
if(i == -1)
{
double tmpA = a[length-1] - error1;
double tmpSum = result1 + tmpA;
error1 = (tmpSum - result1) - tmpA;
result1 = tmpSum;
}
return (result1+result2)-(error1+error2);
}
The performance on my Intel(R) Core(TM)2 Duo T8300 was as follows:
Standard sum | 1x |
Standard sum (floats) | 1.6x |
Kahan | 3.2x |
Kahan SSE2 | 0.9x |
As can be seen from the numbers, on my cpu the optimized code is three times faster and even faster than a normal unoptimized for loop. I am happy do discuss any requests and complaints. Simple summing with SSE would have been possible as well but that has the accuracy problems mentioned earlier.
4 Kommentare:
Out of interest, what sort of error would you be looking at with, say, adding 128 floats ranging from -2f to +2f using naive summing?
Generally that depends on the distribution of the numbers that you use. In the ideal case, the error might cancel but this is not the worst case.
A float has just below 7 digits precision. So in float it is (2 + 1.19E-7) = 2. This may introduce an error of 127 * 1.19E-7 = 1.5E-5 which may seem low (0.0005%). In IIR filters for example, these results are taken for the following calculations and the errors may propagate.
This is an amazing post. Will you post more computer science here? I am extremely interested in these kinds of optimizations, thank you.
Sure, why not? It won't always be about optimization but yes - Computer Science is what I do :-)
Kommentar veröffentlichen