Montag, 1. November 2010

Fast Kahan summation implementation in Assembler

Summing floating-point numbers is trickier than most people think. For example ((x+y)+z) does not necessarily equal (x+(y+z)) with FP numbers. Also, it is easy to loose precision if two numbers of different magnitude are summed.

If you want to add several floating-point numbers, these rounding errors may sum up and make the result unusable.

Luckily, there is help on the way and it is called "Kahan Summation".
The basic idea is to introduce an extra error variable that remembers the errors made, so that at a later point, when the error has accumulated, it can be subtracted. Useful advice and other methods are described in here.

The common C++-code for this operation looks something like this:

double sum(const double* a, int length)
if(length == 0)
return 0.0;

double result = a[0];
double error = 0.0;
for(int i=1;i<length;i++)
double tmpA = a[i] - error;
double tmpSum = result + tmpA;
error = (tmpSum - result) - tmpA;
result = tmpSum;

return result;

Compared to normal summing with a simple for loop, this implementation is a few times slower. Even worse, turning on compiler optimizations may destroy the whole function because "error = (tmpSum - result) - tmpA" could be reduced to "error = 0". So there must be a way to stop the compiler from doing certain optimizations but not doing others.

That is hard.

So I decided to write it in Assembler. With SSE2 I could process two double values at a time which is a pretty good optimization which even good compilers may not find. Because my solution uses inline assembler, it will not be optimized any further and the functionality remains intact.

Here it comes:

double sum(const double* a, int length)
if(length == 0)
return 0.0;
else if(length == 1)
return a[0];

double result1 = 0.0;
double result2 = 0.0;
double error1 = 0.0;
double error2 = 0.0;

int i = length - 2;

_asm {
xorps xmm6, xmm6 ; Initialize error register (xmm6)

mov eax, DWORD PTR a ; Initialize result register (xmm4)
movhpd xmm4, QWORD PTR [eax]
movlpd xmm4, QWORD PTR [eax+8]

mov ecx, dword ptr i ; ecx = i save loop counter

sub ecx, 2 ; ecx -= 2
cmp ecx,0 ; if(ecx <= 0) goto saveResults
jl saveResults

add eax, 16 ; go to next two array elements
movhpd xmm0, QWORD PTR [eax]
movlpd xmm0, QWORD PTR [eax+8]; Save a[2*i], a[2*i+1] in xmm0

subpd xmm0, xmm6 ; a = a[i] - error

movapd xmm1, xmm4 ; ts = result + a
addpd xmm1, xmm0

movapd xmm6, xmm1 ; error = (ts - result) - a
subpd xmm6, xmm4
subpd xmm6, xmm0

movapd xmm4, xmm1 ; result = ts

jmp loopStart

movhpd result1, xmm4 ; Save result into result variables
movlpd result2, xmm4
mov dword ptr i, ecx

if(i == -1)
double tmpA = a[length-1] - error1;
double tmpSum = result1 + tmpA;
error1 = (tmpSum - result1) - tmpA;
result1 = tmpSum;

return (result1+result2)-(error1+error2);

The performance on my Intel(R) Core(TM)2 Duo T8300 was as follows:

Standard sum1x
Standard sum (floats)1.6x
Kahan SSE20.9x

As can be seen from the numbers, on my cpu the optimized code is three times faster and even faster than a normal unoptimized for loop. I am happy do discuss any requests and complaints. Simple summing with SSE would have been possible as well but that has the accuracy problems mentioned earlier.


Anonym hat gesagt…

Out of interest, what sort of error would you be looking at with, say, adding 128 floats ranging from -2f to +2f using naive summing?

BaumBlogger hat gesagt…

Generally that depends on the distribution of the numbers that you use. In the ideal case, the error might cancel but this is not the worst case.

A float has just below 7 digits precision. So in float it is (2 + 1.19E-7) = 2. This may introduce an error of 127 * 1.19E-7 = 1.5E-5 which may seem low (0.0005%). In IIR filters for example, these results are taken for the following calculations and the errors may propagate.

Maethalion hat gesagt…

This is an amazing post. Will you post more computer science here? I am extremely interested in these kinds of optimizations, thank you.

Anonym hat gesagt…


BaumBlogger hat gesagt…

Sure, why not? It won't always be about optimization but yes - Computer Science is what I do :-)