Baum Dev Blog: Fast Kahan summation implementation in Assembler

Summing floating-point numbers is trickier than most people think. For example ((x+y)+z) does not necessarily equal (x+(y+z)) with FP numbers. Also, it is easy to loose precision if two numbers of different magnitude are summed.

If you want to add several floating-point numbers, these rounding errors may sum up and make the result unusable.

Luckily, there is help on the way and it is called "Kahan Summation".
The basic idea is to introduce an extra error variable that remembers the errors made, so that at a later point, when the error has accumulated, it can be subtracted. Useful advice and other methods are described in here.

The common C++-code for this operation looks something like this:


double sum(const double* a, int length)
{
    if(length == 0)
        return 0.0;
    
    double result = a[0];
    double error = 0.0;
    for(int i=1;i<length;i++)
    {
        double tmpA = a[i] - error;
        double tmpSum = result + tmpA;
        error = (tmpSum - result) - tmpA;
        result = tmpSum;
    }
    
    return result;
}

Compared to normal summing with a simple for loop, this implementation is a few times slower. Even worse, turning on compiler optimizations may destroy the whole function because "error = (tmpSum - result) - tmpA" could be reduced to "error = 0". So there must be a way to stop the compiler from doing certain optimizations but not doing others.

That is hard.

So I decided to write it in Assembler. With SSE2 I could process two double values at a time which is a pretty good optimization which even good compilers may not find. Because my solution uses inline assembler, it will not be optimized any further and the functionality remains intact.

Here it comes:


double sum(const double* a, int length)
{
    if(length == 0)
        return 0.0;
    else if(length == 1)
        return a[0];
    
    double result1 = 0.0;
    double result2 = 0.0;
    double error1 = 0.0;
    double error2 = 0.0;

    int i = length - 2;

    _asm {
        xorps     xmm6, xmm6              ; Initialize error register (xmm6)

        mov     eax, DWORD PTR a            ; Initialize result register (xmm4)
        movhpd     xmm4, QWORD PTR [eax]
        movlpd     xmm4, QWORD PTR [eax+8]

        mov      ecx, dword ptr i        ; ecx = i save loop counter

loopStart:
        sub        ecx, 2                    ; ecx -= 2
        cmp     ecx,0                ; if(ecx <= 0) goto saveResults
        jl     saveResults

        add        eax, 16                    ; go to next two array elements
        movhpd     xmm0, QWORD PTR [eax]
        movlpd     xmm0, QWORD PTR [eax+8]; Save a[2*i], a[2*i+1] in xmm0

        subpd    xmm0, xmm6                ; a = a[i] - error

        movapd    xmm1, xmm4                ; ts = result + a
        addpd    xmm1, xmm0                

        movapd    xmm6, xmm1                ; error = (ts - result) - a
        subpd    xmm6, xmm4
        subpd    xmm6, xmm0                

        movapd    xmm4, xmm1                ; result = ts
        
        jmp loopStart

saveResults:
        movhpd     result1, xmm4 ; Save result into result variables
        movlpd     result2, xmm4
        mov      dword ptr i, ecx
    }

    if(i == -1)
    {
        double tmpA = a[length-1] - error1;
        double tmpSum = result1 + tmpA;
        error1 = (tmpSum - result1) - tmpA;
        result1 = tmpSum;
    }
    
    return (result1+result2)-(error1+error2);
}

The performance on my Intel(R) Core(TM)2 Duo T8300 was as follows:

Standard sum	1x
Standard sum (floats)	1.6x
Kahan	3.2x
Kahan SSE2	0.9x

As can be seen from the numbers, on my cpu the optimized code is three times faster and even faster than a normal unoptimized for loop. I am happy do discuss any requests and complaints. Simple summing with SSE would have been possible as well but that has the accuracy problems mentioned earlier.

4 Kommentare:

Anonym hat gesagt…: Out of interest, what sort of error would you be looking at with, say, adding 128 floats ranging from -2f to +2f using naive summing?; 2. November 2010 um 00:07
BaumBlogger hat gesagt…: Generally that depends on the distribution of the numbers that you use. In the ideal case, the error might cancel but this is not the worst case.

A float has just below 7 digits precision. So in float it is (2 + 1.19E-7) = 2. This may introduce an error of 127 * 1.19E-7 = 1.5E-5 which may seem low (0.0005%). In IIR filters for example, these results are taken for the following calculations and the errors may propagate.; 4. November 2010 um 15:19
Maethalion hat gesagt…: This is an amazing post. Will you post more computer science here? I am extremely interested in these kinds of optimizations, thank you.; 12. November 2010 um 12:02
BaumBlogger hat gesagt…: Sure, why not? It won't always be about optimization but yes - Computer Science is what I do :-); 12. November 2010 um 14:18

Kommentar veröffentlichen

Baum Dev Blog

Montag, 1. November 2010

Fast Kahan summation implementation in Assembler

4 Kommentare:

Blog-Archiv