Baum Dev Blog: High precision floating-point arithmetic in MS Visual C++

In my article about Kahan floating-point summation I showed how high precision summing can be realized with high speed. There is another efficient way to do high precision floating-point arithmetic but it isn't supported by the common SIMD (Single Instruction Multiple Data) instruction sets. Nevertheless, it's accuracy is significantly higher!

This posting is structured as follows: First I talk about "extended precision" in general, giving some historical information. In the second part I show how and when extended precision can be used in VC++ without assembler. In the third part I show some assembler code and do the accuracy and speed evaluation.

It all starts with the old x87 floating point (co-)processor. It's architecture contained the nowadays somewhat exotic datatype "extended precision" with 80bit size (64bit mantissa and 16bit exponent) and I guess it was created to have temporary variables with more-than-double precision in order to counteract too much rounding errors in double variables.

It is this datatape that is commonly associated with "long double" variables but in MSVC++, long double is the same as double. Extended precision is not supported by the current implementation. There is a compiler switch to enable high precision math (-fp:precise) but this will not enable 80 bit math as one would like, sometimes. However, the Microsoft compiler can calculate in 80bit but not store results in 80 bits!

Let's take a simple for loop as example:

double sum = 0.0;
for(int i=0;i<1000;i++)
    sum += 0.1;

Ideally, sum would be held in 80 bits until the for-loop terminates and only then the results were written back into the double variable.

Unfortunately, with the default MSVC++ compiler, each assignment results in a down-cast to double. This is to ensure portability because doing these kinds of optimization can easily change the behavior of some applications.

So when does MSVC++ use 80bits precision? The answer in this case is simple. If the code was

double sum = 0.1 + 0.1 /* + .... (1000 times!)*/;

the calculations would have been done in 80 bits, if the 80bit-precision control bit of the processor was set. The number 0.1 as a literal however, would have been represented in 64 bits and then converted before processing. This is because there is no 80bit datatype at all neither for constants nor for variables. Generally speaking, all intermediate expressions are evaluated at register precision which means that all results without assignments and casts will be done in the fpu of the cpu and whatever precision is set on that will be used. However, it is guaranteed that this precision is at least double when you work with doubles.

So when exactly is this useful?
Take for example the implementation of a musical IIR filter (acts like an equalizer on a stereo):

void iir(double* signal, int samples) {
    double x0, x1, x2; // memorizes recent signal values
    double y0, y1, y2; // memorizes recent output values
    for(int i=0;i&samples;i++){
        y2 = y1; y1 = y0; // shifts memory forward
        x2 = x1; x1 = x0; // ..
        x0 = samples[i];
        y0 = x0 + 2 * x1 + x2          // <-- filter operation
             - (1.5 * y1 - 1.45 * y2); // <-- 
        samples[i] = y0; // store result in signal array
    }
}

This code is accuracy critical and contains a cumbersome statement with floating-point calculations. Some people might prefer to split this to two statements for cosmetic reasons, but this will remove accuracy because results are rounded to double three times instead of one!

y0 = x0 + 2 * x1 + x2;
y0 -= 1.5 * y1 - 1.45 * y2;
// this is the same as
y0 = (double)(x0 + 2 * x1 + x2) 
   - (double)(1.5 * y1 - 1.45 * y2);
// and  NOT
y0 = (x0 + 2 * x1 + x2)
   - (1.5 * y1 - 1.45 * y2);

But wait again! Does this do the trick already? The answer is no, it doesn't. At least not if you want to be sure that it works. To guarantee accurate results, the processor precision bit has to be set and the compiler has to be advised not to optimize so much that the code doesn't work anymore.

First, the compiler pragma "#pragma fenv_access( [ on | off ] )" has to tell the compiler that we will change processor flags and he should not optimize this away. Secondly, we need to use "_controlfp_s" to enable the high-precision bit of the processor:

#include <float.h>
#pragma fenv_access( on )
void iir(double* signal, int samples) {

    // enable 64 mantissa and store old settings
    unsigned int control_word;
    _controlfp_s(&control_word, _PC_64, MCW_PC);

    // do calculations ...

    // restores old processor settings
    _controlfp_s(&control_word, _CW_DEFAULT, MCW_PC);
}

This is a nice way to avoid assembler in your code, but sometimes it just doesn't work. For example when you want to sum the elements of an array and you can not put it all in one statement. In this case it would be nice to use intrinsics but the necessary floating point operations are not (yet) available as intrinsics.

So here is the assembler routine to sum array elements with extended precision:

#include <float.h>
#pragma fenv_access( on  )
double sumEP(const double* a, int length)
{
 if(length == 0)return 0.0;else if(length == 1)return a[0];
 
 double result;
 int i = length - 1;

 unsigned int control_word;
 _controlfp_s(&control_word, _PC_64, MCW_PC);

 _asm {
  lea  eax, DWORD PTR a      ; load adress of a
  mov  eax, [eax]
  fld  QWORD ptr [eax]       ; load first value of a

  mov  ecx, dword ptr i      ; load loop counter i

loopStart:
  sub  ecx, 1                ; i--
  cmp  ecx, 0                ; if(i < 0) goto saveResults
  jl   saveResults

  add  eax, 8                ; go to next array element
  fld  QWORD ptr [eax]       ; push a[i]
  fadd                       ; add top two stack numbers
  
  jmp loopStart

saveResults:
  lea  eax, DWORD PTR result ; load address if result
  fstp QWORD PTR [eax]       ; store result
 }

 _controlfp_s(&control_word,_CW_DEFAULT, 0xfffff);

 return result;
}

Calculating 10000000 times + 0.1 gave the following time and precision values:

	error	time
standard	-1.6E-4	1
SSE	-8.9E-5	0.97
extended	8.7E-8	0.97
Kahan/SSE	0	1.66

Calculating the sum over the first 10000000 elements in a geometric progression with q=0.9999:

	error	time
standard	1.4E-9	1
SSE	-1.7E-9	0.48
extended	1.5E-9	0.73
Kahan/SSE	1.5E-9	4.8

The tables show two things. In the first example, the precision depends highly on the calculation precision. The performance however is nearly the same for all approaches.
The second example shows little to no difference in precision but in speed. I think the reason is, that in the second example the double precision summands are limiting the quality, but since the cpu can't optimize a lot, the runtime in this table is more relevant. In the first example the worst case accuracy is captured better but the runtime is not very helpful since all numbers equal 0.1.

Notice that in contrast to my earlier posting, kahan is slower than standard summation. This is due to better compiler optimization in this post. Notice, that SSE Kahan is still faster than non-SSE kahan.

The nice thing about VC++ is that if you only want higher precision for intermediate results, you can easily remove casts and set the processor accuracy bit.

Hope you enjoyed this posting. Let me say that it's a pity that this is not usable with SSE1-4 so far. You'd have to compromise speed against precision if you use it, which is perfectly fine if double is accurate enough.

Regards!

4 Kommentare:

Anonym hat gesagt…: I’d like to visit your blog more usually but lately it appears to be taking without end to
come back up. I visit from work, and our connection there may be pretty good.

Do you assume the problem might be in your finish?

my blog ... having trouble getting pregnant; 20. Mai 2013 um 04:06
Anonym hat gesagt…: Nice post. I learn something totally new and challenging on sites I stumbleupon everyday.
It will always be helpful to read articles from other writers
and use a little something from other sites.

Also visit my weblog - anti cellulite treatment; 8. Juni 2013 um 08:23
Anonym hat gesagt…: I do not even know how I ended up here, but I thought this post was good.
I don't know who you are but certainly you are going to a famous blogger if you aren't already
;) Cheers!

Here is my homepage :: bet angel; 15. Juni 2013 um 13:22
Anonym hat gesagt…: smokeless cigarettes, electronic cigarette, e cigarette reviews, electronic cigarette brands, e cigarette health, electronic cigarette starter kit; 28. November 2014 um 03:00

Kommentar veröffentlichen

Baum Dev Blog

Montag, 15. November 2010

High precision floating-point arithmetic in MS Visual C++

4 Kommentare:

Blog-Archiv