This posting is structured as follows: First I talk about "extended precision" in general, giving some historical information. In the second part I show how and when extended precision can be used in VC++ without assembler. In the third part I show some assembler code and do the accuracy and speed evaluation.

It all starts with the old x87 floating point (co-)processor. It's architecture contained the nowadays somewhat exotic datatype "extended precision" with 80bit size (64bit mantissa and 16bit exponent) and I guess it was created to have temporary variables with more-than-double precision in order to counteract too much rounding errors in double variables.

It is this datatape that is commonly associated with "long double" variables but in MSVC++, long double is the same as double. Extended precision is not supported by the current implementation. There is a compiler switch to enable high precision math (-fp:precise) but this will not enable 80 bit math as one would like, sometimes. However, the Microsoft compiler can

*calculate*in 80bit but

*not store*results in 80 bits!

Let's take a simple for loop as example:

double sum = 0.0; for(int i=0;i<1000;i++) sum += 0.1;

Ideally, sum would be held in 80 bits until the for-loop terminates and only then the results were written back into the double variable.

Unfortunately, with the default MSVC++ compiler, each assignment results in a down-cast to double. This is to ensure portability because doing these kinds of optimization can easily change the behavior of some applications.

So when does MSVC++ use 80bits precision? The answer in this case is simple. If the code was

double sum = 0.1 + 0.1 /* + .... (1000 times!)*/;

the calculations would have been done in 80 bits,

**if**the

*80bit-precision control bit of the processor*was set. The number 0.1 as a literal however, would have been represented in 64 bits and then converted before processing. This is because there is no 80bit datatype at all neither for constants nor for variables. Generally speaking, all

*intermediate expressions*are evaluated at

*register precision*which means that all results without assignments and casts will be done in the fpu of the cpu and whatever precision is set on that will be used. However, it is guaranteed that this precision is

*at least*double when you work with doubles.

So when exactly is this useful?

Take for example the implementation of a musical IIR filter (acts like an equalizer on a stereo):

void iir(double* signal, int samples) { double x0, x1, x2; // memorizes recent signal values double y0, y1, y2; // memorizes recent output values for(int i=0;i&samples;i++){ y2 = y1; y1 = y0; // shifts memory forward x2 = x1; x1 = x0; // .. x0 = samples[i]; y0 = x0 + 2 * x1 + x2 // <-- filter operation - (1.5 * y1 - 1.45 * y2); // <-- samples[i] = y0; // store result in signal array } }This code is accuracy critical and contains a cumbersome statement with floating-point calculations. Some people might prefer to split this to two statements for cosmetic reasons, but this will remove accuracy because results are rounded to double three times instead of one!

y0 = x0 + 2 * x1 + x2; y0 -= 1.5 * y1 - 1.45 * y2; // this is the same as y0 = (double)(x0 + 2 * x1 + x2) - (double)(1.5 * y1 - 1.45 * y2); // and NOT y0 = (x0 + 2 * x1 + x2) - (1.5 * y1 - 1.45 * y2);

But wait again! Does this do the trick already? The answer is no, it doesn't. At least not if you want to be sure that it works. To guarantee accurate results, the processor precision bit has to be set and the compiler has to be advised not to optimize so much that the code doesn't work anymore.

First, the compiler pragma "#pragma fenv_access( [ on | off ] )" has to tell the compiler that we will change processor flags and he should not optimize this away. Secondly, we need to use "_controlfp_s" to enable the high-precision bit of the processor:

#include <float.h> #pragma fenv_access( on ) void iir(double* signal, int samples) { // enable 64 mantissa and store old settings unsigned int control_word; _controlfp_s(&control_word, _PC_64, MCW_PC); // do calculations ... // restores old processor settings _controlfp_s(&control_word, _CW_DEFAULT, MCW_PC); }

This is a nice way to avoid assembler in your code, but sometimes it just doesn't work. For example when you want to sum the elements of an array and you can not put it all in one statement. In this case it would be nice to use intrinsics but the necessary floating point operations are not (yet) available as intrinsics.

So here is the assembler routine to sum array elements with extended precision:

#include <float.h> #pragma fenv_access( on ) double sumEP(const double* a, int length) { if(length == 0)return 0.0;else if(length == 1)return a[0]; double result; int i = length - 1; unsigned int control_word; _controlfp_s(&control_word, _PC_64, MCW_PC); _asm { lea eax, DWORD PTR a ; load adress of a mov eax, [eax] fld QWORD ptr [eax] ; load first value of a mov ecx, dword ptr i ; load loop counter i loopStart: sub ecx, 1 ; i-- cmp ecx, 0 ; if(i < 0) goto saveResults jl saveResults add eax, 8 ; go to next array element fld QWORD ptr [eax] ; push a[i] fadd ; add top two stack numbers jmp loopStart saveResults: lea eax, DWORD PTR result ; load address if result fstp QWORD PTR [eax] ; store result } _controlfp_s(&control_word,_CW_DEFAULT, 0xfffff); return result; }Calculating 10000000 times + 0.1 gave the following time and precision values:

error | time | |

standard | -1.6E-4 | 1 |

SSE | -8.9E-5 | 0.97 |

extended | 8.7E-8 | 0.97 |

Kahan/SSE | 0 | 1.66 |

Calculating the sum over the first 10000000 elements in a geometric progression with q=0.9999:

error | time | |

standard | 1.4E-9 | 1 |

SSE | -1.7E-9 | 0.48 |

extended | 1.5E-9 | 0.73 |

Kahan/SSE | 1.5E-9 | 4.8 |

The tables show two things. In the first example, the precision depends highly on the calculation precision. The performance however is nearly the same for all approaches.

The second example shows little to no difference in precision but in speed. I think the reason is, that in the second example the double precision summands are limiting the quality, but since the cpu can't optimize a lot, the runtime in this table is more relevant. In the first example the worst case accuracy is captured better but the runtime is not very helpful since all numbers equal 0.1.

Notice that in contrast to my earlier posting, kahan is slower than standard summation. This is due to better compiler optimization in this post. Notice, that SSE Kahan is still faster than non-SSE kahan.

The nice thing about VC++ is that if you only want higher precision for intermediate results, you can easily remove casts and set the processor accuracy bit.

Hope you enjoyed this posting. Let me say that it's a pity that this is not usable with SSE1-4 so far. You'd have to compromise speed against precision if you use it, which is perfectly fine if double is accurate enough.

Regards!