Montag, 15. November 2010

High precision floating-point arithmetic in MS Visual C++

In my article about Kahan floating-point summation I showed how high precision summing can be realized with high speed. There is another efficient way to do high precision floating-point arithmetic but it isn't supported by the common SIMD (Single Instruction Multiple Data) instruction sets. Nevertheless, it's accuracy is significantly higher!

This posting is structured as follows: First I talk about "extended precision" in general, giving some historical information. In the second part I show how and when extended precision can be used in VC++ without assembler. In the third part I show some assembler code and do the accuracy and speed evaluation.

It all starts with the old x87 floating point (co-)processor. It's architecture contained the nowadays somewhat exotic datatype "extended precision" with 80bit size (64bit mantissa and 16bit exponent) and I guess it was created to have temporary variables with more-than-double precision in order to counteract too much rounding errors in double variables.

It is this datatape that is commonly associated with "long double" variables but in MSVC++, long double is the same as double. Extended precision is not supported by the current implementation. There is a compiler switch to enable high precision math (-fp:precise) but this will not enable 80 bit math as one would like, sometimes. However, the Microsoft compiler can calculate in 80bit but not store results in 80 bits!

Let's take a simple for loop as example:
double sum = 0.0;
for(int i=0;i<1000;i++)
    sum += 0.1;

Ideally, sum would be held in 80 bits until the for-loop terminates and only then the results were written back into the double variable.

Unfortunately, with the default MSVC++ compiler, each assignment results in a down-cast to double. This is to ensure portability because doing these kinds of optimization can easily change the behavior of some applications.

So when does MSVC++ use 80bits precision? The answer in this case is simple. If the code was
double sum = 0.1 + 0.1 /* + .... (1000 times!)*/;

the calculations would have been done in 80 bits, if the 80bit-precision control bit of the processor was set. The number 0.1 as a literal however, would have been represented in 64 bits and then converted before processing. This is because there is no 80bit datatype at all neither for constants nor for variables. Generally speaking, all intermediate expressions are evaluated at register precision which means that all results without assignments and casts will be done in the fpu of the cpu and whatever precision is set on that will be used. However, it is guaranteed that this precision is at least double when you work with doubles.

So when exactly is this useful?
Take for example the implementation of a musical IIR filter (acts like an equalizer on a stereo):

void iir(double* signal, int samples) {
    double x0, x1, x2; // memorizes recent signal values
    double y0, y1, y2; // memorizes recent output values
    for(int i=0;i&samples;i++){
        y2 = y1; y1 = y0; // shifts memory forward
        x2 = x1; x1 = x0; // ..
        x0 = samples[i];
        y0 = x0 + 2 * x1 + x2          // <-- filter operation
             - (1.5 * y1 - 1.45 * y2); // <-- 
        samples[i] = y0; // store result in signal array
This code is accuracy critical and contains a cumbersome statement with floating-point calculations. Some people might prefer to split this to two statements for cosmetic reasons, but this will remove accuracy because results are rounded to double three times instead of one!

y0 = x0 + 2 * x1 + x2;
y0 -= 1.5 * y1 - 1.45 * y2;
// this is the same as
y0 = (double)(x0 + 2 * x1 + x2) 
   - (double)(1.5 * y1 - 1.45 * y2);
// and  NOT
y0 = (x0 + 2 * x1 + x2)
   - (1.5 * y1 - 1.45 * y2);

But wait again! Does this do the trick already? The answer is no, it doesn't. At least not if you want to be sure that it works. To guarantee accurate results, the processor precision bit has to be set and the compiler has to be advised not to optimize so much that the code doesn't work anymore.

First, the compiler pragma "#pragma fenv_access( [ on | off ] )" has to tell the compiler that we will change processor flags and he should not optimize this away. Secondly, we need to use "_controlfp_s" to enable the high-precision bit of the processor:
#include <float.h>
#pragma fenv_access( on )
void iir(double* signal, int samples) {

    // enable 64 mantissa and store old settings
    unsigned int control_word;
    _controlfp_s(&control_word, _PC_64, MCW_PC);

    // do calculations ...

    // restores old processor settings
    _controlfp_s(&control_word, _CW_DEFAULT, MCW_PC);

This is a nice way to avoid assembler in your code, but sometimes it just doesn't work. For example when you want to sum the elements of an array and you can not put it all in one statement. In this case it would be nice to use intrinsics but the necessary floating point operations are not (yet) available as intrinsics.

So here is the assembler routine to sum array elements with extended precision:

#include <float.h>
#pragma fenv_access( on  )
double sumEP(const double* a, int length)
 if(length == 0)return 0.0;else if(length == 1)return a[0];
 double result;
 int i = length - 1;

 unsigned int control_word;
 _controlfp_s(&control_word, _PC_64, MCW_PC);

 _asm {
  lea  eax, DWORD PTR a      ; load adress of a
  mov  eax, [eax]
  fld  QWORD ptr [eax]       ; load first value of a

  mov  ecx, dword ptr i      ; load loop counter i

  sub  ecx, 1                ; i--
  cmp  ecx, 0                ; if(i < 0) goto saveResults
  jl   saveResults

  add  eax, 8                ; go to next array element
  fld  QWORD ptr [eax]       ; push a[i]
  fadd                       ; add top two stack numbers
  jmp loopStart

  lea  eax, DWORD PTR result ; load address if result
  fstp QWORD PTR [eax]       ; store result

 _controlfp_s(&control_word,_CW_DEFAULT, 0xfffff);

 return result;
Calculating 10000000 times + 0.1 gave the following time and precision values:

Calculating the sum over the first 10000000 elements in a geometric progression with q=0.9999:

The tables show two things. In the first example, the precision depends highly on the calculation precision. The performance however is nearly the same for all approaches.
The second example shows little to no difference in precision but in speed. I think the reason is, that in the second example the double precision summands are limiting the quality, but since the cpu can't optimize a lot, the runtime in this table is more relevant. In the first example the worst case accuracy is captured better but the runtime is not very helpful since all numbers equal 0.1.

Notice that in contrast to my earlier posting, kahan is slower than standard summation. This is due to better compiler optimization in this post. Notice, that SSE Kahan is still faster than non-SSE kahan.

The nice thing about VC++ is that if you only want higher precision for intermediate results, you can easily remove casts and set the processor accuracy bit.

Hope you enjoyed this posting. Let me say that it's a pity that this is not usable with SSE1-4 so far. You'd have to compromise speed against precision if you use it, which is perfectly fine if double is accurate enough.


Donnerstag, 4. November 2010

Butterworth Lowpass Filter Coefficients in C++

When I searched the web it wasn't easy to find a simple function to calculate Butterworth lowpass filter coefficients, so I looked up the theory and did things on my own.

If you just want coefficients for a single filter specification, you can use the code generated by this nice online code generator. It will produce C-code that you can use for a specified samplerate and cutoff-frequency. It will also do Chebyshev and Bessel filters as well as lowpass, highpass, bandpass and bandstop filters for you.

However, if your program needs adjustable cutoff or samplerate, you can not use this script.

The first part of this post will take a glance at the required analytical steps and the second part will simply give you the code.

One of the standard techniques is to use the analytical frequency response of the filter and then use bilinear transformation to obtain IIR filter coefficients.

The required steps require some math for which I found this lecture very helpful:

1) Take analytical transfer function: H(s)
2) Do bilinear transform: H( (z-1)/(z+1) )
3) "Warp" cutoff frequency to find cutoff in the "bilinear domain"
4) Express function as
5) Use these coefficients for time-domain filtering with the linear difference equation:

Notice that in order to simplify the derivation of the filter, you can safely set the sample time T to 2.

Doing this for a 2 pole Butterworth filter gives the following code in C++:

void getLPCoefficientsButterworth2Pole(const int samplerate, const double cutoff, double* const ax, double* const by)
double PI = 3.1415926535897932385;
double sqrt2 = 1.4142135623730950488;

double QcRaw = (2 * PI * cutoff) / samplerate; // Find cutoff frequency in [0..PI]
double QcWarp = tan(QcRaw); // Warp cutoff frequency

double gain = 1 / (1+sqrt2/QcWarp + 2/(QcWarp*QcWarp));
by[2] = (1 - sqrt2/QcWarp + 2/(QcWarp*QcWarp)) * gain;
by[1] = (2 - 2 * 2/(QcWarp*QcWarp)) * gain;
by[0] = 1;
ax[0] = 1 * gain;
ax[1] = 2 * gain;
ax[2] = 1 * gain;

which then can be used in a filter like this:

double xv[3];
double yv[3];

void filter(double* samples, int count)
double ax[3];
double by[3];

getLPCoefficientsButterworth2Pole(44100, 5000, ax, by);

for (int i=0;i<count;i++)
xv[2] = xv[1]; xv[1] = xv[0];
xv[0] = samples[i];
yv[2] = yv[1]; yv[1] = yv[0];

yv[0] = (ax[0] * xv[0] + ax[1] * xv[1] + ax[2] * xv[2]
- by[1] * yv[0]
- by[2] * yv[1]);

samples[i] = yv[0];

And that is it. To reduce summation error you may want to use Kahan summation!

Montag, 1. November 2010

Fast Kahan summation implementation in Assembler

Summing floating-point numbers is trickier than most people think. For example ((x+y)+z) does not necessarily equal (x+(y+z)) with FP numbers. Also, it is easy to loose precision if two numbers of different magnitude are summed.

If you want to add several floating-point numbers, these rounding errors may sum up and make the result unusable.

Luckily, there is help on the way and it is called "Kahan Summation".
The basic idea is to introduce an extra error variable that remembers the errors made, so that at a later point, when the error has accumulated, it can be subtracted. Useful advice and other methods are described in here.

The common C++-code for this operation looks something like this:

double sum(const double* a, int length)
if(length == 0)
return 0.0;

double result = a[0];
double error = 0.0;
for(int i=1;i<length;i++)
double tmpA = a[i] - error;
double tmpSum = result + tmpA;
error = (tmpSum - result) - tmpA;
result = tmpSum;

return result;

Compared to normal summing with a simple for loop, this implementation is a few times slower. Even worse, turning on compiler optimizations may destroy the whole function because "error = (tmpSum - result) - tmpA" could be reduced to "error = 0". So there must be a way to stop the compiler from doing certain optimizations but not doing others.

That is hard.

So I decided to write it in Assembler. With SSE2 I could process two double values at a time which is a pretty good optimization which even good compilers may not find. Because my solution uses inline assembler, it will not be optimized any further and the functionality remains intact.

Here it comes:

double sum(const double* a, int length)
if(length == 0)
return 0.0;
else if(length == 1)
return a[0];

double result1 = 0.0;
double result2 = 0.0;
double error1 = 0.0;
double error2 = 0.0;

int i = length - 2;

_asm {
xorps xmm6, xmm6 ; Initialize error register (xmm6)

mov eax, DWORD PTR a ; Initialize result register (xmm4)
movhpd xmm4, QWORD PTR [eax]
movlpd xmm4, QWORD PTR [eax+8]

mov ecx, dword ptr i ; ecx = i save loop counter

sub ecx, 2 ; ecx -= 2
cmp ecx,0 ; if(ecx <= 0) goto saveResults
jl saveResults

add eax, 16 ; go to next two array elements
movhpd xmm0, QWORD PTR [eax]
movlpd xmm0, QWORD PTR [eax+8]; Save a[2*i], a[2*i+1] in xmm0

subpd xmm0, xmm6 ; a = a[i] - error

movapd xmm1, xmm4 ; ts = result + a
addpd xmm1, xmm0

movapd xmm6, xmm1 ; error = (ts - result) - a
subpd xmm6, xmm4
subpd xmm6, xmm0

movapd xmm4, xmm1 ; result = ts

jmp loopStart

movhpd result1, xmm4 ; Save result into result variables
movlpd result2, xmm4
mov dword ptr i, ecx

if(i == -1)
double tmpA = a[length-1] - error1;
double tmpSum = result1 + tmpA;
error1 = (tmpSum - result1) - tmpA;
result1 = tmpSum;

return (result1+result2)-(error1+error2);

The performance on my Intel(R) Core(TM)2 Duo T8300 was as follows:

Standard sum1x
Standard sum (floats)1.6x
Kahan SSE20.9x

As can be seen from the numbers, on my cpu the optimized code is three times faster and even faster than a normal unoptimized for loop. I am happy do discuss any requests and complaints. Simple summing with SSE would have been possible as well but that has the accuracy problems mentioned earlier.