tag:blogger.com,1999:blog-92185556047073361072016-08-29T03:35:15.068-07:00Baum Dev BlogComputer Science, (Web-)Development and DSPBaumBloggerhttp://www.blogger.com/profile/12668078702880913731noreply@blogger.comBlogger5125tag:blogger.com,1999:blog-9218555604707336107.post-47918657368906946912010-11-15T09:03:00.000-08:002010-11-15T10:07:48.539-08:00High precision floating-point arithmetic in MS Visual C++In <a href="http://baumdevblog.blogspot.com/2010/11/fast-kahan-summation-implementation-in.html">my article about Kahan floating-point summation</a> I showed how high precision summing can be realized with high speed. There is another efficient way to do high precision floating-point arithmetic but it isn't supported by the common SIMD (Single Instruction Multiple Data) instruction sets. Nevertheless, it's accuracy is significantly higher!<br /><br />This posting is structured as follows: First I talk about "extended precision" in general, giving some historical information. In the second part I show how and when extended precision can be used in VC++ without assembler. In the third part I show some assembler code and do the accuracy and speed evaluation.<br /><br />It all starts with the old x87 floating point (co-)processor. It's architecture contained the nowadays somewhat exotic datatype "extended precision" with 80bit size (64bit mantissa and 16bit exponent) and I guess it was created to have temporary variables with more-than-double precision in order to counteract too much rounding errors in double variables.<br /><br />It is this datatape that is commonly associated with "long double" variables but in MSVC++, long double is the same as double. Extended precision is not supported by the current implementation. There is a compiler switch to enable high precision math (<a href="http://msdn.microsoft.com/en-us/library/aa289157%28VS.71%29.aspx#floapoint_topic2">-fp:precise</a>) but this will not enable 80 bit math as one would like, sometimes. However, the Microsoft compiler can <i>calculate</i> in 80bit but <i>not store</i> results in 80 bits!<br /><br />Let's take a simple for loop as example:<br /><pre>double sum = 0.0;<br />for(int i=0;i<1000;i++)<br /> sum += 0.1;</pre><br />Ideally, sum would be held in 80 bits until the for-loop terminates and only then the results were written back into the double variable.<br /><br />Unfortunately, with the default MSVC++ compiler, each assignment results in a down-cast to double. This is to ensure portability because doing these kinds of optimization can easily change the behavior of some applications.<br /><br />So when does MSVC++ use 80bits precision? The answer in this case is simple. If the code was<br /><pre>double sum = 0.1 + 0.1 /* + .... (1000 times!)*/;</pre><br />the calculations would have been done in 80 bits, <b>if</b> the <i>80bit-precision control bit of the processor</i> was set. The number 0.1 as a literal however, would have been represented in 64 bits and then converted before processing. This is because there is no 80bit datatype at all neither for constants nor for variables. Generally speaking, all <i>intermediate expressions</i> are evaluated at <i>register precision</i> which means that all results without assignments and casts will be done in the fpu of the cpu and whatever precision is set on that will be used. However, it is guaranteed that this precision is <i>at least</i> double when you work with doubles.<br /><br />So when exactly is this useful?<br />Take for example the implementation of a musical IIR filter (acts like an equalizer on a stereo):<br /><br /><pre>void iir(double* signal, int samples) {<br /> double x0, x1, x2; // memorizes recent signal values<br /> double y0, y1, y2; // memorizes recent output values<br /> for(int i=0;i&samples;i++){<br /> y2 = y1; y1 = y0; // shifts memory forward<br /> x2 = x1; x1 = x0; // ..<br /> x0 = samples[i];<br /> y0 = x0 + 2 * x1 + x2 // <-- filter operation<br /> - (1.5 * y1 - 1.45 * y2); // <-- <br /> samples[i] = y0; // store result in signal array<br /> }<br />}</pre>This code is accuracy critical and contains a cumbersome statement with floating-point calculations. Some people might prefer to split this to two statements for cosmetic reasons, but this will remove accuracy because results are rounded to double three times instead of one!<br /><br /><pre>y0 = x0 + 2 * x1 + x2;<br />y0 -= 1.5 * y1 - 1.45 * y2;<br />// this is the same as<br />y0 = (double)(x0 + 2 * x1 + x2) <br /> - (double)(1.5 * y1 - 1.45 * y2);<br />// and NOT<br />y0 = (x0 + 2 * x1 + x2)<br /> - (1.5 * y1 - 1.45 * y2);<br /></pre><br />But wait again! Does this do the trick already? The answer is no, it doesn't. At least not if you want to be sure that it works. To guarantee accurate results, the processor precision bit has to be set and the compiler has to be advised not to optimize so much that the code doesn't work anymore.<br /><br />First, the compiler pragma "#pragma fenv_access( [ on | off ] )" has to tell the compiler that we will change processor flags and he should not optimize this away. Secondly, we need to use "_controlfp_s" to enable the high-precision bit of the processor:<br /><pre>#include <float.h><br />#pragma fenv_access( on )<br />void iir(double* signal, int samples) {<br /><br /> // enable 64 mantissa and store old settings<br /> unsigned int control_word;<br /> _controlfp_s(&control_word, _PC_64, MCW_PC);<br /><br /> // do calculations ...<br /><br /> // restores old processor settings<br /> _controlfp_s(&control_word, _CW_DEFAULT, MCW_PC);<br />}</pre><br />This is a nice way to avoid assembler in your code, but sometimes it just doesn't work. For example when you want to sum the elements of an array and you can not put it all in one statement. In this case it would be nice to use intrinsics but the necessary floating point operations are not (yet) available as intrinsics.<br /><br />So here is the assembler routine to sum array elements with extended precision:<br /><br /><pre>#include <float.h><br />#pragma fenv_access( on )<br />double sumEP(const double* a, int length)<br />{<br /> if(length == 0)return 0.0;else if(length == 1)return a[0];<br /> <br /> double result;<br /> int i = length - 1;<br /><br /> unsigned int control_word;<br /> _controlfp_s(&control_word, _PC_64, MCW_PC);<br /><br /> _asm {<br /> lea eax, DWORD PTR a ; load adress of a<br /> mov eax, [eax]<br /> fld QWORD ptr [eax] ; load first value of a<br /><br /> mov ecx, dword ptr i ; load loop counter i<br /><br />loopStart:<br /> sub ecx, 1 ; i--<br /> cmp ecx, 0 ; if(i < 0) goto saveResults<br /> jl saveResults<br /><br /> add eax, 8 ; go to next array element<br /> fld QWORD ptr [eax] ; push a[i]<br /> fadd ; add top two stack numbers<br /> <br /> jmp loopStart<br /><br />saveResults:<br /> lea eax, DWORD PTR result ; load address if result<br /> fstp QWORD PTR [eax] ; store result<br /> }<br /><br /> _controlfp_s(&control_word,_CW_DEFAULT, 0xfffff);<br /><br /> return result;<br />}</pre>Calculating 10000000 times + 0.1 gave the following time and precision values: <br /><table><tbody><tr><td></td><td>error</td><td>time</td></tr><tr><td>standard</td><td>-1.6E-4</td><td>1</td></tr><tr><td>SSE</td><td>-8.9E-5</td><td>0.97</td></tr><tr><td>extended</td><td>8.7E-8</td><td>0.97</td></tr><tr><td>Kahan/SSE</td><td>0</td><td>1.66</td></tr></tbody></table><br />Calculating the sum over the first 10000000 elements in a geometric progression with q=0.9999: <br /><table><tbody><tr><td></td><td>error</td><td>time</td></tr><tr><td>standard</td><td>1.4E-9</td><td>1</td></tr><tr><td>SSE</td><td>-1.7E-9</td><td>0.48</td></tr><tr><td>extended</td><td>1.5E-9</td><td>0.73</td></tr><tr><td>Kahan/SSE</td><td>1.5E-9</td><td>4.8</td></tr></tbody></table><br />The tables show two things. In the first example, the precision depends highly on the calculation precision. The performance however is nearly the same for all approaches.<br />The second example shows little to no difference in precision but in speed. I think the reason is, that in the second example the double precision summands are limiting the quality, but since the cpu can't optimize a lot, the runtime in this table is more relevant. In the first example the worst case accuracy is captured better but the runtime is not very helpful since all numbers equal 0.1.<br /><br />Notice that in contrast to my earlier posting, kahan is slower than standard summation. This is due to better compiler optimization in this post. Notice, that SSE Kahan is still faster than non-SSE kahan.<br /><br />The nice thing about VC++ is that if you only want higher precision for intermediate results, you can easily remove casts and set the processor accuracy bit.<br /><br />Hope you enjoyed this posting. Let me say that it's a pity that this is not usable with SSE1-4 so far. You'd have to compromise speed against precision if you use it, which is perfectly fine if double is accurate enough. <br /><br />Regards!<div class="blogger-post-footer"><script src="http://www.google-analytics.com/urchin.js" type="text/javascript">
</script>
<script type="text/javascript">
_uacct = "UA-1434537-1";
urchinTracker();
</script></div>BaumBloggerhttp://www.blogger.com/profile/12668078702880913731noreply@blogger.com5tag:blogger.com,1999:blog-9218555604707336107.post-10141721093443738012010-11-04T11:27:00.000-07:002010-11-04T15:52:14.269-07:00Butterworth Lowpass Filter Coefficients in C++When I searched the web it wasn't easy to find a simple function to calculate Butterworth lowpass filter coefficients, so I looked up the theory and did things on my own.<br /><br />If you just want coefficients for a single filter specification, you can use the code generated by <a href="http://www-users.cs.york.ac.uk/%7Efisher/mkfilter/">this nice online code generator</a>. It will produce C-code that you can use for a specified samplerate and cutoff-frequency. It will also do Chebyshev and Bessel filters as well as lowpass, highpass, bandpass and bandstop filters for you.<br /><br />However, if your program needs adjustable cutoff or samplerate, you can not use this script.<br /><br />The <span style="font-style:italic;">first part</span> of this post will take a glance at the required analytical steps and the <span style="font-style:italic;">second part</span> will simply give you the code.<br /><br />One of the standard techniques is to use the analytical frequency response of the filter and then use <a href="http://en.wikipedia.org/wiki/Bilinear_transform">bilinear transformation</a> to obtain IIR filter coefficients.<br /><br />The required steps require some math for which I found <a href="http://il.youtube.com/watch?v=5RoPKr94_-w">this lecture</a> very helpful:<br /><br />1) Take analytical transfer function: H(s)<br />2) Do bilinear transform: H( (z-1)/(z+1) )<br />3) "Warp" cutoff frequency to find cutoff in the "bilinear domain"<br />4) Express function as <a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://upload.wikimedia.org/math/c/e/a/cea9aeb4bbf5f4312211e65473147c11.png"><img style="cursor: pointer; width: 427px; height: 49px;" src="http://upload.wikimedia.org/math/c/e/a/cea9aeb4bbf5f4312211e65473147c11.png" alt="" border="0" /></a><br />5) Use these coefficients for time-domain filtering with the <a href="http://en.wikipedia.org/wiki/Digital_filter#Difference_equation">linear difference equation</a>: <a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://upload.wikimedia.org/math/7/5/5/7550fd7e8b5e1aa43d587ff986ef5238.png"><img style="cursor:pointer; cursor:hand;width: 330px; height: 52px;" src="http://upload.wikimedia.org/math/7/5/5/7550fd7e8b5e1aa43d587ff986ef5238.png" border="0" alt="" /></a><br /><br />Notice that in order to simplify the derivation of the filter, you can safely set the sample time T to 2.<br /><br /><br />Doing this for a 2 pole Butterworth filter gives the following code in C++:<br /><br /><pre style="font-size: small;"><br />void getLPCoefficientsButterworth2Pole(const int samplerate, const double cutoff, double* const ax, double* const by)<br />{<br /> double PI = 3.1415926535897932385;<br /> double sqrt2 = 1.4142135623730950488;<br /><br /> double QcRaw = (2 * PI * cutoff) / samplerate; // Find cutoff frequency in [0..PI]<br /> double QcWarp = tan(QcRaw); // Warp cutoff frequency<br /><br /> double gain = 1 / (1+sqrt2/QcWarp + 2/(QcWarp*QcWarp));<br /> by[2] = (1 - sqrt2/QcWarp + 2/(QcWarp*QcWarp)) * gain;<br /> by[1] = (2 - 2 * 2/(QcWarp*QcWarp)) * gain;<br /> by[0] = 1;<br /> ax[0] = 1 * gain;<br /> ax[1] = 2 * gain;<br /> ax[2] = 1 * gain;<br />}<br /></pre><br /><br />which then can be used in a filter like this:<br /><br /><pre style="font-size: small;"><br />double xv[3];<br />double yv[3];<br /><br />void filter(double* samples, int count)<br />{<br /> double ax[3];<br /> double by[3];<br /><br /> getLPCoefficientsButterworth2Pole(44100, 5000, ax, by);<br /><br /> for (int i=0;i<count;i++)<br /> {<br /> xv[2] = xv[1]; xv[1] = xv[0];<br /> xv[0] = samples[i];<br /> yv[2] = yv[1]; yv[1] = yv[0];<br /><br /> yv[0] = (ax[0] * xv[0] + ax[1] * xv[1] + ax[2] * xv[2]<br /> - by[1] * yv[0]<br /> - by[2] * yv[1]);<br /><br /> samples[i] = yv[0];<br /> }<br />}<br /></pre><br /><br />And that is it. To reduce summation error you may want to use <a href="http://baumdevblog.blogspot.com/2010/11/fast-kahan-summation-implementation-in.html">Kahan summation</a>!<div class="blogger-post-footer"><script src="http://www.google-analytics.com/urchin.js" type="text/javascript">
</script>
<script type="text/javascript">
_uacct = "UA-1434537-1";
urchinTracker();
</script></div>BaumBloggerhttp://www.blogger.com/profile/12668078702880913731noreply@blogger.com40tag:blogger.com,1999:blog-9218555604707336107.post-46397871220263710542010-11-01T11:03:00.000-07:002010-11-01T12:50:55.153-07:00Fast Kahan summation implementation in AssemblerSumming floating-point numbers is trickier than most people think. For example ((x+y)+z) does not necessarily equal (x+(y+z)) with FP numbers. Also, it is easy to loose precision if two numbers of different magnitude are summed.<br /><br />If you want to add several floating-point numbers, these rounding errors may sum up and make the result unusable.<br /><br />Luckily, there is help on the way and it is called <a href="http://en.wikipedia.org/wiki/Kahan_summation_algorithm">"Kahan Summation"</a>.<br />The basic idea is to introduce an extra error variable that remembers the errors made, so that at a later point, when the error has accumulated, it can be subtracted. Useful advice and other methods are described in <a href="http://www.drdobbs.com/184403224">here</a>.<br /><br />The common C++-code for this operation looks something like this:<br /><br /><pre style="font-size: small;"><br />double sum(const double* a, int length)<br />{<br /> if(length == 0)<br /> return 0.0;<br /> <br /> double result = a[0];<br /> double error = 0.0;<br /> for(int i=1;i<length;i++)<br /> {<br /> double tmpA = a[i] - error;<br /> double tmpSum = result + tmpA;<br /> error = (tmpSum - result) - tmpA;<br /> result = tmpSum;<br /> }<br /> <br /> return result;<br />}<br /></pre><br /><br />Compared to normal summing with a simple for loop, this implementation is a few times slower. Even worse, turning on compiler optimizations may destroy the whole function because "error = (tmpSum - result) - tmpA" could be reduced to "error = 0". So there must be a way to stop the compiler from doing <span style="font-style: italic;">certain</span> optimizations but not doing others.<br /><br />That is hard.<br /><br />So I decided to write it in Assembler. With SSE2 I could process two double values at a time which is a pretty good optimization which even good compilers may not find. Because my solution uses inline assembler, it will not be optimized any further and the functionality remains intact.<br /><br />Here it comes:<br /><br /><pre style="font-size: small;"><br />double sum(const double* a, int length)<br />{<br /> if(length == 0)<br /> return 0.0;<br /> else if(length == 1)<br /> return a[0];<br /> <br /> double result1 = 0.0;<br /> double result2 = 0.0;<br /> double error1 = 0.0;<br /> double error2 = 0.0;<br /><br /> int i = length - 2;<br /><br /> _asm {<br /> xorps xmm6, xmm6 ; Initialize error register (xmm6)<br /><br /> mov eax, DWORD PTR a ; Initialize result register (xmm4)<br /> movhpd xmm4, QWORD PTR [eax]<br /> movlpd xmm4, QWORD PTR [eax+8]<br /><br /> mov ecx, dword ptr i ; ecx = i save loop counter<br /><br />loopStart:<br /> sub ecx, 2 ; ecx -= 2<br /> cmp ecx,0 ; if(ecx <= 0) goto saveResults<br /> jl saveResults<br /><br /> add eax, 16 ; go to next two array elements<br /> movhpd xmm0, QWORD PTR [eax]<br /> movlpd xmm0, QWORD PTR [eax+8]; Save a[2*i], a[2*i+1] in xmm0<br /><br /> subpd xmm0, xmm6 ; a = a[i] - error<br /><br /> movapd xmm1, xmm4 ; ts = result + a<br /> addpd xmm1, xmm0 <br /><br /> movapd xmm6, xmm1 ; error = (ts - result) - a<br /> subpd xmm6, xmm4<br /> subpd xmm6, xmm0 <br /><br /> movapd xmm4, xmm1 ; result = ts<br /> <br /> jmp loopStart<br /><br />saveResults:<br /> movhpd result1, xmm4 ; Save result into result variables<br /> movlpd result2, xmm4<br /> mov dword ptr i, ecx<br /> }<br /><br /> if(i == -1)<br /> {<br /> double tmpA = a[length-1] - error1;<br /> double tmpSum = result1 + tmpA;<br /> error1 = (tmpSum - result1) - tmpA;<br /> result1 = tmpSum;<br /> }<br /> <br /> return (result1+result2)-(error1+error2);<br />}<br /></pre><br /><br />The performance on my Intel(R) Core(TM)2 Duo T8300 was as follows:<br /><table><tr><td>Standard sum</td><td>1x</td></tr><br /><tr><td>Standard sum (floats)</td><td>1.6x</td></tr><br /><tr><td>Kahan</td><td>3.2x</td></tr><br /><tr><td>Kahan SSE2</td><td>0.9x</td></tr></table><br /><br />As can be seen from the numbers, on my cpu the optimized code is three times faster and even faster than a normal unoptimized for loop. I am happy do discuss any requests and complaints. Simple summing with SSE would have been possible as well but that has the accuracy problems mentioned earlier.<div class="blogger-post-footer"><script src="http://www.google-analytics.com/urchin.js" type="text/javascript">
</script>
<script type="text/javascript">
_uacct = "UA-1434537-1";
urchinTracker();
</script></div>BaumBloggerhttp://www.blogger.com/profile/12668078702880913731noreply@blogger.com5tag:blogger.com,1999:blog-9218555604707336107.post-32044019743783435262007-04-26T16:04:00.000-07:002007-04-26T16:10:09.460-07:00Python design by contractToday I've implemented some decorators for Eiffel-like design py contract.<br />It's very nice to use I think:<br /><br /><blockquote>@require('s1 != 0')<br />def inv(s1): return 1/s1</blockquote><br />or<br /><blockquote>@require('x != 0')<br />@enshure('result > 0')<br />def quad(x): return x*x*x / x</blockquote><br /><br />For more information just ask me. I'll put the code somewhere, soon.<br /><br />- stanz<div class="blogger-post-footer"><script src="http://www.google-analytics.com/urchin.js" type="text/javascript">
</script>
<script type="text/javascript">
_uacct = "UA-1434537-1";
urchinTracker();
</script></div>BaumBloggerhttp://www.blogger.com/profile/12668078702880913731noreply@blogger.com0tag:blogger.com,1999:blog-9218555604707336107.post-86617533551277693392007-02-28T14:29:00.000-08:002007-03-01T05:30:44.357-08:00Profiling Django: hotshotToday I've discovered an article about <a href="http://www.rkblog.rk.edu.pl/w/p/django-profiling-hotshot-and-kcachegrind/">profiling Django</a> with hotshot. After some minutes of googling 'hotspot' I recognized my fault and found the right python <a href="http://docs.python.org/lib/module-hotshot.html">doc</a>.<br />As I didn't want to setup my apache just for a little profiling session, I did the following:<br />to settings.py I've added<br /><blockquote>import hotshot<br />PROF = hotshot.Profile("plik.prof")</blockquote>every view-method that should be profiled got:<br /><blockquote>from settings import PROF<br />PROF.start()<br />(...)<br />PROF.stop()</blockquote><br />and finally this came to my 'viewstat'-view:<br /><blockquote>import hotshot.stats<br />stats = hotshot.stats.load("plik.prof")<br />stats.strip_dirs()<br />stats.sort_stats('time', 'calls')<br />stats.print_stats(20)</blockquote><br />And that's how I've become happy today. Wish me good luck for my cold.<br />- baum<br /><br />It's fair to say that this probably won't run in a production-setup because of multithreading.<br /><br /><blockquote style="font-family: times new roman; color: rgb(0, 0, 0);"></blockquote><br /><br /><blockquote style="font-family: times new roman; color: rgb(0, 0, 0);"></blockquote><span class="" style="display: block; color: rgb(0, 0, 0);font-family:times new roman;" id="formatbar_CreateLink" title="" link="" onmouseover="ButtonHoverOn(this);" onmouseout="ButtonHoverOff(this);" onmouseup="" onmousedown="CheckFormatting(event);FormatbarButton('richeditorframe', this, 8);ButtonMouseDown(this);" ></span><div class="blogger-post-footer"><script src="http://www.google-analytics.com/urchin.js" type="text/javascript">
</script>
<script type="text/javascript">
_uacct = "UA-1434537-1";
urchinTracker();
</script></div>BaumBloggerhttp://www.blogger.com/profile/12668078702880913731noreply@blogger.com0