Deterministic cross-platform floating point arithmetics

The scope of this document is to provide information on the implementation of floating point arithmetics in the C language on different platforms and details on different compilers. It will explain why it is problematic to simply use floating point math without considering the implications. It will also try to provide a set of preprocessor macros that allow the consistent usage of floating point arithmetics across as many platforms and compilers as possible.

This document is far from complete. Please be encouraged to contact me if you have any further information on this topic whatsoever. I'm especially interested in floating point arithmetics of non-x86 platforms, for example sparc, ia64, arm or alpha. I'd also be interested in someone with knowledge of Windows64 or at least the possibility to test some things with different compilers there. (especially Visual Studio 9, MinGW-gcc, Cygwin-gcc)

This document currently only deals with floating point precision. Two other topics will be discussed in future versions of this documents: floting point exceptions and internal rounding control.

Please note that some understanding of the assembly language and the internal doings of microprocessors is extremely helpful for understanding this document completely. Nevertheless the conclusion should not be problematic.

General overview of floating point number types

For a general explanation on floating point numbers, please refer to Wikipedia article on IEEE 754. The following is only a small overview.

There are four main types of floating point numbers that are widely implemented today:

Single precision
Stored in 32 bits: 1 bit for the sign, 8 bits for the exponent and 23 bits for the fraction.
Double precision
Stored in 64 bits: 1 bit for the sign, 11 bits for the exponent and 52 bits for the fraction.
Double extended precision
This is not exactly defined by IEEE 754. But the most common version is the x87 implementation which has 1 bit for the sign, 15 bits for the exponent, 63 bits for the fraction and 1 bit for the integer part before the fraction (which is nearly always 1, except for denormals).
Quad precision
This is defined in the revised IEEE 754 standard (IEEE 754-2008), stored in 128 bits, uses 1 bit for the sign, 15 bits for the exponent and 112 bits for the fraction. Several non-x86 (integer-64bit) processors support this data type.

Problems with internal precision

Some processors (most notably the x87 fpu) have a so-called internal precision, i.e. instead of having different registers and instructions for different floating point types (as is the case with integers) they have a single internal floating point register type (in the x87 case double extended precision) and a single operation type for doing math.

Many floating point experts state the case that using a higher internal precision for floating point calculations is better in order to ensure algorithms remain stable. While I basically agree with this argument, the problem with x87 FPUs is that they hide the usage of this internal precision from the the programmer. This causes two different kinds of headaches:

  1. First of all, simple C code is highly platform and compiler-dependent. With integers, this is not so much the case, since the C standard gaurantees an integer to have at least a certain size and when performing calculations within that size, the code remains completely portable and produces the same results on every platform (i.e. is deterministic). With floating point calculations, this is not the case anymore. A change in internal precision can result in different results for different platforms, thus the code is not deterministic anymore.
  2. Also, one would expect that the floating point number 0.002877 and the expression 2877.0 / 1000000.0 would be represented by the same floating point number (since both 2877 and 1000000 are exactly representable). If the conversion and calculation are done within the same precision, this is actually the case. However, if an internal precision is used, it may not be. Say, for example, that 0.002877 is converted to the closest double representation of that number. Then calculate the expression 2877.0 / 1000000.0 in double-extended precision and after the calculation truncate it to double precision. This will actually yield a slightly different number.

Note that the second problem apparently does not occur for single-precision floating point numbers since the number of fraction bits is so small that there appears to be no short decimal number that maps to a floating point number with these characteristics. Thus for single precision, calculating in a higher precision and then truncating appears to have the same effect as directly calculating in single precision.

Hardware implementations

This section will take a look at different hardware implementations and analyze them.

x86 platforms

x86 platforms support two types of floating point instruction sets nowadays: The classical x87 floating point unit and the new MMX/SSE instructions that can even perform parallel calculations.

Traditional x87 FPU

The characteristics of the x87 FPU are:

MMX/SSE instructions

The characteristics of the MMX/SSE instructions are:

PowerPC

According to the Documentation (see: Book I: PowerPC User Instruction Set Architecture) and some tests done by Scott MacVicar, the following holds for PPC:

The PPC processor also seems to support a non-IEEE-mode for the FPU. The documentation does not clearly state how this mode is defined and for portability reasons it clearly should be avoided.

SPARC

Johannes Schlüter did some tests on UltraSPARC IIIi platform which yielded the following results:

Other

I'd love to have some details on how other platforms implement floating point operations. The following is a non-exhaustive list of platforms I don't have any information on:

Some notes:

Software implementation

This section focusses on the implementation of floating point operations in compilers and system libraries, i.e. how C code is transformed into assembly instructions and how they interact with the internal processor state.

Calling conventions

Calling conventions define how function parameters and return values are passed. This is platform, operating-system and compiler dependent. Please note that quite a few systems pass floating point parameters and return values directly in the FPU registers and not im memory. This can have interesting side-effects, take for example a method that returns a float (single-precision) but instead of simply storing the return value the function is directly used within another function. With certain calling conventions the FP registers may directly be passed through to the second function - without ever storing the value in memory and thus not truncating it from internal precision. On the other hand, different systems may have calling conventions that use the stack for parameters and thus the value is stored in memory and is truncated. This also causes non-deterministic behaviour in the sense that different systems react differently to the same piece of code. I will discuss this issue separately for each compiler if applicable.

Microsoft Windows

I did extensive tests with cl.exe from Microsoft Visual C++ 2005 Express Edition under Windows 2000 and with MinGW (gcc 3.4.5 mingw-special). I could only test on 32bit Windows, I'd be happy to hear about 64bit Windows.

Microsoft Visual C++ Compiler

Most notably, the Visual C++ Compiler does not support double-extended precision, it only supports single precision and double precision. ([1], [2]) It does, however, know a long double type which is treated the same way as double for arithmetic reasons but is a different type with regard to type coerxion. It also switches the processor's internal precision to double precision, so all calculations are executed in double precision.

TODO: It would be nice to know when this exactly happens. Is it the kernel or some part of the runtime initialization?

The Microsoft Visual C++ Runtime also supports the functions _controlfp _controlfp_s to switch x87 FP mode. They are documented MSDN: _controlfp, _controlfp_s. The latter is available since Visual C++ 2005. In order to set the current precision, use:

 // Both are defined here
 #include <float.h>
 
 // old version
 unsigned int fpu_oldcw;
 fpu_oldcw = _controlfp(0, 0); // store old vlaue
 _controlfp(precision, _MCW_PC);
 // calculation here
 _controlfp(fpu_oldcw, _MCW_PC);
 
 // new version
 unsigned int fpu_oldcw, fpu_cw;
 _controlfp_s(&fpu_cw, 0, 0);
 fpu_oldcw = fpu_cw;
 _controlfp_s(&fpu_cw, precision, _MCW_PC);
 // calculation here
 _controlfp_s(&fpu_cw, fpu_oldcw, _MCW_PC);

Since the new version only does additional range checking on the values, it is ok to use the old version as long as the supplied values are valid.

The precision may be one of:

_PC_24
Single precision
_PC_53
Double precision
_PC_64
Double-extended precision

Please note that the double-extended precision setting is for internal precision only. Values will always be truncated to double precision with Visual C++.

In order to ensure that the optimizer does not make wrong assumptions, please add #pragma fenv_access(on) to the source code. This will tell the compiler that the floating point mode may be changed by the source.

Calling conventions: Windows does use FPU registers for return values but it uses a small trick to ensure that values react deterministaclly. For single precision values the compiler temporarily stores the value in a variable and loads it again - the value is then truncated to single precision. (double precision values don't matter since the compiler assumes the processor is always in double precision) Consider the following source code:

float divf (float a, float b) {
  return a / b;
}

This is transformed to:

_divf   PROC                        ; COMDAT
; Line 36
    fld DWORD PTR _a$[esp-4]
    fdiv    DWORD PTR _b$[esp-4]
    fstp    DWORD PTR tv130[esp-4]
    fld DWORD PTR tv130[esp-4]
; Line 37
    ret 0
_divf   ENDP

Note: Is it possible to have cl.exe generate SSE code?

Conclusion:

MinGW: gcc 3.4.5

This compiler supports double-extended precision (datatype long double). It sets the processor (or leaves it) in double-extended internal precision.

MinGW offers access to _controlfp but not _controlfp_s. The semantics are the same as with Visual C++, only that MinGW can actually use double-extended precision.

Calling conventions: Please note that MinGW never truncates to either double or float before returning a value. You can, however, force it to do that by either using the compiler option -ffloat-store or stroring the result in an explicit variable that is declared volatile and then returning that variable. This will result in the desired code.

Conclusion:

Other compilers

Not tested, feedback welcome.

Linux x86

GCC 4.1.2

As does the GCC that comes with MinGW, this compiler supports double-extended precision and also uses that for the internal processor precision.

glibc provides a header file fpu_control.h which defines several useful macros to set the precision of the FPU. These macros evaluate to inline assembler code that directly sets the FPU state. Example usage:

 #include <fpu_control.h>
 
 fpu_control_t fpu_oldcw, fpu_cw;
 _FPU_GETCW(fpu_oldcw); // store old cw
 fpu_cw = (fpu_oldcw & ~_FPU_EXTENDED & ~_FPU_DOUBLE & ~_FPU_SINGLE) | precision;
 _FPU_SETCW(fpu_cw);
 // calculations here
 _FPU_SETCW(fpu_oldcw); // restore old cw

Precision my be one of:

_FPU_SINGLE
Single precision
_FPU_DOUBLE
Double precision
_FPU_EXTENDED
Double-extended precision

Calling conventions: In order for the GCC to truncate variables to the correct precision before returning them, use the -ffloat-store compiler option or declare an explicit return value variable volatile (as with MinGW). Please note that this alone will not solve the problem described in this article since the above mentioned example 0.002877 will result in different doubles even after truncation.

Caution: Since the FPU control macros only expand to inline assembler, the optimizer will sometimes reorder the actual calculation after the reset of the FPU control word. Example:

double divcw (double a, double b) {
  double res;
  fpu_control_t _oldcw, _cw;
  _FPU_GETCW(_oldcw);
  _cw = (_oldcw & ~_FPU_EXTENDED & ~_FPU_SINGLE) | _FPU_DOUBLE;
  _FPU_SETCW(_cw);
  res = a / b;
  _FPU_SETCW(_oldcw);
  return res;
}

Without optimization, this will be expanded to:

divcw:
    # stuff left out for brevity
#APP
    fldcw -12(%ebp)
#NO_APP
    fldl    -24(%ebp)
    fdivl   -32(%ebp)
    fstpl   -8(%ebp)
#APP
    fldcw -10(%ebp)
#NO_APP
    fldl    -8(%ebp)
    leave
    ret
    .size   divcw, .-divcw

With optimization (-O2, but -O1 has the same problem) however, this will result in:

divcw:
    pushl   %ebp
    movl    %esp, %ebp
    subl    $16, %esp
    fldl    8(%ebp)
#APP
    fnstcw -2(%ebp)
#NO_APP
    movzwl  -2(%ebp), %eax
    andl    $-769, %eax
    orl $512, %eax
    movw    %ax, -4(%ebp)
#APP
    fldcw -4(%ebp)
    fldcw -2(%ebp)
#NO_APP
    fdivl   16(%ebp)
    leave
    ret
    .size   divcw, .-divcw

As one can see, the optimizer reordered the instructions. I consider this a bug in GCC. GCC 3.3.6 and 3.4.6 were not affected according to my tests (but will only use -ffloat-store for explicit variables). I did not yet test it with newer versions.

This reordering can be avoided by two means: Either use the -ffloat-store compiler option or declare the return varaible (here: result) volatile.

Using MMX/SSE: GCC also supports a compiler option -mfpmath=sse. In order for it to take effect, at least the additional options -msse and -march=XXX (where XXX is e.g. pentium4 or any other arch value that supports SSE) must be supplied. In that case, for single and double precision floating point calculations the MMX/SSE extensions will be used instead of the FPU and thus eliminating the need for changing the FPU precision because the MMX/SSE extensions always operate with the same precision as the operhands.

Conclusion:

Other compilers

Not tested, feedback welcome.

FreeBSD x86

GCC 3.4.6

With FreeBSD the x87 fpu is in double precision mode by default. It does, however, support the double-extended data type (long double). Calculations with that data type will however only be executed within double precision by default.

FreeBSD also supports several macros to change the processor's float mode:

 #include <machine/ieeefp.h>
 
 fp_prec_t fpu_oldprec;
 fpu_oldprec = fpgetprec();
 fpsetprec(precision);
 // calculations here
 fpsetprec(fpu_oldprec);

Valid precision values are:

FP_PS
Single precision
FP_PD
Double precision
FP_PE
Double-extended precision

Calling conventions: The compiler does not generally store the result of an operation in memory and back into the registers to truncate it. -ffloat-store itself in only interpreted for actual variables, not for expressions (i.e. double result = a / b; return result; and return a / b; will result in different code with -ffloat-store active - even in optimizer mode.

Conclusion:

Other compilers

Not tested, feedback welcome.

OpenSolaris 2008.05 x86

It does not appear to be possible to switch to double precision on OpenSolaris on Intel platforms without resorting to inline assembler or relying on GCC and SSE extensions (and OpenSolaris runs on non-SSE-x86 processors, so that's not portable).

GCC 3.4.3

The compiler supports double-extended precision and uses that for the default floating point precision on the platform. There seems to be no possibility to explicitely change the FP precision setting without resorting to inline assembler. (there are some macros / functions defined in sys/ieeefp.h but the interesting part is enclosed in an #ifdef _KERNEL) The only possibility appears to be to use MMX/SSE instructions: -mfpmath=sse -msse -march=pentium4 or similar - or to use inline assembly to set the internal precision.

Calling conventions: The compiler does not generally store the result of an operation in memory and back into the registers to truncate it. -ffloat-store itself in only interpreted for actual variables, not for expressions (i.e. double result = a / b; return result; and return a / b; will result in different code with -ffloat-store active - even in optimizer mode.

Conclusion:

SUN C 5.9 Patch 124868-01 2007/07/12

The compiler behaves more or less identical with GCC on OpenSolaris, there are only two differences:

  1. For single-precision values the values are always truncated before returning them via the store-and-reload trick.
  2. There appears to be no option to enable SSE for the compiler.

Conclusion:

Other compilers

Not tested, feedback welcome.

Mac OS X 10.5 x86

This section is based on tests that other people did for me.

GCC 4.0.1 (Apple build 5465)

The FPU mode is double-extended by default and that datatype is also supported. But since Mac OS X runs only on x86 platforms that already support MMX/SSE, the compiler will always use MMX/SSE for single and double precision floating point calculations. Only double-extended precision floating point calculations will be done with the x87 FPU.

This is actually the most desirable behaviour since the precision is always determined by the largest operhand type in C.

Because of this, Mac OS X does not provide any C wrapper macros to change the internal precision setting of the x87 FPU. It is simply not necessary. Should this really be wanted, inline assembler would probably be possible, I haven't tested this, however.

Not tested: See what happens if -mfpmath=i387 is used as a compiler option.

Conclusion:

Simply use the correct datatype and the operations performed will have the correct semantics: float for single precision, double for double precision and long double for double-extended precision.

Other compilers

Not tested, feedback welcome.

Linux AMD64/EMT64 (NOT IA64!)

GCC 4.1.2

GCC on 64bit x86 Linux behaves the same as GCC on Mac OS X: FPU mode is extended by default but since SSE is supported by all x86_64 processors, SSE is always used for single and double precision floating point calculations and thus the largest operhand in C defines the calculation precision.

However, it is possible to use the -mfpmath=i387 switch. In that case, glibc also provides the fpu_control.h with the same macros as for the x86 case.

Conclusion:

Simply use the correct datatype and the operations performed will have the correct semantics: float for single precision, double for double precision and long double for double-extended precision.

Mac OS X PPC64 (Power4)

Scott MacVicar did some tests on the Power4 platform with GCC 4.0.1 which yielded the expected results that the compiler generates different instructions for different datatypes, thus using the correct precision. An interesting detail: quad precision calculations are done via software emulation by default, perhaps for compability with older processor revisions.

Conclusion:

Simply use the correct datatype and the operations performed will have the correct semantics: float for single precision, double for double precision and long double for quad precision.

Solaris SPARC (UltraSPARC IIIi)

Johannes Schlüter did some tests on the UltraSPARC IIIi platform with both GCC 3.4.3 and Sun's CC which yielded the expected results that both compilers generate different instructions for different datatypes, thus using the correct precision. An interesting detail: quad precision calculations are done via software emulation by default (by both (!) compilers), perhaps for compability with older processor revisions.

Conclusion:

Simply use the correct datatype and the operations performed will have the correct semantics: float for single precision, double for double precision and long double for quad precision.

Linux ARM (Nokia N810)

A friend of mine ran the test suite below on a Nokia N810 with Linux installed. Since the N810 does not posess a FPU, all FPU operations are emulated via function calls.

GCC 3.4.4

The emulation uses the same precision as the largest operhand type. long double is treated the same as double, i.e. there is no extended precision.

Conclusion:

Simply use the correct datatype and the operations performed will have the correct semantics: float for single precision and double for double precision.

Conclusion

The best solution would probably be using the MPFR library (or something similar). That has its drawbacks though: The performance is slower than native FPU performance and it requires memory allocation on the heap. Also, having such a big library as a requirement for the own software may not always be the optimal choice. Thus a solution using the native FP unit will be built.

Using single or double precision for floating point calculations seems to be the only portable alternative without using an entire library. Double-extended precision is not supported on PPC and SPARC, quad precision is not supported on x86 and extended precision is also not supported by Microsoft's Visual C++ compiler.

Since there is probably no good reason to restrict calculations to single precision (lower precision means less accuracy), this document will focus on using double precision portably.

In order to acchieve this, the following steps should be taken:

Macros for saving, switching and restoring FPU precision

The following macros provide platform-independent support for saving, switching and restoring FPU precision. If a platform does not support this or it is not necessary, they will simply expand to no-ops. XPFPA stands for "cross platform floating point arithmetic".

XPFPA_DECLARE()
Declare all necessary variables to save the current FP precision setting.
XPFPA_SWITCH_DOUBLE()
Save the current precision and switch to double precision
XPFPA_RESTORE()
Restore the previous precision
XPFPA_RETURN_DOUBLE(val)
Restore the previous precision setting and return a value. Handles the compiler-misoptimization discussed above.
XPFPA_SWITCH_SINGLE()
Save the current precision and switch to single precision. This macro is defined for completeness purposes only.
XPFPA_RETURN_SINGLE(val)
Restore the previous precision setting and return a value. Handles the compiler-misoptimization discussed above.
XPFPA_SWITCH_DOUBLE_EXTENDED()
Save the current precision and switch to double-extended precision. May not work at all (e.g. PPC) or not as expected (MS Visual C++). This macro is defined for completeness purposes only.

Example usage:

 #include "xpfpa.h"

 double somefn (double a) {
   XPFPA_DECLARE()
   double localvar1;
   
   XPFPA_SWITCH_DOUBLE()
   
   // do some calculation
   localvar1 = a * a;
   
   // restore state temporarily, call an external library function
   XPFPA_RESTORE()
   localvar1 = external_lib_solve_linear_equation(localvar1);
   XPFPA_SWITCH_DOUBLE()
   
   // do some more calculations
   localvar1 = localvar1 / 2.0;
   
   // return the value
   XPFPA_RETURN_DOUBLE(localvar1);
 }

Please be sure to always use the correct datatype for the calculations since several architectures do not posess an internal precision but rather determine the precision according to the datatype of the operhands!

The header file xpfpa.h defines the above macros on systems where the internal precision can be changed. On other systems, the macros expand to nothing and we assume that the correct precision is chosen by the compiler based on the operhand types. The file can also be downloaded in an archive: xpfpa.tar.gz

Checks for autoconf

In order to set the correct defines for the macros, I have written an m4 file that may be included in your configure.in. It's name is float_precision.m4 and it is available in the archive autoconf-macros.tar.gz. Usage:

dnl ... top of configure.in or acinclude.m4 or whatever:
sinclude(float_precision.m4)
dnl ... below:
CHECK_FLOAT_PRECISION

It will set one or more of the following variables, please make sure your config.h.in contains them:

Checks for CMake

I have also provided a macro for CMake that may be included in your CMakeLists.txt. It's name is CheckFloatPrecision.cmake and it is available in the archive cmake-macros.tar.gz. Usage:

include(CheckFloatPrecision.cmake)
# or, if in module path
include(CheckFloatPrecision)
check_float_precision()

It will set one or more of the following variables, please make sure your config.h.in contains them:

Tests for other platforms / compilers

I have started to create an easy test suite that does some checks on the default floating point behaviour. It will only run on POSIX-like platforms with a C compiler that supports options in the style of GCC (e.g. GCC itself, ICC, Sun C Compiler, ...).

It can be downloaded here: fp-testsuite.tar.bz2 / fp-testsuite.tar.gz.

If you have a new platform / new compiler that I haven't discussed yet, please run the following (in the directory where you unpacked the archive):

CC=/path/to/your/compiler ./fptest.sh | tee messages.txt

Then please send me messages.txt and results.tar.gz.


Author: Christian Seiler, Last change: 2008-10-28