About the Project
3 Numerical MethodsAreas

§3.1 Arithmetics and Error Measures

Contents
  1. §3.1(i) Floating-Point Arithmetic
  2. §3.1(ii) Interval Arithmetic
  3. §3.1(iii) Rational Arithmetics
  4. §3.1(iv) Level-Index Arithmetic
  5. §3.1(v) Error Measures

§3.1(i) Floating-Point Arithmetic

Computer arithmetic is described for the binary based system with base 2; another system that has been used is the hexadecimal system with base 16.

A nonzero normalized binary floating-point machine number x is represented as

3.1.1 x=(1)s(b0.b1b2bp1)2E,
b0=1,

where s is equal to 1 or 0, each bj, j1, is either 0 or 1, b1 is the most significant bit, p () is the number of significant bits bj, bp1 is the least significant bit, E is an integer called the exponent, b0.b1b2bp1 is the significand, and f=.b1b2bp1 is the fractional part.

The set of machine numbers fl is the union of 0 and the set

3.1.2 (1)s2Ej=0p1bj2j,

with b0=1 and all allowable choices of E, p, s, and bj.

Let EminEEmax with Emin<0 and Emax>0. For given values of Emin, Emax, and p, the format width in bits N of a computer word is the total number of bits: the sign (one bit), the significant bits b1,b2,,bp1 (p1 bits), and the bits allocated to the exponent (the remaining Np bits). The integers p, Emin, and Emax are characteristics of the machine. The machine epsilon ϵM, that is, the distance between 1 and the next larger machine number with E=0 is given by ϵM=2p+1. The machine precision is 12ϵM=2p. The lower and upper bounds for the absolute values of the nonzero machine numbers are given by

3.1.3 Nmin2Emin|x|2Emax+1(12p)Nmax.

Underflow (overflow) after computing x0 occurs when |x| is smaller (larger) than Nmin (Nmax).

IEEE Standard

The current floating point arithmetic standard is IEEE 754-2019 IEEE (2019), a minor technical revision of IEEE 754-2008 IEEE (2008), which was adopted in 2011 by the International Standards Organization as ISO/IEC/IEEE 60559. In the case of the normalized binary interchange formats, the representation of data for binary32 (previously single precision) (N=32, p=24, Emin=126, Emax=127), binary64 (previously double precision) (N=64, p=53, Emin=1022, Emax=1023) and binary128 (previously quad precision) (N=128, p=113, Emin=16382, Emax=16383) are as in Figure 3.1.1. The respective machine precisions are 12ϵM=0.596×107, 12ϵM=0.111×1015 and 12ϵM=0.963×1034.

\begin{picture}(152.0,38.0)(-1.0,-1.0)\put(0.0,31.0){\makebox(2.0,6.0)[]{%
\small 1}}
\put(0.0,26.0){\framebox(2.0,6.0)[]{$s$}}
\put(3.0,31.0){\makebox(8.0,6.0)[]{\small 8}}
\put(3.0,26.0){\framebox(8.0,6.0)[]{$E$}}
\put(12.0,31.0){\makebox(23.0,6.0)[]{\small 23 bits}}
\put(12.0,26.0){\framebox(23.0,6.0)[]{$f$}}
\put(133.0,31.0){$N=32$,}
\put(135.0,27.0){$p=24$}
\put(0.0,18.0){\makebox(2.0,6.0)[]{\small 1}}
\put(0.0,13.0){\framebox(2.0,6.0)[]{$s$}}
\put(3.0,18.0){\makebox(11.0,6.0)[]{\small 11}}
\put(3.0,13.0){\framebox(11.0,6.0)[]{$E$}}
\put(15.0,18.0){\makebox(52.0,6.0)[]{\small 52 bits}}
\put(15.0,13.0){\framebox(52.0,6.0)[]{$f$}}
\put(133.0,17.0){$N=64$,}
\put(135.0,13.0){$p=53$}
\put(0.0,5.0){\makebox(2.0,6.0)[]{\small 1}}
\put(0.0,0.0){\framebox(2.0,6.0)[]{$s$}}
\put(3.0,5.0){\makebox(15.0,6.0)[]{\small 15}}
\put(3.0,0.0){\framebox(15.0,6.0)[]{$E$}}
\put(19.0,5.0){\makebox(112.0,6.0)[]{\small 112 bits}}
\put(19.0,0.0){\framebox(112.0,6.0)[]{$f$}}
\put(133.0,4.0){$N=128$,}
\put(135.0,0.0){$p=113$}
\end{picture}
Figure 3.1.1: Floating-point arithmetic. Representation of data in the binary interchange formats for binary32, binary64 and binary128 (previously single, double and quad precision). Magnify

Rounding

Let x be any positive number with

3.1.4 x=(1.b1b2bp1bpbp+1)2E,

NminxNmax, and

3.1.5 x =(1.b1b2bp1)2E,
x+ =((1.b1b2bp1)+ϵM)2E.

Then rounding by chopping or rounding down of x gives x, with maximum relative error ϵM. Symmetric rounding or rounding to nearest of x gives x or x+, whichever is nearer to x, with maximum relative error equal to the machine precision 12ϵM=2p.

Negative numbers x are rounded in the same way as x.

For further information see Goldberg (1991) and Overton (2001).

§3.1(ii) Interval Arithmetic

Interval arithmetic is intended for bounding the total effect of rounding errors of calculations with machine numbers. With this arithmetic the computed result can be proved to lie in a certain interval, which leads to validated computing with guaranteed and rigorous inclusion regions for the results.

Let G be the set of closed intervals {[a,b]}. The elementary arithmetical operations on intervals are defined as follows:

3.1.6 IJ={xy|xI,yJ},
I,JG,

where {+,-,,/}, with appropriate roundings of the end points of IJ when machine numbers are being used. Division is possible only if the divisor interval does not contain zero.

A basic text on interval arithmetic and analysis is Alefeld and Herzberger (1983), and for applications and further information see Moore (1979) and Petković and Petković (1998). The last reference includes analogs for arithmetic in the complex plane . For interval arithmetic, one should refer to the IEEE Standards for Interval Arithmetic IEEE (2015, 2018).

§3.1(iii) Rational Arithmetics

Computer algebra systems use exact rational arithmetic with rational numbers p/q, where p and q are multi-length integers. During the calculations common divisors are removed from the rational numbers, and the final results can be converted to decimal representations of arbitrary length. For further information see Matula and Kornerup (1980).

§3.1(iv) Level-Index Arithmetic

To eliminate overflow or underflow in finite-precision arithmetic numbers are represented by using generalized logarithms ln(x) given by

3.1.7 ln0(x) =x,
ln(x) =ln(ln1(x)),
=1,2,,

with x0 and the unique nonnegative integer such that aln(x)[0,1). In level-index arithmetic x is represented by +a (or (+a) for negative numbers). Also in this arithmetic generalized precision can be defined, which includes absolute error and relative precision (§3.1(v)) as special cases.

For further information see Clenshaw and Olver (1984) and Clenshaw et al. (1989). For applications see Lozier (1993).

For further references on level-index arithmetic (and also other arithmetics) see Anuta et al. (1996). See also Hayes (2009).

§3.1(v) Error Measures

If x is an approximation to a real or complex number x, then the absolute error is

3.1.8 ϵa=|xx|.

If x0, the relative error is

3.1.9 ϵr=|xxx|=ϵa|x|.

The relative precision is

3.1.10 ϵ𝑟𝑝=|ln(x/x)|,

where xx>0 for real variables, and xx0 for complex variables (with the principal value of the logarithm).

The mollified error is

3.1.11 ϵm=|xx|max(|x|,1).

For error measures for complex arithmetic see Olver (1983).