3 Numerical Methods Areas 3 Numerical Methods 3.2 Linear Algebra

§3.1 Arithmetics and Error Measures

ⓘ

Referenced by:: Erratum (V1.0.25) for Section 3.1
Permalink:: http://dlmf.nist.gov/3.1
See also:: Annotations for Ch.3

§3.1(i) Floating-Point Arithmetic
§3.1(ii) Interval Arithmetic
§3.1(iii) Rational Arithmetics
§3.1(iv) Level-Index Arithmetic
§3.1(v) Error Measures

§3.1(i) Floating-Point Arithmetic

ⓘ

Keywords:: arithmetics, binary number system, bits, exponent, floating-point, floating-point arithmetic, format width, fractional part, hexadecimal number system, machine epsilon, machine number, machine precision, overflow, significand, significant, underflow
Referenced by:: §4.48(i), Erratum (V1.0.19) for Notation, Erratum (V1.0.25) for Section 3.1
Permalink:: http://dlmf.nist.gov/3.1.i
Modification (effective with 1.0.25):: In the first sentence of this subsection, the phrase “another frequently used system is the hexadecimal system with base 16” has been replaced with “another system that has been used is the hexadecimal system with base 16”.
Suggested 2019-02-22 by Nicola Torracca
See also:: Annotations for §3.1 and Ch.3

Computer arithmetic is described for the binary based system with base 2; another system that has been used is the hexadecimal system with base 16.

A nonzero normalized binary floating-point machine number $x$ is represented as

3.1.1		$x=(-1)^{s}\cdot(b_{0}.b_{1}b_{2}\dots b_{p-1})\cdot 2^{E},$
$b_{0}=1$ ,
ⓘ Symbols: $b_{p}$ : bit, $s$ : sign, $p$ : number of significant bits and $E$ : exponent Permalink: http://dlmf.nist.gov/3.1.E1 Encodings: TeX, pMML, png See also: Annotations for §3.1(i), §3.1 and Ch.3

where $s$ is equal to $1$ or $0$ , each $b_{j}$ , $j\geq 1$ , is either $0$ or $1$ , $b_{1}$ is the most significant bit, $p$ ( $\in\mathbb{N}$ ) is the number of significant bits $b_{j}$ , $b_{p-1}$ is the least significant bit, $E$ is an integer called the exponent, $b_{0}.b_{1}b_{2}\dots b_{p-1}$ is the significand, and $f=.b_{1}b_{2}\dots b_{p-1}$ is the fractional part.

The set of machine numbers $\mathbb{R}_{{\rm fl}}$ is the union of $0$ and the set

3.1.2		$(-1)^{s}2^{E}\sum_{j=0}^{p-1}b_{j}2^{-j},$
ⓘ Symbols: $b_{p}$ : bit, $s$ : sign, $p$ : number of significant bits and $E$ : exponent Permalink: http://dlmf.nist.gov/3.1.E2 Encodings: TeX, pMML, png See also: Annotations for §3.1(i), §3.1 and Ch.3

with $b_{0}=1$ and all allowable choices of $E$ , $p$ , $s$ , and $b_{j}$ .

Let $E_{{\rm min}}\leq E\leq E_{{\rm max}}$ with $E_{{\rm min}}<0$ and $E_{{\rm max}}>0$ . For given values of $E_{{\rm min}}$ , $E_{{\rm max}}$ , and $p$ , the format width in bits $N$ of a computer word is the total number of bits: the sign (one bit), the significant bits $b_{1},b_{2},\dots,b_{p-1}$ ( $p-1$ bits), and the bits allocated to the exponent (the remaining $N-p$ bits). The integers $p$ , $E_{{\rm min}}$ , and $E_{{\rm max}}$ are characteristics of the machine. The machine epsilon $\epsilon_{M}$ , that is, the distance between $1$ and the next larger machine number with $E=0$ is given by $\epsilon_{M}=2^{-p+1}$ . The machine precision is $\frac{1}{2}\epsilon_{M}=2^{-p}$ . The lower and upper bounds for the absolute values of the nonzero machine numbers are given by

3.1.3		$N_{{\rm min}}\equiv 2^{E_{{\rm min}}}\leq\|x\|\leq 2^{E_{{\rm max}}+1}\left(1-2^% {-p}\right)\equiv N_{{\rm max}}.$
ⓘ Defines: $N_{{\rm min}}$ : underflow (locally) and $N_{{\rm max}}$ : overflow (locally) Symbols: $\equiv$ : equals by definition, $p$ : number of significant bits, $E_{{\rm min}}$ : minimum exponent and $E_{{\rm max}}$ : maximum exponent Permalink: http://dlmf.nist.gov/3.1.E3 Encodings: TeX, pMML, png See also: Annotations for §3.1(i), §3.1 and Ch.3

Underflow (overflow) after computing $x\neq 0$ occurs when $|x|$ is smaller (larger) than $N_{{\rm min}}$ ( $N_{{\rm max}}$ ).

IEEE Standard

ⓘ

Keywords:: IEEE standard, double precision, floating-point arithmetic, single precision
Referenced by:: Erratum (V1.0.25) for Section 3.1
Modification (effective with 1.0.25):: The material in this paragraph has been modified to reflect the most recent IEEE 754-2019 Floating-Point Arithmetic Standard IEEE (2019). In the new standard, single, double and quad floating-point precisions are replaced with new standard names of binary32, binary64 and binary128. Also, the terminology “memory positions” has been replaced with the more current “representation of data in the binary interchange formats”.
Suggested 2019-02-22 by Nicola Torracca
See also:: Annotations for §3.1(i), §3.1 and Ch.3

The current floating point arithmetic standard is IEEE 754-2019 IEEE (2019), a minor technical revision of IEEE 754-2008 IEEE (2008), which was adopted in 2011 by the International Standards Organization as ISO/IEC/IEEE 60559. In the case of the normalized binary interchange formats, the representation of data for binary32 (previously single precision) ( $N=32$ , $p=24$ , $E_{{\rm min}}=-126$ , $E_{{\rm max}}=127$ ), binary64 (previously double precision) ( $N=64$ , $p=53$ , $E_{{\rm min}}=-1022$ , $E_{{\rm max}}=1023$ ) and binary128 (previously quad precision) ( $N=128$ , $p=113$ , $E_{{\rm min}}=-16382$ , $E_{{\rm max}}=16383$ ) are as in Figure 3.1.1. The respective machine precisions are $\frac{1}{2}\epsilon_{M}=0.596\times 10^{-7}$ , $\frac{1}{2}\epsilon_{M}=0.111\times 10^{-15}$ and $\frac{1}{2}\epsilon_{M}=0.963\times 10^{-34}$ .

\begin{picture}(152.0,38.0)(-1.0,-1.0)\put(0.0,31.0){\makebox(2.0,6.0)[]{%
\small 1}}
\put(0.0,26.0){\framebox(2.0,6.0)[]{$s$}}
\put(3.0,31.0){\makebox(8.0,6.0)[]{\small 8}}
\put(3.0,26.0){\framebox(8.0,6.0)[]{$E$}}
\put(12.0,31.0){\makebox(23.0,6.0)[]{\small 23 bits}}
\put(12.0,26.0){\framebox(23.0,6.0)[]{$f$}}
\put(133.0,31.0){$N=32$,}
\put(135.0,27.0){$p=24$}
\put(0.0,18.0){\makebox(2.0,6.0)[]{\small 1}}
\put(0.0,13.0){\framebox(2.0,6.0)[]{$s$}}
\put(3.0,18.0){\makebox(11.0,6.0)[]{\small 11}}
\put(3.0,13.0){\framebox(11.0,6.0)[]{$E$}}
\put(15.0,18.0){\makebox(52.0,6.0)[]{\small 52 bits}}
\put(15.0,13.0){\framebox(52.0,6.0)[]{$f$}}
\put(133.0,17.0){$N=64$,}
\put(135.0,13.0){$p=53$}
\put(0.0,5.0){\makebox(2.0,6.0)[]{\small 1}}
\put(0.0,0.0){\framebox(2.0,6.0)[]{$s$}}
\put(3.0,5.0){\makebox(15.0,6.0)[]{\small 15}}
\put(3.0,0.0){\framebox(15.0,6.0)[]{$E$}}
\put(19.0,5.0){\makebox(112.0,6.0)[]{\small 112 bits}}
\put(19.0,0.0){\framebox(112.0,6.0)[]{$f$}}
\put(133.0,4.0){$N=128$,}
\put(135.0,0.0){$p=113$}
\end{picture} — Figure 3.1.1: Floating-point arithmetic. Representation of data in the binary interchange formats for binary32, binary64 and binary128 (previously single, double and quad precision). Magnify

Rounding

ⓘ

Keywords:: by chopping, down, floating-point arithmetic, rounding, symmetric, to nearest machine number
See also:: Annotations for §3.1(i), §3.1 and Ch.3

Let $x$ be any positive number with

3.1.4		$x=(1.b_{1}b_{2}\dots b_{p-1}b_{p}b_{p+1}\dots)\cdot 2^{E},$
ⓘ Symbols: $b_{p}$ : bit, $p$ : number of significant bits and $E$ : exponent Permalink: http://dlmf.nist.gov/3.1.E4 Encodings: TeX, pMML, png See also: Annotations for §3.1(i), §3.1(i), §3.1 and Ch.3

$N_{{\rm min}}\leq x\leq N_{{\rm max}}$ , and

3.1.5	$\displaystyle x_{-}$	$\displaystyle=(1.b_{1}b_{2}\dots b_{p-1})\cdot 2^{E},$
3.1.5	$\displaystyle x_{+}$	$\displaystyle=((1.b_{1}b_{2}\dots b_{p-1})+\epsilon_{M})\cdot 2^{E}.$
ⓘ Symbols: $b_{p}$ : bit, $p$ : number of significant bits, $E$ : exponent and $\epsilon_{M}$ : machine epsilon Permalink: http://dlmf.nist.gov/3.1.E5 Encodings: TeX, TeX, pMML, pMML, png, png See also: Annotations for §3.1(i), §3.1(i), §3.1 and Ch.3

Then rounding by chopping or rounding down of $x$ gives $x_{-}$ , with maximum relative error $\epsilon_{M}$ . Symmetric rounding or rounding to nearest of $x$ gives $x_{-}$ or $x_{+}$ , whichever is nearer to $x$ , with maximum relative error equal to the machine precision $\frac{1}{2}\epsilon_{M}=2^{-p}$ .

Negative numbers $x$ are rounded in the same way as $-x$ .

For further information see Goldberg (1991) and Overton (2001).

§3.1(ii) Interval Arithmetic

ⓘ

Keywords:: arithmetics, interval, interval arithmetic, validated computing
Referenced by:: Erratum (V1.0.25) for Section 3.1
Permalink:: http://dlmf.nist.gov/3.1.ii
Addition (effective with 1.0.25):: A sentence at the end of this subsection has been added referring readers to the IEEE Standards for Interval Arithmetic IEEE (2015, 2018).
Suggested 2019-02-22 by Nicola Torracca
See also:: Annotations for §3.1 and Ch.3

Interval arithmetic is intended for bounding the total effect of rounding errors of calculations with machine numbers. With this arithmetic the computed result can be proved to lie in a certain interval, which leads to validated computing with guaranteed and rigorous inclusion regions for the results.

Let $G$ be the set of closed intervals $\{[a,b]\}$ . The elementary arithmetical operations on intervals are defined as follows:

3.1.6		$IJ=\{xy\,\|\,x\in I,y\in J\},$
$I,J\in G$ ,
ⓘ Symbols: $\in$ : element of and $G$ : set of closed intervals Permalink: http://dlmf.nist.gov/3.1.E6 Encodings: TeX, pMML, png See also: Annotations for §3.1(ii), §3.1 and Ch.3

where $*\in\{+,-,\cdot,/\}$ , with appropriate roundings of the end points of $I*J$ when machine numbers are being used. Division is possible only if the divisor interval does not contain zero.

A basic text on interval arithmetic and analysis is Alefeld and Herzberger (1983), and for applications and further information see Moore (1979) and Petković and Petković (1998). The last reference includes analogs for arithmetic in the complex plane $\mathbb{C}$ . For interval arithmetic, one should refer to the IEEE Standards for Interval Arithmetic IEEE (2015, 2018).

§3.1(iii) Rational Arithmetics

ⓘ

Keywords:: arithmetics, exact, exact rational, exact rational arithmetic, rational arithmetics
Permalink:: http://dlmf.nist.gov/3.1.iii
See also:: Annotations for §3.1 and Ch.3

Computer algebra systems use exact rational arithmetic with rational numbers $p/q$ , where $p$ and $q$ are multi-length integers. During the calculations common divisors are removed from the rational numbers, and the final results can be converted to decimal representations of arbitrary length. For further information see Matula and Kornerup (1980).

§3.1(iv) Level-Index Arithmetic

ⓘ

Keywords:: arithmetics, generalized logarithms, generalized precision, level-index, level-index arithmetic
Referenced by:: §4.44, Erratum (V1.0.19) for Notation
Permalink:: http://dlmf.nist.gov/3.1.iv
See also:: Annotations for §3.1 and Ch.3

To eliminate overflow or underflow in finite-precision arithmetic numbers are represented by using generalized logarithms $\ln_{\ell}(x)$ given by

3.1.7		$\displaystyle\ln_{0}(x)$	$\displaystyle=x,$
		$\displaystyle\ln_{\ell}(x)$	$\displaystyle=\ln\left(\ln_{\ell-1}(x)\right),$
	$\ell=1,2,\dots$ ,
ⓘ Symbols: $\ln\NVar{z}$ : principal branch of logarithm function and $\ell$ : base Permalink: http://dlmf.nist.gov/3.1.E7 Encodings: TeX, TeX, pMML, pMML, png, png See also: Annotations for §3.1(iv), §3.1 and Ch.3

with $x\geq 0$ and $\ell$ the unique nonnegative integer such that $a\equiv\ln_{\ell}(x)\in[0,1)$ . In level-index arithmetic $x$ is represented by $\ell+a$ (or $-(\ell+a)$ for negative numbers). Also in this arithmetic generalized precision can be defined, which includes absolute error and relative precision (§3.1(v)) as special cases.

For further information see Clenshaw and Olver (1984) and Clenshaw et al. (1989). For applications see Lozier (1993).

For further references on level-index arithmetic (and also other arithmetics) see Anuta et al. (1996). See also Hayes (2009).

§3.1(v) Error Measures

ⓘ

Keywords:: absolute error, arithmetics, complex, complex arithmetic, error measures, mollified error, relative error, relative precision
Referenced by:: §3.1(iv)
Permalink:: http://dlmf.nist.gov/3.1.v
See also:: Annotations for §3.1 and Ch.3

If $x^{*}$ is an approximation to a real or complex number $x$ , then the absolute error is

3.1.8		$\epsilon_{a}=\left\|x^{*}-x\right\|.$
ⓘ Defines: $\epsilon_{a}$ : absolute error (locally) A&S Ref: 3.5.1 (in absolute value) Permalink: http://dlmf.nist.gov/3.1.E8 Encodings: TeX, pMML, png See also: Annotations for §3.1(v), §3.1 and Ch.3

If $x\neq 0$ , the relative error is

3.1.9		$\epsilon_{r}=\left\|\frac{x^{*}-x}{x}\right\|=\frac{\epsilon_{a}}{\left\|x\right\|}.$
ⓘ Defines: $\epsilon_{r}$ : relative error (locally) Symbols: $\epsilon_{a}$ : absolute error A&S Ref: 3.5.2 (in absolute value) Permalink: http://dlmf.nist.gov/3.1.E9 Encodings: TeX, pMML, png See also: Annotations for §3.1(v), §3.1 and Ch.3

The relative precision is

3.1.10		$\epsilon_{\mathit{rp}}=\left\|\ln\left(\ifrac{x^{*}}{x}\right)\right\|,$
ⓘ Defines: $\epsilon_{\mathit{rp}}$ : relative precision (locally) Symbols: $\ln\NVar{z}$ : principal branch of logarithm function Permalink: http://dlmf.nist.gov/3.1.E10 Encodings: TeX, pMML, png See also: Annotations for §3.1(v), §3.1 and Ch.3

where $xx^{*}>0$ for real variables, and $xx^{*}\neq 0$ for complex variables (with the principal value of the logarithm).

The mollified error is

3.1.11		$\epsilon_{m}=\frac{\left\|x^{*}-x\right\|}{\max(\left\|x\right\|,1)}.$
ⓘ Defines: $\epsilon_{m}$ : molified error (locally) Permalink: http://dlmf.nist.gov/3.1.E11 Encodings: TeX, pMML, png See also: Annotations for §3.1(v), §3.1 and Ch.3

For error measures for complex arithmetic see Olver (1983).

3 Numerical Methods 3.2 Linear Algebra

Site Privacy Accessibility Privacy Program Copyrights Vulnerability Disclosure No Fear Act Policy FOIA Environmental Policy Scientific Integrity Information Quality Standards Commerce.gov Science.gov USA.gov