- §3.1(i) Floating-Point Arithmetic
- §3.1(ii) Interval Arithmetic
- §3.1(iii) Rational Arithmetics
- §3.1(iv) Level-Index Arithmetic
- §3.1(v) Error Measures

Computer arithmetic is described for the binary based system with base 2; another frequently used system is the hexadecimal system with base 16.

A nonzero normalized binary floating-point machine number $x$ is represented as

3.1.1 | $$x={(-1)}^{s}\cdot ({b}_{0}\mathrm{.}{b}_{1}{b}_{2}\mathrm{\dots}{b}_{p-1})\cdot {2}^{E},$$ | ||

${b}_{0}=1$, | |||

where $s$ is equal to $1$ or $0$, each ${b}_{j}$, $j\ge 1$, is either $0$ or $1$, ${b}_{1}$ is the most significant bit, $p$ ($\in \mathbb{N}$) is the number of significant bits ${b}_{j}$, ${b}_{p-1}$ is the least significant bit, $E$ is an integer called the exponent, ${b}_{0}.{b}_{1}{b}_{2}\mathrm{\dots}{b}_{p-1}$ is the significand, and $f=\mathrm{.}{b}_{1}{b}_{2}\mathrm{\dots}{b}_{p-1}$ is the fractional part.

The set of machine numbers ${\mathbb{R}}_{\mathrm{fl}}$ is the union of $0$ and the set

3.1.2 | $${(-1)}^{s}{2}^{E}\sum _{j=0}^{p-1}{b}_{j}{2}^{-j},$$ | ||

with ${b}_{0}=1$ and all allowable choices of $E$, $p$, $s$, and ${b}_{j}$.

Let ${E}_{\mathrm{min}}\le E\le {E}_{\mathrm{max}}$ with $$ and ${E}_{\mathrm{max}}>0$. For given values of ${E}_{\mathrm{min}}$, ${E}_{\mathrm{max}}$, and $p$, the format width in bits $N$ of a computer word is the total number of bits: the sign (one bit), the significant bits ${b}_{1},{b}_{2},\mathrm{\dots},{b}_{p-1}$ ($p-1$ bits), and the bits allocated to the exponent (the remaining $N-p$ bits). The integers $p$, ${E}_{\mathrm{min}}$, and ${E}_{\mathrm{max}}$ are characteristics of the machine. The machine epsilon ${\u03f5}_{M}$, that is, the distance between $1$ and the next larger machine number with $E=0$ is given by ${\u03f5}_{M}={2}^{-p+1}$. The machine precision is $\frac{1}{2}{\u03f5}_{M}={2}^{-p}$. The lower and upper bounds for the absolute values of the nonzero machine numbers are given by

3.1.3 | $${N}_{\mathrm{min}}\equiv {2}^{{E}_{\mathrm{min}}}\le |x|\le {2}^{{E}_{\mathrm{max}}+1}\left(1-{2}^{-p}\right)\equiv {N}_{\mathrm{max}}.$$ | ||

Underflow (overflow) after computing $x\ne 0$ occurs when $|x|$ is smaller (larger) than ${N}_{\mathrm{min}}$ (${N}_{\mathrm{max}}$).

The current standard is the ANSI/IEEE Standard 754; see IEEE (1985, §§1–4). In the case of normalized binary representation the memory positions for single precision ($N=32$, $p=24$, ${E}_{\mathrm{min}}=-126$, ${E}_{\mathrm{max}}=127$) and double precision ($N=64$, $p=53$, ${E}_{\mathrm{min}}=-1022$, ${E}_{\mathrm{max}}=1023$) are as in Figure 3.1.1. The respective machine precisions are $\frac{1}{2}{\u03f5}_{M}=0.596\times {10}^{-7}$ and $\frac{1}{2}{\u03f5}_{M}=0.111\times {10}^{-15}$.

Let $x$ be any positive number with

3.1.4 | $$x=(1.{b}_{1}{b}_{2}\mathrm{\dots}{b}_{p-1}{b}_{p}{b}_{p+1}\mathrm{\dots})\cdot {2}^{E},$$ | ||

${N}_{\mathrm{min}}\le x\le {N}_{\mathrm{max}}$, and

3.1.5 | ${x}_{-}$ | $=(1.{b}_{1}{b}_{2}\mathrm{\dots}{b}_{p-1})\cdot {2}^{E},$ | ||

${x}_{+}$ | $=((1.{b}_{1}{b}_{2}\mathrm{\dots}{b}_{p-1})+{\u03f5}_{M})\cdot {2}^{E}.$ | |||

Then rounding by chopping or rounding down of $x$ gives ${x}_{-}$, with maximum relative error ${\u03f5}_{M}$. Symmetric rounding or rounding to nearest of $x$ gives ${x}_{-}$ or ${x}_{+}$, whichever is nearer to $x$, with maximum relative error equal to the machine precision $\frac{1}{2}{\u03f5}_{M}={2}^{-p}$.

Negative numbers $x$ are rounded in the same way as $-x$.

Interval arithmetic is intended for bounding the total effect of rounding errors of calculations with machine numbers. With this arithmetic the computed result can be proved to lie in a certain interval, which leads to validated computing with guaranteed and rigorous inclusion regions for the results.

Let $G$ be the set of closed intervals $\{[a,b]\}$. The elementary arithmetical operations on intervals are defined as follows:

3.1.6 | $$I*J=\{x*y|x\in I,y\in J\},$$ | ||

$I,J\in G$, | |||

where $\mathrm{*}\in \{\mathrm{+},\mathrm{-},\mathrm{\cdot},\mathrm{/}\}$, with appropriate roundings of the end points of $I*J$ when machine numbers are being used. Division is possible only if the divisor interval does not contain zero.

Computer algebra systems use exact rational arithmetic with rational numbers $p/q$, where $p$ and $q$ are multi-length integers. During the calculations common divisors are removed from the rational numbers, and the final results can be converted to decimal representations of arbitrary length. For further information see Matula and Kornerup (1980).

To eliminate overflow or underflow in finite-precision arithmetic numbers are represented by using generalized logarithms ${\mathrm{ln}}_{\mathrm{\ell}}(x)$ given by

3.1.7 | ${\mathrm{ln}}_{0}(x)$ | $=x,$ | ||

${\mathrm{ln}}_{\mathrm{\ell}}(x)$ | $=\mathrm{ln}\left({\mathrm{ln}}_{\mathrm{\ell}-1}(x)\right),$ | |||

$\mathrm{\ell}=1,2,\mathrm{\dots}$, | ||||

with $x\ge 0$ and $\mathrm{\ell}$ the unique nonnegative integer such that $a\equiv {\mathrm{ln}}_{\mathrm{\ell}}(x)\in [0,1)$. In level-index arithmetic $x$ is represented by $\mathrm{\ell}+a$ (or $-(\mathrm{\ell}+a)$ for negative numbers). Also in this arithmetic generalized precision can be defined, which includes absolute error and relative precision (§3.1(v)) as special cases.

If ${x}^{*}$ is an approximation to a real or complex number $x$, then the absolute error is

3.1.8 | $${\u03f5}_{a}=\left|{x}^{*}-x\right|.$$ | ||

If $x\ne 0$, the relative error is

3.1.9 | $${\u03f5}_{r}=\left|\frac{{x}^{*}-x}{x}\right|=\frac{{\u03f5}_{a}}{\left|x\right|}.$$ | ||

The relative precision is

3.1.10 | $${\u03f5}_{\mathit{rp}}=\left|\mathrm{ln}\left({x}^{*}/x\right)\right|,$$ | ||

where $x{x}^{*}>0$ for real variables, and $x{x}^{*}\ne 0$ for complex variables (with the principal value of the logarithm).

The mollified error is

3.1.11 | $${\u03f5}_{m}=\frac{\left|{x}^{*}-x\right|}{\mathrm{max}(\left|x\right|,1)}.$$ | ||

For error measures for complex arithmetic see Olver (1983).