- §3.1(i) Floating-Point Arithmetic
- §3.1(ii) Interval Arithmetic
- §3.1(iii) Rational Arithmetics
- §3.1(iv) Level-Index Arithmetic
- §3.1(v) Error Measures

Computer arithmetic is described for the *binary* based system with base
2; another system that has been used is the *hexadecimal* system with base
16.

A nonzero *normalized binary floating-point machine number* $x$ is
represented as

3.1.1 | $$x={(-1)}^{s}\cdot ({b}_{0}\mathrm{.}{b}_{1}{b}_{2}\mathrm{\dots}{b}_{p-1})\cdot {2}^{E},$$ | ||

${b}_{0}=1$, | |||

where $s$ is equal to $1$ or $0$, each ${b}_{j}$, $j\ge 1$, is either $0$ or $1$,
${b}_{1}$ is the *most significant bit*, $p$ ($\in \mathbb{N}$) is the number of
significant bits ${b}_{j}$, ${b}_{p-1}$ is the *least significant bit*, $E$ is
an integer called the *exponent*, ${b}_{0}.{b}_{1}{b}_{2}\mathrm{\dots}{b}_{p-1}$ is the
*significand*, and $f=\mathrm{.}{b}_{1}{b}_{2}\mathrm{\dots}{b}_{p-1}$ is the *fractional
part*.

The set of *machine numbers* ${\mathbb{R}}_{\mathrm{fl}}$ is the union of $0$ and
the set

3.1.2 | $${(-1)}^{s}{2}^{E}\sum _{j=0}^{p-1}{b}_{j}{2}^{-j},$$ | ||

with ${b}_{0}=1$ and all allowable choices of $E$, $p$, $s$, and ${b}_{j}$.

Let ${E}_{\mathrm{min}}\le E\le {E}_{\mathrm{max}}$ with $$ and
${E}_{\mathrm{max}}>0$. For given values of ${E}_{\mathrm{min}}$, ${E}_{\mathrm{max}}$, and
$p$, the *format width in bits* $N$ of a computer word is the total number
of bits:
the sign (one bit), the significant bits ${b}_{1},{b}_{2},\mathrm{\dots},{b}_{p-1}$ ($p-1$
bits), and the bits allocated to the exponent (the remaining $N-p$ bits). The
integers $p$, ${E}_{\mathrm{min}}$, and ${E}_{\mathrm{max}}$ are characteristics of the
machine. The *machine epsilon* ${\u03f5}_{M}$, that is, the distance
between $1$ and the next larger machine number with $E=0$ is given by
${\u03f5}_{M}={2}^{-p+1}$. The *machine precision* is
$\frac{1}{2}{\u03f5}_{M}={2}^{-p}$. The lower and upper bounds for the absolute
values of the nonzero machine numbers are given by

3.1.3 | $${N}_{\mathrm{min}}\equiv {2}^{{E}_{\mathrm{min}}}\le |x|\le {2}^{{E}_{\mathrm{max}}+1}\left(1-{2}^{-p}\right)\equiv {N}_{\mathrm{max}}.$$ | ||

*Underflow (overflow)* after computing $x\ne 0$ occurs when $|x|$ is
smaller (larger) than ${N}_{\mathrm{min}}$ (${N}_{\mathrm{max}}$).

The current floating point arithmetic standard is IEEE 754-2019 IEEE (2019),
a minor technical revision of IEEE 754-2008 IEEE (2008), which was adopted
in 2011 by the International Standards Organization as ISO/IEC/IEEE 60559.
In the case of the normalized binary interchange formats, the representation of data
for *binary32* (previously *single precision*)
($N=32$, $p=24$, ${E}_{\mathrm{min}}=-126$, ${E}_{\mathrm{max}}=127$),
*binary64* (previously *double precision*)
($N=64$, $p=53$, ${E}_{\mathrm{min}}=-1022$,
${E}_{\mathrm{max}}=1023$) and
*binary128* (previously *quad precision*)
($N=128$, $p=113$, ${E}_{\mathrm{min}}=-16382$,
${E}_{\mathrm{max}}=16383$) are as in Figure 3.1.1. The
respective machine precisions are
$\frac{1}{2}{\u03f5}_{M}=0.596\times {10}^{-7}$,
$\frac{1}{2}{\u03f5}_{M}=0.111\times {10}^{-15}$ and
$\frac{1}{2}{\u03f5}_{M}=0.963\times {10}^{-34}$.

Let $x$ be any positive number with

3.1.4 | $$x=(1.{b}_{1}{b}_{2}\mathrm{\dots}{b}_{p-1}{b}_{p}{b}_{p+1}\mathrm{\dots})\cdot {2}^{E},$$ | ||

${N}_{\mathrm{min}}\le x\le {N}_{\mathrm{max}}$, and

3.1.5 | ${x}_{-}$ | $=(1.{b}_{1}{b}_{2}\mathrm{\dots}{b}_{p-1})\cdot {2}^{E},$ | ||

${x}_{+}$ | $=((1.{b}_{1}{b}_{2}\mathrm{\dots}{b}_{p-1})+{\u03f5}_{M})\cdot {2}^{E}.$ | |||

Then *rounding by chopping* or *rounding down* of $x$ gives ${x}_{-}$,
with maximum relative error ${\u03f5}_{M}$. *Symmetric rounding* or
*rounding to nearest* of $x$ gives ${x}_{-}$ or ${x}_{+}$, whichever is nearer
to $x$, with maximum relative error equal to the machine precision
$\frac{1}{2}{\u03f5}_{M}={2}^{-p}$.

Negative numbers $x$ are rounded in the same way as $-x$.

*Interval arithmetic* is intended for bounding the total effect of
rounding errors of calculations with machine numbers. With this arithmetic the
computed result can be proved to lie in a certain interval, which leads to
*validated computing* with guaranteed and rigorous inclusion regions for
the results.

Let $G$ be the set of closed intervals $\{[a,b]\}$. The elementary arithmetical operations on intervals are defined as follows:

3.1.6 | $$I\ast J=\{x\ast y|x\in I,y\in J\},$$ | ||

$I,J\in G$, | |||

where $\mathrm{\ast}\in \{\mathrm{+},\mathrm{-},\mathrm{\cdot},\mathrm{/}\}$, with appropriate roundings of the end points of $I\ast J$ when machine numbers are being used. Division is possible only if the divisor interval does not contain zero.

A basic text on interval arithmetic and analysis is Alefeld and Herzberger (1983), and for applications and further information see Moore (1979) and Petković and Petković (1998). The last reference includes analogs for arithmetic in the complex plane $\u2102$. For interval arithmetic, one should refer to the IEEE Standards for Interval Arithmetic IEEE (2015, 2018).

Computer algebra systems use *exact rational arithmetic* with rational
numbers $p/q$, where $p$ and $q$ are multi-length integers. During the
calculations common divisors are removed from the rational numbers, and the
final results can be converted to decimal representations of arbitrary length.
For further information see Matula and Kornerup (1980).

To eliminate overflow or underflow in finite-precision arithmetic numbers are
represented by using *generalized logarithms* ${\mathrm{ln}}_{\mathrm{\ell}}(x)$ given by

3.1.7 | ${\mathrm{ln}}_{0}(x)$ | $=x,$ | ||

${\mathrm{ln}}_{\mathrm{\ell}}(x)$ | $=\mathrm{ln}\left({\mathrm{ln}}_{\mathrm{\ell}-1}(x)\right),$ | |||

$\mathrm{\ell}=1,2,\mathrm{\dots}$, | ||||

with $x\ge 0$ and $\mathrm{\ell}$ the unique nonnegative integer such that
$a\equiv {\mathrm{ln}}_{\mathrm{\ell}}(x)\in [0,1)$. In *level-index arithmetic* $x$ is
represented by $\mathrm{\ell}+a$ (or $-(\mathrm{\ell}+a)$ for negative numbers).
Also in this arithmetic *generalized precision* can be defined, which includes
absolute error and relative precision (§3.1(v)) as special cases.

If ${x}^{\ast}$ is an approximation to a real or complex number $x$, then the
*absolute error* is

3.1.8 | $${\u03f5}_{a}=\left|{x}^{\ast}-x\right|.$$ | ||

If $x\ne 0$, the *relative error* is

3.1.9 | $${\u03f5}_{r}=\left|\frac{{x}^{\ast}-x}{x}\right|=\frac{{\u03f5}_{a}}{\left|x\right|}.$$ | ||

The *relative precision* is

3.1.10 | $${\u03f5}_{\mathrm{\mathit{r}\mathit{p}}}=\left|\mathrm{ln}\left({x}^{\ast}/x\right)\right|,$$ | ||

where $x{x}^{\ast}>0$ for real variables, and $x{x}^{\ast}\ne 0$ for complex variables (with the principal value of the logarithm).

The *mollified error* is

3.1.11 | $${\u03f5}_{m}=\frac{\left|{x}^{\ast}-x\right|}{\mathrm{max}(\left|x\right|,1)}.$$ | ||

For error measures for complex arithmetic see Olver (1983).