next up previous
Next: Rounding Up: (DRAFT) Previous: (DRAFT)

COMPUTER ARITHMETIC

This section covers material that some years ago was extremely crucial...this is in the days of less powerful computers. It is also true, however, that the power of a computer is relative to the problems that need to be solved, and as computers become faster and more efficient, the problems we give them to solve become more demanding.

But perhaps the most important lesson to get out of this section is that computers are finite, discrete, and have engineered in them ways in which numbers are expressed or approximated. There are two important sources of errors in computation: ``loss of precision'' and ``discretization''. A little bit of knowledge about computer arithmetic and the representation of numbers on a machine is crucial to understanding the distinction between these two sources of errors, and how and when they relate or interact with each other.

In the days of simple machine calculators, small-word computer architectures, calculating with numbers was a tricky business since only a small and finite range of numbers could be represented on a small and finite-word machine. Moreover, the resolution of the number system was quite poor (i.e. what distinguished one number from a neighboring one as being different). Added to that was the problem of there being many ways of approximating numbers on the limited range. People devised very clever ways of calculating with acceptable precision, something that was especially important and motivated by the war effort...before the second world war, the calculations concerned projectile trajectories, for example: structural calculations in the large engineering projects, such as the construction of Hoover Dam, and other public works, as well as aeronautical and maritime engineering design calculations. During the second world war there was a tremendous jump in the use of technology and thus in the need for large scale calculations. Check out the history of Los Alamos National Lab, for some historical details. These days we still use finite-sized-word machines, however, the size of the word has grown to one that is ``adequate'' in most applications: the result is a bigger range of numbers and a finer resolution. It has made being especially careful in arithmetic calculations far less crucial...that is, in many instances. We still need to devise ways of approximating numbers (that are in the range of numbers of a machine) that are not part of the set representable on a particular finite-size-word machine, and the techniques to be discussed are quite old (there are very intricate techniques these days).

The goal of this section is to teach you the basics of how numbers are represented in machines; what approximation techniques can be used; some surprising facts about machine number representation (such as the fact that they may not be evenly distributed along their range); but most importantly, you will learn about the concept of ``loss of precision'' and how you should always have a mindful eye for how it could affect a calculation....after all: a computer will always give you an answer (this is a variation of garbage in/garbage out).

Reals, integers, complex, are all represented with finite precision on computers/calculators.

Floating Point Numbers: representation of numbers on a computer. Take

\begin{eqnarray*}
2347 = 2.347 \cdot 10^3\\
0.0007396 = 7.396\cdot 10^{-4}
\end{eqnarray*}



the decimal point moves or floats as the powers of 10 change.

A floating point number representation system for numbers consists of 4 integers:

\begin{eqnarray*}
\beta \mbox{ base or radix}\\
t \mbox{ Precision}\\
{[L,U]}\quad \mbox{exponent range}
\end{eqnarray*}



Base $\beta$ representation:

\begin{displaymath}
x=\pm\left(\frac{d_1}{\beta}+\frac{d_2}{\beta^2}\cdots
\fr...
...dots d_{t} \times
\beta^e}_{\mbox
{normalized representation}}
\end{displaymath}

\begin{eqnarray*}
\beta \mbox{ base or radix}\\
t \mbox{ Precision}\\
{[L,U]}\quad \mbox{exponent range}
\end{eqnarray*}




\begin{displaymath}
\mbox {where } 0\le d_i\le\beta-1, \quad i=0, \ldots t
\end{displaymath}


\begin{displaymath}
L\le e\le U
\end{displaymath}

So then the string of base $\beta$ represented by $\overbrace{d_1,
d_2\cdots d_{t}} \equiv \mbox { mantissa or significand}$ and $e \equiv \mbox { exponent or characteristic}$ The normalized representation refers to the fact that the first entry to the left on the mantissa is a $0$. As we will see shortly, there are two equally acceptable choices for normalization when dealing with base 2 arithmetic, or ``binary'' arithmetic.

In a computer: sign, exponent, mantissa stored in separate ``fields'' each field has fixed width. These fields are measured in words, which in turn have a prescribed number of bytes. Hence, a floating-point number system is finite and discrete. Schematically, a number is stored in a computer, using one bit for the sign, $N$ bits for the exponent, $M$ bits for the normalized mantissa, so that a number will have a length of $1+N+M$ bits on a particular machine.

Exercise: Show that the total number of machine numbers, for a machine operating in base $\beta$ with a mantissa of $t$ digits is $2 (\beta-1) \beta^{t-1} (U+L+1) + 1$. $\Box$

We'll use $fp(x)$ to mean $x$ in the floating point representation.

Binary Representation: here $\beta=2$ thus $0\le d_i\le 2-1$, or either 0 or 1. Example $-8_{10}=-(2\cdot 2\cdot 2)_{10}=-1.00\times 2^3$

$\Box$ Hexadecimal Representation: $\beta=16\qquad 0\le d_i\le 15$

\begin{displaymath}
d_i= 0, 1, 2, 3, \cdots 9, A, B, C, D, E, F
\end{displaymath}

Example

\begin{eqnarray*}
(23A.D)_{16} &=& 2 \times 16^2+3 \times 16^1
+10 \times 16^{0}+13\times 16^{-1}=
512+48+10+0.8125\\
&=& 570.8125
\end{eqnarray*}



$\Box$ Decimal Representation: $\beta = 10\quad 0\le d_i\le 9$

Example $-(256)_{10} = -(2 \times 10^2+
5 \times 10^2+6\times 10^0)=-256$

Example $\left(\displaystyle \frac{1}{3}\right)_{10} = (0.3333\cdots)_{10} = 0\cdot
10^0+3\cdot 10^{-1}+3\cdot 10^{-2}+ 3\cdot 10^{-3}\cdots$ $\Box$

Example A reminder of basic conversion between a binary and decimal number.

\begin{eqnarray*}
(1101.0011)_2 &=& 1\cdot 2^3+1\cdot 2^2+0\cdot 2^1+1\cdot 2^0...
...+1\cdot 2^{-4}\\
&=& 8+4+0+1+0+0+0.125+ 0.0625\\
&=& 13.1875
\end{eqnarray*}



The normalized representation of the number in binary and in decimal are, respectively, $(0.11010011)_2 \times 2^4$, and $0.131875 \times
10^2$.

Example A reminder of how to convert from decimal to binary:

What is the binary form of $x=\displaystyle \frac23$?

\begin{eqnarray*}
% latex2html id marker 97&&\frac23=(0. a_1 a_2 a_3\cdots)_2...
... a_5 =1\\
&&\frac23=(0.10101\cdots)=(1.0101\cdots)\cdot 2^{-1}
\end{eqnarray*}



How big are the fields used to store a floating point number? On every computer, it's different. Some examples,

\begin{displaymath}
\mbox{Standard for vendors}
\left\{\begin{array}{lllll}
\mb...
...99\\
\mbox {IBM 3000} & 16 & 6 & -64 & 63
\end{array}\right.
\end{displaymath}

here SP refers to ``single precision'' and DP ``double precision''. The IEEE standard is the most commonly standard, and it is used in just about every high level compiler and machine on the market. Otherwise, it can be adopted as a compiler option.

Note: double precision is usually about twice as slow as single precision calculations because it is done by software, rather than in hardware: more calculations and data movement is involved.

Some important values in a floating point scheme:

The smallest positive normalized $fp$ number is called the ``underflow level'' (UFL) and it equals $\beta^L$.
The largest positive normalized $fp$ number is the ``overflow level'', OFL= $\beta^{U+1}(1-\beta^{-t})$.

In binary, a floating point number may be represented as

\begin{displaymath}
x = (-1)^s q \times 2^m,
\end{displaymath}

Suppose we have a hypothetical computer (32-bit word length). The Figure 1 illustrates schematically how a floating point number is stored.
Figure 1: Schematic representation of the storage on a machine that uses 32-bit word length. Here the mantissa is 24-bit in length, where the first bit $=1$ (is implied, and we're going to use the convention that the first bit is an implied 1). Each section is called a field -sometimes the sign bit is assigned to the exponent field-.
\includegraphics[totalheight=1in]{num.eps}
Computers use normalized binary representation (normalized, because this way the non-fraction part is always 0, or 1, therefore we not to have to store it...it's implied).

We decide to normalize with an implied $1$ to the left of the radix point. Hence, $q=(1.f)_2$, $s=0 \rightarrow negative$, or $s=1 \rightarrow
positive$. $m=e-127$ which is known as ``8-bit biased exponent''. The range of $0 < e < (11 \quad 111 \quad 111)_2=2^8-1=255$; the number $0$ is reserved for $\pm 0$, and $255$ is reserved for $\pm \infty$, and/or NaN, ``not a number''. Since $m=e-127$, then $-126 \le m \le
127$. The smallest positive number is $2^{-126} \approx 1.2 \times
10^{-38}$ and the largest is $(2-2^{-23}) \times 2^{127} \approx 3.4
\times 10^{38}$. The least significant bit is $2^{-23}$. Notice that the number set would not be evenly distributed on this machine. There are more clustered around $0$, since gaps of powers of $2$ are smaller near $0$ and larger closer to $2$. And of course, the number of floating points available on this machine is finite!...

Suppose we want to express $1/100$, say, in...is this number in the set available to the machine? (no) If not, what do we do?

$\Box$

First some definitions:

Def: Machine Number a number that is exactly representable (identical) as one belonging to the set of numbers available on that particular machine. Numbers that do not belong to this set must be approximated (processed) and inevitably, loss of information occurs.

Suppose we are comparing two numbers $x$ and $\hat x$. Typically, we think of $x$ as the ``exact'' number and $\hat x$ as an approximation. Then, the

Def: Absolute Error is the absolute difference between $x$ and $\hat x$.

Def: Relative Error is the absolute error divided by the magnitude of $x$.



Subsections
next up previous
Next: Rounding Up: (DRAFT) Previous: (DRAFT)
Juan Restrepo 2003-04-12