This section covers material that some years ago was extremely crucial...this is in the days of less powerful computers. It is also true, however, that the power of a computer is relative to the problems that need to be solved, and as computers become faster and more efficient, the problems we give them to solve become more demanding.
But perhaps the most important lesson to get out of this section is that computers are finite, discrete, and have engineered in them ways in which numbers are expressed or approximated. There are two important sources of errors in computation: ``loss of precision'' and ``discretization''. A little bit of knowledge about computer arithmetic and the representation of numbers on a machine is crucial to understanding the distinction between these two sources of errors, and how and when they relate or interact with each other.
In the days of simple machine calculators, small-word computer architectures, calculating with numbers was a tricky business since only a small and finite range of numbers could be represented on a small and finite-word machine. Moreover, the resolution of the number system was quite poor (i.e. what distinguished one number from a neighboring one as being different). Added to that was the problem of there being many ways of approximating numbers on the limited range. People devised very clever ways of calculating with acceptable precision, something that was especially important and motivated by the war effort...before the second world war, the calculations concerned projectile trajectories, for example: structural calculations in the large engineering projects, such as the construction of Hoover Dam, and other public works, as well as aeronautical and maritime engineering design calculations. During the second world war there was a tremendous jump in the use of technology and thus in the need for large scale calculations. Check out the history of Los Alamos National Lab, for some historical details. These days we still use finite-sized-word machines, however, the size of the word has grown to one that is ``adequate'' in most applications: the result is a bigger range of numbers and a finer resolution. It has made being especially careful in arithmetic calculations far less crucial...that is, in many instances. We still need to devise ways of approximating numbers (that are in the range of numbers of a machine) that are not part of the set representable on a particular finite-size-word machine, and the techniques to be discussed are quite old (there are very intricate techniques these days).
The goal of this section is to teach you the basics of how numbers are represented in machines; what approximation techniques can be used; some surprising facts about machine number representation (such as the fact that they may not be evenly distributed along their range); but most importantly, you will learn about the concept of ``loss of precision'' and how you should always have a mindful eye for how it could affect a calculation....after all: a computer will always give you an answer (this is a variation of garbage in/garbage out).
Reals, integers, complex, are all represented with finite precision on computers/calculators.
Floating Point Numbers: representation of numbers on a computer. Take
the decimal point moves or floats as the powers of 10 change.
A floating point number representation system for numbers consists of 4 integers:
Base
representation:
and
In a computer: sign, exponent, mantissa stored in separate
``fields'' each field has fixed width. These fields are measured in
words, which in turn have a prescribed number of bytes. Hence,
a floating-point number system is
finite and discrete.
Schematically, a number is stored in a computer, using one bit for the
sign,
bits for the exponent,
bits for the normalized mantissa, so
that a number will have a length of
bits on a particular machine.
Exercise: Show that the total number of machine numbers, for a
machine operating in base
with a mantissa of
digits is
.
We'll use
to mean
in the floating point representation.
Binary Representation: here
thus
, or
either 0 or 1.
Example
Hexadecimal Representation:
Example
Example
Example A reminder of basic conversion between a binary and decimal number.
Example A reminder of how to convert from decimal to binary:
What is the binary form of
?
How big are the fields used to store a floating point number?
On every computer, it's different. Some examples,
Note: double precision is usually about twice as slow as single precision calculations because it is done by software, rather than in hardware: more calculations and data movement is involved.
Some important values in a floating point scheme:
The smallest positive normalized
number is called the
``underflow level'' (UFL) and it equals
.
The largest positive normalized
number is the ``overflow level'',
OFL=
.
In binary, a floating point number may be represented as
|
We decide to normalize with an implied
to the left of the radix point.
Hence,
,
, or
.
which is known as ``8-bit biased exponent''.
The range of
; the number
is reserved for
, and
is reserved for
,
and/or NaN, ``not a number''. Since
, then
. The smallest positive number is
and the largest is
. The least significant bit is
.
Notice that the number set would not be
evenly distributed on this machine.
There are more clustered around
, since gaps of powers of
are smaller near
and larger closer to
. And of course, the
number of floating points available on this machine is finite!...
Suppose we want to express
, say, in...is this number
in the set available to the machine? (no) If not, what do we do?
First some definitions:
Def: Machine Number a number that is exactly representable (identical) as one belonging to the set of numbers available on that particular machine. Numbers that do not belong to this set must be approximated (processed) and inevitably, loss of information occurs.
Suppose we are comparing two numbers
and
. Typically, we
think of
as the ``exact'' number and
as an
approximation. Then, the
Def: Absolute Error is the absolute difference between
and
.
Def: Relative Error is the absolute error divided by the magnitude of
.