**USE OF GIS IN ANALYZING
ENVIRONMENTAL CANCER RISKS AS A FUNCTION of GEOGRAPHIC SCALE**

Zheng Cai
under the supervision of Prof. D. Myers and Seumas
Rogan

Spring 2003

v
__Introduction__

The Atlas of Cancer Mortality in

In general, the goal of this three-year
research project is to examine the geographic variation in the relationship of
cancer risk and arsenic in

This report is going to present the
research on the geographic variation by using GIS software to analyze data for

v
__Methods applied__

Geographic information system (GIS) is a
type of mapping software that links data of real-word objects with onscreen
map. It has data creation, data display, analysis and
output four main usages. Realistically, it is used to connect multiple sources
of georeferenced health statistics data. Insights concerning diverse
health-environment-behavior interactions can be derived by identification of
clusters of cancer incidences followed by comparison with cluster locations in
the mapped distribution of arsenic. The “Geostatistical Analyst” feature in
ArcMap of version 8.1 build by ESRI Inc was used in geostatistical analyses
reported below.

Interpolation techniques are mainly
categorized as deterministic and stochastic. Deterministic interpolation uses
the techniques of creating new surfaces from the measured points, basing on the
either the extent of similarity or the degree of smoothing. It can be divided
into two subgroups: global and local. Geostatistical interpolation applies the
techniques of utilizing the statistical properties of the measured points.

**1.
****Inverse Distance Weighted (IDW)**

Assume that values at locations that are
close to one another are more alike than those that are farther apart, Inverse
Distance Weighted (IDW) will use the measured values surrounding the prediction
location to predict a value for any unmeasured location. In IDW, the closer
measured values to the prediction have more influence on the predicted value
than those farther away from it.

In this experiment, IDW was
conducted on the 170 points in the training set using power of 2 and the
neighborhood method. There are 15 neighbors included, and include at least 10
of them using the elliptical (quadrant) search window . An example
of this method is shown below:

**2.
****Ordinary Kriging
**

Ordinary Kriging
assumes the model, Z(s) = μ + ε(s), where μ is an unknown
constant and the ε(s) are random fluctuations. It allows for local
influences due to nearby neighborhood values. It produces prediction, quantile,
probability or standard error maps using the data points that are continuous in
space. Due to the unknown mean, there are few assumptions can be made for the
ordinary kriging, which made this method particularly flexible.

In this analysis,
Ordinary Kriging was conducted on the 170 points in
the training set using a Spherical model variogram,
automatic Lag Size of 4364.1 and 12 lags. We also used neighborhood method,
including 5 neighbors and with at least 2 using shape type . The equation for the variogram
is

= 161.77*Spherical(20473) +
32.728*Nugget ,

where the Nugget effect
is the sum of measurement error and small-scale irregularities (microscale variation). Because either component can be
zero, the Nugget effect can be comprised wholly of one or the other.

The plot of the
experimental variogram and the fitted model is shown
in the following figure

The figure above shows
a typical search neighborhood for the ordinary kriging

**3.
****Local Polynomial Interpolation (mean
value)**

The conceptual basis
for Local Polynomial interpolation is to fit many smaller overlapping planes,
and then use the center of each plane as the prediction for each location in
the study area. The resulting surface will be more flexible and perhaps more
accurate. This interpolation fits many polynomials each within specified
overlapping neighborhoods. Local Polynomial Interpolation is sensitive to the
neighborhood distance.

In the experiment,
Local Polynomial Interpolation was conducted on 170 points in the training set
using a weight of 125644.96 and power of 1. It also takes the neighborhood
method, including 165 (at least 10). An example neighborhood is shown in the
following figure

**4.
****Global Polynomial Interpolation (mean
value)**

The Global Polynomial
Interpolation method fits a plane between the sample points based on the
overriding trend. A plane is a special case of a family of mathematical
formulas called polynomials. The goal for interpolation is to minimize error.
One can measure the error subtracting each measured point from its predicted
value on the plane, square it, and add them up. This sum is referred to as a
“least squares” fit. This process is the theoretical basis for the first-order
Global Polynomial interpolation. Global Polynomial interpolation fits a smooth
surface that is defined by a mathematical function (a polynomial) to the input
sample points. The Global Polynomial surface changes gradually and captures
coarse-scale pattern in the data.

v
__Results__

Before starting the geostatistical analyst, we randomly divide the data into
two parts, with 170 data in the training set and 64 data in the validation set.
The following diagrams show the frequency distributions of the all wells data
set, training data set and validation data set.

It appears that all
three diagrams have the same general shape.

**1.
****Inverse Distance Weighted (IDW)**

Figure 1 illustrates
the modeling results using IDW. IDW results in a pattern with many local
“hot-spots” and “cold-spots”. There appears to be a characteristic trend in
high arsenics from south-central to north-central in

Figure 2 with a table
summarizes the descriptive statistics for the difference (error) between the
validation set (N=64 points) and the modeled arsenic concentrations using IDW.
The mean and standard Deviation of the error are 1.977741mg/L and 9.241922mg/L
respectively. The Frequency Distribution shows that most errors are between
-4.9 and 10.5 mg/L.

Figure 3 shows the IDW
is a conservative method, since it underestimates the points compared to the
measured points. The regression function for the blue line below is 0.057*x +
4.249. The mean and root-mean-square errors are -0.2329 and 12.92 mg/L.

Figure 4 shows the
error of the predicted map is pretty good between 0.01 and 0.19, but there are
few points have quite big errors above 0.19 value. Its regression function is
-0.943*x + 4.249.

**2.
****Ordinary Kriging**

Figure 1 shows the
results using Ordinary Kriging. Ordinary Kriging results in a mere pattern with
many local “hot-spots” appears to be a characteristic trend in high arsenic
from south-central to north-central Cochise county.

Figure 2 with table
summarized the descriptive statistics for the difference (error) between the
validation set (N = 64 points) and the modeled arsenic concentrations using
Ordinary Kriging. The mean and standard deviation of the error are 1.8047777
mg/L and 9.605348 mg/L respectively. The Frequency Distribution shows that most
errors are between -6.5 and 6.9 mg/L.

Figure 3 shows the
diagram between measured points and the predicted points. From it, we can see
that the Ordinary Kriging underestimates the points according to the training
set points. Thus, Ordinary Kriging is also a conservative method. The
regression function for it was 0.062*x + 4.761. The mean error was 0.07891
mg/L. The Root-Mean-Square was 12.54 mg/L. the average standard error was 12.51
mg/L. The mean standardized error was 0.005696 mg/L. The root-mean-square
standardized error was 1.002.

Figure 4 shows the
errors between the predicted points and the measured points. In general the
error was small enough, since most points are around the 0 scale. The
regression function for it was -0.938*x + 4.761.

Figure 5 shows the QQPlot Tab (Quantile-Quantile-plot) of predication
standardized error. We note that the errors are not normally distributed.

The following map is
Prediction Standard Error Map gotten from the Ordinary Kriging method. It is
shown that the prediction is usually nice around those areas with a great
amount of points, but worse in the edge where the points were not a lot. On the
other hand, the two points with the biggest errors were inside those areas with
quite a lot of points. The reason maybe these two points have much higher or
lower arsenic concentration than those points around them, thus the predicting
values were quite different from them.

**3.
****Local Polynomial Interpolation (mean
value)**

Figure 1 shows the
modeling results using Local Polynomial Interpolation by mean value. Local
Polynomial Interpolation results in a mere pattern without spots. It appears to
be a characteristic trend that the arsenic concentration decreases from the
north-central to the south-central.

Figure 2 with table
summarized the descriptive statistics for the difference (error) between the
validation set (N = 64 points) and the modeled arsenic concentrations using
Local Polynomial Interpolation. The mean and standard deviation of the error
are 0.995629 and 9.556649 mg/L. The Frequency Distribution demonstrates that
most errors are greater than 0.2.

Figure 3 shows the scatterplot of predicted versus actual measured values.
From it we can see that the method underestimates the values compared to the
measured ones. Hence, Local Polynomial Interpolation is also a conservative
method. The regression function of it is 0.055*x + 5.167. The mean and
root-mean-square errors are -0.2033 and 12.52mg/L.

Figure 4 shows a scatterplot of prediction error versus actual measured
values. In this method, the error was good under 0.19, and few points are
beyond the 0 scale. The regression function of it is -0.945*x + 5.167.

**4.
****Global Polynomial Interpolation (mean
value)**

Figure 1 shows the
results using Global Polynomial Interpolation (GPI) by mean value. GPI was
conducted on the 170 points in the training set. It results in a decreasing
trend from the north to south. The Global Polynomial interpolation appears to
be the same pattern as the Local Polynomial interpolation, unless it is
straight line rather than the curves in Local Polynomial interpolation.

Figure 2 with table
summarized the descriptive statistics for the difference (error) between the
validation set (N = 64 points) and the modeled arsenic concentrations using
Global Polynomial interpolation. The mean and standard deviation of the error
are 1.421281 mg/L and 9.956629 mg/L respectively. The Frequency Distribution
shows that most errors are greater than -1.9.

Figure 3 shows a scatterplot of predicted versus actual measured values.
Since the method underestimates the points compared to the actual values, we
say this method is also a conservative method. The regression function of it is
0.046*x + 5.866. The mean and root-mean-square errors are 0.008116 and 12.57
mg/L.

Figure 4 shows a scatterplot of prediction error versus actual measured
values. We see that most errors are on the 0 scale under 0.19, but few lower
than 0 beyond that. The regression function of it is -0.954*x + 5.866.

v
__Data File Generation and Modification__

The vast majority of
effort on any GIS project generally involves data acquisition, or preparing the
data for analysis in the GIS. We have put a lot of effort on preparing water
quality data from a variety of sources for geostatistical and geospatial
analysis. These data have required transformation from the raw data (which is
read by NotePad) into Excel file, and then the
alteration into SPSS file (which is read by SPSS 11.0.1 Data Editor).

There were two major
steps involved for the geostatistical analyst: exploratory spatial data
analysis and geostatistical modeling. Exploratory spatial data analysis involves tidying up
the summary measures of central tendency and dispersion for each county in

v
__Discussion and Conclusions__

This investigation has
identified geographic clustering of arsenic cases throughout the

Mathematically, the
best method above the four is the Ordinary Kriging method. Kriging fits a
mathematical function to a specified number of points, or all points within a
specified radius, to determine the output value for each location. Kriging is a
multiple step process; it includes exploratory statistical analysis of the
data, variogram modeling, creating the surface,
and exploring a variance surface. This function is most appropriate when
knowing there is a spatially correlated distance or directional bias in the
data. Kriging has several advantages over other deterministic interpolation
methods. While kriging has a tendency to smooth distributions, these
simulations maintain closer resemblance to the true 'shape' of the data. On the
other hand, the other three methods are not derived methods. For example,
Inverse Distance Weighting only considers the distance but not the direction,
which is intuitive.

Yet Ordinary Kriging in this experiment did not give the best error
result of all. There are several factors which may affect the output of
Ordinary Kriging:

1. Different divisions of

2. The choice of exponential power factor is
very important sometimes. It needs to consider the characteristics of the
arsenic values;

3. Ordinary Kriging
is very sensitive to the search neighborhood. In the experiment, the default
value given by software was used, it may or may not be the most appreciate
value;

4. The biggest assumption for the Kriging method is constant mean. While looking at the Local
and Global polynomial interpolation maps, it is shown that the mean was not
constant since the color indicating arsenic values become lighter and lighter
from north to south. Perhaps the mean value in this Ordinary Kriging was a variant.

Above factors could
lead to in-depth researches about the data. We could further explore the few
error observations occurred in the map, discussing if there could be any
differences made in other occasions.

v
__Advisors__

D.
E. Myers, Department of Mathematics,

Seumas Rogan, Epidemiology Program,

v
__References__

1. R. Harris, M. K O’Rourke and D.E.
Myers. USE OF GIS IN ANALYZING
ENVIRONMENTAL CANCER RISKS AS A FUNCTION of GEOGRAPHIC SCALE, http://math.arizona.edu/~ura/ideas.html#project_myers_gis
.

2. ArcGIS Desktop Help, ArcMap 8.1, ESRI. Inc.

3. ESRI online courses.

4. Donald E. Myers, Interpolation and
estimation with spatially located data, Chemometrics
and Intelligent Laboratory Systems, 11 (1991) 209 -228.