USE OF GIS IN ANALYZING ENVIRONMENTAL CANCER RISKS AS A FUNCTION of GEOGRAPHIC SCALE

Zheng Cai under the supervision of Prof. D. Myers and Seumas Rogan

University of Arizona

Spring 2003

v     Introduction

The Atlas of Cancer Mortality in United States (1950-1994) tabulates the distribution of cancer in the United States by county. Its formation and utility embraces various assumptions. It particularly implies the equality of cancer risks across the county and the universality of the reported cancer risk representation for the residents of the county. This assumption would be reasonable if the county is quite homogeneous in both the characteristics of the underlying population and exposure risks. Nevertheless this assumption may be not appropriate for many states in the western United States. These states are usually divided into only a few counties, each of which covers large geographical spaces with uneven distributed populations. Therefore, it is unlikely that the county-level statistics adequately represent the range of actual county-level experiences in the approved manner.

In general, the goal of this three-year research project is to examine the geographic variation in the relationship of cancer risk and arsenic in United States. Exposure to arsenic may be a reason for the development of bladder, lung, kidney, and skin cancers in some respects. Meanwhile, arsenic concentrations show a variation between geographic locations. The state of Arizona is divided into several counties. Each county covers large geographic spaces with an uneven distribution of population. In order to represent the range of actual county experiences, we need to do further research on the geographic variation by using Geographic Information Systems (GIS) software.

This report is going to present the research on the geographic variation by using GIS software to analyze data for Cochise County, Arizona. Modeling of ground water arsenic concentrations was the main goal of this research project.

v     Methods applied

Geographic information system (GIS) is a type of mapping software that links data of real-word objects with onscreen map. It has data creation, data display, analysis and output four main usages. Realistically, it is used to connect multiple sources of georeferenced health statistics data. Insights concerning diverse health-environment-behavior interactions can be derived by identification of clusters of cancer incidences followed by comparison with cluster locations in the mapped distribution of arsenic. The “Geostatistical Analyst” feature in ArcMap of version 8.1 build by ESRI Inc was used in geostatistical analyses reported below.

Interpolation techniques are mainly categorized as deterministic and stochastic. Deterministic interpolation uses the techniques of creating new surfaces from the measured points, basing on the either the extent of similarity or the degree of smoothing. It can be divided into two subgroups: global and local. Geostatistical interpolation applies the techniques of utilizing the statistical properties of the measured points.

1.      Inverse Distance Weighted (IDW)

Assume that values at locations  that are close to one another are more alike than those that are farther apart, Inverse Distance Weighted (IDW) will use the measured values surrounding the prediction location to predict a value for any unmeasured location. In IDW, the closer measured values to the prediction have more influence on the predicted value than those farther away from it.


In this experiment, IDW was conducted on the 170 points in the training set using power of 2 and the neighborhood method. There are 15 neighbors included, and include at least 10 of them using the elliptical (quadrant) search window . An example of this method is shown below:

2.      Ordinary Kriging

Ordinary Kriging assumes the model, Z(s) = μ + ε(s), where μ is an unknown constant and the ε(s) are random fluctuations. It allows for local influences due to nearby neighborhood values. It produces prediction, quantile, probability or standard error maps using the data points that are continuous in space. Due to the unknown mean, there are few assumptions can be made for the ordinary kriging, which made this method particularly flexible.

In this analysis, Ordinary Kriging was conducted on the 170 points in the training set using a Spherical model variogram, automatic Lag Size of 4364.1 and 12 lags. We also used neighborhood method, including 5 neighbors and with at least 2 using shape type .  The equation for the variogram is

= 161.77*Spherical(20473) + 32.728*Nugget ,

 where the Nugget effect is the sum of measurement error and small-scale irregularities (microscale variation). Because either component can be zero, the Nugget effect can be comprised wholly of one or the other.

The plot of the experimental variogram and the fitted model is shown in the following figure


The figure above shows a typical search neighborhood for the ordinary kriging

3.      Local Polynomial Interpolation (mean value)

The conceptual basis for Local Polynomial interpolation is to fit many smaller overlapping planes, and then use the center of each plane as the prediction for each location in the study area. The resulting surface will be more flexible and perhaps more accurate. This interpolation fits many polynomials each within specified overlapping neighborhoods. Local Polynomial Interpolation is sensitive to the neighborhood distance.

In the experiment, Local Polynomial Interpolation was conducted on 170 points in the training set using a weight of 125644.96 and power of 1. It also takes the neighborhood method, including 165 (at least 10). An example neighborhood is shown in the following figure

4.      Global Polynomial Interpolation (mean value)

The Global Polynomial Interpolation method fits a plane between the sample points based on the overriding trend. A plane is a special case of a family of mathematical formulas called polynomials. The goal for interpolation is to minimize error. One can measure the error subtracting each measured point from its predicted value on the plane, square it, and add them up. This sum is referred to as a “least squares” fit. This process is the theoretical basis for the first-order Global Polynomial interpolation. Global Polynomial interpolation fits a smooth surface that is defined by a mathematical function (a polynomial) to the input sample points. The Global Polynomial surface changes gradually and captures coarse-scale pattern in the data.

v     Results

Before starting the geostatistical analyst, we randomly divide the data into two parts, with 170 data in the training set and 64 data in the validation set. The following diagrams show the frequency distributions of the all wells data set, training data set and validation data set.

 

It appears that all three diagrams have the same general shape.

 

1.      Inverse Distance Weighted (IDW)

Figure 1 illustrates the modeling results using IDW. IDW results in a pattern with many local “hot-spots” and “cold-spots”. There appears to be a characteristic trend in high arsenics from south-central to north-central in Cochise County.

Figure 2 with a table summarizes the descriptive statistics for the difference (error) between the validation set (N=64 points) and the modeled arsenic concentrations using IDW. The mean and standard Deviation of the error are 1.977741mg/L and 9.241922mg/L respectively. The Frequency Distribution shows that most errors are between -4.9 and 10.5 mg/L.

Figure 3 shows the IDW is a conservative method, since it underestimates the points compared to the measured points. The regression function for the blue line below is 0.057*x + 4.249. The mean and root-mean-square errors are -0.2329 and 12.92 mg/L.

Figure 4 shows the error of the predicted map is pretty good between 0.01 and 0.19, but there are few points have quite big errors above 0.19 value. Its regression function is -0.943*x + 4.249.

 

 

 

2.      Ordinary Kriging

Figure 1 shows the results using Ordinary Kriging. Ordinary Kriging results in a mere pattern with many local “hot-spots” appears to be a characteristic trend in high arsenic from south-central to north-central Cochise county.

Figure 2 with table summarized the descriptive statistics for the difference (error) between the validation set (N = 64 points) and the modeled arsenic concentrations using Ordinary Kriging. The mean and standard deviation of the error are 1.8047777 mg/L and 9.605348 mg/L respectively. The Frequency Distribution shows that most errors are between -6.5 and 6.9 mg/L.

Figure 3 shows the diagram between measured points and the predicted points. From it, we can see that the Ordinary Kriging underestimates the points according to the training set points. Thus, Ordinary Kriging is also a conservative method. The regression function for it was 0.062*x + 4.761. The mean error was 0.07891 mg/L. The Root-Mean-Square was 12.54 mg/L. the average standard error was 12.51 mg/L. The mean standardized error was 0.005696 mg/L. The root-mean-square standardized error was 1.002.

Figure 4 shows the errors between the predicted points and the measured points. In general the error was small enough, since most points are around the 0 scale. The regression function for it was -0.938*x + 4.761.

Figure 5 shows the QQPlot Tab (Quantile-Quantile-plot) of predication standardized error. We note that the errors are not normally distributed.

The following map is Prediction Standard Error Map gotten from the Ordinary Kriging method. It is shown that the prediction is usually nice around those areas with a great amount of points, but worse in the edge where the points were not a lot. On the other hand, the two points with the biggest errors were inside those areas with quite a lot of points. The reason maybe these two points have much higher or lower arsenic concentration than those points around them, thus the predicting values were quite different from them.

3.      Local Polynomial Interpolation (mean value)

Figure 1 shows the modeling results using Local Polynomial Interpolation by mean value. Local Polynomial Interpolation results in a mere pattern without spots. It appears to be a characteristic trend that the arsenic concentration decreases from the north-central to the south-central.

Figure 2 with table summarized the descriptive statistics for the difference (error) between the validation set (N = 64 points) and the modeled arsenic concentrations using Local Polynomial Interpolation. The mean and standard deviation of the error are 0.995629 and 9.556649 mg/L. The Frequency Distribution demonstrates that most errors are greater than 0.2.

Figure 3 shows the scatterplot of predicted versus actual measured values. From it we can see that the method underestimates the values compared to the measured ones. Hence, Local Polynomial Interpolation is also a conservative method. The regression function of it is 0.055*x + 5.167. The mean and root-mean-square errors are -0.2033 and 12.52mg/L.

Figure 4 shows a scatterplot of prediction error versus actual measured values. In this method, the error was good under 0.19, and few points are beyond the 0 scale. The regression function of it is -0.945*x + 5.167.

4.      Global Polynomial Interpolation (mean value)

Figure 1 shows the results using Global Polynomial Interpolation (GPI) by mean value. GPI was conducted on the 170 points in the training set. It results in a decreasing trend from the north to south. The Global Polynomial interpolation appears to be the same pattern as the Local Polynomial interpolation, unless it is straight line rather than the curves in Local Polynomial interpolation.

Figure 2 with table summarized the descriptive statistics for the difference (error) between the validation set (N = 64 points) and the modeled arsenic concentrations using Global Polynomial interpolation. The mean and standard deviation of the error are 1.421281 mg/L and 9.956629 mg/L respectively. The Frequency Distribution shows that most errors are greater than -1.9.

Figure 3 shows a scatterplot of predicted versus actual measured values. Since the method underestimates the points compared to the actual values, we say this method is also a conservative method. The regression function of it is 0.046*x + 5.866. The mean and root-mean-square errors are 0.008116 and 12.57 mg/L.

Figure 4 shows a scatterplot of prediction error versus actual measured values. We see that most errors are on the 0 scale under 0.19, but few lower than 0 beyond that. The regression function of it is -0.954*x + 5.866.

v     Data File Generation and Modification

The vast majority of effort on any GIS project generally involves data acquisition, or preparing the data for analysis in the GIS. We have put a lot of effort on preparing water quality data from a variety of sources for geostatistical and geospatial analysis. These data have required transformation from the raw data (which is read by NotePad) into Excel file, and then the alteration into SPSS file (which is read by SPSS 11.0.1 Data Editor).

There were two major steps involved for the geostatistical analyst: exploratory spatial data analysis and geostatistical modeling. Exploratory spatial data analysis involves  tidying up the summary measures of central tendency and dispersion for each county in Arizona. These analyses focus on arsenic, as well as some other variables along with, such as well depth and other contaminants that may correlate with Arsenic. Geostatistical modeling assumes that the distance or direction between sample points reflects a spatial correlation that can be used to explain variation in the surface. It interpolates arsenic concentrations at unmeasured locations using ArcGIS. We have interpolated Arsenic concentration using Inverse Distance Weighting, Ordinary kriging, Local and Global Polynomial Interpolation.

v     Discussion and Conclusions

This investigation has identified geographic clustering of arsenic cases throughout the Cochise County based on two assumptions: the whole Land of Cochise County is flat and water can flow freely from any direction, people living in this county only drink groundwater, not city water. In that case, the arsenic data collected in this county would take effect on people that getting cancer.

Mathematically, the best method above the four is the Ordinary Kriging method. Kriging fits a mathematical function to a specified number of points, or all points within a specified radius, to determine the output value for each location. Kriging is a multiple step process; it includes exploratory statistical analysis of the data, variogram modeling, creating the surface, and exploring a variance surface. This function is most appropriate when knowing there is a spatially correlated distance or directional bias in the data. Kriging has several advantages over other deterministic interpolation methods. While kriging has a tendency to smooth distributions, these simulations maintain closer resemblance to the true 'shape' of the data. On the other hand, the other three methods are not derived methods. For example, Inverse Distance Weighting only considers the distance but not the direction, which is intuitive.

Yet Ordinary Kriging in this experiment did not give the best error result of all. There are several factors which may affect the output of Ordinary Kriging:

1.      Different divisions of Cochise County data may lead to different outcomes, one of which is the best approximation to the actual result;

2.      The choice of exponential power factor is very important sometimes. It needs to consider the characteristics of the arsenic values;

3.      Ordinary Kriging is very sensitive to the search neighborhood. In the experiment, the default value given by software was used, it may or may not be the most appreciate value;

4.      The biggest assumption for the Kriging method is constant mean. While looking at the Local and Global polynomial interpolation maps, it is shown that the mean was not constant since the color indicating arsenic values become lighter and lighter from north to south. Perhaps the mean value in this Ordinary Kriging was a variant.

Above factors could lead to in-depth researches about the data. We could further explore the few error observations occurred in the map, discussing if there could be any differences made in other occasions.

v     Advisors

D. E. Myers, Department of Mathematics, University of Arizona

Seumas Rogan, Epidemiology Program, University of Arizona

v     References

1.      R. Harris, M. K O’Rourke and D.E. Myers.  USE OF GIS IN ANALYZING ENVIRONMENTAL CANCER RISKS AS A FUNCTION of GEOGRAPHIC SCALE, http://math.arizona.edu/~ura/ideas.html#project_myers_gis .

2.      ArcGIS Desktop Help, ArcMap 8.1, ESRI. Inc.

3.      ESRI online courses.

4.      Donald E. Myers, Interpolation and estimation with spatially located data, Chemometrics and Intelligent Laboratory Systems, 11 (1991) 209 -228.