Investigation of Geographical Disparities: The Use of An Interpolation Method For Cancer Registry Data.

The American Cancer Society estimated 1.9 million diagnosed cancer cases and 608,570 cancer deaths in 2021 in the US; for Oklahoma, they estimated 22,820 cases and 8,610 deaths. This project aimed to demonstrate a method to systematically describe cancer in an accurate and visually attractive, yet simple to make, interpolated map using ZIP Code level registry data, as it is the smallest area unit with high accuracy using inverse distance weighting. We describe a process of creating smoothed maps with an appropriate, well-described, simple, replicable method. These smoothed maps display low (cold) or high (hot) areas of incidence rates of: (a) all cancer combined, (b) colorectal cancer and lung cancer rates by gender, (c) female breast cancer, and (d) prostate cancer, by ZIP Codes for Oklahoma from 2013-2017. The methods we present in this paper provide an effective visualization to pinpoint low (cold) or high (hot) areas of cancer incidence.

characteristics with geography, such as breast cancer clusters in the high-income areas of Nassau and Suffolk counties in New York 35 and Marin County in California. 36 There have been several high-pro le suspected, but unproven, geographic clusters, such as Camp Lejeune in North Carolina. [37][38][39] Most highpro le geographic clusters of cancer have been occupational, such as vermiculite mining in Libby, Montana. 40,41 Clusters of cancer have been represented with maps that show differing incidence or mortality rates based on administrative districts, such as state, county, or census tract in the form of choropleth maps.
However, choropleth maps can be misleading with potentially serious consequences since these use political and administrative boundaries that may not represent true risk. [42][43][44] The creation of arti cially imposed boundaries for administrative purposes can exclude geographic neighbors from analyses because they depend on the values that exclude neighbors. 44 Population size and land area may vary within the geographic units; 44 thus, choropleth maps are subject to small number problems, particularly in rural areas, and these maps typically do not include error estimates. 45 Moreover, choropleth map classi cation systems, such as natural breaks, equal intervals, or quantiles, can relay differing messages depending on the system used. [46][47][48] Smoothed maps, however, created with interpolation methods maintain the accuracy of signi cant high and low cluster locations better than choropleth maps while allowing clusters to be displayed visually. To our knowledge, well-described, easily accessible methods for displaying geographic disparities in cancer have not been published.
This study aimed to demonstrate a method to systematically illustrate cancer incidence in an accurate and visually attractive, yet straightforward, GIS-based method using ZIP Code level cancer registry data for all cancers combined and four major types of cancer (e.g. colorectal, lung, female breast, and prostate). This analysis will better inform public health researchers by enabling them to empirically assess and describe geographic disparities in cancer for their own areas.

Study Population and Data Sources
Cancer incidence data were obtained from the Oklahoma Central Cancer Registry (OCCR), Oklahoma's statewide cancer registry system, through a data-sharing agreement. Cancers were grouped by all cancers combined, colorectal cancer (ICD-0-3 18.0-18.9, 19.9, 20.9), lung cancer (C34. 0-34. 9), female breast cancer (C50. 0-50. 9), and prostate cancer (C61.9). Those whose histology codes speci ed mesotheliomas, Kaposi sarcomas, lymphomas (9050-9055, 9140, 9590, and 9989), and males with breast cancer were excluded. For each cancer type and all cancers combined, we received the number of cases for each diagnosis, the number and percentage of cases diagnosed at a late stage, age group (0-39, 40-44, 45-49, 50-54, 55-59, 60-64, 65-69, 70-74, 75-79, 80-84, and 85 or older), sex, and ZIP code of residence at diagnosis for the latest ve years available (2013-2017). Geographic Unit of Analysis For this study, we aimed to use data from the smallest geographical unit available, which is the USPS ZIP Code level. Limitations related to the utilization of USPS ZIP code data in public health research, compared to census blocks or tracts, is well known. 49,50 In response to this concern the US Census Bureau created ZIP Code Tabulation Areas (ZCTAs). While ZIP codes represent a collection of mail delivery routes established for use by the US Postal Service, ZCTAs are generalized areal representations. [50][51][52] It is important to note there are strengths and limitations to both of these geographic areas. 49,53,54 For purposes of this paper ZCTAs were considered adequate and are easily attainable; therefore, we used 2015 ZCTAs for mapping purposes. 55 ZCTAs may cross county lines and sometimes also cross state lines. For the purposes of this study of interpolation methods, ZCTA are su cient to illustrate generalized areas. Interpolation Methods Using inverse distance weighting (IDW), a method where unknown points are calculated using the weighted average of the values of nearby known points, we created maps displaying the incidence rates of colorectal (male and female), female breast, lung (male and female), prostate, and all cancers combined in Oklahoma for 2013-2017. We divided cancer incidence into 13 classes, which were determined using geometrical interval breaks. This large number of classes was used to present a smooth transition between categories. 56 The geometrical interval classi cation method (or smart quantiles) is particularly good for visualizing continuous data; it lessens within-class variance and works well with "heavily skewed and duplicate values" introduced by the use of a Standardized Incidence Ratio (SIR). 57 An SIR is the observed number of cases divided by the expected number of cases of, in the present study, cancer. The expected number of cases is the number of cases that would have occurred if a standard was applied throughout the area; in the present study, the incidence rate of the state of Oklahoma from 2013-2017 was used as the standard. SIR is typically used when the occurrence of cancer in a relatively small population is disparate or a small number of observed cases occur, such as in ZIP Codes. This study had a heavily skewed SIR with this dataset having a skewness of, for example, 24.16 for all males and 3.10 for all females.
We used ZCTA polygon data for the US Census. ZIP codes were matched to their respective ZCTA; however, nine (n = 648) cancer case ZIP Codes did not match a ZCTA. For ZIP Codes that did not match, the ZIP code was geocoded (using ESRI® ArcGIS ready-to-use geocoding tool), and the resulting location was used to place the ZIP Code data within an appropriate ZCTA. There were 90 ZIP Codes that were either recently created or merged with another ZIP Code. These were placed on the map in the corrected area. We, then, used an incorporated places shape le from the US Census Bureau for Oklahoma that included county, ZCTA, and incorporated places. We joined the population count to the incorporated places shape le to determine the largest population center in the ZCTA. For those ZCTAs without incorporated places, the ZCTA centroid was used. We used a color scheme that transitioned from red to blue to indicate hot (red) and cold (blue) spots, which experienced higher and lower rates than the statewide rate, respectively. Statistical Analysis Figure 1 shows the nal data work ow. We calculated SIRs for each ZCTA for each cancer type using SAS 9.4 (SAS Institute Inc. 2013, Cary, NC). For the IDW, we used the ArcGIS 10.8.1 to create smoothed maps of Oklahoma ZCTA SIR.

Indirect Age-Sex Standardization
The number of expected cancer cases for each ZIP code was determined by using indirect age-sex standardization and Oklahoma ZCTA population data with the following age groups, by years: 0-39, 40-44, 45-49, 50-54, 55-59, 60-64, 65-69, 70-74, 75-79, 80-84, and 85 or older. Indirect standardization was used rather than direct standardization because it applies the stable statewide rate to local populations, instead of applying local disease rates, which are unstable for small areas, to standard population weights. 56 An SIR is was used to determine whether the occurrence of cancer in a relatively small population was high or low.
For each ZCTA, to calculate the SIR for incidence (or proportion of late-stage diagnosis cases), the expected number of cancer cases (or the number of late-stage diagnosis cases) of each cancer was computed by applying the statewide rates to the numbers of people (or cases) in each age-sex group in the ZIP Code.

Hot Spot Analysis
To con rm that the interpolations were reasonable, we also created a choropleth map of the ZIP Code data to determine areas of high rates as a validation step. We then performed a Getis Ord Gi* to determine low (cold) or high (hot) areas or spots. Hot spot and cold spot analysis using the Getis-Ord Gi* statistic uses xed distance band in ArcGIS software. The subsequent Z score identi ed ZIP Code centroids having high or low values of clustering spatially. Positive Z scores indicate the clustering of high values, or hot spots. Negative Z scores indicate clustering of low values, or cold spots. A Z score near zero indicates no apparent spatial clustering. The Getis-Ord Gi* statistic works by examining each feature within the context of adjacent features. 58 Smoothed Maps Using Inverse Distance Weighting Interpolation is used to estimate the values of intermediate and extended pixels (or point on a map) by applying a mathematical function to available data. Inverse distance weighted (IDW) interpolation determines pixel values using a linearly weighted combination of sample points with the weight as a function of inverse distance. The interpolated surface should be a geographically dependent variable, such as cancer incidence or late-stage cancer. IDW is often used to show interpolation for rainfall or elevation. IDW is represented by the following formula where Z i is the value of known point, d ij is the distance to the known point, Z j is the unknown point, and n is a user-selected exponent. 59

Map Production
The USA Contiguous Albers Equal Area Conic Projected projection was used for all maps. This study was approved by the IRB at the University of Oklahoma Health Sciences Center and the Oklahoma State

Overall Cancer
When reviewing the three maps together (Fig. 2a-c and Fig. 3a-c), we see that a standard ZCTA choropleth map using ve classi cations with natural breaks does not depict a clear smoothed picture of cancer patterns ( Fig. 2a and Fig. 3a). While the hot spot analysis (Getis-Ord Gi*) clearly shows areas of hot spots, this analysis leaves the impression that these spots are very precisely located ( Fig. 2b and Fig. 3b). The IDW maps ( Fig. 2c and Fig. 3c) show clearer and intuitive results. Hot spots for overall males diagnosed with cancer include areas throughout central Oklahoma, one high SIR (hot spot) in southwestern Oklahoma, and a few random hot spots in northern and northeastern parts of Oklahoma (Fig. 2c). Hot spots for overall females diagnosed with cancer were mainly in the northeastern portion of Oklahoma (Fig. 3c). No cold spots were observed. Lung Cancer Lung cancer in Oklahoma is pervasive (Fig. 3a-b). 60 Using SIR revealed that for males, there are large hot spots in the eastern and southern parts of the state, with a larger cold spot area in northwestern Oklahoma (Fig. 3a), compared with the high rates in Oklahoma overall. For males, the Oklahoma Metropolitan Area (Central Oklahoma) and the eastern part of the area have high rates than the north, west, and even southern areas of Central Oklahoma. For males, the Tulsa area shows hot spots (small) in the northwestern part of the county (Fig. 3a). Throughout Oklahoma, there were small hot spots for females (Fig. 3b). For females, large areas of northwest and western Oklahoma showed cold spots (Fig.   3b) compared with the overall Oklahoma rate. Finally, there was a large swath of cold spots from the southeastern to the northeastern parts of the state (Fig. 3b).
Colorectal Cancer Hot spots for male colorectal cancer were located primarily in southeastern, northwestern, and southwestern Oklahoma, with a large hot spot in central Oklahoma county and northern Oklahoma (Fig. 4a). For females, the SIR showed only one hot spot in southeastern Oklahoma county (Fig. 4b).
For late-stage colorectal cancer, there are hot spots throughout the state, primarily in rural areas. There are hot and cold spots in both urban areas (Tulsa and Oklahoma counties); the hot spots are located in southwestern and northwestern Oklahoma county and northwestern, central, and southern Tulsa County. These geographically smaller, but highly populated, areas are not as visually obvious (Fig. 4c) as are the rural areas in Oklahoma.

Female Breast Cancer
Hot spots for female breast cancer include an urban areas with a higher SIR from southwest Oklahoma to northeast Oklahoma (Fig. 5a). There are also hot spots in southern Oklahoma and the panhandle (Fig. 5a). Late-stage breast cancer mapping suggests that rural areas have a higher concentration of hot spots than urban areas, although there are two large urban hot spots in the southern and northwestern Oklahoma City Metropolitan Area and the northwest Tulsa Metropolitan Area (Fig. 5b).

Prostate Cancer
For men in Oklahoma diagnosed with prostate cancer, the SIR showed hot spots in southwestern Oklahoma, in south central, north and south Tulsa, and in central Oklahoma (Fig. 6).

Discussion
GIS can play a major role in epidemiology, helping understand the spatial distribution of diseases, and thus informing allocation of resources. However, map making methodologies directly impact the subsequent visual output. The output can be misleading and thus lead to potentially serious consequences. Currently there is a lack of well described easily accessible methods for displaying geographic disparities in cancer. This study described and demonstrated a method to systematically describe cancer incidence that is accurate and visually attractive, yet simple to make, GIS-based method using state cancer registry data.
Choropleth maps (maps made from administrative districts such as counties) are the mainstay of spatial epidemiology. Choropleth maps, however, are often di cult to interpret. Interpolated maps are much easier for resource planners and the public to interpret. These smoothed maps allow researchers and community members to understand the geographic areas of interest for future resource planning. Working with the community, researchers and planners can then identify why some of these areas have high (hot) or low (cold) SIRs.
The present study used data from a high-quality data set (e.g., OCCR) and strong methods to create tools for understanding geographic cancer disparities in Oklahoma. Moreover, this study used an accepted method to show the hot and cold spots using an indirect age standardization method. Because we know that cancer rates do not change at administrative borders, interpolation proposes a more realistic picture of cancer rates across a geographic area. Finally, this methodology is achievable using a welldocumented industry-standard simple software program (ArcGIS), but can be completed in other GIS packages (QGIS or R).
Despite the strengths of this study, there are still limitations. First, spatial resolution may not be consistent since the maps are based on points with different densities (based typically on population). 56,61 Another limitation may be the small sample size, particularly in the rural areas. Even combining ve years of Oklahoma data, there were geographic areas based on small numbers. Also, aggregate estimates of cancer incidence across large geographic areas often mask differences within the area. While ZIP Codes are not typically large geographic areas, they can still mask differences, particularly in geographically large ZIP Codes, such as those in rural areas. Besides the overall ZIP Code size issue, the ZCTAs were used to represent ZIP Codes; thus, there are likely areas of geographic inconsistency. Although IDW does not smooth as well as some other methods (e.g., Empirical Bayesian Kriging), this project had the goals of producing maps that are accurate, smoothed, and easy to understand. While we considered adaptive spatial lters to create high-quality accurate maps, 62 for geographic areas with widely varying population density, we determined that IDW was as effective at producing maps that were accurate and legible. Moreover, IDW does not require specialized software, and there are many options, including ArcGIS, QGIS, and R, the latter two being open source, no-cost software. We believe that the methods we present in this paper provide an effective compromise that allows the pragmatic pinpointing of hot and cold spots.
Understanding the relationships between health and place is foundational in epidemiology and public health. Geographical areas can show areas in need of screening or preventive services. It can show areas that are doing well in screening or prevention efforts leading to improved public health activities. With the emergence of COVID-19 and the efforts of the Johns Hopkins University Coronavirus resources center maps (https://coronavirus.jhu.edu/us-map) the signi cance of GIS in public health has become even more apparent. This study demonstrates a method that public health practitioners can duplicate with minimal skills, no or low-cost applications, and limited data to assist health care professionals and the community in interpreting cancer in their state.   All cancers standardized incidence ratio for males a) by zip code tabulation areas , b) Getis-Ord Gi* hot and cold spots, c) inverse distance weighting (geometrical intervals) interpolated map; for females d) by zip code tabulation areas , e) Getis-Ord Gi* hot and cold spots, f) inverse distance weighting (geometrical intervals) interpolated map for females by zip code, Oklahoma 2013-2017. Note: The designations employed and the presentation of the material on this map do not imply the expression of any opinion whatsoever on the part of Research Square concerning the legal status of any country, territory, city or area or of its authorities, or concerning the delimitation of its frontiers or boundaries. This map has been provided by the authors.  Colorectal cancers standardized incidence ratio Inverse distance weighting (geometrical intervals) interpolated map a) for males, b) for females, and c) male and female late stage Oklahoma 2013-2017. Note: The designations employed and the presentation of the material on this map do not imply the expression of any opinion whatsoever on the part of Research Square concerning the legal status of any country, territory, city or area or of its authorities, or concerning the delimitation of its frontiers or boundaries. This map has been provided by the authors. Female breast cancers standardized incidence ratio Inverse distance weighting (geometrical intervals) a) interpolated map and b) percent of late-stage standardized incidence ratio Inverse distance weighting (geometrical intervals) interpolated map by zip code, Oklahoma 2013-2017. Note: The designations employed and the presentation of the material on this map do not imply the expression of any opinion whatsoever on the part of Research Square concerning the legal status of any country, territory, city or area or of its authorities, or concerning the delimitation of its frontiers or boundaries. This map has been provided by the authors.

Figure 6
Prostate cancers standardized incidence ratio Inverse distance weighting (geometrical intervals) interpolated map for males by zip code, Oklahoma 2013-2017. Note: The designations employed and the presentation of the material on this map do not imply the expression of any opinion whatsoever on the part of Research Square concerning the legal status of any country, territory, city or area or of its authorities, or concerning the delimitation of its frontiers or boundaries. This map has been provided by the authors.