Effectiveness of predicting the spatial distributions of target contaminants of a coking plant based on their related pollutants

The prediction accuracy of the spatial distribution of soil pollutants at a site is relatively low. Related pollutants can be used as auxiliary variables to improve the prediction accuracy. However, little relevant research has been conducted on site soil pollution. To analyze the prediction accuracy of target pollutants combined with auxiliary pollutants, Cu, toluene, and phenanthrene were selected as the target pollutants for this study. Based on geostatistical analysis and spatial analysis, the following results were obtained. (1) The reduction in the root mean square errors (RMSEs) for Cu, toluene, and phenanthrene with multivariable cokriging was 68.4%, 81.6%, and 81.2%, respectively, which are proportional to the correlation coefficient of the relationship between the auxiliary pollutants and the target pollutants. (2) The RMSEs calculated for the multivariable cokriging were lower than those obtained by only combining one related pollutants, and two co-variables should be better. (3) The predicted results for Cu, phenanthrene, and toluene and their corresponding related pollutants are more accurate than the results obtained not using the related pollutants. (4) In the interpolation process, the RMSEs for Cu, toluene, and phenanthrene with multivariable cokriging basically increase as the neighborhood sample data increases, and then they become stable. (5) When 84, 61, and 34 sample points were removed, the RMSEs for Cu, toluene, and phenanthrene, respectively, with multivariable cokriging were close to the RMSEs of the target pollutants based on the total samples. The results are of great significance to improving the prediction accuracy of the spatial distribution of soil pollutants at coking plant sites.


Introduction
Contaminated sites have received worldwide attention, and China is facing serious problems from polluted sites Roslund et al. 2018;Shi et al. 2017;Cui et al. 2017). According to the national soil pollution survey bulletin of China, the soil environmental problems in industrial and mining wastelands are prominent. The over-standard rate of the soil pollutant content in heavily polluting enterprises, contaminated sites, industrial parks, centralized solid waste disposal sites, oil production areas, and mining areas is as high as 20-40%. Contaminated soil has significant adverse effects on human health and the environment (Gou et al. 2019;Yang et al. 2018). Therefore, remediation and treatment are needed for those sites (Liu et al. 2015;Fang et al. 2017).
Accurate knowledge of the spatial distribution of pollutants is key to delineating the scope of remediation efforts . Overestimating the extent of the remediation increases the cost of the remediation, while underestimating the scope fails to eliminate the risks to human health caused by contaminated sites in a comprehensive way (Li and Heap 2014). The restoration boundary is delimited based on a contour line, which is obtained using spatial interpolation methods and pollutant concentration data from a limited number of soil samples collected throughout the site (Ren et al. 2016;Ma et al. 2016). Commonly used interpolation Responsible editor: Philippe Garrigues methods include the ordinary kriging (OK), inverse distance weighted (IDW), and radial basis function (RBF) methods (Dong et al. 2011;Chen et al. 2016;Wu et al. 2008Wu et al. , 2013Gutierrez et al. 2015;Goovaerts et al. 2008;Carlon et al. 2001;Saito and Goovaerts 2000;Qiao et al. 2018). Their prediction accuracy has been relatively low for sites compared to that for large-scale farmland (Santos-Francés et al. 2017;Weindorf et al. 2013;Paulette et al. 2015). This is due to the strong spatial variability of soil pollutants at sites, which causes uncertainty (Saito and Goovaerts 2002;Hofmann et al. 2010).
The strong spatial variability and localization of the characteristics of the soil pollution of a site often occur in response to the production and pipeline layout, the pollution source distribution, the pollutant properties, and the soil properties Wu et al. 2011a;Liu et al. 2013). There are three main types of soil pollution, each with local site characteristics that are noticeable, such as (Armiento et al. 2011;Girault et al. 2016;Monaco et al. 2015): (1) a small number of high peak values exist in isolation; (2) the concentration of the pollutants decreases sharply during the transition from extremely high values to low values in a neighborhood; and (3) the change in the pollutant concentration in other regions is relatively small. The smoothing effect and the global single spatial gradient expression of the commonly used interpolation models blur the local features and reduce the accuracy of the prediction, which is not conducive to the delimitation of the remediation boundary (Xie et al. 2011;Huo et al. 2010;Robinson and Metternicht 2006;Ding et al. 2017).
In order to improve the prediction accuracy, normal transformation was applied to the highly skewed data for a site (Juang et al. 2001;Wu et al. 2006), multiple interpolation models were jointly used in the different subregions (Wu et al. 2011a), and auxiliary factors (e.g., topographic features, pH values, and organic matter contents) were combined using the kriging method (Fu et al. 2018;Vyas et al. 2004;Schnabel and Tietje 2003). However, the presence of several high peak values caused the difficulties in the normal transformation (Wu et al. 2011a), and the limited sampling points decreased the accuracy of the interpolation in the subregions (Modis et al. 2008). A cokriging method with auxiliary variables can be used to obtain relatively accurate spatial distribution information for pollutants (Le et al. 2019;Tziachris et al. 2017). This is because cokriging method can incorporate both spatial and intervariable correlations into the spatial interpolation (Juang et al. 1996). Auxiliary factors can be used to calibrate the prediction results of the target pollutants to a certain extent. Nevertheless, the acquisition of auxiliary variables increases the sampling and analysis workload.
The pollutants related to the target pollutant are an easily achieved auxiliary factor. According to the sampling and measurement processes, after the soil samples are obtained, the procedures for analyzing various pollutants of the same type, e.g., heavy metal pollutants (HMs), semivolatile organic pollutants (SVOCs), and volatile organic pollutants (VOCs), are basically the same, and their data can be obtained almost simultaneously (USEPA 2014(USEPA , 2018(USEPA , 2017. Moreover, the sources of multiple pollutants at the same site are roughly the same, which leads to correlations among the various pollutants (Li et al. 2013a, b;Rashed 2010). Therefore, the related pollutants have the potential to be a good source of auxiliary data for use in the cokriging method (Yang et al. 2016). However, it is unclear how much influence cokriging using relevant pollutants has on the accuracy of the prediction since little relevant research has been done to analyze this problem in detail (Zhang and Yang 2017;Juang and Lee 1998).
In this study, we take three types of pollutants (HMs, SVOCs, and VOCs) in the soils of an abandoned coking plant in the central part of Hebei Province, China, as the target pollutants, and the prediction effects of cokriging using related pollutants of the same type were analyzed. This study was conducted (1) to evaluate the extent of the accuracy improvement in the accuracy of the cokriging analysis by using related pollutants compared to the ordinary kriging and (2) to analyze the uncertainty in the results of the application of cokriging using related pollutants. The results of this study provide a useful method for determining the spatial distribution prediction of pollutants in industrial sites.

Sample collection and analysis
This study focuses on an abandoned coking plant with 0.175 km 2 in Hebei province, China, which mainly produced coke, coal gas, coal tar, ammonia, and some other coal chemical products. The abandoned manufactured gas plant began operating started in 1988 and closed in 2017. In 2018, an environmental investigation of the site was conducted, and 300 soil samples were collected from 126 sample points using direct-push drilling methods (Fig. 1).
For each sample, about 5 g of soil was collected in a non-perturbed way and was placed in a 40-mL headspace bottle with 5 mL of methanol in preparation to analyze the VOCs. Approximately 300 g of soil was placed in a 250-mL glass bottle for the heavy metal and semi-volatile organic pollutant analyses. Subsequently, all of the samples were placed in a sample storage box at 4 °C, and then, they were immediately sent to the laboratory for monitoring and analysis. The HMs, SVOCs, and VOCs in the soil samples were determined respectively following the US Environmental Protection Agency Methods: 6020B (ERF) (USEPA 2014), 8270E (ERF) (USEPA 2018), 8260D (ERF) (USEPA 2017), respectively.

The cokriging principle
Cokriging is a geostatistical method used to estimate the target variables using measurement data for several auxiliary variables. It not only combines the autocorrelation of the target variable and the auxiliary variable, but it also determines the relationships between multiple auxiliary variables and the target variable. If Z 1 , ..., Z m is the value of Z 1 (x,…,Z m (x)) at point x, and Z(x) = Z 1 (x), ..., Z m (x) , then the formula of cokriging is where Γ i is the weight vector. In order to determine Γ i , it is necessary to consider the variable Z j (j = 1, 2, ..., m) at position x as a random variable. It is also required that the cokriging estimation formula provides an unbiased estimation, and the sum of the estimation variance of the corresponding variables is the minimum.
The fitting of the cross semi-variogram is used to describe the spatial continuity of the crossover between the different variables. The formula for calculating the cross-semivariogram is shown as Eq. 2 (Li et al. 2013a, b;Zhou et al. 2016;Lu et al. 2012).
where h is the separation distance or lag distance, N(h) is the number of data pairs for the separation distance h, and a i and b i indicate the primary and the target pollutant and the related pollutant, respectively.
The interpolation formula for predicting the spatial distribution of the target pollutant based on the related pollutant estimates the value at location x 0 as shown in Eq. 3.
where Z * (x 0 ) is the predicted value, N is the number of samples, a i and b j are the cokriging weights, Z(x i ) is the measured value of the property of the target pollutant, and Z(x j ) is the value of the related pollutant, respectively.
Equation 3 shows that Z * (x 0 ) is obtained by weighted averaging of the observations of Z(x i ) and Z(x j ) at their spatial locations. The weighting coefficients (a i and b j ) are determined based on the distance between the measurement points of the two variables and location x 0 , the distance between the measurement points, the semi-variograms of the two variables and their cross-semi-variograms. Thus, in the weighted averaging of the two measurements, the auxiliary information provided by the related pollutants can be

The principle for determination of target pollutants and their related analysis
In order to better analyze the prediction effect of the spatial distribution of the target pollutants based on the related pollutants, in this study, we chose the target pollutants based on the following principles.
Abnormal values have a relatively high deviation, which reflects the characteristics of the soil pollution of the industrial site. In this section, the level of deviation was characterized by the ratio of the highest concentration to the sum of the mean plus three times the standard deviation (abbreviated as R h/sum ) (Eq. 4).
where R h/sum is the level of deviation. A high R h/sum value indicates that the abnormal value deviates far from the norm. The Mean is the mean values of all of the concentration data, and SD is the standard deviation of all of the concentration data. Values of greater than (Mean + 3 × SD) are defined as extreme outliers.
The variability in the pollution data is stronger for the respective types, which is characterized by the CV. This is another characteristic of soil pollution at industrial sites. The higher deviation in the abnormal values and a stronger variability in the pollution data reflect the characteristics of the soil pollutants of the industrial site.
There are several related pollutants, which are strongly correlated with the target pollutants. These related pollutants can be used as the auxiliary pollutants in the spatial interpolation of the target pollutants.

Data analysis method
The correlation analysis was performed using SPSS 16.0 (IBM, Armonk, New York, USA). The semi-variogram model was fitted using the GS + 10.0 (Gamma Design Software LLC., Plainwell, MI, USA) and 'gstat' package. The spatial interpolation and cross-validation were performed using ArcGIS 10.1 (Esri, Redlands, CA, USA). The root mean square error (RMSE) was used as the index of the cross validation (Eq. 5). The accuracy of spatial interpolation is higher when the value of the RMSE is lower. The variability is characterized by a coefficient of variance (CV). When CV < 0.1, the spatial variability is weak, and when 0.1 < CV < 0.9, the spatial variability is moderate (Gao and Shao 2012).
where Z(x i ) is the measured concentration at location x i ; Z*(x i ) is the predicted concentration based on the interpolation method at location x i , and n is the number of samples.
The cross-validation conducted in this research refers to the "leave one out" method, which involves the following steps. (1) Temporarily remove a measurement value Z(x 1 ) from the dataset, and the values of the related pollutants are also kept out. (2)

Descriptive statistics of target pollutants
According to the principles in "The principle for determination of target pollutants and their related analysis", Cu, phenanthrene, and toluene, which are HMs, SVOC, and VOC pollutants, respectively, were selected as the target pollutants. The auxiliary pollutants were determined based on the results of the correlation analysis, which revealed a significant correlation with the target pollutants at the 0.01 level. The auxiliary pollutants for Cu are Cd, As, Pb, and Ni; for phenanthrene are fluoranthene, pyrene, chrysene, and benzaanthracene; for toluene are benzene, styrene, and m/p-xylene. The descriptive statistics of these target pollutants, such as their minimum values (Min), maximum values (Max), CVs, and R h/sum values, are shown in Fig. 2.

Degree of improvement in the prediction accuracy of the spatial distribution trend of the target pollutants in combining with their various related pollutants
The prediction accuracies of the three target pollutants in combination with their various related pollutants are shown as Fig. 3. When combined with the related pollutants, the prediction error which is indicated by RMSE of the spatial distribution of the target pollutants obtained using singlevariable cokriging is lower. For Cu combined with As, the reduction in the RMSE was approximately 37%. For toluene combined with benzene, the reduction in the RMSE was 76.8%. For phenanthrene combined with four related pollutants, the reduction in the RMSE was approximately 79%. However, the reduction in the RMSE for Cu combined with Ni and Pb, toluene combined with M/p-xylene, and phenanthrene combined with benzaanthracene were lower. Therefore, Ni and Pb were not used as auxiliary pollutants for Cu in the multivariable cokriging; M/p-xylene was not used for toluene, and benzaanthracene was not used for phenanthrene. The reductions in RMSEs of Cu, toluene, and phenanthrene using multivariable cokriging were 68.4%, 81.6%, and 81.2%, respectively. Moreover, the RMSEs calculated for the multivariable cokriging were lower than those obtained by only combining one related pollutant.

General distribution trend of the interpolation results
The contour lines in Fig. 4 were extracted from the raster layer predicted using the OK method. The region in Fig. 4  with the dense contour distribution is the area with the high concentration data predicted using the OK method. The contour interval is the same in Fig. 4(I-VI) was same. This is also true for Fig. 4(a-f) and Fig. 4(i-v).
Overall, the spatial distributions of Cu, toluene, and phenanthrene predicted using the OK interpolation and using the related pollutants are similar. The correlation coefficients between the raster layers interpolated using the OK method for Cu with and without the related pollutants are all greater than 0.7, while the correlation coefficients of the raster layers for phenanthrene and toluene are all greater than 0.9. Among them, the spatial transition in the Cu content predicted using the related pollutants ( Fig. 4(II-VI)) is relatively gentle compared with that predicted without the related pollutants ( Fig. 4(I)). Furthermore, the long axis direction of the Cu and phenanthrene distributions is both northwest-southeast, while the higher concentration of toluene is located in the northern part of the study area.

Prediction accuracy of the high value area
In the process of the investigation and remediation of a contaminated site, the recognition accuracy of the high pollution area is the main concern. In this study, the phenanthrene contents of 5.08% of the sample points exceeded the screening level for the soil environmental risk assessment of sites (DB11/T 811 2011) (40 mg/kg) (DB11/T 811 2011), while none of the Cu and toluene content of the samples sites exceed their corresponding screening values. In order to analyze the differences in the high value region prediction results of the interpolation methods with different combinations of the related pollutants, the 93rd percentile (80 mg/ kg) of the Cu content and the 75th percentile (9 mg/kg) of the toluene content, which make the recognition area easy to identify and analyze, were taken as the boundary of the high value region. The boundaries of the high value region are denoted by the labeled red lines in Fig. 4.
According to Fig. 4, the positions exceeding 40 mg/kg, 80 mg/kg, and 9 mg/kg of Cu, toluene, and phenanthrene, respectively, predicted with and without using the related pollutants are the same, but the area is different. The area enclosed by the Cu contour without related pollutants is the smallest, and the area for the combination of Cu and Pb is the largest. In addition, the area for a combination of Cu with Cd and As falls between those for only Cu and for Cu and Pb. For phenanthrene, the area for the combination of fluoranthene, pyrene, and chrysene is the smallest compared with Fig. 4(a-e). For toluene, the area for the combination of toluene and styrene is the smallest, and the area for the combination of toluene and benzene is the largest. In addition, the area for the combination of toluene with benzene and styrene falls between the areas of the other two combinations.
The prediction accuracy of the area of the different combinations, i.e., the prediction accuracy of the high value area, was judged based on the number of sampling points with values greater than the standard pollutant content values inside and outside the high value area. According to the statistical results, of the 118 sampling points, 9 samples had Cu contents greater than 80 mg/kg, 6 samples had phenanthrene  (1) for Cu predicted using the OK interpolation method and (2-6) for Cu combined with its related pollutants using the CK interpolation method. The contour lines for a phenanthrene predicted using the OK interpolation method and b-f for phen-anthrene combined with its related pollutants using the CK interpolation method. The contour lines i for toluene predicted using the OK interpolation method and ii-v for toluene combined with its related pollutants using the CK interpolation method contents greater than 40 mg/kg, and 29 samples had toluene contents greater than 9 mg/kg. According to Table 1, all 9 of the samples with Cu contents of greater than 80 mg/kg were identified using cokriging and a combination of Cu and Pb, and using a combination of Cu and both Cd and As. However, the high pollution area predicted using the combination of Cu and both Cd and As was smaller than that predicted using the combination of Cu and Pb (Fig. 4(III-VI)). Therefore, the locations with Cu and both Cd and As and the prediction area of the high value region are small, which lower the remediation. Thus, the prediction results are ideal.
For phenanthrene, although all of the points with phenanthrene contents of greater than 40 mg/kg were accurately identified, the points not exceeding 40 mg/kg were also included in the high value area. The error recognition rate of the prediction results of the combination of phenanthrene with benzaanthracene, chrysene, pyrene, and fluoranthene was approximately 50%, while that of the combination of phenanthrene and fluoranthene, pyrene, and chrysene was only about 33% (Table 1). In addition, the prediction area of the high value region is the smallest (Fig. 4(a-f)). Therefore, the prediction result for the combination of phenanthrene with fluoranthene, pyrene, and chrysene is more accurate, and the repair cost is the lowest.
The error recognition rates for toluene were all 60%, except for the combination of toluene with benzene and styrene (67%) (Table 1). However, the sample points with toluene contents of greater than 9 mg/kg were distributed in the high value area predicted in the central region as shown in Fig. 4(i-v), which is the region that was predicted accurately. The areas of the regions that were predicted accurately in Fig. 4(i-v) are 23,485 m 2 , 22,488 m 2 , 21,278 m 2 , 23,680 m 2 , and 20,876 m 2 , respectively. Therefore, the prediction result of toluene combined with benzene and styrene was better than the prediction result of toluene combined with other related pollutants.

Discussion
The effect of the correlation between the content of the related pollutants and the contents of the target pollutants In this study, there is a significant positive correlation between the contents of the related pollutants and the contents of target pollutants, which can be used to calibrate the descriptive results of the spatial distribution characteristics of the target pollutants to some extent. The reduction in the RMSE of the spatial distribution of the target pollutants combined with the related pollutants increases as the correlation between the related pollutants and the target pollutants increases (Fig. 5). This result is consistent with the results of previous studies (Khosravi and Balyani 2019). The influence of the number of neighboring samples and the total number of sample points on the prediction accuracy

The influence of the number of neighboring samples on the prediction accuracy
Based on the predicted results, for combinations of auxiliary pollutants, the prediction error of the target pollutants is reduced, i.e., RMSE of Cu > RMSE of Cu-auxiliary pollutants, RMSE of toluene > RMSE of toluene-auxiliary pollutants, RMSE of phenanthrene > RMSE of phenanthreneauxiliary pollutants. The advantages of combining the target pollutant with auxiliary pollutants are demonstrated by the results of this study (Zhen et al. 2019;Wu et al. 2006). In addition, the prediction error of a target pollutant combined with multiple correlation auxiliary pollutants is smaller than that of combination with a single correlation auxiliary pollutant, i.e., RMSE of Cu-Cd-As < RMSE of Cu-Cd and RMSE of Cu-As, RMSE of toluene-benzenestyrene < RMSE of toluene-benzene and RMSE of toluene-styrene, RMSE of phenanthrene-fluoranthene-pyrenechrysene < RMSE of phenanthrene-fluoranthene, RMSE of phenanthrene-pyrene, and RMSE of phenanthrene-chrysene. This phenomenon is consistent with the results of previous studies (Zhang and Yang 2017).
Furthermore, the RMSEs shown in Fig. 6 basically increase as the number of neighborhood sample points increases, and then, they become stable. This is dependent on the soil pollution characteristics of the site and the RMSE principle. According to the computational formula for the RMSE, the value of the RMSE is greatly affected by the maximum difference between the measured value and the predicted value. For a contaminated site, the maximum error is mainly located in the extremely high value position (underestimated).
A small number of extremely high values in the site deviate from the overall trend of pollutant content. When the number of neighboring samples increases, the prediction result of the high values is greatly affected by the low values in the neighborhood. Therefore, the underestimation of the extremely high values is more significant for higher RMSEs. In addition, because the variability in the low value area, except for the high values, is relatively small, when the number of neighboring points increases to a certain number, the impact on the prediction results of the high values is small. This result is contrary to the results of previous studies (Zhang and Yang 2017). This is due to the differences in the spatial variation characteristics of the research objects. Soil organic matter is a type of non-point source pollution, with a strong spatial continuity. Increasing the number of neighboring sample points results in the better characterization of the pollutant content of the point to be determined using the distribution characteristics of the neighboring sample points. However, the spatial heterogeneity of the pollutants at the site is strong, and the local characteristics are clear (Wu et al.

The relationships between the number of sample points and the accuracy of prediction combination with auxiliary pollutants
Reducing the number of investigation sample points reduces the accuracy of the prediction, while adding auxiliary pollutants improves the accuracy of the prediction. Therefore, theoretically, after adding auxiliary pollutants, the number of investigation sample points can be reduced to a certain extent, and the cost of the investigation can be decreased. The relationship between the number of samples removed and the accuracy of the prediction is analyzed in this section.
Because of the strong correlation between a target pollutant and its auxiliary pollutant(s), the sample points were removed based on the spatial distribution characteristics of the target pollutant. The principle of sample removal is as follows.
(1) Maintain the local characteristics of the site, i.e., keep the maximum and minimum values in the data. (2) Keep the low value near the high value point to maintain the local characteristics of the site. (3) Using the group analysis method in the ArcGIS software, divide the pollutant data into 20 groups, and evenly remove points from the level with the highest number of points.
The RMSE increases as the number of sample points removed increases. When 84 sample points are removed, RMSE of toluene-benzene-styrene)≈RMSE of toluene based on the total number of samples. This is due to the significant clustering distribution of the toluene concentration. After removing points from the categories containing more sample points, the detail representation weakens and the RMSE gradually increases.
The cross distribution of the high and low concentrations of Cu and phenanthrene is significant. RMSE of Cu-Cd-As and RMSE of phenanthrene-fluoranthene-pyrene-chrysene initially increased and then decreased as the number of sample points removed increases. This is because at the beginning of the sample removal, the number of sample points decreased, and the spatial distribution details of the Cu and phenanthrene concentrations could not be characterized, so the RMSE increased. As the number of sample points used continued to decrease (i.e., increased removal of points), the local variation characteristics of the site were lost, resulting in a weak overall variability, easy fitting, and reduced RMSE. When the number of sample points removed was 61 and 34 for Cu and phenanthrene, respectively, RMSE of Cu-Cd-As and RMSE of phenanthrene-fluoranthenepyrene-chrysene were close to RMSE of Cu and RMSE of phenanthrene based on the total number of samples.

Conclusions
Based on the related pollutants, the reduction in RMSEs with cokriging was higher than 68.4%, which is larger than the reduction in their RMSEs when they are combined with a single-auxiliary variable. In addition, the reduction in RMSEs of the spatial distributions of the target pollutants combined with their auxiliary pollutants increases with Fig. 6 RMSEs of a Cu, b toluene, and c phenanthrene with different numbers of neighboring points and different combinations of auxiliary pollutants increasing correlation between the auxiliary pollutants and the target pollutants. Furthermore, for the target pollutants combined with their related pollutants, the prediction accuracies of the high value regions are improved. In addition, with the same prediction accuracy, the number of samples required by Co-Kriging is less than ordinary kriging. The research results of this study provide important information for improving the prediction accuracy of the spatial distributions of soil pollutants and reducing remediation cost.