Identification of the High-Risk Area for Schistosomiasis Transmission in China Based on Information Value and Machine Learning—A Newly Data-Driven Modeling Try


 Objective

Information value (IV) and machine learning models were used to analyze and predict the high-risk distribution of schistosomiasis, in order to provide scientific evidence for disease surveillance and control in China.
Methods

The local case distribution from schistosomiasis surveillance data in China between 2005 and 2019 was assessed based on 19 variables including climate, geography, and social economy. Seven models were built in three categories including IV, three machine learning models (logistic regression, LR; random forest, RF; generalized boosted model, GBM), and three coupled models (coupled model of information value and logistic regression, IV + LR; coupled model of information value and random forest, IV + RF; coupled model of information value and generalized boosted model, IV + GBM). Accuracy, AUC (area under the curve), and F1-score were used to evaluate the prediction performance of the models. The best model was selected to predict the risk distribution for schistosomiasis.
Results

IV + GBM had the highest prediction effect (accuracy = 0.878, AUC = 0.902, F1 = 0.920). The results of IV + GBM showed that the risk area for transmission comprised 4.66% of China, mainly distributed in the coastal regions of the middle and lower reaches of the Yangtze River, the Poyang Lake region, and the Dongting Lake region. Risk areas can be divided into low-risk (2.47%), medium-risk (1.35%), and high-risk (0.84%). High-risk areas are primarily distributed in eastern Changde, western Yueyang, northeastern Yiyang, middle Changsha of the Hunan Province, southern Jiujiang, northern Nanchang, northeastern Shangrao, eastern Yichun in Jiangxi Province, southern Jingzhou, southern Xiantao, middle Wuhan in Hubei Province, southern Anqing, northwestern Guichi, eastern Wuhu in Anhui Province, middle Meishan, northern Leshan, and the middle of Liangshan in Sichuan Province.
Conclusions

The risk of schistosomiasis transmission in China still exists, with high-risk areas relatively concentrated within regions. Coupled models of IV and machine learning provide for effective analysis and prediction, forming a scientific basis for surveillance and control within key areas.

As one of 20 neglected tropical diseases, schistosomiasis is a typical zoonotic parasitic disease that remains a major public health problem worldwide [1]. In the 1950s, schistosomiasis was endemic in 12 southern Chinese provinces in close proximity to the Yangtze River. China was one of the countries with the heaviest schistosomiasis burden with more than 10 million patients. Over the past 70 years of active control, China's schistosomiasis control program has achieved remarkable success [2]. By the end of 2020, 337 (74.89%) of the 450 schistosomiasis endemic counties in China had achieved the elimination standard, 97(21.56%) have achieved the transmission blocking standard and 16 (3.55%) have achieved transmission control [3]. China's "13th Five-Year Plan" for national schistosomiasis control identi es risk monitoring and early warning to be essential to reduce potential transmission risk. Prediction model design is an effective means by which to achieve accurate monitoring and precise control of schistosomiasis [4].
There are two development stages or methods for infectious disease risk prediction: a knowledge-driven method (qualitative method), and a data-driven method (quantitative method) [5]. There are four components to the process of development: epidemic data processing, environmental factor selection, model construction, and model evaluation. In particular, the application of GIS (geographic information system), RS (remote sensing), and GPS (global positioning system) in infectious disease research accelerates the development of quantitative risk prediction [6]. Commonly used qualitative methods are the analytic hierarchy process (AHP) and the Delphi method. For example, Ajakaye et al. [7] used AHP to evaluate the transmission risk of schistosomiasis in Nigeria. Yang et al. [8] used the Delphi method to establish a schistosomiasis early warning index in the middle and lower reaches of the Yangtze River.
The results for early warning were consistent with epidemic levels based on a recent epidemiological survey. A single quantitative method or a combination of multiple quantitative methods is frequently used. Solano-Villarreal et al. [9] used a boosted regression tree to study the transmission risk of malaria in the Loreto area. Xia et al. [10] combined a variety of classi cation algorithms including random forest (RF) and a generalized boosted model (GBM) in BioMod2, to construct a combined model that predicted the potential distribution of Oncomelania hupensis (O. hupensis) in the Dongting Lake region. The combined model had greater prediction accuracy.
Information value (IV) is derived by statistical quantitative analysis of data based on information theory.
A model is based on the in uencing factors of an epidemic as well as an evaluation of risk for the region [11]. As an example, Rai [12] used IV to establish a malaria susceptibility index. IV has high modeling e ciency and can judge the weight of various in uencing factors. Classi cation algorithms such as logistic regression (LR), RF, and GBM determine the weight of each in uencing factor [5]. IV and classi cation algorithms can predict vector infectious diseases during the initial stage. For example, Chen et al. [13] used a coupled IV and LR model (IV + LR) to predict hot spots of HFRS (hemorrhagic fever with renal syndrome) in Hunan Province. The coupled model takes into account the merits of LR and IV, resulting in more reliable and practical prediction. Based on epidemic data and related environmental factors, we used IV combined with LR, RF, and GBM to evaluate and predict the risk for schistosomiasis transmission. The purpose of this study was to provide a methodological basis for monitoring and control of epidemic schistosomiasis. In this manner, a theoretical knowledge of infectious disease prediction was established.

Data collection
Case and non-case data Schistosomiasis data were derived from the national schistosomiasis survey of 2005 to 2019 [14]. Villages with local infection cases were selected as distribution points (Fig. 1). Longitude and latitude coordinates of the distribution points were identi ed with the Baidu map coordinate picking system (http://api.map.baidu.com/lbsapi/getpoint/index.html). Due to a lack of data for nonexistent points, and in order to increase the discrimination of environmental factors, this study randomly selected coordinate points for nonexistent points in non-endemic counties adjacent to schistosomiasis endemic counties based on a ratio of 1:2.

Environmental data
Environmental variables related to schistosomiasis and its vector snail distribution were collected. This included ten climate variables, six geographical variables, and three socio-economic variables, as shown in Table 1. Among the climate related variables, four types of background meteorological data were derived from the Resource and Environmental Science and Data Center of the Chinese Academy of Sciences (http://www.resdc.cn/) and represent conventional climate conditions. The other six bioclimatic variables were based on the high-resolution climate data website WorldClim (https://www.worldclim.org/). Those data included mean diurnal temperature range (BIO2), temperature annual range (BIO7), mean temperature of the warmest quarter (BIO10), mean temperature of the warmest quarter (BIO11), mean temperature of the coldest quarter (BIO16), precipitation of the wettest quarter, and precipitation of the driest quarter (BIO17). These represent extreme climatic conditions and limit the distribution range of S. japonicum and O. hupensis. Elevation and annual normalized vegetation index for the geographic environmental variables were from the Resource and Environment Science and Data Center of Chinese Academy of Sciences (http://www.resdc.cn/). Landform types and land use types are from the National Earth System Science Data Sharing Platform (http://www.geodata.cn). Distance to waterways was obtained from WorldPop (https://www.worldpop.org/). Socio-economic variables including gross domestic product, population density, and night light were derived from a map of China. ArcGIS 10.2 software was used to trim all environmental variables to the same spatial range and then resampled to a spatial resolution of 1 km × 1 km. IV [13] uses the frequency or density of schistosomiasis occurrence to re ect the risk effect of different in uencing factors and their sub-intervals. An IV is calculated that represents the contribution of different in uencing factors related to the occurrence of schistosomiasis. A regional risk assessment for schistosomiasis transmission is realized through the spatial superposition of multi-factor information [13]. The formula is as follows: When I is positive, the combination of multiple factors will increase the risk of schistosomiasis in grid cells, otherwise, it is not conducive to the occurrence of schistosomiasis. The IV model was implemented in R language (version 4.0.0; R Core Team 2020) using the "scorecard" package.

Machine learning
A logistic regression model (LR) [15] is a statistical nonlinear classi cation method based on logit transformation, which is widely used in classi cation and prediction tasks due to its simplicity, rapidity, and relative accuracy. LR takes the selected factors as independent variables and the occurrence of results (occurrence is 1, non-occurrence is 0) as dependent variables. LR uses the "H20" package to implement the modeling process in the R language (version 4.0.0; R Core Team 2020).
A random forest model (RF) [16] is a predictive model based on statistical analysis principles formed by the combination of multiple decision trees. The basic random forest concept is to use bootstrap to extract k samples from the original training set, with the sample size of each sample the same as the original training set. Next, k decision tree models are established for k samples to obtain k classi cation results. Finally, a vote on each record is used to determine a nal classi cation according to k classi cation results. RF uses the "randomForest" package to implement the modeling process in the R language (version 4.0.0; R Core Team 2020). A generalized boosted model or gradient boosting machine (GBM) [17] is based on two algorithms: regression trees and gradient boosting. It builds multiple regression trees on the basis of self-learning and multiple random selections. The process of multiple ttings can gradually reduce the error of model tting, and in turn, the simulation accuracy of the regression tree is stably improved. For classi cation problems, the GBM algorithm only needs to limit the base classi er in the AdaBoost algorithm to a classi cation tree. The GBM model uses the "H20" package in the R language to implement the modeling process.

Model coupling
Using calculated information value "I" to replace the corresponding frequency ratio of LR, sample variable values for RF and GBM, and coupled models (IV + LR, IV + RF, and IV + GBM) are obtained.

Model evaluation
The sample data were randomly divided into two parts: 75% as training samples for model construction, and 25% as test samples to evaluate the accuracy. A confusion matrix was used to re ect the comprehensive performance of the models. The accuracy, AUC (area under the curve), and F1-score derived from the confusion matrix were used to evaluate the prediction effect. Accuracy = (a + d) / (a + b + c + d); F1 = (2(a / (a + b) * a / (a + c)) / (a / (a + b) + a / (a + c)). The higher the accuracy and F1, the better the prediction effect of the model [18]. The AUC is derived from the receiver operating characteristic curve, which takes the true positive rate (a / a + c) as the ordinate and the false positive rate (b / b + d) as the abscissa according to a series of different dichotomies. The AUC threshold is (0.5, 1), where 0.5 represents a completely random classi cation, and 1 represents completely correct classi cation, so the larger the AUC value, the better the performance of the model [19].

Risk visualization analysis
We selected the optimal model based on the evaluation indicator and calculated the transmission risk index for the study area. Then, the area was divided into four levels: no-risk area (0.00 to 0.40), low-risk area (0.41 to 0.60), medium-risk area (0.61 to 0.80), and high-risk area (0.81 to 1.00) [20].

Correlation analysis among schistosomiasis and environmental factors
Based on the principle of chi-square binning, the upper limit of binning is set to 8, and the IV of different levels of in uencing factors is calculated according to the binning situation (Table 3). When annual average temperature is 11.5-19.0°C, the annual average rainfall is 1000-1550 mm, the dryness is 66%-92%, and the wetness index is 45%-70%, schistosomiasis is more likely to occur. In this geographic environment, the risk of schistosomiasis transmission is higher when the distance from waterways is less than 2.5 km, the altitude is less than 100 m, the land use is paddy eld, grassland, and water area, and the landform type is plain. Among the socio-economic variables, when population density is above 200, the GDP is over 100, and Night-time lights are above 0.12, then a situation exists that is more disease epidemic-prone. Extreme climate and geographic conditions are not conducive to the spread of schistosomiasis: for example, annual rainfall of less than 1000 mm or more than 1550 mm, annual average temperature of less than 11.5°C or more than 19°C, average temperature during the hottest season of less than 27°C, rainfall in the wettest season of less than 500 mm, and distance to the waterway of more than 3 km, with a slope greater than six (Table 4).  for the three machine learning models had similarities and differences. The possibility for schistosomiasis transmission was mainly concentrated in the middle and lower reaches of the Yangtze River by three machine learning models. LR indicated the risk was also distributed in northern Xinjiang and southwestern Tibet. RF showed a lower risk in southern Guangzhou. GBM showed a lower risk in northern Xinjiang. Prediction results for the three coupled models were better than those for the single models. North of the Yangtze River there was no obvious abnormal risk, although small detail differences in risk areas were observed. For example, IV + RF showed no obvious risk area in central Sichuan or northwestern Yunnan, as opposed to IV + GBM.
The predicted performance for schistosomiasis by the seven models as judged by transmission risk, accuracy, AUC, and F1 for each model was calculated (Table 5). Sorted model prediction results were ordered as follows: AUC, IV + GBM > IV + RF > GBM > IV + LR > IV > RF > LR. Overall, the coupled models had the best results, followed by the three machine models, and then the information model. The best of the three machine learning models was GBM, and the best of the three coupled models was IV + GBM (accuracy = 0.878, AUC = 0.902, F1 = 0.920).

Risk prediction of schistosomiasis transmission in China based on the optimal coupled model
Prediction results for GBM + IV showed the risk of schistosomiasis in China to be scattered through a large spatial range, although clusters appeared in southeastern Hubei province, northeastern Hunan province, northern Jiangxi Province, central Anhui province, central Sichuan province, northwestern Yunnan province, and southern Jiangsu province. Superimposed on the national river map, risk areas were concentrated in the coastal areas of the middle and lower reaches of the Yangtze River, Poyang Lake region, and Dongting Lake region.
Classi cation of transmission risk shows that 4.66% of China is in an at-risk area and 95.34% is not. Risk areas can be divided into low-risk (2.47%), medium-risk (1.35%), and high-risk areas (0.84%). High-risk areas are primarily distributed in eastern Changde, western Yueyang, northeastern Yiyang, middle Changsha of the Hunan Province, southern Jiujiang, northern Nanchang, northeastern Shangrao, eastern Yichun in Jiangxi Province, southern Jingzhou, southern Xiantao, middle Wuhan in Hubei Province, southern Anqing, northwestern Guichi, eastern Wuhu in Anhui Province, middle Meishan, northern Leshan, and the middle of Liangshan in Sichuan Province (Fig. 3). Medium-risk areas and low-risk areas are distributed in areas adjacent to high-risk areas, as well as southern Jiangsu and northwestern Yunnan.

Discussion
Due to the unique life history of S. japonicum and O. hupensis, as well as the numerous terminal hosts of S. japonicum, the epidemic process for schistosomiasis is exceedingly complex. Geographic, climatic, socio-economic, and other factors affect the scope and degree of schistosomiasis [21]. In this study, coupled models for IV and machine learning were used to evaluate factors that interfere with schistosomiasis transmission. A spatial distribution pattern of potential risks provided a support tool for the formulation of macroscopic schistosomiasis control strategies and the development of a quantitative risk assessment model for communicable diseases.
To the best of our knowledge, this is the rst time that coupled models of IV and machine learning were applied to schistosomiasis transmission risk. Coupled models were used to establish statistical relationships among case distribution and environmental factors, providing a new method for analysis and prediction of hot spots of schistosomiasis transmission. By comparing the seven model indicators, we found that coupled models have better prediction accuracy than IV and machine learning models alone. The prediction results more accurately re ected the spatial distribution of risk for schistosomiasis. Differences in prediction results and goodness of t were found for the seven models, re ecting model uncertainty. A nal, optimal model, GBM + IV, was selected to predict the risk for schistosomiasis transmission. That model reduced the errors associated with the other models. Machine learning algorithms cannot express the relationships among the in uencing factor's internal levels and the occurrence of schistosomiasis. IV does not consider differences in the weight contribution of in uencing factors [22]. The higher success rate for the coupled model is that it considers the internal level of in uencing factors and the weight of different in uencing factors in relationship to schistosomiasis [23].
Therefore, risk prediction results are more scienti c and reasonable.
Predicted middle-risk and high-risk areas based on the optimal coupled model were consistent with the areas of schistosomiasis transmission control and blocking in China [24]. Combined with the distribution of water areas in China, the coastal areas of middle and lower reaches of the Yangtze River, the Poyang Lake region, and the Dongting Lake region are the high-risk areas for schistosomiasis spread. This is likely due to the wide distribution and high density of O. hupensis in those areas [25]. Further, there are numerous water conservancy projects, frequent population ow, developed animal husbandry industries, and increased opportunities for human and animal contact, placing these regions at risk for schistosomiasis rebound [26][27]. Comprehensive control strategies have focused on infection control, the distribution pattern of intermediate schistosomiasis hosts, composition, and distribution of infection sources. However, the pattern of population activities has undergone signi cant changes due to oods, disasters [28], wetland construction [29], and global warming [30], which have increased the risk of snail spreading. Hence, there is a greater risk for infection in the areas described above. In the epidemic risk areas, we recommend O. hupensis monitoring, strengthened infection control of domestic and wild animals, and timely assessment of epidemic schistosomiasis. In this manner, the goal of schistosomiasis elimination by 2030 will be achieved [31].
The relationships among the spatial change of schistosomiasis risk and environmental factors can be explained by a biological knowledge of S. japonicum and snails [32]. Suitable climatic conditions, small slopes, and proximity to rivers are conducive to the growth and reproduction of S. japonicum and snails [33], which in turn leads to the prevalence of schistosomiasis. This study demonstrates that temperature, rainfall, altitude, and the risk of schistosomiasis transmission are closely related. Abnormal climatic conditions will have a negative impact on an epidemic, which con rms previous studies using different methods [34]. Certainly, environmental factors determine the transmission dynamics of schistosomiasis. Previous studies [35] have shown that land use greatly affects the distribution and density of snails in rice elds. When water is high and in proximity to a river, there is an increased risk for infection. This may be due to the increased risk of swimming, shing, and agricultural activities in contact with water bodies containing cercariae [36]. This study did not nd a high risk for schistosomiasis transmission in economically backward areas, which may be due to the large scope of the study and the lack of conditions for schistosomiasis prevalence in backward areas such as Xinjiang and Tibet. Further, results were based on surveillance data from 2005 to 2019 in China, which is accurate and reliable. However, there may be errors in the analysis of relationships among in uencing factors and transmission risk due to insu cient case numbers.
This study has limitations. Firstly, although IV + GBM provided high goodness of t, the potential risk for schistosomiasis remains uncertain, because of other associated factors such as snail control, cattle grazing, water conservancy construction, and behaviors [37][38][39]. Secondly, risk prediction based on IV + GBM identi ed sporadic high risk in northern Zhejiang, which is inconsistent with the known elimination of schistosomiasis in Zhejiang. That area is very similar to that of the case distribution in this study but schistosomiasis is no longer endemic in that area due to human intervention, which resulted in a false positive. For the future, more variables related to disease transmission should be collected, which would enrich the data set. Further, IV combined with more classi cation algorithms would improve assessment.
These approaches would result in better predictive model performance and provide guidance for monitoring and early warning of disease in key areas.

Conclusions
This study con rmed that a model that combines IV and machine learning is better than a single model.
Among the models, the optimal coupling model had a better predictive performance for schistosomiasis risk assessment, roughly consistent with the actual situation. These results can guide monitoring and control of schistosomiasis and serve as a reference for predicting the risk of other vector-mediated infectious diseases. Study area and case distribution Note: The designations employed and the presentation of the material on this map do not imply the expression of any opinion whatsoever on the part of Research Square concerning the legal status of any country, territory, city or area or of its authorities, or concerning the delimitation of its frontiers or boundaries. This map has been provided by the authors. Current risk prediction for schistosomiasis in China based on the optimal coupled model Note: The designations employed and the presentation of the material on this map do not imply the expression of any opinion whatsoever on the part of Research Square concerning the legal status of any country, territory, city or area or of its authorities, or concerning the delimitation of its frontiers or boundaries. This map has been provided by the authors.