Study outline
Fig. 3 shows an overview of the study. Details of the various steps will be described in the following sections.
Fig. 3 Study outline
Data sources
Countries boundaries
Shape file of boundaries of countries (ne_50m_admin_0_countries.shp, version 4.1.0) was downloaded from “www.naturalearthdata.com” (Made with Natural Earth. Free vector and raster map data @ naturalearthdata.com).
Bioclimatic variables
The GeoTiff files of standard (19) Bioclimatic variables (22) with spatial resolution of 2.5 minutes for WorldClim version 2.1 were downloaded from “www.worldclim.org/data/worldclim21.html”. They are derived from the monthly temperature and rainfall values, averaged for the years 1970-2000, to generate more biologically meaningful variables. They are coded as shown in Table 4.
Table 4
Meteorological variables
High-resolution gridded data of meteorological variables of CRU TS 4.05 dataset for 2001-2010 and 2011-2020 periods were utilized (33). List of the variables along with their code and units are shown in Table 5.
Table 5
Global animal distribution
The Gridded Livestock of the World (GLW 3) database was used to obtain GeoTiff files containing global population density data for sheep, goats, cattle, and buffaloes with a spatial resolution of 0.083333 decimal degrees (34) .
Elevation
The elevation GeoTiff data file (22) with a spatial resolution of 2.5 minutes was obtained from "www.worldclim.org/data/worldclim21.html."
Land cover
The dominant land cover GeoTiff file was obtained from Global land cover share (GLC-SHARE) database (35). The data has a resolution of 30 arc-second2 (~1sqkm) and is categorized as 01-Artificial Surfaces, 02-CropLand, 03-Grassland, 04-Tree Covered Area, 05-Shrubs Covered Area, 06-Herbaceous vegetation, aquatic or regularly flooded, 07-Mangroves, 08-Sparse vegetation, 09-BareSoil, 10-Snow and glaciers and 11-Waterbodies.
Bluetongue virus infection occurrence data
The occurrence of BTV infections, along with the date and geographical coordinates, was downloaded as a.csv file from FAO's Global Animal Disease Information System (Food and Agriculture Organization - https://empres-i.review.fao.org//) from January 1, 1970 to January 6, 2022.
Data preprocessing
Because no data on the areas of Antarctica and Greenland was included in the Global animal distribution data, the related data were also removed from all other data sources using the shape file of boundaries of countries as a template.
The gdal2xyz function of Geospatial Data Abstraction Library (GDAL) tools plugin of QGIS software was used to convert all GeoTIFF image files into CSV files. CRU TS 4.05 datasets for 2001-2010 and 2011-2020 were merged using the merge method of pandas library.
Different prepared csv vector layer data files were spatially joined using the “Join Attributes by Nearest” tool of QGIS software. The final csv file had 72385 rows, including 13530 rows with BTV infection occurrence and 58855 rows without the disease outbreak.
The dominant land cover variable was converted (One-Hot encoded) into binary vector representation using the get_dummies method of pandas library.
The whole dataset was spitted into training (80%) and testing (20%) samples using the train_test_split function in scikit-learn model selection.
All predictive features were standardized using the StandardScalar object of scikit-learn library so that the mean of observed values was 0 and the standard deviation was 1. First, this procedure was performed on training data. Then, the used scale and offset with training dataset was applied to test dataset. For validation, the training dataset that was used during model development and the test set that was not seen by the model were used. During the training phase, the ML models were validated using repeated stratified K-Fold cross-validation with 3 splits and 2 repeats.
Feature selection
ExtraTreesClassifier and SelectFromModel classes from the Scikit-learn module were used to choose features that are most important for prediction. By fitting a number of randomized decision trees (extra-trees) on different sub-samples of the dataset, the ExtraTreesClassifier class creates a meta estimator that uses averaging to control over-fitting (36). The SelectFromModel meta-transformer allows choosing features based on importance weights. SelectFromModel takes a threshold parameter and selects features whose relevance (as determined by the coefficients) exceeds it. The underlying estimator must expose a coef_ attribute or a feature_importances_ attribute, which was given in this case by the ExtraTreesClassifier class. The net effect of these two classes' collaboration is the selection of the most important predictive variables among all predictive variables.
Designing different models
In the present study, four models were designed based on the types of predictive variables used. In model 1, all predictor variables were used to predict BTV infection occurrence. In model 2, all predictor variables except meteorological variables (CRU TS 4.05 dataset), in model 3, all variables except bioclimatic variables, and in model 4, only important variables resulting from the implementation of the algorithm which determined important features in the first model were included.
Hyper parameters tuning
The RandomizedSearchCV method from the scikit-learn library was used to select a set of ideal parameters for each ML methodology. This method can be used to test a specific number of candidates from a parameter list.
Performance metrics
Accuracy, precision, recall, F1 score, and AUC were employed as performance metrics to assess the ability of different classifiers to predict unknown data (test set)(37). The number of correct predictions divided by the total number of predictions yields the accuracy score, which is one of the most frequent performance metrics.
Precision, or the accuracy of positive forecasts, is another valuable metric:
The ratio of positive cases accurately detected by the classifier is called recall:
The F1 score is the harmonic mean of precision and recall, which gives low values much more weight:
The true positive rate (recall) is plotted against the false positive rate on the ROC curve. The AUC of ROC curves was also used as a summary to evaluate a classifier's ability to discriminate between classes.
Machine learning algorithms
Logistic Regression (LR)
This ML classification approach is used to predict the likelihood of a categorical dependent variable. The dependent variable in LR is a binary variable that comprises data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). Probability of the occurrence of the dependent variable is predicted by the LR model as a function of the values of predictive features. Finally, it provides probabilistic values for the dependent variable that range from 0 to 1 (38).
Support Vector Machines (SVM)
For classification, regression, and other learning problems, SVMs are a common supervised ML method. An SVM training method creates a model that allocates new examples to one of two categories based on a collection of training examples that have been labelled as belonging to one of two categories. SVM maps training examples to points in space in order to widen the distance between the two categories as much as possible. New examples are then mapped into the same space and classified according to which side of the gap they fall on. The C-Support Vector Classification algorithm from the scikit-learn toolkit was used in this investigation. LIBSVM is used in the implementation which is a library for SVMs (39). The C parameter is the penalty parameter of the error term in SVM. It can be thought of as the degree of proper classification required by the algorithm or the degree of optimization required by the SVM.
Decision Tree (DT)
DT is a supervised learning technique that can be applied to classification and regression problems. However, it is most commonly employed to solve classification problems. Internal nodes represent dataset attributes, branches represent decision rules, and each leaf node provides the outcome in this tree-structured classifier (40).
Random Forest (RF)
RF is a classification and regression method that uses a huge number of decision trees to combine them. It's an ensemble of trees built from a training data set and internally validated to produce a response prediction given the predictors for future observations (41).
XGBoost
The gradient boosting decision tree approach is implemented in the XGBoost (Extreme Gradient Boosting) package (42). Boosting is an ensemble strategy that involves adding new models to old models to remedy faults. Models are added one by one until no more enhancements are possible. Gradient boosting is a technique that involves creating new models that forecast the residuals or errors of previous models, which are then combined to form the final prediction. Gradient boosting gets its name from the fact that it uses a gradient descent approach to minimize loss when adding new models. Both regression and classification predictive modeling issues are supported by this technique.
Artificial Neural Networks (ANN)
ANN are at the heart of Deep Learning, a more advanced variant of ML. ANN are made up of the following ideas: the input and output layers, hidden layers, neurons under hidden layers, forward and backward propagation. In summary, the input layer is made up of predictive variables, the output layer is made up of the final output (the dependent variable), and the hidden layers are made up of neurons where equations are constructed and activation functions are applied. Backward propagation calculates the gradient descent to update the learning rates accordingly, whereas forward propagation discusses how equations are created to obtain the ultimate result (28).
Analysis tools
To analyze and edit spatial data files, QGIS software (version 3.16 – Hannover) was used. The Python programming language (version 3.6) and the Anaconda navigator platform (as a package manager; version 1.10.0) were used to implement ML techniques. The LR, SVM, DT Classifier and RF Classifier were implemented using Scikit-learn 0.23.2 (43). The XGBoost algorithm was implemented using the XGBoost library (42). Keras API (28), which runs as an abstraction layer on top of the TensorFlow framework (version 2.1.0) (44), was used to construct ANN.