Evaluation of machine learning algorithms in predicting bluetongue virus infection occurrence based on different combinations of predictive risk factors

doi:10.21203/rs.3.rs-2497025/v1

Background: Bluetongue virus (BTV) is an arbovirus that causes lots of economic losses worldwide. The most common method of transmission is by vector Culicoides midges. Due to this close relationship between the BTV infection and the vectors, many climate-related risk factors play a role in the occurrence of the disease. The predictive ability of Logistic Regression (LR), Support Vector Machines (SVM), Decision Tree (DT), Random Forest (RF), XGBoost and Artificial Neural Networks (ANN) algorithms in predicting the BTV infection occurrence was assessed. Evaluated predictive risk factors included 19 standard bioclimatic variables, meteorological variables, ruminant population density, elevation and land cover data.

Results: Based on the results of the ExtraTreesClassifier algorithm, 19 variables were identified as important features in prediction which mostly included bioclimatic variables related to temperature. Different combinations of predictive risk factors were evaluated in separate models. ANN and RF algorithms, especially when all predictor variables were included together showed the best performance in predicting the BTV infection occurrence.

Conclusions: RF and ANN algorithms outperformed other machine learning methods in predicting the occurrence of BTV infection, especially when all predictive risk factors were included. Moreover, compared to meteorological, ruminant population density, altitude and land cover features, bioclimatic variables especially those related to temperature played a more important role in predicting the occurrence of BTV infection using machine learning algorithms. The results of the present study could be helpful in planning BTV infection surveillance and adopting control and preventive strategies.

machine learning

Bluetongue virus

risk factors

prediction

Bluetongue virus (BTV) is an arbovirus belongs to Reoviridae family and has a double-stranded RNA genome (1). The virus causes clinical signs mostly in sheep and some species of wild ruminants, and cattle are the reservoir and amplification hosts. Clinical symptoms of the disease in sheep include fever, respiratory distress, oral erosions and ulcerations, lameness as a result of coronitis, myositis, and muscle necrosis (2). The BTV causes a lot of economic losses worldwide (3). The spread of the disease is increasing in the world and is a growing threat to the livestock industry (4).

In endemic areas, the most common method of transmission is by vector Culicoides spp. midges. However, some serotypes of BTV can be spread through contact transmission (5). Therefore, BTV infection occurs mainly where competent vectors are present. The disease occurs year-round in tropical climates and only in the late summer and fall months in temperate areas (6).

Due to this close relationship between the disease and the vectors, many climate-related risk factors play a role in the occurrence of the disease.

The link between risk factors such as temperature, precipitation, humidity, altitude, animal density, wind, and land cover and BTV infection has been shown in numerous researches (7-17).

Investigations have also been done into the connections between bioclimatic variables and the distribution or abundance of Culicoides spp. or the BTV infection (18-20). Bioclimatic variables are more biologically meaningful factors that are derived from the monthly temperature and rainfall variables (Table 4) (21, 22). They show annual trends (e.g., annual precipitation), seasonality (e.g., annual range in precipitation) and extreme or limiting environmental factors (e.g., precipitation of the wet and dry quarters of year).

Together with big data technology and high-performance computing, Machine Learning (ML) has arisen to open up new possibilities for unraveling, quantifying, and comprehending data-intensive processes. ML is described as the branch of science that allows machines to learn without being explicitly programmed. A growing number of scientific disciplines are using ML including medicine, bioinformatics, biochemistry, meteorology, economic sciences, robotics, aquaculture, food security and climatology (23).

Some researchers have tried to forecast the distribution of Culicoides or the occurrence of BTV infection using ML approaches based on different risk factors.

Alkhamis et al. (24) used Extreme Gradient Boosting (XGBoost) , Random Forest (RF), and Support Vector Machine (SVM) algorithms to predict BTV outbreaks as well as occurrence of each of the three BTV serotypes (1, 4, and 8) separately in 25 European countries between 2000 and 2019. Twenty-three relevant environmental features (including bioclimatic features, land cover, livestock densities, annual average wind speed, the normalized difference vegetation index (NDVI) and vector abundance) were used as predictive features (risk factors). They used Breiman’s permutation procedure to discover important features.

Ciss et al. (8) collected specimens of the genus Culicoides spp.

from 96 of sites visited at the end of the 2012 rainy season in Senegal. Bioclimatic variables, elevation data and animal density were used to implement the predictive models of Culicoides distribution. They first used an Ecological Niche Factor Analysis to identify the variables determining habitat suitability. Using these variables, MaxEnt and Boosted Regression Tree Modelling showed the predicted geographical distribution of the vectors. Gouda et al. (25) carried out ELISA for the detection of antibodies against BTV VP7 antigen in serum samples of 233 apparently healthy sheep and goats from five different provinces in Egypt in 2018 and 2019. They developed predictive bluetongue risk models using Logistic Regression (LR), Decision Tree (DT), RF and Artificial Neural Networks (ANN) algorithms based on species (sheep or goat), age, sex, governorate and season as predictive risk factors.

The aim of the present study was to assess the predictive ability of LR, SVM, DT, RF, XGBoost and ANN in predicting the BTV infection occurrence based on different combinations of predictive risk factors. These predictive features included 19 standard bioclimatic variables, meteorological variables, ruminant population distribution, elevation and land cover data.

Between January 1, 1970 and January 6, 2022, 13530 geographical locations reported BTV infection. In this period, the first report of the disease was on December 3, 2004, and the last report was recorded on June 23, 2021. Except for Antarctica and Greenland, 58855 other geographical coordinates where the disease was not reported were chosen at random throughout the world (Fig. 1). As a result, the geographical points with disease occurrence reports constituted 18.69% of the total points investigated.

Based on the results of the ExtraTreesClassifier algorithm, the important variables in predicting the occurrence or non-occurrence of the BTV infection were: BIO1, BIO2, BIO3, BIO4, BIO6, BIO7, BIO8, BIO9, BIO11, cloud cover, diurnal temperature range, frost days, potential evapo-transpiration, minimum temperature, mean temperature, maximum temperature, vapor pressure, wet days and cropland land cover, in order of importance. Their corresponding mean values in geographical coordinates with and without BTV report are shown in Table 1.

Table 1

In Table 2 the comparative performance metrics of four different models are shown. As can be seen, the highest values of accuracy score, Area Under Curve (AUC) value, recall, precision score and F1 score were related to RF in model 1 (0.9859), ANN in model 1 (0.9960), ANN in all models equally (1), RF in model 3 (0.9492) and RF in model 1 (0.9628), respectively. Therefore, it seems that ANN and RF algorithms, especially in model 1 (including all predictor variables), have the best performance in forecasting the BTV infection occurrence.

Table 2

In Table 3 some of the most important tuned parameters of each ML algorithm in model 1 can be seen.

Table 3

Receiver Operating Characteristic (ROC) curves of different ML algorithms for model 1 are shown in Fig. 2 which reveals that the ANN algorithm had the highest AUC value.

The performance of various ML algorithms in predicting the occurrence of BTV infection was examined in this study by employing different combinations of risk factors. Based on the results obtained when all predictor variables were used together (model 1), the algorithms performed generally better than when all predictor variables except meteorological variables (model 2), all variables except bioclimatic variables (model 3), or only important features (model 4) were used.

One of the possible reasons that caused the best overall performance to be obtained when all predictor variables were used in the model could be that a feature that is completely useless on its own can provide a significant performance boost when combined with others. Two features that are useless on their own can be useful when combined (26). As a result, feature selection does not always imply better results, and in some circumstances, removing features may be harmful (27). Moreover, the final processed dataset used in this study was of the imbalanced type, because the geographical points with disease occurrence reports constituted 18.69% of the total points investigated.

For a balanced classification problem where each class has equal probability, accuracy
and the area under the Receiver Operating Characteristic (ROC) curve are common performance metrics. In case of class imbalance problem precision and recall as well as a weighted form of Accuracy or ROC AUC could be better options (28). In the present study, in terms of precision and recall, RF in model 3 (0.9492) and ANN in all models equally (1) showed the best performance, respectively.

Therefore, based on the results, when the focus is on minimizing false positives, the RF algorithm using all variables except bioclimatic variables showed the best performance and when the focus is on minimizing false negatives, the ANN using any combination of predictive variables showed the best efficacy.

With unbalanced datasets, the goal is to improve recall without sacrificing precision. However, these goals are often contradictory, as increasing the true positive for a minority class often increases the number of false positives, which leads to a decrease in precision (29).

Therefore, instead of choosing one measure or another, we can choose F1 score metric that combines both precision and recall into a single score and weights precision and recall equally. This metric is the most often used when learning from imbalanced data (29). RF using all sets of predictive risk factors (model 1) showed the highest F1 Score value (0.9628).

In some similar studies, using various types of risk factors, the performance of different ML algorithms in predicting the occurrence of the BTV infection, the presence of Culicoides spp., or their abundancy has been investigated.

Aguilar-Vega et al. (30) developed models of BTV-1 epidemic scenarios to evaluate the risk of an area becoming endemic and also to identify the factors most influencing BTV-1 persistence in Spain using 21 variables related to climate, vegetation indexes, host, orography, land cover and soil type. The k-nearest neighbors for classification, AdaBoost for classification and RF algorithms were tested for occurrence models (classification) and lasso regression, k-nearest neighbors, Ada-Boost for regression and RF for regression were tested for abundance models (regression). RF was selected as the algorithm to be used in all models since it outperformed other techniques in general for all Culicoides species (recall: 0.8, precision: 0.81-0.85).

To predict BTV outbreaks and also occurrence of each of the three BTV serotypes (1, 4, and 8) separately in European countries, Alkhamis et al. (24) utilized XGBoost, RF, and SVM algorithms. Regarding all serotypes together, highest accuracy, specificity and sensitivity were obtained using XGBoost (88.20%), RF (88.32%) and XGBoost (92.99%), respectively. However, in forecasting each serotype individually, the maximum accuracy (96.33%), specificity (95.53%) and sensitivity (97.87%) were related to the XGBoost in predicting the occurrence of serotype 8, RF in predicating serotype 8 and XGBoost in predicting serotype 1 outbreaks, respectively.

XGBoost performed better than LR and DT algorithms, with an AUC score of 81% in the study by Gouda et al. (25) who developed predictive bluetongue risk models based on various risk factors.

MaxEnt and Boosted Regression Tree Modelling were used by Ciss et al. (8) to predict the geographical distribution of Culicoides spp. based on various climatic risk factors. For each species, both MaxEnt model and Boosted Regression Tree Modelling showed an AUC greater than 0.77.

In the present study, after performing feature selection algorithm, 19 risk factors were detected as important predictive features. First nine most important features were all bioclimatic variables and all of them were related to temperature (Table 1).

Since different methods and different risk factors have been used in various studies and the specific overall designs of the studies have been very different from each other, the risk factors that have been declared as the main factors related to the occurrence of the disease or corresponding to the habitat of biting midgets have been different. However, in many cases, temperature has often been considered as a key factor among different variables. The summary of the results of some similar studies is as follows.

Cuéllar et al. (31) obtained entomological data from 904 farms in nine European countries from 2007 to 2013. They used environmental and climatic data (midinfrared, daytime land surface temperature, night-time land surface temperature, enhanced vegetation index, normalized difference vegetation index and some of 19 bioclimatic variables), estimates of production animal density, land cover features and soil types as predictor variables to predict Culicoides species/ensembles abundance based on RF modelling. In general, the most important variables for the Obsoletus ensemble were related to temperature and precipitation (BIO 18). For the Pulicaris ensemble, the most important variables were related to temperature. For C. imicola, the most important variables were related to precipitation and temperature.

Ciss et al. (8) used bioclimatic variables, elevation data and animal density to predict the Culicoides distribution in Senegal. Using Ecological Niche Factor Analysis, the altitude, maximum temperature of the warmest month, precipitation of the warmest quarter, mean temperature of the wettest quarter, temperature seasonality, precipitation of the wettest quarter and livestock density were among the most important factors to predict suitable habitats of Culicoides.

Alkhamis et al. (24) used twenty-three relevant environmental features (including bioclimatic features, land cover, livestock densities, annual average wind speed, the normalized difference vegetation index (NDVI) and vector abundance) as predictive features (risk factors) to predict BTV outbreaks as well as occurrence of each of the three BTV serotypes (1, 4, and 8) separately in 25 European countries between 2000 and 2019. Isothermality followed by goat density were the most important features associated with the predicted spatial risk of all BTV outbreaks in Europe.

Gouda et al. (25) found age and season as the most important predictor variables to build bluetongue risk models among different risk factors including species (sheep or goat), age, sex, governorate and season.

Klingeneisen et al. (32) applied generalized linear models using remotely sensed environmental variables to predict his BTV seropositivity in cattle in northern Australia. The odds of seropositivity at sites with normalized differential vegetation Index estimates >0.45 were 3.90 (95% CI 1.11 to 13.7) times higher than sites with the same index estimates between 0 and 0.45. A unit increase in minimum night land surface temperature over the previous winter increased the likelihood of seropositivity by a factor of 1.40 (95% CI 1.02 to 1.91).

Aguilar-Vega et al. (30) found that the three most relevant variables among 21 variables related to climate, vegetation indexes, host, orography, land cover and soil type were the abundance of C. imicola and Obsoletus complex and density of goat farms (AUC 0.86).

One of the limitations of the present study is the use of data recorded by various veterinary related organizations around the world which the authors of the paper had no control over recording them. Although the FAO is considered a reliable international organization, there may be defects in the date or location of the recorded disease information. Moreover, it is possible that all cases of disease occurrence have not been identified or recorded. In addition, the limited number of used data related to the occurrence of the disease could be considered as another potential source of error.

Nevertheless, the results of the present study could be helpful in planning BTV infection surveillance and adopting control and preventive strategies.

In conclusion, RF and ANN algorithms performed better than other ML methods in predicting the occurrence of BTV infection, especially when all predictive risk factors were used together. Moreover, compared to meteorological, ruminant population density, altitude and land cover features, bioclimatic variables especially those that are related to temperature played a more important role in predicting the occurrence of BTV infection using ML algorithms.

Study outline

Fig. 3 shows an overview of the study. Details of the various steps will be described in the following sections.

Fig. 3 Study outline

Data sources

Countries boundaries

Shape file of boundaries of countries (ne_50m_admin_0_countries.shp, version 4.1.0) was downloaded from “www.naturalearthdata.com” (Made with Natural Earth. Free vector and raster map data @ naturalearthdata.com).

Bioclimatic variables

The GeoTiff files of standard (19) Bioclimatic variables (22) with spatial resolution of 2.5 minutes for WorldClim version 2.1 were downloaded from “www.worldclim.org/data/worldclim21.html”. They are derived from the monthly temperature and rainfall values, averaged for the years 1970-2000, to generate more biologically meaningful variables. They are coded as shown in Table 4.

Table 4

Meteorological variables

High-resolution gridded data of meteorological variables of CRU TS 4.05 dataset for 2001-2010 and 2011-2020 periods were utilized (33). List of the variables along with their code and units are shown in Table 5.

Table 5

Global animal distribution

The Gridded Livestock of the World (GLW 3) database was used to obtain GeoTiff files containing global population density data for sheep, goats, cattle, and buffaloes with a spatial resolution of 0.083333 decimal degrees (34) .

Elevation

The elevation GeoTiff data file (22) with a spatial resolution of 2.5 minutes was obtained from "www.worldclim.org/data/worldclim21.html."

Land cover

The dominant land cover GeoTiff file was obtained from Global land cover share (GLC-SHARE) database (35). The data has a resolution of 30 arc-second2 (~1sqkm) and is categorized as 01-Artificial Surfaces, 02-CropLand, 03-Grassland, 04-Tree Covered Area, 05-Shrubs Covered Area, 06-Herbaceous vegetation, aquatic or regularly flooded, 07-Mangroves, 08-Sparse vegetation, 09-BareSoil, 10-Snow and glaciers and 11-Waterbodies.

Bluetongue virus infection occurrence data

The occurrence of BTV infections, along with the date and geographical coordinates, was downloaded as a.csv file from FAO's Global Animal Disease Information System (Food and Agriculture Organization - https://empres-i.review.fao.org//) from January 1, 1970 to January 6, 2022.

Data preprocessing

Because no data on the areas of Antarctica and Greenland was included in the Global animal distribution data, the related data were also removed from all other data sources using the shape file of boundaries of countries as a template.

The gdal2xyz function of Geospatial Data Abstraction Library (GDAL) tools plugin of QGIS software was used to convert all GeoTIFF image files into CSV files. CRU TS 4.05 datasets for 2001-2010 and 2011-2020 were merged using the merge method of pandas library.

Different prepared csv vector layer data files were spatially joined using the “Join Attributes by Nearest” tool of QGIS software. The final csv file had 72385 rows, including 13530 rows with BTV infection occurrence and 58855 rows without the disease outbreak.

The dominant land cover variable was converted (One-Hot encoded) into binary vector representation using the get_dummies method of pandas library.

The whole dataset was spitted into training (80%) and testing (20%) samples using the train_test_split function in scikit-learn model selection.

All predictive features were standardized using the StandardScalar object of scikit-learn library so that the mean of observed values was 0 and the standard deviation was 1. First, this procedure was performed on training data. Then, the used scale and offset with training dataset was applied to test dataset. For validation, the training dataset that was used during model development and the test set that was not seen by the model were used. During the training phase, the ML models were validated using repeated stratified K-Fold cross-validation with 3 splits and 2 repeats.

Feature selection

ExtraTreesClassifier and SelectFromModel classes from the Scikit-learn module were used to choose features that are most important for prediction. By fitting a number of randomized decision trees (extra-trees) on different sub-samples of the dataset, the ExtraTreesClassifier class creates a meta estimator that uses averaging to control over-fitting (36). The SelectFromModel meta-transformer allows choosing features based on importance weights. SelectFromModel takes a threshold parameter and selects features whose relevance (as determined by the coefficients) exceeds it. The underlying estimator must expose a coef_ attribute or a feature_importances_ attribute, which was given in this case by the ExtraTreesClassifier class. The net effect of these two classes' collaboration is the selection of the most important predictive variables among all predictive variables.

Designing different models

In the present study, four models were designed based on the types of predictive variables used. In model 1, all predictor variables were used to predict BTV infection occurrence. In model 2, all predictor variables except meteorological variables (CRU TS 4.05 dataset), in model 3, all variables except bioclimatic variables, and in model 4, only important variables resulting from the implementation of the algorithm which determined important features in the first model were included.

Hyper parameters tuning

The RandomizedSearchCV method from the scikit-learn library was used to select a set of ideal parameters for each ML methodology. This method can be used to test a specific number of candidates from a parameter list.

Performance metrics

Accuracy, precision, recall, F1 score, and AUC were employed as performance metrics to assess the ability of different classifiers to predict unknown data (test set)(37). The number of correct predictions divided by the total number of predictions yields the accuracy score, which is one of the most frequent performance metrics.

Precision, or the accuracy of positive forecasts, is another valuable metric:

The ratio of positive cases accurately detected by the classifier is called recall:

The F1 score is the harmonic mean of precision and recall, which gives low values much more weight:

The true positive rate (recall) is plotted against the false positive rate on the ROC curve. The AUC of ROC curves was also used as a summary to evaluate a classifier's ability to discriminate between classes.

Machine learning algorithms

Logistic Regression (LR)

This ML classification approach is used to predict the likelihood of a categorical dependent variable. The dependent variable in LR is a binary variable that comprises data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). Probability of the occurrence of the dependent variable is predicted by the LR model as a function of the values of predictive features. Finally, it provides probabilistic values for the dependent variable that range from 0 to 1 (38).

Support Vector Machines (SVM)

For classification, regression, and other learning problems, SVMs are a common supervised ML method. An SVM training method creates a model that allocates new examples to one of two categories based on a collection of training examples that have been labelled as belonging to one of two categories. SVM maps training examples to points in space in order to widen the distance between the two categories as much as possible. New examples are then mapped into the same space and classified according to which side of the gap they fall on. The C-Support Vector Classification algorithm from the scikit-learn toolkit was used in this investigation. LIBSVM is used in the implementation which is a library for SVMs (39). The C parameter is the penalty parameter of the error term in SVM. It can be thought of as the degree of proper classification required by the algorithm or the degree of optimization required by the SVM.

Decision Tree (DT)

DT is a supervised learning technique that can be applied to classification and regression problems. However, it is most commonly employed to solve classification problems. Internal nodes represent dataset attributes, branches represent decision rules, and each leaf node provides the outcome in this tree-structured classifier (40).

Random Forest (RF)

RF is a classification and regression method that uses a huge number of decision trees to combine them. It's an ensemble of trees built from a training data set and internally validated to produce a response prediction given the predictors for future observations (41).

XGBoost

The gradient boosting decision tree approach is implemented in the XGBoost (Extreme Gradient Boosting) package (42). Boosting is an ensemble strategy that involves adding new models to old models to remedy faults. Models are added one by one until no more enhancements are possible. Gradient boosting is a technique that involves creating new models that forecast the residuals or errors of previous models, which are then combined to form the final prediction. Gradient boosting gets its name from the fact that it uses a gradient descent approach to minimize loss when adding new models. Both regression and classification predictive modeling issues are supported by this technique.

Artificial Neural Networks (ANN)

ANN are at the heart of Deep Learning, a more advanced variant of ML. ANN are made up of the following ideas: the input and output layers, hidden layers, neurons under hidden layers, forward and backward propagation. In summary, the input layer is made up of predictive variables, the output layer is made up of the final output (the dependent variable), and the hidden layers are made up of neurons where equations are constructed and activation functions are applied. Backward propagation calculates the gradient descent to update the learning rates accordingly, whereas forward propagation discusses how equations are created to obtain the ultimate result (28).

Analysis tools

To analyze and edit spatial data files, QGIS software (version 3.16 – Hannover) was used. The Python programming language (version 3.6) and the Anaconda navigator platform (as a package manager; version 1.10.0) were used to implement ML techniques. The LR, SVM, DT Classifier and RF Classifier were implemented using Scikit-learn 0.23.2 (43). The XGBoost algorithm was implemented using the XGBoost library (42). Keras API (28), which runs as an abstraction layer on top of the TensorFlow framework (version 2.1.0) (44), was used to construct ANN.

BTV: bluetongue virus; ML: Machine Learning; LR: Logistic Regression; SVM: Support Vector Machines; DT: Decision Tree; RF: Random Forest; XGBoost: Extreme Gradient Boosting; ANN: Artificial Neural Networks; BIO1: Annual Mean Temperature; BIO2: Mean Diurnal Range; BIO3: Isothermality; BIO4: Temperature Seasonality; BIO5: Max Temperature of Warmest Month; BIO6: Min Temperature of Coldest Month; BIO7: Temperature Annual Range; BIO8: Mean Temperature of Wettest Quarter; BIO9: Mean Temperature of Driest Quarter; BIO10: Mean Temperature of Warmest Quarter; BIO11: Mean Temperature of Coldest Quarter; BIO12: Annual Precipitation; BIO13: Precipitation of Wettest Month; BIO14: Precipitation of Driest Month; BIO15: Precipitation Seasonality; BIO16: Precipitation of Wettest Quarter; BIO17: Precipitation of Driest Quarter; BIO18: Precipitation of Warmest Quarter; BIO19: Precipitation of Coldest Quarter; TMP: Mean 2 m temperature; DTR: Diurnal 2 m temperature range; PRE: Precipitation rate; VAP: Vapor pressure; WET: Wet days; CLD: Cloud cover; FRS: Frost days; TMN: Minimum 2 m temperature; TMX: Maximum 2 m temperature; PET: Potential evapo-transpiration.

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Availability of data and materials

The datasets generated during and/or analyzed during the current study are available in the Mendeley repository: https://data.mendeley.com/datasets/rxwjtkjzhm (doi: 10.17632/rxwjtkjzhm.1).

Competing interests

The authors declare that they have no competing interests.

Funding

Not applicable.

Authors' contributions

Study conception and design: E.A.S., M.K. and A.E.T.T.

Data collection and preparation: E.A.S., M.K. and A.E.T.T.

Data analysis: E.A.S.

All the authors reviewed and approved the final manuscript.

Acknowledgements

Not applicable.

Patel A, Roy P. The molecular biology of Bluetongue virus replication. Virus Res. 2014;182:5–20.
MacLachlan N. Bluetongue: pathogenesis and duration of viraemia. Vet Ital. 2004;40(4):462–7.
Rushton J, Lyons N. A review of the effects on production. Veterinary Italian. 2015;51(4):401–6.
Mayo C, McDermott E, Kopanke J, Stenglein M, Lee J, Mathiason C, et al. Ecological dynamics impacting bluetongue virus transmission in North America. Front Veterinary Sci. 2020;7:186.
Batten C, Darpel K, Henstock M, Fay P, Veronesi E, Gubbins S, et al. Evidence for transmission of bluetongue virus serotype 26 through direct contact. PLoS ONE. 2014;9(5):e96049.
Guis H, Caminade C, Calvete C, Morse AP, Tran A, Baylis M. Modelling the effects of past and future climate on the risk of bluetongue emergence in Europe. J R Soc Interface. 2012;9(67):339–50.
Liu F, Gong Q-L, Zhang R, Chen Z-Y, Wang Q, Sun Y-H, et al. Prevalence and risk factors of bluetongue virus infection in sheep and goats in China: A systematic review and meta-analysis. Microb Pathog. 2021;161:105170.
Ciss M, Biteye B, Fall AG, Fall M, Gahn MCB, Leroux L, et al. Ecological niche modelling to estimate the distribution of Culicoides, potential vectors of bluetongue virus in Senegal. BMC Ecol. 2019;19(1):1–12.
Bakhshesh M, Otarod V, Mehrabadi MHF. Large-scale seroprevalence and risk factors associated with bluetongue virus in Iran. Prev Vet Med. 2020;179:104994.
Munmun TK, Islam S, Zamil S, Rahman MA, Abedin J, Ahad A, et al. Seroprevalence and risk factors of bluetongue virus in sheep of Chattogram, Bangladesh. Veterinary World. 2022;15(6):1589.
Faes C, Van der Stede Y, Guis H, Staubach C, Ducheyne E, Hendrickx G, et al. Factors affecting Bluetongue serotype 8 spread in Northern Europe in 2006: the geographical epidemiology. Prev Vet Med. 2013;110(2):149–58.
Purse B, Brown H, Harrup L, Mertens P, Rogers D. Invasion of bluetongue and other orbivirus infections into Europe: the role of biological and climatic processes. Rev Sci Tech. 2008;27(2):427–42.
Hendrickx G, Gilbert M, Staubach C, Elbers A, Mintiens K, Gerbier G, et al. A wind density model to quantify the airborne spread of Culicoides species during north-western Europe bluetongue epidemic, 2006. Prev Vet Med. 2008;87(1–2):162–81.
Sedda L, Brown HE, Purse BV, Burgin L, Gloster J, Rogers DJ. A new algorithm quantifies the roles of wind and midge flight activity in the bluetongue epizootic in northwest Europe. Proceedings of the Royal Society B: Biological Sciences. 2012;279(1737):2354-62.
Durand B, Zanella G, Biteau-Coroller F, Locatelli C, Baurier F, Simon C, et al. Anatomy of bluetongue virus serotype 8 epizootic wave, France, 2007–2008. Emerg Infect Dis. 2010;16(12):1861.
Ward M, Thurmond M. Climatic factors associated with risk of seroconversion of cattle to bluetongue viruses in Queensland. Prev Vet Med. 1995;24(2):129–36.
Gao X, Wang H, Qin H, Xiao J. Influence of climate variations on the epidemiology of bluetongue in sheep in Mainland China. Small Ruminant Res. 2017;146:23–7.
Mukhopadhyay E, Hazra S, Saha GK, Banerjee D. Altitudinal variation and bio-climatic variables influencing the potential distribution of Culicoides orientalis Macfie, 1932, suspected vector of Bluetongue virus across the North Eastern Himalayan belt of Sikkim. Acta Trop. 2017;176:402–11.
Calvete C, Estrada R, Miranda M, Borrás D, Calvo J, Lucientes J. Ecological correlates of bluetongue virus in Spain: predicted spatial occurrence and its relationship with the observed abundance of the potential Culicoides spp. vector. Vet J. 2009;182(2):235–43.
Thameur BH, Soufiène S, Ammar HH, Hammami S. Spatial distribution and habitat selection of culicoides imicola: The potential vector of bluetongue virus in Tunisia. Onderstepoort J Vet Res. 2021;88(1):1–9.
Noce S, Caporaso L, Santini M. A new global dataset of bioclimatic indicators. Sci data. 2020;7(1):1–12.
Fick SE, Hijmans RJ. WorldClim 2: new 1-km spatial resolution climate surfaces for global land areas. Int J Climatol. 2017;37(12):4302–15.
Liakos KG, Busato P, Moshou D, Pearson S, Bochtis D. Machine learning in agriculture: A review. Sensors. 2018;18(8):2674.
Alkhamis MA, Fountain-Jones NM, Aguilar‐Vega C, Sánchez‐Vizcaíno JM. Environment, vector, or host? Using machine learning to untangle the mechanisms driving arbovirus outbreaks. Ecol Appl. 2021;31(7):e02407.
Gouda HF, Hassan FA, El-Araby EE, Moawed SA. Comparison of machine learning models for bluetongue risk prediction: a seroprevalence study on small ruminants. BMC Vet Res. 2022;18(1):1–10.
Kotsiantis S. Feature selection for machine learning classification problems: a recent overview. Artif Intell Rev. 2011;42(1):157–76.
Guyon I, Gunn S, Nikravesh M, Zadeh LA. Feature extraction: foundations and applications. Springer; 2008.
Chollet F. Deep learning with Python. Simon and Schuster; 2021.
Ma Y, He H. Imbalanced learning: foundations, algorithms, and applications. 2013.
Aguilar-Vega C, Fernández-Carrión E, Lucientes J, Sánchez-Vizcaíno JM. A model for the assessment of bluetongue virus serotype 1 persistence in Spain. PLoS ONE. 2020;15(4):e0232534.
Cuéllar AC, Kjær LJ, Baum A, Stockmarr A, Skovgard H, Nielsen SA, et al. Modelling the monthly abundance of Culicoides biting midges in nine European countries using Random Forests machine learning. Parasites & vectors. 2020;13(1):1–18.
Klingseisen B, Stevenson M, Corner R. Prediction of Bluetongue virus seropositivity on pastoral properties in northern Australia using remotely sensed bioclimatic variables. Prev Vet Med. 2013;110(2):159–68.
Harris I, Osborn TJ, Jones P, Lister D. Version 4 of the CRU TS monthly high-resolution gridded multivariate climate dataset. Sci data. 2020;7(1):1–18.
Gilbert M, Nicolas G, Cinardi G, Van Boeckel TP, Vanwambeke SO, Wint G, et al. Global distribution data for cattle, buffaloes, horses, sheep, goats, pigs, chickens and ducks in 2010. Sci data. 2018;5(1):1–11.
Latham J, Cumani R, Rosati I, Bloise M. Global land cover share (GLC-SHARE) database beta-release version 1.0-2014. FAO: Rome, Italy. 2014;29.
Geurts P, Ernst D, Wehenkel L. Extremely randomized trees. Mach Learn. 2006;63(1):3–42.
Géron A. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems: ". O'Reilly Media, Inc."; 2019.
Cox DR. The regression analysis of binary sequences. J Roy Stat Soc: Ser B (Methodol). 1958;20(2):215–32.
Chang C-C, Lin C-J. LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol (TIST). 2011;2(3):1–27.
Safavian SR, Landgrebe D. A survey of decision tree classifier methodology. IEEE Trans Syst man cybernetics. 1991;21(3):660–74.
Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.
Chen T, Guestrin C, editors., editors. Xgboost: A scalable tree boosting system. Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining; 2016.
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine learning in Python. J Mach Learn Res. 2011;12:2825–30.
Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems.arXiv preprint arXiv:160304467. 2016.

Table 1 Mean values of important predictive features in outbreak and non-outbreak points

BIO1

BIO2

BIO3

BIO4

BIO6

BIO7

BIO8

BIO9

BIO11

cld

dtr

frs

pet

tmn

tmp

tmx

vap

wet

Cropland coverage

Outbreak locations

12.03

9.74

34.55

693.80

-0.78

28.26

12.29

12.63

3.70

75.03

6.58

16.93

0.80

0.16

3.45

6.77

6.28

12.11

46.27%

Non-outbreak locations

9.59

11.32

39.55

852.54

-7.49

34.66

16.71

3.89

-0.88

56.59

10.48

17.05

1.93

-5.39

-0.14

5.10

8.32

9.33

13.85%

The full names and/or explanations of the abbreviated words could be found in tables 4 and 5.

Table 2 Comparative performance metrics of four different models

Model	Algorithm	Accuracy score	AUC	Recall score	Precision score	F1 score
Model 1	LR	0.9586	0.9563	0.9527	0.8445	0.8953
	SVM	0.9834	0.9801	0.9750	0.9380	0.9561
	DT	0.9810	0.9791	0.9761	0.9258	0.9503
	RF	0.9859	0.9831	0.9787	0.9473	0.9628
	XGboost	0.9845	0.9797	0.9720	0.9459	0.9588
	ANN	0.9829	0.9960	1.00	0.81	0.90
Model 2	LR	0.9455	0.9414	0.9348	0.8037	0.8643
	SVM	0.9817	0.9784	0.9731	0.9315	0.9519
	DT	0.9810	0.9825	0.9851	0.9184	0.9505
	RF	0.9841	0.9815	0.9772	0.9398	0.9581
	XGboost	0.9843	0.9808	0.9754	0.9420	0.9584
	ANN	0.9809	0.9952	1.00	0.81	0.90
Model 3	LR	0.9471	0.9337	0.9124	0.8221	0.8649
	SVM	0.9796	0.9749	0.9675	0.9258	0.9462
	DT	0.9803	0.9757	0.9683	0.9289	0.9482
	RF	0.9856	0.9814	0.9746	0.9492	0.9617
	XGboost	0.9854	0.9819	0.9765	0.9465	0.9613
	ANN	0.9807	0.9950	1.00	0.81	0.90
Model 4	LR	0.9454	0.9402	0.9318	0.8050	0.8638
	SVM	0.9811	0.9799	0.9780	0.9246	0.9505
	DT	0.9790	0.9780	0.9765	0.9158	0.9452
	RF	0.9854	0.9827	0.9783	0.9449	0.9613
	XGboost	0.9847	0.9811	0.9754	0.9441	0.9595
	ANN	0.9808	0.9952	1.00	0.81	0.90

LR: Logistic Regression, SVM: Support Vector Machine, DT: Decision Tree, RF: Random Forests, ANN: Artificial neural network

Highest value(s) in each column is embolden.

Table 3 The most important tuned parameters of each algorithm in model 1 after performing hyper-parameter tuning.

Algorithms in model 1	Tuned parameters
Logistic Regression	solver = 'liblinear', penalty= 'l1', class_weight = {0: 40, 1: 60}, C= 1000, max_iter=1000
Support Vector Machine	kernel= 'rbf', gamma= 'scale', coef0= 0, class_weight= {0: 60, 1: 40}, random_state=13
Decision Tree	splitter= 'best', max_depth= 14, criterion= 'entropy', class_weight= {0: 30, 1: 70}, random_state=42
Random Forests	n_estimators= 1000, max_leaf_nodes= 1000, max_features= 'auto', class_weight= {0: 40, 1: 60}, bootstrap= True, random_state=42
XGBoost	subsample= 0.6, objective= 'binary:logistic', min_child_weight= 1, max_depth= 10, gamma= 1.5, eta= 0.1, colsample_bytree= 1.0, random_state=42
Artificial Neural Networks	'learning_rate': 0.0001, 'n_hidden': 2, 'n_neurons': 100

Table 4 List of 19 bioclimatic variables.

Bioclimatic variable code	Description
BIO1	Annual Mean Temperature
BIO2	Mean Diurnal Range (Mean of monthly (max temp - min temp))
BIO3	Isothermality (BIO2/BIO7) (×100)
BIO4	Temperature Seasonality (standard deviation ×100)
BIO5	Max Temperature of Warmest Month
BIO6	Min Temperature of Coldest Month
BIO7	Temperature Annual Range (BIO5-BIO6)
BIO8	Mean Temperature of Wettest Quarter^*
BIO9	Mean Temperature of Driest Quarter
BIO10	Mean Temperature of Warmest Quarter
BIO11	Mean Temperature of Coldest Quarter
BIO12	Annual Precipitation
BIO13	Precipitation of Wettest Month
BIO14	Precipitation of Driest Month
BIO15	Precipitation Seasonality (Coefficient of Variation)
BIO16	Precipitation of Wettest Quarter
BIO17	Precipitation of Driest Quarter
BIO18	Precipitation of Warmest Quarter
BIO19	Precipitation of Coldest Quarter

^*A quarter is a period of three months (1/4 of the year).

Table 5 Meteorological variables of CRU TS 4.05 dataset

Variables	code	units
Mean 2 m temperature	TMP	degrees Celsius
Diurnal 2 m temperature range	DTR	degrees Celsius
Precipitation rate	PRE	mm/month
Vapor pressure	VAP	hPa
Wet days	WET	days
Cloud cover	CLD	percentage
Frost days	FRS	days per month
Minimum 2 m temperature	TMN	degrees Celsius
Maximum 2 m temperature	TMX	degrees Celsius
Potential evapo-transpiration	PET	mm/day

No competing interests reported.

Evaluation of machine learning algorithms in predicting bluetongue virus infection occurrence based on different combinations of predictive risk factors

Status:

Version 1

Abstract

Figures

Background

Results

Discussion

Conclusions

Methods

Abbreviations

Declarations

References

Tables

Additional Declarations

Status:

Version 1