Prediction of the disease controllability in a complex network using machine learning algorithms

—The application of machine learning (ML) techniques span a vast spectrum ranging from speech, face and character recognition, medical diagnosis, anomaly detection in data to the general classiﬁcation, prediction and regression problems. In the present work, we solve the problem of predicting R 0 for disease spreading on complex networks using the regression-based state-of-art ML techniques. R 0 is a metric that determines whether the disease-free epidemic or an endemic state is asymptotically stable and hence indicates the controllability of the disease spread. We predict R 0 , based on training the ML models with structural properties of complex networks, irrespective of the network type. The prediction is possible because: (a) The structure of complex networks plays an essential role in the spreading processes on networks (b) The regression techniques such as Support Vector Regression and Artiﬁcial Neural Network Model can be very efﬁciently used for prediction problems, even for non-linear data. We obtained good accuracy in the prediction of R 0 for the simulated networks as well as real-world networks using these techniques. Moreover, the ML model training is a one-time investment cost in terms of training time and memory, and the trained model can be used for predicting R 0 on unseen/new examples of networks.


INTRODUCTION
T He problem of disease spreading has been studied using a system of Ordinary Differential Equations (ODE) [Anderson and May(1992)], to predict the endemic disease state and devise effective control strategies.Earlier studies did not employ any spatial structure, and the dynamics generally depended on the population number and the probabilities of transitions from one disease state to others.However, the use of a spatial structure for determination of relative positions and interactions of individuals has taken a front stage recently.The complex network framework [Barabási and Albert(1999)], where nodes represent the individuals, and the links govern the interactions between the nodes is thus very useful.Disease spreading on networks has been studied using various compartmental epidemiology models such as SI (Susceptible-Infected), SIR (Susceptible-Infected-Recovered), SIRS (Susceptible-Infected-Recovered-Susceptible), etc. [Hethcote(2000)].
The impact of disease in the population is measured using basic reproduction number, R 0 [Lotka(1956)].R 0 is the average number of individuals an infected person infects over its period of activity, such that if R 0 < 1 the disease will die out in the long run and if R 0 > 1 the disease-free stationary state is asymptotically unstable [Stewart et al.(2005)Stewart, Logsdon, and Kelley].The fact that for a 100% effective vaccine, the fraction of individuals that need to be vaccinated is 1 − 1 R0 to prevent persistent disease spread, indicates that higher number of individuals need to be vaccinated if the factor R 0 is high for a disease.There have been several works [Shirley and Rushton(2005)], [Schimit and Pereira(2018)] for determining the dynamical relationship between network and disease parameters for an epidemic spread occurring on networks.
Machine learning (ML) models based on supervised and unsupervised learning algorithms have found important applications in the area of complex networks.For example, for optimal graph partitioning into community structure [MacQueen et al.(1967)], for classification of diseased networks from the control networks using data from brain imaging [Chaplot et al.(2006)Chaplot, Patnaik, and Jagannathan], for classification of networks into various model networks [Xin et al.(2018)Xin, Zhang, and Shao], etc.Recently, a study [Schimit and Pereira(2018)] reported the relative relevance of network topological and disease parameters for disease spreading on complex networks, irrespective of the model network type.They found that the topology of population (or network) affects the disease spreading process, apart from the disease parameters.In essence, for the given initial conditions for the epidemic spread, the topological properties of networks govern the asymptotic disease state.Our work is based on this idea and focuses on explor-ing if the topological properties solely could predict R 0 .The accurate determination/prediction of R 0 is of paramount importance to analyze the stability of the disease stage, concerning the infection outbreak.To this end, we train ML regression models, using large number of networks of various model network types.We used six structural properties of five different complex networks examples as input features and the corresponding R 0 evaluated after simulating the SIR dynamics on them, as output labels.While training, the model fits these inputs with the outputs and learns certain weights.Using the trained models (or the weights), we predict the R 0 value for test network example based on its own structural metrics.
In particular, we trained three models: linear regression, support vector regression (non-linear regression) with different kernels and a neural network model.We optimized these models for the correct prediction of R 0 on test examples and evaluated two accuracy metrics: mean squared error and coefficient of determination [Pedregosa et al.(2011)Pedregosa, Varoquaux, Gramfort, Michel, Thirion, Grisel, Blondel, Prettenhofer, Weiss, Dubourg, Vanderplas, Passos, Cournapeau, Brucher, Perrot, and Duchesnay] for each of them.We present the parameter list and their ranges for fine tuning of each of these models.Support Vector Regression (SVR) with radial basis function (RBF) kernel and the artificial neural network (ANN) model resulted in high accuracy of prediction over linear models, for both real and simulated networks.Moreover, the excellent overlap between the expected and predicted values of R 0 for these nonlinear ML models also point to the non-linear relationship between the model input and output variables.Hence, we show that simple ML techniques can be used to predict R 0 with high precision using the structural properties of the network.We also find that different network metrics have different relative powers of prediction.Based on this result, we can just use four most important measures out of the six, for the model training and prediction.However, ML model performances were marginally better with all the six features used.

PROBLEM STATEMENT
As explained earlier, the R 0 of disease spreading is an estimate of disease impact on the population and hence its controllability.For a dynamical process occurring on a network, the structure of the network plays a key role in determining the next stage of the process apart from inherent parameters of the process.Similarly for a disease spread on a network, the dynamics and hence the disease stages are a result of interplay between the network structural features and the epidemic model parameters such as the probability of infection, probability of recovery from infection, etc.For the R 0 calculation in the present work, we update all the disease related rate constants at each time step depending on the instantaneous values and the respective changes in the populations of the susceptible, infected and recovered individuals according to equations 2-5 in supplementray information (SI).Since in networks the interactions can only occur through nodes neighbours, the instantaneous values and changes of populations of S, I and R over the whole network are indirectly determined by the local and global structure of the network.Hence for networks, the value of R 0 , which is a dynamical descriptor of the disease and depends on the disease related rate constants is affected by the network structure.
Given that R 0 is the average number of susceptible an infected individual infects over its period of being in infected state, different structural metrics of networks have specific effect on its value.For example, for an Erdos − Rényi (ER) network, higher the clustering coefficient means higher the number of connections and hence higher the R 0 .Similar trend is observed for network density, average degree and the maximum degree.On the other hand, higher the shortest path length means higher is the average shortest distance between two nodes and hence smaller the R 0 and vice versa.For a Small-World (SW) network, owing to the regularity of its structure, even the lower density of connections than an ER network shows similar potential for the disease spread.Hence, other topological features are also important for determining the R 0 .In the same manner, the effect of topological parameters on the R 0 value can be intuitively understood for other model networks.
In this work, we seek to predict the R 0 for any example network, given that we know its structural properties a priori.Given that R 0 is an important measure to understand the effect of disease on the population, it would serve as a warning for a presently unaffected population and device the vaccination strategies better for disease control.In this pursuit, we trained and optimized ML models with state of the art techniques and tested them on unseen artificial and real world networks.The results for SVR with RBF kernel and ANN show that R 0 was accurately predicted for these networks based on their known network properties.Hence, we have an estimate of disease controllability beforehand, without the need to simulate the SIR model on the test network.

METHODOLOGY
In this section we describe the procedure for generation of simulated data set.We also briefly describe the k-fold validation which a standard procedure used in ML model training and testing in the SI.

Generation of simulated data set
For simulation of the SIR model on networks, the parameters related to disease were fixed at: k = 0.1, p ir = 60%, p id = 30%, p rs = 10%.The starting population of individuals in each of the states was fixed at S o = 99.5%,I o = 0.5% and R o = 0%.Each network had 1000 nodes and the network structure remains fixed throughout the simulation.The simulations were performed for 100-time steps and parameters a, b, c and e were determined using equations 2-5 in SI respectively.R 0 was calculated (using these parameters) and averaged over last 20 time steps, where the system reaches a stable regime (Fig. 1 of SI).The networks were obtained using the python library NetworkX [Hagberg et al.(2008)Hagberg, Swart, and S Chult], which returns a network as output, for the supplied input parameter(s) governing connectivity patterns.For obtaining n (say) number of networks of a model network type, n values of these parameters were chosen from a range.This range was carefully chosen, such that the network properties fall in more or less the same range for all the models.Around 500 networks of each model kind (exact number in the description below) each of size(N) 1000 were obtained.We use five model networks: Erdos − Rényi (ER) random networks [ERDdS and R&WI(1959) score is recorded every time, and then the model is discarded.This procedure is carried out in a loop (k times) with a different fold held out for testing and k − 1 folds used for training every time.In this way, each fold is used once for testing and k − 1 times for training.Hence the model accuracy is averaged over all iterations of the procedure.In the present work all the ML model performances are evaluated using 10fold cross validation technique.Also, please note that the data matrix was always permuted over rows before splitting it into testing and training parts, so that same model networks are not stacked together.

MODEL PERFORMANCE
The linear regression model resulted in a good fit when the networks from the same model were used for training and testing.The MSE and R 2 scores for the ER, SF , SW , BA and SBM networks as : (0.01, 0.99), (0.14, 0.87), (0.02, 0.99), (0.0, 0.99), (0.1, 0.99) respectively, indicating a good fit (correspondong plots shown in SI2).However, the linear regression lost the accuracy significantly when networks from all the models were used for training and testing (see Figure SI2) with MSE and R 2 as (4.99, 0.69).This shows that linear-regression is not a reliable model for predicting R 0 irrespective of the network type.Also, the failure of linear regression confirms the absence of linear relation between the input and target variable and calls for testing of non-linear regression techniques.

Support Vector Regression
Owing to the failure of linear regression in accurately predicting R 0 , we explored the performance of SVR with RBF kernel on the data set.The performances for the other two kernels (linear, polynomial) have also been reported (see Table 1 for results).The parameters for polynomial and RBF kernel functions are γ = 0.1, degree(d) = 2 and γ = 1 no. of f eatures (refer Eqns:13-14 in SI) respectively.SVR with linear, polynomial and RBF kernels show mean squared error and R 2 as (3.67, 0.73), (11.30, 0.16) and (0.01, 1.00).
Increasing the degree of the polynomial kernel to 3 improves the accuracy scores (3.02, 0.81); increasing the degree beyond 3 resulted in an arbitrarily high error and requires much higher model training time than for degree 2. For the RBF kernel, the previously mentioned parameters were optimal concerning the accuracy and the required training time.Comparing the accuracy scores, we found that RBF kernel outperforms the other two kernels with a substantial margin and hence RBF can be used to predict R 0 with good precision.Please refer to

Neural Network Model
We also use a NN architecture (see Figure 2, left panel) that is optimized iteratively to gain maximum accuracy for regression.The NN model used here consists of three layers.The number of neurons in the input layer are conventionally fixed to be equal to the number of features of the data matrix.
The output layer has one neuron, as the model performs regression to predict a number as an output (R 0 ).The hidden layer has 23 neurons that gather information from each neuron in the input layer and transmit it to the output layer.The number of neurons in the hidden layer were determined according to an empirical rule of thumb [hobs (https://stats.stackexchange.com/users/15974/hobs)()]that puts an upper limit on the total number of neurons without incurring the problem of over-fitting.The rule is, where N i is the number of input neurons, N o is the number of output neurons, N s is the number of samples in the training data set, and α is an arbitrary scaling factor between 2 and 10.For α = 2, we get the maximum number of neurons according to the above formula, and any number of neurons greater than this value will result in over-fitting.For our case, we chose α = 10, to avoid over-fitting and reduce the number of free parameters or weights in the model.
Putting the known values of N i , N o and N s as 6, 1 and 2552, we obtained N h = 36.The optimal value of N h was then evaluated numerically by varying N h over a range of numbers within 36.The accuracy metrics were evaluated for a different number of neurons in the hidden layer, and this exercise was repeated for ten trials on randomly permuted data set.The optimum number of neurons in hidden layer were found out to be 23, as can be seen from the variation of MSE and R 2 coefficient in Figure .2, right panel.We used Keras (deep learning library for Python) [Chollet et al.(2015)] for model construction and simulation.Other specifics of the model in terms of its hyper-parameters and parameters are as described in the Table 1 in SI.The weights of edges in the NN it can be seen that for 23 number of neurons MSE touches the zero mark and R 2 touches the value one.This implies that using 23 number of neurons in the hidden layer gives the maximum accuracy.
network were chosen from a normal distribution using the kernel initializer function.The activation functions for neurons were governed by the rectified linear unit ("relu") i.e., the neuron activation was linearly related to the input.Adaptive Moment Estimation ("Adam") was used as optimizer, which is based on an adaptive learning scheme for updating of the network weights.This optimizer function updates the learning rates iteratively based on the moments of the gradient of the objective function.The objective function or the loss function was accuracy measured in terms of MSE.With epoch size ("Epochs") set at 50, the batch size of 5 and other parameters set as specified above in the NN model, the mean accuracy measured in terms of (MSE, R 2 ) for 10-fold cross validation was (0.093, 0.99).
We have shown the predicted and true R 0 for all the examples using SVR (with RBF kernel) and ANN model in Figure 4. We also trained all the ML model using only four (avgdeg, maxdeg, dia, spl) out of six features selected based on their contribution indices.These four features had highest values of the contribution indices (refer to subsection Ranking of the features in SI).We have shown the relative contribution indices for all the six features in the Figure 3.We observe that model performances are still fairly accurate in predicting the correct R 0 .In the Table 2, the model performance metrics for SVR with RBF and ANN for training with six and four features respectively have been shown.We can infer from the results that there is a trade-off between accuracy of model predictions and number of features used for training the model.The accuracy of the predicted value is better with the all the six features in the data set than when only four feature vectors were considered.The precision of the prediction with top-four features is slightly reduced but it is in a bearable range (refer to the MSE and variance scores in table2).Therefore, one can remove cc, den from the feature set for the training without compromising much with the prediction accuracy.

Performance on Real-world Networks
We tested the ML models for R 0 prediction for four real-world network datasets: infect-dublin [Isella et al.(2011)Isella, Stehlé, Barrat, Cattuto, Pinton, and Van den Broeck], infect-hyper [Isella et al.(2011)Isella, Stehlé, Barrat, Cattuto, Pinton, and Van den Broeck], crime-moreno and email-univ ( [Rossi and Ahmed(2015)], [Guimera et al.(2003)Guimera, Danon, Diaz-Guilera, Giralt, and Arenas]).infect-dublin and infect-hyper are categorized as proximity networks based on face-to-face interactions between people, with the number of nodes (N ) and number of edges (E) being (410, 2765) and (113, 2196) respectively.crimemoreno is categorized as interaction network with (N , E) = (829, 1474).email-univ is email network with (N , E) = (1133, 5451).Please note that we chose networks with single giant component only.The accuracy metrics corresponding to the ML models on real-world networks is tabulated in Table 3.It is observed that the ML model performs very accurately for these networks as well, as for artificial networks.Especially, for infect-dublin and crime-moreno networks, both SVR and ANN models predict R 0 that almost matches the true R 0 value.Furthermore, these realworld test networks serve as unseen test examples for the ML models, and accurate prediction of R 0 authenticates the ML models even more.

CONCLUSION AND FUTURE PROSPECTS
The present work explored the applicability of ML regression techniques to predict basic reproduction number, R 0 , a factor indicative of the effect of disease or its controllability, on complex networks.R 0 , in general, depends on many factors: the duration of disease persistence in the population, the vulnerability of an individual to an infection, the number of infected neighbours to a susceptible individual, etc.On the other hand, if we have a population where all these above parameters are fixed to a reasonable value, how the social strata(complex network in our case) on which the disease spreads affect the disease spreading is still a question.To explore this, we examined whether R 0 can be predicted based on global properties of the network in hand, irrespective of the model network type it belongs to?A large number of networks were generated, and dynamics of the disease spreading were simulated on these networks, and the corresponding R 0 was recorded along with the network properties.Using the recorded data, three ML regression models were trained to predict R 0 values on the test data.These models were tuned based on their parameters to obtain good prediction.The results using RBF kernel in SVR and ANN models showed high accuracy of R 0 prediction, suggesting that there exists a significant correlation between the network properties and disease controllability.The generalizability of the trained models is convincing because the testing was always performed on unseen data using the k-fold validation technique.One of the improvements to the present work could be to train the models using a larger number networks of different sizes, such that a higher range of network properties such as shortest path length, clustering coefficient, etc. is spanned.These models will then yield correct prediction for any given test network.However, as one can see, this is just a scalability issue.Moreover, good predictions of these models on real-world networks is an exciting result.Our work reports two significant findings (a) The disease controllability on the network can be predicted using global network properties.(b) The standard ML techniques can be applied to processes on complex network.In our case it is predicting disease controllability on a network.The tunability of ML models offers immense power to forecast or predict processes on complex network systems.
The computational cost for some of the features is high (especially for the clustering coefficient (cc) and the shortest path length (spl)).Hence, the time complexity for obtaining the features for the training data set will be high.This is one of the constraints of our approach for the ML model training.But, for predicting the value of R 0 for any arbitrary test network based on known network features, the prediction time is almost negligible.This implies that we can predict the value of R 0 at the very first stage of getting the test network (without waiting until the epidemic outbreak has completed or reached a stable state).Of course, underlying assumption is that we should know the value of these features beforehand at the time of testing.
The prospects of the work may include using deep learning approaches for unsupervised learning of features.As we know that the numerical calculation of network properties for the training as well as testing the model is a time-consuming step, it would be great if the network itself could be made to train the model.Another prospect is to explore if the network adjacency matrix can be used to train a deep learning CNN architecture.Also, if network embedding algorithms can be used to learn the features that are instrumental in the disease spreading, it would be a significant leap forward from this work.
], Watts-Strogatz small world (SW) networks[Watts and Strogatz(1998)], Scale Free (SF) networks[Bollobás et al.(2003)Bollobás, Borgs, Chayes, and Riordan],Barabasi-Albert [Barabási  and Albert(1999)] (BA) and Stochastic block model (SBM) networks[Tcoyze(2016)] in our work.The parameters and their range of variation for each model network are described in the Generation of model network examples section in SI.The six network structural properties that were used as input features for training were Average Degree (avgdeg), Average Shortest Path Length (spl), Clustering Coefficient (cc), Network Density (den), Network Diameter (dia) and Maximum Degree (maxdeg).The definitions of these network metrics are presented in section Complex Networks: Types and properties of SI.3.2Training data-set and k-fold validationFor training the model, the k-fold cross validation routine of the sklearn library[Pedregosa et al.(2011)Pedregosa,  Varoquaux, Gramfort, Michel, Thirion, Grisel, Blondel, Prettenhofer, Weiss, Dubourg, Vanderplas, Passos, Cournapeau, Brucher, Perrot, and Duchesnay]   was used.The k-fold cross-validation (CV) procedure avoids over-fitting by holding back a part of data from use in training the model; such that the model performance is evaluated and reported based on testing on the unseen data.The full data set is split into k folds, out of which k − 1 folds are used for training the model and the remaining 1 fold is used for testing the model performance.The model performance Figure. 1 top panel and bottom panel (left figure) for the match of predicted output with the expected values, for all three kernels in SVR.The plots show predictions on only the first hundred data samples for better visualization.

Fig. 1 .
Fig. 1.The figure shows the difference in predicted and true R 0 using vertical lines at data points for linear, polynomial, RBF kernel in SVR and ANN respectively.For better visualization of results only 100 randomly selected data points(true and corresponding predicted) are shown for all the models.

Fig. 2 .
Fig. 2. The left panel shows a schematic of the NN model used in the present work; with 3 layers and number of neurons in each layer specified.The figures in the right panel show R 2 coefficient (top) and MSE (bottom) for R 0 prediction, averaged over 10 realizations for different number of neurons in the hidden layer;it can be seen that for 23 number of neurons MSE touches the zero mark and R 2 touches the value one.This implies that using 23 number of neurons in the hidden layer gives the maximum accuracy.

Fig. 3 .
Fig. 3.This figure shows relative importance of the network features based on their contribution index.The contribution indices are normalized between 0 and 1

TABLE 2 Table of ML
model performance results on simulated networks.

TABLE 3 Table of
Model performance results on real world networks