A Machine Learning Algorithm to Predict Hyperglycemic Cases Induced by PD-1/PD-L1 Inhibitors in the Real World

Background: Diabetes mellitus and cancer are amongst the leading causes of deaths worldwide; hyperglycemia plays a major contributory role in neoplastic transformation risk. From reported adverse events of PD-1 or PD-L1 (programmed death 1 or ligand 1) inhibitors in post-marketing monitoring, we aimed to construct an effective machine learning algorithm to predict the probability of hyperglycemic adverse reaction from PD-1/PD-L1 inhibitors treated patients eciently and rapidly. Methods: Raw data was downloaded from US Food and Drug Administration Adverse Event Reporting System (FDA FAERS). Signal of relationship between drug and adverse reaction based on disproportionality analysis and Bayesian analysis. A multivariate pattern classication of Support Vector Machine (SVM) was used to construct classier to separate adverse hyperglycemic reaction patients. A 10-fold-3-time cross validation for model setup within training data (80% data) output best parameter values in SVM within R software. The model was validated in each testing data (20% data) and two total drug data, with exactly predictor parameter variables: gamma and nu. Results: Total 95918 case les were downloaded from 7 relevant drugs (cemiplimab, avelumab, durvalumab, atezolizumab, pembrolizumab, ipilimumab, nivolumab). The number-type/number-optimization method was selected to optimize model. Both gamma and nu values correlated with case number showed high adjusted r 2 in curve regressions (both r 2 >0.95). Indexes of accuracy, F1 score, kappa and sensitivity were greatly improved from the prediction model in training data and two total drug data. Conclusions: The SVM prediction model established here can non-invasively and precisely predict occurrence of hyperglycemic adverse drug reaction (ADR) in PD-1/PD-L1 inhibitors treated patients. Such information is vital to overcome ADR and to improve outcomes by distinguish high hyperglycemia-risk patients, and this machine learning algorithm can eventually add value onto clinical decision making.


Background
Diabetes mellitus and cancer are amongst the leading causes of deaths worldwide. As a major contributory role in neoplastic transformation risk, hyperglycemia (high blood glucose) is also in uenced by cancer treatment. Higher blood glucose level, dealing with or without insulin, correlated with cancer risk, progression, and mortality in a higher degree. 1,2 Recently, immunotherapy is a vast improvement over previous anti-cancer therapies. 3 In 2018 and 2019, two reviews of JAMA reported treatment-related adverse events from programmed death 1 (PD-1) or programmed death ligand 1 (PD-L1) inhibitors, or immune checkpoint inhibitor regimens in clinical trials. Among the endocrine dysfunctions, hyperglycemia was on the top three following hypothyroidism and hyperthyroidism in all-grade immune-related adverse events (irAE), and on the top of grade 3 or higher irAEs were hyperglycemia. 4,5 Prolonged exposure to hyperglycemia can epigenetically modify gene expression pro les in human cells and this effect is still sustained even after hyperglycemic control is therapeutically achieved. Cancer cells exposed to hyperglycemic situation would develop permanent aggressive growth, even after returned to euglycemic conditions. This phenomenon is called hyperglycemic memory. This metabolic memory effect contributes substantially to the pathology of various diabetic complications. 6 , 7 World-wide public database on adverse events could provide many drug usage information 8, 9 , becoming a new information source in drug post marketing phase. The Food and Drug Administration Adverse Event Reporting System (FAERS) is a database that contains adverse event reports, medication error reports and product quality complaints resulting in adverse events. The database is designed to support the FDA's post-marketing safety surveillance program for drug and therapeutic biologic products. The informatic structure of the FAERS database adheres to the international safety reporting guidance issued by the International Conference on Harmonisation. 10 Machine learning is the scienti c discipline that focuses on how computers learn from data, with its emphasis on e cient computing algorithms. 11 Machine Learning techniques have increased to be a hot spot in data mining with health record data, and modeling this Big Data information requires managing over tting, model interpretability and computational cost. 12 Support Vector Machine (SVM) is a type of supervised learning method which analyzes data and recognizes patterns, mainly used for statistical classi cation and regression, especially in nonlinear classi cation. 13,14 SVM based approach doing well in managing sparse data in high dimensions. It is a multivariate pattern classi cation dividing samples into worthy bifurcations (binary variables) through a line or plane in multidimensional feature, which is called Maximum Margin Hyperplane. 15 SVM overcomes the other state-of-the-art competitors providing the best compromise between predictive performance and computation time. 12 The SVM has been applied to seizure prediction, detection and classi cation. 16 One advantage of SVM is its classi cation in small number of training samples; another is solving linear and nonlinear regression problems. 14 In 2019, Bernardini et al. successfully predicted Type 2 Diabetes in electronic health records by establishing a cross-validation sparse balanced SVM model, proving that the capability of SVM model to predict the clinical outcomes. 12 However, to date, neither the association of hyperglycemic incidence and personal-related features, nor which special features could impact hyperglycemia have been reported. Predicting the hyperglycemic incidence could provide better guidance for clinical decision-making for avoiding terrible adverse drug reaction (ADR). In this study, we aimed to construct an effective SVM-based machine learning algorithm model to predict hyperglycemic reaction within PD-1/PD-L1 treating patients.  . Search by reaction term: Diabetes Mellitus, Hyperglycemia (Hyperglycaemia), Type 1 Diabetes Mellitus, Type 2 Diabetes Mellitus, and Blood Glucose Increased.

Exclusive criteria:
. Other products.
. No information on blood glucose in reactions.
. Case with missing items.

Experimental Design
The procedure of SVM machine learning is: raw data download from FAERS, algorithm selection, model setup (parameters optimization and determination), parameter curve regressions, and data prediction (Fig. 1).
Step 1: Data form selection Raw data were xed into two parts: original-data (with missing values) and complete-data (without missing values, rows containing missing value(s) were deleted).
Step 1. Step 1.2. Data mining (complete data): Within original data, algorithms on signal of relationship between a drug and a special adverse reaction usually based on disproportionality analysis and Bayesian analysis, including four statistical procedure: proportional reporting ratio (PRR), reporting odds ratio (ROR), information component (IC), and empirical Bayes geometric mean (EBGM). 18, 19 The four algorithms' computation and criteria are according to reference. 20 Step 2: SVM optimization methods Bicategorically, target reaction could be changed into factor-type ("Yes" and "No"), or treated as number-type ("1" and "0") in SVM model setup. The parameter optimization could also be accomplished via two methods: number-optimization and R-function-optimization.
In number-optimization, the three parameters were tested separately to select a range covering best value. After best proportion were de ned by combination of three parameters, the best values were set to predict. In Rfunction-optimization, the parameter range was also input to a built-in function (tune.svm), and then best values were output. If the best values closed to range boundary, the adjusted new range would be input for optimization again. The factorized data were both optimized via number and function methods.
Step 3: SVM model After missing values deleted from original data to generate complete data, disproportionate and Bayesian analysis were adopted to quantify the signal, the association between the reported features and ADR. The general modelling set was from stratifying random-split cross-validation into training data (80% data) and testing data (20% data) in each drug, containing proportional positive and negative cases respectively. Once the algorithm was optimized by training data, no more changes were made and it was evaluated on testing data checks.
Step 3.1. Model variable selection The key to construct a SVM model that can accurately screen the active markers is to select the appropriate variable indexes. 14 Variable selection was according to two methods: Near Zero Variance Method (R function: NearZeroVar) and Model Assessment (R function: varImp).
Parameters (eg. gamma, nu, cost, degree, coef0) optimization separately: their value ranges were determined by the best outputs.
Parameters (eg. gamma, nu, cost) optimization mutually: the complete-data was divided into training data and testing data according to random seed; best parameter values were set by training data through 10-fold-3-time cross validation.
Parameters (eg. gamma, nu) determination: Accuracy (total precise rate), F1 score, sensitivity (positive precise rate) and kappa (consistence) values were chose as evaluation indicators. Confusion matrix was calculated according to appendix- Table 1.
Step 4: Curve regression The variables (eg. gamma, nu) and relevant case number from SVM model setup were selected and tested in curve regression. The computing formula were given beside the curve.
Step 5: Data prediction The prediction model was performed by the testing data and included the sixth drug (ipi) as well as the exceptional drug (dur). Four indexes (accuracy, F1 score, sensitivity and kappa) were checked from this model.

Statistical analysis
Descriptive analysis was used to summarize patient demographic characteristics, with mean values and range values for continuous variables, ratios for categorical variables.
To explain the impacting factors to hyperglycemia, Reaction is de ned as dependent variable, others as response variables. For classi cation algorithm, we used SVM with statistical software R, version 4.0.0 for windows. T test was performed for comparing normal distributions and de ning 95% con dence intervals, and Wilcoxon rank test for comparing other unknown distributions.

Results
From FAERS database, total 95918 case les were downloaded from 7 relevant drugs (cem:376, avel:1116, dur:2980, atez:7061, pem:20507, ipi:20862, nivo:43016). Hyperglycemic signal detection for those drugs were listed in Table 1, where Bayesian con dence propagation neural network (BCPNN) indicated all 6 PD-1/PD-L1 inhibitors and ROR indicated 4 of them would induce hyperglycemia (Table 1). As no hyperglycemic ADR reported in cem as well as too many missing items in its original-data (only 2 cases in complete-data), cem is not included in following analysis yet. Cases with unavailable items were removed and the rest demographic summary was listed in Appendix-table 2; and the following complete-data were based on the listed cases.
Model for original-data was setup and optimized via number-type and number-optimization, while complete-data was via bilateral methods ( Table 2). Accuracy and sensitivity results showed number-optimization is more precise than R-function-optimization (avel, pem, nivo); number-type was a little higher than factor-type (avel, dur, atez) (Fig. 2). The following SVM model setup chose number-type and number-optimization method within complete-data. The two methods (Near-Zero-Variance-Method and Overall-Value-of-Model-Assessment) showed one common point: variables of "Reporter" (means Consumer or Health professional etc), "Serious" and "Source" were not important in model ( Table 3). The three variables ("Reporter", "Serious" and "Source") and "ID" were not included in following algorithm. Clinically, "Source" and "Reporter" introduced where the ADR came from, without more information in disease control. Other 8 variables ("Reactions", "Reason", "Country", "Weight", "Year", "Age", "Priority" and "Sex") were used to model setup. In SVM algorithms, SVM-kernel includes "l", "p", "r", "s" and SVM-type includes "C", "one", "eps", "nu". The ve PD-1/L1 drugs were test by the 4 × 4 cross combinations. The 16 combinations showed "rbf" and "nu-regression" (rnu) displaying good prediction, especially at F1 score and kappa (Fig. 3).
Parameters in SVM mainly included degree, cost, gamma, nu, coef0. In former method section, all the results showed the best degree and coef0 value is "1" (the function default number), so the following model setup section took another three parameters (nu, cost and gamma) into account. Based on the best model and to avoid over tting, cost value was de ned as "1" (the function default number); and nu and gamma value was de ned as the mean value in the best prediction range. Based on adjusting gamma and nu value from training data, the model performance was checked in testing data, four indexes (accuracy, F1 score, sensitivity, and kappa values) were showed in Fig. 4.
As the four-index trends from dur were exceptional (excessively high at Sensitivity in Fig. 4), the curve regression did not include parameters from dur (ie, curve regression based on avel, atez, pem and nivo). From model optimization, we found out the gamma values were quadratic to case number and nu values were exponential to minus case number (Fig. 5), which are as follows: New predictions from optimized model (Type: nu-regression; Kernel: rbf; parameter: gamma and nu) were applied on four testing data and two total data (dur and ipi). Four indexes (accuracy, F1 score, sensitivity and Kappa) over curve model were greatly improved than initial prediction (Fig. 6, appendix-table 3).
Receiver operating characteristic (ROC) analysis is a tool used to describe the discrimination accuracy of a diagnostic test or prediction model. 21 The diagnostic values from this model prediction and single items ("Reactions", "Reason", "Country", "Weight", "Year", "Age", "Priority" and "Sex") on the test-parts and two drugs were evaluated by ROC curves, whereas the predictive values were much higher from model than single FAERS items in Fig. 7.

Discussion
In this study, we have developed a machine learning algorithm that predicted hyperglycemic cases in PD-1/PD-L1 used patients based on US FDA adverse events reporting system (FAERS) within 5 respective drugs. To the best of our knowledge, this is the rst time to predict hyperglycemic incidence from real-world clinical practice via machine learning.
Hyperglycemia is an important ADR in cancer treatment 22 , as it in uences the outcome of cancer therapy via various mechanisms such as chemoresistance enhancement 23 , metabolic reprogramming and molecular alterations 24 , neoplastics deactivation 25 , affecting pharmacokinetics, pharmacodynamics and ADRs 26 , and in ammation sponsoring 1 , immune destruction. 27,28 One emphasis of pharmacovigilance is ADR from drug post-marketing phase. On one hand, it is challengeable to nd out the effective and fast means from clinical interheterogeneity and intraheterogeneity results. Interestingly, and on the other hand, the prediction is important in that occurrence is usually unknown in clinical treatment, as real time monitoring is inconvenient and expensive for discovering ADRs. This algorithm from clinical available features at the time of presentation was proved robust and generalizable in later testing sequence. This noninvasive and precise prediction could greatly help clinical practitioners to distinguish high-risk patients. Therefore, this study provided an orientation to predict occurrence of hyperglycemic ADR of these drugs.
FDA is responsible for protecting the public health by ensuring the safety, e cacy, and security of human and veterinary drugs, biological products, and medical devices. The reports in FAERS are evaluated by clinical reviewers, in the Center for Drug Evaluation and Research (CDER) and the Center for Biologics Evaluation and Research (CBER), to monitor the safety of products after they are approved by FDA. 10 SVM is a kind of structural dependence model to nd Maxium Margin Hyperplane with ADR reaction and reported features. To train our algorithm in prediction section, new cases are projected into the same situation to test which side of the hyperplane they located into. 16 Each drug database was divided into training and testing parts proportionally and randomly, in order to set up and check model on real data. As the adverse events may have occurred in a small fraction of patients, we adjusted, for class-imbalance, data divide within R software function (createDataPartition).
As omission values and duplicated cases, variables should be selected in machine learning. Variable selection was de ned by two methods, all of which was performed by package of caret in R software. In Near Zero Variance Method, variables displayed as "TRUE" should be deleted. In Overall value of Model Assessment, variables displayed as "0" should be deleted. Compared the two results and clinical analysis, variables of "reactions", "reason", "country", "age", "weight" and "year" are included in model setup. "pem" and "nivo" indicated the meanings of "priority" in variable selection in "overall" method, and "dur" in "nzv" method; so, "priority" is included. Though there were no meanings of "sex" in "overall" method, all the ve drugs indicated its meanings at the "nzv" method; so, "sex" is included. For the clinical and algorithmic reasons, "Serious", "Source" and "Reporter" were not included in model setup (Table 3).
SVM model algorithm is de ned by its kernel and type. The "r-nu" from 4 × 4 crossing combination showed upper value than average in accuracy and sensitivity, and top values in F1 score and kappa (Fig. 3). Hence, this combination was selected as the main algorithm in SVM model.
Five parameters (degree, cost, gamma, nu and coef0) were tested in parameter selection section. As no any change of result occurred in regulating the parameter of degree and coef0, the two values were set as "1". Cost was used to prevent over tting level. Higher cost value means more over tting probability. SVM model indicated cost value could be set as "1", where adjusting the other two parameters (gamma and nu) could also get the best results in 3 × 10 (10 folds 3 times) cross validation within training data. Then the other parameters were determined by the mean value of cross validation. Using the two mean values, the testing data were predicted with such parameter values. In Fig. 3, dur sensitivity result (highest) did not match other four results (lowest) and its gamma value (gamma = 5) was excessive at the gamma parameter group; so, the curve regression did not include the parameters provided from dur drug.
In the exponential curves (Formula-II), the constant value (2650) was determined by the adjusted r 2 in liner regression between nu value and e (−casenumber/constant) . We tried many constant values, during which all the constant (2630, 2640, 2650 and 2660) indicated the same highest r 2 value (0.9567); so, we selected the middle one (2650) as the formula constant.
Though SVM machine learning have many advantages, it does not solve all the problems based on spontaneous reporting systems. FAERS data itself have limitations: no certainty that the reported event was due to the product, insu cient detail, incomplete reports and duplicate reports. 10 In this study, only 5 ADR items (at inclusive criteria ) could be selected as the restriction of larger volume data download from website. Other drug factors (eg. doses, frequency) and biological factors (eg. genomic data) were not included in the development of the present algorithm, preventing accuracy as well as F1 score up to 100%.
The graphical ROC curve is produced by plotting sensitivity (true positive rate) on the y-axis against 1-speci city (false positive rate) on the x-axis for the various values tabulated. 29 Areas under the ROC curves (AUC) from eight items were far less than that from model prediction (the uppermost red line in Fig. 7). It means the prediction of total combination from our model is much more powerful, since single factors do nothing meaningful in prediction.
Although these limitations described above, the algorithm may have provided information that supported either identifying signals between the hyperglycemic ADR and PD-1/PD-L1 regimens, or adjusting care goals for these patients, or providing signs for further well-organized clinical studies.

Conclusion
In summary, the SVM model established here can non-invasively and precisely predict occurrence of hyperglycemic ADR in PD-1/PD-L1 inhibitors treated patients from given personal-related variables and case number. The SVM model showed good prediction performance within testing data, whose result proved this model is robust and generalizable. We also believe that the availability of drug regime and dosage will promote this model for facilitating the prediction of hyperglycemic ADR occurrence. Such information is vital to improve to overcome ADR and patient outcomes. The prediction algorithm can eventually add value to clinical decision making. Predictions from Optimized Parameters Compared with Initial SVM Results Note: a = accuracy, F1 = F1 score, k = kappa, s = sensitivity Predictive Evaluations on ROC Curves

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. tables13appendix.docx