2.2 Experimental Design
The procedure of SVM machine learning is: raw data download from FAERS, algorithm selection, model setup (parameters optimization and determination), parameter curve regressions, and data prediction (Fig. 1).
Step 1: Data form selection
Raw data were fixed into two parts: original-data (with missing values) and complete-data (without missing values, rows containing missing value(s) were deleted).
Step 1.1. Data extraction (original data):
Items of Case ID, Reactions, Serious, Sex, Event Date, Case Priority, Patient Age, Patient Weight, Reporter Type, Report Source, Country where Event occurred were obtained from the downloaded excel files (raw data).
Excluded items were: Suspect Product Names, Suspect Product Active Ingredients, Reason for Use, Outcomes, Latest FDA Received Date, Sender, Concomitant Product Names, Latest Manufacturer Received Date, Initial FDA Received Date, Reported to Manufacturer, Manufacturer Control Number, Literature Reference, Compounded Flag.
Step 1.2. Data mining (complete data):
Within original data, algorithms on signal of relationship between a drug and a special adverse reaction usually based on disproportionality analysis and Bayesian analysis, including four statistical procedure: proportional reporting ratio (PRR), reporting odds ratio (ROR), information component (IC), and empirical Bayes geometric mean (EBGM).18,19 The four algorithms’ computation and criteria are according to reference.20
Step 2: SVM optimization methods
Bicategorically, target reaction could be changed into factor-type (“Yes” and “No”), or treated as number-type (“1” and “0”) in SVM model setup. The parameter optimization could also be accomplished via two methods: number-optimization and R-function-optimization.
In number-optimization, the three parameters were tested separately to select a range covering best value. After best proportion were defined by combination of three parameters, the best values were set to predict. In R-function-optimization, the parameter range was also input to a built-in function (tune.svm), and then best values were output. If the best values closed to range boundary, the adjusted new range would be input for optimization again. The factorized data were both optimized via number and function methods.
Step 3: SVM model
After missing values deleted from original data to generate complete data, disproportionate and Bayesian analysis were adopted to quantify the signal, the association between the reported features and ADR. The general modelling set was from stratifying random-split cross-validation into training data (80% data) and testing data (20% data) in each drug, containing proportional positive and negative cases respectively. Once the algorithm was optimized by training data, no more changes were made and it was evaluated on testing data checks.
Step 3.1. Model variable selection
The key to construct a SVM model that can accurately screen the active markers is to select the appropriate variable indexes.14 Variable selection was according to two methods: Near Zero Variance Method (R function: NearZeroVar) and Model Assessment (R function: varImp).
Step 3.2. Model setup
Parameter algorithm selection: SVM-Type (C-classification, "C", one-classification, "one", eps-regression, "eps", nu-regression, "nu") and SVM-Kernel (linear, "l", poly, "p", rbf, "r", sigmod,"s").
Parameters (eg. gamma, nu, cost, degree, coef0) optimization separately: their value ranges were determined by the best outputs.
Parameters (eg. gamma, nu, cost) optimization mutually: the complete-data was divided into training data and testing data according to random seed; best parameter values were set by training data through 10-fold-3-time cross validation.
Parameters (eg. gamma, nu) determination: Accuracy (total precise rate), F1 score, sensitivity (positive precise rate) and kappa (consistence) values were chose as evaluation indicators. Confusion matrix was calculated according to appendix-Table 1.
Step 4: Curve regression
The variables (eg. gamma, nu) and relevant case number from SVM model setup were selected and tested in curve regression. The computing formula were given beside the curve.
Step 5: Data prediction
The prediction model was performed by the testing data and included the sixth drug (ipi) as well as the exceptional drug (dur). Four indexes (accuracy, F1 score, sensitivity and kappa) were checked from this model.