Preoperative prediction of complicated appendicitis using machine learning method

BACKGROUND Accurate preoperative prediction of complicated appendicitis (CA) could help selecting optimal treatment and reducing risks of postoperative complications. The study aimed to develop a machine learning model based on clinical symptoms and laboratory data for preoperatively predicting CA. METHODS 136 patients with clinicopathological diagnosis of acute appendicitis were retrospectively included in the study. The dataset was randomly divided (94: 42) into training and testing set. Predictive models using individual and combined selected clinical and laboratory data features were built separately. Three combined models were constructed using logistic regression (LR), support vector machine (SVM) and random forest (RF) algorithms. The CA prediction performance was evaluated with Receiver Operating Characteristic (ROC) analysis, using the area under the curve (AUC), sensitivity, specicity and accuracy factors. RESULTS The features of the abdominal pain time, nausea and vomiting, the highest temperature, high sensitivity-CRP (hs-CRP) and procalcitonin (PCT) had signicant differences in the CA prediction (P<0.001). The ability to predict CA by individual feature was low (AUC<0.8). The prediction by combined features was signicantly improved. The AUC of the three models (LR, SVM and RF) in the training set and the testing set were 0.805, 0.888, 0.908 and 0.794, 0.895, 0.761, respectively. The SVM-based model showed a better performance for CA prediction. RF had a higher AUC in the training set, but its poor eciency in the testing set indicated a poor generalization ability. The SVM machine learning model applying clinical and laboratory data can well predict CA preoperatively which could assist diagnosis in resource limited settings.

can help optimizing operative treatment, including surgical approach or administration of preoperative and duration of postoperative antibiotics [5]. Improving the differential diagnosis of simple and complex appendicitis can help reduce risk of postoperative complications, shorten recovery time and hospital stay, and avoid the medical burden of hospitals and patients [6,7,8].
Among preoperative diagnosis of suspected acute appendicitis, history, clinical manifestations, laboratory examination and medical imaging provide the main diagnostic basis. As popularized imaging tools for acute appendicitis, the sensitivity and speci city were reported to be over 90% for CT and over 80% for ultrasound [9,10]. The sensitivity of noncontrast-enhanced MRI can also approach 77% [11]. With technological innovations, researchers or clinicians start to pay more attention in timely and accurately identifying pathological types of appendicitis before treatment. Both of CT and ultrasound have been applied to identify complicated appendicitis or predicting its pathological severity [12][13][14]. CT features including appendix diameter, dependent uid, appendolithiasis were reported to be associated with appendicitis pathological severity [15] and ultrasound concordance with pathology was reported to be higher for perforated appendicitis as well [16].
However, except for the factors in uencing the accuracy for pathology prediction of imaging based methodology (such as the localized in ammation, uncertain appendix position or di culties in margin measurement), it also exist safety concerns about the use of radiation-based imaging or the availability challenges of skilled clinicians for ultrasound at any time. Therefore, the timely diagnosis of acute appendicitis or even its pathological type through clinical manifestations and laboratory examinations become challenging but necessary. Several scoring systems have been developed to help clinicians with diagnosis of acute appendicitis, including the Alvarado, the Raja Isteri Pengiran Anak Saleha Appendicitis (RIPASA) and the acute in ammatory response (AIR) scoring systems [17][18][19][20][21]. AIR is an index of acute in ammatory responses based on the Alvarado Score, which improves the accuracy of the diagnosis of acute appendicitis. It has also been reported that combined interpretation of white cell counts (WCC) or Creactive protein (CRP) abnormal results could yield competitive sensitivity as CT in diagnosis of acute appendicitis [22]. However, the latest research results show that these scoring systems still face great challenges in discriminating between UA and CA [17,21].
It is encouraging that there has been appearing studies involving laboratory factors such as CRP and neutrophil to differentiate simple appendicitis, cellulitis appendicitis and gangrenous appendicitis [23]. In addition, several studies have shown that arti cial intelligence (AI) can signi cantly improve the accuracy and e ciency of prediction and initial diagnosis of acute appendicitis by combining factors derived from physical signs, symptoms and laboratory tests [24][25][26][27][28]. However, only a few studies presented the AIbased predictive models for pathological types of acute appendicitis [29]. The pathological types of acute appendicitis are of great value to clinical treatment decision-making. And AI-assisted tools may enhance the diagnostic con dence in resource limited settings where only conventional clinical methods are accessible. Therefore, based on our previous work [30], we aimed to use machine learning methods to establish an optimized model to preoperatively predict the UA and CA pathological types and enrich the preliminary data for AI-assisted appendicitis diagnosis.

Patients
This retrospective study was conducted in accordance with the Declaration of Helsinki and approved by the Ethics Committee of Aerospace Center Hospital, with the requirement for informed consent waived.
During the period of June 2015 to November 2017, 138 patients with acute appendicitis who had underwent surgery and had clinical and laboratory data available, including in ammatory response and pathological diagnosis, were initially included.
The inclusion criteria were as follows: 1) patients had histologically con rmed acute appendicitis including UA and CA; 2) patients had data records including clinical manifestations (RIPASA scoring), laboratory tests before surgery (cell counts in full blood and in ammatory factors), which were listed in Table 1 and Table 2; 3) patients agreed to participate in the study and provided signed informed consent; 4) the preoperational examination indicated no surgical contraindications; and 4) the age range was 18 to 81 years old. The exclusion criteria for the current study were as follows: 1) patients did not meet the inclusion criteria; 2) patients with ileocecal neoplasms. Finally, 2 patients with mucinous adenocarcinoma were excluded, and 136 patients were enrolled in this study. Figure  Basic information, including age and sex, physical symptoms, clinicopathological data, and blood assay results before surgery, such as in ammation factors of high-sensitivity C-reactive protein (hs-CRP), procalcitonin (PCT), the lymphocyte subpopulations, were retrospectively extracted from electronic medical records.

Histopathology
All patients underwent surgical treatment, and all of the surgical specimens were examined by two pathologists. The pathological types of 136 cases of acute appendicitis were as follows: acute simple appendicitis (n = 9), acute purulent appendicitis (n = 103), acute gangrenous or perforated appendicitis (n = 24), and periappendiceal abscess (n = 0). The numbers of CA and UA cases were 24 and 112, respectively.

Feature selection and CA predictive modeling
In this study, the patients were divided into UA and CA groups according to histopathology. Univariate analysis was used to select the effective features among clinical and laboratory data, which have signi cant differences between UA and CA groups. CA predictive models based on individual clinical and laboratory data features and models combining clinical and laboratory data features were built separately. The CA prediction probability of individual clinical and laboratory features was identi ed by univariate logistic regression analysis. To construct predictive model based on combined features, the following three machine learning algorithms with high stability were investigated: logistic regression (LR), support vector machine (SVM) and random forest (RF). The models were trained and assessed using the repeated ten-fold cross-validation method in the training set, and differentiation performance was evaluated with the testing set.

Validation of the prediction model
Univariate logistic regression analysis was used to assess the clinical and laboratory features in predicting CA. The diagnostic ability of the single and combined models was studied with Receiver Operating Characteristic (ROC). The CA prediction performance was assessed using the area under the curve (AUC) of ROC curve, sensitivity, speci city and accuracy (ACC). In addition, a nomogram was plotted to better express the predictive effect of logistic regression model. The statistical difference of AUC among the three machine learning models was analyzed. Decision curve analysis (DCA) was conducted to evaluate the clinical usefulness of best preoperative prediction model by quantifying the net bene ts at different threshold probabilities in the testing set (31).

Statistical analysis
Comparisons of proportions and ranks of variables between training and testing set, and between UA and CA groups were performed using the Chi-square test, Fisher's exact test, Kruskal-Wallis H-test, Student's ttest or Mann-Whitney U test, as appropriate. The clinical and laboratory characteristics were compared using chi square test or Fisher's exact test for the nominal variable, Kruskal-Wallis H-test for the ordinal variable and Mann-Whitney U test for the continuous variable with abnormal distribution. Univariate logistic regression analysis was used to present prediction performance of individual clinical or laboratory feature. In addition, ROC curve analyses were performed to determine the AUC, ACC, sensitivity and speci city for each predictive model. The statistical difference of AUC between any two of the machine learning models was analyzed by Delong's test. DCA describe the clinical bene t of the predictive model as the difference between the true-positive and false-positive rates, weighted by the odds of the selected threshold probability of risk.
Statistical analysis was conducted with R software (Version: 3.6.0, https: www.r-project.org). The reported statistical signi cance levels were all two-sided, and the statistical signi cance was set at 0.05. The multivariate logistic regression and ROC analysis were performed with the 'stats', 'glmnet' and 'pROC' packages. The construction of the DCA and nomogram diagrams were performed using the 'rms' and 'rmda' packages.

Statistical analysis and feature selection
The clinical and laboratory characteristics of patients in the training and testing sets are shown in Table 1. Demographic, clinical or laboratory features did not signi cantly differ between the training and testing sets. The statistical analysis of the clinical and laboratory data between UA and CA, including the in ammatory response of the 136 patients, is presented in Table 2. The results of univariate analysis showed that nausea and vomiting, abdominal pain time, the highest temperature, hs-CRP and PCT were signi cantly different between UA and CA groups (P < 0.001) which associated with the appendicitis pathological type. The performance of these selected clinical and laboratory features in the diagnosis of CA is shown in Table 3. Although compared with other features, hs-CRP has a better e ciency in predicting CA pathology before operation as shown in nomogram in Fig. 2, the AUC of single feature prediction CA is below 0.80.

Prediction model assessment
By combining clinical and laboratory features, the CA predictive effectiveness of three different machine learning models (LR, SVM and RF) were summarized in Table 4. For logistic regression model, the nomogram in Fig. 2 also indicated that hs-CRP showed more superior than other features in CA prediction and the CA diagnostic probability would reach 80% when the total point was 200.  Table 5 (P > 0.05), the SVM model shows comprehensive advantages by considering accuracy, sensitivity, speci city and AUC ( Table 4).
The DCA of the selected SVM model is shown in Fig. 3D. DCA showed that the SVM model had a higher overall net bene t when the threshold probability for a patient was within a range from 0.20 to 0.85. The SVM model showed a better performance in predicting CA pathology among three combined machine learning models.

Discussion
Diagnosis of acute appendicitis is a dynamic process that closely correlated with the pathological changes of the disease. Under resource limited conditions, sophisticatedly and fully considering of clinical symptoms, signs and blood biomarkers could potentially help medical practitioners in decisionmaking for acute appendicitis patients [32]. And the biomarkers of blood in ammation are of great value in the diagnosis and pathological classi cation of acute appendicitis [20,21,33]. AI-assisted diagnosis of acute appendicitis may help clinicians to make better decisions. While there is still a lack of the AI-based predictive models for pathological types of acute appendicitis constructed from general clinical examinations [29]. In the current study, the combined clinical symptoms, signs and laboratory data were used and three models were constructed by LR, SVM and RF algorithms respectively to preoperatively discriminate CA from UA.
In this study, univariate difference analysis was used to select features that are most correlated with the pathological classi cation of acute appendicitis from data pool including clinical signs and blood indexes, and the effect of each index on the diagnosis results was comprehensively considered. The factor of hs-CRP has a better e ciency in predicting CA pathology before operation as shown in nomogram, with AUC of 0.744(0.606-0.883) and 0.702(0.542-0.862) in the training and testing set respectively. This nding was in accordance with previous report in which C-reactive protein was a good predictor for complicated appendicitis [23]. However, comprehensively considering, the sensitivity and speci city of hs-CRP in the testing set of our study was weaker than this reported study (73.1% sensitivity and 89.5% speci city). The further prediction model was built based on these selected features to avoid the in uence of weak correlation features or possible over tting on the prediction performance of the joint diagnosis model to a certain extent. Among three combined predictive models constructed respectively by LR, SVM and RF, the SVM model showed a better diagnostic performance. Comparing LR and RF with SVM, it is true that they have no signi cant difference in CA predictive e ciency. However, SVM is more interpretable, which is probably due to the reasons below.
Firstly, multivariable logistic regression is a classical linear modeling method to derive occurrence probability and in uenced by all of the data points, which might be weak to reveal complicated nonlinear relationships and import bias resulted from the less correlated features. In comparison with logistic regression technique, the SVM using nonlinear kernel function can overcome these restrictions [28]. SVM classi er is weighted by data points which have stronger correlation with classi cation and shows a better generalization ability on the unseen data.
Secondly, SVM may have better classi cation performance for problems with small samples and high dimension [34]. It could be found that CA subjects occupied minority in our study. The RF model performed better with a 0.908 AUC for the training set but a sharply decreased AUC of 0.761 for the testing set, which might be induced by the small sample size and large depth of decision trees [35]. While the ensemble learning in RF model is based on the accuracy and signi cance of variables, which may guarantee the accuracy of the model.
The visualization charts derived from the AI-based analysis are easy for clinicians to make quick decisions, combining their clinical experience. However, the current study we have conducted shows several limitations. As can be seen from subject number, a small size of less than 100 was used to build the model through machine learning models. The e ciency of the classi er might not be generalized as a small size may not describe the entire data population. In addition, the performance of each machine learning method closely depends on the dataset properties. One selected algorithm may not always be the best. The choice of algorithm and the prediction procedure should be well interpretable. As the UA and CA distribution is imbalanced along the time line in the current study, splitting the data by period could not be conducted for further model validation. If the research is extended to multiple centers, the generalization and the accuracy of the prediction model might be improved. As the sample size is enlarged, more features, including that not speci c or strongly related to the conventional acute appendicitis diagnosis, should also be considered in the optimizing procedure of CA predicting model. It is helpful to establish the joint diagnosis model without omitting the potential but nonspeci c features. And model established upon period splitting should also be considered to validate the feasibility and stability of the AI-based predicting methods.
The current research developed a preoperative prediction method for complicated appendicitis using a machine learning technique. The prediction model constructed by the SVM algorithm showed a better performance than the LR and RF algorithms. The predicted results displayed by the decision curve present good clinical practicability and universality.

Ethics approval and consent to participate
This study was reviewed and approved by the Ethics Committee of Aerospace Center Hospital. All patients signed informed consent before the operation.

Consent for publication
All patients or their caregivers signed a consent form giving permission to use their clinical data for research.

Availability of data and materials
The core data has been included in the manuscript. The datasets generated for this study are available on request to the corresponding author.

Competing interests statement
The authors declare that they have no competing interests.

Funding
The study is supported by the Funding Project of Aerospace Center Hospital [grant number: No. YN201429].

Acknowledgments
To the General Hospital of Aeronautics and Astronautics and Aerospace Center Hospital for providing strong support for the technology, funds and services.   Abbreviation: hs-CRP †, high-sensitivity C-reactive protein; PCT ‡, procalcitonin; AUC §, area under the curve.   Figure 1 Flow chart of the patient selection and exclusion process