Group Penalized Logistic Regressions Predict Ovarian Cancer

Objectives : Ovarian cancer ranks ﬁrst among gynecological cancers in terms of the mortality rate. Ac- 3 curately diagnosing ovarian benign tumors and malignant tumors is of immense important. The goal of 4 this paper is to combine group LASSO/SCAD/MCP penalized logistic regression with machine learning 5 procedure to further improve the prediction accuracy to ovarian benign tumors and malignant tumors 6 prediction problem. 7 Methods : We combine group LASSO/SCAD/MCP penalty with logistic regression, and propose group 8 LASSO/SCAD/MCP penalized logistic regression to predict the benign and malignant ovarian cancer. 9 Firstly, we select 349 ovarian cancer patients data and divide them into two sets: one is the training set 10 for learning, and the other is the testing set for checking, and then choose 46 explanatory variables and 11 divide them into 11 diﬀerent groups. Secondly, we apply the training set and group coordinate descent 12 algorithm to obtain group LASSO/SCAD/MCP estimator, and apply the testing set to compute con- 13 fusion matrix, accuracy, sensitivity and speciﬁcity. Finally, we compare the prediction performance for 14 group LASSO/SCAD/MCP penalized logistic regression with that for artiﬁcial neural network (ANN) 15 and support vector machine (SVM). 16 Results : Group LASSO/SCAD/MCP/ penalized logistic regression selects 6/4/1 groups. The pre- 17 diction accuracy and AUC for group MCP/SCAD/LASSO penalized logistic regression/SVM/ANN is 18 93.33%/85.71%/82.26%/74.29%/72.38% and 0.892/0.852/0.823/0.639/0.789, respectively. 19 Conclusions : Group MCP/SCAD/LASSO penalized logistic regression performs than SVM and ANN 20 in terms of prediction accuracy and AUC. In particular, group MCP penalized logistic regression predicts 21 the best. Therefore, we suggest group MCP penalized logistic regression to predict ovarian tumors.


26
Ovarian cancer is a malignant tumor growing on the ovary. Its incidence rate is lower than 27 that of cervical and endometrial cancer, whereas its mortality rate is higher than the sum of both 28 cervical cancer and endometrial cancer and ranks first among gynecologic cancers. According to 29 the global cancer data released by the World Health Organization International Agency in 2020, CT and PET, and help better diagnosis. Advanced ovarian cancer is the leading cause of cancer 40 death in women. Especially after chemotherapy, the recurrence rate was still over 70%. High 41 degree of malignancy, high recurrence rate and poor prognosis from advanced ovarian cancer 42 has become some prominent factors affecting the survival of ovarian cancer patients. Therefore, 43 it is crucial to accurately diagnose ovarian benign tumors and malignant tumors. 44 There are many studies related to ovarian cancer, including the influencing factors analysis, 45 screening methods, tumor markers, treatment and prognosis. Kikkawa et al. (1998) assessed the 46 value of tumor markers and clinical characteristics in making a differential diagnosis between 47 MCT and squamous cell carcinoma arising from MCT, demonstrated that there were significant 48 differences in age, tumor size, and levels of squamous cell carcinoma antigen (SCC), CA125, and 49 CEA, as well as a significant difference in the CA19-9 level between MCT and squamous cell 50 carcinoma arising from MCT, found that (1) age and tumor size are important factors in making 51 a differential diagnosis and the optimal cutoff values for age and tumor size were, respectively, 52 45 years and 99 mm, (2)CEA was the best screening marker for squamous cell carcinoma arising 53 from MCT, whereas age and tumor size were better markers than CA125 or CA19-9, and (3)SCC 54 and CEA levels should be measured in patients age 45 years or older who have an MCT-like 55 ovarian tumor larger than 99 mm in greatest dimension [1]. Robbins  after diagnosis of ovarian cancer, and found that high lifetime ovulatory cycles(LOC) and early 60 age at menarche were associated with decreased survival after ovarian cancer [2]. Díaz-Padilla sparsely, are contaminated with noise and the error process is highly non-stationary [14]. In this 120 paper, we combine group LASSO/SCAD/MCP with logistic regression, and propose group LAS-121 SO/SCAD/MCP penalized logistic regression classifier to investigate the benign and malignant 122 ovarian cancer. Firstly, we select the 46 predictor variables like blood routine, general chemical 123 detection, tumor markers and basic information etc., and divide the selected 349 ovarian can-124 cer patients into the two sets: the training set for learning and the testing set for predicting. 125 We develop the group coordinate descent algorithm and the training samples to obtain group 126 LASSO/SCAD/MCP estimator, and apply the testing samples to establish two-class confusion 127 matrix, prediction accuracy, sensitivity and specificity, draw the ROC curve and apply the area 128 under ROC curve (AUC) to assess the prediction performance. Finally, we compare group LAS-129 SO/SCAD/MCP penalized logistic regressions with SVM and ANN, and found that the predic-  The rest is arranged as follows: Section 2 specifies data source, 49 features and their group 134 processing. Section 3 constructs three group penalized methods. Section 4 reports model esti-135 mators and the prediction performances for the five methods. Section 5 is conclusion. ovarian-cancer). The data set is divide into two parts: a training set composed of 70% ovarian 142 cancer patients and a testing set composed of 30% ovarian cancer patients, see Table 1.

143
Carbohydrate antigen 19-9 0 ∼ 37(U/mL) The main function of platelets is to accelerate coagulation, promote hemostasis and repair damaged blood vessels.
Group 2 White blood cell X 5 , X 6 , X 7 , X 8 , X 9 , X 10 , X 11 , X 12 , X 13 White blood cells can phagocytose foreign materials to produce antibodies, and heal body damage, resist pathogen invasion and disease immunity.

Group 3
Red blood cell X 14 , X 15 , X 16 , X 17 , The main work of red blood cells is to transport oxygen and carbon dioxide which can enhance phagocytosis and immune adhesion.
Group 4 Chemical element X 20 , X 21 , X 22 , X 23 , Ions are used to measure human body electrolytes. The imbalance of the number of cations and anions will cause electrolyte disorders,which will lead different body damages.

Group 5
Liver function X 26 , X 28 , X 30 , X 31 , X 32 , X 33 , X 34 , X 35 Liver function examination generally includes protein metabolism function, bilirubin and bile acid metabolism function and serum enzyme indexes.
8 Tumor markers can be used for early detection, screening and differential diagnosis of tumors and can also be used for patient efficacy detection, recurrence and prognosis judgment.
The main function of kidney is to secrete and excrete urine and toxins, regulate body fluids volume and water, and maintain the balance of body's internal environment.
The pH value of normal people's blood is always maintained at a certain level. Once the acid-base balance is disturbed, acidosis or alkalosis will occur.

Group 9
Blood sugar X 47 The glucose in the blood is called blood sugar. The production and utilization of blood sugar are in a state of dynamic balance to maintain the needs of various organs and tissues in the body.
Group 10 Age X 48 Ovarian cancer has a certain relationship with age. The most common age group for ovarian cancer is middleaged and elderly women, but many young women may also suffer from ovarian cancer.  [16]proposed group where β = (β (1) , . . . , β (J) ) with the jth group coefficient vector is the whole coefficient vector, and d j = dim(β (j) ) is the length of the j-th group. The j-th 177 group LASSO estimator is given by where LASSO penalized logistic regression (GLASSO) to investigate ovarian cancer [17]. In this paper 181 we introduce group logistic regression to study the relation between Y and X = (X (1) , . . . , X (11) ) ⊤ , where is the conditional probability of ovarian benign tumors, β 0 is the intercept, β (j) is the j-th group 184 parameter vector and β = (β (1) , . . . , β (11) ) is the whole unknown parameter vector. Then, the 185 negative group log likelihood function for group logistic regression is Group LASSO penalized logistic log likelihood is 10 where the tuning parameter λ ≥ 0 controls the penalty size. Suppose that for a univariate Z, 188 the univariate soft-thresholding operator is and for a vector-valued argument Z, the multivariate soft-thresholding operator is where Z/ Z is the unit vector in the direction of Z. In other words, S(Z, λ) acts on the vector 191 Z by shortening it towards to 0, and if the length of Z is less than λ, the vector is shortened all 192 the way to 0.
, j = 1, . . . , 11. 3.2 Group SCAD penalized logistic regression 200 Fan & Li(2001) [20] proposed the following SCAD penalty where λ > 0 and γ > 2. Its first derivative with respect to the parameter vector β is Group penalized log-likelihood for group SCAD penalized logistic regression(GSCAD-PLR) is Similar to Algorithm 1, we apply the group coordinate descent algorithm for GSCAD-PLR and 204 obtain the j-th group SCAD estimator

Two-class prediction accuracy evaluation 214
For two-class problem, a two-class confusion matrix(accuracy, sensitivity, and specificity),  Accuracy = T P + T N T P + T N + F N + F P , Precision = T P T P + F P , The ROC curve can be drawn by changing (1 − specificity) and sensitivity at different thresh-  Then, we apply ten cross validation to select the optimal λ, and obtainβ through the optimal 239 λ. For default γ, group SCAD is 4 and group MCP is 3, the algorithm starts at λ max and 240 proceeding toward λ min . When the objective function is a strictly convex function, the estimated 241 coefficients continuously vary within λ ∈ [λ min , λ max ] and produce a regularized solution path.  λ is selected by ten fold cross validation, and apply the optimal λ and default γ and the formula 251 (9), (13)and (17) to compute group LASSO/SCAD/MCP estimators. Then, the test set is used 252 to compute a confusion matrix, accuracy, sensitivity, specificity and draw ROC curves so that 253 one can compare the prediction accuracy. Fig.1 shows the coefficient path diagrams selected by The optimal λ selected by ten fold cross validation are listed in Fig.2. Ordinates represent 257 cross validation errors, abscissa represents log(λ) and the numbers above indicate the number of 258 variables entered into the model at the corresponding λ value. Table 5 lists the optimal λ selected 259 by ten fold cross validation based on ovarian tumors data sets for group LASSO/SCAD/MCP  From Table 5, we found that the optimal λ of group MCP penalized logistic regression 262 is larger than group LASSO/SCAD penalized logistic regressions. Therefore, the penalization 263 intensity of group MCP is greater, the more coefficients compressed to 0, and the fewer vari-264 able groups selected. After determining the optimal λ selected by ten fold cross validation, 265 we apply the group coordinate descent algorithm to obtain group estimators for group LAS-  Table 6.

268
where c is a given threshold. For balanced data, c is generally taken as 0.5. For unbalanced 286 data, Youden index is widely used to select the optimal threshold (Raghavan, Ashour & Bailey, 16 To evaluate the prediction performance, we compare group LASSO/SCAD/MCP penalized 289 logistic regressions with SVM and ANN. The confusion matrixes and the prediction performances 290 from the five methods are listed in Table 7 and   Fig. 3. The ROC curves for the five methods.
As shown in Fig.3, the sensitivity and specificity of several models are consistent with  Conflict of interest The authors declare that they have no known competing financial interests 356 or personal relationships that could have appeared to influence the work reported in this paper.

358
Ethic approval This is an observational study and do not require ethics approval.