Establishment of prediction models for lung cancer NOG/PDX models: A guideline for machine learning in small biomedical datasets

Background: Targeted therapy and immune checkpoint inhibitors are the most promising treatments for lung cancers but still facing multiple challenges, including resistance and individual difference. Therefore, patient-derived tumor xenografts (PDX) models are developed for drug discovery and screening. NOG mice is under the destruction of the interleukin-2 (IL-2) receptor common gamma chain, which is appropriate for building PDX models to test immunotherapies. However, current studies have little understanding of the causes of genotype mismatches in PDX or NOG/PDX models, which leads to a massive economic and time loss. Methods: Lung cancer tissues from 53 patients were obtained and engrafted into NOG mice. All of the patients' tumors and NOG/PDX models were detected for common gene mutations. Seventeen clinicopathological features were organized and input to stepwise logistic regression based on the lowest Akaike information criterion (AIC), least absolute shrinkage and selection operator (LASSO)-logistic regression, support vector machine recursive feature elimination (SVM-RFE), eXtreme Gradient Boosting (XGBoost), Gradient Boosting & Categorical Features (CatBoost), and synthetic minority over-sampling technique (SMOTE). Finally, the performance of all models was evaluated by the accuracy, area under the receiver operating characteristic curve (AUC), and F1 score in 100 testing groups. Results: Fifty-three lung cancer NOG/PDX models were successfully established, with a genotype matching rate of 79.2% (42/53). Two multivariable logistic regressions revealed that age, the number of driver mutations, epidermal growth factor receptor (EGFR) gene mutations, the type of prior chemotherapy, prior tyrosine kinase inhibitors (TKIs) therapy, and the source were potent predictors. Moreover, CatBoost (mean accuracy=0.960; mean F1 and 8-feature SVM (mean accuracy=0.950; mean AUC=0.934; mean F1 score=0.903) showed the best performance compared with the other algorithms. Moreover, the combination of SMOTE with SVM signicantly improved the predictive


Background
Lung cancer is the one that causes most deaths in humans, the number of which exceeds one million each year worldwide [1]. Around 85% of lung cancers are non-small cell lung cancer (NSCLC), and small cell lung cancer (SCLC) accounts for 15% of lung cancers [2]. Recently, the chemotherapy-based paradigm in lung cancer patients has been shifted with the introduction of the driver genes and the advance of molecular detection technology [3,4], especially in those with epidermal growth factor receptor (EGFR) positive mutant [5,6] and rearrangements of the anaplastic large-cell lymphoma kinase (ALK) genes [7].
Nevertheless, targeted therapy is facing a series of di culties, including different individual responses and frequently acquired resistance [8,9]. Subsequently, immune checkpoint inhibitors (ICIs) are recommended as substitutional drugs of targeted therapy [10]. Therefore, the screening of immunotherapy for patients without driver gene mutations or response to targeted drugs is pivotal to refractory lung cancers.
To overcome these challenges, pre-clinical animal models are crucial for drug screening and e cacy evaluation to achieve precision medicine. Patient-derived tumor xenografts (PDX) have emerged as an accurate pre-clinical system capable of maintaining the molecular, genetic, and histopathologic heterogeneity of the parental tumors [11,12]. Moreover, a new generation of super immunode cient mice named NOG was considered as an excellent choice to build PDX models for cancer immunotherapy, with the destruction of the interleukin-2 (IL-2) receptor common gamma chain and functional de ciency of multiple immune cells such as T cells, B cells, natural killer cells (NK cells), macrophages, and dendritic cells [13]. Compared with nude mice and traditional severe combined immunode cient (SCID) mice, NOG mice have demonstrated great potential for researches on ICIs, adoptive T-cell therapy (ACT), and other immunotherapy since tumor-in ltrating T lymphocytes (TILs) can be serially transplanted into them after xenografts develop [14,15]. Nonetheless, the low success rate of PDX establishment (20-40%) [16][17][18] with the notably inconsistent rate of diver gene-mutations between established PDX and original tumors (10-20%) [19,20] is quite a worrisome problem, but not well studied. Since protocols of PDX models are time-consuming, laborintensive, and costly [21], the presence of inconsistency of driver gene mutations imparts a vast toll on researchers, physicians, and patients. Although multiple factors including gender, smoking history, pathology, TNM-stage, tumor grade, the quality of tumor samples, and EGFR gene mutations have been assumed to correlate with the success rate of the tumor engraftment [17,18,22], whether these factors contribute to the consistency of driver gene mutations PDX models, especially those established in NOG mice have not been validated. This study is aimed to perform machine learning (ML) algorithms including multivariable logistic regression (LR), support vector machine recursive feature elimination (SVM-RFE), gradient boosting decision tree (GBDT), and synthetic minority over-sampling technique (SMOTE) to establish a potent tool for predicting inconsistency of driver gene mutations between NOG/PDX models and patients' tumors ( Fig. 1).

Patient samples
Lung cancer tissues or cells from 53 patients were obtained from computed tomography (CT) guided percutaneous lung biopsy (CT-PLB), lymph node biopsy (LNB) or thoracentesis at Shanghai Pulmonary Hospital (Shanghai, China), between August 2018 and October 2019.

Preparations for Tissues samples
Harvested tissues from TBB, CT-PLB, and LNB were divided into three parts. The rst part was minced into fragments of 50-100 mm 3 and was immersed in frozen medium Bambanker (Nippon Genetics, Cat. No. BBH01), and then was kept in liquid nitrogen until implantation into immunode cient mice. The second part was immediately frozen in liquid nitrogen for DNA/RNA extraction. The third part was made into formalin-xed para n-embedded (FFPE) slides for pathologic assessment.

Preparations for the malignant pleural effusion (MPE)
The preparations and culture of MPE were conducted as previously described. Approximately 200-1000 ml of pleural effusion was extracted each time by thoracentesis. The samples were centrifuged at 450 G for 10 min and then were resuspended in PBS. The tumor cells were isolated from the interphase layer of samples by density gradient centrifugation using Ficoll-PaqueTM PLUS (GE Healthcare Bio-Sciences, Uppsala, Sweden). After being washed with PBS, the tumor cells were cultured in RPMI-1640 containing 10% fetal bovine serum (FBS) and 10 ng/ml epidermal growth factor (EGF) at a density of 1 to 2 × 10 6 cells per plate.

NOG/PDX establishment
All animal experiments in this study are following the guidelines of the Institutional Animal Care and Use Committee (IACUC). The PDX models were established in 6-8-week-old female NOG mice (NOD.Cg-Prkdc scid IL2rg tm1Sug /JicCrl) (Charles River, Beijing, China).
Frozen tissues were thawed at 37 °C and directly implanted into the sterilized skin on NOG mice subcutaneously (n = 4-5 for each tumor sample). Simultaneously, tumor cells isolated from MPE mentioned above were washed once in PBS and then injected 510 6 cells into the right ank of each NOG mice (n = 4-5 for each MPE sample).
The initial tumor-implanted NOG mice were maintained for 120 days and were measured once a week.
The tumor volumes (TV) were measured by the formula: TV= (length × width 2 )/2 (length was the longest diameter, while the width was the shortest diameter). The xenografted tumor was passaged when the tumor size reached around 700-800 mm 3 , and the PDX models utilized in this study were from the third passage (P3) to the fth passage (P5). The PDX tumors of each passage were again separated into three parts. The rst part was implanted into another NOG mouse for passaging. The second part was immediately frozen in liquid nitrogen for DNA/RNA extraction. The third part was made into FFPE slides for pathologic assessment.

DNA and RNA extractions
Lung cancer tissues and PDX tissues were pathologically reviewed to ensure that tumor cells were more than 80% and that no signi cant tumor necrosis had occurred before the DNA extraction. Genomic DNA was extracted from each tissue sample using QIAamp DNA Mini Kit (Qiagen-51306, Germany). The quantity and purity of DNA samples were measured using Nanodrop ND-1000 UV/VIS Spectrophotometer (Therm Scienti c, USA). DNA fragment integrity was con rmed by electrophoresis using 1% agarose gel.
The concentration of DNA samples was normalized to 20 ng/µL and stored at − 20 °C until use. Both 'Hot spot' mutations in EGFR (exon 18,19,20,21) and ALK fusions (EML4-ALK) were screened by ampli cation refractory mutation system (ARMS) and mutant-enriched liquid chip polymerase chain reaction (PCR) method.

Models' performance evaluation
To evaluate the performance of all prediction models in this study, we calculated the following indexes:

Statistical analysis
Stepwise LR based on the lowest AIC, logistic-LASSO regression, SVM-RFE, eXtreme Gradient Boosting (XGBoost), and Gradient Boosting & Categorical Features (CatBoost) were utilized for developing multivariate prediction models (detailed information was presented in the Supplementary Methods). Strati ed random sampling was performed to generate 100 training groups (containing seven nonmatching samples and 28 matching samples) and 100 testing groups (containing four non-matching samples and 14 matching samples). The pair-sample t-test was used to compare the performance among different models in testing groups. All of the data analysis in this study was performed by SPSS
All of the PDX models included in this study were all con rmed as the successful establishment (size reached around 700-800 mm 3 ) by pathologists, and the representative sections of PDX models with haemotoxylin and eosin (H&E) staining were shown in Fig. 1. The overall rate of the driver genes matching was 84% (42/50).

Model 1: Logistic Regression 1 Univariable analysis of factors associated with genotype mismatches
Logistic univariate regression analysis indicated that the risk factors for the inconsistency of driver genes mutations between PDX models and parental tumors were female, younger age, smoking history, acquisition from LNB or thoracentesis, NSCLC except for squamous cell carcinoma (SCC), EGFR mutations, more driver genes mutations, without prior chemotherapy, with prior chemotherapy of pemetrexed plus carboplatin, and prior TKIs therapy (Table 1).  To balance the prediction model's performance and complexity, we performed stepwise model selections by calculating the AIC. According to the univariable analysis, there are ten potential predictive features. Figure 2A showed the AIC values of each step in the backward stepwise LR, where ten predictive features were deleted one by one until AIC could not decrease any more. Generally, models excluding the number of driver genes presented the worst AIC, which indicated that the number of driver genes was a weighty predictor. Moreover, the best multivariable models selected by AIC was a ve-variable logistic regression including age, the number of driver mutations, type of prior chemotherapy, prior TKIs therapy, and the source.

Least absolute shrinkage and selection operator (LASSO)-logistic regression
We performed the LASSO regularization on the LR to improve the prediction accuracy and interpretability. Two features out of the ten features were screened out via a LASSO logistic regression combined with ten-fold cross-validation, where the optimal penalization coe cient λ valued as one standard error: EGFR mutations and the number of driver genes (Figs. 2B and 2C).
Model 2: Support vector machine recursive feature elimination (SVM-RFE) SVM-RFE begins with a complete feature set and eliminates the least important feature for classi cation in each iteration, according to the weight vector of dimension length. According to the rank of feature importance, which was visualized in Fig. 3, we rstly deleted the least important seven variables and then eliminated the remaining ten variables one by one to optimize the prediction accuracy. According to the mean predictive accuracy and F1 score in 100 testing groups by strati ed random sampling, the SVM model, including eight variables, maintained the best performance with the least complexity. As a result, the 8-feature SVM was the most optimal model among all SVM classi ers.

Model 3: Gradient Boosting Decision Tree (GBDT)
To implement GBDT, we performed two commonly used algorithms, eXtreme Gradient Boosting (XGBoost) and Gradient Boosting & Categorical Features (CatBoost). A large number of experiments indicated that multicollinearity among features did not hinder decision trees' predictive classi cations [23]. Therefore, we input all of the 17 features to XGBoost and CatBoost in this study. The rank of features based on XGBoost and CatBoost classi cation algorithms was also presented in Fig. 3. The representative structure of a decision tree generated by XGBoost and CatBoost was shown in Fig. 4B.
(2) The improvement upon the synthetic minority oversampling technique (SMOTE) application Synthetic minority oversampling technique (SMOTE) is a kind of oversampling method that increases the number of positive classes through randomly data replication to achieve an equal number with the negative class [24]. Herein we exerted SMOTE to add ten more positive samples into training groups to establish every model and then tested them in the original 100 testing groups (Table S2) (Fig. 5B). Among all the other algorithms, XGBoost was the only model that acquired an obvious advantage in AUC (0.935 vs. 0.908, P = 0.004; Fig. 5C), while the other models did not present any improvement (Table S3). Therefore, we exhibited an approach that can improve the performance of SVM in small uneven samples.

Discussion
Overall, this study initially developed a prediction model for the inconsistency of driver gene mutations between NOG/PDX models and patients' samples. A total of 53 lung cancer NOG/PDX models were successfully engrafted and excised, including 42 NOG-PDX models with matching driver gene mutations from parents' tumors and 11 NOG-PDX models with non-matching ones. To analyze this small unbalanced database, we performed three models of ve algorithms, including LR based on AIC, LASSO-LR, SVM, XGBoost, and CatBoost, which all present with an excellent predictive capability. According to the evaluation indexes in testing groups, CatBoost and SVM demonstrated the best performance of and modeling. Moreover, the application of SMOTE generally improved the performance of SVM based on the fundamental level.
LR illustrated what exactly determined the genotypes of NOG/PDX models LR is a widely ML technique in biomedical data analysis since it is reasonably easy to interpret with a clear demonstration of the positive or negative association of the variables with the predicted probability [25]. Therefore, the two multiple LR models revealed the critical predictor variables.
The formation and passaging of PDX models are dynamic events, where clonal and subclonal alternations frequently occur, especially when the development of P1 PDX models is slow, which gives adequate time for tumor cells to mutate for adaptation a new environment [26,27]. In addition to these cell-autonomous heterogeneities, the stromal heterogeneity in the tumor microenvironment (TME) is another critical reason for different driver genotype of PDX from parental tumors [12]. As expected, most of the predictive features from univariable analysis consist of the factors associated with xenografts engraftment, except for the pathology. It was reported that SCC was much more prone to be tumorigenic in nude mice compared to adenocarcinoma ADC [18], which is contrary to our conclusion that SCC is the most challenging type to establish NOG/PDX models of genetic matching. More CD8 + TILs were detected in the cancer nests of SCC than in non-SCC [28], which reveals that PDX models of SCC may lose more tumor stroma during the xenograft engraftment. Moreover, SCC was prone to carry signi cantly more clonal mutations than AC [29], which may also contribute to more clonal selections.
As for multivariable analysis, age, the number of driver mutations, EGFR mutations, the type of prior chemotherapy, prior TKIs therapy, and the source shown a signi cant role in the inconsistency of driver gene mutations. Although age accounted for a small weight in multivariable LR, we have not found an appropriate therapy to illustrate a younger age rather than an elder age is a risk factor for the inconsistency of driver gene matching [30]. However, this surprising factor suggests that the age of implemented mice might play a critical role in the establishment of PDX models. Most of the PDX models in current researches use 8-week-old mice rather than aged mice (> 8 months). Recent studies found that aging could dramatically alter the components of the tumor microenvironment [31], thereby the inconsistent age of mice from patients could be the potential reason why age becomes a predictive feature here. Another feature, source, also played a negative role in matching the genotype, different from that in tumor engraftments. Although uids source, including MPE and lymph, are proved to have a higher engraftment rate than the solid tumor tissues [32], we found that uid-derived tumor xenografts were more challenging to maintain driver genotypes from parental tumors.
The number of driver gene mutations, including clonal and subclonal mutations, are associated with intra-tumor heterogeneity, genomic instability, or chromosomal instability [33]. The largest coe cient of the number of the driver gene in the multivariable LR model also illustrates its absolute importance in developing non-patient-matched genotypes. Secondly, PDX models from EGFR mutant lung cancers were reported with poor histological differentiation, and frequent loss of EGFR mutations [34], which supported the high inconsistent risk of EGFR mutant NOG/PDX models in this study. Thirdly, the evidence that pemetrexed increased the number of TILs, and upregulated immune-related genes related to antigen presentation might support the conclusion that PDX models from patients receiving pemetrexed are less likely to maintain the original genotypes [35]. TKIs have been proved with the capability to alter the pulmonary TME, including increased CD8 + T cells and mononuclear myeloid-derived suppressor cells (M-MDSCs) (CD11b + Ly6 − G − Ly6C high ), and fewer Foxp3 + T regulatory cells (Tregs) and M2-like macrophages (CD206 + ) [36]. Also, the clonal selection is a frequent occurrence during TKIs therapy, resulting in TKI resistance [37]. Interestingly, we found that the factors promoting TILs were conducive to the stability of genotypes during the NOG/PDX models establishment, which needs further veri cation (Fig. 6).

SVM-RFE and GBDT provide a robust and straightforward classi er
Unlike LR, both SVM and GBDT are similar to "black boxes," which only shows the inputs and outputs without internal workings [38]. SVM and GBDT are considered with the reliable power for classi cation, less concern for over tting, and the ability to handle unbalanced data, which has been validated in this study. Thereby, when there is no need to explain the model in detail with an immediate requirement of building an accurate classi er, CatBoost, SVM, or SVM-SMOTE become a better choice for predicting the inconsistency of driver gene mutations with a signi cantly better performance.

ML for small biomedical unbalanced datasets
Recently, ML is a promising topic for predictive modeling in numerous areas, which enables prediction models to "learn" information systematically from initial data and adapt to each new data environment [39]. However, ML has not been widely performed in small sample databases (less than ten frequencies per predictor variable), which is a common characteristic in biomedical animal models with expensive costs and complicated techniques [40]. Ultimately, the ML algorithms we attempted to establish predictive tools for lung cancer NOG/PDX models demonstrated excellent performance, which not only provides a predictive tool to screen lung cancer patients for NOG/PDX models of precisive immunotherapy but also offers a general approach for building prediction models in small biomedical samples: (1) Select features to develop a multivariable model in all samples with standard ML algorithms, including stepwise LR based on AIC, LASSO-LR, SVM (or SVM-SMOTE), XGBoost, CatBoost, and et al.
(2) Perform strati ed random sampling to generate 100 training groups and testing groups to achieve stable performance.
(3) Formulate the predictive score or establish the predictive classi er in training groups.
(4) Evaluate the predictive model based on ROC, accuracy, and F1 score, in the corresponding testing groups to determine an optimal algorithm or modeling.
(5) Interpret the critical predictors for positive class by LR, and apply the optimal algorithm for the nal prediction.

Conclusions
In conclusion, we established a predictive model for the inconsistency of driver gene mutations between NOG/PDX models and patients' samples based on machine learning, which promises to improve the success rate of PDX establishment and reduce the massive economic loss. Further, the NOG mice we used in this study were considered an excellent choice to build PDX models for cancer immunotherapy, but not well studied. Therefore, the model we established has the potential for immunotherapy screening and development. What is more, machine learning has not been widely performed in small sample databases. Ultimately, the algorithms we explored in this study also offer a general approach for building prediction models in small biomedical samples.  Figure 1 The study design and protocol for establishing lung cancer NOG/PDX models and machine learning. In this study, we initially obtained lung cancer tissues via percutaneous lung biopsy (CT-PLB), lymph node biopsy (LNB) or thoracentesis, and then implanted all tissues into NOG mice. After successfully establishing 53 NOG/ patient-derived tumor xenografts (PDX) models, we took all PDX tissues and then performed haemotoxylin and eosin (H&E) staining and gene sequencing to con rm whether the genotypes of PDX models matched that of patients' tumors. Further, we input 17 clinicopathological Guo, Haoyue created this gure, and she approved it to be published in this paper.

Figure 3
Page 25/28 The rank of importance top ten variables based on support vector machine (SVM), eXtreme Gradient Boosting (XGBoost), and Gradient Boosting & Categorical Features (CatBoost). The chart showed the most critical ten variables according to the above three algorithms, where the same color represented the same ranking. Note: Dr. Guo, Haoyue created this gure, and she approved it to be published in this paper.

Figure 4
The modeling precess of support vector machine (SVM), and gradient boosting decision tree (GBDT). (A) The mean predictive accuracy of SMV based on the different number of variables in 100 testing groups. After inputting the test sample into each CART tree, we can get the predicted score of each sample at the leaf node. After weighing the total score of 100 trees, we can get the overall score of each sample and the corresponding classi cation. Note: Dr. Guo, Haoyue created this gure, and she approved it to be published in this paper. P<0.05, **: P<0.01, ***: P<0.001, based on the paired-sample t-test. Note: Dr. Guo, Haoyue created this gure, and she approved it to be published in this paper.

Figure 6
The association between the factor contributing to the inconsistency of driver gene mutation and the tumor microenvironment (TME). According to the univariable and multivariable logistic regression, squamous cell carcinoma, application of pemetrexed, and prior tyrosine kinase inhibitors therapy were risk factors for the non-matching genotypes between patient-derived tumor xenografts (PDX) models and parental tumors. All of the above three factors raised tumor-in ltrating T lymphocytes. Moreover, tyrosine kinase inhibitors (TKIs) could also decrease FoxP3+ T regulatory cells (Tregs), mononuclear myeloidderived suppressor cells (CD11b+Ly6-G-Ly6Chigh), and M2-like tumor-associated macrophages (CD206+) in the TME. Note: Dr. Guo, Haoyue created this gure, and she approved it to be published in this paper.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. SupplementaryMaterial.docx