Estimation of 18F-FDG PET Image Texture Features for Metastasis Prediction in Non-Small Cell Lung Cancer Using Epithelial Mesenchymal Transition-Related Genes

Purpose: The aim of this study was to estimate a metastasis prediction image factor in non-small cell lung cancer by correlation next generation sequence gene expression level and uorine-18-2-uoro-2-deoxy-D-glucose positron emission tomography image features. Methods: RNA-sequencing data and 18 F-FDG PET images of 63 patients with NSCLC (29 metastasis and 34 non-metastasis) from The Cancer Imaging Archive and The Cancer Genome Atlas Program databases were used in a combined analysis. Weighted correlation network analysis was performed to identify gene groups were related metastasis. Module was selected with high module signicance. Genes selection was performed by gene function related metastasis and high AUC (AUC > 0.6). A total of 47 image features were extracted from PET images as radiomics. The relationship of Gene expression and image features were calculated by using a hypergeometric distribution test with the Pearson correlation method. Metastasis prediction model was validated by random forest algorithm using image texture features related gene expression. Results: 36 modules were identied by gene expression pattern with WGCNA assay. The modules had highest module signicance was selected assay. 7 genes from selected module were identied to involve in the epithelial mesenchymal transition pathway that have important role in the cancer metastasis and had high AUC. Also, expression of these genes was related to quantitative of image feature (GLCM_contrast, -log10 P-value: 2.45~3.89). The AUC value (accuracy: 0.856 ± 0.06, AUC: 0.868 ± 0.05) was shown from the EMT-related gene and GLCM_contrast model and AUC value (accuracy: 0.842 ± 0.06, AUC: 0.838 ± 0.09) was shown from GLCM_contrast image texture model. Conclusion: GLCM_contrast image texture feature shows relationship with EMT related gene expression. We developed a model for predicting metastasis of non-small cell lung cancer using 18 F-FDG PET image feature and evaluated its accuracy.


Introduction
Non-small cell lung cancer (NSCLC) has a high incidence among cancers that can occur in modern people with large molecular heterogeneity in tissues [1,2]. Its molecular heterogeneity was shown to be different between patients and intratumor and intertumor regions [3]. Intratumor heterogeneity is known to be linked to the development of primary tumors and metastases [4]. It is possible to diagnose cancer by analyzing intracellular gene expression events and nding a suitable treatment method for each cancer [5]. Many studies have been conducted to search for methods to diagnose cancers having different genotypes and to nd a treatment for each cancer: image features that analyze phenotypes based on genotype, next generation sequencing (NGS) for large-scale gene analysis, and radiogenomics that uses uorine-18-2-uoro-2-deoxy-D-glucose positron emission tomography ( 18 F-FDG PET) image features and NGS in combination.
NGS is a high-throughput sequencing analysis method that is capable of accurately quantifying large amounts of gene information compared to conventional gene analysis methods [6]. In the past, gene expression was characterized one by one with electrophoresis after PCR, a time-consuming, expensive procedure and limitation of sample amounts. Recently, advances in NGS technology have made it possible to analyze total RNA in single cells. Studies of genes involved in NSCLC metastasis have also been conducted using NGS, and genes that play important roles in metastasis, such as EFGR, had been identi ed [7]. However, this method has some disadvantages: time-consuming sequencing, painful invasive biopsies, and identi cation the genes from the sampled tissue but not necessarily from the entire tissue [8]. The classical image technique uses radiation to image the affected area without causing pain to the patient, grasping the overall characteristics of the affected area, and has the advantage of quick analysis [9] but only showing the cancer phenotype. Radiogenomics is a study that combines image feature technology for analyzing images and NGS technology for mass analysis of genes, revealing the relationship between expression of speci c genes related to cancer and image features present. By combining the two analysis methods, diagnosis and prediction of cancer without any invasive method is possible [10]. 18 F-FDG PET/CT has the advantage of evaluating metabolic processes in cancer. It is an advanced technique compared to CT, a traditional imaging technique: 18 F-FDG is absorbed during glucose metabolism, and it is possible to estimate glucose metabolism by imaging FDG remaining in the cell.
Depending on the degree of cancer progression, glucose uptake and FDG concentration remaining in the cells are different. Because the residual FDG concentration in the initial cancer is low and increases as the cancer progresses, the degree of cancer progression can be evaluated through FDG imaging. This method is also suitable for evaluating cancer metastasis [11]. 18 F-FDG PET/CT imaging was used to predict the chemotherapy response after treatment with an anticancer drug in NSCLC [12]. In other studies, 18 F-FDG PET/CT imaging can be used for prognosticating survival in NSCLC by analyzing image features [13].
The following is a case study of NSCLC that recently utilized radiogenomics. Research on gene expression speci c to NSCLC has already been conducted, and it is well known that the EGFR gene plays an important role in metastasis when mutation occurs [14]. A recent study has shown that 18 F-FDG PET/CT image features are correlated with EGFR mutation status in NSCLC [15]. In this study, patient DNA was collected to distinguish patients with EGFR mutations, and image features of CT images were analyzed to determine whether the features (SUVmax, SUVmean, and SUVpeak) were related to the EGFR mutation. A metastasis prediction model was estimated with these results. In another study, mRNA extracted from NSCLC tissues was analyzed by NGS to nd metagenes, and image features from CT images were used for analysis by searching for correlations between NGS and CT image features [16].
The relationship and action of the expressed metagenes and image features for cancer cell proliferation were studied. Epithelial mesenchymal transition (EMT) plays a most important role in cancer metastasis.
In NSCLC cells, activation of EMT induces cell migration, proliferation, and invasion [17].
In this study, we estimated correlation between the expression of genes in metastasis of NSCLC and the quantitative 18 F-FDG PET image texture features. The NSCLC metastasis prediction model was developed by image texture features have relation with gene expression.

Material And Methods
NSCLC NGS data processing RNA-sequencing data, clinical data of patients, and 18 F-FDG PET images were downloaded from the TCIA/TCGA database (NGS data accession number: GSE103584, PET image data: http://doi.org/10.7937/K9/TCIA.2017.7hs46erv -DOI). Patient data were classi ed in a binary manner between metastasis (n = 29) and non-metastasis (n = 34) groups based on clinical data and 18 F-FDG PET images. The classi cation in the metastasis and non-metastasis models was performed with reference to clinical data from TCGA. Patients in the N1 and N2 stages were placed in the metastasis group, and those in the N0 stage were placed in non-metastasis group. Patient information is summarized in Table 1. Downloaded data were normalized by FRKM. The genes with zero FRKM values from all the samples were trimmed for fast analysis [18]. For differentially expressed gene (DEG) analysis, the Deseq2 tool of the R packaged was used [19]. Input data groups followed the metastasis and non-metastasis groups.
DEG analysis results were visualized in volcano plots by ggplot in R [20]. Weighted gene co-expression networks and modules associated with clinical traits To analyze the correlation between expressed genes and features extracted from images, gene selection was conducted at rst. A total of 22,125 genes were analyzed by DEG and the selected only those genes with signi cant differences [21]. To obtain the gene module with the greatest in uence on determining metastasis, WGCNA analysis was performed [22]. The genes were separated into several modules using the WGCNA tool in the R package. A soft threshold for network construction was selected for gene clustering. In the soft threshold, the adjacency matrix forms a continuous range of values between 0 and 1. The constructed network conforms to the power-law distribution and is closer to a real biological network state. A scale-free network was constructed using the blockwise module function, followed by module partition analysis to identify gene co-expression modules, which grouped genes with similar expression patterns. The modules were de ned by cutting the clustering tree into branches using a dynamic tree cutting algorithm and assigned to different colors for visualization [23]. The module eigengene (ME) of each module was calculated. ME represents the expression level for each module. The correlation between ME and clinical traits in each module was calculated. Finally, the gene signi cance (GS) that represented the correlation between genes and samples was further calculated. Genes from selected modules with a GS value of 0.8 or more and a P-value of 0.05 or less were selected [24]. Each gene's AUC value was calculated, and genes have high AUC values (AUC > 0.6) were selected for correlation assays.

Functional and pathway enrichment analyses of selected modules
Genes from selected modules were used for functional analysis. DAVID 6.8 [25] software was used for the GO term, biological process (BP), molecular function (MF), and cellular component (CC) [26] in each module. A P-value < 0.05 was selected as the threshold for the identi cation of signi cant GO terms and pathways. Go terms were visualized using the revigo web tool [27]. 18

F-FDG PET imaging
Tumor volumes were segmented and radiomics features in the de ned tumors were subsequently extracted using the Local Image Features Extraction (LIFEx) version 4.0 software package [28]. The tumor region was drawn using a semi-automated segmentation method with a threshold SUV of 2.0 based on our previous report [29] in three-dimensional (3D) images. In segmented tumors, SUVmax, SUVmean, SUVpeak, metabolic tumor volume (MTV), total lesion glycolysis (TLG), and features from shape and histogram were calculated as the rst order features. For texture feature calculation, the number of intensity levels was resampled using 64 discrete values between zero and 20 SUVs, corresponding to a sampling bin width of 0.3125 SUV [30,31]. Spatial resampling was 4.1 mm (X-direction), 4.1 mm (Ydirection), and 2.5 mm (Z-direction) in Cartesian coordinates [14]. Texture features were assessed using four texture matrices: co-occurrence matrix (CM), gray-level run length matrix (GRLM), gray-level zone length matrix (GZLM), and neighborhood gray-level different matrix (NGLDM). The CM was calculated in 13 directions with one voxel distance relationship between neighboring voxels, and each texture feature calculated from this matrix was the average of the features over the 13 directions in space (X, Y, Z). The GRLM was also calculated for 13 directions via a similar method, whereas the GZLM was computed directly in 3D. The NGLDM was computed from the difference in gray levels between one voxel and its 26 neighbors in 3D, and each texture feature was calculated from this matrix [32]. A total of 47 features were extracted from the PET image data.
Hub gene and image feature correlation A total of 47 image features and 145 genes were used to estimate the relationship between all table factors, which was calculated using a hypergeometric distribution test with the Pearson correlation method. The hypergeometric P-value was calculated using the equation p = (kCx) ((n − k)C(n − x))/NCn, where N is the number of total genes in the genome, k is the number of expression values identi ed in gene expression, n is the expression value of features identi ed in the images, x is the number of overlapping genes, and kCx is the number of possible genes and features from image combinations [33]. The image features and genes for estimation of the metastasis prediction model were selected by the Pvalue of correlation (P-value < 0.05). The selected image features were compared with image values that are generally used for validation of radiogenomics.

Evaluation of the metastasis prediction model
To predict the patient's outcome in terms of metastasis, we used a machine learning approach [34] called random forest (RF) [35]. The machine learning prediction model was used to evaluate the accuracy, precision, and recall score using test data. Prediction was performed 10 times to obtain an average value [36]. A radiomics (47) only prediction model, an EMT-related gene (145) model, a histogram rst order (15) model, a texture (32) model, an EMT-related gene (145) and radiomics (47) model, and a GLCM_contrast model was used for estimation of the machine learning method using the random forest algorithm.

Results
In this study, 18 F-FDG PET data and RNA-sequencing data from 63 patients with NSCLC were used for analysis. The average age of the patients was 67.5 years, and the ratio of men and women was approximately 8:2. (Table 1). The process of development of the relationship between the RNAsequencing data and 18 F-FDG PET image features are schematically described in Fig. 1.

Gene modulation and hub gene assay
To search for hub genes, have important role in the metastasis, WGCNA was used rst to construct a gene module with a similar expression pattern, and a network analysis was performed to search for hub genes. A total of 36 gene modules were obtained (Fig. 2). The module with the highest signi cance in the metastasis group was selected. To con rm the function of the gene module, GO term analysis was performed. A total of 145 genes were selected as EMT-related genes with high GS scores (GS > 0.8) and high AUC value (AUC > 0.6).

Hub gene and image feature associations
To determine the relationship between hub genes in the gene modules and the factor expression levels extracted from the images, a correlation analysis was performed using the rcorr function in the Hmisc library of the R package. The analysis was performed using 47 radiomics and 145 EMT-related genes. Results regarding the relationship between expression levels of the factors were obtained. Among the relationships between image features and gene expression levels, the top 50 genes were selected to show the total relationship in the highest order and visualized as a heatmap (Fig. 3). The results show one image feature (GLCM_contrast) that was expressed deeply in relation (P-value < 0.05) to the expression of seven genes (NME1.NME2, LST1, KAT7, BMX, CLIC1, KANSL2, and UFL1) ( Table. 2).

Discussion
EMT is an evolutionarily conserved process in which cells undergo the conversion from epithelial cells to mesenchymal cells. EMT was found in a study on the development of embryo stem cells. EMT is a major activity during embryo stem cell development, gastrulation, neural nests, and development of the heart and other tissues and organs [37]. Recent studies have shown that EMT is also implicated in cancer progression and metastasis. Studies on breast cancer metastasis suggest that EMT is also involved in the acquisition of characteristics of cancer stem-like cells (CSCs) [38]. CSCs are cancer cells that have the characteristics of embryonic stem cells of self-renewal, regeneration, and differentiation to diverse types of cancer cells. CSCs are thought to be crucial for the initiation and maintenance of tumors as well as their metastasis [39]. Many studies using NGS for NSCLC have been performed because of the ability to determine the molecular characteristics of the cancer state for diagnosis or treatment [40]. NGS is a technology that can analyze gene expression levels at a fast and large scale compared to conventional gene analysis methods. However, a limitation is biopsies are need for sampling, which is not available all cancer cases because of cancer location [41]. Another limitation is representativeness [42]. Cancer tissues have a high heterogeneity; biopsy samples cannot represent all cancer regions. To overcome this limitation, image features had to be introduced into the analysis.
PET/CT images have become a popular research topic for the diagnosis of NSCLC in recent days. Features extracted from the images were used for analysis. Each feature is represented by a call status such as cell shape, cell surface texture, and cell density. These features were digitized for cancer analysis using a mathematical method [43]. Many studies have been published on the possibility of tumor classi cation by analysis of PET/CT texture features with 18 F-FDG PET/CT. The development of 18 FDG PET/CT imaging technology and techniques for analyzing digitized features from images have information on cell activity [31]. A limitation of the PET/CT imaging method is the lack of information from image analysis. Imaging factors of cells or tissues can only provide information on cell morphology and the texture of the cell surface. Some cancers with a unique phenotype can be diagnosed, but accurate diagnosis is not possible for most cancers using a phenotype because it cannot represent the genotype [44].
Recently, a combination of two analysis methods, NGS and PET CT imaging, has been studied to overcome the limitations of each. The prediction and diagnosis of lung cancer metastasis is related to serious problems for patients because lung cancer shows no symptoms or pain until the late stages and has spread to other organs, with a high probability of being at a late stage when diagnosed [45]. Development of a composite diagnosis method for genes and images has the advantage of being noninvasive [46] and fast compared to existing diagnostic methods, and is also capable of diagnosing overall cancer. In terms of genetic analysis, two methods were used to reduce the number of genes used for analysis. The rst was to select genes with signi cant differences between the two groups using a ttest [47] and the second was to use the hub gene assay to select genes with the desired functions. A t-test was performed for more e cient analysis to remove genes with low P-values using mathematical calculations [47]. Genes were divided into modules according to the gene expression pattern through WGCNA analysis, and each module was assigned a signi cant value according to its contribution to the module. One module selected had the highest gene signi cance. A total of 145 genes were identi ed as EMT-related genes from the selected module (GS > 0.8 and AUC > 0.6). The hypergeometric distribution method [48] was used to identify which EMT-related genes are associated with image features extracted from the genetics. The relevance of image features and genes was calculated by P-value and was listed from low values. P-values greater than 0.05 were excluded. Gene expressed levels were compared in patients with and without metastasis of each gene to identify differences in both conditions. A total of seven genes were identi ed as having a high relationship with one radiomics: GLCM_contrast. The seven identi ed genes, NME1.NME2, LST1, KAT7, BMX, CLIC1, TAP2 and PSMB9 are known to be involved in EMT. Bone marrow X-linked kinase (BMX) has been reported to be involved in EMT, such as cell growth, transformation, migration, survival, apoptosis, and tumorigenicity [49][50][51][52]. Nucleoside diphosphate kinase A (NME1) and nucleoside diphosphate kinase B (NME2) form the complex unit NM23 (NME1.NME2) and have the nucleoside diphosphate kinase activity, which catalyzes the phosphorylation of nucleoside diphosphates to the corresponding nucleoside triphosphates. NME1.NME2 is the rst metastasis suppressor in lung cancer. A decrease in NME1.NME2 increases cancer metastasis [53]. The function or mechanism of leukocyte-speci c transcript 1 protein (LST1) has not been well studied, but high expression of LST1 in metastasized lung cancer has been reported [54]. Chloride intracellular channel 1 (CLIC1) has the ability of the antiangiogenic peptide CLT1 on proliferating endothelial cells [55]. CLIC1 is mainly overexpressed in the tumor vasculature, and overexpression has been observed in breast, lung, and liver cancer patients [56,57]. CLIC1 has been shown to promote regular invasion and proliferation of tumor and endothelial cells, but the underlying mechanism is unclear [58]. Transporter associated with antigen processing 1 (TPA1) regulates WISP2, which can affect TGF-b signaling. TGB-b signaling is one of the most important roles of EMT in breast cancer [59]. Proteasome subunit beta type-9 (PSMB9) is co-expressed with RARRES3 and is a well-known metastasis suppressor in breast cancer cells [60].
GLCM_contrast is a feature from image feature analysis. It is considered a texture feature from the LIFEx image analysis tool. In general, features such as SUVmax, SUVpeak, TLG, and ENTROPY were used for radiogenomics analysis for cancer prediction or cancer metastasis prediction [61]. However, in this study, the correlation (P-value) of SUVmax, SUVpeak, TLG, and ENTROPY was lower than that of GLCM_contrast. This result shows that new factors such as GLCM_contrast can be used to develop a model for predicting metastasis of NSCLC using radiogenomics. One of the limitations of our study that although we provide the evidence that EMT related gene has relation to GLCM_contrast in NSCLC but do not provide mechanistic studies. While this was not the goal of this study, future investigations could be directed toward to uncover the mechanisms of operation of genes that play an important role in NSCLC metastasis, and to elucidate the correlation of expression of imaging features. Large scale of follow-up studies with molecular mechanism of metastasis in NSCLC could strengthen the study and further con rm and extend our ndings. In addition, it was possible to search for radiomics related to EMT genes in this study and it will be possible to search for imaging biomarkers for diagnosis and prognosis by analyzing genetic functions related to other cancers or diseases.

Conclusion
In this study, we con rmed through RNA-sequencing analysis that the group genes involved in the NSCLC metastasis were related to EMT function. The expression of these group genes was related to the image texture feature like GLCM_Contrast. It was con rmed that the accuracy of the prediction model developed using two factor that was consist of the EMT-related group genes and GLCM_Contrast and GLCM_Contrast only by the the Random Forest algorithm was high. These results reveal the possibility of a prediction model using image text features related to gene expression in NSCLC metastasis.