Data source and downloads
We obtained whole slide histopathological images of colon adenocarcinoma (COAD) patients from The Cancer Imaging Archive (TCIA, http://cancerimagingarchive.net/), a public cancer image archive repository. TCIA database collects, provides and manages affluent oncology image data supported by 28 agencies, and can provide researchers with publicly available cancer imaging data and unique imaging resources.[21, 22]
218 whole slide histopathological images were downloaded from TCIA. The histopathological tissue slides are all formalin-fixed and paraffin-embedded to preserve cell morphology as much as possible so that they are suitable for image feature recognition.
The data of mRNA expression levels and clinical information for these COAD patients were downloaded from The Cancer Genome Atlas database (TCGA, https://portal.gdc.cancer.gov/). The Cancer Genome Atlas (TCGA) database is one of the largest and richest public funded projects designed to build a comprehensive genetic map of cancer genome.[23]
For mRNA sequencing data, we totally obtained 478 COAD samples from TCGA. And plenty of clinical information for corresponding patients has been downloaded at the same time, which ultimately contains 459 COAD samples. We applied R package DESeq2 to normalize the mRNA sequencing data.
Extraction of histopathological imaging features
In order to extract imaging features from whole slide histopathological images obtained, our process for dealing with these images consists of three steps. Firstly, since the size of each pathological image is so large that it is inconvenient to use them directly for analysis and feature extraction, we cropped each image evenly into several sub-images of 1000 × 1000 pixels and saved them in tiff image format using the Openslide Python library.[24] In this process, sub-images containing more than 50% white margin are get rid of. To eliminate sample selection bias and reduce computing amount, we randomly select 20 representative image files from the remaining sub-images for the next step. Cropping and randomly selecting images are widely used methods in studies with whole slide image processing.[9, 16, 19]
Secondly, we applied CellProfiler[25] to extract features from each sub-image. CellProfiler is an open-source modular analysis software that can process cell images. It can measure a number of features, including size, shape, intensity, and texture, for each identified cell or subcellular region. The results of hematoxylin-eosin staining make the histopathological images appear different colors. By converting image color into grayscale, cell and tissue features can be extracted.
A total of 656 features for each sub-image were output during the preliminary extraction process. These features are different from well-known classic pathological characteristics such as cellular basophilic, eosinophilic, nuclear atypia and mitotic counts and not able to be recognized by pathologists through the naked eyes. After further removing irrelevant features such as file sizes and execution information, there are 590 features of each sub-image left for the following workflow.
Thirdly, Calculate the average value of 590 features extracted from the representative sub-images in the above steps and regard it as the average feature value of each corresponding slide. When there are more than one slides of a subject, the mean values over those slides are further figured up.
It should be emphasized that the purpose of our study is not to make specific interpretations of the relationship between these imaging features and COAD, but to quest the optimal combination of features to establish an independent prognostic model of COAD. Therefore, the lack of definite biological interpretations does not prevent us from making further reasonable analysis.
Acquisition of prognosis-related features
The least absolute shrinkage and selection operator (LASSO) gets a relatively refined model by constructing a penalty function, and compresses the insignificant variable coefficient to 0 to achieve the effect of variable selection. By customizing the value of the parameter lambda (λ), the user controls the balance between the sparsity (how many features are produced) and high prediction accuracy. Support vector machine recursive feature elimination (SVM-RFE) is another machine learning method based on support vector machine, which is devoted to find the best variables group by sorting of SVM-generated eigenvectors and iteratively eliminating of the minimum features until all features are removed. In order to figure out the relationship between imaging features and prognosis of COAD, SVM-RFE and LASSO logistic regression were applied to filtrate the pathological prognostic features over imaging features obtained from CellProfiler. SVM-RFE and LASSO analysis was realized by using R version 3.6.3 software. When running 5-fold cross-validation SVM-RFE, feature selecting was performed by defining high risk (patients survival time less than 12 months) and low risk (patients surviving for more than 60 months) as training samples. In SVM-RFE model, the maximal cross-validated accuracy is adopted as the evaluation index to confirm the optimal feature subset related with prognosis. The optimal subset of features obtained by SVM-RFE was intersects with the results of LASSO regression to obtain the pathological features most relevant to prognosis.
Co-expression gene module analysis
WGCNA,[26] fully called Weighted gene co-expression network analysis, can identify the set of co-expressed genes, which is called module. Moreover, correlation analysis can be conducted between modules and phenotype data to explore potential Mark genes. The WGCNA method was applied to construct the co-expression gene network in the samples of COAD and explore the co-expressed gene module most associated with the pathological prognostic features defined by the previous step. Calculating the interaction coefficient between genes and then computing the topological overlap measure (TOM) using the adjacency matrix. The co-expression network was constructed based on the W matrix to determine co-expression gene modules. Where, modules with module significance < 0.05 were regarded as prognostic related modules.
Establishment of integrated prognostic model
Aimed at establishing a prognostic prediction model based on histopathological features and genetic expression of COAD patients, we applied the random forest to construct the integrated prognostic prediction model using R randomForestSRC package. Random forest (RF) is a classifier containing multiple decision trees and each tree is built on an independent bootstrap training set. The output category is determined by the mode of the output category of individual trees. RF has great advantages over other algorithms in high-dimensional data processing. It can process high-dimensional data without deleting variables, and can evaluate the predictive ability of each feature. Meanwhile, the unbiased estimation of generalization error generated by internal cross validation guarantees high accuracy. The intervention of two randomness avoids overfitting. The samples were divided into 10 parts, including seven parts (140 samples) of training set and three parts (59 samples) of independent testing set. The 10-fold cross-validation was applied to construct prognostic model, plotting time-dependent receiver operating curve (ROC) using the average of 10 accuracies in order to calculate area under the curve (AUC). The RF model estimates the survival risk of each patient. In other words, the above steps produce a risk score of survival for each patient which we named it histopathologic-genomic prognosis factor (HGPF). By taking the median of risk score, the training set and test set can be divided into a high-risk group and a low-risk group respectively.
After single-factor cox regression, we incorporated meaningful results (p < 0.05) into multivariate cox regression analysis. The nomogram was drawn to verify the predictive ability of the prediction model. Two predictive factors, HGPF obtained from RF model and tumor stage of patients, were used for evaluation. Scores were assigned for each predictive factor according to their degree of influence (the value of regression coefficient) on survival outcome in cox regression model. After adding the scores of the two factors as total score of each patient, the function conversion was performed between the two factors and the probability of different survival time of patients, so as to accurately predict the survival of patients.