With the advancement of high throughput DNA sequencing technologies, transcriptome sequencing has been used increasingly in clinical studies. The rapid accumulation of large transcriptome data presented a great opportunity to apply machine learning algorithms to address clinical issues. In this study, we adopted the CNN algorithm to breast cancer RNA sequencing data and developed models to classify two commonly used biomarkers. Our goals were two-fold: First, to evaluate the application of the AIO technique for RNA sequencing data; and second, to evaluate the performance of our CNN models with other methods that were currently used for the classification of breast cancer biomarkers. The reason we focused on Ki67 and NHG was that the assessments for these markers remained a challenge, improvement of the accuracy was of high clinical value.
We designed two sets of experiments to evaluate how the combination of AIO technique and CNN algorithms performed in RNA sequencing data and biomarker classification. In the first experiment, we used cross validation techniques to assess the models built with AIO technique and CNN algorithm. Here we combined the GSE81538 and GSE96058 datasets and performed 5-fold cross validation for both Ki67 and NHG markers. For Ki67, a binary classification, we accomplished an accuracy of 0.797 ± 0.034 and AUC of 0.820 ± 0.064 (Table 1). The precision, recall and F1 score were 0.797 ± 0.033, 0.795 ± 0.035, and 0.797 ± 0.030 respectively. For NHG, a multi-label classification, the weighted average of categorical accuracy and the weighted average of class specific AUC were 0.726 ± 0.018 and 0.848 ± 0.019 (Table 2). The class specific accuracy varied substantially, with Grade I accuracy of 0.316 ± 0.017. This could be due to the low proportion of Grade 1 (14.8%) subjects in the datasets.
In the second experiment, sample testing, we used GSE81538 as training dataset to build the model and tested its performance in the GSE96058 data set. We used the GSE81538 dataset as training data for practical reasons. One was to compare our model with the original study that first reported the two datasets [14]. In that study, GSE81538 was used the training dataset. This was the first study that applied our AIO technique in combination with CNN algorithm to classify breast cancer biomarkers. We would like to use a well-designed and representative study as a reference. Another reason was that GES96058 had clinical data for survival and responses for chemo and endocrine therapy. We were not just interested in model performance, but also interested in whether the use of a large number of genes in the model could provide new information on the heterogeneity of breast cancer. For both markers, although their performances were slightly worse than the cross validation, they were comparable to or better than the multi-gene models of the original study. For Ki67, the multi-gene model reported a concordance rate or accuracy of 0.663 as compared to the consensus calls from trained pathologists. Our model reported an accuracy of 0.772 (Table 1). For NHG, the accuracies were 0.677 and 0.682 (Table 2) respectively. These comparisons indicated that by transforming gene expression data into AIOs, we could apply mature algorithms such as CNN to effectively classify biomarkers and accomplish comparable or better accuracy as compared to other modeling methods. If we followed the tradition in data science where larger datasets were normally used as training data, i.e., using GSE96058 as training data and GSE81538 as testing data, we could have a performance comparable to that of the cross validation experiment (data not shown).
We also evaluated the performance of our models in term of their predictive power. Compared to the consensus calls of trained pathologists, the calls from our models had similar predictive power in survival analyses (Figure 5). This suggested that we could use these models to classify Ki67 status and NHG grades and use these model-predicted status and grades to predict survival rate. Once implemented, these models would improve the productivity and consistency in clinical applications. Intriguingly, when we compared the calls from the pathologists with that from our models, the discordant or misclassified subjects showed some interesting properties. For the discordant subjects from the Ki67 model, biomarkers ER, NHG, PAM50 and PGR could not predict their survival rate, and neither chemotherapy nor endocrine therapy improved their survival rates (supplementary Figure S1). These results seemed to suggest that the discordant subjects were a unique group of patients with distinct prognostic projection and treatment response. The discordant subjects from the NHG model were more complexed, because this was a multi-class classification where a Grade 1 subject could be misclassified as Grade 2 or Grade 3. While ER and PAM50 could predict their survival rates just as in the concordant subjects (supplementary Figure S2), PGR could not predict the survival rate and endocrine therapy could not improve survival. The implication of the differences between the concordant and discordant subjects were not clear at this time. Follow-up studies would be required to understand the distinctions.
It remained a challenge to explain how the CNN algorithm identified and classified image objects. Principally, the algorithm extracted or learned the feature maps or patterns characteristic of the labels from the training dataset, applied the same procedures to extract the patterns from testing objects, and compared and matched the patterns with that of the training labels. These patterns could be geometric, statistical or both. To help understand the classification of our models, we applied the saliency gradient method [21] to visualize the correctly classified AIOs (Figures 3 and 4). At the level of individual subjects, we could see clear differences for individuals belonging to different classes or groups (Figures 3A, 3B, 4A, 4B and 4C). At the group level, features specifically to a group were not common. Instead, many features were shared between groups (comparing Figures 3C and 3D, and Figures 4D, 4E and 4F). The differences between the groups were largely quantitative, i.e., changes in intensities (Figure 3E and Figures 4G, 4H and 4I). What we observed was typical of image classification that the same class of objects could have multiple different patterns. Since our AIOs were created from gene expression data where each pixel represented a single gene, this difference of patterns within a class implied that different genes contributed to the patterns of the class and the class was heterogeneous. This was consistent with many studies that major subtypes of breast cancer were heterogeneous [22–24]. If we followed the patterns between classes or within class to identify their perspective genes, it could provide useful insights to understand the underpinning biology of these subtypes or subgroups.
In this article, we reported the development of a new approach to transform genomic data into AIOs and applied CNN algorithm for their classification. Using the transcriptome sequencing data as a case study, we demonstrated that once transformed into AIOs, gene expression data could be used to classify biomarkers and accomplished similar or better performance as other multi-gene prediction models. Compared to other methods, our approach had several advantages. First, the AIO transformation could handle a very large number of variables and it did not require special procedures to select the variables. This is because once the variables were transformed into pixels in an AIO, they became a component of structural pattern of which the CNN algorithm was designed to learn. Collinearity amongst the variables would not have an impact on the model performance because perfect pixel correlation would not impact object recognition in image classification. Because the AIO transformation did not need variable selection and could handle a large number of variables, it could be easily implemented for any tabulated data, which covered most types of omics data such as single nucleotide polymorphism, gene expression, methylation, proteomics and metabolomics. Second, because we could trace back the genes represented by the pixels in a given pattern, this would allow us to identify which genes were necessary to recognize the pattern, leading to a better understanding how these genes worked coordinately and contributed to the phenotype, i.e. the label. This ability to track the genes in a spacious pattern provided a new approach to discover multi-gene interactions and networks. Although we did not do further analyses in this direction, it would be an interesting area for future studies. Overall, the method reported here would have broad applications that utilize omics data to promote and improve personalized medicine.