Background: Breast cancer (BC) is a malignant neoplasm that arises from the uncontrolled growth and cell proliferation in the breast tissue. The BC is classified into various types defined by underlying molecular types such as estrogen receptor positive (ER+), progesterone receptor positive (PR+), human epidermal growth factor positive (HER2+) and triple negative (TN) breast cancer. Timely diagnosis of various distinct breast cancer types is crucial in the choice of an appropriate treatment strategy. Here we report the key genes and the use of machine learning (ML) approach for classification of TNBC and ER+ patients using gene expression data.
Method: The RNA Sequencing data analysis was performed on TNand ER+ tumor samples from ENA to obtain differentially expressed genes. The DAVID database was used to conduct pathway enrichment analysis. There has been considerable enrichment of these differentially expressed genes (DEGs) in cancer-related functions and pathways. An interaction network between genes was then constructed using the STRING database. Lastly, we evaluated three different classification models including Support vector machine, k nearest neighbor and Naïve Bayes using different threshold levels to train the models for classifying two types of breast cancer.
Results: DEG’s were obtained as a result of this study which can differentiate between ER+ and TNBC types. We filtered out 10 hub genes by cytohubba plug-in including CDC20, CDK1, BUB1, AURKA, CDCA8, RRM2, TTK, CENPF, CEP55 and NDC80 which can be used for the prognosis and can generate therapeutic alternatives. Among the three ML algorithms, kNNwas able to classify more accurately.
Conclusion: 10 hub genes were identified which can be used to study the clinical and molecular behavior of breast cancer and to generate therapeutic alternatives to increase the survival rate and the prediction results of ML algorithm can be used to classify the breast cancer types.