Methodology.
The literature review for the review article involved searching for publications on the PubMed database up to November 2016. To search for the publications, three keywords were used to obtain the publications that involved the applications of bioinformatics in cancer diagnosis. Using the advanced search option on PubMed, the keywords “cancer, “diagnosis”, and “bioinformatics” were first searched on the database using the AND Boolean operator. The search term was, therefore “cancer” AND “diagnosis” AND “bioinformatics” with no terms added as the query leading to the output of 19,798 publications. The search was then modified by adding a query term to search for the keywords in the title as well as the abstract hence the query, title/abstract was used. This modification significantly reduced the search results to 1,080 publications.
The publications were then further filtered to include only review and systematic review articles which further reduced the publications to 75. Publications were then selected and their abstracts reviewed to analyze the relevance of the publication. Publications that did not provide how bioinformatics tools and databases have been used in the diagnosis of cancer were excluded. Articles that were not available in full text and free of cost were also excluded. For the publications that were labeled as relevant, additional information and publications were obtained from the reference sections of the articles. Papers including applications of bioinformatics in the treatment of cancer as well as drug discovery against cancer were also excluded since they were beyond the scope of the topic of study.
Post identifying the papers of relevance and interest, the search results were further filtered by searching for the keywords in only the Title. This, therefore, resulted in the output of 7 publications that were all relevant to the topic of study. For purposes of contrast, publications on the use of non-bioinformatics methods of cancer diagnosis were also obtained. To obtain these publications, both the PubMed database and Google scholar were used. The search for these articles was performed by searching for the keywords “Cancer” and “diagnosis” together with using the AND Boolean operator in the title and abstract of the articles as well as in the title only. This resulted in the output of 11,752 and 1,071 publications respectively. Articles describing relevant information on how cancer is diagnosed using non-bioinformatics tools and databases were then included in the list of articles to be reviewed. The papers were then distributed amongst the 12 authors that reviewed the articles and extracted key information from the articles.
From each of the papers, information such as the findings, the methods used, and key discoveries made were collected and analyzed. The information collected from the articles involving the application of bioinformatics in cancer diagnosis was divided into the bioinformatics databases used for cancer diagnosis, the bioinformatics tools, and software used to analyze the data obtained from the different databases, and the key findings. For the papers related to cancer diagnosis using non-bioinformatics methods, information on the methods of cancer diagnosis and the level of accuracy of the methods were identified.
Bioinformatics tools and databases used in cancer diagnosis
To apply bioinformatics tools and analysis techniques, it is important to first obtain relevant data in line with the area of study. The process of obtaining this data is known as data mining. Data mining involves the use of refined data analysis tools to find unknowns, patterns, and relationships in large data sets. This plays an important role in processes such as gene finding, protein function domain detection, function motif detection, protein function inference, disease diagnosis, disease prognosis, protein and gene interaction network reconstruction, data cleansing, and protein subcellular location prediction (14). A platform that is used in the process of data mining is Oncomine. Oncomine is a cancer microarray database and integrated mining platform that systematically curates analyses and makes available all public cancer microarray data (15).
Bioinformatics databases containing data on cancer
The Gene Omnibus Database (GEO).
The GEO database is a public source that archives and distributes high-throughput gene expression and other functional genomics data internationally free of charge. With the rapid change in technologies, the GEO also evolves to expand and include some other data applications like examining chromatin structure and genome-protein interactions rather than only gene expression studies(16). The database has been able to provide access to data for tens of thousands of studies and has also been able to provide various web-based tools to analyze the data. GEO enables users to visualize and analyze data within their specific interests while providing detailed descriptions(17). The GEO homepage is at http://www.ncbi.nlm.nih.gov/geo/
The Cancer Genome Atlas (TCGA).
The Cancer Genome Atlas is one of the most ambitious and successful cancer genomics programs. The database program has generated, analyzed, and made available genomic sequence, expression, methylation, and copy number variation data on over 11,000 individuals representing over 30 different cancer types (18). The Cancer Genome Atlas (TCGA) was a joint effort of the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI), which are both part of the National Institutes of Health, U.S. Department of Health and Human Services (18). The Cancer Genome Atlas, therefore, is a comprehensive and coordinated effort to accelerate our understanding of the molecular basis of cancer through the application of genome analysis technologies, including large-scale genome sequencing.
The Human Protein Atlas
The Human Protein Atlas is a Swedish-based program that started in 2003 to map all the human proteins in cells, tissues, and organs using an integration of various omics technologies, including antibody-based imaging, mass spectrometry-based proteomics, transcriptomics, and systems biology. The Human Protein Atlas is divided into three sub-atlases which includes the Tissue Atlas, Cell Atlas, and Pathology Atlas (19). The program relies on sensitive and highly specific antibodies to provide an accurate estimation of protein expression. All antibodies used by the Human Protein Atlas undergo a rigorous validation process which includes Western blot analysis, immunohistochemical staining, and immunofluorescence evaluation against carefully selected sample materials.
The Tissue Atlas contains information regarding the expression profiles of human genes within different tissues at both the protein and mRNA levels. The protein expression data relies on immunohistochemical analysis of 76 different cell types, corresponding to 44 normal human tissue types. The mRNA expression data on the other hand is derived from deep sequencing of RNA from 37 normal tissue types (19).
The Cell Atlas provides information about the spatial distribution of proteins within a panel of 64 cell lines, selected to represent various cell populations in different organs of the human body. The protein expression data is generated by immunofluorescence microscopy, while the mRNA expression data is derived from the deep sequencing of RNA. A subset of these cell lines is subjected to a deeper investigation, within which subcellular protein distribution is classified into 33 different organelles and fine cellular structures.
The Pathology Atlas provides information on how protein expression differs between normal and cancer tissues. The Pathology Atlas contains protein expression and mRNA data for the most common forms of human cancer. The Human Protein Atlas has correlated mRNA expression levels of human genes in cancer tissue with the clinical outcome. The Pathology Atlas therefore enables researchers to study protein expression levels for individual tumors of each cancer type. For example, PAX8 staining is elevated in the thyroid, ovarian, and endometrial cancer tissues.
Post completion of the process of data mining from the databases described in the previous section, bioinformatics tools and techniques are required to perform analysis on the collected data. It is during the analysis process that new components such as biomarkers and biochemical pathways can be discovered to improve the process of cancer diagnosis. Some of the bioinformatics tools that are used to perform the process of data analysis have been described in the following section on bioinformatics tools.
Bioinformatics Tools
The Database for Annotation, Visualization and Integrated Discovery (DAVID)
The functional bioinformatics tool uses a collection of algorithms to compress a large group of genes with associated biological terms into relatable well-ordered families, known as biological modules(20). The database is widely being used for complex biological exercises such as but not limited to identifying enhanced biological themes, particularly GO terms; finding out enriched functional-related gene groups; clustering redundant annotation terms; exhibiting connected many-genes-to-many-terms on a 2-D outlook; listing interacting proteins; connecting gene-disease associations and envisaging genes using Bio Carta & KEGG pathway maps.
It mainly uses four data analysis modules which include; the annotation tool that automatically comments on gene lists; GO charts that give visual representations of genes according to the biological process together with molecular function and cellular components; Domain Charts that show the distribution of differentially expressed genes among protein family domains and KEGG Charts through which the differentially expressed genes among KEGG biochemical pathways are exhibited using the KEGG's DBGET, an integrated data retrieval system (21). With over 4,000 journal articles search results on the usage of the DAVID tool in bioinformatics papers in PubMed, over 2,000 studies were cancer-related, showing how the tool has revolutionized the cancer field through assessment of microarray data.
Surveillance, Epidemiology, and End Results Program (SEER)
This program launched on January 1 in 1973, aims at gathering information on the diagnosis, treatment, and trends of cancer for about 30% of the United States (U.S) population (22). The program keeps track of the various types of cancer and survival variance by age, ethnicity and stage at diagnosis time. The program has turned cancer data into discoveries, with over a thousand researchers, clinicians, and legislators using it to analyze and interpret the variations and evolution of cancer in the U.S (NCI, 2018). The SEER program has proven a great tool in observing histopathologic cancer subtypes, and data by molecular subtyping. Over 13,068 studies have actively used the SEER data within the SEER database between 2000 and 2021 with the help of the SEER*Stat software. This has aided in making queries to the SEER data. Bioinformatics studies have used the SEER database and software to analyze and assess early deaths, survival rates, survival prognostic factors, observe cancer patterns and improve overall outcomes.
Gene ontology (GO)
This is a comprehensive bioinformatics resource that provides information about functional genomics to represent biological knowledge. It is a community-based project that is available on (http://www.geneontology.org). The biological knowledge is described in three ways i.e., Molecular function, cell component, and biological process. Molecular function describes the activities performed by gene products and occur at a molecular level like transport and catalysis. The GO rather than describing the complex structures where the activities take place, it provides information on the activities of the gene products.
Cell components provides information on the locations where the gene products perform their activities which may be either the cellular compartment or stable macromolecular complexes. This is in other words the cellular anatomy. Biological process are the bigger extent processes achieved by several molecular activities. For example, glucose transmembrane transport(23). A ‘GO annotation’ defines the connection of a class from ontology and a gene product with references to the evidence supporting the connection. A Gene Ontology Consortium (GOC) is responsible for monitoring the gene products so that they have consistent descriptions across the biological databases and gene functions across all organisms. The GO can also be used in conjunction with KEGG path like in (24) that explored this combination to analyze the cancer-related long non-coding RNAs.
Gene Expression Profiling Iterative Analysis (GEPIA).
This is a webserver that allows biologists and clinicians to perform comprehensive and complex data mining tasks with simple clicking thus facilitating mining of data in research areas, scientific discussions, and therapeutic discovery about cancer. It is a webserver for profiling and analyzing cancer and normal gene expression (25). In other words, GEPIA provides a tool for resolving bulk RNA datasets in the TCGA and Genotype-Tissue Expression (GTEx) projects to investigate expression profiles across cancer and healthy patient groups. This is done using different techniques like studying its cell- type, interrogating the characteristics of different cell types in cancer(26). It provides a deeper understanding of gene functions and creates new opportunities for data mining in the cancer field of study. (27) used GEPIA to study ovarian cancer expression and prognosis using sirtuins, which are enzymes that have distinct roles in ovarian cancer, and analysis of prognostic biomarkers of cervical cancer done by (28). The webserver is available on http://gepia.cancer-pku.cn/.
University of Alabama Cancer Database (UALCAN)
UALCAN database is a comprehensive, user-friendly, and interactive web resource for analyzing cancer omics data. It is an integrated data-mining platform to facilitate the comprehensive analysis of cancer transcriptome. UALCAN uses TCGA RNA-sequencing and patients' clinical data from 33 different cancer types and also includes several metastatic tumors. The web-based platform's user-friendly feature. UALCAN facilitates relative expression analysis of a query gene(s) across tumor and normal samples. It also identification of the top over- and under-expressed genes in individual cancer types (29).
UALCAN makes it possible to explore or validate the pan cancer expression pattern of hundreds of user-defined genes. It, therefore, serves as a one-stop-shop by providing easy access to external resources such as Gene Cards, Human Protein Reference Database, PubMed, Target Scan, and Human Protein Atlas that are used to investigate protein expression in various cancers. UALCAN is designed to provide easy access to publicly available cancer OMICS data, allow users to identify biomarkers or perform in silico validation of potential genes of interest and provide graphs and plots depicting expression profile and patient survival information. It is also used to perform cancer gene expression analysis and provide additional information about the selected genes targets by linking to HPRD, Gene Cards, PubMed, Target Scan and the human protein atlas (29).
Case studies on the application of bioinformatics tools and databases in the diagnosis of cancer
The described bioinformatics tools and databases have been key in improving the process of diagnosis of different cancer types. To understand how the bioinformatics tools and databases described have been used to improve the process of cancer diagnosis, three different case studies were looked at. The case studies looked at involved improving cancer diagnosis of three of the most common cancer types which included cervical cancer, breast cancer and pancreatic cancer.
Cervical cancer; is one of the most dangerous diseases affecting women of all ages. As of 2018, there were approximately 569,000 new cases of cervical cancer worldwide with one of the highest number of cases occurring in Uganda (30) and about 311,000 deaths associated with the disease. Of these deaths, about 84-90% of them occurred in low and middle-income countries such as South Africa(31). . However, through the combination of high throughput sequencing technology and bioinformatics tools to analyze the data generated, several new gene characteristics and signal pathways that can be used to diagnose cervical cancer have been discovered. The gene characteristics and signal pathways can thus be used to detect the cell pathologies at a very early stage thus improving the process of disease diagnosis, prognosis, and recurrence (32). A study was therefore conducted by Hua-ju Yang et.al. on how bioinformatics tools and databases can be used to identify key genes and pathways of diagnosis and prognosis in cervical cancer (33).
In the study, three different gene expression profiles consisting of both normal and tumor samples of the same gene were obtained from the Gene Omnibus Database. Genes expressed differently in the normal and tumor cells were then identified using the GEO2R web tool. This process thus provides the starting point of analysis since only the differentially expressed genes would be focused on. The DAVID tool was then used to identify the functional genes and biological pathways from the differentially expressed genes. The Gene Ontology together with the KEGG tools were then used to identify the gene functions, understand biological processes, and also metabolic pathways of the genes (34).
Post identification of the significant genes and their functions, using GEPIA, an interactive web application, the genes were visualized in a box plot format to obtain in-depth information on the genes. Protein-protein interaction was also performed to analyze commonly expressed genes amongst the three gene profiles and the proteins common to two or more genes. Based on the different analyses performed using the different bioinformatics tools, 12 key differentially expressed genes were discovered from a total of 57 differentially expressed genes that were involved in processes such as cell division and epithelial cell differentiation. All the genes identified had a high level of expression in cervical cancer tissues compared to normal tissues (33).
A key gene found and identified as CXCL8 was common to all cervical cancer tissues and was also associated with poor prognosis in patients with high expression of the gene. This was also verified from UALCAN online tool using data from the Oncomine database. The gene was also associated with other cancer types such as pancreatic cancer, head and neck tumors, breast cancer, and many others. The analysis performed also found that the gene plays a key role in apoptosis resistance and tumorigenesis. Cervical cancer patients with decreased levels of the CXCL8 gene have also been shown to have a better survival rate. Other key genes identified included the MCM2, TOP2A, TYMS, and HELLS genes. These genes were identified to be responsible for the occurrence of cervical cancer and thus were considered to be vital diagnostic markers (33). The bioinformatics tools and database have thus been used to identify key differentially expressed genes which can be used as biomarkers to detect cancer at a very early stage and improve the patient survival chances.
Breast cancer; Breast cancer represents a top biomedical research issue since it’s the most frequently diagnosed cancer in females with increasing mortality rates in past years. Early diagnosis and treatment of breast cancer is of paramount importance. Breast cancer diagnosis is done using Magnetic resonance imaging, ultrasound, mammography, positron emission tomography, and biopsy, but these methods are; expensive, with low sensitivity and lengthy. Bioinformatics presents a quicker and efficient approach by identification of breast cancer biomarkers for early detection and hence diagnosis of breast cancer (35).
DEGs in breast cancer we established using 3 datasets of the GEO database. Analysis of genome pathways was used to show the functional roles of differentially expressed genes. The Chinese breast cancer tissues by (Reverse Transcriptase – quantitative Polymerase Chain Reaction) RT-qPCR were used to authenticate expression of novel DEGs with a total of 46. Two novel biomarkers; ADH1A and IGSF10, and 4 other genes (APOD, KIT, RBP4, and SFRP1) were seen as causes of breast cancer since they were expressed in breast cancer tissues. Also, 14 out of 25 microRNAs targeting 6 genes were seen to be associated with breast cancer and hence potential biomarkers for diagnosis of breast cancer (32).
Weighted gene co-expression analysis was done with Gene Set Enrichment Analysis (GSEA) for genome-wide RNA expression. Functional enrichment analysis, GO was used to describe the function of gene and gene products. KEGG analysis was used to annotate genes with pathway and functional information. Breast cancer cell lines, were obtained from the Pathology Laboratory of the Cancer Institute of the Fourth Hospital of Hebei Medical University. Cell lines were maintained in DMEM-H medium supplemented with fetal bovine serum(35).
Pancreatic cancer; In the digestive tract pancreatic cancer is common with a poor prognosis. Early detection of biomarkers of pancreatic cancer is good for timely detection and management to improve prognosis and lower mortality rates attributed to it. In this study, DLGAP5 expression in pancreatic cancer was explored in tumorigenesis and tumor growth. So, differentially expressed genes were screened via the GEO data set GSE16515. GO based functional analysis & KEGG pathways enrichment analysis was performed on the conforming proteins of the genes using the DAVID. Analysis was done using the Kaplan–Meier Plotter database to establish the relationship between differentially expressed genes and pancreatic cancer prognosis. The DLGAP5 gene was isolated and its expression in pancreatic cancer and other cancer tissues was profiled using the Oncomine and GEPIA databases (4).
The overall DLGAP5 survival was analyzed using the TCGA database. Thereafter molecular mechanisms of pancreatic cancer were analyzed by GSEA. Finally, a cell function experiment was done to discover the DLGAP5 biological behavior. During this study, 201 upregulated differentially expressed genes & 79 downregulated genes were used. And tumor-related signaling pathways were observed with emphasis on; the cancer pathways, extracellular matrix-receptor interaction pathway & p53 signaling pathway. It was found that the DLGAP5 was highly expressed in pancreatic cancerous cells and the higher the levels the worse the prognosis. This gene had also seriously enriched in cell signaling pathways e.g. the cell cycle, p53, and oocyte meiosis. (4).