A transcriptomic analysis identifies the association of MIR31HG and EPB41L4A-AS2 lncRNAs in Oral Cancer

doi:10.21203/rs.3.rs-1603295/v1

Download PDF

Research Article

A transcriptomic analysis identifies the association of MIR31HG and EPB41L4A-AS2 lncRNAs in Oral Cancer

https://doi.org/10.21203/rs.3.rs-1603295/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Long non-coding RNAs or lncRNAs, are non-protein coding transcripts longer than 200 nucleotides, distinguishing them from small regulatory RNAs such as microRNAs (miRNAs), short interfering RNAs (siRNAs), and other short RNAs. LncRNAs have begun to gain widespread attention in recent years; as a result, the need to identify them as well as determine their functions has gained precedence. Over the years some lncRNAs are involved in numerous fundamental processes of gene regulation. Examples: Chromatin modification, DNA methylation, etc. Having a variety of functions, it is expected that lncRNAs are also involved in the development of diseases. LncRNAs are quite often found to be differentially expressed, which play a significant role in carcinogenesis. Presently oral cancers are considered two of the most prevalent cancers affecting the human population. Therefore the need to study the mechanism and functions of the lncRNAs involved in cancer reaches significant priority. Identification of long noncoding RNAs as well as lncRNAs which are speculated to play a role in the mechanism and formation of various forms of cancer. For identification and analysis of lncRNAs, a comparison between the expression of lncRNAs in normal and affected tissues will confirm the involvement of lncRNAs in the disease or the lack thereof. In this article, we perform transcriptomic analysis to identify the prevalence of the lncRNAs EPB41L4A-AS2 and MIR31HG in oral cancer.

Oral cancer

lncRNA

Transcriptomics

NGS data analysis

Cancer genetics

Long non-coding RNA plays a significant role in cancer. lncRNAs are essential transcriptional and post-transcriptional regulators, controlling the gene expressions. Cancers caused as a result of genetic alterations are often the genesis of aberrant gene expression. Human Papillomavirus (HPV) are small DNA viruses affecting numerous epithelial tissues at various sites in the human body. HPV is considered to be the leading cause of oral carcinogenesis. In a review by Langfield and Laimins [1], they highlighted the potential priorities of research in HPV-mediated oral cancer. The paper mentions that understanding the mechanisms by which productive HPV infections can progress to cancer, especially the interaction of the virus to the host’s chromatin remodeling, DNA repair, and differentiation pathways. Moving on to oral cancer, which is a subtype of head and neck cancer, 90% of the cases are oral squamous cell carcinoma. A study by Alexandra Iulia Irimie et al [2], provides a perspective of ncRNA and its derivatives. Another paper by Gibb et al [3] documents the first evaluation of the lncRNA expression profile for oral mucosa.

This brings us to gingivobuccal complex cancer which is quite predominant in countries where tobacco use is rampant. Very little literature offers us the perspective of lncRNA expression profiles relating to gingivobuccal cancer, most of which are quite outdated. Thus a need to properly identify and validate the expression of lncRNA in the gingivobuccal complex gains precedence.

Despite the glaring evidence of lncRNA playing a critical role in the biological processes of human diseases, very few efforts have been made to identify their association in terms of identifying major lncRNAs associated with the disease and assessing their prognostic values. In 2014, Steven B Cogill and Liangjiang Wang [4] published an article highlighting gene co-expression relational analysis for the identification and annotation of long noncoding RNAs (lncRNAs) and thus verifying their relation to the cancer disease. To achieve their goal, they used the weighted gene co-expression network analysis (WGCNA) method, which yielded hub lncRNA genes and enriched functional annotation terms within the modules. Recently another paper by Shervin Alaei et al [5], used similar network construction and module detection with the help of WGCNA to identify novel key regulators in esophageal squamous cell carcinoma. They performed gene ontology and pathway enrichment analysis. This proved to be beneficial for estimating the biological processes or pathways that the lncRNAs co-expressed.

Long intergenic noncoding RNA or lincRNA are long RNA transcripts found in genomes of mammals via analysis of transcriptomic data. They share features with the transcripts of long noncoding RNA however long noncoding RNAs are described as molecules, whereas lincRNAs are described as transcripts. Over the years, it has been clarified that lincRNAs actually contain most of the lncRNAs within them. A paper by Cabili et al. [6] confirmed the fact that lincRNAs are the largest subtype of non-coding RNA molecules in the human genome. The fact that lincRNAs are mainly composed of lncRNAs suggests the possibility that they are also involved in the prognosis of various cancers. Danish Memon et al 2019 [7] published an article correlating the implications of lincRNA on lung cancer cell survival. This potentially paves the way for future prospects as the effects of lncRNA on other cancers as well as other lincRNAs are relatively unknown.

In a review article by Claire Jean Quartier et al [8], in silico methods involved in cancer research have been highlighted. The article goes on to explain the importance of the TCGA database, and methods pertaining to computational validation, classification, and prediction using mathematical and statistical analysis.

Cancer recurrence is a major problem affecting patients who have been affected by the disease. In oral cancer, the case is no different. Surgery has been the preferred treatment for both cases. Even with the advancements in chemotherapy and radiotherapy, the treatment remains poor due to the local invasion and metastasis, which leads to recurrence. With a 30% survival rate of patients with recurrent oral cancer, the need to identify factors that may be able to identify the factors responsible for recurrence becomes crucial. This study aims at identifying biomarkers that would help improve the prognosis prediction in specific types of oral cancer [9], [10].

Transcriptomics in cancer study

The advent of next-generation sequencing brought along with it the means to analyze gene expression and correlate their prevalence with several diseases, more specifically cancer. Before its invention, basic gene expression analysis techniques did exist, however for rudimentary purposes. Alizadeh et al in 2000 [11] documented one of the first major studies to use gene expression analysis in the diagnosis of diffuse large B-cell lymphoma, thus making it one of the first studies to use transcriptomics to diagnose cancer.

Subsequent studies have thereafter been done profiling various gene expression studies involving protein-coding as well as non-protein-coding genes. Over the years, researchers considered protein-coding genes to be useful data and non-protein-coding genes were treated as junk data, due to their inability to code for proteins. However, it was later discovered that they contain invaluable information related to various regulations and pathways involved in a variety of diseases, one of the most notorious being cancer. In oral cancer specifically, numerous studies have been done concerning noncoding genes and their implications for the disease. The invention of NGS methodologies has certainly made it feasible to perform such analysis without significant problems. The first generation of NGS technologies involved Sanger sequencing, which was developed in 1977, based on the principle of the chain termination method or the dideoxynucleotide method. The second generation of NGS included the microarray and RNA sequencing techniques, both of which were considered significant advancements over its predecessor and slowly rendered Sanger sequencing obsolete barring a few specific exception cases. However, we are now entering the age of EPB41L4A-AS2 of third-generation sequencing, which is still in its infancy, but it is on its way to toppling its predecessors [12].

Moving back to second-generation NGS methodologies, it involves microarray and RNA sequencing techniques which are powered by Illumina/Solexa sequencing, which depends upon sequencing by synthesis approach and is currently one of the most used platforms for NGS. After which a huge amount of raw data is generated which contains valuable information about the genome or transcriptome. These data are then analyzed with the help of subsequent bioinformatic tools [12].

Data Collection

For this report, we scoured the NCBI database for gingivobuccal cancer datasets and selected those which suited our criteria of having normal and tumor tissues. We used the available data set GSE101547 in the GEO database consisting of gingivobuccal cancer tissues of 24 patients [13]. Further, Dr. Sadasivam from DeepSeeq Bioinformatics provided us with 12 samples of oral squamous cell carcinoma.

Data Processing

The FASTQ files were processed using a standard transcriptomic bioinformatics analysis pipeline, to obtain dysregulated genes from the data. The dysregulated genes provided us with the opportunity to further understand the relational nature of the genes with the disease. The raw sequence reads were initially checked for quality using FASTQC.

Alignment

The reads were then aligned with the help of a splice-aware aligner HISAT2 [14] with the hg38 reference genome. We were able to achieve a>90% alignment rate for all the samples. Of the 53 million paired-end reads per sample, approximately, 50 million reads were aligned successfully to the hg38 reference genome. Low and inconsistent reads were subsequently removed.

Count data generation and lncRNA identification

For the generation of count data, we employed the use of the HTSeq-count tool [15] which gave us the count data for the overlapping exons of each gene. To sort out the lncRNAs we set some stringent parameters. First, the transcripts need to be longer than 200bp. The second parameter we set was that the FPKM be greater than 0.5. Finally, we employed the usage of the Coding Potential Calculator (CPC2) [16] to sort out the coding genes from non-coding genes. CPC scores with ≤-1 were removed from our list and thus we were left with our ideal lncRNAs.

Data Analysis

The pair-ended raw reads of the GSE101547 were determined to be unstranded since it was not mentioned in the public database. We employed the usage of the Salmon quantification tool [17] to identify the data set to be unstranded. The data set provided by Dr. Sadasivam was already specified to be unstranded as a result we did not have to quantify them. Eventually, differential gene expression data analysis was performed, with the help of the DESeq2 tool [18] and we were able to report that the gingivobuccal cancer datasets yielded 48256 differentially expressed lncRNA genes. The outliers and low counts were excluded from the results, and we kept an adjusted p-value threshold of < 0.001, we sorted out the lncRNAs based on our threshold value of < 0.001 and obtained 2369 significant dysregulated lncRNAs.

GEPIA2

GEPIA2 is an updated version of the GEPIA (Gene Expression Profiling Interactive Analysis) [19] web server and is used mainly for analyzing gene expression between cancerous and normal tissues fetched from the TCGA and the GTEx databases.

Here we employed the use of the GEPIA2 database and thoroughly analyzed for head and neck squamous cell carcinoma entries present there. We specifically isolated the MIR31HG and EPB41L4A-AS2 lncRNAs due to their association with cancer hallmarks.

From a data set of 34 samples of 17 normal and 17 tumor replicates, 48,256 genes were analyzed. After sorting out the outliers and low counts and performing stringent, 5226 genes were considered for examination, which yielded 2369 dysregulated genes based on the adjusted p values were considered after stringent exclusions as seen in Fig 1 and Fig 2.

We were able to deduce that the dysregulated genes which were obtained after comparing the normal data set with the tumor data set yielded data that can be used to further identify and explore the relational nature of the genes to the disease. The dysregulated genes serve as an indication of the disruption of normal genomic processes (Fig 3 and Fig 4). This disruption can be traced back to the source of the disease at hand and thus can be further analyzed and used as a biomarker or subsequent target for drug identification. For identification and analysis of dysregulated genes, comparison between the expression in normal and affected tissues will confirm the involvement of those genes in the disease or the lack thereof. Similarly, the dysregulated genes will help us understand their relevance in tumor growth and maintenance.

The 2369 dysregulated lncRNA genes obtained after stringent parameters allude to the fact that these lncRNAs in some manner and form are not performing their designated duties due to the disease and as a result, they are causing a genetic aberration within the system. Further delving into analyzing the lncRNAs we employed the usage of the LNCSEA database [20] and found out the lncRNAs C5orf66-AS1, EPB41L4A-AS2, CYTOR, FAM3D-AS1, MIR31HG, and HCG22 were all previously identified to be associated with oral squamous cell carcinoma. Further EPB41L4A-AS2 and MIR31HG are associated with cancer hallmarks and were both significantly upregulated in oral cancer. EPB41L4A-AS2 is involved with proliferation and prognosis, while MIR31HG is seen to be associated with apoptosis, migration, invasion, prognosis, and proliferation.

GEPIA2 analysis

After isolating the EPB41L4A-AS2 and MIR31HG lncRNAs from our results, we ran a gene expression analysis using the GEPIA2 database. We compared the overall survival rate of the patients and the prevalence of the two lncRNAs. We were able to analyze that both EPB41L4A-AS2 and MIR31HG showed a significant correlation to the survival rate of the patients (Fig 5 and Fig 6). According to the median expression of MIR31HG and EPB41L4A-AS2, 2077 patients were divided into two different expression groups. One is the high expression group and the other is a low expression group. Our analysis is further proved by the results of GEPIA2 which shows a higher concentration of both the lncRNAs resulted in significantly poor survival rate outcomes.

Different noncoding RNAs (ncRNAs) that negatively regulate gene expressions, such as the microRNAs and the long ncRNAs (lncRNAs), have been associated with cell invasiveness and cell dissemination, tumor recurrence, and metastasis. Thus a comparison between the expression of lncRNAs in normal and affected tissues will confirm the involvement of lncRNAs in the disease or the lack thereof. Increasing evidence points towards the need to explore the possibilities of genome-scale expression of lncRNAs in cancer. It would also be beneficial to gain knowledge about their potential biological functions as information is severely lacking in these sectors.

The two lncRNAs identified to have cancer hallmark associations, namely the MIR31HG and EPB41L4A-AS2, can significantly benefit the gene target identification for oral cancer biomarker prediction. Further performing an overall survival analysis on the two lncRNAs showed us a higher concentration of both the lncRNAs resulted in significantly poor survival rate outcomes. Since these two lncRNAs show a variety of cancer hallmark properties, it is a viable approach for potential further study and research concerning the two.

The recurrence of oral cancers is one of the most important aspects of the disease. Identifying factors that affect the recurrence of these cancers to reduce postoperative recurrence is an emerging issue in clinics. Since lncRNAs have been linked with the cause of recurrence, a detailed analysis might lead us to the identification of prognostic biomarkers related to the recurrence gains significance of paramount proportions.

Acknowledgement

We would like to thank Dr. Shubashini Sadasivam from Deep Seq Bioinformatics for providing us with BAM files of oral cancer samples in this article.

Funding:

Not applicable.

Conflicts of interest/Competing interests:

The authors declare no competing interests.

Ethics approval:

Not applicable.

Informed consent:

Not applicable.

Author Declaration:

Ajay Kumar Singh and Agnik Haldar have contributed equally in this manuscript.

Langsfeld, E., & Laimins, L. A. (2016). Human papillomaviruses: research priorities for the next decade. Trends in cancer, 2(5), 234–240. https://doi.org/10.1016/j.trecan.2016.04.001
Irimie, A. I., Braicu, C., Sonea, L., Zimta, A. A., Cojocneanu-Petric, R., Tonchev, K., Mehterov, N., Diudea, D., Buduru, S., & Berindan-Neagoe, I. (2017). A Looking-Glass of Non-coding RNAs in oral cancer. International journal of molecular sciences, 18(12), 2620. https://doi.org/10.3390/ijms18122620
Gibb, E. A., Enfield, K. S., Stewart, G. L., Lonergan, K. M., Chari, R., Ng, R. T., Zhang, L., MacAulay, C. E., Rosin, M. P., & Lam, W. L. (2011). Long non-coding RNAs are expressed in oral mucosa and altered in oral premalignant lesions. Oral oncology, 47(11), 1055–1061. https://doi.org/10.1016/j.oraloncology.2011.07.008
Cogill, S. B., & Wang, L. (2014). Co-expression Network Analysis of Human lncRNAs and Cancer Genes. Cancer informatics, 13(Suppl 5), 49–59. https://doi.org/10.4137/CIN.S14070
Alaei, S., Sadeghi, B., Najafi, A., & Masoudi-Nejad, A. (2019). LncRNA and mRNA integration network reconstruction reveals novel key regulators in esophageal squamous-cell carcinoma. Genomics, 111(1), 76–89. https://doi.org/10.1016/j.ygeno.2018.01.003
Cabili, M. N., Trapnell, C., Goff, L., Koziol, M., Tazon-Vega, B., Regev, A., & Rinn, J. L. (2011). Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes & development, 25(18), 1915–1927. https://doi.org/10.1101/gad.17446611
Memon, D., Bi, J. & Miller, C.J. In silico prediction of housekeeping long intergenic non-coding RNAs reveals HKlincR1 as an essential player in lung cancer cell survival. Sci Rep 9, 7372 (2019). https://doi.org/10.1038/s41598-019-43758-7
Jean-Quartier, C., Jeanquartier, F., Jurisica, I. et al. In silico cancer research towards 3R. BMC Cancer 18, 408 (2018). https://doi.org/10.1186/s12885-018-4302-0
Zhou, M., Hu, L., Zhang, Z., Wu, N., Sun, J., & Su, J. (2018). Recurrence-Associated Long Non-coding RNA Signature for Determining the Risk of Recurrence in Patients with Colon Cancer. Molecular therapy. Nucleic acids, 12, 518–529. https://doi.org/10.1016/j.omtn.2018.06.007
Wang, B., Zhang, S., Yue, K., & Wang, X. D. (2013). The recurrence and survival of oral squamous cell carcinoma: a report of 275 cases. Chinese journal of cancer, 32(11), 614–618. https://doi.org/10.5732/cjc.012.10219
Alizadeh, A., Eisen, M., Davis, R. et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403, 503–511 (2000). https://doi.org/10.1038/35000501
Behjati, S., & Tarpey, P. S. (2013). What is next generation sequencing?. Archives of disease in childhood. Education and practice edition, 98(6), 236–238. https://doi.org/10.1136/archdischild-2013-304340
Singh, R., De Sarkar, N., Sarkar, S., Roy, R., Chattopadhyay, E., Ray, A., Biswas, N. K., Maitra, A., & Roy, B. (2017). Analysis of the whole transcriptome from gingivo-buccal squamous cell carcinoma reveals deregulated immune landscape and suggests targets for immunotherapy. PloS one, 12(9), e0183606. https://doi.org/10.1371/journal.pone.0183606
Kim, D., Langmead, B., & Salzberg, S. L. (2015). HISAT: a fast spliced aligner with low memory requirements. Nature methods, 12(4), 357–360. https://doi.org/10.1038/nmeth.3317
Anders, S., Pyl, P. T., & Huber, W. (2015). HTSeq--a Python framework to work with high-throughput sequencing data. Bioinformatics (Oxford, England), 31(2), 166–169. https://doi.org/10.1093/bioinformatics/btu638
Kang Y. J., Yang D. C., Kong L., Hou M., Meng Y. Q., Wei L., Gao G. 2017. CPC2: a fast and accurate coding potential calculator based on sequence intrinsic features. Nucleic Acids Research 45(Web Server issue): W12–W16.
Patro, R., Duggal, G., Love, M. I., Irizarry, R. A., & Kingsford, C. (2017). Salmon provides fast and bias-aware quantification of transcript expression. Nature methods, 14(4), 417–419. https://doi.org/10.1038/nmeth.4197
Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome biology, 15(12), 550. https://doi.org/10.1186/s13059-014-0550-8
Tang, Z., Kang, B., Li, C., Chen, T., & Zhang, Z. (2019). GEPIA2: An enhanced web server for large-scale expression profiling and interactive analysis. Nucleic Acids Research, 47(W1), W556–W560. https://doi.org/10.1093/nar/gkz430
Chen, J., Zhang, J., Gao, Y., Li, Y., Feng, C., Song, C., Ning, Z., Zhou, X., Zhao, J., Feng, M., Zhang, Y., Wei, L., Pan, Q., Jiang, Y., Qian, F., Han, J., Yang, Y., Wang, Q., & Li, C. (2021). LncSEA: a platform for long non-coding RNA related sets and enrichment analysis. Nucleic acids research, 49(D1), D969–D980. https://doi.org/10.1093/nar/gkaa806

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

A transcriptomic analysis identifies the association of MIR31HG and EPB41L4A-AS2 lncRNAs in Oral Cancer

Status:

Version 1

Abstract

Figures

Introduction

Materials And Methods

Results And Discussion

Conclusion

Declarations

References

Additional Declarations

Status:

Version 1