Genome Based Search for Reference Genes for Gene Expression Analysis in Oral Cancer: a Data Science Driven Approach

Quantitative real time PCR (qPCR) remains by far the most cost-effective, fast yet sensitive technique to check the gene expression levels in various systems. The traditionally used reference genes over the years were found to be regulated heavily based on sample sources and/or experimental conditions. This paper therefore presents a data science driven -omic approach for selection of reference genes from ~60,000 candidates from The Cancer Genome Atlas (TCGA) and Broad Institute Cancer Cell Line Encyclopaedia (CCLE) for gene expression studies in head and neck squamous cell carcinoma (HNSCC). mRNA-sequencing data of 500 patient samples and 33 cell lines from publicly available databases were analysed to assess stability of genes in terms of multiple statistical measures. A final set of 12 candidate genes were studied in the Indian set of data in Gene Expression Omnibus (GEO) and validated experimentally using qPCR in 35 different types of samples from platforms like drug sensitive and resistant cell lines, normal and tumor samples, fibroblast and epithelial primary culture derived from HNSCC patients from India. is a stably gene in of cell


Abstract Background
Quantitative real time PCR (qPCR) remains by far the most cost-effective, fast yet sensitive technique to check the gene expression levels in various systems. The traditionally used reference genes over the years were found to be regulated heavily based on sample sources and/or experimental conditions. This paper therefore presents a data science driven -omic approach for selection of reference genes from ~60,000

Result
The study lead to the choice of five most stable reference genes -TYW5, RIC8B, PLEKHA3, CEP57L1 and GPR89B across three experimental platforms.

Conclusion
In addition to providing a set of five most stable reference genes for future gene expression analysis experiments across different types of samples in HNSCC, the study also presents an objective framework for assessing reference genes for other disease areas as well. 3 Background Gene expression profiling by qPCR is the most cost-effective and reliable technique for targeted profiling in in vitro, ex vivo and in vivo systems. However, quantification with reference to normalization controls (a.k.a. reference genes) to negate the interexperimental variability caused by differences in RNA concentration or variable sample handling processes is paramount [1].
A good reference gene should have uncontrolled expression in various conditions [2,3].
Previously used qPCR "gold standards" have been shown to change during various cellular processes such as cell cycle, differentiation, cancer progression or by various environmental conditions such as drug exposure, hormonal therapies and chemo or radio therapies [4]. In various disease conditions expression levels of reference genes vary depending on the location [5][6][7][8][9], experimental conditions [10][11][12], the tissue type under the study [13,14] and the tumor grade [15,16].
Ever since genome-wide expression data was available through high throughput experiments like microarray, there have been many efforts to identify stable genes that could be used for normalization in qPCR experiments [2,3]. Most of the initial work [17] focussed on evaluating a set of known reference gene candidates for stability of expression using several normalization algorithms -geNorm [18], NormFinder [19] and bestKeeper [20]. However, researchers also tried assessing gene stability using bioinformatics approach [21], statistical measures like the CV [22], and the difference in DNA entropies in promoters driving the expression of specific genes [23].
Availability of RNA sequencing data, with a wider dynamic range compared to microarray enable researchers to try out different statistical methods to evaluate gene stability. Yim et.al attempted to discover reference genes for expression studies in Soybean using two measures -(i) CV and (ii) p-value from a normality test assuming that the true reference genes should follow Normal distribution across samples [24]. An automated workflow called findRG [25] was proposed to find reference genes in different plant species and human cancers using CV as the primary measure. Hoang et.al used geNorm and Normfinder algorithms to identify reference genes for non-melanoma skin cancers from RNA-sequencing data [26]. Even for using these stability finding softwares, PCR efficiency should be taken into consideration as it can lead to bias of stable genes, if ignored [27].
Though all the efforts focused on identification of genes with least variation across samples, a systematic data science-based approach was not found in human disease areas, particularly in cancer. The present study focuses on validating reference genes in HNSCC, using an unbiased -omics approach in all the three major model systems of research (cell lines, patient samples and primary cultures) considering difference in origin and drug resistance. Common list of reference genes across multiple platforms will help researchers to reduce the inter-sample variability and thus arrive at an unambiguous data interpretation.

Results
Statistical Analysis of RNA-sequencing data

Selection of Primers
From the top 20 selected candidates from publicly available data (TCGA and CCLE), melt curve analysis ( Supplementary Fig 2) gave 11 genes with a single amplicon among which 8 genes had primer efficiencies ranging from 90-110% [48]. Among the 5 commonly used reference genes 4 had acceptable range of primer efficiency (Table 1) thus making the total number of selected candidate reference genes to 12.

Expression Behaviour of Candidate Genes in Cell Lines
Candidate reference genes when analysed in CCLE dataset (Fig 4a)  Expression patterns were checked in the characterized primary cultures ( Supplementary   Fig 4a and b). Passage numbers did not have any effect on genes like CEP57L1 and TYW5 whereas some genes like VTI1A showed huge variation (Fig 4c). Epithelial and fibroblast cells from the same patient samples expressed CEP57L1 and TYW5 at equal levels whereas VTI1A was regulated (Fig 4d).

Behaviour of Candidate Genes in Patient Samples
Analysis of effect of tumor location in 500 TCGA dataset did not reveal any variation for all RefFinder tool [49]. Geometric means of ranks obtained from both algorithms was used to rank the top 5 most stable genes -TYW5, PLEKHA3, RIC8B, CEP57L1 and GPR89B (Fig 6).
TYW5 functions in iron binding and the biosynthesis of a hydroxywybutosine (a hypermodified nucleoside) in tRNA by catalysing hydroxylation [50]. RIC8B guanine nucleotide exchange factor can activate some G-alpha proteins by changing bound GDP to free form GTP [51]. PLEKHA3 has several biochemical functions and is involved in golgi apparatus to cell surface trafficking of protein cargo [52]. CEP57L1 has been seen to be required for microtubule attachment to centrosomes [53][54]. GPR89B lastly is required for proper functioning of Golgi apparatus by maintaining the voltage dependent anion channel [55].

Discussion
The choice of reference genes becomes crucial for expression analysis of gene of interest.
Levels of reference genes therefore, should remain unaltered in any given condition. The genes required for regular operations of a cell i.e. the house keeping genes were the obvious choice of reference genes. However, their expression has been shown to alter depending on different conditions. Considering only the functional aspect of the reference genes have led to erroneous picks. The functions of the chosen genes from the study also vary differently. TYW5 functions in iron binding and the biosynthesis of a hydroxywybutosine (a hyper-modified nucleoside) in tRNA by catalysing hydroxylation [50]. RIC8B guanine nucleotide exchange factor can activate some G-alpha proteins by changing bound GDP to free form GTP [51]. PLEKHA3 has several biochemical functions and is involved in golgi apparatus to cell surface trafficking of protein cargo [52]. CEP57L1 has been seen to be required for microtubule attachment to centrosomes [53][54]. GPR89B lastly is required for proper functioning of Golgi apparatus by maintaining the voltage dependent anion channel [55]. Therefore, the choice of reference genes based only on their function is not sufficient. Current study thus offers an unbiased data-science based approach to shortlist reference genes. Most reliable reference genes should not be The study reported here is a major improvement over similar approaches found in literature, even co-authored by two authors (AS and MAK) from this study [35]. Some of the improvements include (i) starting with an -omics pool of unbiased genes (ii) using a median-based variation parameter (MAD) in addition to the standard deviation based variation to make the analysis less susceptible to outliers often seen with patient samples

Conclusion
The present study offers an unbiased; -omics based approach to arrive to a set of candidate genes which can be used as reference genes not only across different conditions, but also across three major platforms of research -cell lines, primary cultures and patient samples. Candidate genes, 12 in number, when checked by qPCR in 35 different systems (Fig 6 a) and subjected to RefFinder, chose 5 genes to be the most stable (Fig 6b): TYW5, RIC8B, PLEKHA3, CEP57L1 and GPR89B. Although, a robust method for the choice of reference genes has been developed, still sequence data from Indian samples are required to come to an unbiased set of reference genes. The study therefore calls for an Indian Cancer Genome Atlas.

Gene Expression Data Acquisition
As represented in To assess stability of a gene, two measures were adopted -(i) CV = x ̂ σ x where x ̂ and σ x are mean and standard deviation of a variable x respectively and (ii) the normality pvalue measured by the Shapiro-Wilks Test (p-value < 0.05 indicates the distribution to be away from Normal) [24]. CV, albeit most frequently used, is affected by outliers, and hence fails to be a robust measure. To address this, a third parameter -MAD = m e d i a n x -x ̂ (where x ̂ is the median of x) [42] was used after normalization with median. MAD is a measure of the spread of the distribution and being based on medians, is less susceptible to deviations by outliers.
However, for the patient dataset, the Normality p-value calculated by Shapiro-Wilks Test is <10 -4 for most genes, indicating that expression of none of the genes deviate from Normal distribution. Hence only CV and MAD were used as the two parameters for the study.
An ideal set of reference genes should have low or similar statistical variation (e.g. CV and MAD) across samples. Therefore, genes were clustered based on their CV and MAD values (normalized to respective z-scores) using the PAM algorithm originally proposed by Kaufman and Rousseeuw [43]. Optimal number of clusters required is calculated using the Silhouette graphical method of Rousseeuw [44]. For patient and cell line dataset, the cluster having the lowest medoid value for CV and MAD z-scores was selected, and the intersection between the two clusters was identified containing the genes having least CV

Selection of Commonly Used Reference Genes
Most commonly used reference genes were shortlisted by literature based on their frequency of usage in published papers. No unique keywords were used by researchers to report studies on reference genes. Many such articles are not indexed with MeSH terms so that the subheadings can be used for disease-based search. Hence a very broad methodology was adopted in which all articles in Pubmed were searched for occurrence of any of the terms "reference gene" or "control gene" or "housekeeping gene" along with co-occurrence of the term "head and neck" or "oral" anywhere in the article. Obtained abstracts were manually curated by authors ND and SKD independently to find the relevant articles that described studies on reference genes specifically in the context of oral or head and neck cancer. The shortlisted 28 genes were run on CCLE and TCGA database for expression analysis for their segregation among four standard quartiles.

Design of Primers
Primers were designed (Table 1; supplementary table 2 qPCR [3] qPCR was done on Roche LightCycler 480 II instrument using KAPA SyBr green Universal

Availability of data and material
The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.
All data generated or analysed during this study are included in this published article [and its supplementary information files].

Competing interests
The authors declare that they have no competing interests Funding Current study was funded by MSMF. The funding agency did not have any role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.
Author's contributions SKD contributed in conceptualization, study design, data acquisition (algorithms), data interpretation and manuscript preparation. ND contributed in doing qPCR for all samples, cell culture, data analysis, data interpretation and manuscript preparation. CG contributed in the establishment of primary cultures from patients. MAK, oral oncology surgeon contributed in getting the patient consent and samples. MD contributed in study design, conceptualization and manuscript editing. AS contributed in reviewing the manuscript.  Figure 1 Work flow of the study.

Supplementary Files
This is a list of supplementary files associated with the primary manuscript. Click to download.
Supplementary data.pdf