Whole transcriptome analyses identify pairwise gene circuit motif in serous ovarian cancer

Background: Ovarian cancer is the most lethal gynaecological malignancy, resulting in approximately 14,000 deaths annually in the United States. Transcriptome data are emerging as an effective tool possibly leading to clinical applications for various cancers. Methods: We collected eight serous ovarian carcinomas (SOCs) and eight normal ovary samples, and generated a whole transcriptome profile of human ovarian cancer using microarrays, including mRNAs, lncRNAs, and circRNAs. We constructed a competing endogenous RNA (ceRNA) network involving these three types of RNAs and identified immune-related circRNAs from the network. Moreover, we proposed a gene-pair filtering method to identify significant expression reversals from integrated multi-cohorts, which mitigates the technical variation and improves the statistical power. Results: Three pairs of mRNAs (BIRC5:PRKCQ, PTK2B:OGN, and S100A14:NR2F1) were identified as promising biomarkers and were fused as an indicator (SOC index) for diagnostic prediction. Validation in three independent cohorts demonstrated that the SOC index carries a very high predictive capacity (average ROC, 0.99; sensitivity, 0.98; specificity, 1.00). Additionally, the SOC index exhibited its prognostic potential to discriminate SOC patients between early and late stage disease. Conclusions: Our findings elucidate the repertoire of RNA expressions in SOC and identified three gene pairs for the primary screening of SOC. Further biological experiments of the three gene pairs are warranted in order to investigate the underlying function mechanisms involved in ovarian cancer development.

regulatory network in prostate and breast adenocarcinomas [13]. Liang et al. constructed a ceRNA regulatory network for mesenchymal ovarian cancer and identified the downregulation of lncRNA protransition associated R (PTAR) s potentially inhibiting cancer metastasis by sponging miR-101 [14].
PTAF is a pivotal regulator of the epithelial-to-mesenchymal transition promoting the invasionmetastasis cascade of ovarian cancer. Furthermore, they demonstrated that the overexpression of PTAF can upregulate snail family zinc finger 2 (SNAI2) by directly sponging miR-25, leading to the promotion of ovarian cancer epithelial-to-mesenchymal transition and invasion [14]. Nevertheless, current research on the ceRNA regulatory network in ovarian cancer merely focuses on the competing relationship between lncRNAs and mRNAs [15][16][17][18][19], given that no circular RNA (circRNA) expression data are available for ovarian cancer. As a layer of the gene regulatory network, circRNAs feature a variety of biological processes, including tumour cell proliferation, migration, and invasion [20].
Investigating circRNA expression and correlation between circRNAs and other types of RNAs will shed light on the underlying molecular mechanisms of ovarian cancer.
Although several previous works emphasised the capability of RNA as cancer biomarkers, these transcriptome markers are not considered sufficient for clinical utility. The primary reason stems from the lack of stability or consistency across different cohorts or measurement platforms. Specifically, the limitations include (1) a low statistical significance and an instability of prediction models resulting from the small size of normal ovarian samples; (2) inconsistent transcriptome size, the amount of total RNA per sample from the tumour and from normal tissues; and (3) the datasets highly susceptible to technical variations such as batch effects. To overcome these challenges, several recent studies took advantage of the relative rank of genes and developed gene panels for cancer diagnosis and prognostics [21,22]. This strategy concentrates on the relative expression ordering of genes, rather than the gene expression abundance, which is not comparable across samples in many cases. However, the relative expression ordering ignores the size effect difference between the expressions of two genes. A large expression difference of two genes contribute equally to a small one, probably generating false positive discoveries.
In this study, we generated expression profiles from eight SOCs and eight normal ovary samples to characterise the full RNA expression pattern of human SOC. To comprehensively understand the alteration of RNAs, we first assessed the differentially expressed mRNAs, long non-coding RNAs (lncRNAs), and circular RNAs (circRNAs), which are immune-related. Based upon these, we then constructed a competing endogenous RNA (ceRNA) network affecting the expression of the proteincoding genes and selected them as candidate biomarkers due to their topological importance. We proposed a novel gene-pair filtering method to draw out significant expression reversals from multiple integrated cohorts, thereby mitigating the technical variation and improving the statistical power. Next, we developed a diagnostic indicator, called the SOC index using a machine-learning method, based on the gene pairs for the accurate discrimination between SOCs and normal controls. SOC index training on the collected cohort and two large cohorts achieved a high sensitivity and specificity across three independent cohorts. The SOC index also shows a significant correlation with tumour infiltration (CD4+, CD8+) and associated with clinical stage and prognosis. These findings suggest that the SOC index has the potential to detect SOC and monitor disease severity of patients. University. Both cancerous and normal ovarian tissues were quickly excised and snap-frozen in a liquid nitrogen tank at −80°C until further use.

RNA extraction
Tissues were homogenised in the TRIZOL reagent (Invitrogen, USA) using a Qiagen Tissuelyser. Total RNA was extracted in accordance with the manufacturer's protocol and then quantified using a NanoDrop ND-1000 spectrophotometer (Thermo Fisher Scientific, Waltham, MA, USA). The RNA integrity of each sample was assessed by denaturing agarose gel electrophoresis.

Microarray experiments
To identify deregulated RNAs associated with SOC patient outcomes, we conducted a microarray study and profiled mRNAs, lncRNAs, and circRNAs, respectively. We performed Arraystar Human LncRNA Microarray V2.0 and Arraystar Human circRNA Array V2.0 analyses on all 16 samples. The expressions of lncRNAs and mRNA were quantified using the first platform, while the expressions of circRNAs were measured using the second. Total RNA from each sample was measured using NanoDrop ND-1000. Sample preparation and microarray hybridisation were performed based on the standard protocols of Arraystar (Agilent Technology, USA). Processing RNA was different between the two platforms. For the lncRNA platform, rRNA was removed from total RNA using the mRNA-ONLY™ Eukaryotic mRNA Isolation Kit (Epicentre Biotechnologies, USA). For the circRNA platform, total RNAs were digested with Rnase R (Epicentre, Inc.) to remove linear RNAs and enrich circular RNAs. Then, the two platforms followed the same steps below.
Each sample was amplified and transcribed into fluorescent cRNA utilising a random priming method (Arraystar Super RNA Labeling Kit; Arraystar). The labelled cRNAs were purified using the RNeasy Mini Kit (Qiagen, Germany). The concentration and specific activity of the labelled cRNAs (pmol Cy3/μg cRNA) were measured using NanoDrop ND-1000. Here, 1 μg of each labelled cRNA was fragmented by adding 5-μl 10 × blocking agent and 1-μl 25 × fragmentation buffer, heating the mixture to 60°C for 30 min, and then adding 25-μl 2 × hybridization buffer to dilute the labelled cRNA. Next, 50-μl hybridisation solution was dispensed into a gasket slide and assembled on the circRNA expression microarray slide. The slides were incubated for 17 hr at 65°C in an Agilent Hybridisation Oven. The hybridised arrays were washed, fixed, and scanned using the Agilent Scanner G2505C. We used the Agilent Feature Extraction software (version 11.0.1.1) to analyse the acquired array images.

Differential and functional analysis
The expression profiles of SOC patients and normal controls were analysed to identify differentially expressed mRNAs, lncRNAs, and circRNAs, respectively. Quantile normalisation was used to normalise the profiles to render the data comparable across samples. The significance levels were estimated using the Student's t-test and, then, adjusted using the Benjamini and Hochberg (BH) multiple testing correction method. The effect size was also taken into account using the twofold change [23,24]. In total, we obtained 1881 downregulated mRNAs and 1995 upregulated mRNAs, 1849 downregulated lncRNAs and 718 upregulated lncRNAs, as well as 122 downregulated circRNAs and 69 upregulated circRNAs. The differentially expressed mRNAs were subjected to Gene Set Enrichment Analysis (GSEA) [25] for functional enrichment analysis. We performed all analyses using R (R x64, version 3.5.2).

ceRNA network construction
We used the Pearson's correlation test to calculate the expression correlation among mRNAs, lncRNAs, and circRNAs. Only RNAs positively correlated with each other (Pearson correlation coefficient (PCC) > 0.9 and P < 0.01) were considered for the construction of the competing endogenous regulatory network [10,26]. The final network was illustrated using Cytoscape [27].

Denoising individualised pairwise analysis
The abundance of genes may vary across different detection platforms or preprocessing methods, but the relative ranking is stable in a pair of genes. Herein, we proposed denoising individualised pairwise analysis to identify differentially expressed gene pairs (DEPs). The gene expression profiles from different discovery cohorts were concatenated directly as a training set. Then, we performed pairwise subtraction for all genes to form gene pairs ( − ) in a single sample (Figure 2A).
However, the gene expression value may vary due to technical noise and can be expressed as = ′ + , where ′ is the true value of the gene expression and ε ∈ (−∞, +∞) represents the error of measurement. When subtraction is performed, it becomes ( Following the intrasample analysis, we conducted a population analysis to obtain the significance level of the gene pairs. For each gene pair, a contingency table was calculated without considering the samples within the difference threshold, given that those samples are regarded as noise ( Figure 2C).
We then calculated the Fisher's exact test, and the significantly different pairs (P < 0.001) were preserved as potential biomarkers.

SOC index generation
To further select fewer genes capable of achieving the optimal discrimination for SOC, we utilised the forward selection on gene pairs with AUC for performance evaluation. All possible combinations from one to ten pairs of genes were evaluated and fivefold cross-validation was applied to avoid bias within the data. When using more than three pairs, the best combinatorial pairs could reach an AUC of 1.00 for all five folds. Therefore, the three pairs (BIRC5:PRKCQ, PTK2B:OGN, and S100A14:NR2F1) were identified as biomarkers for SOC.
The SOC index was calculated using the least absolute shrinkage and selection operator (LASSO) [28][29][30], a modified linear regression model with regularisation that can avoid overfitting and improve the generalisability of a model. The three pairs (BIRC5:PRKCQ, PTK2B:OGN, and S100A14:NR2F1) were extracted from the training set. The linear model was optimised using the following loss function with an L1 penalty: cross-validation and calculated as 4.2 × 10 −4 , which had the best performance ( Figure 3E). Based on this, we trained and finalised the SOC index R = 0.154 * (BIRC5, PRKCQ) + 0.179 * (PTK2B, OGN) + 0.248 * (S100A14, NR2F1) + 0.645.

Correlation analysis with tumour infiltration
To explore the association with immune cell infiltration, Spearman's rank correlation coefficient was adopted to estimate the correlation between the SOC index and immune cells in cancer (|R| > 0.3 and P < 0.05). For each sample, we calculated the levels of immune cell infiltration using the Tumour Immune Estimation Resource (TIMER) [31]. with an absolute fold change >2 and a false discovery rate (FDR) adjusted P-value from a Student's ttest <0.10 ( Figure 1A). Using GSEA [25] Kyoto Encyclopedia of Genes and Genomes (KEGG) [32] functional enrichment analyses showed that DEGs are remarkably involved in the pathways related to cancer immunity, including T-cell activation, cytokine production, immune regulation, and tumourinfiltrating lymphocyte differentiation and migration ( Figure 1B and Figure S1). According to ImmPort [33] and ImmLnc [34], 270 mRNAs and 426 lncRNAs are immune-related (Figure 1C and   1D).

Identification of mRNAs that participate as ceRNAs in SOC
The competitive endogenous RNA (ceRNA) hypothesis revealed an intrinsic mechanism within RNAs that regulate biological processes. LncRNAs, circRNAs, and mRNAs act as miRNA sponges or ceRNAs by competing for the shared microRNAs (miRNAs) (Figure 1E). For instance, in the mRNA-miRNA-lncRNA interaction, changes in the expression of lncRNA alter the number of unbound miRNAs, thereby affecting the expression abundance of the target mRNA. To determine if RNAs participate as ceRNAs, we identified 255 mRNAs, 362 lncRNAs, and 91 circRNAs positively correlated with each other (Pearson's correlation coefficient (PCC) >0.9, P < 0.01) and established the competing endogenous regulatory network among these immune-related RNAs ( Figure 1F). Notably, 91 circRNAs in the network were identified as immune-related circRNAs in SOC, providing an opportunity for immunology studies of ovarian cancers involving circRNAs (Supplementary Table   S1; Figures S2 and S3).
Following topological network analysis [17,[35][36][37][38], we found that mRNAs have a significantly higher network degree and betweenness, while a lower transitivity than non-coding RNAs (Figure 1G), indicating that mRNAs tend to lie at the centre in star-like structures. Hence, they exert a greater interactive influence in the regulatory network. Beyond this, given that no circRNA and few lncRNA expression datasets are available for independent validation, we merely selected these mRNAs for further analysis and potential biomarker identification.

Identification of gene-pair markers
In addition to our cohort, we collected all publicly available tissue expression cohorts for SOC from The Cancer Genome Atlas (TCGA) [39] and the Gene Expression Omnibus (GEO) [40], including TCGA-Affy, TCGA-Agilent, GSE18520, GSE6008, and GSE40595 ( Table 2). mRNA from the six cohorts were quantified using three platforms: the Affymetrix HG-U133A Array, the Affymetrix HG-U133 Plus 2.0 Array, and the Agilent G4502A. In total, 255 genes in the ceRNA regulatory network were extracted from our cohort, from which 219 genes existed in all five public cohorts. We combined the two largest cohorts-TCGA-Affy and GSE18520-with our cohort to comprise the discovery set, consisting of 646 patients and 26 normal controls ( Table 2). The remaining cohorts served as independent validation sets.
A preliminary study showed that the six cohorts shared few differentially expressed mRNAs (Supplementary Figure S4), and using absolute gene abundance was not suitable to identify stable biomarkers across different platforms. Specifically, only one gene was commonly selected as differentially expressed across the six cohorts. Therefore, we proposed a denoising individualised pairwise analysis framework (Figure 2; see the Methods section) that selects differentially expressed gene pairs (DEPs) and constructed a SOC index by summarising these DEPs using the least absolute shrinkage and selection operator (LASSO). First, we performed subtraction between every possible pair of genes within each sample (Figure 2A), converting the difference between a pair into a 'greater' signal (1) or 'smaller' signal (0) with a noise interval of 0.5 expected to filter out most of the false discoveries ( Figure 2B). The gene expression value may include technical variation and is expressed by g + ε, where g is the true gene expression and ε ∈ (−∞, +∞) represents the error of measurement.
Therefore, we used the noise interval to filter out the difference induced by technical variation. If the difference within a pair did not exceed 0.5, the pairing signal was assigned 0. The signals for all possible gene pairs were defined as the relative expression level of the sample, taking into account the technical variation bias caused by data integration. We derived a pairwise spectrum of 23,871 gene pairs. For each pair in the pairwise spectrum, we obtained a contingency table across the population ( Figure 2C) and 52 DEPs composed of 52 genes were identified (P < 0.001, FDR corrected Fisher's exact test). A heat map displays the effect size (difference between two genes in a pair) of each gene pair ( Figure   3A). Notably, the denoised relative expression shows a clearer distinction between SOC patients and normal controls ( Figure 3B).
Next, we applied sequential forward selection on the 52 gene pairs and obtained three pairs (BIRC5:PRKCQ, PTK2B:OGN, and S100A14:NR2F1; Table 3). Together, these pairs exhibit an optimal classification performance in a fivefold cross-validation (accuracy = 1.00; Figure 3C). The genome position of these genes in different chromosomes appear in Figure 3D.

Validation of independent cohorts
To determine if the SOC index obtained from the discovery set is reproducible in other SOC cohorts, we applied the same locked model to three independent validation cohorts (GSE6008, GSE40595, and TCGA-Agilent) measured using three different platforms (Affymetrix HG-U133A Array, Affymetrix HG-U133 Plus 2.0 Array, and Agilent G4502A). Three-dimensional (3D) plots of the three pairs and the box plots of the SOC index in both GSE6008 and GSE40595 (Affymetrix HG-U133A Array platform) indicated that our denoising relative expression model established from the training set carried a sensitivity of 100%, a specificity of 100%, an AUROC of 1.00, and AUPRC of 1.00 (Figure 4: GSE6008 and GSE40595; Figure 5: GSE6008 and GSE40595). In TCGA-Agilent, we assessed the performance of the three pairs and the SOC index to again distinguish SOC from normal using the Agilent platform (Figure 4: TCGA-Agilent). We found that the SOC index resulted in a 93.8% sensitivity and 100% specificity with an AUROC of 0.98 and an AUPRC of 0.99 (Figure 4: TCGA-Agilent; Figure 5: TCGA-Agilent). Taken together, we observed a highly accurate discrimination in the three independent cohorts, indicating that the SOC index carries a very high predictive value in assisting SOC diagnosis.

SOC index indicates tumour progression
Given that the SOC index of the three identified gene pairs potentially discriminated between SOC and normal controls, we questioned whether the score correlated with the tumour status. To address this, we performed a tumour-infiltrating immune cell analysis using TIMER2.0 [41], and decomposed the bulk mRNA expression into cell-type proportion for our cohort and the other four cohorts (GSE18520, GSE40595, GSE 6008, and TCGA-Affy). We evaluated the correlation between the SOC index and the population of different cell types on five cohorts ( Figure 6A). For GSE18520, the SOC index has a significantly positive correlation with CD4+ ( Figure 6A (Figures 6B-E).
Furthermore, the SOC index associated with the clinical stage of SOC. Patients with stages IB, IC, and II disease tended to have a high SOC index reaching 1.10, while the SOC index for stages III and IV disease were substantially lower in the TCGA-Affy cohort ( Figure 6F). When stratifying patients into early and late stage disease, the SOC index for stage III and IV was significantly lower than that for stage I and II (Figure 6G; P < 0.005, Student's t-test). Moreover, patients were stratified into high-risk (n = 455) and low-risk (n=128) groups according to the median SOC index. We found that patients with a higher SOC index (>1.10) tended to experience better clinical outcomes with a longer survival when compared to those with a lower SOC index (Figure 6H; P < 0.004, log-rank test). Thus, our findings demonstrated that a higher SOC index corresponded to better overall survival and early stage SOC, illustrating the prognostic power of our gene-pair markers.

DISCUSSION
To obtain comprehensive RNA profiles for serous ovarian carcinomas (SOC), we used microarrays to measure mRNAs, lncRNAs, and circRNAs from eight SOC samples and eight normal ovarian samples.
We constructed a competing endogenous RNA regulatory network on the basis of the immune-related and differentially expressed RNAs. Owing to the topological importance and sufficient resources, mRNAs in the competing endogenous regulatory network from our cohort and the two largest public SOC cohorts were extracted to construct a diagnostic model. We identified three gene pairs based on their relative expression ordering and, then, combined them to form a SOC index using LASSO.
Validation in three independent cohorts reveals that the model is strongly reproducible and accurate (average AUC close to 1).  [43].
Additionally, the proposed SOC index reflects tumour infiltration and appears effective for determining SOC prognosis and stage stratification. The SOC index inversely correlated with CD8+/CD4+, which is lower in SOC patients and higher among normal controls. A higher SOC index score implies early stage SOC and an improved prognosis compared to a lower score (Figures 6G and 6H), which seems like a paradox but not rare in tumour biomarkers. For instance, the isocitrate dehydrogenase (IDH) mutation is oncogenic in gliomas, but is prevalent in lower-grade gliomas and the presence of the IDH mutation indicates a longer overall survival in patients with gliomas [44][45][46].
Normal ovarian samples are crucial for expression profiling studies relying on comparisons with malignant ovarian tissues. However, tissues from normal donors are rare for ovarian cancer because of the invasive procedure. In total, only 33 normal ovarian samples for gene expression were publicly available among TCGA and GEO before this study. We profiled the expression of eight normal ovarian samples from cervical cancer patients not metastasised to the ovaries, representing important support and complements to ovarian research.
In addition to the mRNA and lncRNA expressions, we measured genome-wide circRNAs using microarray previously unexamined, thereby providing an opportunity to investigate the regulatory role of non-coding RNAs in ovarian cancer. This research may also facilitate studies related to the mechanism and therapeutics involving circRNAs in ovarian cancer.
Given the rarity of ovarian samples, data integration is necessary for a large-scale study capable of obtaining accurate biomarkers [47]. The primary challenge lies in integrating cohorts, related to the technical variation between platforms and the batch effect from different experiments. To address this issue, the relative expression of gene pairs in each sample rather than the absolute expression value of a single gene was taken into account. Although this might lose some quantitative information from the expression data, datasets from different resources were integrated for model training, thereby substantially increasing the sample size and improving the statistical power of detecting reversal gene pairs. More importantly, the reversal gene pairs can be easily applied to independent individuals, since they do not require any extra preprocessing of population samples. Notably, neglecting the size effect difference of two genes for relative expression ordering result in a number of false positives among significant results, given that a large expression difference of two genes contribute equally to a small one. To address this, we used a parameter, difference threshold ∆, to reduce the rate of false positive discoveries.

CONCLUSION
In this study, we generated a whole transcriptome profile of human ovarian cancer by analysing eight serous ovarian carcinomas (SOCs) and eight normal ovary samples using microarrays, including mRNAs, lncRNAs, and circRNAs. We constructed a competing endogenous RNA (ceRNA) network involving these three types of RNAs and identified immune-related circRNAs from the network. Based upon that, we proposed filtering gene-pair method to integrate multiple cohorts and obtained SOC index. Our

DATA AVAILABILITY
Data are available upon request.

CONFLICTS OF INTEREST
The authors declare no competing interests.

Table and Figure Legends
Tables Table 1. Clinical and laboratory characteristics of patients with ovarian cancer and control samples. Table 2. Gene expression datasets used in this study. Table 3. Genome characteristics of the three gene pairs. Figure 1. Construction of the competitive endogenous RNA regulatory network.