Gene datasets and clinical information
A series of matrix files for the GSE4122 dataset (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE4122) and the GSE47841 dataset [20] were obtained from NCBI-GEO. The platform used in the GSE4122 dataset was the GPL201 (HG-Focus) Affymetrix Human HG-Focus Target Array. A total of 67 samples were included in the GSE4122 dataset and divided into 4 groups (benign, malignant, borderline malignant and normal human ovaries). However, the sample size of the borderline malignant group which has only 3 samples is too small. Considering the bias of WGCNA algorithm, borderline malignant group was wiped out in the further analysis [21]. Therefore, 64 samples were divided into three groups based on pathology (benign, malignant and normal human ovaries) for the further analysis. Two miRNA datasets were included in the present study and the data search stage was shown in Figure 1. The platform used in the GSE47841 dataset was the GPL14613 (miRNA-2) Affymetrix Multispecies miRNA-2 Array. As described in the original study of the GSE47841 dataset, 30 samples were divided into three groups: high-grade serous ovarian carcinoma (HGSC), ovarian surface epithelium (OSE) and clear cell ovarian carcinoma. In the present study, only 12 HGSC and 9 OSE samples were included. The platform used in the GSE131790 dataset was the GPL8786 (miRNA-1) Affymetrix Multispecies miRNA-1 Array. As described in the original study of the GSE131790 dataset, 25 samples were divided into two groups: 19 HGSC samples and 6 normal samples.
Weighted gene corepression analysis
Sample clustering was performed to demonstrate the relationship between clinical traits and the expression profile. After the raw data of the GSE4122 dataset were preprocessed, weighted gene corepression analyses (WGCNA) were performed according to the previously described algorithm to identify significant gene modules [22]. The probe set was first filtered based on the variance of expression value across all samples. Probe sets with duplicated gene symbols were also removed based on expression variance. The R package WGCNA [23] was applied for this analysis. Briefly, Pearson’s correlation coefficients were calculated for selected genes in a pairwise manner yielding a similarity matrix (Sij). The soft threshold (power) was set as 7. The matrix was transformed into an adjacency matrix (aij) using a power function using the formula aij = Power (Sij, β) ≡ |Sij| β. Average linkage hierarchical clustering was then performed to identify modules of densely interconnected genes. Network interconnectedness was measured by calculating the topological overlap using the TOMdist function with a signed TOM-Type. Average hierarchical clustering was performed to group the genes based on the topological overlap dissimilarity measure (1-TOM) of their connection strengths. Then, the module that was most related to the disease phenotype was selected for further analysis. The coexpression with a weight score > 0.05 in the module was filtered. Finally, the coexpression network was visualized by using Cytoscape (www.cytoscape.org; version 3.4.2) and analyzed by computing the topological properties of the network, including the network distribution [24]. The subnetwork was then extracted from the entire PPI network by using the MCODE app according to the previously described algorithm [25].
Gene function analysis
GO enrichment analysis and KEGG pathway enrichment analysis of mRNAs was implemented with DAVID (https://david.ncifcrf.gov/). In short, the gene identifiers were first converted into their H. sapiens Entrez gene IDs based on the latest database. If multiple identifiers corresponded to the same Entrez gene ID, they were treated as a single Entrez gene ID in downstream analyses. For each given list of genes, pathway and process enrichment analysis was performed with the following ontology sources: KEGG pathway and GO biological processes (BP), cellular component (CC) and molecular function (MF). The species was limited to H. sapiens, and all genes in the genome were used as the enrichment backgrounds. More specifically, the p-value was calculated based on the accumulative hypergeometric distribution, and the q-value was calculated using the Benjamini-Hochberg procedure to account for multiple testing [26]. The Kappa score was used as a measure of similarity when performing hierarchical clustering on the enriched terms, and then subtrees with similarity > 0.3 were considered a cluster. The most statistically significant term within a cluster was selected as the one representing the cluster.
Validation in TCGA datasets
The Cancer Genome Atlas (TCGA, https://cancergenome.nih.gov) is a platform for researchers to download free public datasets [27]. In the present study, the expression of hub genes selected from the GSE4122 dataset was verified in TCGA using the online tool GEPIA (http://gepia.cancer-pku.cn) to improve the reliability of the analysis. A total of 426 normal samples and 88 tumor samples of ovarian serous cystadenocarcinoma (OV) in TCGA datasets were included in the present study. Box and whisker plots were used to show the expression levels of the hub genes. Fold change (FC) > 2 and p-value < 0.05 were used as cut-offs. The prognostic value of the hub genes was further confirmed by Kaplan–Meier survival analysis based on the clinical information from TCGA datasets using GEPIA. Based on gene expression values, the group cut-off was the quartile.
Identification of differentially expressed genes
In the present study, GEO2R was used to filter differentially expressed miRNAs between HGSC samples and OSE samples in the GSE47841 dataset. A FC greater than 2 and a P-value less than 0.05 were considered as the cut-off criteria, and probes were filtered without a corresponding gene symbol.
Construction of the TF-miRNA-mRNA network
The TFCheckpoint database [28] was used to find the transcription factors in differentially expressed genes identified in the GSE4122 dataset. The interactions between TFs and miRNAs were predicted by using the miRGen v.3 database [29]. The interactions between the differentially expressed miRNAs and differentially expressed mRNAs were predicted using miRWalk, TargetScan [30] and miRDB [31], and a score ≥ 0.95 was considered the cut-off criterion. Cytoscape software (version 3.40) was used to visualize the regulatory network.
Patient samples
A total of 5 ovarian serous cystadenocarcinoma specimens and 5 nontumor ovarian cancer epithelial tissues were collected from the First Affiliated Hospital of Soochow University (Suzhou, China). The average age of the patients was 45±5.5 years. Biopsy samples were collected from May 2017 to May 2018. Written informed consent was obtained from each patient, and the experimental protocols were approved by the Institutional Review Board of Soochow University. Each biopsy sample was divided into two sections. One section was submitted for routine histological diagnosis.
RNA extraction and reverse transcription‑quantitative polymerase chain reaction (RT‑qPCR) analysis
First, total RNA of tissues was extracted by fully grinding the sample with liquid nitrogen. Then, 1 ml of TRIzol (Zymo Research) solution was added, mixed, and placed at room temperature for 5 min to fully extract the total RNA. Then, samples were centrifuged at 14,000 g at 4℃ for 15 min, and the RNA, which was separated into the upper water phase layer, was transferred to another new RNase-free EP tube. Later, isopropanol was added to the samples, gently and thoroughly mixed, and allowed to stand at room temperature for 10 min. Next, the samples were centrifuged at 14,000 g for 10 min at 4℃, and the RNA precipitate was collected and washed twice with 75% ethanol. According to the amount of precipitate, an appropriate amount of DEPC water was added to dissolve the precipitate.
RNA samples were reverse transcribed into cDNA using TransScript One Step gDNA Removal and cDNA Synthesis SuperMix. RT qPCR was performed using TB Green Premix Ex Taq (Tli RNaseH PlTs, Takara, RR420A). The PCR conditions included an initial step at 95℃ for 10 min, followed by 40 cycles of amplification and quantification (95℃ for 15 sec, 60℃ for 1 min, and 60℃ for 1 min). GAPDH was used as an endogenous control for normalization. The sequences of the primers used for RT qPCR were as follows:
hsa-mir-193b forward, 5' GCAACTGGCCCTCAAAGT 3' and reverse, 5' GTGCAGGGTCCGAGGT 3'; hsa-mir-296 forward, 5' TAGGGCCCCCCCTCAAT 3' and reverse, 5' GTGCAGGGTCCGAGGT 3'; hsa-mir-29b-2 forward, 5' CGCTGGTTTCACATGGTG 3' and reverse, 5' GTGCAGGGTCCGAGGT 3'; hsa-mir-29c forward, 5' GCTAGCACCATTTGAAAT 3' and reverse, 5' GTGCAGGGTCCGAGGT 3'; hsa-mir-34a forward, 5' GCTGGCAGTGTCTTAGCT 3' and reverse, 5' GTGCAGGGTCCGAGGT 3'; hsa-mir-421 forward, 5' CGATCAACAGACATTAATT 3' and reverse, 5' GTGCAGGGTCCGAGGT 3'.