Annotation of Gene Alteration in Amniotic Fluid Cell Free RNA Transcriptome via Weighted Gene Co-expression Network Analysis

Background Human amniotic uid (AF) cells are commonly used in prenatal diagnosis. The AF cell-free mRNA (cfRNA) derived from necrotic or apoptotic cells of fetus may provide feasible marker of organ development and fetal malformation. Analysis of gene co-expression network will help to annotate AF cfRNA gene variations in fetal diseases. Methods Datasets of amniotic uid free RNA were downloaded from the Gene Expression Omnibus database. Co-expressed modules based on normal fetus AF cfRNA transcriptome were established via Weighted Gene Co-expression Network Analysis. The relationship between modules and tissues were set up via tissue-specic gene expression analysis of genes in modules. Differential expressed genes in Down syndrome (DS), Edwards syndrome (ES), and Turner syndrome (TS) were analyzed via linear models implemented in the limma package. Gene Ontology (GO) analysis of modular specic differential expressed genes for these three syndrome fetus was performed separately. Based on relationship of GO terms, GO graph of all enriched GO terms were constructed. The associated enriched GO terms in GO graph were considered as the same subset. Numbers of GO terms of the same subset with diseases and modules were calculated. A total of 22 co-expressed were Most eigengenes showed no correlation with Probe sets with higher values showed signicant clustering Dominant tissues and modules were different for different modules and tissues respectively through analysis of tissue-specic genes distribution in The total numbers of enriched GO terms in six major related to for DS, and TS. Disease-specic modules for DS, and were relation to biological through GO analysis. Differential genes in disease-specic were identied ES, and TS.

Tissue-speci c gene expression analysis and gene set enrichment analysis are the major annotation methods for the AF cfRNA research. Tissue-speci c gene analysis is based on public gene expression data in human tissues, such as BioGPS [19], Genomics Institute of the Novartis Research Foundation [20], DFLAT [21], and Gene Atlas [20]. According to the results of tissue-speci c gene expression analysis [1,2,6,7,14,22,23], the genes derived multiple tissues (brain, liver, kidney, lung, blood, ovary, pancreas, tongue, adrenal gland, thyroid, etc) are detected in AF cfRNA. Enrichment analyses of different expressed genes in abnormal fetus AF cfRNA include Gene Ontology analysis [13,23], Kyoto Encyclopedia of Genes and Genomes pathway analysis [17], and QIAGEN's Ingenuity Pathway Analysis (IPA) [23]. Gene Ontology analysis was the most frequent method used.
Disease-associated pathways and -biological processes are enriched via enrichment analysis. Hematologic/immune category functions and gastrointestinal/metabolic functions are signi cantly enriched in Turner syndrome (TS) [10]. The neurological disease is identi ed as the most highly enriched disease pathway in Down syndrome (DS) and Edwards syndrome (ES) [1].
Embryonic development comprises the dynamic complex processes, including multiple coordinated development and gene co-expression intra-tissues or inter-tissues. The above relationships are not included in tissue-speci c gene analysis and gene function enrichment analysis. We assumed that gene co-expression networks in AF cfRNA may re ect these coordinated development and gene co-expression.
To verify this hypothesis, AF cfRNA array data from GEO database were analyzed for weighted coexpression networks analysis. Enrichment analysis of different expressed genes and tissue-speci c gene analysis were also performed.
Table S1 containing 26 normal samples with gestation age information were used in co-expression networks analysis. Samples in Table S2 (7 DS versus 7 normal fetus), Table S3 (5 ES versus 6 normal  fetus), and Table S4 (5 TS versus 5 normal fetus) were used in differential genes analysis for DS, ES and TS groups. Microarray data were preprocessed (background correction, normalization, and summarization) using the Robust Multi-array Average approach [25] implemented in the oligo package [26]. Probe sets were converted to gene symbol based on Affymetrix Human Genome U133 Plus 2.0 Array annotation data ("hgu133plus2.db" package).

Co-expression net works analysis
An unsigned co-expression net works was performed with Weighted Gene Co-expression Network Analysis (WGCNA) package [27]. Normal second trimester samples (Table S1) with available gestational ages were included in WGCNA analysis. Before WGCNA analysis, raw expression values of probe sets were preprocessed, and the batch effects were removed with ComBat function in the SVA package [28].
Soft-thresholding power was set to 15, which was calculated by the pickSoftThreshold algorithm. The automatic block-wise network construction and module detection method was employed. Module assignments were made using dynamic tree cutting method by setting mergeCutHeight to 0.2. Module eigengene was calculated via moduleEigengenes algorithm. To generate co-expressed modular genes, probe sets in clustered modules were annotated to genes via "hgu133plus2.db" package.
Tissue-speci c gene analysis Gene expression data in human tissues were downloaded from tissue-based map of the human proteome [29] in the Human Protein Altas database (http://www.proteinatlas.org). The Human Protein Atlas aims to map all the human proteins in cells, tissues and organs using the integration of various omics technologies. Tissue-speci c genes in a speci c tissue were de ned as expression value in particular tissue higher than cut-off value times the median expression value of all other tissues. Cut-off value was set as 5, 10, 15, 20, 25, and 30 respectively. Tissue-speci c gene names were transformed to probe sets for further analysis via "hgu133plus2.db" package. Tissue-speci c genes that were not tested by array were discarded. Distribution of Tissue-speci c genes (probes) at a different cut-off values in coexpression modules were analyzed via R script.
Gene differential expression analysis Before gene differential expression analysis, probe sets expression values were preprocessed (background correction, normalization, and summarization) and log2-transformed. Differential expressed probe sets were analyzed via linear models implemented in the limma [30] package in Bioconductor [31]. Differential expressed probe sets (DEPs) were deemed signi cant based on p-values < 0.05. Differential expressed genes (DEGs) were obtained after transforming differential expressed probe sets into genes via "hgu133plus2.db" package.

Gene ontology analysis
According to the results of WGCNA analysis, modular distribution of DEPs from ES, DS, and TS groups were analyzed. After transforming probe sets into gene names, modular speci c DEGs (DEGs in the certain module) from ES, DS, and TS groups were obtained. Gene ontology analysis of modular speci c DEGs from ES, DS, and TS groups was performed separately via the clusterPro ler package [32]. P-values were adjusted using the Benjamini-Hochberg method. Adjusted p-values < 0.05 were considered statistically signi cant. Based on the relationship of GO terms (is_a, part_of, regulates, positively_regulates, etc), GO graph of all enriched GO terms from ES, DS and TS groups were constructed. The associated enriched GO terms in the GO graph were considered as a same subset.
Numbers GO terms of the same subset concerning diseases and modules were calculated.

Statistical Analysis
All other data statistical analyses and plots were created using R language (version 3.5.3), ggplot2 (version 3.3.1) and VennDiagram (version 1.6.20).

Summary of gene co-expression network analysis
A total of 26 second term normal AF cfRNA chip results (15 males and 11 females) were selected for WGCNA. The mean gestation ages of male and female fetus were 15 and 19 weeks respectively (Table  S1). Raw data from each microarray were pre-processed for background correction, normalization and batch correction ( Fig 1A). All the 54675 probe sets on the chip were analyzed by WGCNA. For block-wise network construction, a computationally inexpensive and relatively crude clustering method was adopted to classify pre-cluster probe sets into three blocks. A full network analysis was performed in each block separately. A total of 22 distinct probe set modules were generated (Fig 1 B-D). These 22 modules were shown in different colors and module names were labeled as colors, i.e., a color represented a group of co-expressed genes. The size of modules ranged from 20 to 1198 probe sets (Table S6). Probe sets without obvious co-expressions relationship were labeled as grey. After probe sets were converted to gene symbol, co-expressed genes in 22 modules were obtained and summarized in Table S5.
Most of the co-expressed modules showed no correlation with gestation weeks The correlation between MEs and gestation age was evaluated by Spearman correlation analysis. As shown in Table S6, most modules showed no signi cant correlation with gestation age except for green and turquoise modules. Modules were clustered using hierarchical clustering. Green and turquoise modules showing signi cant correlation with gestation age were classi ed into different clusters.
Modules with similar correlation coe cient were divided into different clusters (Fig 2A). MEs of major modules (turquoise, blue, brown, yellow, green and red) were smoothed by locally weighted regression and were shown in Fig 2B. A downward trend was observed for MEs of green and turquoise modules with the increase of gestation age. No obvious trend was found for MEs of other modules concerning gestation age. Expression of genes in green and turquoise modules reduced with augment of gestation age in second term.
High expression probe sets showing signi cant clustering tendencies Probe sets with higher expression values are more inclined to be used as a marker for fetal organ development. To investigate the expression character of clustered probe sets, mean values and coe cient of variation for all probes in the normal 26 samples employed in WGCNA analysis were calculated. A scatter plot (mean expression value VS coe cient of variation) was drawn to show clustered probe sets distribution in all probe sets, and the clustered probe sets were labeled as modular colors ( Fig 3A). Compared to probe sets with low expression values, the higher portions of the probe sets with higher expression values were clustered.

Relationship between modules and tissues via dominant tissues and modules
To establish the relationship between fetal tissues and modules, the distribution of tissue-speci c genes in modules were analyzed. Tissue-speci c genes were obtained at a different cut-off value conditions. After tissue-speci c genes were converted to probe sets, numbers of tissue-speci c genes (probe sets) under different cut-off value (5, 10, 15, 20, 25 and 30) were counted (Fig 3B). The number of tissuespeci c genes (probe sets) decreased with higher cut-off value. The distribution of tissue-speci c probe sets in all 22 modules, was analyzed and summarized in Table S6. As shown in Fig 3C, tissue-speci c probe sets were mainly distributed in turquoise, blue, brown, yellow, green, and red modules, which were called major modules.
Considering the difference of gene expression in fetus and adult tissues, the cut-off value was set as 5 to get more gene speci c genes. The numbers of tissue-speci c probe sets derived from different tissues in major modules were shown in Fig 3D, while cut-off value was set as 5 (detailed data see Table S7). Tissue-speci c genes from skeleton, liver, and testis accounted for the largest speci c genes in turquoise module. The above three tissues were de ned as dominant tissues in turquoise module. Similarly, dominant tissues were calculated in blue module (placenta, skeletal muscle, and testis), brown module (testis, cerebral cortex, and skeletal muscle), yellow module (testis, cerebral cortex, and cerebellum), green module (small intestine, liver, and colon) and red module (esophagus, tongue, and tonsil). Tissue-speci c probe sets derived from liver mostly distributed in green and turquoise modules. Green and turquoise modules were de ned as dominant modules for liver. Dominant modules for cerebral cortex included yellow and blue modules. Dominant tissues and dominant modules were different for different modules and tissues respectively.
Blue, brown, and yellow modules included the largest neural-speci c genes (cerebral cortex, cerebellum, basal ganglia, etc). Green modules contained more digest system speci c genes.
Differential expressed genes in DS, ES, and TS Samples used for DEGs analysis were listed in Table S2 (DS), Table S3 (ES), and Table S4 (

Disease-speci c modules via GO analysis
Numbers of DEPs of DS, ES, and TS in major clustered modules were counted (Fig 5A). Blue module was the most abundant distributed module in TS and DS groups. However, the brown module had the largest number of DEPs of ES. To analyze the function of clustered genes, DEPs were converted to gene symbols. The modular distribution of DEGs in three groups was summarized in Table S8.
Modular speci c DEGs, which mean DEGs of every group (DS, ES, and TS) in a certain module, were extracted from Table S8. Modular speci c DEGs in major modules (turquoise, blue, brown, yellow, green, and red) were performed functional enrichment analysis (Table S10). The numbers of enriched GO terms were shown in Fig 5B. The yellow, blue, and red modules were speci c modules for DS, ES, and TS groups respectively. DEGs of DS in the yellow module included SORBS1 and FFAR4. DEGs of ES in blue module included AKNAD1, SOX9, ZNF395, PID1, PRPF38A, FAM220A, ATP1B1, NME7, CD9, EPC1, and MAML2. DEGs of TS in the red module included S100A8 and IVL. The total number of enriched GO terms in ES, DS, and TS group ranked the rst, the second, and the third respectively. In green and turquoise modules, similar numeric distribution of enriched GO term was shown for DS and ES, rather than TS. In green and turquoise modules, a similar number of GO term numbers were enriched in DS and ES group.
To get an overall view of all enriched GO terms, the GO graph was established based on the relationship of GO terms. Go maps of all enriched GO terms were list in Table S9. Based on the GO graph, interrelated GO Terms were classi ed into the same subsets. As shown in Table S10, subsets contained a certain number of enriched GO terms from different modules and abnormal fetus (DE, ES, and TS). A total of 184 subsets were established. The numbers of GO terms in a subset ranged from 1 to 332. The largest nine subsets were summarized in Table 1. Functions of these subsets included basic physiological processes, absorption and transport of nutrients, response to external stimuli, multi-organ (kidney, lung, and heart) development, protein synthesis process, protein catabolic process and proteolysis, thermoregulation, signal transduction of TGF, and bone development.  Function: summarized function according to in corresponding biological GO term subsets.
To analyze the relationship between subset function and disease in different modular condition, enriched GO terms numbers in the nine largest subsets (Table 1) in relation to modules and fetus disease were shown in Fig 6. Besides that yellow, blue, and red modules were speci c modules for DS, ES, and TS groups, more speci c modules were shown in certain subsets. Green, turquoise, and brown modules were speci c for ES group in absorption and transport of nutrients subset. The green module was speci c to the DS group in multi-organ (kidney, lung, and heart) development subset, protein synthesis process subset, thermoregulation subset and signal transduction of TGF subset.
Compared to single genes, the advantages of gene sets contains noise and dimension reduction, as well as greater biological interpretability [33]. A total of 22 co-expressed gene sets (modules) were identi ed via WGCNA [27]. As shown in Fig. 3A, probe sets with high expression values show obvious clustering tendency. Analogous to dimension reduction, expression values of clustered probes can be simpli ed as module eigengenes. Correlation between gene expression and gestation ages is easier to analyze via module eigengenes. As shown in Fig. 2B, only genes in green modules showed signi cant correlation with gestation ages. The above results indicated that correlation between gene expression and gestation ages was not an appropriate way to discover the biomarker of the abnormal fetal development.
To explore the tissue origins of genes in AF cfRNA, different gene expression atlas and cut-off values were used to de ne tissue-speci c genes in previous studies [1,5,10,34,35]. However, there was no fully comparison to get con rmed standard. Consequently, a series of cut-off values were used to de ne tissue-speci c genes expression in our research. Based on tissue-speci c gene expression, the relationships between modules and tissues were constructed. Genes from different organs showed similar co-expression patterns and were classi ed as the same module. Inter-organ synergetic development was manifested as a co-ordinated expression of genes intra modules. Due to the complexity of gene expression in the same organs, genes from the same organ showed different co-expression patterns. Accordingly, tissue-speci c expressed genes from the same organ were divided into different modules. Thus, synergetic developments inter-and intra-organs were shown as gene co-expression networks intra-and inter-modules via gene tissue-speci c analysis and weighted correlation network analysis.
DS, ES, and TS were the most frequent chromosomal abnormalities in prenatal diagnosis [36], which were caused by trisomy of chromosome 21, trisomy of chromosome 18, and sex chromosome aneuploidy. The development of multiple tissues and organs was affected in these three symptoms. ES is the most serious defects and most ES cause spontaneous abortion. The most frequent clinical features of ES consist of neurological ndings, growth disturbances, malformations of the skull, face, thorax, abdomen, limbs, genitals, skin, skin annexes, and internal organs [37]. Cardiac, airway, pulmonary, hearing, growth, hematologic, oncologic, autoimmune, musculoskeletal, and neurodevelopmental disorders are the most frequent clinical symptoms in DS [38]. Endocrinal, gastrointestinal, hepatic, phenotypic, neurocognitive, and psychosocial disorders [39] were reported in TS patient. Compared with the normal fetus, the differential expressed genes of DS [8], ES [9], and TS [10] were analyzed. Different from common affected organs or closely related clinical symptoms, different expressed genes from these syndrome share few common differential expressed genes [4].
To explain this contradictory phenomenon, DEGs from DS, ES, and TS compared to normal fetuses were extracted. Similar to Zwemer's report, few common misregulated genes were detected in three groups [4] ( Fig. 4D). Interestingly, the rank of the total number of enriched GO terms in ES, DS, and TS group was correlated with disease severity (Fig. 5B). Biological functions of enriched GO terms were simpli ed via classifying interrelated GO terms into the same subsets. A subset contains function-related GO terms performing associated embryonic development process. In most subsets, the number of enriched GO terms was positively related with disease severity (Fig. 6D). Another tendency, shown in the modular distribution of GO term subsets was modular heterogeneity, which suggested disease-speci c modules were detected in functional enrichment analysis of modules related DEPs. Pathogenesis-related different expressed genes were classed into different modules in different diseases. In absorption and transport of nutrients subsets, DEGs in DS and ES were associated with yellow and red modules respectively.
Consequently, Preprocessing via co-expression analysis could help constructed the relationship between DEGs and disease phenotype.
Large numbers of GO terms were enriched in most modules with a small amount of DEGs, as shown in Fig. 5B. To explain this phenomenon, DEGs in disease-speci c modules were extracted. SORBS1 and FFAR4 were two of DEGs from the yellow module in the DS group. SORBS1 was associated with insulin signaling pathway [40], glucose homeostasis [41,42], cancer growth and migration [43,44], etc. FFAR4 have similar functions including glucose and fatty acid metabolism [45,46], glucose-dependent insulinotropic polypeptide secretion [47], cell cancelation [48,49], etc. This is accounted for enriched results of yellow modules in DS group. In subsets of a basic physiological process, absorption and transport of nutrients, response to external stimuli, protein synthesis process and thermoregulation were enriched. Taken together, genes without co-expression relationships were ltered via WGCNA. The remaining co-expressed genes were easier to be enriched via functional enrichment analysis.

Conclusion
Co-expression modules based on samples from the whole second gestation (from 15 to 22 weeks) provided gene classi cation in this study. Via tissue-speci c genes analysis, the relationships between modules and tissues were established. Disease-speci c modules can help to annotate AF cfRNA variation in fetal diseases. Pre-grouping via co-expression before functional enrichment analysis could help to build a relationship between DEGs and disease phenotype. With the accumulation of datasets of AF cfRNAs, gestation week speci c co-expression modules could be constructed, in which more genes in AF cfRNA would be included. With the ner co-expression modules for different gestation age, comprehensive knowledge for fetal coordinated development and gene co-expression will be elucidated.      red modules were speci c modules for DS, ES and TS groups, more speci c modules were shown in certain subsets. Green, turquoise and brown module was speci c for ES group in absorption and transport of nutrients subset. Green module was speci c to DS group in multi-organ (Kidney, lung and heart) development subset, protein synthesis process subset, thermoregulation subset and signal transduction of TGF subset.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.