Integrative transcriptomic, proteomic, and machine learning approach to identifying feature genes of atrial brillation

Yaozhong Liu Department of Cardiovascular Medicine, Second Xiangya Hospital, Central South University, Hunan Province, China. Fan Bai Department of Cardiovascular Medicine, Second Xiangya Hospital, Central South University, Hunan Province, China. Zhenwei Tang Department of Dermatology, Xiangya Hospital, Central South University, Hunan Province, China. Na Liu Department of Cardiovascular Medicine, Second Xiangya Hospital, Central South University, Hunan Province, China. Qiming Liu (  qimingliu@csu.edu.cn ) Department of Cardiovascular Medicine, Second Xiangya Hospital, Central South University, Hunan Province, China.


Background
Atrial brillation (AF) is the most common cardiac arrhythmia and is a leading cause of stroke, heart failure, and dementia [1]. AF currently affects over 30 million individuals worldwide [2], and this number is projected to grow dramatically over the next 20 years [3]. Despite > 100 years of basic and clinical research, the fundamental mechanisms of AF remain poorly understood.
Microarray expression analysis of atrial tissues can provide a global unbiased framework to characterize the transcriptional changes associated with AF. Advancement of high-throughput microarray technology is producing a large number of gene expression data, which are powerful tools for discovering and studying novel biomarkers for AF. However, analysis based on high throughput data may face the 'curse of dimensionality'. This refers to the phenomena that the amount of dependent variables increases greatly while the amount of samples is relatively small, increasing statistical errors [4].
Recently, Integrated transcriptomic and quantitative proteomic analyses have been widely used to promote a better understanding of the molecular mechanisms driving biological processes in cells and tissues [5]. Advances in mass-spectrometry (MS) provide an unprecedented opportunity for antibodyindependent proteome pro ling with approximately 80% of all proteins in major human tissues quanti able by this technique [6]. By integrating the transcriptomic and proteomic data, the 'curse of dimensionality' can be solved through cross-validation in the two levels. Besides, combining datasets from different origins by meta-analysis to increase the power of identi cation and using some machine learning algorithms can also improve the problems caused by this 'curse' [7].
Here, our objective was to elucidate a more complete understanding of molecular mechanisms underlying AF and to nd potential diagnostic and therapeutic targets. The integration of multi-omics data, along with the application of the machine learning approach, vouched for the identi cation of key pathways and feature genes in AF, which may help to investigate the underlying mechanism of AF and to discover potential diagnostic and therapeutic targets.

Microarray data collection and preprocessing
For the meta-analysis, AF microarray expression data sets were collected from NCBI Gene Expression Omnibus (GEO) database (http://www.ncbi.nlm.nih.gov/geo/). Only microarray data that met the following criteria were included: (1) Datasets were produced by Genome-wide mRNA expression pro ling by microarray; (2) The experimental platform was GPL570 (Affymetrix Human Genome U133 Plus 2.0 microarray); (3) Data sets should be gene expression pro les of human atria tissues between AF and sinus rhythm (SR); (4) The minimum number of cases and controls was three. Then, the raw CEL les were downloaded and preprocessed using robust multi array average (RMA) algorithm with 'affy' package [8] implemented in R software. The quality of individual samples was assessed using the 'arrayQualitymetrics' packages [9]. The outlier samples were excluded if it was detected by array intensity distribution criteria. After that, raw CEL les of the rest samples were preprocessed again using RMA algorithm for background correction, quantile normalization, and summarization.
We then reannotated the probes of GPL570 as it improves accuracy and makes it possible to identify new transcripts. In brief, the probe sequences were downloaded from Affymetrix (affymetrix.com) and were remapped to the human genome (GRCh38 release 99 primary assembly) using the R package 'Rsubread' [10]. Then, the chromosomal positions of these probes were matched to the corresponding genome annotation database in Ensembl using the R package 'GenomicRanges' [11]. Probe sets that were mapped to >1 gene were removed to ensure the reliability of the reannotation. The median expression values among all multiple probe IDs were selected to represent the corresponding gene symbol. After that, 19557 unique genes were retained. The normalized and annotated datasets containing 19557 rows and 130 columns were used for further meta-analysis. GSE2240, which contained microarray expression pro les from 10 AF and 20 SR atrial samples, were preprocessed using RMA algorithm and annotated using 'annotate' and 'hgu133a.db' packages. The median expression values among multiple probe IDs were selected to represent the corresponding gene symbol.

Microarray meta-analysis using GeneMeta
'GeneMeta' Bioconductor package [12] in R was used to perform a microarray meta-analysis of data sets from different 'origins'. This package is based on the meta-analysis method proposed by Choi et al. [12] in which an overall ranked gene list is produced based on the false discovery rate (FDR) of each gene. In this study, samples regarded as the same 'origin' must come from the same tissue (left atria, right atria, etc) and the same microarray study. The Random effect model (REM) was used [13]. The false discovery rate (FDR) for each gene was obtained with the function "ZscoreFDR" using 1000 permutations. Genes with FDR<0.05 were considered as differentially expressed genes (DEGs).

Proteomics study
18 left atrial appendage (LAA) tissue samples were obtained as surgical specimens from patients with mitral stenosis undergoing cardiac surgery at the Second Xiangya Hospital of Central South University, including 9 with chronic AF and 9 with SR. The characteristics of all patients are presented in Table 2. For each clinical group, three samples were mixed into one pooled sample. Qualitative and quantitative proteomic analysis was performed using dimethyl label-coupled high performance liquid chromatography-tandem mass spectrometry (HPLC-MS/MS) and MaxQuant software [14]. Benjamini-Hochberg's method was used to calculate the FDR. DEPs were identi ed using a criterion of FDR<0.1 and fold change >1.2. The detailed procedure for proteomic study was described in Supplementary Material 1.

Pathway enrichment analysis
Metascape (https://Metascape.org/) is a web-based portal designed to provide a comprehensive gene list annotation and analysis resource for biologists [15]. It is one of the most effective tools to conducted muti-omics level enrichment analysis. To gain more insights of biological roles of identi ed DEGs and DEPs, we conducted pathway enrichment analysis of Gene Ontology biological process (GO BP), Kyoto Encyclopedia of Genes and Genomes (KEGG), Reactome, and Canonical pathway in Metascape tools. By inputting the lists of DEGs and DEPs simultaneously, Metascape online tools can identify commonlyenriched and selectively-enriched pathways from two levels, which enables a comprehensive assessment of the molecular features of the biological process.

Cross-validation between the transcriptomic and proteomic study
The DEGs and DEPs were further analyzed using VennDiagram to compare and identify the shared genes. To make the selected biomarkers more signi cant, we only select genes that have consistent expression trends (upregulated or downregulated) between the transcriptomic and proteomic levels for further analysis.

Feature selection and classi cation algorithm
The 130 samples involved in the meta-analysis were selected as the training set. The correlation-based selection (CFS) method [16] implemented in WEKA solfware [17] was used using the training set to select feature genes. Three popular state-of-the-art supervised classi cation methods (NB, Naive Bayes; SMO, sequential minimal optimization; and RF, random forest) were used for generating the classi cation models using WEKA with the default parameter settings. The three supervised machine learning algorithms were trained with the training set and were further validated by 6 fold cross-validation. The generated training models were compared based on their accuracy. The best trained classi er generated in the training set was further validated on the independent test set GSE2240. The performance of the classi er was evaluated using criteria including precision, recall, F-measure, Matthews correlation coe cient (MCC), AUC (area under receiver operating curve), and auPRC (area under precision-recall curve), true positive rate, false positive rate, and Kappa statistic.

Microarray data description and preprocessing
In the transcriptomic meta-analysis study, four microarray data sets were included containing a total of 54 SR and 79 AF atrial samples ( Table 1). The included raw CEL les were pre-processed and quality control analysis of the data sets (after normalization) led to the removal of 3 samples including GSM1005420, GSM3182694, and GSM3182707. After removing the outliers and reprocessing, the normalized data sets consisting of 130 samples were taken for further meta-analysis approach.

Identi cation of DEGs
As shown in Table 1, we only considered samples from the same study and the same tissue as the same 'origin', which led to a total of 7 different origins. We then performed a meta-analysis by using the R package 'GeneMeta' and DEGs were detected by comparing the differential expression levels between the AF and SR group. The results identi ed 863 genes as DEGs (FDR < 0.05; 485 up-regulated: z-score > 0; 378 down-regulated: z-score < 0) (Supplementary Table SI).

Results of proteomic study
The characteristics of the patients included in the proteomic study were balanced between the two groups, except for the left atrial (LA) size (Table 2). Figure 1A shows the procedure of the proteomic study. Pearson's correlation analysis indicated good repeatability between the samples ( Figure 1B). The mass accuracy of the MS data met the requirement ( Figure 1C) and the distribution of peptides' length agreed with the properties of tryptic peptides ( Figure 1D). In total, we identi ed 4489 proteins including 3606 quanti able proteins( Figure 1E)

Pathway enrichment analysis and visualization
Pathway enrichment analysis helps researchers gain mechanistic insight into gene lists generated from genome-scale (omics) experiments. This method identi es biological pathways that are enriched in a gene list more than would be expected by chance. Metascape helps to integrate different omics data such as genomics, transcriptomics, and proteomics, which enables a comprehensive understanding of a biological process. Unlike other methods, Metascape clusters enriched terms into non-redundant groups that will be critical for informing future studies. We visualized the top 20 clusters and chose the most signi cant (lowest p-value) term within each of the 20 clusters to represent the cluster. For the upregulated proteins and mRNAs, most of the top 20 clusters (19) were enriched in both protein and mRNA levels, which highly suggested the importance of these pathways in AF pathogenesis (Figure 2A).
While for the down-regulated ones, the top 20 clusters were mainly involved in energy metabolism-related pathways, and these pathways were only enriched in the protein level ( Figure 2B). To further capture the relationships between the terms, we selected a subset of representative terms from each of the 20 clusters (up to the 10 best scoring terms) and convert them into a network layout which was visualized within Cytospace (Figure 2, right part).

Cross-validation
To make the selected biomarkers more signi cant, we only select genes that have consistent expression trends (upregulated or downregulated) between the transcriptomic and proteomic levels for further analysis. As VeneDiagram showed (Figure 3), 23 up-regulated genes/proteins, and 7 down-regulated genes/proteins were identi ed to have consistent trends from two-level. These 30 genes/proteins were considered as important biomarkers for AF.

Performance evaluation of AF classi er
After feature selection using training set, the number of features reduced from 30 to 10 including CD44, CHGB, FHL2, GGT5, IGFBP2, NRAP, SEPTIN6, TNNII, and TRDN. After removing the bath effect using 'sva' packages in the R solfware, the expression values of these 10 features were used to develop the classi cation models using three supervised machine learning algorithms-NB, SMO, and RF by 6-fold cross-validation to classify AF and SR samples on training set. All classi ers performed well with a precision of 86.9% for NB, 86.3% for SMO, and 76.8% for RF (Table 3). Base on a comprehensive evaluation of precision and other popular measures, the NB classi er performed best and the constructed NB classi er using the training set was further evaluated in the independent test set. Among the 30 atrial samples, 24 of them (80%) were correctly classi ed. The performance criteria including precision, recall, F-measure, MCC, AUC, auPRC, true positive rate, false positive rate, and Kappa statistic were 87.5%, 0.8, 0.805, 0.661, 0.995, 0.995, 0.8, 0.1, and 0.609, respectively. Therefore, the overall measures of high accuracy con rmed the e cacy of the classi er to distinguish AF from SR samples, which further proved that the 10 gene feature are important biomarkers for AF.

Discussion
To our knowledge, this is the rst integrated transcriptomic and proteomic analysis of human AF atrial tissue, and the rst to identify feature genes of AF using machine learningPrevious transcriptomic studies have provided insights into the pathogenesis of AF [18,19]. However, these experiments are generally analyzed through a single data source or restricted to a fewer sample which can lead to biological and technical biases. Thus, the microarray meta-analysis was used in this study to integrate four microarray data sets of AF from GEO which led to the identi cation of 863 DEGs. To elucidate a more complete understanding of AF pathogenesis, we also conducted a proteomic study of local atrial tissue which identi ed 482 DEPs.
Pathway enrichment analysis can help to characterize physiological and functional changes associated with the changes in mRNA and protein expression in AF atrial tissues. For the upregulated mRNAs or proteins, the top 19 scoring items were enriched in both transcriptomic and proteomic levels, which vouched for the importance and signi cance of these pathways. Some of the items, such as 'PDGFRB PATHWAY', 'activation of immune response', 'muscle structure development', 'regulation of actin cytoskeleton', and 'leukocyte degranulation', have been proved to play key roles in AF progression [3,20]. For the downregulated mRNAs or proteins, the top 19 scoring items were only enriched in the proteomic level, and these pathways were mainly involved in metabolism regulation, such as 'mitochondrial respiratory chain complex assembly', 'TP53 regulates metabolic genes', and 'response to oxidative stress'. Besides, the 'Metabolism of lipids' pathway was enriched in two levels. These are in accord with the recent studies which highlighted the role of metabolic remodeling in AF [21][22][23]. The reason why these pathways are only identi ed in the protein level may be caused by some post-transcriptional and translational regulations.
After cross-validation between the two omics data. We identi ed 30 genes or proteins with the same trends between two levels. To make the selected features more signi cant and informative, the machine learning CFS feature selection method was adopted in the training set which led to the nal 10 features, wherein 8 are upregulated (CD44, CHGB, FHL2, GGT5, IGFBP2, NRAP, SEPTIN6, YWHAQ) and 2 are downregulated (TNNI1, TRDN). The NB classi er base on the expression values of these features in the training set can classify AF and SR samples with a precision of 87.5% and AUC of 0.995 in the independent test set.
Some of these feature genes have been reported to be associated with AF or its related pathogenesis. The CD44 related pathways including CD44/STAT3 and CD44/NOX4 signaling pathways can lead to atrial brosis [24] and Ca 2+ -handling abnormalities [25] during AF. Secretogranin-1 (CHGB) presents in the secretory granules in atrial myoendocrine cells and is co-localized with atrial natriuretic peptide (ANP) while CHGB genetic variation results in oxidative stress [26] and hypertension [27]. The four and a half LIM domains protein 2 (FHL2) is a component of the hypertrophic response and is found to be protective in cardiac hypertrophic through inhibiting MAPK/ERK signaling [28]. MAPK has been proved to function in AF context by mediating oxidative stress [29,30], epicardial adipose tissue remodeling [31], atrial brosis [32], load-induced hypertrophic response [33], and ionic channel remodeling [34]. Gammaglutamyltransferase-5 (GGT5) is con rmed to be closely associated with immune cell activation [35] and oxidative stress [36,37] and can be a potential biomarker of myocardial infarction [38]. Insulin-like growth factor-binding protein 2 (IGFBP2) belongs to the insulin-like growth factor-binding protein (IGFBP) family. Two recent studies observed a higher hazard of incident AF associated with higher mean levels of plasma IGFBP1 protein [39] and IGFBP3 protein [40]. Nebulin related anchoring protein (NRAP) is present in myo bril precursors during myo brillogenesis and thought to be involved in myo bril assembly [41], and its genetic variance is associated with cariomyopathy [42]. Septin-6 (SEPTIN6) is invovled in extracellular matrix remodeling [43]. 14-3-3 protein theta (YWHAQ) is a gene in the P53 network and has been shown to promote apoptosis directly upon genotoxic stress [44]. Another proteomic also identi ed YWHAQ as an important biomarker in AF [44]. TNNI1 encodes a troponin-I protein that is the dominant form of troponin-I expressed in the fetal/neonatal/infant heart, and its participants in AF remains unknown. Triadin (TRDN) is a stable subunit of the ryanodine receptor 2 (RyR2) and is involved in the regulation of Ca 2+ release [45]. The loss or dysfunction of RyR2 stable subunits was demonstrated to cause the occurrence of spontaneous calcium elevation in AF atrial cells [46]. Our present study further proved and emphasized the importance of these markers.
There are some limitations to the current study. Firstly, the number of samples included in the microarray meta-analysis remains relatively small (n=130), which is caused by the limited number of available studies in the GEO database. Secondly, there is no corresponding clinical information of the samples, we were not able to make a prognostic analysis of these biomarkers. Third, the samples used in the proteomic study came from patients with mitral stenosis. The psychophysiology of AF in mitral stenosis may have some differences from those with non-valvular AF. Finally, the transcriptomic and proteomic can only indicate the potential causes for a phenotypic response, but they cannot predict what will happen at the next level. Thus, one should consider the metabolomic that provides a functional view of an organism as determined by the sum of its genes, RNA, proteins, and environmental factors [47]. Nonetheless, the integrated analysis of multi-omics data along with the machine learning method makes sure the selected genes as important features for AF. Further studies are needed to clarify their functions in AF pathogenesis.

Conclusions
In conclusion, the current study identi ed a list of signi cantly dysregulated feature genes associated with AF using a multi-omics analysis. The machine learning feature selection identi ed 10 feature genes. Naive Bayes prediction model built in the training set using the expression pro les of 10 features performed accurately and reliably classi ed AF from SR samples in the independent test set. These ndings could provide novel insight into the pathogenesis of AF and suggested that the feature genes might be diagnostic and therapeutic targets for AF. Abbreviations AF, atrial brillation; GEO, Gene Expression Omnibus; MS: mass spectrometry; DEGs, and differentially expressed proteins; SR, sinus rhythm; RMA, robust multiarray average; REM, random effect model; FDR, false discovery rate; DEGs: differentially expressed genes; LAA, left atrial appendage; HPLC-MS/MS: highperformance liquid chromatography-tandem mass spectrometry; GO BP, Gene Ontology biological process; KEGG, Kyoto Encyclopedia of Genes and Genomes; CSF, correlation-based selection; NB, Naive Bayes; SMO, sequential minimal optimization; RF, random forest; AUC, area under receiver operating curve; MCC, Matthews correlation coe cient; auPRC, area under precision-recall curve.

Declarations
Ethics approval and consent to participate: The proteomic study was approved by the Ethics Committee of the Second Xiangya Hospital of Central South University. The research was carried out in accordance with the World Medical Association Declaration of Helsinki. Informed written consent was obtained from all patients. Consent for publication: Not applicable.
Availability of data and materials: The microarray datasets analyzed during the present study are available from the Gene Expression Omnibus repository (https://www.ncbi.nlm.nih.gov/geo). The results of proteomic study were submitted as supplementary material.
Competing interests: The authors declare that they have no competing interests.
Funding: This work was supported by grants from the National Natural Science Foundation of China (no.81770337). They had no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.
Authors' contributions: YL and FB performed the bioinformatic analysis and was a major contributor in writing the manuscript. ZT and NL made important modi cations to the manuscript. YL and QL designed the research project and created the nal revision of the manuscript. All authors read and approved the nal version of the manuscript.