Construction of diagnostic markers for hub lncRNAs in Parkinson’s disease based on chip re-annotation


 BACKGROUND

Parkinson’s disease (PD) is a progressive neurodegenerative disease that is also the most common motor disorder and is accompanied by the loss of DA neurons in the brain. Long non-coding RNAs (lncRNAs) have recently been identified as new genetic entities that regulate cellular processes. One of the main functions of lncRNAs is the regulation of the expression of specific genes in multiple steps, including the regulation of transcriptional and post-transcriptional mechanisms and epigenetics.
MATERIAL AND METHODS

Here we downloaded three sets of expression-spectrum data for PD from the GEO database. The data were re-annotated with R package, which were integrated into a set of expression profiles for the analysis of differentially expressed lncRNAs. Subsequently, lncRNA/mRNA co-expression modules were identified through a weighted co-expression analysis and lncRNAs were expressed based on binding differences. The diagnostic tags of PD were filtered with key modules and were finally used to build the PD diagnostic prediction model.
RESULTS

Based on lncRNA re-annotation, a total of 1931 lncRNA expression values in the three sets of data were obtained and significant differences in expression (P < 0.05) for a total of 162 lncRNAs. A total of 21 modules were identified through WGCNA and five modules were selected. We screened 12 lncRNAs (AUC > 0.6) as PD diagnostic markers and as features to construct a SVM classification model. The model had good predictive ability in the training set and verification set (AUC of 0.9928 and 0.464, respectively), which illustrated their potential as diagnostic markers of PD.
CONCLUSIONS

This study provided new molecular entities for the diagnosis of PD, which may promote the early detection of this disease and the development of personalized therapies.


Background
Parkinson's disease (PD) is a progressive disorder and one of the most common degenerative neurological diseases worldwide, after Alzheimer's disease (AD) [1]. Because of the diversity of the initial symptoms, PD diagnosis is highly di cult, resulting in confusion and delays in diagnosis and directly affecting the post-treatment stage. Therefore, there is an urgent need to identify useful biological markers of early-stage PD [2,3]. The continued research on this subject included the exploration of body-uid and imaging markers; however, because of the drastic heterogeneity of this disease, no reliable biomarkers are available currently [4,5].
The term non-coding RNA refers to functional RNA molecules that cannot be translated into proteins.
Among them, common regulatory non-coding RNAs include small interfering RNAs (siRNAs), micro RNAs (miRNAs), Piwi-interacting RNAs (piRNAs), and long non-coding RNAs (lncRNAs). In particular, lncRNAs have become key regulators of different genetic regulatory layers. LncRNAs are typically expressed in more cell types and tissue-speci c terms than are mRNAs or miRNAs; thus, they exhibit great advantages and are being prioritized as diagnostic and prognostic markers [6]. Increasing evidence suggests that lncRNAs have key biological functions in the brain, as lncRNAs have been associated with neurodegenerative diseases, such as AD and PD [7].
This research compared lncRNA data pertaining to a PD group from a database with that from the brain tissues (substantia nigra) of normal individuals using bioinformatics. The differential expression of lncRNAs was screened through the Gene Co-Expression Network (via WGCNA) and the SVM pattern was validated, thus laying a solid foundation for the identi cation of biological markers of PD.

LncRNA expression pro les in PD
Three sets of data were obtained from the HG-U133_Plus2 platform of the Gene Expression Omnibus (GEO) database, numbered GSE49036 [8], GSE20141 [9], and GSE7621 [10]. The date of download was 2019.1.5. The GSE49036 set contains 20 disease samples and eight control samples, GSE20141 contains 10 disease samples and eight control samples, and GSE7621 contains nine normal samples and 16 disease samples. We downloaded the original cell data of the three sets of data separately.
1) The robust multichip average (RMA) method of affy [11] was used to standardize the three sets of data; 2) The batch function was removed using the SVA package combat function in the R language; 3) Probes were mapped to genes/lncRNAs, with multiple probes corresponding to the median of a gene and one probe corresponding to the elimination of multiple genes. The data analysis process used here is shown in Supplementary Fig. S1.

LncRNA re-annotation
Based on the NetAffx annotation of the probes and the Refseq and Ensembl annotations of lncRNAs, we identi ed 2448 probes (1970 lncRNAs) that were represented on the Affymetrix HG-U133 Plus 2.0 arrays (S1.xls). Of these, 725 probes (510 genes) were annotated as lncRNAs by both the Refseq and the Ensembl databases; 512 probes (379 genes) were annotated only by the Refseq database, and 1211 probes (1081 genes) were annotated only by the Ensembl database. The probes that were annotated by both databases but had controversial de nitions were excluded from our study.

Identi cation of differences between lncRNAs and mRNAs
To screen for genes and lncRNAs that were greatly changed in the PD samples, we used the limma package of the R software [12] to perform differential gene screening using a fold change < − 1.2 or > 1.2 and a P value < 0.05 as the threshold.

Identi cation of lncRNA/mRNA co-expression modules
To identify the PD-related lncRNA and mRNA co-expression modules, we rst combined the lncRNA and mRNA expression pro les to further remove outlier samples and mRNA/lncRNA modules with a variance < 0.5, to obtain large-variation mRNA/lncRNA modules (DVGLs). The weighted co-expression network analysis nds a module that has a co-expression relationship with a lncRNA and uses Fisher's exact test to screen the modules that are signi cantly enriched in that lncRNA. Speci cally, we used the WGCNA [13] package in R to construct a scale-free co-expression network for the DVGLs. First, Pearson's correlation matrices and an average linkage method were both applied to all pairwise DVGLs. Subsequently, a weighted adjacency matrix was constructed using a power function, A = |C | (C = Pearson's correlation between DVGL m and DVGL n; A = adjacency between DVGL m and DVGL n). β is a softthresholding parameter that emphasizes strong correlations and penalizes weak correlations between DVGLs. After choosing the power of β, the adjacency was transformed into a topological overlap matrix (TOM), which measures the network connectivity of a DVGL de ned as the sum of its adjacency with all other DVGLs for the network DVGL ratio, and the corresponding dissimilarity (1-TOM) was calculated. To classify DVGLs with similar expression pro les into DVGL modules, average linkage hierarchical clustering was conducted according to the TOM-based dissimilarity measure with a minimum size (DVGL group) of 30 for the DVGL dendrogram. To analyze the module further, we calculated the dissimilarity of module eigen DVGLs, chose a cut-off value for the module dendrogram, and merged several modules. Finally, the number of lncRNAs and mRNAs in each module was counted. Fisher's exact test was used to identify modules with signi cant enrichment of lncRNAs.

Functional enrichment analyses
Gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses were performed using clusterPro ler of the R package [14] for genes associated with lncRNAs in modules with signi cant enrichment, to identify over-represented KEGG pathways. In both analyses, statistical signi cance was set at P < 0.05.

Identi cation of co-expression modules related to
Parkinson's disease DisGeNET (http://www.disgenet.org) is a discovery platform that contains one of the largest publicly available collections of genes and variants associated with human diseases. Genes related to PD were screened in the DisGeNET database, and genes related to PD, Parkinsonian tremor, Parkinsonian Disorders, and Parkinsonism were searched based on keywords; a total of 1598 genes were collected (S6.xlsx). After further ltering, 1168 unique IDs were retained, including 1040 genes that appeared in our data set. We counted the number of disease genes in each module, as shown in Table 1, where the P value represents the degree of signi cant aggregation of disease genes in this module. Subsequently, we determined the mRNA and Parkinson gene correlation of each module by Fisher's test.

Identi cation of lncRNA diagnostic markers
First, we selected the intersection of lncRNA and differential lncRNA genes in the disease-related and lncRNA-enriched modules and calculated the area under the receiver operating characteristic (ROC) curve (AUC) of each lncRNA. We then screened lncRNAs with an AUC > 0.6 as candidate diagnostic markers for PD and used a literature-mining approach to analyze the candidate diagnostic markers regarding their relevance to PD.

Construction and validation of a PD lncRNA diagnostic model
We used the PD-related candidate lncRNAs to build a diagnostic prediction model based on support vector machine (SVM) [15] classi cation to predict PD and normal healthy samples. The SVM classi cation is a supervised learning model in machine learning algorithms that can be used to analyze recognition patterns in data. An SVM constructs a hyperplane that can be used for classi cation and regression in high or in nite dimensional space. Given a set of training samples, each label belongs to two categories, and an SVM training algorithm establishes a model by assigning a new instance to one class or another, thus rendering it a non-probabilistic binary linear classi cation. We randomly divided all samples into a training data set and a veri cation data set. The model was built using the training data set based on the 10-fold cross-validation veri cation model classi cation ability. Subsequently, we used the established model to predict the samples in the validation data set. The model's predictive ability was evaluated using the AUC, and the model's predictive sensitivity and speci city for PD were analyzed. The 71 samples included in the set were randomly divided into two sets, the training data set and the veri cation data set, while ensuring that the ratio of PD to normal control samples was similar between the two data sets. The training data set contained 35 samples (23 PD samples and 12 normal control samples) and the validation data set contained 36 samples (23 PD samples and 13 normal control samples).

KEGG pathway enrichment analysis of lncRNAs
To assess further the function of the lncRNAs identi ed in the previous analyses, we used the singlesample gene set enrichment analysis (ssGSEA) [16] method of the GSVA package of the R software to perform a KEGG pathway enrichment analysis for each sample through gene expression pro ling. We obtained an enrichment score for each pathway and further calculated the relevance of the lncRNA expression. The 20 most relevant KEGG pathways were selected.

Statistical analysis
All analyses that do not specify parameters used the default parameters of the software. Signi cance was set at P < 0.05. All statistical analyses were performed in R 3.4.3.

Data processing
Each set of data obtained from the GEO database was pre-normalized and post-normalized using RMA. The results are shown in Supplementary Fig. S2. The results of lncRNA re-annotation are listed in S1.xlsx.
Next, we extracted the three sets of data and nally obtained a combined data set.

Differential lncRNA and differential gene analysis
After removing the batch effect on the three sets of data, we extracted lncRNA probes according to the annotation information and transformed the probe ID into Gene-Symbol. Finally, the differences of 162 lncRNAs was identi ed, 91 of them were upregulated and 71 were downregulated in PD samples. Using the pheatmap R package to draw a heat map (Fig. 1), we can see from the diagram that these lncRNAs differ in PD from normal control samples.

Co-expression analysis of lncRNAs and genes
A sample clustering analysis identi ed two abnormal samples, as shown in Fig. 2A. After removing the outlier samples (69 samples), 9755 genes/lncRNAs were nally obtained. Pearson's correlation coe cient was then used to calculate the distance between each gene and the lncRNA, and the WGCNA package of the R software was used to construct the weighted co-expression network and to select the soft threshold, as depicted in (Fig. 2B and C). Finally, 21 co-expression modules were identi ed (Fig. 2D). The number of differential lncRNAs in each module is shown in Table 2, where P values represent the degree of signi cant aggregation of the differential lncRNAs in this module. We found that ve modules (black, blue, midnight-blue, tan, and turquoise) were signi cantly related to the differential lncRNAs.
The ve modules were enriched in multiple KEGG pathways, with little intersection between these pathways, as shown in Fig. 3A. This suggests that different modules may perform different functions. The turquoise, blue, and black modules all contained the classic PD pathways. In addition, the rst two Neuroactive ligands with the highest gene ratio among the 23 pathways that were enriched in black − receptor interaction, Dopaminergic synapse pathway. There are also reports in the literature [17,18] were the most prominent pathways enriched in quantiles with PD miRNA patterns (e.g., Fig. 3E). The main pathways in another enrichment module (tan module) (e.g., Fig. 3C) were Alcoholism and PD, Drug addictions, etc., whereas Dopamine neurotransmission impairment underlies a wide range of disorders with motor control de ciencies [19].

Mining of disease-related modules
We obtained a total of 1168 genes from the DisGeNET database, among which 1040 genes underwent expression analysis. Moreover, the correlation between the expression of these 1040 genes and coexpression modules was assessed. We found that nine modules were signi cantly related to these genes ( Supplementary Fig. 3 black and blue). In these three modules other with turquoise, the lncRNAs and differential lncRNAs were also signi cantly correlated. This module contained lncRNAs/mRNAs (GeneModuleClass.xlsx).

Screening of key lncRNAs
Using SVM to analyze the differential lncRNAs in the disease-related modules, 12 candidate lncRNAs were nally identi ed, as shown in Table 3. These lncRNAs exhibited high classi cation performance, with an average AUC > 0.6; thus, they were deemed potential lncRNA diagnostic markers of PD.

Construction and testing of a PD lncRNA diagnostic model
The 71 samples were randomly divided into two groups, the training data set (n = 35, 23 PD samples and 12 normal samples) and the validation dataset (n = 36, 23 PD samples and 13 normal samples). We used the 12 lncRNAs identi ed above as features in the training data set, to obtain their corresponding expression pro les and build an SVM classi cation model. The model was tested using a 10-fold crossvalidation method. The classi cation accuracy rate was 94.28% (Fig. 4A), and 33 out of 35 samples were classi ed correctly. The sensitivity of the model regarding the identi cation of PD samples was 100%, with a speci city of 83.33% and an AUC of 0.9928 (Fig. 4C). Furthermore, the established model was used to predict the samples in the veri cation data set and test the predictive ability of the model. Thirty out of 36 samples were classi ed correctly, with a classi cation accuracy of 83.3%. The sensitivity of the model to PD was 86.95%, the speci city was 76.92% (Fig. 4B), and the AUC was 0.9464 (Fig. 4D). These results indicate that the diagnostic prediction model constructed in this study can effectively distinguish patients with PD from normal control populations, and that the 12 lncRNAs identi ed here can be used as reliable biomarkers for PD diagnosis.

KEGG pathway analysis of the 12 lncRNAs
To assess the function of each of the lncRNAs identi ed here, we analyzed the 12 most relevant pathways of lncRNA expression and found that seven lncRNAs were related to the Parkinson's disease pathway. Among them, AC093323.3 and COPG2IT1 showed a positive correlation, whereas AC120114.3, LOC153684, NCRNA00107, RP11-10O22.2, RP11-417J1.4, etc. were negatively correlated ( Supplementary  Fig. S4).

Discussion
The data pertaining to brain tissues (substantia nigra) of patients with PD were speci cally selected here for data mining. We obtained the gene expression data of PD from the GEO database. Compared with the TCGA database, the GEO data are scattered; therefore, we were only able to collect the data manually. Via chip re-annotation, 1970 lncRNA probes were obtained to study the regulatory mechanism of the mRNA/lncRNA co-expression network in PD and the possible regulatory mechanisms of disease pathways. Such a large sample in the study of PD is unique and will help improve the reliability of the research results [20].
The co-expression network analysis performed here identi ed ve WGCNA modules, among which the midnight-blue module was most signi cantly enriched in neurodegenerative diseases. Furthermore, an enrichment analysis showed that the Parkinson's disease (PD) pathway (hsa05012) was one of the representative pathways related to this module. In addition, the remaining modules exhibited low intersection and were enriched in different pathways with different functions, such as the turquoise module, which was enriched for the Helicobacter pylori infection (hsa05120) and epithelial signal pathways; previous studies have reported that Helicobacter pylori infection is associated with PD [21] or that the ubiquitin-mediated proteolytic pathway (hsa04120) in astrocytic glutamine metabolism is associated with PD [22]. Thus, we inferred that this is a pathogenic entity. Because of the complexity of its underlying mechanisms, PD is a complex disease that cannot be attributed to the dysfunction of a single pathway. Therefore, further data mining was performed on the disease-related modules and 12 key lncRNAs were selected using the ROCR package in the R language. A PubMed literature search system was used for literature Dig and to explore the relationship between these lncRNAs and PD. According to previous reports, AC093323.3 exhibits differential expression in the midbrain of cocaine abusers [23], which shows that these 12 genes can be used as potential lncRNA diagnostic markers of PD. Subsequently, we used the SVM model to perform a disease prediction analysis of the 12 candidate lncRNAs. The 10-fold cross-validation of the PD dataset showed that our model had 12 lncRNAs. The sensitivity of RNA veri cation was 86.95% and the speci city was 76.92%, which further showed that these 12 lncRNAs can be used as reliable biomarkers for PD diagnosis. Finally, the 12 selected lncRNAs were re-analyzed through the KEGG pathway database. We identi ed 10 positive correlations for AC093323.3, including cancer-related pathways such as PD, and 10 negative correlations, mainly related to the JAK/STAT signaling pathway; several negative correlations for AC120114.3, including Huntington's disease, AD, PD, and other related pathways; and a negative correlation between LOC153684 and the AD pathway.
We analyzed the lncRNA/mRNA network and related pathways in PD using bioinformatics techniques. These results can help understand the occurrence and development of PD. However, our research also had some limitations. First, we used probes to re-annotate the pipeline and identify functional lncRNAs related to PD; although this approach has been widely used in many bioinformatics studies, we admit that this pipeline lters out many lncRNAs that do not match the probe sequence. Second, in addition to gene expression, epigenetic-and protein-level information also plays a very important role in the drugresponse mechanism; therefore, this information should be included in the pre-expression model. Third, in the eld of bioinformatics, the validity of the results is often assessed based on statistical signi cance and literature veri cation, which were used here to validate the accuracy and reliability of the lncRNA/mRNA network, lncRNA-related functional modules, or the diagnostic potential of the lncRNA biomarkers.
In this study, we used the GEO database to analyze systematically potential lncRNA molecular markers in PD based on lncRNA re-annotation. We screened out 12 lncRNA molecules and veri ed them through the SVM model, to obtain satisfactory results. It is concluded that the expression of these 12 lncRNAs may be related to the occurrence and development of PD. This study provided new molecular entities for the diagnosis of PD, which may promote the early detection of this disease and the development of personalized therapies.

Conclusions
This research provides new molecular features for Parkinson's diagnosis through database screening and machine learning veri cation model, which is helpful for early Parkinson's diagnosis and personalized treatment.

Declarations
Ethics approval and consent to participate.

Not applicable
Availability of data and materials The datasets used and analyzed during the current study are available from the corresponding author on reasonable request.

Con ict of Interest
The authors declare that the research was conducted in the absence of any commercial or nancial relationships that could be construed as a potential con ict of interest.

Author Contributions
Yi Quan had full access to all of the study data and takes responsibility for the integrity and accuracy of the data analysis; Study concept and design: Yi Quan, Jia Wang; Critical revision of the manuscript for important intellectual content: All authors; Statistical analysis: Yi Quan, Jia Wang; Administrative, technical, and study supervision: Shuo Wang, Jizong Zhao Funding Tables   Table 3 Candidate lncRNAs which using SVM to analyze.  Figure 1 A) In Volcano Plot, the blue part represents downregulated expression and the red part represents upregulated expression. B) Heat map, on the top part, red represents PD samples, blue part which below red represents normal control samples.