Transcriptome Research of Key Modules and Hub Genes Associated with Primary Myelobrosis by Weighted Co-expression Network Analysis.

Background: This study aimed to identify novel targets of diagnosis, therapy as well as prognosis for primary myelobrosis (PMF). Methods: The gene expression proles of GSE26049 was obtained from GEO dataset, weighted gene coexpression network analysis (WGCNA) was then performed to identify the most related modules with PMF. Subsequently, GO (Gene Ontology), KEGG (Kyoto Encyclopedia Genes and Genomes), GSEA (Gene Set Enrichment Analysis) and PPI (Protein-Protein Interaction) network were conducted to fully understand the detailed information of the green module. Results: Green module was strongly correlated with PMF disease after WGCNA analysis. 20 genes in green module were identied as hub genes responsible for the progression of PMF. Functional annotation and pathway analysis revealed that these hub genes were primarily enriched in erythrocyte differentiation, transcription factor binding, hemoglobin complex, transcription factor complex and cell cycle et al. Of which, EPB42, CALR, SLC4A1 and MPL had the most correlations with PMF. Conclusions: This study elucidated that genes EPB42, CALR, SLC4A1 and MPL were signicantly more highly expressed in PMF samples than in normal samples. These four genes may be considered candidate prognostic biomarkers and potential therapeutic targets for early stage of PMF. Meanwhile, EPB42 and SLC4A1 were rstly found to be highly correlated with the progression of PMF. get comprehensive information, a crucial provides a set of functional annotation tools for researchers to explore the real biological signicance underlying different genes. GO (Gene Ontology) is a powerful tool to analyze the molecular function, cellular component, and biological process of genes. KEGG (Kyoto Encyclopedia Genes and Genomes) pathway enrichment analysis was performed to understand links among different genes and signaling pathways. This study performed GO and KEGG in DAVID database and P < 0.05 set as the cutoff to nd out statistically signicant information. Additionally, GSEA (Gene set enrichment analysis) to determine which set of showed statistical signicance. the a for each and


Background
The World Health Organization (WHO) revised and updated the hematopoietic tumor classi cation system in 2016, which con rmed seven major categories of myeloid malignancies including acute myeloid leukemia (AML) and related neoplasms, myeloproliferative neoplasms (MPN), myelodysplastic syndromes (MDS), mastocytosis, eosinophilia-associated myeloid/lymphoid neoplasms with speci c mutations, MDS/MPN overlap and myeloid neoplasms with germline predisposition [1]. Of which MPNs are clonal hematopoietic disorders characterized by excessive production of differentiated hematopoietic cells in chronic phase [2]. The Philadelphia-negative MPNs contained 3 major diseases including polycythemia vera (PV), essential thrombocythemia (ET) and primary myelo brosis (PMF). Among MPNs, PMF is the most essential neoplasms which could evolve from the other MPNs such as PV and ET. Later in their courses, both PV and ET disorders may evolve into myelo brotic phases termed "post polycythemia vera myelo brosis" or "post-essential thrombocythemia myelo brosis", respectively [3,4]. In this situation, PMF is referred to as post-PV MF or post-ET MF.
Primary myelo brosis (PMF) is a nascent, myeloproliferative, neoplastic disorder characterized by clonal proliferation of myeloid cells in the bone marrow resulting in brosis and leading to the devastation of healthy marrow [3,5,6]. Clinical manifestations often include marked splenomegaly, anemia as well as constitutional symptoms such as fatigue, fever, weight loss and night sweats. Patients with severe symptoms of PMF may show upper abdomen atulence feeling, bone pain, bleeding and cachexia et al [7][8][9]. Besides, the overall treatment and prognosis of PMF is generally poor. Epidemiology ndings showed that the estimated prevalence of PMF is between 4 to 6 per 100000 people, with a median survival time of 15 years for patients younger than 60 and 6 years for patients older than 60, respectively [10,11]. Meanwhile, chemotherapy, including JAK2 inhibitors does not provide a promising prospective view [12,13], which may result from inadequate exploration of genes responsible for the occurrence and development of PMF. Therefore, there is an urgent need to discover novel chemotherapy as well as to explore original biomarkers to assess their malignant potential and prognosis for a better understanding in genetic and molecular biology signi cance behind the PMF.
Recent decades, bioinformatics combined with microarray technology have displayed prospective view for analyzing molecular and genetic mechanisms of malignant neoplasms [14], which prompted study on the initiation, development and metastasis of tumor at the molecular level, making it possible to analyze the genetic alteration and molecular mechanisms in the development of neoplasm. Weighted Gene Coexpression Network Analysis (WGCNA) is an effective and systematic biological analysis method among bioinformatics, which is used to perform weighted correlation network analysis for exploring the correlation between genes and a given feature [15,16]. The mathematical principle of WGCNA was rstly proposed in 2005 [17], and the algorithm was implemented in R environment in 2008, which was named as WGCNA package [16]. Recent 5 years, WGCNA has been witnessed for extensive application in bioinformatics to identify the candidate biomarkers through a variety of tumors such as uveal melanoma, breast cancer and adrenocortical carcinoma et al [18][19][20][21][22]. With the assistance of WGCNA, researchers could explore potential mechanisms among highly co-related genes and discover novel diagnostic biomarkers or therapeutic targets from gene cluster associated with disease. To the best knowledge, the application of WGCNA in the identi cation of biomarkers of PMF has not been reported up to now.
In the current study, microarray dataset GSE26049 has been analyzed to construct a weighted coexpression network as well as to nd a module with speci c interest to the development of PMF to identify potential therapeutic targets and key signaling pathways. Subsequently, functional enrichment analysis including GO (Gene Ontology) and KEGG (Kyoto encyclopedia genes and genomes) were carried out to discover molecular function and aberrant regulated pathways through displaying their cellular component, biological process, and molecular functions. Then protein-protein interaction (PPI) network analysis of interested modules responsible for the progression of PMF was performed to visualize the connections between highly co-related genes. The results in this study provided a comprehensive understanding in the oncogenesis of PMF under transcriptome level, which may contribute to assess the prognosis of PMF and ultimately offer new ideas into the treatment of PMF.

Microarray Data
Gene expression pro le of GSE26049 was downloaded from the National Center for Biotechnology Information Gene Expression Omnibus (GEO, https://www.ncbi.nlm.nih.gov/geo/), which is a functional public genomics repository including high throughout gene expression data, chips and microarray. The pro le totally contained 90 samples, of which 9 PMF (primary myelo brosis) samples, 19 ET (essential thrombocythemia) samples, 41 PV (polycythemia vera) samples and 21 normal samples.

Gene Expression Pro les' Preprocessing
The method of quality control was performed by R ("affy", "affyPLM" package) and standard samples were identi ed through calculating their normalized unscaled standard error. Data preprocessing procedure was used to handle the raw data (".CEL" le format) using robust multi-array average (RMA) background correction and normalization (RMA function, "affy" and "affyPLM" package in R environment). The Affymetrix annotation les from GPL570 platform were applied to annotate corresponding probes, and probes without annotation were removed.
Weighted Gene Co-expression Network Analysis (WGCNA) WGCNA analysis was performed to identify corresponding expression modules. In this part, the steps of WGCNA analysis as well as the way to identify signi cant modules associated with clinical features were clearly explained, together with how to screen the most representative genes in the interested module.
Firstly, hierarchical clustering analysis was applied to check the heterogeneity of samples to detect and eliminate the outliers. After excluding outliers, clustering was performed according to gene expression levels in each sample to construct the network connection to get access to the real biological network state (scale-free network). The soft threshold power was determined to select a value by computing a correlation raised to a power between every pair of genes, which could amplify disparity among strong and weak correlations so that the network state could approach scale-free network topology. The proper soft threshold power was selected when the scale-free topology index (R^2) reached 0.90 and mean connectivity decreased to 0.
Next, weighted gene co-expression network was constructed based on the correlations of inter-gene expression levels, and co-expression modules were identi ed by dynamic pruning method. The minimum number of genes in module was set as 30. Eigengenes adjacency was calculated to evaluate interaction of various gene modules. The expression pattern of the gene modules in different features was shown by the heat map (the X axis was the features and Y axis was the module), module-trait relationships were estimated to identify the correlation between modules and PMF.
The degree of correlation between genes was calculated by WGCNA using the Topological overlap measure (TOM) [23,24], and Pearson's correlation coe cient was also calculated to evaluate correlation between module eigengene and phenotype. P < 0.01 was considered to be statistically signi cant. Then relationships between Gene Signi cance (GS) and Module Membership (MM) was plotted. Ultimately, the module with the highest weighted correlation coe cient among all clustered modules was chosen as the interested module and was pooled for further analysis.

Functional and Pathway Enrichment Analysis on Interested Module
Functional annotations and interpretations were applied to discover their biological meaning behind many genes. This study used DAVID database (Database for Annotation, Visualization and Integrated Discovery, http://david.abcc.ncifcrf.gov/), to get comprehensive information, which is a crucial repository that provides a set of functional annotation tools for researchers to explore the real biological signi cance underlying different genes. GO (Gene Ontology) is a powerful tool to analyze the molecular function, cellular component, and biological process of genes. KEGG (Kyoto Encyclopedia Genes and Genomes) pathway enrichment analysis was performed to understand links among different genes and signaling pathways. This study performed GO and KEGG in DAVID database and P < 0.05 was set as the cutoff to nd out statistically signi cant information. Additionally, GSEA (Gene set enrichment analysis) was further conducted to determine which set of genes showed statistical signi cance. Based on the size of set, a normal enrichment score (NES) was assessed for each gene set, FDR and p value were calculated as the cutoff criteria.

PPI (Protein-Protein Interaction) Network Construction and Hub Genes Selection
PPI analysis was then conducted by STRING online tool, (Search tool for retrieval of interacting genes, https://string-db.org/), an online database that provided PPI analysis for bioinformatics. Subsequently, we used Cytoscape software (version 3.7.0) to screen hub genes and modules through Molecular Complex Detection (MCODE) plug-in, which was used for clustering given network links based on topology to identify densely connected regions. Each node in the network represented a gene, and edges represented regulatory relationships between genes. MCODE recognized the most signi cant modules through mathematical algorithm that number of nodes > 7 and MCODE scores > 8. P < 0.05 was considered to be signi cant difference and hub driver genes were selected as degree ≥ 36.

Pro le's Quality Control
The microarray dataset GSE26049 were applied into this study, gene expression pro les of normal samples as well as different tumor samples (n = 90) were normalized and generated. Firstly, we lter and identify the samples which we could use according to the quality control of these gene chips. Standard samples used in this study were selected by calculating their normalized unscaled standard errors (NUSE) and NUSE plot was also generated for visualization (Fig. 1A). Results illustrated that these 90 samples all had a good chip quality and could be used in our further research. Subsequent to data preprocessing using RMA algorithm, the boxplot of normalized and background corrected gene expression level was plotted (Fig. 1B), results showed that the median amount of gene expression in each sample was on a straight line, indicating that the preprocessed data was served as standard data and could be analyzed for the following study.

Construction of Weighted Gene Co-expression Networks
Firstly, hierarchical clustering analysis for each sample was applied to check the heterogeneity of samples to detect and remove the outliers ( Fig. 2A), cutoff threshold height was set as 85 to lter outliers.
Finally, one sample GSM639674 was excluded and the gene expression matrix contained 20482 genes in the rest of 89 samples were used for WGCNA analysis. Next, the optimal soft threshold power for WGCNA was calculated. As shown in Fig. 2B, when the soft threshold power β was equal to 16, the scale free topology model t index (R^2) was higher than 0.90 and mean connectivity was in nitely approaching zero. As a result, power 16 was selected as the optimal soft threshold power value.
Based on the co-expression relationships, hierarchical clustering analysis was then performed to obtain the weighted co-expression network (Fig. 3A), results displayed that after generating the clustered dendrogram, 14 distinct gene co-expression modules, characterized by their unique module color, were identi ed and clustered. The module colors were salmon, tan, blue, purple, yellow, brown, pink, magenta, black, red, turquoise, green, green-yellow, grey, respectively. Interaction relationships of the 14 coexpression modules in all of the genes was shown in Fig. 3B, topological overlap heatmap revealed that each co-expression module could independently validate each other in the network, and dendrogram branches indicated that genes in each module were highly heterogenous. Consequently, it was necessary to further identify the interested modules in each subgroup.

Key Module Identi cation of PMF
After obtaining WGCNA network data, the module-trait relationships heat map was also performed. The interaction relationships between each module and each different feature including control, PV, ET, and PMF subgroups were fully assessed (Fig. 4A, 4B). From this heat map and histogram, we observed that the modules signi cantly associated with PV was black module, no co-expression module was signi cantly related to ET. As for PMF, results visualized that correlations and p values of PMF samples in the modules green and red had strong positive correlation compared to normal samples, while purple and yellow modules had a strong negative correlation compared to normal samples, indicating that modules green, red, purple and yellow could signi cantly differentiate the PMF samples from the normal samples and genes in these modules could be up-regulated or down-regulated in the progression of PMF. Of which, green module had the strongest correlation (r = 0.46) and the lowest p value (p = 8e-5), elucidating that the green module was highly correlated with PMF, the genes in green module played a crucial roles in the pathogenesis and oncogenesis of PMF. Besides, Module Membership (MM) in PMF versus Gene Signi cance (GS) scatter plot of green module was also generated (Fig. 4C), results illustrated that MM was highly correlated with GS in green module (cor = 0.48, p = 3.6e-5).
Module eigengene adjacency was subsequently calculated to cluster and assess the identi ed modules with feature PMF, and heatmap was plotted to depict their interactions within modules and PMF subgroup (Fig. 5A, 5B). Based on the dendrogram, result illustrated that the PMF subgroup clustered with green module, implying that these two eigenvalues had a highly correlations with each other. As for heat map, each module exhibited their independent validation to the other. Blue colors and red colors represented the different strength of co-expression interactions, and results displayed that the PMF subgroup had the highly correlations with green module.
Then we performed the hierarchical clustering of the gene expression pro les in green module between four subgroups: control, PV, ET and PMF in green module. As shown in Fig. 5C, hierarchical clustering analysis displayed that the genes in green module could signi cantly differentiate PMF group from other three groups, and PMF samples clustered tightly with each other compared to other groups. From the results above, green module was identi ed as key module accounting for the progression of PMF and subsequent research was pooled based on the green module genes.

Functional and Pathway Enrichment Analysis of Genes in Green Module
Totally the blue module contained 217 genes (as shown in supplementary table 1), to obtain a comprehensive understanding of the biological functions and signaling pathway relevance of the genes in green module, GO, KEGG and GSEA researches were conducted. These genes were uploaded into DAVID database for further investigation, the detailed results of GO, KEGG was shown in Table 1 and signaling pathways were identi ed to be signi cantly changed, such as cell cycle, protein processing in endoplasmic reticulum, bile secretion and porphyrin metabolism. Then GSEA analysis was also conducted to screen the potential biological function and aberrant regulated pathways in PMF patients. As shown in Fig. 7, results indicated that the mutual genes in green module were commonly evolved in critical signaling pathways that were correlated with carcinogenesis of tumor, including cell cycle, hematopoietic cell lineage, JAK-STAT signaling pathway, oocyte meiosis, P53 signaling pathway and tolllike receptor signaling pathway et al.

Identi cation of Hub Genes and PPI (Protein-Protein Network) Construction
After screening the interested module, the PPI network was then analyzed into STRING database, and the weighted co-expression network was constructed and uploaded into Cytoscape software.
Co-related modules in these genes were also identi ed by the Molecular Complex Detection (MCODE) plug-in in Cytoscape, totally 155 nodes and 863 edges were acquired, together with the top 2 signi cant modules, as shown in Fig. 8A. Module 1 contained 10 nodes and 55 edges, module 2 contained 7 nodes and 30 edges. The functional annotation and pathway enrichment of screened modules were also conducted using DAVID database, as shown in Table 2. Results showed that genes in module 1 mainly played roles in hemoglobin metabolic process, erythrocyte differentiation, iron ion binding and hemoglobin complex. Genes in module 2 were primarily enriched in protein polyubiquitination, protein binding and cytoplasm et al.  Table 3. Meanwhile, the gene expression levels of the hub genes between PMF and normal samples were tested. The results of gene expression levels were shown in heat map (Fig. 8B). Illustration showed that these hub genes expression levels in PMF samples were much higher compared to normal samples, indicating that these genes was responsible for the development of PMF. Additionally, EPB42, CALR, SLC4A1 and MPL associated genes including HBM, KLF1, GYPB, MYL4, AHSP, RHAG were also expressed abnormally, which demonstrated that EPB42, CALR, SLC4A1 and MPL associated pathways were upregulated aberrantly.

Discussion
Primary myelo brosis (PMF) is one of the most complex and common malignant tumors in myeloproliferative neoplasm (MPN), which is frequently but not always accompanied by JAK2, CALR mutation, aberrant cytokine expression, bone marrow brosis and anemia et al [25][26][27]. Despite the fact that great progress has been made in the diagnosis of PMF during the last decades, the overall treatment and prognosis of PMF is still unsatisfactory, followed by a short median survival time as well as an inadequate exploration of chemotherapy [10][11][12][13]. Consequently, to better discover novel and precise molecular biomarkers that could effectively and accurately predict the progression and prognosis of PMF patients, this study extracted gene expression pro le of GSE26049 from GEO database to investigate and validate potential key modules and hub genes responsible for the pathogenesis and oncogenesis of PMF through bioinformatics analysis with WGCNA.
WGCNA is a powerful global research tool for data mining to analyze multiple genes in large-scale datasets, which is developed in recent years characterized by ltering signi cant modules and gene signatures associated with phenotype features or clinical traits. WGCNA has already been used in the past ve years in some kinds of neoplasms and has made a great progress whether in the identi cation of biomarkers in disease or treatment of targeted genes [28][29][30][31][32]. To our knowledge, the application of WGCNA methods in the identi cation of potential targeted genes in PMF has never been reported so far. Therefore, a comprehensive understanding in the pathogenesis of PMF under transcriptome level is imperative for its treatment and prognosis in order that researchers could provide effective therapeutic strategies as well as make an overall prognostic assessment at the early stage of PMF.
In the present study, WGCNA was conducted and 14 co-expression modules were generated based on 20482 genes from the 89 samples in GEO database. The aim of this study was to elucidate the molecular pattern involved in PMF and the association of key modules with PMF features and excavate the real biological meaning behind hub genes. Firstly, hierarchical clustering analysis was generated to detect outliers and cutoff threshold was set as 85 to remove the outliers. Next, the optimal soft threshold power was picked by "pickSoftThreshold" function in R to construct a scale-free topology network in order to get access to the real biological network state. Soft threshold power was a key value applied to amplify the disparity among strong and weak correlation genes, which could affect the mean connectivity and the scale independence of co-expression modules. Ultimately, soft threshold power β was determined as 16.
The weighted co-expression network was constructed after soft threshold power was selected, then hierarchical clustering was performed to visualize the weighted network. Altogether, 14 distinct coexpression modules were generated, and each module contained the genes that had the most correlations with each other. After acquiring WGCNA network data, the module-trait heat map was displayed, WGCNA tool was conducted because of its merit in revealing co-expression modules and their correlation with different groups. from the heat map, we observed that green module had the strongest positive correlation and the lowest p value to PMF groups, illustrating that genes in green module could signi cantly differentiate the PMF groups from the normal groups. Meanwhile, purple and yellow modules had a highly negative correlation and low p value in PMF groups whereas these modules had a highly positive correlation and low p value in normal groups, genes in purple and yellow modules were highly expressed in normal samples while poorly expressed in PMF samples, indicating that genes in these modules may bene t for the normal organism development. Additionally, the correlation and p value in PV and ET groups were not as signi cant as PMF group (Fig. 4A). Meanwhile, hierarchical clustering analysis in green module visualized that genes expression level could signi cantly differentiate PMF groups from other three groups (Fig. 5C). These ndings implied, consistent with previous studies, that PMF is the most major neoplasms in MPNs, later in their courses, both PV and ET disorders may be evolved into post-PV MF and post-ET MF due to gene mutation or over-expression [3,4].
To further study the function and pathway regulation mechanisms of tumorigenesis, GO, KEGG and GSEA researches were carried out to identify signi cant biological functions and abnormal signaling pathways related to PMF. Functional annotation revealed that genes in green module were mainly enriched in erythrocyte differentiation, hemoglobin complex, transcription factor complex and protein binding, which may explain why the fast multiplication of cancer cells as well as the generation of tumor cells. Our results implied that abnormal changes such as erythrocyte differentiation may occur in hemoglobin complex, which agreed with previous researches that myeloproliferative differentiation disorders may lead into aplastic anemia in the development of PMF [33]. This analysis also found that molecular functions were primarily enriched in transcription factor binding in transcription factor complex, which was consistent with recent studies that cytokines mediated corresponding receptors into the nucleus to participate in transcriptional regulation, and nally lead into tumor growth [5]. Furthermore, the analysis of KEGG and GSEA revealed that these genes were primarily enriched in cell cycle, hematopoietic cell lineage, JAK-STAT signaling pathway, oocyte meiosis, P53 signaling pathway. The oocyte meiosis was rstly contacted to PMF, and the mechanism was presumed to be related to progesterone-mediated oocyte maturation, which was reported to be related with many neoplasms [34]. Next, toll-like receptor signal pathway was found to be highly activated in normal samples, indicating that toll-like receptor behaved as role in resisting tumor in the normal individual. Besides, protein processing in endoplasmic reticulum, bile secretion and porphyrin metabolism were also aberrantly activated in PMF, the reason caused bile secretion may due to hepatomegaly and hyperfunction caused by gene over expression. Advanced research reported that porphyria was a group of porphyrin metabolic disorders caused by the de ciency of special enzyme or the decrease of enzyme in the pathway of heme synthesis. Based on the ndings in this study, genes responsible for the progression of PMF in green module may play roles in heme binding, porphyrin metabolism et al, we have reason to deduce that the development of PMF may also lead to the occurrence of porphyria, the links between these two diseases had not been reported so far. Further research needs to be conducted to nd out the correlations between these two disorders.
Additionally, studies also reported that dysregulation of cell cycle played a crucial role in the oncogenesis of tumor [35,36], which was agreed with our ndings that cell cycle was aberrantly activated in PMF.
With the aim of screening hub genes among green module identi ed in our previous work, the 217 mutual genes were analyzed with construction of PPI network based on the STRING database. Altogether, 20 genes were selected as hub genes with high degrees. Heat map of these gene expression between PMF samples and normal samples showed that these hub genes could signi cantly distinguish these two groups (Fig. 8B). Particularly, EPB42, CALR, SLC4A1 and MPL ranked highest in their degree value, indicating that these genes had the most correlations with the progression of PMF.
CALR, located on chromosome 19, whose full name was calreticulin, was a multifunctional protein that acted as a major ca 2+ binding protein in the lumen of the endoplasmic reticulum. Advanced research proved that it also existed in nucleus, suggesting that in may play a role in transcription regulation. Many of existed studies had reported their highly correlation with the development of PMF [5,37,38], driver mutations involving CALR in 90% of patients mediated constitutive JAK-STAT signaling, thereby leading to changes in the cytokine and growth factor milieu and accordingly potentiated brosis and nally played a fundamental role in disease pathogenesis. MPL, located on chromosome 1, mediated the expression of thrombopoietin and its receptor, which played an important role in cellular signaling pathway, they were essential to the expansion and regulation of megakaryocytes as well as self-renewal of hematopoietic stem cells. Previous researches showed that mutations in both MPL and CALR could increase the JAK-STAT activation, of which CALR may induce JAK-STAT signaling pathway by increasing recruitment of proteins to the MPL promoter site [39]. EPB42, erythrocyte membrane protein band 4.2, was an ATP-binding protein which may regulate the correlation of protein 3 with ankyrin. It was reported that EPB42 could manipulate the shape and mechanical property of erythrocyte and have a high association with hereditary spherocytosis [40,41]. SLC4A1, the protein encoded by this gene was part of the anion exchanger family and was expressed in the erythrocyte plasma membrane. The protein comprised two domains which were structurally and functionally distinct, of which the N-terminal 40 kDa domain was in the cytoplasm and acted as an attachment site for erythrocyte skeleton by binding with ankyrin. Many mutations in this gene had been known in human, the over expression of SLC4A1 could lead to hereditary spherocytosis caused by destabilization of cell membrane [42]. It was well known that the clinical manifestation of primary myelo brosis was obvious enlargement of spleen (mostly splenomegaly) and extramedullary hematopoiesis of each organ. Hematological features were the presence of large amount of erythrocyte in the peripheral blood cell smear. Based on the ndings above, this study rstly found that EPB42 and SLC4A1 had a strong correlation with PMF, consequently, we hypothesize that patients with PMF may have a complication of hereditary spherocytosis due to the over expression of EPB42 and SLC4A1 in the progression of the disorder. Due to the fact that there were no reports of the association between these two disorders, further research needs to be conducted to validate our hypothesis.
Overall, this study used WGCNA analysis for a whole situation to conduct an elaborate, precise, and systematic view to analyze the differentially expressed genes from PMF patients. However, considering all the results above, there were still some limitations, on the one hand, small size of samples may limit the statistical power to identify the hub genes; on the other hand, molecular biological experiments needs to be applied to further validate these hub genes to determine whether they may be bene cial in the diagnosis or treatment of PMF.

Conclusions
In conclusion, this study attempted to explore the potential molecular regulatory mechanism of PMF based on WGCNA analysis. 14 distinct co-expression modules were identi ed from GEO dataset GSE26049. Of which, green module was most signi cantly correlated with PMF. EPB42, CALR, SLC4A1 and MPL in green module were recognized as key therapeutic targets for PMF, EPB42 and SLC4A1 were rstly found to be highly correlated with PMF.

Consent to publish
All contributing authors agree to the publication of this article.

Availability of data and materials
The data used and analyzed during the current study are available upon reasonable request.

Competing Interests
All authors declare no con icts of interest related to this manuscript, and all authors have approved the publication of this work.

Funding
Not applicable.

Authors' Contribution
This study was completed with teamwork. Every author had made corresponding contribution to the study: Conceived the idea: WL, MY, GG.
Wrote the main manuscript: WL, BY.     Functional and pathway enrichment analysis of genes in green module.
Page 30/31 Figure 7 (A-F), Gene set enrichment analysis of KEGG pathways in green module.