Migene: an Evidence-based Database of Genes and Phenotypes of Male Infertility

Male infertility (MI) is a disease with high heterogeneity. Its direct cause is spermatogenic failure, which lead to sperm abnormalities and quality deterioration in semen, and ultimately infertility, of which genetic factors account for about 30 percent. Although some knowledge bases had established the correlation between genetic factors and MI/spermatogenesis, they did not highlight the relationship between gene mutations and MI phenotypes, and even some had stopped updating. Meanwhile, a large mass of genotypes and phenotypes data were scattered and not effectively utilized with the wide application of high-throughput sequencing technology in MI research. To address these tough issues, a MIgene database (http://midb.geneworks.cn) was built through integrating existing data of gene-phenotypic MI from 989 literatures in PubMed and Medline database. A total of 22 information including 18 direct entries and 4 extended entries were extracted, ltered, and curated using bioinformatical and manual methods, resulting in 664 genes, 68 (standard) phenotypes (classied into 37 categories), 3606 mutants and 7985 studies in MIgene database. It was a web-accessible data repository and offered four search moles and six modules covering the genes, phenotypes, proteins, functions, homologous, cases, etc. Interestingly, many non-exonic variants could cause the MI, the same mutation could increase or decrease the MI in different phenotypes and races, the degree of gene-phenotypic association was presented by the enrichment analyses based on the principle of hypergeometric distribution. In general, MIgene not only had user-friendly interface for concise search, convenient browse, and customized downloads, but also provides early warning of disease risk and assist clinicians in timely diagnosis.


Introduction
Infertility is a global health problem among couples when they fail to conceive a child over one year of unprotected intercourse 1 . It occurs in approximately 15% of couples, half of which could be attributed to male infertility (MI) that makes a major threat to patients' family harmony, and mostly in developing countries 2,3 . The direct cause of male infertility is spermatogenesis failure, which lead to sperm abnormalities and quality deterioration in semen, and ultimately infertility, of which genetic factors account for about 30% 4-6 .
Spermatogenesis was a sophisticated biological process responsible for development of spermatozoa from spermatogonia stem cells, and was elaborately regulated by multiple genes. Spermatogenesis failure usually presented the spermatogenic quantitative defects (azoospermia, oligozoospermia) and spermatogenic qualitative defects (globozoospermia, macrozoospermia) 5,7 . Currently, the known genetic abnormalities, including chromosome aberrations, Y chromosome microdeletion, epigenetics and post-transcriptional modi cation, sperm DNA damage, mitochondrial DNA (mtDNA) mutation, and genetic variants, had been identi ed the associations with idiopathic MI [8][9][10][11] . In this manuscript, it mainly underlined the relationship between variants of mtDNA and chromosome and MI.
Up to now, many efforts have been made by different researchers to build database or knowledgebase in various aspects of MI. GermOnline 4.0 was a cross-species database gateway focusing on high-throughput gene expression data related to germline development, the meiotic and mitosis cell cycle in normal or malignant cells 12 . SpermBase, a sperm RNA database, included the large and small sperm-borne RNA expression data for M.musculus, H.sapiens, etc 13 . GermlncRNA, a catalog of germ cell long non-coding RNAs, systematically annotated lncRNAs for each specialized germ cell stage using public annotations and Hybrid Transcriptome Assembly (HTA) approach 14 . SpermatogenesisOnline 1.0 used manual curation from 30, 233 articles published before 1 May 2012, which contained 1666 core genes and 762 extended genes that participated in spermatogenesis in 37 organisms 15 . ReproGenomics Viewer, a cross-species and cross-technology web-based resource of manually-curated sequencing datasets related to reproduction 16,17 . Besides, there are other databases for MI, such as GED 18 , GermSAGE 19 and CFTR2 20 . However, these database or knowledgebases did not highlight the relationship between gene mutations and male infertility phenotypes and some have stopped updating. In recent years, with the wide application of high-throughput sequencing technology in MI research, more genes have been found and a large mass of data about clinical phenotypes also have been obtained. However, these data are relatively scattered and have no effective utilized. Therefore, it is important to integrate the existing data to construct a comprehensive, informative, and updatable database for genetic predispositions to MI, which could greatly facilitate the counseling, diagnosis, and therapy for MI.
In this study, MIgene, an evidence-based database of genes and phenotypes related to MI, was presented to ful ll the increasingly urgent need for data integration and resources. That 664 genes (515 genes from non-GWAS and 179 genes from GWAS), 3606 mutations containing SNPs (Single Nucleotide Polymorphisms), VNTRs (Variable Number of Tandem Repeats), and 68 phenotypes (37 categories) were collected from 989 articles. To the best of our knowledge, MIgene is the rst genetic database for MI to conveniently browse, retrieve and download, which can facilitate to study the functions of MI for researchers and to provide the reference information for clinicians in prenatal diagnosis.

Literature search.
MIgene integrated genetic variants of MI from publications. Besides SNPs, other variants like VNTRs were also included. More than 200 combinations of different keywords were searched in the PubMed and Medline database (Supplementary Table S1), such as 'aspermia AND mutation', 'spermatogenesis failure AND genomic alteration', 'severe oligozoospermia AND gene defects', 'spermatogenesis impairment AND various', 'oligozoospermia AND polymorphism', 'non-obstructive aspermia AND mutant', 'male sterile AND mutant', 'male infertility AND various', 'male infertility AND copy number', 'infertile men AND genetic alteration', etc. These papers containing these keywords in the titles or abstracts were obtained and then were fetched through NCBI E-utilities API. By manual screening, the following papers such as reviews and research articles about pharmacology, sociology, electrophysiology, behavioral research, neurophysiology, chromosome aberrant, Y-chromosome microdeletions and cancer/tumor were excluded. In addition, the papers about non-human species and meta-analyses in this version were dropped off. Finally, the remaining literatures included in our data set of MIgene.
Data extraction, integration, and curation.
The full text of each eligible publication was downloaded and read carefully. The detailed information of each study was extracted manually by two or three researchers. The genetic, proteomic, clinical and demographic contents belonged to 18 direct entries were collected from papers and 4 extended entries (Supplementary Table S2), such as gene name, mutant, study type, clinical signi cance, phenotypes, and author comments, then, and were proofread, standardized, replenished, veri ed through bioinformatic and manual methods according to GeneBank 21 , HGNC (Human Gene Nomenclature Committee) 22 , dbSNP 23 , GeneCards 24 and UniProt 25 . However, not all the mutations have their own rs_IDs, so variants without rs_IDs were encoded with Jrs000001, Jrs000002… (Supplementary Table S3). For phenotypes, many papers only contained the entry 'spermatogenic failure', but not 'azoospermia', 'oligozoospermia', or others. Therefore, MIgene only showed the raw data. These collected phenotypes were classi ed into 68 (standard) phenotypes and 37 categories by reference criteria established by the 5th WHO edition and different forms of writing or synonyms (Table 1 and Supplementary Table S4-S6) 26 . The concepts of phenotypes were given by HPO 27 , OMIM 28 , Wikipedia (https://en.wikipedia.org) and the 5th edition of the WHO Laboratory Manual 26 . To illustrate the relationship between candidate genes and MI, statistical results were classi ed into 'related-damage', 'relatedprotection', 'unrelated' and 'unknown' according to their statistical evidence in the original publications. The results with p < 0.05 for non-GWAS or p < 1x 10 −8 for GWAS usually were de ned as 'related' unless the authors suggested some other values or the associated studies according to the references 29 . However, many mutants produced the opposite consequence. For instance, the catalase C262T polymorphism indicated that CAT-262T/T genotype conferred less susceptibility to MI 30 . Hence, the 'related' was divided into two classes: 'related-damage' represented the results related to the increase of the risk of MI and 'relatedprotection' represented the results related to the decrease of the risk of MI. The 'unrelated' represented the results with p > 0.05 for non-GWAS or p > 1x10 −5 for GWAS. For 'unknown' results, the values were below these thresholds of GWAS or the original papers did not provide the clinical signi cance. If other statistical values were used, the criteria would be referred to as the statistical method in original papers. All the clinical results were checked by more than two researchers, the opposite consequences were veri ed after discussion.
To further understand the function of all the genes associated with MI, extensive functional knowledge and data from the online database were gathered. Protein expression levels of subcellular-location were retrieved from Compartment. The position and function of the peptides and proteins were annotated using UniProt 25 . In addition, other information was provided, such as coexpression protein, protein-protein interactions, and enriched functional pathways from GeneCards 24 , String 31 , GO (Gene Ontology) 32 and KEGG 33 , respectively.

Enrichment analysis of genes and phenotypes.
For the enrichment analysis of the association between gene and phenotype, algorithm of hypergeometric distribution 34,35 was applied. The results were named the enrichment scores of gene or phenotype enrichment analysis (Supplementary Table S7-S8). The -lg (p-value) was calculated with the enrichment score plus 0.0001 (this value could be random given), then logarithms and minus sign. After searching for a gene, the phenotype enrichment results can be acquired (Supplementary Table S7). By using the same strategy, after searching for a phenotype, the results of gene enrichment could be obtained (Supplementary  Table S8).
Web interface con guration.
MIgene was established as an integrated information resource, in which the whole data were stored and managed in a MySQL relational database and implemented using node.js, JavaScript, vue and egg.js. They are platform independent, open and free source software and support multi-user to browse the web. The web interface is available online at http://midb.geneworks.cn/introduce.

Results
Data collection and curation.
Refer to the mentioned work ow of "Materials and methods" (Figure 1), we screened and selected 989 literatures related to MI from 25,312 papers, then gave 22 different entries to describe the patient. After processing these data, that 664 genes (515 genes from non-GWAS and 179 genes from GWAS), 68 phenotypes, 3606 mutants and 7985 studies were contained in MIgene (Figure 1 and Supplementary Table S9-S10).
Spermatogenesis is a highly organized process of cell proliferation in seminiferous tubules and terminal differentiation for the development of mature spermatozoa. If spermatogenesis is disturbed, it will cause azoospermia, oligozoospermia and other defects of sperm count, motility, and morphology 10,36 . Among the whole phenotypes, the azoospermia, spermatogenic failure, and oligozoospermia studies account for 46.47%, 45.37%, 34.64%, respectively.

Analysis of genes, molecular consequence, variant types for clinical signi cance.
From 989 papers about MI, a total of 664 genes were obtained and classi ed into four clinical signi cance and two study types (Figure 2A, Table 2 and Supplementary Table S11-S12). Among these genes, there were 515 genes from non-GWAS, 179 genes from GWAS and 30 genes coexisting in them. Besides, there were 280 genes associated with more than two types of clinical signi cance. For example, the c.2039A>G mutant of FSHR gene showed four types of clinical signi cance under different conditions including phenotypes, zygosity and ethnicity, etc (Table 3) Table S13), which suggested that MI is a multifactorial disease.  Fortunately, there were 103 genes (non-GWAS: 85 genes, GWAS: 19 genes) exclusively in "related-damage" patients and corresponded to 38 phenotypes, the genes' number for which phenotypes was counted and found that the top three phenotypes were spermatogenic failure (59 genes), azoospermia (47 genes), asthenospermia (20 genes) ( Figure 2B and Supplementary  Table S14).
Further, that 37.5% missense, 19.1% intron and 10.2% synonymous variants were the top three molecular consequence ( Figure  2C) in MIgene. In the related-damage group, the top three results were 44.3% missense, 10.1% intron and 9.8% splice site ( Figure  2D). Notablely, the intron mutations could affect MI in accordance with intron retention has the extent and functional signi cance 39 .
The comprehensive collection of MIgene database allowed us to have an overview of related-damage genes among different chromosome. The gene ontology analysis revealed that every chromosome had a certain number of genes except chromosome 21 ( Figure 2E and Supplementary Table S15). Importantly, a lot of mtDNA genes participated in MI. For example, the mtDNA 4977 deletion was found to be related to MI 40,41 .
Enrichment analysis of genes and phenotypes.
To nd further evidence for the association between genes and phenotypes, an enrichment analysis was performed on the basis of the principle of the hypergeometric distribution. The enrichment results interpreted that the larger the number of samples was for the enriched item in the database, the more stable the results of enrichment were (Figure 3). To take oligoteratozoospermia as an example, in related-damage group, the prioritization for MI candidate genes is presented in Figure  3A. We obtained the most relevant gene PLOG with oligoteratozoospermia.
It is well known that one gene could generate the different phenotypes, thus the phenotype enrichment rank for the gene was further explored. By using PLOG as a training gene, the phenotype enrichment analysis was ranked in graphics ( Figure 3B). The top of these consequences was oligoteratozoospermia phenotypes in accord with Figure 3A.
Data search and navigation.
MIgene provides users a powerful and multi-faceted search engine and a user-friendly interface to access, browse and retrieve different data types and analysis results. The website interface comprises seven sections including "Home", "Browser", "Submit case", "Download", "Tutorial", "Contact" and "Analysis" (Figure 4). On the "Home" page, a brief introduction of MI, information accessible in the database and gene or phenotype search are provided. There are four search modules, 'Gene Symbol', 'Phenotype', 'rs_ID' and 'Mutant'. Furthermore, these symbols are not only auto-completed after typing some letters in their corresponding search box, but also cross-accessed using inter-linkages. After selecting "Browser" in the navigation bar, the complete list of MI including genes, related phenotypes, clinical signi cance, and supporting evidence, could be randomly browsed. On the "Submit case" page, the users could submit new genes, mutants, and phenotypes to our database. These data will be stored, curated and then entered the database. At the same time, this MIgene database will be updated periodically according to the latest publications. The 'Tutorial' page presents the database's guidelines.
MIgene provides a detailed report for each gene. Firstly, to take the gene FSHR as an example, MIgene showed basic information of gene and protein including protein sequence annotations, function analysis and related external databases such as OMIM 28 , InterPro 42 , KEGG 33 , GO 32 , String 31 , Compartment, etc. In addition, homology, enrichment phenotypes, and coexpression proteins were also obtained. For variant information of FSHR, the users of MIgene could not only get the variant types and its statistical results but also download the ltered contents at any moment. After the "view" button was clicked, the whole detailed information would be displayed for this genomic mutation. Also, the number of phenotypes and clinical signi cance associated with the gene was counted respectively. For example, the oligozoospermia was one of the phenotypes related to FSHR. There were 109 studies about it, which were divided into four groups: 5 of related-damage, 5 of relatedprotection, 97 of unrelated and 2 of unknown. Secondly, for phenotype, MIgene de ned the phenotype, the number of studies, the information of enrichment genes and other contents including SNPs, indel, deletion, duplication, insertion, and related clinical signi cance. Thirdly, in rs_ID modules, the basic information of rs_ID, the number of studies and statistical clinical signi cance were exhibited by MIgene. Finally, this database provided a powerful and convenient way to search for the mutants of genes and phenotypes for MI.

Discussion
As a comprehensive and rst genetic database of MI, MIgene included many genes, clinical phenotypes, and basic information of the patients. Authors have tried the best to extract the information through in-depth reading manually. In addition, several issues should be considered for MIgene. First, there is currently no universal, rapid and e cient system to screen the full text instead of manual screening 46,47 . It seriously slows down the update, especially when massive data of genes and mutants are being produced along with the development of new technologies. Therefore, the automatic mining methods should be exploited and used for updating male infertility-related data in the future. Second, the curation of clinical signi cance only con rmed according to the publications, but the supporting data in the publications are not validated by ourselves, which may lead to partial false positive results. In the next few years, the issues mentioned above are expected to be concerned and solved in future versions.
In summary, MI is a complicated disease that can be in uenced by multi-factors including genes, phenotypes, mutation types, genetic background, ethnicity, environment and even zygosity. However, the MIgene database were established, which are publicly available for researchers and clinicians. Furthermore, it elucidates the association between genetypes and phenotypes (especially spermatogenic failure). This study could help users understand the complex biological process and mechanisms of MI, and provides references to the prenatal diagnosis' results. Figure 1 Process of data collection and curation in MIgene. The work ow of this project is divided into three sections, literature collection, data extraction and manual curation. MI: male infertility.  The -lg (p-value) equals to -lg (enrichment score + 0.0001). The whole data sets of -lg (p-value) are displayed in Supplementary  Table S7-S8.