Identification of SMNDC1 genes in animal and construction of a phylogenetic tree
To explore the functional differentiation of SMNDC1 family genes, in our work, 72 SMNDC1 protein sequences from 66 animal species were subjected to protein domain alignment analysis by the online software SMART. Specifically, 23 Primates, 19 Rodents and lagomorphs, 9 Fish, 2 birds and Reptiles (Anole lizard and Chinese softshell turtle), and one other animal (Lamprey) were identified. Specifically, 60 out of 66 species, including humans and all fish, have only one SMNDC1 gene, while 6 species have two copies of SMNDC1, including Angola colobus, Mouse Lemurs, Pig-tailed macaques, Rabbits, Golden Hamsters and Pigs. In addition, all animal species do not have more than 3 copies of the SMNDC1 gene, which is inconsistent with the results in plants. For example, Kalanchoe laxifora has four SPF30 (SMNDC1) genes and Triticum aestivum (wheat) possesses 3 copies [28].Furthermore, in order to understand the evolutionary history and phylogenetic relationships among the above identified SMNDC1 genes, a phylogenetic tree was constructed using the Bayesian method based on the amino acid sequences of 72 SMNDC1 members from 66 animal species (Fig. 1). From multiple transcript isoforms of one gene, the gene with the longest protein-coding sequence was selected as a representative. Bootstrap values are presented as a color gradient at the branches. Species from different taxonomies are marked with different colors. The tree grouped into four major clades including primates, rodents and lagomorphs (purple), birds and reptiles (pink), fish (light blue) and other mammals (green). Not surprisingly, genes from phylogenetically related animal species tend to cluster together in the tree. For example, the SMNDC1 gene from primate species including Homo sapiens and its close relatives belongs to a unique monophyletic group (Fig. 1). Taken together, the four main clades clustered reflect general animal phylogeny. Furthermore, the lengths of the branches indicate evolutionary distances between organisms, while the clear topology indicates the validity of the phylogenetic reconstruction of the SMNDC1 gene family in animals. The high-precision phylogenetic tree constructed by the current study can provide the basis for subsequent bioinformatics analysis.
Analysis of protein domain/motif
To further study the conservation of the animal SMNDC1 gene, a detailed analysis of its protein domains and conserved motifs was performed. The SMNDC1 proteins of 66 representative animal species were further aligned and used to construct a phylogenetic tree (Figs. 1,2). According to the results, the length of the identified SMNDC1 proteins from all animal species was characterized in a range of 178 to 288 amino acids. Most SMNDC1 proteins are approximately 238 amino acids in length (Table S2). Moreover, all SMNDC1 proteins have a characteristic central SMN domain. Specifically, the size of the conserved SMN domains was kept strictly at 60 amino acids. Moreover, the conserved motifs of animal SMNDC1 proteins were predicted by the MEME online tool. In detail, the top ten conserved motifs are illustrated in colored boxes, which cover most areas of the protein (Fig. 2 right panel). The vast majority of animal SMNDC1 sequences, including those of humans, contain 10 conserved motifs. Furthermore, the SMN domain was mainly concentrated in the middle 5 motifs (Fig. 2 right panel). Interestingly, animal species SMNDC1 with two copies have some differences in their motifs, one with 10 motifs, the other with less than 10 or with differences in motifs. For instance, ENSCANT00000039560.1 in Angola colobus has 10 motifs, while ENSCANT00000045349.1 has only 8 motifs, which implies potential functional diversification.
Interaction Networks of SMNDC1
The crystal structure of human SMNDC1 is presented here (Fig. 3). The aromatic cage in the Tudor domain of SMNDC1 mediates dimethylarginine recognition through cation-π interactions with five important residues of the aromatic cage (Trp83, Tyr90, Phe108, Tyr111, and Asn113), as shown in Fig. 3. In details, Tyr90 and Asn113 were highly conserved at ConSurf Grade 9. Trp83 was conserved at ConSurf Grade 8. Phe108(98.571%) and Tyr111 (97.143%) were conserved at ConSurf Grade 6 and ConSurf Grade 4.
Since the SMNDC1 protein interaction network of SMNDC1 proteins may further reveal its involvement in various biological processes. In our study, to investigate the functional relationship between SMNDC1 and other proteins, the webtool STRING was used to construct the protein interaction networks of animal SMNDC1. Based on experiments and databases, three representative SMNDC1 protein sequences of human, mouse and Schizosaccharomyces pombe (yeast) were selected to generate an interaction network (Fig. 4). The resulting networks of human, mouse and yeast SMNDC1 networks grouped 10, 10 and 5 functional partners, respectively. In detail, the interacting proteins of human SMNDC1 can be divided into three categories: small nuclear ribonucleoprotein (SNRNP200 and SNRPB), splicing factor (SF3A3, SF3A2, SF3B2, SF3B4, SF3B5 and SF3B6) and pre-mRNA processing factor (PRPF6 and PRPF3). However, except the three interacting proteins described above, mouse SMNDC1 also interacts with U6 small nuclear RNA and mRNA degradation-associated protein (Lsm5 and Lsm6) and RNA-binding motif protein (Rbmx). Interestingly, the yeast SPF30 interacting protein is quite different from human and mouse, mainly including prp1 (U4/U6 x U5 tri-snRNP complex subunit Prp1), sap62 (zinc finger protein Sap62), itr2 (MFS myo-inositol transporter), dis3 (putative 3'-5' exoribonuclease subunit Dis3) and swi6 (chromodomain protein Swi6). In addition, we found that many interacting proteins of mammalian SMNDC1 have no apparent homologue in Schizosaccharomyces pombe, for example, splicing factor (SF3A3, SF3A2, SF3B2, SF3B4, SF3B5 and SF3B6) and pre-mRNA processing factor (PRPF6 and PRPF3). Taken together, the specific interaction studies and further functional verification of SMNDC1 may reveal its involvement in various biological processes.
Analysis of gene structure and conserved motifs
To further explore the conservation of gene structure and motif composition at the genome level, the longest SMNDC1 gene transcript of each coding sequence (CDS) was chosen for analysis (Fig. 5). According to the results, different genomic structures were observed, with the number of total exons ranging between two and seven. In most primates, the number of exons remains at five, moreover, the white partridge ENSMLET00000055620.1 has seven exons, while ENSMNET00000044298.1 and ENSMICT00000046893.2 have three exons, and ENSCANT00000045349.1 has only one exon. In fish, the number of exons remained stable at five and six. In summary, the SMNDC1 gene with five exons in the CDS accounts for approximately 89% of the total (Fig. 5 and Table S2), including SMNDC1 genes from representative species human and rodent and rabbit IDs. Among the 72 SMNDC1 family genes, 43 sequences had 5 exon-4 intron gene structure layouts, accounting for 59.7% of the total number of members. Twenty-two members had 6 exon-5 intron gene structure layouts, accounting for 30.5% of the total number of members. Additionally, ENSOCUT00000012273.2 and ENSMLET00000055620.1 possess the most exons with exon 7, while ENSCANT00000045349.1 and ENSSSCT00000039370.2 have the fewest exons with exon 2. Furthermore, ENSMNET00000044298.1, ENSMICT00000046893.2 and ENSMAUT00000006691.1 have 3 exons. Among all members with 6 exons and 5 introns, except one SMNDC1 from Armadillo (ENSDNOT00000016385.2), the other members all have an extra exon that was not a coding exon. Notably, there are two sequences of SMNDC1 genes from 6 species, which have different gene structures; for example, two sequences from Sus scrofa were found, one of which has 2 exons (ENSSSCT00000039370.2), and the other contains 5 exons (ENSSHAP00000001057.1). Collectively, the differences in the exon-intron distribution patterns of SMNDC1 among the above animal species, indicate that the structural changes of genes may be involved in the evolution of the gene family in the phylogeny of general animals. Furthermore, SMNDC1 in the same branch has obvious similarities in gene structure, indicating that they have a close evolutionary relationship. Based on the differences in gene structure between SMNDC1 genes, we further used MEME to determine whether there were differences in motif composition in their cDNA sequences. As shown in the results, the 10 most conserved motifs were identified from the cDNA sequence of SMNDC1 (supplementary Fig. 6, right panel). Overall, over half of the SMNDC1 sequences contained 10 conserved motifs. The motif position and number of the SMNDC1 gene in most animals showed little difference among primates, rodents and lagomorphs, other mammals (purple) and other vertebrates (pink). Interestingly, there were few differences between the observed motifs of SMNDC1 sequences with two different gene structures in one species. For example, two sequences from Oryctolagus cuniculus were found, one containing 9 motifs (ENSMICT000000041763.2) and the other containing 8 motifs (ENSMICT 000000046893.2). In conclusion, by comparing the conserved motifs at the RNA/cDNA and protein levels, it is found that the codon usage, number and similarity of these homologues are not different. The location of these motifs indicates the preservation of animal SMNDC1 between different proteins and cDNA. In addition, the comparison of cDNA showed that no conservative motif was found in the untranslated region, and the region was enriched with regulatory elements, which provided additional information for the conservative regulatory mechanism among these SMNDC1s.
Transcript Isoforms and Conserved Splice Site Analysis
To investigate the splicing patterns and conserved splicing sites of the animal SMNDC1 family genes, we performed an AS analysis of the animal SMNDC1 genes. According to the results, a total of 36 transcript isoforms from 15 animal SMNDC1 genes were summarized from the Ensembl database and linked to the phylogenetic relationships among selected species (Fig. 6). In particular, SMNDC1 in Rattus norvegicus and Mus musculus have the most numbers of isoforms, possess five transcript isoforms, while in the other 13 animals SMNDC1 contains two transcripts. In addition, conserved protein motifs were identified from potential protein products of the above transcript isoforms by using MEME (Fig. 6 right panel). From the results, the location of splicing is mainly located on the SMN protein domain. Meanwhile, the primary transcript has the longest peptide sequence and the most conserved motifs, while the spliced transcript has a shorter protein length and contains fewer motifs. In addition, alternative splicing types of SMNDC1 in 15 animal species are mainly alternative 3′ splice and alternative 5′ splice. Meanwhile exon skipping in Rattus norvegicus and Mus musculus was also detected. Furthermore, conserved splicing sites or conserved sequences were identified. Flanking sequences (31 bp in total) of animal SMNDC1 genes were analyzed to show their consensus in WebLogo and multiple alignment. According to the results, five representative splice sites were identified. (Fig. 7A, B).
Expression profile analysis of animal SMNDC1s
To further investigate the potential functions of animal SMNDC1 in response to developmental cues or disease correlations, we analyzed the expression patterns of SMNDC1 genes from Homo sapiens and Mus musculus. In this work, we reconstructed the expression profiles of SMNDC1 in various biological aspects, such as developmental stages, different tissues and cell types, and disease conditions by using the BAR Heat Mapper Plus tool (Supplementary Figures S1–S6).The data of Homo sapiens disease proteomics expression showed that SMNDC1 protein had high expression abundance in multiple cancer types, including breast cancer (breast tumor luminal, HER2 positive breast carcinoma and triple-negative breast cancer), colon cancer (colon adenocarcinoma and colon mucinous adenocarcinoma) and rectal cancer (rectal cell carcinoma and rectal mucinous adenocarcinoma) (Figure S3). Moreover, the SMNDC1 transcript of humans is widely expressed in whole body tissues, including skeletal muscle, adult brain, spinal cord, testis, liver, ovary and lung (Fig. 8, Figure S2), while mouse SMNDC1 is highly expressed in brain tissue (Fig. 8, Figure S2). Furthermore, cell type expression analysis showed that human SMNDC1 was highly expressed in granulocyte monocyte progenitor cells, hematopoietic multipotent progenitor cells and hematopoietic stem cells, while mouse SMNDC1 was expressed in naive thymus-derived CD4-positive, alpha-beta T cells, embryonic stem cells, accumulated in induced T-regulatory cells and T-helper 17 cells (Fig. 8, Figure S4). On the other hand, human SMNDC1 was highly expressed in the fetal period and downregulated in the juvenile period (Figure S1), while mouse SMNDC1 was highly expressed in the embryonic period but did not abundantly accumulate in the fetus (Fig. 8, Figure S4). In addition, we will pay more attention to the expression of the SMNDC1 gene in cancer and other diseases. Specifically, transcriptome data revealed that human SMNDC1 was expressed at higher gene expression levels in cancer tissues than in normal paracarcinoma tissue and normal tissues (Figure S3). Among them, human SMNDC1 had the highest expression level in ovarian adenocarcinoma, followed by esophageal adenocarcinoma (Figure S3), while the expression abundance of this protein was enriched in breast, colon and rectal cancer. In summary, we found that SMNDC1 is highly expressed in ovarian adenocarcinoma and digestive system diseases, and may be used as a valuable diagnostic or therapeutic protein target in clinical treatment.