Exclusive amino acid changes in the DNA binding domain of mCdx2
The house mouse (Mus musculus) and the common rat (Rattus norvegicus) are two of the most widely used model organisms in biomedical research. As TF-DNA binding is crucial in gene regulation, we were interested in identifying TFs which might differ in this important aspect between mouse and rat, which might have undergone adaptive evolution in either lineage but are conserved with human in the other. To find such a candidate TF we selected all 1194 TFs with one-to-one orthologs among mouse, rat and human. Systematic analysis of the DNA binding domain (DBD) sequences of these TFs between human, mouse and rat showed the DBD sequence of Cdx2 was identical between human and rat, but possessed three amino acid changes in mouse. Moreover, by analyzing the protein sequences of Cdx2 among four additional model species, surprisingly, we found that these three specific amino acid changes are exclusively present in mouse (Fig 1A). Further comparison of Cdx2 DBD sequences in 56 vertebrate species showed that the DBD is highly conserved in general (Fig S1A) whereas the three amino acid changes were present in all mouse strains with available genome sequences (Fig S1B). Based on this observation, we inferred that the three amino acid changes have occurred on the Mus lineage shortly after its divergence from the Rattus lineage at about 12.5 to 5 MY (million years) ago (Fig 1B) [35]. Taken together, our evolutionary analyses indicated that mouse Cdx2 had three specific amino acid changes in the DBD within a short evolutionary time frame, and the three changes were then fixed in all mouse species, potentially under positive selection. Therefore, we went on to examine whether these three amino acid changes caused any functional alteration of Cdx2.
Cdx2 functions as a pioneer factor and induces trophoblast-like epigenomic and transcriptomic landscape
Cdx2 is a homeobox transcription factor essential for the development of the intestinal epithelium and the placenta [32, 36]. Serving as the first lineage specification marker [30-32, 37], previous studies have shown that ectopic overexpression of Cdx2 could efficiently promote the differentiation of ES cells to trophoblast stem cells [30, 31, 33]. Therefore, to investigate the function of Cdx2 during this process, we applied a doxycycline (DOX) inducible Tet-On system to induce Flag-tagged Cdx2 expression in mESCs and analyzed its function by measuring transcriptome and epigenome changes (Fig 2A). To check the suitability of this system, we first investigated the molecular function of mouse Cdx2 (mCdx2). As shown in Fig 2B, after successful induction of mCdx2 (Fig 2C, S2A and S2B), typical round shape colonies of undifferentiated cells were disappearing, while differentiated epithelial-like cells with flat or square shape started to appear.
We then compared the gene expression of the mESCs before and after DOX induction using RNA-seq (Fig 2A). As shown in Fig 2D, a total of 2603 genes exhibited significantly differential expression (|GFOLD|>1, TPM>3), with 1463 genes (56.2%) upregulated and 1140 (43.8%) genes downregulated after DOX induction. Notably, trophoblast stem cell (TSC) related markers such as Cdh2, Ascl2, Gata6 and Gata3 showed moderately increased expression while the expression of ESC markers such as Sox2 and Oct4 decreased, in consistence with the incipient trajectories of ES differentiation into TSCs (S2C Fig).
The massive differences in gene expression after mCdx2 induction suggested that dramatic differences in the epigenetic landscapes could be induced after mCdx2 overexpression (mCdx2-OE). To address this, ATAC-seq was carried out on cells with and without mCdx2-OE, respectively. A total of 39,369 peaks showed significantly different openness after mCdx2 induction (|log2 (fold change)|>1, total read count>=40, Fig 2E). To gain a deeper insight into the mCdx2 mediated epigenomic changes, we divided all the ATAC-peaks into three categories, namely UP (newly opened chromatin regions in mCdx2-OE cells), DOWN (chromatin accessibility lost in mCdx2-OE cells) and COMMON (regions without openness change) (Fig 2F), and then performed sequence motif analysis in these three categories separately. As shown in Fig 2F, UP regions were mostly enriched for DNA binding motifs of Cdx and Hox family transcription factors which are important for lineage specialization, while the DOWN motifs were mainly enriched for Oct4, Sox and Klf family genes which are known pluripotency regulators, and the motifs of the common peaks were enriched for house-keeping chromatin regulators, such as Ctcf (Fig 2F).
Last, to characterize mCdx2 binding sites on a genome-wide scale, we performed ChIP-seq in mCdx2-OE cells. A total of 93,591 peaks were identified, with 6.47%, 46.27% and 47.26% located at promoter, gene body, and distal intergenic regions, respectively (Fig S2D). Importantly, mCdx2 binding peaks were differentially distributed between the three categories of ATAC peaks, with predominant binding at the UP group (Fig 2F). Taken together, the results above revealed that mCdx2 likely acted as a pioneer transcription factor, instructing cell fate specification by binding, then opening chromatin and finally initiating the expression of important downstream target genes.
Evolutionarily conserved function of mCdx2 and rCdx2 in transcriptional regulation
Then, to assess whether and how mouse Cdx2 might function differently as rat Cdx2 (rCdx2), we compared the effects of overexpressing mCdx2 and rCdx2. Like for mCdx2, we firstly established a Dox-inducible rCdx2 expression in mESCs. After 48h DOX treatment and successful rCdx2 induction, we observed similar morphological changes (Fig S3A) as with mCdx2 (Figs S3B-D). To explore the functional similarities and differences between rCdx2 and mCdx2 at the molecular level, we firstly carried out RNA-seq experiments on rCdx2-OE cells. At the transcriptome level, mCdx2-OE and rCdx2-OE cells showed high correlation (correlation = 0.992; p-value < 2.2 x 10-16, Fig 3A), suggesting their conserved effects on downstream gene expression. Then we conducted ChIP-seq to measure the rCdx2 DNA binding in rCdx2-OE mES cells. A total of 78,765 peaks were identified, of which 6.09%, 46.32% and 47.6% located at promoter, gene body, and distal intergenic regions, respectively Fig S3E). The peak distribution pattern and the global binding signal showed nearly no differences between mCdx2 and rCdx2 (correlation=0.977, Figs 3B and 3C). As shown in Fig 3D, Hoxa family genes, well-known direct targets of Cdx2 exhibited the same DNA binding and gene expression change after mCdx2 and rCdx2 overexpression. Finally, we performed de novo motif analysis in mCdx2 and rCdx2 ChIP peaks, separately, and found an identical binding motif for mCdx2 and rCdx2, the same as previously reported for mCdx2 (Fig 3E). Collectively, these results indicated mCdx2 and rCdx2 are functionally conserved at the molecular level.
Mouse-specific amino acid changes in DBD, neither alone nor in combination have any effects on Cdx2 function
Based on the results above, the three amino acid changes in the DBD combined together did not induce any functional changes in our DOX-induced overexpression ES cells. Interestingly, whereas the three changes were observed after mouse-rat divergence, no other murine species were found to contain only one or two of the three changes. One possible evolutionary scenario was that one or two of the changes decreased DNA binding and were compensated by (a) later change(s) (i.e., the DBD found in mouse with all three amino acid changes in combination had the same binding affinity as that found in rat or human but changing only one or two of the amino acids could affect Cdx2 binding and function). To test this hypothesis, we constructed a panel of seven mutated plasmids consisting of all possible single-site changes, two-site changes and three-site changes in rCdx2 (Fig 4A and S4A). Then we established stable cell lines transfected with these inducible mutated Cdx2s. After DOX treatment, all the cells tended to differentiate, manifested typical morphology changes (Fig S4B), and no obvious differences were found among all the mutated Cdx2 overexpressing cells. At the molecular level, we performed RNA-seq on cells overexpressing all different mutants (Fig S4C). As shown in Fig 4B, correlation analysis showed all the mutated Cdx2s had similar effects on the transcriptome compared to mCdx2 and rCdx2. Together, these results suggested no functional differences among the different mutants in our DOX-induced overexpression mESCs.
Cis-driven species-specific DNA binding of Cdx2
Transcriptional regulation of gene expression is mediated by both cis and trans components. The results shown above indicate that the amino acid differences in the Cdx2 DBD did not lead to divergent trans-regulatory effects between rat and mouse. We then turned to the cis-regulatory part of Cdx2-mediated regulation. One common strategy to study cis-effects is to compare the gene regulation between two alleles in F1 hybrid [3, 5, 12]. In F1 hybrids, both parental alleles are subject to the same trans-regulatory environments; thus, observed differences in allele-specific regulatory patterns should reflect only the impact of cis regulatory divergence. However, this approach cannot be conducted between mammalian species with long evolutionary distance, such as mouse and rat, due to reproductive isolation. To circumvent this limitation, we took advantage of the previously established RMES cells which contain a haploid mouse genome and a haploid rat genome [34]. Using this system, we then explored the cis divergence of Cdx2 mediated regulation between mouse and rat in an unbiased manner. Again, we established stable Cdx2-overexpressing RMES cell lines by transfecting Flag-tagged mCdx2 (mCdx2-OE RMESCs) and Flag-tagged rCdx2 (rCdx2-OE RMESCs), separately. Typical differentiation characteristics were observed after 48 hours DOX induction (Fig S5A). As before, RNA-seq and ChIP-seq were then conducted. Consistent with the observation in mESCs, mCdx2 and rCdx2 had similar DNA binding patterns in either mouse or rat genome, and induced similar gene expression changes on either rat or mouse allele in RMES cells (Figs S5B-E). In the following analysis, we combined the RNA-seq and ChIP-seq datasets from mCdx2-OE and rCdx2-OE RMESCs as experimental replicates.
Then we compared the Cdx2 binding sites between the mouse and rat genomes. In order to check how these binding sites evolved, we classified the binding sites determined by ChIP-seq based on whether the peaks in one species could be aligned to the other species and if yes, whether the aligned sites were also bound there (Fig 5A). The first two categories, including conserved and loss peaks represented the alignable sites, whereas the third one were those that could not be aligned to the other species. As shown in Fig 5B, conserved peaks, where the aligned binding sites were also bound, accounted for 23.9% (mouse to rat direction: 17,321 peaks) and 22.4% (rat to mouse direction: 17,317 peaks), respectively; loss peaks, referring to no binding in the orthologous regions, occupied 57.3% (mouse to rat direction: 41,430 peaks) and 58.5% (rat to mouse direction: 45,105 peaks), respectively; unaligned peaks constituted 18.8% (13,580 peaks, mouse to rat) and 19.1% (14,715 peaks, rat to mouse), respectively. The distributions of the three groups were similar between the binding sites at proximal and those at distal regions (Tables S1 and S2).
Then we compared the signal intensities among these three peak classes. As expected, the binding affinities were the highest for conserved peaks and there were no significant differences between these conserved sites bound at the rat and mouse genomes (Figs 5C, 5D, S6A and S6B). Consistent with this, a gradual decrease in the frequency of Cdx2 binding motifs near binding sites was observed from conserved to loss peaks (Figs 5E and S6C). To determine whether binding differences arise from sequence changes at potential binding sites, we checked the mutation rate among different peak categories. As expected, higher proportions of matched bases occurred in the conserved peaks relative to loss peaks (Fig 5F and S6D). Therefore, the cis-driven DNA binding differences can be attributed to the sequence differences, particularly those affecting binding motifs.
Cis-regulatory divergence of Cdx2 binding causes a fraction of species-specific gene expression differences
To further explore whether the divergent DNA binding could cause the species-biased gene expression, we compared Cdx2 induced transcriptional changes between mouse and rat alleles (Methods). As shown in Fig 6A, a total of 7,372 candidate genes were divided into five categories based on their expression changes after Cdx2 induction: type 0: no change, type 1: change only in the mouse allele, type 2: opposite changes between mouse and rat, type 3: change only in the rat allele, type 4: similar changes in both mouse and rat. A majority of genes (5592 genes, type 0) were not affected by Cdx2 induction for either the mouse or the rat allele; 496 genes (type 1) and 425 genes (type 3) with mouse-only and rat-only changes, respectively, reflected species-specific regulation; the type 2 category contained only 1 gene, indicating that opposite changes between mouse and rat after the same TF stimulus were extremely rare; a significant proportion of genes (858 genes, type 4) with similar expression changes after Cdx2 induction between mouse and rat suggested an evolutionarily conserved regulatory pattern (Fig 6B). Based on the GO analysis, we found that type 4 genes were highly enriched in functions related to “organism development and cell differentiation, consistent with the known function of Cdx2 in early development (Fig 6C). In addition, the magnitude of gene expression changes was highest for type 4 genes, further suggesting their important functions (Fig 6D).
Finally, we analyzed the relationship between differential binding and gene expression. For this purpose, we checked the distribution of peaks located in different groups of genes. As expected, the proportion of conserved peaks in type 4 was moderately higher than in that of the other types (Figs 6E and S6E). Similarly, the relatively higher percentage of loss peaks in type 1 and unaligned peaks in type 3 suggested the role of these species-specific binding in species-specific regulation. Nevertheless, among type 4 genes with conserved expression changes, more than 70% had no Cdx2 binding sites conserved between the two species. Therefore, at the cis regulatory level, whereas the Cdx2 binding was largely divergent between mouse and rat, the transcriptional effect induced by Cdx2 was conserved to a much larger extent.