The A and B homeologs were identified using a strict syntenic approach. A and B calls were expanded using an in silico genome painting technique. All calls were checked against each other and using kS rates to determine if A and B calls behaved as expected. Genome guided phylogeny was used to examine the relationship of A and B homeologs to closely related Eleusine species. Finally, we demonstrate evidence and examples of sub-genome bias occurring within Eleusine coracana. The B subgenome shows a higher repeat content, rate of gene deletion within cytoplasmic interacting genes , and lower expression profiles within the Circadian Rhythm Pathway.
Identification of A and B Homeologs Employing Synteny
Synteny is the identification of homologous blocks of genes via conserved collinear patterns. The use of this method provides strong evidence for homology between genomes, but it provides limited coverage of the genomic assembly. To identify A and B genes a 2 step method was employed. First A and B calls were made using the direct gene to gene comparison. Gene relationships were determined using CoGe SynFinder to align the E. coracana genome to the E. indica genome with a syntenic depth of 2 to 1 respectively. This comparison created triads of genes, in which two E. coracana genes putatively from the A and B subgenome were linked by one E. indica gene. The E. coracana gene most similar to the E. indica gene was annotated as A while the other was annotated B. A total of 12,296 direct calls were made, 500 triads were uncalled because both E. coracana copies were equidistant from the E. indica copy. Direct calls were used to infer if the syntenic block they occurred in was A or B. If a block had significantly more A or B calls under chi square (p=0.05), it was called A or B respectively; 704 blocks were called (351 A, 353 B) and 116 blocks had no call, 76% (88/116) of uncalled blocks contained fewer than 8 genes, in these cases 1 conflicting call was enough to keep the block from being characterized as A or B. Syntenic block calls were used to obtain contig calls. Contig calls were divided into 3 categories, strong, provisional and ambiguous. Strong contained no uncalled blocks. Provisional contained a single uncalled block, and ambiguous contained multiple uncalled blocks. A contig could be called as A , B or a crossover, if it contained both A and B calls. We identified 12 strong and 16 provisional crossovers, 68 strong and 29 provisional A contigs, and 82 strong and 23 provisional B contigs. The subgenome identity of 33% of the 62,347 predicted genes in E. coracana (9,985 A genes and 10,349 B gene) was inferred usings strong syntenic region and surrounding sequence d .
In Silico Genome Painting
A genome painting scheme using repetitive elements was employed to expand the number of A and B homeologs identified because a strictly syntenic approach limited the scope homeolog identification. Repetitive elements were identified for the entire E. coracana genome using RepeatScout with default settings and its output was used to create a custom database for RepeatMasker for annotation. A subset of A contigs (61) and B contigs (73) were chosen to identify A and B repeat elements. The B genome had a higher density of repetitive elements per sliding 100 kbp window (Fig.1). A total of 50,416, repetitive elements longer than 200 bp were identified, 21,481 A and 28,935 B, spanning 81 mpb (A 32 mbp and B 49 mpb). The longest repetitive elements spanned several thousand base pairs (A 12,578 bp) and (B 15,440). For all retro-element families 898 were represented in both subgenomes by 84,965 entries (A 30,896 and B 54,069), while 180 families were uniquely predicted for one subgenome (A 50 and B 130) with 9,281 entries (A 1,175 and B 8,106). Reads and their pairs that mapped to either A or B were extracted, and mapped against the entire genome. High quality coverage mapping was calculated for A and B read mappings where, both reads were mapped to the same contig and their insert was less than 1000bp, with mapping score of greater than or equal to 30. A sliding window was used to sum all A and B reads mapped to a region of the genome, and region calls were made and aggregated using a custom python script (abPainting.ipynb). Using this method we called 31,543 A genes and 28,483 B genes. When added to existing cala total 59,377 unambiguous calls accounting for 81% of the gene annotations were made (Additional File 1).
The Identification of the A Sub-Genome Donor
The A sub-genome donor was identified using a unified genomics approach, and the absence of the B sub-genome donor was identified in the largest set of species to date. We produced the largest Eleusine super-matrix compiled to date containing 455 genes to resolve A and B genome relationships within Eleusine coracana (Fig. 2abc). We tested the effects of targeted analysis on the Eleusine indica genome to determine if using a targeted assembly produced false B transcripts from a completely A genome. The results show that some, 31, putative B transcripts assembled, but that they were in the same clade as E. indica and the A genome (Fig. 2b). This demonstrates that the process of targeted assembly does not create B genomic transcripts as an artifact, or unduly bias the transcriptomic assembly process. The successful assembly and inclusion of 31 putative B transcripts suggests that our syntenic approach was not sensitive to cases of gene conversion, which would be expected when designating A and B genes by region. Phylogenomic analysis reveals that E. indica is indeed sister to the A genome and that Eleusine tristachya is sister to the E. indica - A genome clade while the B genome is sister the E. tristachya - A genome clade, further confirming that the B genome did not arise from E. floccifolia[10]. More complete species level sampling is required to determine the precise relationship of E. indica to the Eleusine africana A genome and the E. coracana A genome. According to the study of Zhang et al. (2019) phylogenetic analysis supports E. indica as the maternal parent of E. coracana and E. africana, in addition to a close relationship between E. indica and E. tristachya, and between E. floccifolia and E. multiflora, and E. intermedia as a separate clade. So a rethinking of the labeling of ancestral genomes of E. floccifolia, E. multiflora, and E. intermedia maybe in order.
Genome guided phylogenomic approach established an expected pattern of divergence among subgenomes for kS (Fig. 3). kS patterns confirm genome calls made by genome painting methods by the production expected profiles between Eleusine indica and A , and E. indica and B.The kS values suggest that there are a small number of mis-characterized genes or instances of gene conversion which would be expected given the size of the sliding window used to characterize A and B blocks. The comparison between E. indica and the A genome shows a peak at approximately 1.1 mya when using the standard conversion rate of 6.5 e-9 substitutions per year [46].
Analysis of Repetitive Content of the A and B Sub-Genomes
Analysis of repetitive DNA occurring on high confidence called contigs using RepeatMasker and custom repeat library indicated that the B region contains more repetitive elements per base pair than the A genome. Analysis of repeat family density within a sliding window of 100 kbp was applied to A and B homeologs using syntenic blocks, and it revealed a significant difference in repeat count density for only one family, LTR_Copia, out of 30 families. Most families exist in clusters 1-20 per 100,000 bp, and DNA_Mule_MudDR is a striking example of a repeat occurring at dense clusters in the A genome, with a maximum of 114 in the sliding window, compared to 13 for the B genome. Several of the TE Line_L1 elements show an elevation in the density of single count insertions in the B genome which contained 1080 windows containing a single Line_L1 compared to the A genome which had 823 windows containing a single Line_L1 (Additional File 2). When all repeat counts and coverage are taken together they show that B contains more variation in repeat counts per sliding window while it also shows markedly higher coverage. This suggests that the repetitive elements in B are less well controlled and may be undergoing an expansion relative to A repeats (Figure 1), 17.4% of windows from the B contigs had coverage of 75% or greater compared to 9.64% in A contigs. This wider coverage of repeats is likely to have impacts on expression levels because of TE induced methylation (Fig. 1).
A Case of Expression Bias in the Circadian Rhythm Pathway
Expression of A and B genes within the circadian rhythm pathway (ko04712) [47] visibly demonstrated homeolog expression bias with a subgenome preference of A, in line with expression patterns suggested by repeat coverage. PHYA (K12120), LHY, PRR5 (K12130), CSNK2A (K03097), FKF1 (K12116), and COP1 all showed a strong A bias in their expression, while CHE (K16221) was the only gene to show a strong B preference caused by the loss of the A homeolog (Additional File 3). This was confirmed by examination of the region with Gevo using CoGe (https://genomevolution.org/coge last accessed: 3/18/2020). While the rest of the genes exhibited a mixture of A and B homeolog expression. There are a couple interesting cases, first floral transcriptome sets showed a marked preference in A homeolog expression. Second, comparison of pooled drought and control leaf tissue samples from a single study [16] showed notable shifts from A to B homeolog expression for key genes: ZTL, ELF3 (K12125), CRY1, and LHY further examination of individual readsets confirmed A to B switching when expression was detected, for LHY. The circadian rhythm pathway shows a visible preference for A homeolog expression but does exhibit a case of putative A homeolog loss.
B Biased Loss of Cytoplasmic Interacting Genes
When the patterns of homeolog loss were examined at a global scale a strong pattern of biased fractionation favoring A genome homeolog retention was detected. A comparison of syntenic dyads created between hard masked assemblies of Setaria italica and E. coracana using CoGe, show that 61% singletons are A, that 38% are B, and 1% are unassigned (A: 1506, B:925, Unassigned: 21) (Additional File 4,5,6).
To determine if cytoplasmic genes showed appreciable subgenomic bias in their retention we started with a list of 4,042 genes that were determined to be lost from either the A or B subgenome during an unmasked syntenic analysis. These genes occur in E. indica to E. coracana dyads not the expected E. coracana to E. indica to E. coracana triads. In the initial blast, 111 single copy genes that likely have potential cytonuclear interaction were identified. These were further examined in the E. coracana genome using CoGe (https://genomevolution.org/coge/ last accessed: 3/18/2020) blast which found that 26 of them had a single hit while eight of them had two hits but the second hit had either partial or poor alignment. The remaining 77 genes had either two or more hits. Out of the 34 genes, 24 were on the A subgenome and 10 were on the B subgenome (Additional File 7). These results suggest that the rate of gene loss is generally slow, but that genes on the A subgenome are favored for retention.
Thirty four genes were identified as single copy and involved several important pathways: defense signaling pathways, synthesis of indole-3-acetic acid, RNA interference, heat shock proteins, seed germination, plant growth formation and repair of photosystem II super complex, protein kinase, and flowering time. Some of the interesting genes that are important for normal functioning of chloroplast and mitochondria included Met1, Ribose-5-phosphate isomerase 3 (RPI3), Brassinazole insensitive pale gene-2 (BPG2) 3-hydroxyisobutyrate dehydrogenase (HIBADH), Translocase of outer membrane 34 kDa (TOM34).