Flow sorting and sequencing of chromosome arm 6VS of H. villosa
Flow cytometric analysis of chromosomes isolated from T. aestivum-H. villosa 6VS ditelosomic addition line resulted in bivariate flow karyotypes FITC (log scale) vs. DAPI (linear scale) fluorescence on which a number of populations could be resolved (Figure 1). The population representing 6VS telosome was identified after screening all populations with lower DAPI fluorescence, which were expected to correspond to smaller chromosomes. Microscopic analysis of flow-sorted particles after FISH with probes for
pSc119.2 and Afa family repeats enabled unambiguous identification of the population representing 6VS telosomes (Figure 1). A detailed microscopic analysis showed that 6VS telosome could be sorted at an average purity of 89.41%. The sorted DNA was amplified by multiple displacement amplification (MDA) reactions before Illumina sequencing. Sequencing of DNA amplified from flow-sorted chromosome 6VS in Illumina MiSeq system generated 47.7 Gb high-quality paired-end reads from two libraries, with insert sizes of 500 bp and 1,000 bp, respectively.
After assembly using Hecate software, a total of 230.39 Mb draft sequences was obtained. The sequences consist of 153,177 scaffolds, with the maximum and minimum lengths of the scaffold of 138,620 bp and 100 bp, respectively, and contig N50 length of and mean length of 9,788 bp and 1,464 bp, respectively.
Identification of repetitive DNA elements
Using RepeatMasker software, a total of 181.29 Mb out of 230.39Mb 6VS assembly was identified as repetitive sequences which accounted for 74.91% (Table. 2). Among the repeat elements, the most abundant were LTR retrotransposons comprising about 88%, out of which Gypsy superfamily repeats accounted for about 69.3%, followed by Copia superfamily, which comprised 14.43%. DNA transposons were mainly represented by TIR family, which made up about 8.72% of all repeats. After masking all repetitive DNA elements, the remaining non-repetitive sequence reads from 6VS equaled 49.1 Mb, which was used for the following gene prediction and sequence comparisons.
Gene content of chromosome 6VS
Ab initio gene prediction using AUGUSTUS software preliminarily identified 5,973 predicted coding genes from repeat-masked scaffold of 6VS. After using transcriptome data of H. villosa (data not show) as the evidence of coding loci, 3,276 genes on 2,871 scaffolds of 6VS were retained and deemed as high-confidence. The gene length distribution is shown on Fig. S1A. The genic sequences represented a total length of 5,278,412 bp, which accounted for 2.3% of the 6VS assembly. Totally, 1,672 genes were classified to one or more Gene Ontology (GO) terms (Fig. S1B). To summarize, the number of genes which was annotated into biological process, molecular function and cellular component were 1,432, 1,150 and 1,441, respectively.
In order to test the annotation quality of the 6VS, we used genes NLR-V [32], STPK-V [33] and NAM-V1 [9] cloned from 6VS to perform BLASTn search. We found sequences homologous at 99.93%, 100.00% and 99.93%, respectively, implying a high quality of H. villosa 6VS annotation. Thus, the 6VS draft sequence obtained in this work will facilitate extensive mining of 6VS genes in wheat breeding.
Comparative analysis of 6VS sequence composition
The genomic reference of common wheat cv. Chinese Spring released by IWGSC [34] was used to identify the syntenic regions of 6VS on wheat chromosomes 6A, 6B and 6D, using the high-confidence genes. We have also identified 6VS syntenic regions on chromosomes 6A and 6B of tetraploid T. dicoccoides, 6D of Ae. tauschii, 6A of T. urartu and 6H of H. vulgare. After filtering, 2,867 6VS genes had 1,499, 1,577 and 1,430 blastn hits with homologous genes in wheat chromosomes 6A, 6B and 6D, respectively; the number of hits with homologous genes in T. dicoccoides chromosomes 6A and 6B was 1,323 and 1,374, respectively; in Ae. tauschii chromosome 6D it was 1,301, in T. urartu 6A it was 1,424 and in H. vulgare 6H it was 1,307. Moreover, 634 out of 2,867 genes were shared among all eight genomes. The syntenic genes on T. aestivum chromosomes 6A, 6B and 6D, T. dicoccoides chromosomes 6A and 6B, Ae. tauschii chromosome 6D, T. urartu chromosome 6A and H. vulgare chromosome 6H were plotted on chromosomes to highlight the syntenic regions, according to their physical position (Fig. 2). As expected, the syntenic regions with high gene density were observed on chromosomes 6AS, 6BS and 6DS of T. aestivum, 6AS and 6BS of T. dicoccoides, 6DS of Ae. tauschii, 6AS of T. urartu and 6HS of H. vulgare.
NB-ARC domain proteins are commonly known as disease resistance genes. In the 6VS assembly, a total of 45 genes were predicted to encode NB-ARC domain proteins using HMMER model [35]. In a separate project, we analyzed transcriptome of the wheat-H. villosa T6VS/6AL translocation line after the treatment with two Blumeria graminerum f.sp tritici (Bgt) isolates E26 and E31 (data not shown). We found that 28 genes were expressed after inoculation of both isolates within 24 hours, with 15 genes up-regulated two-fold or more when compared to the control (Figure 3). As 6VS chromatin introduced to wheat showed the main contributor of the resistance to various Bgt isolates, 6VS genes might be involved in the innate immunity of H. villosa to powdery mildew.
Wheat cultivars with 6VS/6AL translocation have been used extensively in wheat production with more than four million hectares in China [32], not only due to broad-spectrum resistance to powdery mildew, but also due to their contributions to higher 1000-grain weight (TGW) [36]. In a previous study, TaGW2-6A was described as a negative regulator of grain-width and grain-weight [37-39]. Four SNPs that occurred in the promotor region of TaGW2-6A were reported to be associated with TGW at positions -998bp, -739bp, -593bp and -494bp, in which SNP at -494 bp showing significant association with TGW and was located in the ‘CGCG’ motif [37]. SNP-494 has most effect on TaGW2-6A expression level and TGW, with haplotypes of A allele having significantly lower TaGW2-6A expression and higher TGW compared with those with G allele. To figure out if the increased TGW of 6VS/6AL translocations was due to the substitution of 6AS with 6VS, the TaGW2-6A gene homologue HvGW2-6V was identified in the 6VS assembly. The HvGW2-6V in H. villosa belongs to G allele at SNP-494, which was associated with low TGW (Figure 4). We speculate that higher TGW of 6VS/6AL translocations might be affected by other genes rather than GW2-6V, or that the expression of alien gene is suppressed due to genomic shock in wheat background although the genotype at position -494 was the same with low TGW.
Development of intron targeted (IT) markers
Zhang et al. and Wang et al. developed IT markers for all chromosomes except for 6VS of H. villosa [22, 23]. With the shotgun sequences of 6VS, IT markers for this short arm could now be developed. All 2,063 annotated genes from Ae. tauschii 6DS were aligned against the wheat genome reference and 6VS assembly sequences to determine exon-exon junction lengths on chromosomes 6AS, 6BS, 6DS of T. aestivum as well as on 6VS. A total of 222 genes had the intron length in H. villosa differing by at least 10% as compared to those in wheat subgenomes A, B and D. Then, we designed PCR primers in the conserved exon regions which flanking the targeted introns using Primer 3 software (http:// frodo.wi.mit.edu/primer3/).
In order to test the 222 IT primers on 6VS of H. villosa, we perform PCR analysis for the DNA from T. aestivum cv. Chinese Spring (AABBDD), H. villosa (VV) and T. aestivum-H. villosa T6AL·6VS translocation line. If the primer pair amplify a distinct PCR product visualized only in H. villosa, and T. aestivum-H. villosa T6AL·6VS translocation line while not in common wheat, it was considered 6VS-specific marker. In total, 120 6VS-specical markers were obtained with a success rate of 54.05% (Table S1). All IT markers were tested on three different translocation lines, NAU418, NAU419 and NAU1203, involving 6VS with different introgressed segments. The chromosome arm could be dissected into four bins: bin1 to bin4 (Fig 5), which contained 34, 11, 46 and 29 markers, respectively. Given that all three translocation lines are resistant, the resistant gene Pm21 was mapped within bin3. The 40 markers within this physical bin are suitable for marker-assisted breeding.