Inheritance analysis of watermelon stripe patterns
In our previous work, the lobed leaf gene ClLL1 has been fine mapped of using a F3:4 population (Wei et al. 2017), in which the rind stripe phenotype is also segregated (Fig. 1). During the early developmental stages of ovary, penciled stripes were regularly distributed over the pericarp covering by trichome. However, at approximately 6 days after pollination (DAP, D6 in Fig. 1), the stripe pattern could be visually characterized and distinctly distinguished at 8 DAP (D8), which was subsequently defined as standard green rind with stripes or without stripes at mature fruit period (D23) according to the descriptions of similar rind phenotypes published recently (HeeBum et al. 2015; Park et al. 2016). In a 1286 F3:4 segregated population, there were 987 individuals with stripes and 299 without stripes (Table 1), fitting a 3:1 Mendelian ratio (c2 = 2.01, p = 0.15), which inferred that the stripe phenotype is controlled by a single dominant gene.
To confirm this assumption, an inbred line M08 with standard green rind color and black stripes was used to cross with a dark green rind color and non-stripes line N21 to generate a new F2 population (Fig. S1). The fruit phenotypes of F2 offspring varied dramatically, including fruit shape, background color of rind, and stripe patterns. However, the latter could be clearly classified as stripes and non-stripes at mature fruit period (Fig. S1). Phenotypic data revealed that 358 individuals from F2 population contained 281 stripe and 77 non-stripe plants (Table 1), also consistent with the Mendelian ratio of 3:1 (c2 = 2.15, p = 0.13). Taken together, all these data suggested that the stripe phenotype in watermelon is controlled by a single dominant gene and designated as ClSP (Citrullus lanatusStripe Pattern) hereinafter.
Potential location of ClSP locus by BSR-seq analysis
As shown by morphological observations (Fig. 1), the stripe patterns could be visually distinguished at 8 DAP (D8) in the F3:4 population. To preliminarily located locus ClSP on chromosome, the method combining BSA and RNA-seq was used in this study. We separately selected 34 stripe and non-stripe pattern young fruits 8 days after pollination from the F3:4 population. Then, total RNA were independently extracted using approximately one square centimeter of fruit peel (Fig. 1), and subsequently bulked as stripe pool and non-stripe pool for sequencing. A total of 43.9 (6.59 Gb) and 51.1 million clean reads (7.67 Gb) were generated from stripe and non-stripe bulks, with Q30 values more than 91.00% (Table 2). After mapping to watermelon reference genome 97013 (V2), single nucleotide polymorphisms (SNPs) were called, and subsequently the allele frequencies (index) for each SNP were calculated for each bulk. According to the average values of delta SNP index plotting on chromosomes, we obtained a prominent peak at the end part of chromosome 6 (Fig. 2a), suggesting the possible presence of the ClSP in this corresponding region.
Genetic mapping of ClSP locus
To verify the potential location of ClSP locus, three polymorphic markers (W07302, W07094, and W07096) were developed to screen a small F3:4 segregating population with 119 individuals. Linkage analysis indicated that the ClSP locus was delimited in a 1.48 Mb region between markers W07302 and W07094 with thirteen and two recombinants, respectively (Fig. 2b). Another two new markers FD01071 and W07092 were designed and confirmed to be co-segregate with the object phenotype. To further narrow down this target region, a larger F3:4 segregating population consisting of 1167 individuals was utilized to genotype with the primary flanking markers W07302 and W07094, and 132 and 27 new recombinants were identified, respectively. Combined with the recombinants obtained in the initial small population, a total of 172 were used for further analysis. Consequently, seven new markers were designed in the primary region, and were used to genotype the 172 recombinants. Finally, gene ClSP was finally delimited in a 794.03 Kb region between markers W07061 and W07134, with 38 and 26 recombinants respectively (Fig. 2c), inferring suppression of recombination present in this interval. Due to lack of effective recombinants, further narrowing down of this mapping region was unfeasible and the five markers (W07132, W07163, W07062, FD01071, and W07092) were co-segregated with stripe patterns, which can be used for marker selection breeding program.
To further delineate this target region, a new F2 segregating population was generated by crossing inbred lines M08 (stripe pattern) with N21 (non-stripe pattern). The polymorphisms of 12 makers used in the F3:4 population were firstly checked between parental lines M08 and N21, and four of them (W05111, FD01071, W07092, and W07094) were effective (Fig. 2c). Then, marker FD01071 was used for linkage analysis, confirming the presence of underlying locus at chromosome 6. Subsequently, the two flanking markers W05111 and W07094 were subjected to screen a 1594 M08 x N21 F2 segregating population (containing the 358 individuals used for inheritance analysis), obtaining a total of 44 recombinants. Basing on the high-confident SNPs identified between M08 and N21 (Wei et al. 2019), three new polymorphic markers (W05112, W11041, and W07093), as well as the four effective markers (W05111, FD01071, W07092, and W07094 from F3:4 segregating population), were used to genotype the 44 recombinants. Finally, the ClSP was delimited in a 611.78 Kb region between markers W11041 and W07093 (Fig. 2c), with 14 and 2 recombinants respectively, further validating the presence of recombination suppression in this locus.
Gene annotation and haplotype block analysis of ClSP locus
According to the watermelon reference genome (97103, V2), there are 74 predicted genes in the 611.78 Kb target region. To obtain all possible annotated genes in this region, the syntenic regions of the ClSP locus in reference genome of 97103 (V1, 616.82 Kb) and another watermelon reference genome Charleston Gray (V1, 656.65 Kb) were comprehensively analyzed by BLAST searches, harboring 63 predicted genes for each genome (Fig. 3; Table S2). After visually removing the redundant genes, a total of 81 predicted genes were finally obtained, including three Myb transcription factors, two HD transcription factors, one WD40 transcription factor, and one NAC transcription factor, namely as candidate No1 to No7 based on their chromosome locations (Fig. 3; Table S2). Among them, the WD40 transcription factor gene Cla97C06G126710 (candidate No2) has been recently identified as the best candidate gene for stripe pattern trait in watermelon via GWAS approach (Guo et al. 2019). Furthermore, it is worth to note that gene Cla97C06G126560 encoding a polygalacturonase-1 noncatalytic subunit beta protein was referred as candidate No8 in the target region, because it is the ortholog of cucumber ist (Table S3), which is reported to regulate the irregularly striped rind pattern (Song et al. 2019).
Interestingly, according to nucleotide polymorphisms (SNPs and indels) identified in the BSR-seq data, at least four discrete haplotype blocks (haploblocks) were characterized in ClSP locus (Fig. S2). The first one covering about 77 Kb, exhibited consistent allelic structures among reference genome, stripe and non-stripe pools. The second haploblock spanning approximately 65 Kb and harboring candidate No8, was heterozygous in stripe pool but homozygous in reference 97103 and non-stripe pool. Similarly, the third block (~ 312 Kb) containing five candidates (No1-No5) was largest one, which was heterozygous in stripe pool and consisted of two distinct haploblocks from homozygous reference 97103 and non-stripe pool (Fig. S2). The last haploblock harboring candidates No6 and No7 covered approximately 122 Kb genomic region. Interestingly, we found this haplotype block was identical in two pools but distant from that in reference genome. In addition, an ambiguous haplotype block (~22 Kb) was also defined due to no transcriptional reads detected in this region. Nevertheless, two distinct haplotypes were distinguished between M08 and N21 based on the DNA-seq data, with stripe line M08 identical to reference 97103 (Fig. S2). Collectively, the sequence variations of haploblock among genotypes provide valuable information for candidate gene filtering.
Gene structure analysis of candidates
Intriguingly, gene structures of candidates No1-5 were completely different from their genome annotations according to our BSR-seq data (Fig. S3). For instance, candidate No1 (Cla97C06G126680) was predicted to contain three exons in reference genome 97103 (V2); however, the second as well as parts of the first exon was alternatively spliced in both stripe and non-stripe pools. Candidates No2 (Cla97C06G126710, two exons) and No3 (Cla97C06G126720, two exons) were adjacently distributed on chromosome 6 in the reference genome, while these two transcripts were predictably derived from an integrated amplicon (named as candidate No2_3 hereafter) according to the BSR-seq data (Fig. S3). Moreover, the original genomic sequence of candidate No3 (two exons and one intron) was transcribed as an entire exon in re-annotated No2_3. Similarly, the independent candidates No4 (Cla97C06G126730, two exons) and No5 (Cla97C06G126740, two exons) were also presumed to transcribed from an approximately 8 Kb transcript (designated as No4_5 afterward). Meanwhile, an extra exon was detected in the re-annotated candidate No4_5, positioning in the intergenic region between No4 and No5. Regarding as candidates No6 and No8, no visible variation was observed in their gene structures (Fig. S3). It is worth to note that no transcriptional read was mapped on candidate No7 (Cla97C06G127070) based on the BSR-seq data of stripe and non-stripe pools.
To confirm the gene structures described above, coding sequence (CDS) for each candidate was amplified from two bulks (primer combinations as shown in Fig. S3). Sequence analysis revealed that gene structures of candidate No6 and No8 in two pools were identical to their reference genome annotations (data not shown), consistent with the BSR-seq results. Meanwhile, no PCR product was cloned for candidate No7. However, referring to candidate No1, three alternative transcripts were detected in two pools and defined as No1.1, No1.2, and No1.3 respectively (Fig. 4a). Compared to the original structure of candidate No1, only one intron was validated with ‘CT..AC’ as splice junction in all three transcripts; meanwhile, the first exon was in different length and the second was short for 91 bp at its 5’ part. Notably, transcripts No1.1 and No1.2 were detected in both stripe and non-stripe pools, but the No1.3 was specifically transcribed in the former pool. Additionally, transcripts No1.2 and No1.3 were predicted with premature stop codons leading to truncated proteins. Similarly, three different transcripts of candidate No2_3 were also detected in both stripe and non-stripe pools, which contained two introns with ‘CT..AC’ as intron splice site (Fig. 4b). The second exon varied in length among these three transcripts, and the longest one was 501 bp in transcript No2_3.1. As mentioned above, the original candidate No3 (two exons and one intron) together with additional 31bp nucleotides were confirmed to be transcribed as a longer exon in re-annotated candidate No2_3. Due to candidates No2 and No3 encoding WD40 and MYB transcription factors respectively (Fig. 3; Table S2), we tried to translate all the three transcripts of candidate No2_3 into amino acids for function prediction. Unfortunately, no intact protein was obtained (Fig. S4, No2_3.2 and No2_3.3 not shown). According to the BSR-seq data, the re-annotated candidate N04_5 was hypothesized to consist of No4, No5 and an extra exon. As expected, only one transcript was detected in both stripe and non-stripe pools (Fig. 4c), with the third extra exon harboring 115 bp nucleotides. Furthermore, the candidate No4_5 corresponding to a 326 amino acid protein is a member of class I of Knotted related homeobox transcription family (Fig. S5), exhibiting high sequence identity with gene AtKAT6 in Arabidopsis (Table S3).
Sequence polymorphic variations of candidate genes
According to our re-sequencing data (DNA-seq of M08 and N21 lines, BSR-seq of stripe and non-stripe bulks), sequence variations of the candidate genes were analyzed among stripe and non-stripe pattern genotypes. As described above (Fig. S2), candidates No1, No2_3, and No4_5 were located in a diverse haplotype block which was homozygous in reference genome 97013, M08, N21, and non-stripe pool, but heterozygous in stripe pool. Consistently, the polymorphic sites in the CDS of these three candidates were identical in non-stripe pool and N21 contrast to the stripe lines 97103 and M08, which were heterozygous in stripe pool (Fig. 5). For both candidate No1 and No4_5, two out of four SNP sites led to amino acid changes. Notably, the third mutant at No4_5 from base C to T resulted in the 247th amino acid glutamine (Q) instead by a premature stop codon (TGA) in non-stripe plants. The subsequent amino acid substitution analysis of corresponding nucleotide mutants in candidate No2_3 was ignored, because of no intact protein predicted with its CDS (Fig. S4). In terms of candidate No6, polymorphic sites in CDS displayed an identical haplotype among two pools and the non-stripe N21, compared to that in stripe lines 97103 and M08 (Fig. 5). Similarly, sequence variations of candidate No7 displayed a consistent haplotype in stripe lines 97103 and M08, while no transcript was detected in two pools (stripe and non-stripe pools). As the ortholog of gene ist responsible for the irregularly stripped rind phenotype in cucumber (Song et al. 2019), candidate No8 contained three mutant sites in N21 (non-stripe phenotype) contrast to other four genotypes. Moreover, another three heterozygous sites were also confirmed in stripe pool by BSR-seq data and gene cloning assay, with the third variation causing amino acid substitution. In contrast to sequence variations in coding sequence, much more polymorphisms were detected in the introns of all the six candidates, especially in re-annotated No4_5 (Fig. 5).
Expression analysis of candidate genes
To examine the transcriptional accumulation of each candidate during stripe formation, we randomly chose a 23-DAP fruit with typical stripes, to precisely sample peel from stripe and non-stripe rind, respectively (Fig. 6a). Consistent with the BSR-seq data and gene cloning assay, no transcriptional product of candidate No7 was detected in both stripe and non-stripe peels. Additionally, only the transcription accumulation of No4_5 was significantly reduced in non-stripe peel, compared to the fluctuant expression levels of No1, No2_3, No6, and No8 (Fig. 6b), inferring the potential function of candidate No4_5 involved in stripe formation.
Putative mechanisms underlying stripe development
To explore the possible molecular mechanism responsible for stripe formation, transcriptome profiles were performed using peels precisely sampled from stripe and non-stripe rind of two striped fruits (Fig. 6a). After filtering the raw data, approximately 4.41, 4.59, 4.41, and 4.19 Gb clean data were obtained respectively, with Q30 values higher than 92.00% and more than 96.00% reads successfully mapped on reference genome 97103 (Table 2). Following p-adjust value < 0.05 and fold change >= 2 as thresholds, 213 up-regulated and 143 down-regulated differentially expressed genes (DEGs) were identified in non-striped rind (Table S4). Notably, only three DEGs were located in the ClSP locus, including the HD transcription factor Cla97C06G126740 (original candidate No4), a part of re-annotated candidate No4_5.
The GO categorization analysis revealed that 141 DEGs (86 up-regulated and 55 down-regulated) were significantly enriched in 38 GO terms (Fig. 7a; Table S5), belonging to three major categories biological process (BP, 7), cellular component (CC, 25), and molecular function (MF, 6). As the most prevalently enriched category, the majority of GO terms (21 out of 25) in cellular component (CC) finally pointed to three GO terms, ‘photosystem I’ (GO:0009522), ‘photosystem II’ (GO:0009523), and ‘chloroplast thylakoid membrane’ (GO:0009535) (Figs. 7b and S6). To better understand the metabolic pathways involved in stripe development, KEGG enrichment analysis was performed using p-value < 0.05 as the significance cut-off, obtaining six mainly concentrated metabolic pathways: photosynthesis (ko00195) and photosynthesis-antenna proteins (ko00196), metabolic pathways (ko01100), plant-pathogen interaction (ko04626), phenylalanine metabolism (ko00360), and porphyrin and chlorophyll metabolism (ko00860) (Table S6). Interestingly, we noticed that all the DEGs enriched in energy metabolism photosynthesis (ko00195) and photosynthesis-antenna proteins (ko00196) were down-regulated in non-striped rind, including 14 genes encoding Chlorophyll a_b-binding protein and 15 genes related to photosynthesis (Table S7). Moreover, twelve genes mediating calcium-signal pathway were significantly up-regulated in non-stripe peel relative to stripes, inferring the possibly uneven distribution of calcium during stripe development.
Transcriptional factors involved in stripe formation
Based on the plant transcription factor database PlantTFDB (http://planttfdb.gao-lab.org/index.php), 38 DEGs belonging to 13 families were annotated as transcription factors, including ERF, WRKY, C2H2, and NAC (Fig. 7c). Among them, the majority (31 out of 38 DEGs) were up-regulated in non-striped peels (Table S8). As the largest detected TF family, all the 9 ERF genes were significantly increased at transcriptional level, presumably suggested the vital function of plant hormone ethylene in stripe development. Additionally, six WRKY genes, four C2H2 and NAC transcription genes, as well as other 15 differently expressed TFs demonstrated the complex regulatory network of stripe formation in watermelon.
In addition, to verify the RNA-seq results, 20 DEGs were selected to quantify their relative expression profiles using qRT-PCR (Table S9). The results confirmed that genes involved in chlorophyll a_b-binding and photosynthesis processes were down-regulated in non-stripes relative to stripes (Fig. 8). However, the five genes related to calcium ion binding and ERF TFs were up-regulated in non-stripe peel. The highly consistent expression trends of 20 DEGs between RNA-seq and qRT-PCR validated the reliability of the RNA-seq data.