Materials and sample collection
The CMS-D8 system, a sterile line (A), maintainer line (B) and restorer line (R) were provided by the Institute of Cotton Research (ICR), Chinese Academy of Agricultural Science, Anyang, Henan, China. A BC1F1 ((A line ×R line) ×B line) segregation population was constructed, and all materials were grown at the Cotton Research Farm at the ICR. Fresh leaves were obtained from the parent lines and BC1F1 population. Anthers from buds1-2 mm, 3 mm, and 4 mm in length were collected and combined from 100 plants. All harvested samples were snap-frozen in liquid nitrogen and stored at −80°C before use.
Fertility segregation analysis
During anthesis, visual fertility surveys were conducted for 1623 individuals of the BC1F1 population of CMS-D8 under field trial conditions, three times per plant. The presence of pollen in a plant indicated fertility and was determined by squeezing the anthers between the fingers because the male sterility of CMS-D8 occurs during meiosis, the S(rf) gametes are sterile, and S(rf2rf2) produces no pollen. So, the observed values of fertile and sterile plants were obtained, the fertility trait segregation ratio compatibility test was carried out using Excel software, and the Chi-square value was obtained to determine the genetic model of Rf2.
DNA extraction, library construction and Illumina sequencing
DNA was extracted by the CTAB method [80], and the quality of DNA was assessed by 1.2% agarose gel electrophoresis. The purity of DNA was examined using an Agilent Technologies 2100 Bioanalyzer. The DNA concentration was estimated using a Qubit® DNA Assay Kit in a Qubit® 2.0 Fluorometer (Life Technologies, CA, USA). Equal amounts of DNA (1.5 μg/sample) from 100 BC1F1 plants with fertile phenotypes were mixed to form the fertile sample (F-pool), and those from another 100 plants with sterile phenotypes were mixed to form the sterile sample (S-pool). Sequencing libraries were generated using the VAHTSTM Universal DNA Library Prep Kit for Illumina® V3 (Vazyme Biotech) according to the manufacturer’s recommendations. Briefly, the DNA samples were fragmented by sonication to a size of 300-500 bp. Then, the DNA fragments were end-polished, A-tailed, and ligated with the full-length adapter for Illumina sequencing by PCR amplification. Consequently, the PCR products were purified (VAHTSTM DNA Clean Beads (Vazyme #N411)), and libraries were analysed for size distribution by an Agilent 2100 Bioanalyzer and quantified by real-time PCR. The libraries constructed above were sequenced by the Illumina HiSeq platform, and 150 bp paired-end reads were generated with an insert size of approximately 350 bp.
Data analysis, data filtering, and alignment
The Fast x-toolkit (v 0.0.14–1) was used to filter out the low-quality reads such as reads with ≥10% unidentified nucleotides (N), reads with > 50% bases having phred quality < 5, r reads with > 10 nt aligned to the adapter, allowing ≤10% mismatches, and putative PCR duplicates generated by PCR amplification in the library construction process (read 1 and read 2 of two paired-end reads were completely identical). The released genome of Gossypium hirsutum was downloaded from the Cotton Research Institute (CRI) of Nanjing Agricultural University of China (http://mascotton.njau.edu.cn/Data.htm, v1.1) and used as a reference genome [49]. For mapping to the reference genome, BWA (Burrows-Wheeler Aligner) [81] was used to align the clean reads of each sample against the reference genome (settings: mem -t 4 -k 32 –M -R). Alignment files were converted to BAM files using SAMtools software [82] (settings: –bS –t). In addition, potential PCR duplications were removed using the SAMtools command “rmdup”. If multiple read pairs had identical external coordinates, only the pair with the highest mapping quality was retained.
SNP/InDel detection and annotation
Variant calling was performed for all samples by using the unified genotyper function in GATK [83] software. SNP was used as the variant filtration parameter in GATK (settings: --filterExpression "QD < 4.0 || FS > 60.0 || MQ < 40.0 ", -G_filter "GQ <20", --cluster WindowSize 4). InDel was filtered by Variant Filtration parameter (settings: --filter Expression "QD < 4.0 || FS > 200.0||Read PosRankSum < -20.0 || Inbreeding Coeff < -0.8 "). ANNOVAR [84], an efficient software tool, was used to annotate the SNPs and InDels based on the GFF3 files for the reference genome.
Determination of a candidate interval using SNP index and G’ values
To obtain a highly accurate SNP set, a range of filters were also employed [85]. The homozygous SNPs between two parents were extracted from the VCF files for SNPs. The read depth information for homozygous SNPs in the above offspring pools was obtained to calculate the SNP index [86]. The SNP index method was used for the analysis, and the SNP-index dot matched curve was obtained by regression fitting as described by Abe et al. [87]. The G’ values are used for noise reduction while also addressing linkage disequilibrium (LD) between SNPs [88]. To rule out the effects of unreliable markers, we screened markers based on the SNP index using the following conditions: 1) the sequencing depth of both parents was greater than 8; 2) both pools had a sequencing depth greater than 10; 3) the SNP index values of the two pools were not greater than 0.8 or less than 0.2 at the same time; and 4) the SNP index value difference was greater than 0.8. Sliding window methods were used to present the SNP index of the whole genome. Usually, we used a window size of 2 Mb as the default settings, and used QTL-seqr software to calculate Δ (SNPs-index) and G', based on the markers that were scanned [89].
Fine mapping of Rf2 using high-throughput SNP genotyping and InDel makers
For the analysis, DNA was isolated from the two parental lines and the BC1F1 population. The informative molecular markers were used for genotyping each plant of the BC1F1 population, various recombinants in the target region were identified, and the linkage relationship between markers and the Rf2 locus was analysed for gene mapping. According to the results of sequencing mutation detection and BSA of sequencing candidate intervals, one SNP was selected for approximately every 1.5 Mb of physical distance. The selected SNPs met the requirement of having a variation index close to 0.5 in the F-pool and 0 in the S-pool. A subset of 24 selected SNPs was used for genotyping by the Illumina HiSeq PE150 sequence. Individual BC1F1 population (except 200 plants that formed pools) plants were genotyped using high-throughput SNP genotyping. Based on the SNP locus exchange of individual plants, we narrowed down the range in which the target gene was located.
The InDel markers were developed based on the insertion deletion mutation within the selection interval. Primers were designed using Oligo7 software [90] and synthesized commercially (TSINGKE Biological Technology, Zhengzhou, China). The PCR system consisted of 20 μL of PCR mixture that contained 1× reaction buffer, 2.0 mM MgCl2, 0.2 mM dNTPs, 0.5 mM each primer, 1 U Taq DNA polymerase (Takara, Japan), and 50 ng of DNA template. The PCR amplification conditions were as follows: 35 cycles of denaturation at 94°C for 30 s, annealing at 56°C-58°C for 30 s, and extension at 72°C for 30 s. Then, the reaction was held at 4°C. The PCR products were visualized by 3.5% agarose gel electrophoresis. Based on the difference between the genotypes as assessed using polymorphic markers, recombinants were identified in the BC1F1 population and used to fine map Rf2.
Marker-assisted breeding of restorer lines
The utility of the InDel markers for marker-assisted selection was determined in a segregating population. First, the restorer line [N(Rf2Rf2)] of CMS-D8 was crossed with the recurrent parent [N(rf2rf2)], which has excellent agronomic characteristics. Beginning in the BC1F1 generation, the codominant InDel marker 1327 was used to track the restorer gene in each generation, and the other markers were used for further verification. Only those individuals verified by the markers were chosen as the female parent for successive backcrosses. In the BC4F1 population, 120 individuals were randomly selected, and then InDel 1327 was used to perform segregation analysis. The individuals verified by the markers as homozygous at the restorer gene locus were test-crossed with the sterile line [S(rf2rf2)] to determine the segregation of the fertility phenotype in the offspring under field conditions. The atpA SCAR marker distinguishes the CMS cytoplasm from other types of cytoplasm [2]. Here, the InDel 1327 marker was combined with the atpA SCAR marker to identify hybrids in the CMS-D8 system.
Real-time quantitative PCR (qRT-PCR) analysis
The annotated genes in the interval were analysed, and real-time quantitative PCR was performed to identify PPR family genes and differentially expressed genes that were selected based on anther transcriptome data of the CMS-D8 system (unpublished data). Total RNA was isolated using the Sigma Spectrum Plant Total RNA Kit (Sigma-Aldrich, USA) according to the manufacturer’s protocol. Reverse transcription was conducted using the PrimeScriptTM RT Reagent Kit (TaKaRa, Beijing, China). Trans StartR Top Green qPCR Super Mix (Trans gen, Beijing, China) was used according to the manufacturer's instructions to conduct qRT-PCR of the genes. The internal control gene used for qRT-PCR was the cotton His3 gene (i.e., histone 3) with primers of GhHIS3F and GhHIS3R, and relative gene expression levels were calculated using the 2-△△CT method [91], as described in detail in previous studies [92-94]. Each gene in each sample was analyzed with three replicates and two technical replicates. All primers are listed in Table S1.