Plant material, DNA isolation and RNA isolation
The detailed information for the four high coverage sequence hybrid rice accessions is described in the previous study (Huang et al. 2012). The DNA extraction, library preparation and genome sequence with Illumina sequencing machine are also described in the previous publication.
Short reads alignment and SNP calling
BWA is used to align all the short reads to rice reference genome (IRGSP version4) with the default parameters for the pair-end sequence data to generate the SAM format file. Samtools is used to convert the SAM format file to the bam format. They are sorted by the genome coordination to generate the sorted bam file. All the bam files of wild rice and hybrid rice are also submitted to the SNPtools pipeline. They are processed as described in the manual. In this study we first parse the bam file using Samtools API. We search for the bases in the sequencing reads that are different from the reference genome. The reads that have mapping quality lower than 60 and reads mapping ratio smaller than 0.82 are discarded. HetMap only outputs the polymorphic sites that at least appear twice in the population. All the bases that cover the polymorphic sites detected in the previous step and their related sequencing quality score are extracted from the bam file. HetMap summarizes all the extracted bases for each polymorphic site to generate the missing rate and minor allele frequency. Only the polymorphic sites that have minor allele frequency higher than 0.05, missing rate lower than 0.4 and physical distance with neighboring SNPs bigger than 10bp are kept. Using the chosen the polymorphic sites we calculate genotype likelihood probability using the reads depth and sequence quality score information. Beagle is used to infer the genotype using the output genotype likelihood file.
SNP validation
HetMap uses four high coverage sequence hybrid rice and two high coverage sequence wild rice accessions for the SNP inference accuracy validation of hybrid rice and wild rice. They are also included in the low coverage sequence data. We compare the genotype called in the low coverage sequence data against the genotype called in the high coverage sequence data to check the inference accuracy.
Genotype likelihood calculation
For all the three possible kinds of genotype likelihood we set them as the homologue reference allele, homologue alternative allele and heterozygous allele. The base coverage information for the different alleles is integrated with an improved Binomial model. As there is high sequencing error for the second generation sequence data, we give a penalty score based on the base phred score information in the integration of reads coverage for the genotype likelihood calculation. After we calculate the genotype likelihood for all three kinds of genotypes, we normalize them to get final probability.
Prref=\({\prod }_{\text{i}=1}^{\text{m}}{\text{P}\text{r}}_{\text{r}\text{e}\text{f}}\times {\prod }_{\text{i}=1}^{\text{n}}{\text{E}\text{r}}_{\text{a}\text{l}\text{t}}\) (1)
Pralt=\({\prod }_{\text{i}=1}^{\text{m}}{\text{P}\text{r}}_{\text{a}\text{l}\text{t}}\times {\prod }_{\text{i}=1}^{\text{n}}{\text{E}\text{r}}_{\text{r}\text{e}\text{f}}\) (2)
PrHet=\({\prod }_{\text{i}=1}^{\text{m}}{\text{P}\text{r}}_{\text{h}\text{e}\text{t}}\times {\prod }_{\text{i}=1}^{\text{n}}{\text{P}\text{r}}_{\text{h}\text{e}\text{t}}\) (3)
ProbAA= Prref /( Prref + Pralt + PrHet) (4)
ProbAB= PrHet /( Prref + Pralt + PrHet) (5)
ProbBB= Pralt /( Prref + Pralt + PrHet) (6)
ProbAA, ProbAB and ProbAB represents final normalized value for the three kinds of genotype likelihood.
Cultivation of the wild rice and the phenotyping of the wild rice traits
In this study we have planted and phenotyped the 446 wild rice accession in Lin shui country Hai Nan province from 2012 to 2014. Only few cultivated rice accessions have short awn length, most cultivated rice accessions have no awn. However, Most wild rice accessions have long awn length. The awn is striped from the seed before we measure the awn length and the seed length. The awn length is pulled to straight and measured with a ruler. We measure the awn length for five seeds and use their average value as the final awn length for each wild rice accession.
Data simulation
We simulate different sequence coverage ranging from the 0.25 fold coverage to 30 fold coverage by randomly sampling from the high coverage sequence data. The sequence reads are aligned against the reference genome to generate the SAM file. Samtools is used to convert the SAM format to the bam format and generate the sorted bam file. HetMap and SNPtools are used to generate the genotype likelihood files using these sorted bam files..The genotype likelihood files are submited to beagle imputation. The genotype called in the simulated data is compared against those called in the high coverage sequence data to get the accuracy.
General workflow of the HetMap pipeline
HetMap pipeline is implemented in C++ language. It takes standard bam files as input. The output file is in VCF format. The workflow from raw sequence reads to the final imputed genotype is outlined in Figure 1. Many genotype imputation software packages can be used to handle the output file of HetMap. Here Beagle is chosen as an example.
The whole pipeline can be divided into five sections. The HetMap pipeline starts with the sorted bam file. Generally, multiple sequence alignment software packages can be used to align the short reads to the reference genome to generate the input file. Firstly, the “Call_snp” program is used to detect the genetic variants from the sequence data that are different from the reference genome. Secondly, the “Sum_snp” program can summarize the all the genetic variants of the whole population to generate the candidate polymorphic sites by using some filtering parameters. Thirdly, “Chose_pile” program can extract all the bases that are aligned to the candidate polymorphic sites. Fourthly, the “filter_snp” program extracts the whole base information that’s aligned to the candidate polymorphic sites. Fifthly, the “Call_prob” program can calculate the three kinds of genotype likelihood in the whole population for the whole polymorphic sites. The output file, which is in VCF format (version 4.0) can be used as input file for many popular genotype imputation software packages.
Besides research on the genome sequence data, detecting variants from the RNA-sequence data is also of great importance. HetMap has an extended function to detect variants from the RNA sequence data. As the HetMap pipeline is compatible with many popular software packages, its individual section can also be used separately in the analysis process.
Detection of polymorphic sites
HetMap pipeline uses the application programming interface (API) provided by the Samtools to parse the compressed bam file. Before detecting genetic variants from the next generation sequence data, HetMap first filters the sequence reads that has low mapping score (default, <60) and small reads matching ratio (default,< 0.82). All the variant sites in the sequence data that are different from the reference genome can be detected for each accession. Because some variant sites in the sequence reads may be caused by sequencing error or mismatch of sequence reads, only non-reference alleles that appear at least twice in the whole population are kept to generate the primary polymorphic sites. As only the genetic variants are taken into consideration in the determination of primary polymorphic sites, we need all the four base information to get the final polymorphic sites. All the sequenced bases, which are aligned to the primary polymorphic sites, are pulled down from the sequence alignment files. The base information and phred score information are all included in the output file.
Hash function is often used to get selected polymorphic sties from the sequence reads in many software packages. However, it needs large amount of calculation to get all the bases that are aligned to the candidate polymorphic sites by scanning whole sequence data using hash function. As the sample number which is used for the population genetics analysis is often huge, it becomes a more serious problem. Sort and compare is another method to pull down chosen polymorphic sites from the massive sequence data. By comparing the sorted reads mapping position with the polymorphic sites, the bases that are aligned to the polymorphic sites can also be extracted. It only needs a simple comparison instead of the complex hash function for the input bam file. For the overlap of mapping position with the sequence reads, simple sort and compare method is impractical. In HetMap pipeline an improved sort and compare method is used. We use a dual sorting method in the comparison of the mapping position of the reads with the polymorphic site. The genome position of the polymorphic sites can increase and drecrease dynamically with the change of reads mapping range. After we get all the four base information for each candidate polymorphic site, we can filter the polymorphic sites with multiple criteria to get the final polymorphic sites.
Genotype likelihood calculation using improved Binomial model
For each polymorphic site that consists of two alleles, there are three kinds of genotype. They are homologue reference allele (RR), homologue alternative allele (AA) and heterozygous allele (RA). HetMap doesn’t take the tri-allele polymorphic sites into consideration here, as they only occupy a small part of the genome. However, HetMap outputs all the four base information for the polymorphic site. Tri-allele sites can still be detected from the output file with some easy customization. HetMap calculates genotype likelihood for all possible three kinds of genotypes (Equation 1-6, methods). As there is high sequencing error for the short reads sequence data, some bases which have low phred quality score (< 20) are often discarded in the genotype likelihood calculation process. However, the number of reads covering a specific polymorphic site is small in the low coverage sequence data. Hence HetMap integrates penalty method based on their phred score information to take the bases that have low phred score into consideration. If both an alternative allele and a reference allele are detected in the polymorphic site. The probability for this allele to be a homologue reference allele is a combination of the probability of this reference allele plus the error probability of the alternative allele. HetMap outputs the genotype likelihood file in VCF format. Multiple software packages can be used for downstream analysis.