In the current study, we re-examined the genomes of 99 individuals from the Wellderly project [13]. The wellderly phenotype describes individuals over the age of 80 that present without any known chronic diseases and do not take any chronic medication. Each individual had already been sequenced with three different next-generation sequencing platforms and variant calling had already been performed via three different bioinformatic pipelines (Fig 1a). Therefore, the raw data in our current study were the resulting vcf files. We only considered variants, i.e. single nucleotide polymorphisms (SNP) and insertion/deletion polymorphisms (InDel) with filter tag ‘PASS’, as defined by the variant calling pipeline, without equalizing the ‘PASS’ criteria between different setups. We believe that this is the most realistic approach when comparing datasets generated by different methods. Additionally, we left-aligned the variants and split multiallelic variants in consecutive blocks to equalize the variant annotation between the different sequencing cohorts.
Throughout the study, we use the name of the sequencing platform (HSX, MOL or CG) to describe the three different experimental setups consisting of the sequencing technology and the respective mapping and variant calling algorithm. The three different genomic data sets obtained in this way are referred to as sequencing cohorts.
Our comparative analysis consisted of three different investigations. First, we determined the absolute number of variants in the respective sequencing cohort and all of their converging sets. Subsequently, we focused on the subset of variants within different genomic regions, such as introns and exons or repetitive elements. Finally, we focused on variant sites with highly differing allele frequencies between the sequencing cohorts in order to investigate systematic “miscalls” at the cohort-level.
Comparison of concordant variants between the three platforms
In the first step of the analysis, we estimated the number of variants predicted for each experimental setup and assessed the concordance between the three methods (pipeline details are summarized in Fig. 1b).
HSX was associated with the highest average number of variants, followed by MOL and then CG (Fig. 2a). Altogether, an average of 3,332,799 variants were detected per individual by all three platforms which corresponds to 79.4% of HSX variants, 88.2% of MOL variants and 89.4% of CG variants (Fig. 2a, b). We tested for statistical over- and underrepresentation of observed variant calls in the different sets of the Venn-diagram (Fig. 2b) with a Monte Carlo simulation approach. The intersection between all sequencing techniques was slightly higher than expected by chance (observed mean 3,332,789 vs. expected mean 2,921,107, p < 0.001, Fig 2c), increasing the confidence in these calls. However, the variants unique to each platform were highly overrepresented (MOL: 2.34, HSX: 2.98 and CG: 4.37 times more unique variants than expected by chance) at the expense of the observed number of variants called by exactly two platforms (Fig 2c). This indicated the presence of platform specific systematic sequencing errors.
Next, we investigated the concordance between different platforms more thoroughly by determining the average number and composition of variants detected by all three, exactly two, at least two, or in just one experimental setup. The proportion of InDels in the intersection of all platforms (6.84%) was lower compared to each individual platform (Fig. 2d). We observed a fairly similar pattern in the distribution of variants found by at least two platforms where the proportion of InDels varied between ~7-9% (Fig. 2e). The analysis of variants detected under only one experimental setup revealed that HSX was associated with the highest number of unique variants, followed by CG and then MOL (Fig. 2f). Interestingly, the proportion of InDels was higher than SNPs in the case of unique variants detected by HSX only. Finally, we observed a similar pattern for the concordant variants detected by exactly two platforms (Fig. 2g). Specifically, the proportion of concordant InDels between HSX and CG was higher than the proportion of SNPs. Furthermore, the general overlap between HSX and the other two sequencers was greater than the overlap between CG and MOL (Fig. 2g).
Apart from the average number of variants per individual, we were also interested in investigating the total number of variant sites as a function of varying cohort size. For this, we randomly selected a subset of individuals while increasing the sample size from 1 to 99 and estimated the total number of variants in each resulting cohort. The results for all three platforms were similar and are shown in S. Fig 1. As expected, the number of called variants steadily increased with increasing sample size. However, the function quickly plateaued after applying the MAF filter. Consequently, increasing the sample size of the cohort above 30 individuals was not associated with a considerable increase in the number of detected variant sites.
Distribution of variants along the genome
One way of estimating the reliability of predicted variants is to correlate them to evolutionary highly conserved elements in the genome – such as regions with high PhastCons scores. These scores, ranging between zero and one, were calculated for the human genome (hg19) on the basis of 99 vertebrates, and stand for the probability of conservation of a nucleotide in the genome. A high score indicates a high rate of conservation[14]. These genomic positions are known to be under strong purifying selection, which makes the occurrence of a variant improbable. Therefore, variants detected within such conserved regions are most likely erroneous. Since no recommendations about an appropriate cut-off value have been described in the literature, we chose a threshold of PhastCons score > 0.8. This allowed us to retain conserved regions with high confidence without being overly restrictive.
Interestingly, relative proportions of PhastCons variants were virtually the same for each experimental setup (~2.4-2.6%). In the case of variants called by solely one platform, this remained the same for MOL and CG with ~1.7%, whereas HSX found 4.15% of variants in PhastCons regions. We consider these variants to be candidates for false positives, since we would expect less variation in highly conserved genomic regions. For details, see Supplemental Table S Table 1.
Additionally, we were interested in regional effects of variant calling across the genome. Therefore, we compared the genome-wide distribution of variants in particular areas, such as exonic, intergenic and intronic areas. The proportional difference in the distribution of variants between the platforms was negligible with ~2% of variants found in exonic, ~56% variants in intergenic and ~42% of variants found in intronic regions by all machines. The observed number of variants significantly differed from the expected value for all three platforms based on the relative length of exons, introns and intergenic regions and assuming an equal distribution of variants along the genome (p<0.0001, chi-squared test). Notably, the observed number of variant sites in exons was lower than the expected quantity (observed mean 84867 vs. expected mean 126748 for HSX; observed mean 79071 vs. expected mean 114137 for MOL; observed mean 79551 vs. expected mean 112639 for CG).
With regard to the mean number of variants found only by one platform, the HSX-cohort had a mean number of 197865 unique variants in intergenic regions, whereas MOL had 55679 unique variants and CG had 109274 unique variants in intergenic regions. For intronic regions, a mean count of 142209 variants per sample was found by HSX only, MOL had 37160 mean unique variants and CG had 78368 mean unique variants in introns. For exons there was a mean number of 2762 of variants detected solely by HSX, whereas MOL had on average 436 unique variants and CG had 962 unique variants in exonic regions per sample. For details refer to Supplemental Table S Table 1.
Distribution of variants in repetitive genomic regions
Regions that are frequently excluded from the analysis of genomic variants are repetitive elements and GC rich regions, because they are known to accumulate sequencing errors[1]. In order to calculate the impact of such regions, we proceeded as follows (Fig. 1c): We obtained the RepeatMasker annotation for the following classes of repetitive elements, Alu elements, long interspersed nuclear elements (LINE), low complexity regions, long terminal repeats (LTR) and simple repeats. Then, we calculated the ratio of variants detected exclusively by one platform to the total number of variants detected by the same platform for each type of repetitive element. We refer to this as the observed value of the unique variant rate in the respective RepeatMasker region. The unique variant rate can generally indicate either higher sensitivity or increased false positive rate. However, we believe that this measure more likely corresponds to the false positive rate of the variant calling method in repetitive low complexity regions. In order to estimate the expected value of the unique variant rate, we randomly selected genomic regions of the same length for each type of RepeatMasker repetitive element and calculated the rate of unique variants as described above.
Fig. 3 depicts the results for all three pipelines. The expected unique variant rate was ~12.5% for HSX, ~4.7% for MOL and ~19.2% for CG. The observed unique variant rate in low complexity regions and simple repeats was considerably higher than the expected value for all three platforms. The observed unique variant rate was around 48% in low complexity regions and approximately 54% in simple repeats for both HSX and CG (Fig. 3a, 3c). These values were lower for MOL with an observed rate of 19% in low complexity regions and 25.6% in simple repeats (Fig. 3b). This finding is however likely due to the fact that MOL generally detected the lowest number of unique variants compared to the other platforms (cf. Fig. 2f). Interestingly, however, the absolute number of variants detected in low complexity regions and simple repeats was higher for MOL than both HSX and CG (Fig. 3d).
Furthermore, we examined the general length distribution of InDels and their proportion overlapping the repetitive regions defined by RepeatMasker. Notably, CG included the highest number of 1-bp insertions and deletions while MOL was the only sequencer with more 1-bp deletions than 1-bp insertions (Fig. 4a-c). Additionally, the proportion of InDels in simple repeats and low complexity regions was well above 0.5 for all sequencers, whereas a value of ~0.2 would be expected. In contrast, InDels were less abundant than expected in LINE and LTR regions (Fig. 4d-f).
Factors with a significant impact on the concordance of called variants
We performed a Poisson regression analysis in order to investigate factors which potentially influence the concordance between the different pipelines. The response variable was the number of pipelines which detected each variant. The model included the four categorical predictors genomic region, repetitive element, MAF for the European cohort from the 1000 Genomes Project and positional conservation. Genomic regions included intergenic, introns and exons (reference category). Repetitive elements were obtained from the RepeatMasker annotation including simple repeats, LINE, LTR, low complexity repeats and non-repetitive elements (reference category). Variants were classified as common if the MAF in the 1000 Genomes Project was >5% (reference category), low if MAF was between 0.5% and 5%, rare if MAF was below 0.5% and not detected if the variant was not present in the 1000 Genomes Project. Positions were considered conserved if the PhastCons scores was above 0.8 (reference category) and non-conserved otherwise.
Results from the regression analysis for SNPs are depicted in Fig. 5a. The factors which showed the strongest negative effect on the concordance between different pipelines were simple repeats (p<2x10-16), low complexity regions (p<2x10-16) and novel variants not detected in the 1000 Genomes Project (p<2x10-16). The concordance of detected variants was also significantly worse for LINEs relative to non-repetitive elements as well as variants with low and rare MAF compared to common variants. In contrast, LTRs and introns were associated with a significantly improved concordance relative to non-repetitive elements and exons, respectively. The impact of intergenic regions and positional conservation was not statistically significant.
All factors that we included in the regression analysis significantly influenced the concordance between the pipelines for InDels (Fig. 5b). Novel variants not detected in the 1000 Genomes Project were associated with the strongest negative impact on the concordance between the three pipelines. Furthermore, intergenic regions and introns relative to exons, simple repeats and low complexity regions relative to non-repetitive regions, rare variants and non-conserved genomic positions all significantly contributed to reduced concordance. Conversely, the agreement between the pipelines was significantly better for variants with a low MAF relative to common variants as well as in LINE and LTR elements compared to non-repetitive elements. Interestingly, this correlates with the observed rate of InDels in LINE and LTR regions being lower than the expected average (cf. Fig. 4d-f).
Comparison of allele frequency estimates between different platforms
In the final step of the analysis, we investigated whether the estimated allele frequency of variants was comparable between each pair of platforms for variants detected by both respective platforms. Workflow details are summarized in Fig. 1d.
Fig. 6a-c depicts the allele frequency correlation between the different experimental setups for chromosome 22, while the results for the remaining chromosomes are shown in Supplemental Figure S. Fig. 2. Generally, each pair of sequencing technologies demonstrated a strong consensus in estimated allele frequency indicated by the high density of data points around the diagonal line of the plot. However, numerous variants were associated with a very high allele frequency for one platform while simultaneously having a very low occurrence in the other platform.
In order to quantify the variants with strong disagreement in estimated allele frequency between the different setups, we created a subset with variants having an allele frequency difference >0.8 for each pair of sequencers. We chose this threshold to be able to focus on variants with a very high occurrence in one platform and a very low occurrence in the other. Such extreme discrepancies could point to systematic rather than random errors. We observed an average of 14746.26 variants with allele frequency difference >0.8 for HSX and CG, closely followed by the average for CG and MOL, whereas this value was much lower for HSX and MOL (Fig. 6d).
Next, we examined how many of the variants with highly diverging estimated allele frequency correspond to variants annotated in GWA studies which could illustrate potential problems in downstream analysis. The average number of such variants for the comparison of HSX and CG as well as CG and MOL was around 80 (Fig. 6e). Importantly, we could reduce this number to 0 by applying both the MAF and HWE filter to the data (Fig. 6f).The majority of these annotated GWAS vairants with very divergent estimated allele frequency were filtered out already after applying only the MAF filter (Fig 6g), whereas the impact of the HWE filter alone was much more moderate (Fig. 6h). Consequently, these results indicate that is seems prudent to combine both filters as neither one of them managed to remove these highly discordant variants completely when applied individually.