Reliability of genomic variants across different next-generation sequencing platforms and bioinformatic processing pipelines
Background
Next Generation Sequencing (NGS) is the fundament of various studies, providing insights into questions from biology and medicine. Nevertheless, integrating data from different experimental backgrounds can introduce strong biases. In order to methodically investigate the magnitude of systematic errors in single nucleotide variant calls, we performed a cross-sectional observational study on a genomic cohort of 99 subjects each sequenced via (i) Illumina HiSeq X, (ii) Illumina HiSeq, and (iii) Complete Genomics and processed with the respective bioinformatic pipeline. We also repeated variant calling for the Illumina cohorts with GATK, which allowed us to investigate the effect of the bioinformatics analysis strategy separately from the sequencing platform's impact.
Results
The number of detected variants/variant classes per individual was highly dependent on the experimental setup. We observed a statistically significant overrepresentation of variants uniquely called by a single setup, indicating potential systematic biases. Insertion/deletion polymorphisms (indels) were associated with decreased concordance compared to single nucleotide polymorphisms (SNPs). The discrepancies in indel absolute numbers were particularly prominent in introns, Alu elements, simple repeats, and regions with medium GC content. Notably, reprocessing sequencing data following the best practice recommendations of GATK considerably improved concordance between the respective setups.
Conclusion
We provide empirical evidence of systematic heterogeneity in variant calls between alternative experimental and data analysis setups. Furthermore, our results demonstrate the benefit of reprocessing genomic data with harmonized pipelines when integrating data from different studies.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
This is a list of supplementary files associated with this preprint. Click to download.
Posted 11 Jan, 2021
On 19 Jan, 2021
On 29 Dec, 2020
On 28 Dec, 2020
Posted 21 Dec, 2020
On 21 Dec, 2020
On 20 Dec, 2020
On 20 Dec, 2020
On 20 Dec, 2020
On 14 Dec, 2020
Received 04 Dec, 2020
Received 04 Dec, 2020
On 30 Nov, 2020
On 30 Nov, 2020
Invitations sent on 30 Nov, 2020
On 30 Nov, 2020
On 30 Nov, 2020
On 30 Nov, 2020
On 07 Sep, 2020
Received 03 Sep, 2020
Received 28 Aug, 2020
On 13 Aug, 2020
On 12 Aug, 2020
Invitations sent on 06 Aug, 2020
On 06 Aug, 2020
On 31 Jul, 2020
On 30 Jul, 2020
On 30 Jul, 2020
On 30 Jul, 2020
Reliability of genomic variants across different next-generation sequencing platforms and bioinformatic processing pipelines
Posted 11 Jan, 2021
On 19 Jan, 2021
On 29 Dec, 2020
On 28 Dec, 2020
Posted 21 Dec, 2020
On 21 Dec, 2020
On 20 Dec, 2020
On 20 Dec, 2020
On 20 Dec, 2020
On 14 Dec, 2020
Received 04 Dec, 2020
Received 04 Dec, 2020
On 30 Nov, 2020
On 30 Nov, 2020
Invitations sent on 30 Nov, 2020
On 30 Nov, 2020
On 30 Nov, 2020
On 30 Nov, 2020
On 07 Sep, 2020
Received 03 Sep, 2020
Received 28 Aug, 2020
On 13 Aug, 2020
On 12 Aug, 2020
Invitations sent on 06 Aug, 2020
On 06 Aug, 2020
On 31 Jul, 2020
On 30 Jul, 2020
On 30 Jul, 2020
On 30 Jul, 2020
Background
Next Generation Sequencing (NGS) is the fundament of various studies, providing insights into questions from biology and medicine. Nevertheless, integrating data from different experimental backgrounds can introduce strong biases. In order to methodically investigate the magnitude of systematic errors in single nucleotide variant calls, we performed a cross-sectional observational study on a genomic cohort of 99 subjects each sequenced via (i) Illumina HiSeq X, (ii) Illumina HiSeq, and (iii) Complete Genomics and processed with the respective bioinformatic pipeline. We also repeated variant calling for the Illumina cohorts with GATK, which allowed us to investigate the effect of the bioinformatics analysis strategy separately from the sequencing platform's impact.
Results
The number of detected variants/variant classes per individual was highly dependent on the experimental setup. We observed a statistically significant overrepresentation of variants uniquely called by a single setup, indicating potential systematic biases. Insertion/deletion polymorphisms (indels) were associated with decreased concordance compared to single nucleotide polymorphisms (SNPs). The discrepancies in indel absolute numbers were particularly prominent in introns, Alu elements, simple repeats, and regions with medium GC content. Notably, reprocessing sequencing data following the best practice recommendations of GATK considerably improved concordance between the respective setups.
Conclusion
We provide empirical evidence of systematic heterogeneity in variant calls between alternative experimental and data analysis setups. Furthermore, our results demonstrate the benefit of reprocessing genomic data with harmonized pipelines when integrating data from different studies.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6