Reliability of genomic variants across different next-generation sequencing platforms and bioinformatic processing pipelines
Background
Next Generation Sequencing (NGS) is the fundament of various studies providing insights into questions from biology and medicine. Nevertheless, integrating data from different experimental backgrounds can introduce strong biases. In order to methodically investigate the magnitude of systematic errors, we performed a cross-sectional observational study on a genomic cohort of 99 subjects each sequenced via (i) Illumina HiSeq X, (ii) Illumina HiSeq and (iii) Complete Genomics. Consequently, we systematically analyzed the heterogeneity between the sequencing cohorts with respect to genomic annotation and common filter criteria like minimum allele frequency (MAF).
Results
The number of detected variants/variant classes per individual was highly dependent on the sequencing technology. We observed a statistically significant overrepresentation of variants uniquely called by a single platform which indicates potential systematic biases. These variants were enriched in low complexity genomic regions and simple repeats. Furthermore, estimates of allele frequency were highly discrepant for a subset of variants in pairwise comparisons between different sequencing platforms. Applying common filters – such as MAF 5% and HWE- greatly reduced the heterogeneity between cohorts but still left discrepancies of several thousand variants after filtering.
Conclusion
We provide empirical evidence of systematic heterogeneity in variant calls between alternative experimental and data analysis setups. Our results highlight the potential benefit of reprocessing genomic data with harmonized pipelines when integrating data from different studies.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
This is a list of supplementary files associated with this preprint. Click to download.
On 19 Jan, 2021
On 29 Dec, 2020
On 28 Dec, 2020
Posted 21 Dec, 2020
On 21 Dec, 2020
On 20 Dec, 2020
On 20 Dec, 2020
On 20 Dec, 2020
On 14 Dec, 2020
Received 04 Dec, 2020
Received 04 Dec, 2020
On 30 Nov, 2020
On 30 Nov, 2020
Invitations sent on 30 Nov, 2020
On 30 Nov, 2020
On 30 Nov, 2020
On 30 Nov, 2020
Posted 07 Aug, 2020
On 07 Sep, 2020
Received 03 Sep, 2020
Received 28 Aug, 2020
On 13 Aug, 2020
On 12 Aug, 2020
Invitations sent on 06 Aug, 2020
On 06 Aug, 2020
On 31 Jul, 2020
On 30 Jul, 2020
On 30 Jul, 2020
On 30 Jul, 2020
Reliability of genomic variants across different next-generation sequencing platforms and bioinformatic processing pipelines
On 19 Jan, 2021
On 29 Dec, 2020
On 28 Dec, 2020
Posted 21 Dec, 2020
On 21 Dec, 2020
On 20 Dec, 2020
On 20 Dec, 2020
On 20 Dec, 2020
On 14 Dec, 2020
Received 04 Dec, 2020
Received 04 Dec, 2020
On 30 Nov, 2020
On 30 Nov, 2020
Invitations sent on 30 Nov, 2020
On 30 Nov, 2020
On 30 Nov, 2020
On 30 Nov, 2020
Posted 07 Aug, 2020
On 07 Sep, 2020
Received 03 Sep, 2020
Received 28 Aug, 2020
On 13 Aug, 2020
On 12 Aug, 2020
Invitations sent on 06 Aug, 2020
On 06 Aug, 2020
On 31 Jul, 2020
On 30 Jul, 2020
On 30 Jul, 2020
On 30 Jul, 2020
Background
Next Generation Sequencing (NGS) is the fundament of various studies providing insights into questions from biology and medicine. Nevertheless, integrating data from different experimental backgrounds can introduce strong biases. In order to methodically investigate the magnitude of systematic errors, we performed a cross-sectional observational study on a genomic cohort of 99 subjects each sequenced via (i) Illumina HiSeq X, (ii) Illumina HiSeq and (iii) Complete Genomics. Consequently, we systematically analyzed the heterogeneity between the sequencing cohorts with respect to genomic annotation and common filter criteria like minimum allele frequency (MAF).
Results
The number of detected variants/variant classes per individual was highly dependent on the sequencing technology. We observed a statistically significant overrepresentation of variants uniquely called by a single platform which indicates potential systematic biases. These variants were enriched in low complexity genomic regions and simple repeats. Furthermore, estimates of allele frequency were highly discrepant for a subset of variants in pairwise comparisons between different sequencing platforms. Applying common filters – such as MAF 5% and HWE- greatly reduced the heterogeneity between cohorts but still left discrepancies of several thousand variants after filtering.
Conclusion
We provide empirical evidence of systematic heterogeneity in variant calls between alternative experimental and data analysis setups. Our results highlight the potential benefit of reprocessing genomic data with harmonized pipelines when integrating data from different studies.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6