The WES consensus pipeline yielded a significant number of variants
The pipelines were executed on a server equipped with a 40-core CPU, 128 GB of RAM and L3 cache. The execution time for one exome sequenced sample was approximately 6 h 30 mins for GATK and 5 h 30 m for the consensus pipeline. The time required for variant calling was approximately 3 h 30 mins for GATK and 1 h 30 mins for consensus pipeline wherein on average 3,00,000 variants were identified in each sample. We observed that the GATK called more SNVs and indels, but the consensus pipeline gave a higher average transition-to-transversion (Ti/Tv) ratio (5.57 for consensus and 3.45 for GATK). The Ti/Tv ratio is an important criterion for assessing the quality of SNV calling, for exome sequencing, the expected Ti/Tv ratio is approximately 3.011. A higher Ti/Tv ratio usually indicates a higher accuracy of SNV calling.
Table 2: Variant counts and Ti/Tv Ratio from the two pipelines
|
SNVs
|
Ti/Tv ratio
|
Samples
|
Consensus
|
GATK
|
Consensus
|
GATK
|
HG00448
|
81156
|
266818
|
5.73645
|
2.56075
|
HG00284
|
73988
|
267929
|
5.82906
|
2.82882
|
PCa
|
68552
|
1043400
|
5.46942
|
5.33035
|
The genetic analysis of three samples, HG00448, HG00284, and PCa revealed significant variations in SNVs and Ti/Tv ratios (Table 2; Fig. 3). The GATK pipeline consistently yielded higher SNV counts across all samples. Ti/Tv ratios indicated a higher prevalence of transitions in consensus SNVs, whereas the GATK SNVs showed a more balanced distribution between transitions and transversions. While we observed that the consensus pipeline yielded a large number of variants with conflicting pathogenicity, the variants called from both pipelines were further evaluated for pathogenicity based on the presence of variants in the ClinVar database and the concordance check of the identified variation in clinical significance (Table 3).
Table 3: Clinical significance of variants called by the consensus and GATK pipeline
|
HG00448
|
HG00284
|
PCa
|
|
GATK
|
Consensus
|
GATK
|
Consensus
|
GATK
|
Consensus
|
Pathogenic
|
6
|
3
|
1
|
3
|
3
|
11
|
Likely Pathogenic
|
4
|
2
|
2
|
1
|
5
|
3
|
Pathogenic and likely pathogenic
|
0
|
0
|
1
|
1
|
7
|
2
|
Uncertain significance variants
|
30
|
33
|
30
|
28
|
134
|
109
|
Variants of conflicting pathogenicity
|
70
|
51
|
89
|
64
|
85
|
66
|
Total Clinvar Validated
|
17842
|
17862
|
18649
|
17398
|
14343
|
10677
|
The counts of variants for each category of clinical significance, viz. pathogenic, likely pathogenic, VoUS and conflicting pathogenicity was tabulated wherein our analysis showed an almost comparable number of ClinVar validated records from both pipelines. This asserts that there was a significant difference in the initial number of SNPs called by each pipeline where the final analysis resulted in a similar performance of both pipelines. We argue that the consensus pipeline appeared to be more sensitive in detecting pathogenic and likely pathogenic variants in the PCa sample, while GATK showed better performance in uncertain significance variants and ClinVar-validated variants in certain samples. The consensus pipeline also demonstrated lower numbers of variants of conflicting pathogenicity, indicating more consistent results. On the other hand, the SNPs called by GATK and the consensus pipeline were comparable and fewer in the consensus pipeline can be attributed to the default variant filtering applied by each variant caller and the algorithms used to call the variants. So the variants obtained were deemed the most confident ones among the four variant callers. The high SNP count for the truth set of NA12878 can be due to the various samples used for making the truth set with no filtrations applied.
Benchmarking yielded variant calls from both the pipelines with ClinVar validation
Benchmarking workflows include statistical metrics such as sensitivity, precision, F1 score and specificity which provide an insight into the pipeline performance. Here both the GATK pipeline metrics and the consensus pipelines are compared against the truth set variants of benchmark NA12878 dataset. The variant calls obtained from both pipelines were further annotated with the ClinVar database for the clinical significance of the genetic variants providing clinical interpretations for variants (Table 4). For the current analysis, this annotation will help identify the various metrics associated with the benchmark workflow to assess the pipeline’s ability to call clinically relevant variants as well as concordance with the established pipelines.
Table 4: Pathogenicity analyses for GATK, consensus and truth set
|
NA12878
|
|
GATK
|
Consensus
|
Truth set
|
Pathogenic
|
6
|
5
|
9
|
Uncertain significance variants
|
113
|
78
|
112
|
Variants of conflicting pathogenicity
|
110
|
95
|
116
|
Clinvar Validated
|
22062
|
14809
|
34074
|
The consensus pipeline had a total of 14809 ClinVar validated variants wherein 5 of them were pathogenic, 4 were identified as true positives and 1 as a false positive. While specificity was found to be 0.999971, the sensitivity of the consensus pipeline was 0.444, and the F1 score was around 0.571. On the other hand, the GATK pipeline, which processed a larger dataset of 22062 ClinVar validated variants, identified 6 pathogenic variants. It achieved 5 true positives and 4 false positives, yielding a higher specificity of approximately 0.999882. The GATK pipeline demonstrated a higher sensitivity of around 0.833 and an F1 score of approximately 0.667.
Comparison of trio-exome pipelines yielded distinct variants
One of the three trio-exome datasets is a 1KGP benchmark dataset that is a female and her father and mother were taken as a trio for this analysis. The other two families are CPC families(Z12 and Z19) in which only the proband is affected with the disease with the parents unaffected. The GATK trio pipeline took approximately 17 hours to finish one trio set and the VarScan trio pipeline took approximately 8 hours to complete the trio workflow. Both the pipelines included various steps of joint variant calling of trio samples and that is probably the reason for the increased time for the pipeline completion. GATK also employed additional genotype refinement as well as some joint calling specific subcommands which increases the accuracy and specificity of the variants called but also increases the total time taken to finish the pipeline.
Table 5: Total variants called by VarScan and GATK trio pipelines
|
Total SNVs
|
After filtration SNVs
|
Samples
|
GATK Trio
|
VarScan Trio
|
GATK Trio
|
VarScan Trio
|
NA12878_family
|
1788766
|
70064
|
677150
|
49051
|
Z12_family
|
1326375
|
145238
|
277160
|
102124
|
Z19_family
|
1408848
|
152132
|
332480
|
108207
|
The GATK and VarScan pipelines called distinct variants even as GATK relatively called significantly more variants than VarScan probably because of its sensitivity-focused algorithm (Table 5; Fig. 4). Nonetheless, both pipelines require a more stringent filtering approach which was carried out in the subsequent steps. On the other hand, these pipelines may have a false positive rate due to potential filtering artefacts. GATK also employs local realignment and haplotype assembly techniques to handle regions with complex variation, such as indels accurately. This approach enables GATK to detect and call variants that might perhaps be missed by other callers. The variant annotation of the filtered variants was done with the ClinVar database variant record vcf file (Table 6). The annotations were determined to be present in ClinVar records by concordance check with chromosome number, chromosome position, reference and alternate allele in both the pipelines vcf files and ClinVar records. As expected, the counts for pathogenic and likely pathogenic variants were very less as two of the samples are rare diseases.
Table 6: Clinical significance of variants from Trio pipelines
Samples
|
Pathogenic
|
Likely Pathogenic
|
Uncertain significance variants
|
Variants of conflicting pathogenicity
|
NA12878_family
|
8
|
6
|
198
|
217
|
Z12_family
|
10
|
6
|
245
|
264
|
Z19_family
|
13
|
5
|
236
|
223
|
Functional annotation of selected variants
The variants were analysed using The Genome Aggregation Database (gnomAD), gave us additional information on MAF besides Combined Annotation Dependent Depletion (CADD) scores which are key factors when analysing variants in a genome, particularly in the context of rare diseases. The accepted MAF threshold for defining rare variants in the context of rare diseases is generally considered to be less than 1% and we deemed it to identify extremely rare variants (MAF < = 0.01%). Nonetheless, CADD scores are generally calculated as phred-like scores and higher scores may indicate more deleterious mutations ( CADD > = 10; Table 7). We couldn’t consider Genomic Evolutionary Rate Profiling (GERP) scores keeping in view of the already existing stringent threshold we set.
Table 7
Functional annotation of selected trio variants
Gene
|
Chromosome Position (hg38) and Human Genome Variation Society (HGVS) Nomenclature
|
Ref/Alt Allele
|
Variant_Annotation
|
MAF/GnomAD
|
Phred_CADD
|
Proband
|
Variant type
|
LAMB3
|
NC_000001.11:g.209650092G > A
|
G/A
|
nonsense
|
0.000006571
|
37
|
Z12
|
Extremely rare
|
B3GLCT
|
NC_000013.11:g.31269278G > A
|
G/A
|
splice_donor_variant
|
0.0008022
|
34
|
Z12
|
Extremely rare
|
SOD3
|
NC_000004.12:g.24800212C > G
|
C/G
|
missense_variant
|
0.0127
|
23.4
|
Z19
|
Rare
|
MRE11A
|
NC_000011.10:g.94467821G > A
|
G/A
|
nonsense
|
0.00008
|
|
Z19
|
Extremely rare
|
MAP2K3
|
NC_000017.11:g.21300880C > T
|
C/T
|
missense_variant
|
NA
|
|
Z12,Z19
|
NA
|
Our analyses revealed the presence of novel variants involved in key cellular pathways, viz. cell proliferation, differentiation, migration and tumorigenesis. As anticipated, the occurrence of pathogenic and likely pathogenic variants was relatively limited, considering the rarity of conditions like CPC, a disease characterised by inflammation, fibrosis and hyperplasia, even as several genes encompassing a range of functional roles were identified. For example, SOD3 participates in repairing oxidative damage, MRE11A is involved in DNA double-strand break (DSB) repair, and LAMB3 plays a role in cellular organisation and migration. MAP2K3 regulates cell proliferation and invasion and was found to be coexpressed with other MAP kinase pathway proteins, viz. MAP3K3. Two SNPs in the MAP3K3 gene were candidates for association with colon and rectal cancers while LAMB3 is known to mediate attachment, migration and organisation of cells into tissues during embryonic development, in addition it is involved in the invasive and metastatic abilities of some types of cancer, including colon cancer. However, none of the mutations from our erstwhile exome-trio analysis which reported the significant/extremely rare variants were identified in our pipeline.
The present study attempts to benchmark the consensus exome pipeline by comparing it with a common and widely used gold standard GATK, which comes with standard best practices guidelines which help in benchmarking custom pipelines. Four datasets were taken for evaluation, three 1KGP datasets and one PCa sample. The high confidence variant call set from NA12878 was mainly used for comparing and benchmarking the consensus pipeline along with two other 1KGP samples and one PCa sample. This was performed because different variant callers may have varying strengths and weaknesses in detecting specific types of variants, such as SNVs, insertions, deletions, or structural variants. By integrating multiple variant callers in a consensus pipeline, we can improve the overall sensitivity of variant detection. Variants called by multiple callers are more likely to be true positives, reducing the risk of missing important variants. On the other hand, the consensus pipeline used four open-source variant callers at its core and all four of them contributed to the variant calls made. In the benchmarking studies, although it called fewer variants than the gold standard GATK pipeline, the Ti/Tv ratio was found to be consistently greater than what GATK achieved. This shows that with less number of calls, the consensus pipeline was able to capture a greater quality of SNVs and have a better ability to differentiate between true genetic variations and sequencing artefacts, particularly in the context of exome sequencing. On the other hand, GATK called more SNVs and indels, which could be attributed to differences in variant calling algorithms and default variant filtering applied by each pipeline. Analysis of ClinVar validated variants showed that both pipelines yielded a similar number of validated records, indicating a comparable performance in identifying clinically relevant variants. The consensus pipeline demonstrated higher sensitivity in detecting pathogenic and likely pathogenic variants in the PCa sample, while GATK performed better in uncertain significance variants. The consensus pipeline also showed lower numbers of variants of conflicting pathogenicity, suggesting more consistent results. To validate the consensus pipeline against the established gold standard GATK pipeline, both pipelines were run on a widely used benchmarking dataset (NA12878 exome sequencing data) with a truth set for variant calls indicating an unbiased evaluation and comparison of the consensus pipeline's performance. The results showed comparable SNP calls between the pipelines, with fewer SNPs in the consensus pipeline possibly due to default variant filtering and the selection of the most confident variants. Benchmarking workflows, including sensitivity, precision, F1 score, and specificity, were used to assess the pipeline's performance, and both pipelines were compared against the truth set variants. The clinical significance annotation with the ClinVar database further aided in evaluating the pipeline's ability to call clinically relevant variants.
Our work, however, has inherent limitations. We could optimise the use of additional variant callers or incorporate new algorithms that specialise in detecting specific types of variants, such as indels or structural variants. Fine-tuning the variant filtering step within the consensus pipeline may also enhance its ability to differentiate true genetic variations from sequencing artefacts, thereby increasing the accuracy of variant calling. To further validate the consensus pipeline, it would be beneficial to run it on a diverse range of benchmarked datasets and compare its performance against other established gold standard pipelines. This would provide a comprehensive evaluation of its strengths and weaknesses and establish its reliability across different datasets and variant calling scenarios. Incorporating machine learning and artificial intelligence techniques into the consensus pipeline could also enhance its performance. By leveraging these approaches, the pipeline could learn from large-scale datasets and improve its ability to accurately identify and annotate clinically relevant variants. Additionally, integrating external databases and resources beyond ClinVar, such as other disease-specific databases or functional annotations, would further enrich the pipeline's clinical significance assessment and provide more comprehensive information for variant interpretation. Nonetheless, we have a persistent dearth of trio-exome samples in the public domain supporting this limitation. As systematic benchmarking of such variant calling pipelines is underway for variant discovery[24], we have a firm hope that there is a dawn of a new beginning in the trio-exome analysis.