Comparison of multiple pipelines for analyzing WGS data from different origin
Eight pipelines were compared to determine their performance when using human WGS data (Fig. 2). When the performance of each pipeline was compared between the Caucasian data NA12878 and the data from Korean subject1, using the hg38 reference genome, there were no significant differences (Fig. 3).
The Caucasian data, NA12878, had corresponding VCF data provided by Illumina, which facilitated performance comparison. The recall value was the highest when Novoalign + GATK4 was used, and the precision value was highest with BWA-mem + Strelka2. The best performance, based on both precision and recall, was achieved with the Novoalign + GATK4 pipeline (Fig. 3).
For the Korean subject1 data, we analyzed using each pipeline excluding Samtools, which had the longest runtime among the eight pipelines used for comparing performance (Supplementary Figure S1). Unlike the NA12878 data, the Korean subject1 data did not have a gold standard reference; therefore, we used microarray data for the performance comparison. The results indicated no significant differences in pipeline performance but showed that precision was high for the BWA-mem + Strelka2 pipeline while recall was high for the Novoalign + GATK4 pipeline. The Novoalign + GATK4 pipeline showed the best performance when both precision and recall were considered (Fig. 3B). When Novoalign was used as an alignment tool, more variants were identified than those using BWA-mem; in particular, the Novoalign + GATK4 pipeline detected more variants compared to other pipelines. Similar results were obtained when data from the other two Korean subjects were used (Supplementary Table S1).
Among the variant call tools, the runtime for BWA-mem was shorter than that for Novoalign, and was fastest for Strelka2 (Supplementary Figure S1). BWA-mem took approximately 54 minutes to index the reference file and 1 hour 22 minutes for alignment. In contrast, Novoalign took 1 minute to index the reference file and approximately 5 hours for alignment. Since alignment, switching to the bam file, and the sorting process took approximately 20 minutes using Samtools, we set the thread to 64. To obtain a variant call using a BWA-mem mapped file, GATK4 took 5 hours, Strelka2 took 33 minutes, and DeepVariant took approximately 7 hours. When proceeding with a variant call using a file mapped in Novoalign, GATK4 took 4 hours, Strelka2 took 30 minutes, and DeepVariant took 7 hours and 30 minutes. In the case of Strelka2, the required time differed depending on the presence or absence of the BED file, but with the NA12878 data, if the BED file was not included, a variant call of the file mapped with BWA-mem took 40 minutes; in contrast, Novoalign took 5 hours 45 minutes (Supplementary Figure S2). Therefore, when using Strelka2, including a BED file can facilitate faster runtimes.
To verify the results of the pipeline performance analyses, Sanger sequencing was performed with the Korean subject1 data. The results showed that the concordance rate was higher when Novoalign was used as the alignment tool than the BWA-mem and was the highest when GATK4 was used as the variant caller (Table 1).
Table 1
Sanger sequencing data of Korean subject1
Pipeline
|
Number of variants
|
Genotype concordance
|
Concordance ratio
|
Alignment
|
Variant Caller
|
BWA-mem
|
DeepVariant
|
14
|
10
|
0.7143
|
GATK4
|
11
|
9
|
0.8182
|
Strelka2
|
15
|
12
|
0.8
|
NovoAlign
|
DeepVariant
|
19
|
15
|
0.7895
|
GATK4
|
37
|
32
|
0.8649
|
Strelka2
|
21
|
18
|
0.8571
|
Variant calling with an alternative Korean reference
The reference genome mostly built with a Caucasian. Thus, we proceeded with the analysis using a Korean reference genome as well as ‘the standard reference genome’, or hg38. We assessed the differences when the Korean genome was set as the reference in the Caucasian and Korean WGS analysis versus when the hg38 reference genome was used.
When using the Korean reference genome, the SNP precision and recall of the BWA-mem + Strelka2 pipeline with Caucasian WGS data were 0.936296 and 0.490368, respectively; the recall was very low, while in INDELs, the precision and recall were 0.38198 and 0.425591, respectively.
For the Korean WGS data, we compared the precision and recall with the microarray data. In a previous study using a general reference genome, among the 142,037 variants that matched the rsIDs of the WGS VCF file and microarray data, 141,885 variants showed exactly matched genotypes. When using the Korean reference genome, there were 77,679 variants with concordant rsIDs and 77,499 variants with concordant genotypes (Supplementary Table S2). In total, 440,188 SNPs were identified in the microarray data, with precision and recall values of 0.999691 and 0.986326, respectively, for the general reference and 0.999762 and 0.8399762, respectively, for the Korean reference (Fig. 4).
We compared variants called using the same pipeline and different reference genomes and inferred that the general reference genome was better than the Korean reference genome in the context of performance. When using a general reference genome rather than a Korean reference genome, the pipeline could call more variants. Furthermore, 1,811,832 variants were found to overlap. When checking whether chip data and genotype showed concordance among non-overlapping variants, we found a 2.8% match rate when using the general reference (Fig. 5).
Comparison with or without markduplicate
The markduplicate process is necessary to remove PCR duplicates that are introduced during NGS library construction. However, since PCR-free libraries are often used for subsequent sequencing, we assessed whether the markduplicate step is still necessary.
We compared the performance of Korean WGS data sequenced using the PCR-free library with and without the markduplicate step. In our computation environment, it respectively took 270 and 540 minutes (respectively corresponding to 44 ± 15% and 51 ± 10% of all anlysis time) running the markduplicate process on a BAM file mapped using BWA-mem and Novoalign when using Korean data. Therefore, this step increases the overall analysis time (Supplementary Figure S3).
In addition, the time was measured using NA12878 data to compare runtime by option. The runtimes when the Java option (-Xmx, -Xms) was given at 8g and 32g were compared. When given at 8g, the results were somewhat clearer[28]. It took 269 and 233 minutes to run markduplicate process with a BAM file created by Novoalign and BWA-mem, respectively. The 32g option took 283 and 288 minutes, respectively. These results demonstrated that the markduplicate process takes up a significant amount of time during preprocessing and identified that it is appropriate to use the 8g option when using the markduplicate process.
We then compared the performance of the pipeline with or without the markduplicate step, using Korean WGS data sequenced using the PCR-free library (Fig. 6). Comparing the match rate of the genotypes from the microarray chip and WGS VCF data confirmed that there were no significant differences based on whether markduplicate process was performed. These results agreed with those obtained using NA12878 (Supplementary Figure S4).
Best pipeline for MHC region analysis
We confirmed that using Novoalign as an alignment tool calls more variants than when BWA-mem is used. Based on the number of variants that appear in each pipeline alone, we found that when using standard references, the Novoalign + GATK4 pipeline calls far greater number of variants than other pipelines (Fig. 7A). When we identified the distribution of variants that appeared only in this pipeline, we confirmed that among the 2014 variants, 1568 variants originated from chromosome 6 (Fig. 7B). Among these, 1564 variants were found in the major histocompatibility complex (MHC) region (28,477,797–33,448,354, hg19) (Fig. 7C). This MHC region demonstrated a 99.87% match to the microarray chip and WGS genotypes (mismatches were detected only in two out of 1564 variants). The results were similar for the other two Korean subjects and for Korean subject1. Similarly, when the Novoalign + GATK4 pipeline was used, a large number of variants was detected (Supplementary Table S3); approximately 80% of the detected variants were in chromosome 6, and the majority belonged to the MHC region (Supplementary Figure S5). Additionally, the concordance ratio with the chip data was 99.83% and 99.94%, respectively, for Korean subject2 and subject3. Therefore, to analyze variants in the MHC region, we recommend using the Novoalign-GATK4 pipeline.