Sequencing Quality, Duplicate reads, Probe Analysis of genes involved in CPDH or pituitary development, GC regions, Coverage, Variant Calling and Public databases will be presented as the results in three studied groups (Group 1, comprised of 2 Japanese HapMap samples sequenced by Shigemizu et al5; Group 2, comprised of 2 patients with hypopituitarism and their mothers; and Group 3, of random 109 Brazilian samples (Figure 1).
Sequencing Quality
Sequencing quality was analyzed to check whether all samples had comparable Phred scores and number of sequenced reads. Although FASTQC10 showed that sequencing quality was proper and medium reads for Group 1 were similar to each other, Group 2 had a great difference in the number of sequenced reads, with SureSelect attaining 97 million reads and NimbleGen only 69 million. To compare raw base depth considering only these parameters, SureSelect’s was 93.95x and NimbleGen’s 54.25x. Group 3 showed large variation among its sequenced reads, SureSelect with 89 million reads (raw base depth of 99.39x), Nextera with 91 million (95.18x) and NimbleGen 70 million (55.75x) (Table 1).
Despite the divergent raw coverage among the technologies used, it is possible to observe that even at low coverages of 1x and 5x, SureSelect can capture a higher amount of intended target in all 3 groups, and at higher depths, this tendency was clearer. A fairer comparison can be observed in Group 1, which had the least variation in raw reads sequenced and mean coverages >100x in all three technologies, where at higher depths SureSelect (90.02% at 50x) can cover more percentage of bases than Nextera (69.77% at 50x) or NimbleGen’s (86.25% at 50x). This data suggests that SureSelect approach can capture target in a more homogeneous aspect than the other two technologies.
Duplicates
Estimating the number of PCR duplications among technologies is an important step to check the bias in target covered regions. PCR duplicated reads could lead to error base-calling variants, being a byproduct of the PCR step, which is applied in library construction in all technologies studied here. The use of this parameter to evaluate these technologies could indicate a good horizontal coverage of regions with a low cost per sequencing. NimbleGen’s duplication was the biggest in CPHD genes than any other kit, while SureSelect’s duplication was the lowest, although it showed larger variance of duplication rate among samples (Figure 2). This result reinforces that SureSelect approach showed less uneven coverage than NimbleGen or Nextera.
Probe Analysis of 76 genes involved in CPDH or pituitary development during embryogenesis
It is important to check whether all the used methodologies had designed probes to our region of interest, which span across 161,022 bp, as this may result in better coverage in some genes. However, it was shown that although not all regions had probes specifically designed for them, the entire region was covered by nearby probes. The overlapping probes in these regions can be seen in Figure 3.
GC regions
Investigation of the coverage in regions regarding GC content show that all methodologies have bias regarding GC areas, rich (>80%) or low (<20%). Outside CG-rich regions, NimbleGen shows less depth than the other methodologies, which could be an effect of lower mean coverage out of overall coverage (Figure 4). However, it is important to observe that Nextera showed a preference in covering lower over higher GC-rich regions. In our context this is important, because comparing the 76 genes of interest in this study to 76 random genes from the genome, it is possible to see that our chosen group does have a bigger frequency of higher GC areas. However, it is not statistically different from the random gene group (Figure 5).
Coverage
Regarding overall coverage, the best out of the three kits was Agilent’s SureSelect, which had good coverage both for the entire exonic region as well as our region of interest, as seen in graphics on Figures 6 and 7. Here, the graphics show that at 20x, which is the reliable depth for variant calling, SureSelect shows the highest percentage coverage across all comparisons, most importantly this is also observed in our region of interest that targeted 76 genes. While Illumina’s Nextera maintained a lower coverage in both gene groups, Roche’s NimbleGen had a slight fall in coverage for our regions of interest. However, in regard to the whole WES region, it was comparable to SureSelect.
Variant Calling
The main goal of a researcher when using WES is to find variants that can explain the patient’s phenotype. Usually, the focus lies on exonic or splice site regions, as they have a higher probability of having impact on the resulting protein and thus being deleterious.
All technologies are rather similar in the number of called variants in all regions, both whole exome region and specific CPHD genes, although Nextera seems to have a higher number of called variants in out-of-exon regions, such as intronic, downstream, and upstream (Table 2; Table 3).
Public databases
ClinVar is a public archive of the relationship between human phenotypes and genomic variations with supporting evidence, facilitating the association between human variation and clinical findings11. For such, when submitting new evidence, users must include the clinical significance according to ACMG criteria12. We determine whether these technologies can cover every known pathogenic variant in hypopituitarism genes, so as not to miss any probable cause of the studied phenotype.
ClinVar presents 1808 pathogenic or likely pathogenic variants in the 76 genes here studied. Mean coverage of each loci was performed to check which sequencing kit was able to cover the most of these variants at least 20x. In all the 3 groups, the SureSelect library has a lower number of uncovered variants (22 variants out of 1808). Similarly, NimbleGen library had 80 not covered in any of the sample groups. A quick summary of this information can be seen in Table 4, and for more detail on these variants and their loci, they can be found in Supplementary Information Table S1 of this paper.
ABraOM is a variant repository with the frequency of variants found in a normal Brazilian population. Currently, it consists of Whole Genome Sequencing of 1,171 unrelated elderly individuals 13. In an earlier version, composed of WES of 609 elderly individuals; 207,621 variants appeared only in this repository, which are then believed to be exclusive to the Brazilian population14.
Since our goal involves the efficiency of different sequencing library preparation kits in a Brazilian population specifically, we analyzed the variants found in ABraOM patients in the 76 genes studied, out of which 175 were exonic and found to be exclusive of the Brazilian population (Figure 8). Across all 3 groups, SureSelect was the library with lowest number of uncovered variants (Table 5); Supplementary Information Table S2.