Comparative Exome Capture Methods to Investigate Genes Involved in Hypopituitarism in a Brazilian Population

Whole Exome Sequencing (WES) has been a useful tool to improve molecular diagnosis in hypopituitarism, leading to the discovery of at least 8 new genes in the last 7 years. However, some genes associated with hypopituitarism show low coverage in this methodology, limiting its use for molecular diagnosis. Our objective is to compare three library prepping kits, NimbleGen (Roche), SureSelect (Agilent) and Nextera (Illumina) examining the best performance related to sequencing quality, exon extension coverage ( ≥ 98%) and base depth read ( ≥ 20x) of 44 genes associated with hypopituitarism and 32 involved in pituitary development. Three different groups composed of 2 HapMap samples (Group 1), 2 Brazilian patients with hypopituitarism and their respective mothers (Group 2) and 109 random Brazilian samples (Group 3) were sequenced in Illumina platform. Group 1 and 3 were performed using all three library prepping kits, while group 2 was performed with NimbleGen and SureSelect. Although all technologies covered the selected genes with similar eciency regarding poor (less than 20%) and rich (more than 80%) GC areas, SureSelect has shown to reach the most uniform coverage in the selected region with a lower level of duplicate reads, as well as a higher number of identied pathogenic variants.


Introduction
Combined Pituitary Hormone De ciency (CPHD) is the de ciency of one or more pituitary hormones, affecting 1:8000 births worldwide 1 . It may lead to short stature, weight gain, infertility, among other problems depending on which hormones are not being produced. It can be either idiopathic or congenital, and either non-syndromic, leading only to pituitary hormonal de ciencies, or syndromic, with extra-pituitary phenotypes, such as septo-optic dysplasia and holoprosencephaly 2 .
Several genes have been reported carrying mutations leading to CPHD and most of them were described by using the gene candidate approach by using the Sanger Method. However, all the genes described so far explain only a small percentage (around 15%) of the patients' clinical features, as both Fang et al 2 and DeRienzo et al 3 have pointed out. Nowadays, the method of choice is Whole Exome Sequencing (WES), capable of sequencing every coding region of the genome, allows for the researcher to investigate known genes as well as nding novel variants in yet unknown ones, increasing the possibility of reaching a diagnosis for these patients.
We used the NimbleGen kit Ez v3 to prepare DNA samples for WES of 23 patients with idiopathic hypopituitarism (11 isolated cases, 12 families). However, it was noted that some genes had a lower coverage than what was expected for a trustworthy variant calling, making it impossible to analyze these regions. The lack of proper coverage in these genes may be due to high GC content, as this is a known source of bias originating from the necessary PCR step in library preparation 4 . This uneven coverage can also be observed in databases such as gnomAD, which has graphs showing median coverage of genes in big sample groups.
Previous studies have already investigated different methodologies involving prepping kits [5][6][7][8] . Therefore, to elucidate which technology better covers important regions for proper molecular diagnosis of CPHD patients, we decided to compare different library prepping kits for WES for a set of known genes. This region is comprised by those already identi ed in published studies as causing CPHD, along with 32 genes with no known pathogenic mutations related to hormone de ciency, but having a role in pituitary development during embryogenesis 9 .

Results
Sequencing Quality, Duplicate reads, Probe Analysis of genes involved in CPDH or pituitary development, GC regions, Coverage, Variant Calling and Public databases will be presented as the results in three studied groups (Group 1, comprised of 2 Japanese HapMap samples sequenced by Shigemizu et al 5 ; Group 2, comprised of 2 patients with hypopituitarism and their mothers; and Group 3, of random 109 Brazilian samples ( Figure 1).

Sequencing Quality
Sequencing quality was analyzed to check whether all samples had comparable Phred scores and number of sequenced reads.
Although FASTQC 10 showed that sequencing quality was proper and medium reads for Group 1 were similar to each other, Group 2 had a great difference in the number of sequenced reads, with SureSelect attaining 97 million reads and NimbleGen only 69 million. To compare raw base depth considering only these parameters, SureSelect's was 93.95x and NimbleGen's 54.25x. Group 3 showed large variation among its sequenced reads, SureSelect with 89 million reads (raw base depth of 99.39x), Nextera with 91 million (95.18x) and NimbleGen 70 million (55.75x) ( Table 1).
Despite the divergent raw coverage among the technologies used, it is possible to observe that even at low coverages of 1x and 5x, SureSelect can capture a higher amount of intended target in all 3 groups, and at higher depths, this tendency was clearer. A fairer comparison can be observed in Group 1, which had the least variation in raw reads sequenced and mean coverages >100x in all three technologies, where at higher depths SureSelect (90.02% at 50x) can cover more percentage of bases than Nextera (69.77% at 50x) or NimbleGen's (86.25% at 50x). This data suggests that SureSelect approach can capture target in a more homogeneous aspect than the other two technologies.

Duplicates
Estimating the number of PCR duplications among technologies is an important step to check the bias in target covered regions.
PCR duplicated reads could lead to error base-calling variants, being a byproduct of the PCR step, which is applied in library construction in all technologies studied here. The use of this parameter to evaluate these technologies could indicate a good horizontal coverage of regions with a low cost per sequencing. NimbleGen's duplication was the biggest in CPHD genes than any other kit, while SureSelect's duplication was the lowest, although it showed larger variance of duplication rate among samples ( Figure 2). This result reinforces that SureSelect approach showed less uneven coverage than NimbleGen or Nextera.

Probe Analysis of 76 genes involved in CPDH or pituitary development during embryogenesis
It is important to check whether all the used methodologies had designed probes to our region of interest, which span across 161,022 bp, as this may result in better coverage in some genes. However, it was shown that although not all regions had probes speci cally designed for them, the entire region was covered by nearby probes. The overlapping probes in these regions can be seen in Figure 3.

GC regions
Investigation of the coverage in regions regarding GC content show that all methodologies have bias regarding GC areas, rich (>80%) or low (<20%). Outside CG-rich regions, NimbleGen shows less depth than the other methodologies, which could be an effect of lower mean coverage out of overall coverage ( Figure 4). However, it is important to observe that Nextera showed a preference in covering lower over higher GC-rich regions. In our context this is important, because comparing the 76 genes of interest in this study to 76 random genes from the genome, it is possible to see that our chosen group does have a bigger frequency of higher GC areas. However, it is not statistically different from the random gene group ( Figure 5).

Coverage
Regarding overall coverage, the best out of the three kits was Agilent's SureSelect, which had good coverage both for the entire exonic region as well as our region of interest, as seen in graphics on Figures 6 and 7. Here, the graphics show that at 20x, which is the reliable depth for variant calling, SureSelect shows the highest percentage coverage across all comparisons, most importantly this is also observed in our region of interest that targeted 76 genes. While Illumina's Nextera maintained a lower coverage in both gene groups, Roche's NimbleGen had a slight fall in coverage for our regions of interest. However, in regard to the whole WES region, it was comparable to SureSelect.

Variant Calling
The main goal of a researcher when using WES is to nd variants that can explain the patient's phenotype. Usually, the focus lies on exonic or splice site regions, as they have a higher probability of having impact on the resulting protein and thus being deleterious.
All technologies are rather similar in the number of called variants in all regions, both whole exome region and speci c CPHD genes, although Nextera seems to have a higher number of called variants in out-of-exon regions, such as intronic, downstream, and upstream (Table 2; Table 3).

Public databases
ClinVar is a public archive of the relationship between human phenotypes and genomic variations with supporting evidence, facilitating the association between human variation and clinical ndings 11 . For such, when submitting new evidence, users must include the clinical signi cance according to ACMG criteria 12 . We determine whether these technologies can cover every known pathogenic variant in hypopituitarism genes, so as not to miss any probable cause of the studied phenotype.
ClinVar presents 1808 pathogenic or likely pathogenic variants in the 76 genes here studied. Mean coverage of each loci was performed to check which sequencing kit was able to cover the most of these variants at least 20x. In all the 3 groups, the SureSelect library has a lower number of uncovered variants (22 variants out of 1808). Similarly, NimbleGen library had 80 not covered in any of the sample groups. A quick summary of this information can be seen in Table 4, and for more detail on these variants and their loci, they can be found in Supplementary Information Table S1 of this paper.
ABraOM is a variant repository with the frequency of variants found in a normal Brazilian population. Currently, it consists of Whole Genome Sequencing of 1,171 unrelated elderly individuals 13 . In an earlier version, composed of WES of 609 elderly individuals; 207,621 variants appeared only in this repository, which are then believed to be exclusive to the Brazilian population 14 .
Since our goal involves the e ciency of different sequencing library preparation kits in a Brazilian population speci cally, we analyzed the variants found in ABraOM patients in the 76 genes studied, out of which 175 were exonic and found to be exclusive of the Brazilian population ( Figure 8). Across all 3 groups, SureSelect was the library with lowest number of uncovered variants (

Discussion
When using high throughput sequencing technologies, it is necessary to perform quality and coverage analysis before variant ltering, to ensure reliable results. Generally, the coverage of genes known to cause the phenotype is not discussed in published articles that report new variant ndings in hypopituitarism. This fact, along with experiences with low coverage in WES sequencing in some of our samples using the NimbleGen kit, which had to be remade, led us to compare the e ciency of other kits available to the general market and to us. As many other comparisons on these kits have already been made 5-7 , we decided to focus our comparison in important regions to the disease we have been studying, as to shed light to researchers in this eld which approach is better to use in their cohort. For that, we selected genes that are important to pituitary development during embryogenesis, as well as genes that have been associated with hypopituitarism [15][16][17][18][19][20][21][22][23][24][25][26][27][28][29][30] .
The use of simpler technologies such as gene panels, can be tempting regarding tricky parts of the genome such as the one mentioned in this study, but a low number of molecular diagnosis has been reached according to the literature. Nakaguma et al. had a 4% success rate in diagnosing 117 patients using a custom gene panel with 26 genes previously related to hypopituitarism 31 . However, as stated by the author, this was a cohort previously screened and the use of gene panels may return a higher success rate (closer to 15%) if used in a cohort naïve of diagnostic approach, similar to the number found in the overall diagnostic rate for CPHD patients 3,31 . Even so, the approach of using WES is perhaps a better option, since it gives way to the discovery of new genetic causes 2,3 .
We also opted to broaden our samples groups and, unlike other comparisons made previously, added different group samples, such as a patient with the disease in question and random Brazilian samples. This was done to ensure that different known biases common to the technique of WES, such as sample, run or laboratory bias were not a big factor on the obtained results 4 . Unlike other populations, few information about the Brazilian population is available in the literature, as evidenced by the only two existing databases containing samples from this group, ABraOM and SELA 14,32 .
As all technologies studied are of great quality and achieve their goals, the answer to the question of which is best and should be used comes down to speci c parameters and depends on the researcher's targets 7 . For most investigators of WES in regards to medical sciences, the small difference in coverage of coding regions is of great importance, as it directly re ects the ability to identify rare variants 5 . This is also the case for most researchers trying to obtain molecular diagnosis for CPHD patients.
Our results come in contrast to the ndings of Clark et al, that report that the densely packed and overlapping baits of Roche's NimbleGen granted a higher coverage of targeted regions with a slightly higher edge in sensitivity for SNPs and indels 7 . However, it should be noted the use of different versions of library preparation kits, as theirs was v2.0 of the kit while ours was v3.0, which may explain this difference. This is further exempli ed by Asan et al, who concluded that between NimbleGen v1.0 and SureSelect All Human Exon, that the latter had a higher number of SNPs 8 . Both studies noted that NimbleGen needed a lower number of reads to reach the expected coverage, which is corroborated by our results, as it reached comparable coverage to the other kits even with a lower number of sequenced reads 7,8 .
It was also noted by other studies that Illumina's Nextera had an increase in read depth in areas with 40 to 60% of GC content 5,6 . This, however, did not translate to a higher coverage in genes implicated in CPHD with a high GC content, such as SOX3. In fact, it presented with the lowest coverage among the kits. This may be due to its fragmentation being done by enzymes, which has a greater fragment bias since it is not random shearing like in mechanical fragmentation. Therefore, other kits that use mechanical shearing for library preparation may have a more adequate coverage in these regions.
Lastly, Agilent's SureSelect All Human Exon v5's higher coverage in coding regions is seen across different comparison studies, as well as here 5,6 . Not only it reached a higher expected coverage in the whole exonic region, but also for our speci c region of hypopituitarism genes and those present in pituitary development. We compared coverage of known pathogenic or likely pathogenic loci in our region of interest found on ClinVar across kits, as well as of loci related to Brazilian polymorphisms according to ABraOM. In both cases, SureSelect had the best number of loci covered. Therefore, it is a strong contender for the best kit out of the three.
In conclusion, when comparing library preparation kits for WES taking into consideration studies looking for molecular diagnosis of CPHD patients, Agilent's SureSelect kit has the best performance. Moreover, regardless of the methodology used, it is of utmost importance to properly analyze whether every known causative gene has been properly covered in the sequenced samples, so as not to miss variants. University of São Paulo's SELA (Sequenciamento em Larga Escala), following manufacturer's protocols speci c for each kit. A breakdown on the sample groups analyzed can be found in Figure 1.

Chosen region
From searching the literature, we have selected 76 genes, shown in Table 6, that either are present in pituitary embryogenesis or have mutations found in CPHD patients, despite level of evidence when their variants were classi ed using ACMG criteria, as shown in Table 7. This region is referred to as "our region of interest" in the text. Meanwhile, the whole exome region each kit targets for sequencing may be referred to as "global region".