NGS4THAL, a one-stop molecular diagnosis and carrier screening tool for thalassemia and other hemoglobinopathies by next-generation sequencing

Background Thalassemia is one of the most common genetic diseases and a major health threat worldwide. Accurate, ecient and scalable genetic testing methodology is much needed for its molecular diagnosis and carrier screening. Results We developed NGS4THAL, a bioinformatics analysis pipeline analyzing next generation sequencing (NGS) data to detect pathogenic variants for thalassemia and other hemoglobinopathies. NGS4THAL recovers and realigns ambiguously mapped NGS reads derived from the homologous hemoglobin gene clusters to achieve accurate detection of point mutations and small insertion/deletions (InDels). And it uses several structural variant (SV) detection tools with complementary algorithms, and an inhouse database with control data on a number of known SVs to achieve accurate detection of hemoglobin SVs. Detected variants are matched with those in HbVar, allowing recognition of known pathogenic variants, including disease modiers. Tested on simulation data, NGS4THAL achieved high sensitivity and specicity. For targeted NGS sequencing data from samples with laboratory-conrmed pathogenic hemoglobin variants, it achieved 100% detection accuracy. Application of NGS4THAL on whole genome sequencing data from unrelated studies detected thalassemia mutation carrier rates for Hong Kong Chinese and Northern Vietnamese that were consistent with those from epidemiological studies. detection of point InDels. on simulation gene


Abstract
Background Thalassemia is one of the most common genetic diseases and a major health threat worldwide. Accurate, e cient and scalable genetic testing methodology is much needed for its molecular diagnosis and carrier screening. Results We developed NGS4THAL, a bioinformatics analysis pipeline analyzing next generation sequencing (NGS) data to detect pathogenic variants for thalassemia and other hemoglobinopathies.
NGS4THAL recovers and realigns ambiguously mapped NGS reads derived from the homologous hemoglobin gene clusters to achieve accurate detection of point mutations and small insertion/deletions (InDels). And it uses several structural variant (SV) detection tools with complementary algorithms, and an inhouse database with control data on a number of known SVs to achieve accurate detection of hemoglobin SVs. Detected variants are matched with those in HbVar, allowing recognition of known pathogenic variants, including disease modi ers. Tested on simulation data, NGS4THAL achieved high sensitivity and speci city. For targeted NGS sequencing data from samples with laboratory-con rmed pathogenic hemoglobin variants, it achieved 100% detection accuracy. Application of NGS4THAL on whole genome sequencing data from unrelated studies detected thalassemia mutation carrier rates for Hong Kong Chinese and Northern Vietnamese that were consistent with those from epidemiological studies.
Conclusions NGS4THAL is a highly accurate and e cient molecular diagnosis tool for thalassemia and other hemoglobinopathies based on tailored analysis of NGS data, and is potentially scalable for carrier screening purposes.

Background
Thalassemia is one of the most common genetic diseases, with very high prevalence and carrier rates in many tropical and subtropical regions (1,2). It is a group of heterogeneous, mostly recessively inherited disorders with a broad spectrum of mutations that often differs among geographical regions and ethnical groups (3). More than 400 thalassemia pathogenic variants have been documented in HbVar (4) up to date, which is a comprehensive database on pathogenic and benign hemoglobin variants.
Deletions in the α-globin gene locus are the underlying causes of most of the α-thalassemia (5), while point mutations and small insertions/deletions (InDels) in the β-globin gene region explain most βthalassemia(6). Although most of the pathological mutations involve the coding regions of the hemoglobin genes, there are pathogenic variants involving the regulatory elements in intergenic regions or introns, affecting expression or splicing of the hemoglobin genes (7). In addition, some disease modi ers(8-10), both in the hemoglobin regions and in other regions of the genome, impact on disease severity of β-thalassemia and sickle cell disease (MIM: 603903), mainly through regulation of the fetal hemoglobin levels.
Traditionally, molecular diagnosis of thalassemia (as well as other hemoglobinopathies) starts with hematological tests and hemoglobin electrophoresis, followed by genetic tests to identify the exact pathogenic variant(s) (11,12). This requires cumbersome laboratory work and step-by-step decisionmaking based on the circumstances, which makes it di cult to scale up. Certain disease mutations also may not show positive hematological features and might be missed by hematological screening (12). Thus, an accurate, e cient, and universally applicable thalassemia and hemoglobin variant detection platform is greatly needed in order to screen relevant globin gene variants and disease modi ers for this group of diseases without having to consider speci c circumstances.
Next generation sequencing (NGS) technology provides a possibility of detecting almost all the pathological variants and disease modi ers in a single experimental setting and has been used to various degrees in the molecular diagnosis of thalassemia and other hemoglobinopathies (13)(14)(15). However, none of these attempts have resulted in an accurate, e cient, and scalable solution for globin gene variant detection, and as a result, NGS technology has not been widely applied in clinical laboratories and population carrier screening for hemoglobin diseases up to date.
The α-globin gene clusters are located in 16p13.3, a region features with homologous sequences, interspersed repeats, and various types of genetic variants(16). Short NGS reads derived from this region are often ambiguously mapped to multiple positions due to sequence homology. This not only causes missed detections of point mutations and InDels, but also affects read depth estimation, which is one of the features required for detection of structural variants (SVs). Furthermore, compound forms of hemoglobin SVs are common in thalassemia epidemic regions. For example, --SEA /-α 3.7 , a compound heterozygous genotype that causes hemoglobin H (HbH) disease, is common in Hong Kong. Being able to detect complex forms of genotypes is essential for analysis tools aiming at comprehensive thalassemia molecular diagnosis and carrier screening.
Here, we present NGS4THAL, a one-stop bioinformatics analysis pipeline aiming at detecting hemoglobin variants through analyzing various forms of NGS data. NGS4THAL recovers and realigns NGS reads with multiple alignments in the hemoglobin regions for accurate detection of point mutations and small InDels. It uses SV callers with complementary algorithms, tailored parameters, and an in-house database of relevant controls of speci c SVs to improve SV detection. Validation of NGS4THAL on simulation data and targeted sequencing data on samples with globin gene variants previously con rmed from clinical laboratories achieved high sensitivity and speci city. It can also detect thalassemia pathogenic variants located in regulatory regions, as well as known disease modi ers. NGS4THAL is implemented in python and can be accessed from https://github.com/JavenCao/Realign_BAM.

Implementation
Detection of point mutations and small InDels A schematic illustration of NGS4THAL is presented in Figure 1. NGS4THAL tries to improve the detection of SNVs and small InDels through recovery and realignment of NGS sequencing reads with multiple alignments (RMAs), which are common in the two hemoglobin genomic regions, namely16p13.3 and 11p15.4.
NGS4THAL rstly recovers informative RMAs in the hemoglobin regions from the alignment le in BAM format. The following criteria are applied for recovering a RMA: (1) it maps to one of the homologous regions, namely HBZ-HBZP1, HBA2-HBA1-HBAP1 or HBG1-HBG2; and (2) it has a bwa(17)-based mapping quality score (MAQ) of zero due to having an alternative mapping position with an identical mapping pro le as that of the primary mapping position, or (3) it has a MAQ < 30 and the difference of mismatches between the primary mapping hit and the alternative mapping hit is less than three base pairs under the short-read settings (e.g. 100bps -300bps).
After recovering these reads, one read of the read pair, designated as R1, is realigned to either the primary mapping position or the alternative mapping position at random. The mate read, designated as R2, is mapped nearby, within ve standard deviation of the mean insert size of the NGS library from R1. The MAQs of these realigned RAMs are adjusted to 60 in a new BAM le for downstream analysis. SNVs/InDels are detected from the new BAM le by GATK-HaplotypeCaller(18).
NGS4THAL was designed to detect point mutations and small InDels in globin gene clusters, regulatory elements, and certain thalassemia disease modi ers known to date (Additional le 3: Table S1) and this is accomplished by looking the detected variants up in the HbVar database. The detected variants are classi ed into known pathogenic SNVs/InDels, benign Hb variants, or novel variants, with the latter also being reported for manual annotation. Variants located in equivalent positions within the homologous genes, such as between HBA2 and HBA1 (e.g, HBA2:c.64G>C or HBA1:c.64G>C), cannot be distinguished based on available short reads NGS data. They will be reported as such for annotation and further test using traditional laboratory methods. Additional technical details can be found in Note 1 in additional le 1.

Detection of hemoglobin structural variants
Depending on the relative positions between the breakpoints of the SVs and the homologous sequences, we classi ed hemoglobin SVs into four subtypes: Subtype-A SVs, whose breakpoints are both located in non-homologous regions; Subtype-B and Subtype-C SVs, for which one of the two breakpoints is located in a homologous region; and Subtype-D SVs, whose breakpoints are both located in homologous regions.
NGS4THAL uses complementary tools analyzing the mapped NGS reads for SV detection, based on features of insert size estimate, mapping positions of fragments upon splitting a read, and read-depth, respectively (Additional le 2: Figure S3).
BreakDancer (19) relies on mapping information of respective reads from a read pair, while Pindel (20) uses the respective mapping positions upon splitting a single NGS read into fragments. CoNIFER (21) makes use of read depth information to infer CNVs (Copy Number Variants), initially developed for whole exome sequencing (WES) data. Some modi cations were adopted to tailor the analysis process to better t our purpose. Brie y, requirement on minimum alternative MAQ was lowered to 0 (-q 0), allowing BreakDancer to take RMAs into consideration to better detect SVs located in homologous sequences. An SV may cause insert size deviation from the norm of the library as well as mapping location aberrations when reads are split and respectively mapped. BreakDancer and Pindel are complementary to each other and combined use of the two may increase sensitivity (20) . Thus, we included results from BreakDancer as a source of input for Pindel, which is a practice that has been shown to increase SV detection accuracy (20) .
To make CoNIFER suited for analyzing sequencing data with continuous coverage, a set of probes covering the two hemoglobin loci were generated in-silico. Each probe is designed at 80 bps in length, with 25 bp overlap between adjacent probes. This modi cation allows the software to have more detailed and ne-tuned SV detection. To increase sensitivity, the detection threshold was also lowered from 1.0 to 0.5, effectively allowing detection of CNVs with 50% read-depth change relative to the baseline average, instead of the default threshold of 100% change. Finally, variants detected from a comprehensive use of BreakDancer, Pindel and CoNIFER were matched with those in the HbVar database. SVs matching the known hemoglobin SVs in the database were reported as such, while candidate novel SVs were reported for further manual annotation for their functional implications. More technical details of the algorithm can be found in Note 1 in additional le 1.
In addition, a stage-wise process was implemented in NGS4THAL in order to detect compound SVs (such as --SEA /-a 3.7 ), which is a di cult task for any single algorithm. The principle adopted was to ne tune the background baseline to make it in line with one type of the compound SVs, so that the coexistence of the SV from the other haplotype would be easier to detect. Speci cally, in the screening stage, all the samples are analyzed using BreakDancer, Pindel and CoNIFER to determine the presence of SVs/CNVs. Then in the ne-pro ling stage, samples with suspected large SVs covering both the HBA2 and HBA1 genes (such as --SEA ) are further tested for the possibility of compound heterozygous SVs (such as co-existence of -α 3.7 in the form of --SEA /-α 3.7 ).
Brie y, in the ne-pro ling stage, samples are grouped by the type of large SVs they are suspected to carry, for example, samples suspected of carrying --SEA are grouped altogether. The read-depth for HBA genes is re-examined by CoNIFER for each group of samples, with reference to the baselines estimated by the controls carrying the large SVs (such as --SEA ). NGS4THAL incorporated an in-house sample database with NGS sequencing data on wildtype and pre-de ned heterozygous SVs of various types, from which control samples of a particular heterozygous SV can be drawn to be analyzed together with the samples in question. The sample size is set at 30 in minimum in both the screening-stage and the ne-pro ling stage, ensuring accurate estimate of read depth for SV detection.
Datasets used to evaluate NGS4THAL 1. Simulated data Point mutations and small InDels were simulated to cover HBA2, HBA1, HBB, HBD genes, on both exons and introns. The simulated SVs included the aforementioned hemoglobin SV subtypes, as well as some complex combinations -compound hemoglobin SVs. A total of 69 point mutations (including 66 thalassemia pathogenic variants, three α-globin variants causing other hemoglobinopathies), 19 small InDels, and 25 SVs spanning HBA2 and HBA1 genes (including 16 deletions, 3 duplications, and 6 compound hemoglobin SVs) were simulated (Additional le 3: Table S2).
To test the robustness of NGS4THAL, we simulated ten independent heterozygous carriers for each simulated variant. We also generated ten cases with homozygous -α 3.7 . An additional 870 independent individuals without any polymorphisms were simulated as negative controls, summing up to a simulated cohort of 2,000 samples. VarSim (22) and ART (23) were used sequentially to generate NGS reads with continuous coverage of the HBA (hg19: chr16:132,744-279,999, size = ~147kbps) and the HBB regions (hg19: chr11:5,221,230-5,321,230, size = 100kbps). The simulated NGS reads were similar to those generated by Illumina HiSeq 2500 platform in terms of base pair quality scores. Paired-end reads with read length of 150bps were produced at around 100-fold coverage depth. Enrichment probes were designed by online tools from Roche NimbleDesign. To maximize coverage and continuous capture of sequences from the two hemoglobin gene loci, including the regulatory elements, the designed probes were allowed to have no more than three matching positions in human genome. Besides, a number of genes and polymorphic sites known to modify disease severity were also included (Note 2 in additional le 1). The probes were manufactured by Roche NimbleGen. NGS library preparation, including fragment enrichment, were prepared based on SeqCap EZ HyperCap Work ow Version 1.2. Paired-end sequencing of 100bps read length was generated on Illumina HiSeq 1500, performed by the Centre of Genomic Sciences in the University of Hong Kong. Around 1500-fold coverage depth was achieved for these samples.

Whole exome sequencing (WES) and whole genome sequencing (WGS) datasets
Whole exome sequencing (WES) data on 185 individuals were provided by the Asian Primary Immunode ciency (PID) Network(24) (referred to as Hong Kong WES cohort below). Exomes were captured by probes from various sources, including Agilent SureSelect, Illumina TruSeq and xGen Exome Research Panel from Integrated DNA Technologies. Paired-end sequencing with read length of 100bps or 150bps and sequencing depth of 50-150 folds were applicable to these data.
Whole genome sequencing (WGS) data on a total of 954 samples from a Hirschsprung's disease study (25,26) conducted in The University of Hong Kong was used to test the performance of NGS4THAL, including 447 Hirschsprung's disease cases and 507 healthy controls. The ethnical origins of the samples include Hong Kong Chinese, Mainland Chinese, and Northern Vietnamese (Additional le 3: Table S3). The data were 150bps in NGS read length with paired-end sequencing and an average of 30-fold coverage. Different types of the detected variants from this dataset were rendered for orthogonal validation by traditional experimental methods, including Sanger sequencing after PCR ampli cation and Multiplex Ligation Dependent Probe Ampli cation (MLPA) (Note 3 in Additional le 1).

Haplotype analysis
To de ne haplotype(s) that might be in linkage disequilibrium (LD) with the --SEA deletion, we used 40 heterozygotes --SEA samples, con rmed by NGS4THAL from the aforementioned cohorts with targeted sequencing and WGS data, together with 3, 516 wildtype individuals on the --SEA locus with genome-wide SNP genotyping data and WGS data, all of whom were Hong Kong Chinese in ethnic origin (27). The wildtype genotype was con rmed either by report from NGS4THAL or by heterozygosity on a high quality SNP (rs2974771) located within the range of --SEA deletion. These data were used as the discovery cohort for LD analysis between --SEA and haplotypes in this region (Note 4 in Additional le 1).
A replication cohort was collected from the 1000 Genomes Project with phased genotypes(28), including 21 --SEA heterozygotes and 381 wildtype samples from matching ethnicities (Additional le 3: Table S6).
Recombination rate was derived from the HapMap project (29), and the LD patterns surrounding --SEA were based on SNP genotyping data of Hong Kong individuals, analyzed by Haploview (30). PHASE (version 2.1)(31) was used to determine the haplotypes of the genotype data and their frequencies. χ 2 test was used to calculate the association P values and odds ratios of associations between -SEA and haplotype in linkage.

Detection of point mutations and small InDels
RMAs (NGS reads with multiple alignment) occur within HBZ-HBZP1 and within HBAP1-HBA2-HBA1 (Additional le 2: Figure S4) in the α-globin gene locus, and between HBG1 and HBG2 in the β-globin gene locus (Additional le 2: Figure S5). NGS4THAL identi es these RMAs and realigns the variant-informative RMAs to equally likely positions in the reference genome at random. This process was shown to signi cantly improve variant detection located in the homologous regions (Additional le 2: Figure S6).
Testing of NGS4THAL on simulation data showed that, without the process of RMA recovery and realignment, hemoglobin SNVs located in homologous regions were often missed, including variants located in exon 1 and exon 2 of HBA1 and HBA2 genes (Figure 2 -A and B). This was because NGS reads derived from these regions often failed to meet mapping quality threshold and were not contributing to the variant detection process (Additional le 2: Figure S7). In contrast, upon recovery and realignment of the informative RMAs by NGS4THAL, detection of variants in these regions was signi cantly improved. For SNVs in the simulation data, NGS4THAL achieved a 97.87% sensitivity, compared to 68.09% in the absence of NGS reads recovery and realignment process. Considering SNVs with small InDels altogether, NGS4THAL achieved a sensitivity of 98.48%, compared to a 77.27% sensitivity without the RMA realignment process.
We didn't treat variants detected on equivalent positions within the homologous genes (such as HBA2:c.64G>C or HBA1:c.64G>C) as false positives, as under the short reads NGS sequencing scheme (e.g., 100-300bps read length and insert size around 300-500bps), the exact position of the point mutations between HBA2 and HBA1 cannot be determined with certainty, particularly for exon 1 -2 of these two genes, which share nearly 2kb sequences of complete identity. Acknowledging this caveat, a 100% speci city was achieved as pathogenic variants were concerned with the realignment process.
Testing of NGS4THAL on the 71 samples sequenced by the targeted sequencing platform, 100% sensitivity was achieved for the pathogenic hemoglobin point mutations and small InDels (including variants causing thalassemia and other hemoglobinopathies), which were previously con rmed by traditional laboratory methods (Additional le 2: Figure S8). Specially, Hb Q-Thailand and Hb_Luton, located in exon 1 and exon 2 of HBA2 and HBA1, would have been missed without the RMA realignment process.
Applying NGS4THAL on whole genome sequencing (WGS) data from the Hirschsprung's disease study (25,26), a number of thalassemia pathogenic point mutations and small InDels common to local populations were detected, including Hb CS, Hb Westmead, CD41/42 (-CTTT) and -28 (A>G) for Hong Kong Chinese, and HbE, CD 17 (AAAG>TAG) and -50 G>A for Northern Vietnamese. The detected carrier rates for these variants were consistent with those from previous epidemiological reports(32) , (33) (Additional le 3: Table S7 and Additional le 3: Table S8). Furthermore, all the unique mutations detected above were orthogonally con rmed by PCR ampli cation followed by Sanger sequencing (Additional le 3: Table S9).
Commonly used WES capture protocols in the market can achieve complete coverage of HBA2, HBA1, HBM genes, including introns through off-target capture and sequencing, and coverage on the exonic regions of HBZ, HBB, HBD, HBG1 and HBG2. Hemoglobin variants located in these regions could be readily detected by NGS4THAL. A number of locally common thalassemia pathogenic point mutations and small InDels were detected from the Hong Kong WES cohort with high con dence, including Hb Westmead (carrier frequency was 1/185=0.5%) and Hb CS (1/185=0.5%) from exon 3 of HBA2, Codons 41/42 (-TTCT) (2/185=1.1%), -28 (A->G) (1/185=0.5%) and IVS-II-654 (C->T) (2/185=1.1%) from HBB (Additional le 3: Table S10). The carrier rates for these pathogenic variants detected in the WES cohort were consistent with the epidemiology reports, although this is a small sample size for accurate estimation (32). On the other hand, intronic mutations in HBB, HBG1 and HBG2, as well as intergenic variants that may affect expression of the hemoglobin genes might have been missed by WES, highlighting the shortcomings of the methodology on coverage.

Detection of structural variants
We tested NGS4THAL on simulation data composed of the four subtypes of hemoglobin structural variants (SVs) de ned earlier, including heterozygous, homozygous, and compound heterozygous SV genotypes. Subtype-A deletions (such as -α 2.7 ), presenting with de nitive characteristics from paired end sequencing, could be detected without ambiguity. These SVs present characteristics such as aberrant insert size inferred from mapping positions of paired reads, breakpoints revealed by mapping of reads upon being split, and reduced/increased coverage depth, as detected by BreakDancer, Pindel and/or CoNIFER, respectively (Figure 3 -A).
Subtype-B (such as -α 5.3 ) and Subtype-C (such as -α 3.5 ) deletions, with one of the two breakpoints located in homologous sequences, can be detected by BreakDancer when analyzing changes in insert size from paired-end reads near the breakpoints, which works well especially with RMAs being recovered and used.
Using -α 5.3 , a Subtype-B deletion, as an example (Figure 3 -B), tailored software BreakDancer detected two candidate deletions: one was consistent exactly with the simulated -α 5.3 and the other detection was a false, longer SV with one of the ends incorrectly determined from ambiguous mapping. For NGS4THAL, this issue of breakpoints ambiguity can be resolved by matching the results with HbVar database and the detection matching with known variants in the database was reported as the most likely variant.
For algorithms based on mapping of split-reads, such as Pindel, the 5' breakpoint could be determined by analyzing NGS reads that span the breakpoint. However, the split-fragments of these reads could also be mapped to the paralogous HBA1 gene, causing uncertainty in de ning the 3' breakpoint. In this case, NGS4THAL rstly detects all the plausible SVs supported by the data, and makes the call by matching all the candidate results with the known hemoglobin SVs from the HbVar database. In addition, simultaneously applying CoNIFER also detected Subtype-B and Subtype-C deletions through changes in read depth.
The -a 3.7 is a Subtype-D deletion, caused by crossing-over between misaligned homologous X,Y, Z boxes during meiosis (34). As both breakpoints are located in homologous sequences, Subtype-D variants cannot be detected by evaluating insert size aberrations that BreakDancer is based on (Figure 3 -D).
Similarly, they cannot be detected by split-read mapping methods such as Pindel either. Nevertheless, Subtype-D deletions could still be detected by read depth-based algorithms, such as CoNIFER when parameters are adjusted. In the case of -α 3.7 (hg19: chr16:223,299-227,103), NGS4THAL detected a deletion spanning hg19:chr16:222,879-226,799. Matching the result with all variants in HbVar database reveals that this is likely the -α 3.7 deletion (Figure 3 -D).
Detecting compound heterozygous forms of SVs, which are not uncommon in epidemic regions for thalassemia, can be more complicated. Taking --SEA /-α 3.7 as an example, we found that in the screening stage, although the --SEA deletion could be identi ed by both BreakDancer and Pindel (Figure 4 -A), the loss of another copy of HBA2 on top of --SEA (e.g., -α 3.7 ) can't be correctly determined, even by CoNIFER with modi ed parameters (Figure 4 -B).
Considering these scenarios, in the ne-pro ling stage, as --SEA deletion has already been detected, NGS4THAL groups these samples with a number of heterozygous --SEA controls with similar sequencing depth for further evaluation by CoNIFER. In so doing, the loss of the second copy in HBA genes from the --SEA /-α 3.7 genotype could be clearly detected (Figure 4 -C). The same principle holds true for other compound heterozygous SVs with position overlaps, such as --CANT /-α 3.7 and --SPAN /-α 3.7 (Additional le 2: Figure S9).
From simulated data, NGS4THAL achieved 98.40% sensitivity and 100% speci city for the detection of the hemoglobin SVs including deletions, duplications and compound rearrangements (Additional le 3: Table S11). NGS4THAL determines whether a detected SV is known in the HbVar database by checking the coordinates of the SV boundaries. Due to sequencing depth uctuations and issues with RMAs, the boundaries of SVs determined by read-depth method are not always accurate. As a result, a few SVs were not accurately reported by NGS4THAL, accounting for the 1.6% of missed detections.
When NGS4THAL was applied to the WES data (Additional le 3: Table S10), a limited number of offtarget NGS reads upstream of the HBM gene allowed the detection of the --SEA deletion. These reads, which were near the --SEA breakpoints, allowed BreakDancer and Pindel to make detection of three --SEA heterozygotes, for a carrier frequency of 1.6%. In addition, NGS4THAL detected one -α 3.7 heterozygote deletion (1/185=0.5%) with some uncertainty on the boundaries by CoNIFER. Notably, the detected rates of --SEA and -α 3.7 were both much lower than expected, which probably were caused by lack of adequate and continuous coverage by the WES data.
And the detected carrier rates for αα/ααα anti3.7 was 1.57% (4/254), suggesting that these mutations are Impact of different sequencing platforms on thalassemia mutation detection It was revealed that both targeted sequencing data of the hemoglobin regions with in-depth and continuous coverage and WGS with the widely adopted 30-fold coverage depth lead to mutation detections that are accurate or in line with expectations from epidemiology reports, for both point mutation/small InDels and SVs. WES with 50-150 folds sequencing depth did lead to detection of point mutations and small InDels located in well covered coding regions (Additional le 3: Table S10). However, WES has its intrinsic shortcomings in SV detection due to the coverage discontinuity and lack of coverage in introns and intergenic regions. Although off-targeted NGS reads might lead to identi cation of --SEA deletion, the detected rate is lower than expected. The carrier rate detected from Hong Kong WES dataset (5.9%,11/185) is probably a re ection of missing detections of SVs and non-coding SNVs (Additional le 3: Table S10).
Haplotype(s) in linkage disequilibrium with the --SEA deletion All the detected --SEA deletions from our datasets shared the same breakpoints with the coordinates of hg19:chr16: 215395-234699, consistent with its nature as a founder mutation(38). Population substructure analysis of the WGS cohorts by principal component analysis showed that --SEA deletions were only detected from samples genetically similar to and clustered with Southeast Asian populations, and in cases often defying self-reported geographical origins(Additional le 2: Figure S14).
Local LD structure showed that --SEA deletion located in a LD-block with recombination hotspots located immediately upstream (the telomere end of 16p13.3) and some distance downstream (Additional le 2: Figure S15). We selected six SNPs from this interval to de ne haplotypes that might be associated with the --SEA deletion. From the discovery cohort, a haplotype with G(rs2685118) -G(rs1203957) -A(rs3918352) -T(rs11642609) -G(rs1203974) -A(rs8045291) was found in high LD with --SEA (Additional le 3: Table S13). This haplotype was detected in all but one (39 of 40, 97.5%) --SEA carriers and in only 25.63% of the non --SEA individuals, with an estimated OR of 113.19 and a P-value < 2.2e-16. Using this haplotype as a surrogate for --SEA deletion achieves a sensitivity of 97.50% and speci city of 74.37% (Additional le 3: Table S14). In the replication cohort, this risk haplotype (GGATGA) was associated with --SEA with similar strength (OR of 114.61, P-value < 2.2e-16), which would achieve a sensitivity of 95.45% and a speci city of 84.51%.

Discussion
Thalassemia is a major public health burden in many countries, especially in tropical and subtropical regions. Global population migration, however, has led to the spread of the disease to historically nonepidemic regions, such as Northern Europe and the Americas (39). It has been shown that establishing nation-wide or territory-wide thalassemia screening programs can signi cantly reduce the incidence rates of thalassemia or other hemoglobinopathies, moving forward towards the goal of controlling these ancient diseases (40). It is commonly believed that pre-conception knowledge of carrier status, which can be achieved by population screening or screening for couples who are planning to have children, could offer initiatives to at-risk subjects to take necessary measures in avoiding having children with thalassemia. This is especially valuable in regions where termination of pregnancies is not acceptable for religious, cultural or legal reasons.
The current laboratory diagnosis methods for thalassemia starts with detection of reduced levels of mean corpuscular volume/mean corpuscular hemoglobin (MCV/MCH), which could miss certain mutations due to various reasons. For example, silent heterozygous α-thalassemia and β-thalassemia carriers, compound heterozygotes with both an α-globin gene deletion and a duplication (α-/ααα), heterozygous Hb E carriers, and double heterozygotes of β-thalassemia and α-thalassemia mutations, may not show reduction in MCV/MCH levels. Diagnosis methods based on hemoglobin electrophoresis could also be problematic, especially for some silent heterozygous mutation carriers. For example, coexistence of β-thalassemia mutations may signi cantly reduce HbH levels. Furthermore, downstream genetic tests, such as Gap-PCR and MLPA, which are the gold standard for the diagnosis of certain mutations, are often cumbersome and are targeted towards certain genetic mutations. Many of these techniques are only available to some tertiary hospitals, preventing their wide-spread applications.
Although NGS sequencing of hemoglobin regions has the potential to reveal nearly the complete spectrum of genetic mutations, data analysis poses a big challenge due to sequence complexity of the hemoglobin regions. This is aggravated by the fact that bioinformatic expertise is still not widely available in clinical laboratories. Thus, it is necessary to develop an accurate, comprehensive, and e cient platform to aid the molecular diagnosis of thalassemia and other hemoglobinopathies. NGS4THAL, a NGS data analysis pipeline, aims at serving this purpose. NGS4THAL particularly deals with issues caused by homologous sequences among the hemoglobin genes and pseudogenes for accurate detection of point mutations and small InDels. Using a combination of complementary SV callers with tailored parameters, and aided by speci c in-house controls on one of the two types of heterozygous mutations, NGS4THAL achieved excellent performance in hemoglobin SV detections as well. Tested by both simulated and real-world thalassemia patient data, NGS4THAL achieved high sensitivity and speci city.
Homologous DNA sequences in human genome cause challenges for mutation detection by NGS featuring short reads and small insert sizes. Different approaches have been proposed to deal with the alignment issues, namely 'unique' strategy, 'best match' strategy and 'all matches' strategy (41). They refer to keeping only the uniquely mapped reads, reads with the best matches, or all reads including those with lower mapping quality scores, respectively. To our experience, most of the analysis tools by default removes RMAs before variants calling, akin to the 'unique' strategy. In a genome-wide perspective, this approach balances sensitivity and speci city. However, when it comes to a given region, such as the HBA gene locus, this approach leads to the removal of most of the NGS reads derived from HBA1 and HBA2, resulting in failure of detection of mutations in or across these regions.
NGS4THAL makes use of a combination of three approaches: (1) the reads with MAQ > 30 were used as they were, akin to the 'unique' strategy; (2) the reads with MAQ = 0 were reassigned to one of the equally likely mapping positions at random, which follows the 'best match' strategy; (3) for the reads with 0 < MAQ < 30 and with differences of mismatches < 3bp between the primary and the alternative hits, they are realigned with the consideration of them being variants-informative, which is akin to the 'all matches' strategy. Tested by both simulated data and real-world sequencing data, the use of a combination of different strategies was shown to handle the multiple alignments issue well from homologous hemoglobin genes and pseudogenes, signi cantly increasing sensitivity for point mutation detection. Many 'false' detections are caused by the simultaneous reporting of mutations from HBA1 and HBA2, an issue cannot be solved by short read sequencing. Some other false positive callings caused by RMA realignments are from upstream of HBA2 and between HBA2 and HBA1. As they are not involved in pathogenesis for thalassemia and other hemoglobinopathies, they were treated as likely false detections and were not reported.
Currently, there is no widely accepted best practice for SV detection. Most SV detection tools are specialized for a certain type of SVs by making use of limited sequencing signatures. Comparing mapping positions/orientations of read pair and split read, evaluation of read depth and de novo assembly of relevant reads are among the most used approaches. Integrating results from complementary SV callers has been suggested to improve SV detection (42). Classifying the hemoglobin SVs into four subtypes helped us select relevant and complementary tools to increase sensitivity.
Compound SVs (such as --SEA /-α 3.7 ) are di cult to handle for any available tools based on a single algorithm. Making use of heterozygous controls on one type of the SVs allows the software to re-estimate read depth baselines and to facilitate the detection of the co-existing overlapping SV. Compound SVs are common in thalassemia epidemic regions, but fortunately, types of compound SVs are predictable according to local epidemiology data, allowing us to prepare this "speci c control" dataset to aid the detection process, such as --SEA in Southeast Asia and --THAI in Thailand.
For targeted sequencing data, it is likely that certain deletions are even larger than the entire targeted region, which applies to some known thalassemia deletions. In this case, we suggest examination of run of homozygosity (ROH) on high quality SNVs in the region and violation of Mendelian inheritance if parental data are also available. NGS4THAL not only detects known thalassemia mutations, it is also capable of detecting novel hemoglobin variants, disease modi ers, and variants causal to other hemoglobinopathies. Looking up the detected variants in the HbVar database allows easy reporting of known causal mutations. We categorized all the variants into "Known Thalassemia Causal", "Known Hemoglobinopathy Causal", "Known non-Causal Hb variants" and "Novel variants" for downstream follow ups.
NGS4THAL is implemented in python and can be easily accessed in https://github.com/JavenCao/Realign_BAM, with detailed speci cations. The RMA realignment steps implemented in NGS4THAL can be completed in minutes. The major time-consuming steps for NGSTHAL are variant callings by GATK-HaplotypeCaller and Pindel. Future improvements are needed to ramp up the e ciency when processing large amount of sequencing data, such as in a scenario of population screening.

Conclusions
In summary, taking the DNA sequence speci city of hemoglobin into consideration, we developed a tailored molecular diagnosis and carrier screening pipeline named NGS4THAL, which achieved high sensitivity and speci city for thalassaemia (and other haemoglobinopathies) pathogenic variants detection.

Availability And Requirements
Project name: NGS4THAL