Study design and population:
We performed a cross-sectional study involving the Dar-es-Salaam (Tanzania) Muhimbili National Hospital SCD cohort, which consisted of 1,725 SCD patients, recruited between 2004 and 2009, for prospective surveillance, with three monthly interval visits for routine check-up [3]. These patients were subjected to folic acid (5mg/day) and penicillin. Different hematological factors, including complete blood counts and foetal haemoglobin (HbF) quantifications, were measured during hospital visits. Written informed consent was obtained for each adult patient (>18 years) and ethical approval given by the Muhimbili University Research and Publications Committee (MU/RP/AEC/VOLX1/33 and 2017-03-06/AEC/Vol X11/65). Informed and written consent was obtained from parents or guardians for all minor patients (≤18 years) with an oral assent for children older than 7 years. The study involved 14 individuals confirmed to have SCD (HbSS or S-β°thalassaemia), over five years old, with extreme HbF levels. Excluded were individuals confirmed to be AS or AA following Hb electrophoresis and HPLC, those with HbF measured at an age of less than 5 years, with inconclusive SCD laboratory diagnosis where a repeat test for confirmation could not be performed, and individuals who were on hydroxyurea therapy.
Phenotyping
Individuals were selected using previously collected HbF data. In this population, the median HbF was 4.6 [Interquartile range (IQR): 2.5-7.7)] [17] and therefore 0-2.5% was considered a low HbF level while 7.7% and above was considered a high HbF level.
Sequencing
DNA was extracted from archived buffy coat samples using the Nucleon BACC II system (GE Healthcare, Little Chalfont, UK). The sequencing panel was adopted from a research panel at King’s College London and customized using Illumina DesignStudio (https://designstudio.illumina.com/). Targeted sequencing covered exons and non-coding regions around validated and candidate fetal hemoglobin-influencing loci, including B-cell lymphoma/leukemia 11A (BCL11A), proto-oncogene, transcription factor (MYB), homeobox A9 (HOXA9), hemoglobin subunit beta (HBB), hemoglobin subunit gamma 1 (HBG1), hemoglobin subunit gamma 2 (HBG2), chromodomain helicase DNA binding protein 4 (CHD4), Kruppel like factor 1 (KLF1), methyl-CpG binding domain protein 3 (MBD3), zinc finger and BTB domain containing 7A (ZBTB7A), peptidoglycan recognition protein 1 (PGLYRP1) on chromosomes 2, 6, 7, 11, 12 and 19, respectively (Table 2). Selection of target regions was based on previous associated known and novel loci in the studied population and those reported recently in other populations. Sequencing was performed on the Illumina MiSeq platform at the Kilimanjaro Clinical Research Institute (KCRI), Tanzania, following TruSeq Custom Amplicon Low Input Kit protocol.
Reads Mapping, Alignment, Variant Calling and Variant Calling Quality Control
Figure 1 illustrates and summarizes the pipeline used from alignment to prioritization of mutation. We reconstructed the reads by realigning them to the complete reference genome build hg38 using BWA [25]. The Picard tool kit [26] was used to sort and mark reads duplication, after alignment. We used an ensemble approach implemented in VariantMetaCaller [27] that may find a call consensus in detecting SNPs and short indels [28]. The best practice specific to each caller were adopted [29]. We combined information generated from two independent variant caller pipelines: (1) An incremental joint variant discovery implemented in GATK 3.0 HaplotypeCaller [26], which calls samples independently to produce gVCF files and leverages the information from the independent gVCF file to produce a final call-set at the genotyping step; (2) bcftools via mpileup [30, 31] variant callers (Figure 1). The final call-set from each subject group was produced from VariantMetaCaller [27].
Annotation, In Silico Prediction of Mutation and Prioritization
High confidence variants were called using VariantMetaCaller [27] from the dataset including 14 Tanzanian SCD patients (nine with high HbF and five with low HbF levels). We used ANNOVAR [32] to perform gene-based annotation to detect whether SNPs cause protein coding changes and to produce a list of the amino acids that are affected. ANNOVAR contains up to 21 different functional scores including SIFT [33, 34], LRT [35], MutationTaster, MutationAssessor [36], FATHMM [37], fathmm-MKL [38], RadialSVM, LR, PROVEAN, MetaSVM, MetaLR, DANN, M-CAP, Eigen, GenoCanyon [39], CADD [40], GERP++ [41], Polyphen2 HVAR, Polyphen2 HDIV[42] and PhyloP, SiPhy [43].
From the resulting functional annotated dataset, we first filtered variants for rarity, exonic variants, non-synonymous, stop codons, predicted functional significance and deleteriousness [33, 34]. First, the resulting functional annotated data set was independently filtered for predicted functional status (of which each predicted functional status is "deleterious" (D), "probably damaging" (D), "disease_causing_automatic" (A) or "disease_causing" (D) [44, 45, 46] from these 21 in silico prediction mutation tools. Recent evaluation of in silico prediction tools for mutation effects suggested these tools are quite similar [47]. However, the evaluation of these tools was conducted mostly in non-African populations. Here we opted for an extreme casting vote approach to retain only a variant if it had at least 17 predicted functional status “D” or “A” out of 21, as one can expect a true in silico mutant variant to similarly be reported from most of these tools. Second, the retained variants were further filtered for rarity, exonic variants, nonsynonymous mutations, yielding a final candidate list of predicted mutant and genetic modifier variants.
Network and Enrichment Analysis
To find out how predicted in silico mutant and modifier genes interact with others at the systems level, we analyzed how the set of all interactive genes from knowledge-based Protein-Protein Interaction (PPI) interacted with our identified in silico mutant genes and the rest of targeted genes, respectively. This has enabled the identification of potential biological pathways in which these genes participate. To achieve this, we first mapped the identified mutant SNPs to their closest genes. We mapped genes to a comprehensive human Protein-Protein Interaction (PPI) network [48, 49] to identify sub-networks containing mutant and genetics modifier variant genes and their interactions. Using the Enrichr software [50], we examined how closely these genes within the extracted sub-networks are associated with human phenotypes and elucidate biological processes and pathways in which these genes participate, molecular functions and association with potential human phenotypes. The most significant pathway enriched for genes in the networks were selected from KEGG [51], Panther [52], Biocarta [53] and Reactome [54]. Gene ontologies, including molecular functions and biological processes, from the Gene Ontology database [55].