Rare Variants Analyses Suggest Novel Cleft Genes in the African Population

Non-syndromic orofacial clefts (NSOFCs) are common birth defects with a complex etiology. While over 60 common risk loci have been identified, they explain only a small proportion of the heritability for NSOFC. Rare variants have been implicated in the missing heritability. Thus, our study aimed to identify genes enriched with nonsynonymous rare coding variants associated with NSOFCs. Our sample included 814 non-syndromic cleft lip with or without palate (NSCL/P), 205 non-syndromic cleft palate only (NSCPO), and 2150 unrelated control children from Nigeria, Ghana, and Ethiopia. We conducted a gene-based analysis separately for each phenotype using three rare-variants collapsing models: (1) protein-altering (PA), (2) missense variants only (MO); and (3) loss of function variants only (LOFO). Subsequently, we utilized relevant transcriptomics data to evaluate associated gene expression and examined their mutation constraint using the gnomeAD database. In total, 13 genes showed suggestive associations (p = E-04). Among them, eight genes (ABCB1, ALKBH8, CENPF, CSAD, EXPH5, PDZD8, SLC16A9, and TTC28) were consistently expressed in relevant mouse and human craniofacial tissues during the formation of the face, and three genes (ABCB1, TTC28, and PDZD8) showed statistically significant mutation constraint. These findings underscore the role of rare variants in identifying candidate genes for NSOFCs.


INTRODUCTION
Nonsyndromic orofacial clefts (NSOFCs) constitute the largest proportion of orofacial clefts (OFCs) and are estimated to affect 1.25 in every 1000 live births worldwide 1 .Due to distinct embryological origins and epidemiological patterns, NSOFCs can be broadly classi ed into nonsyndromic cleft lip with or without palate (NSCL/P) and nonsyndromic cleft palate only (NCSPO) 2 .The primary treatment for NSOFCs is surgical to correct structural defects.However, restoring optimal function in affected children requires a multidisciplinary team of orthodontists, maxillofacial surgeons, prosthodontists, otolaryngologists, geneticists, and pediatricians, among others 3 .In the United States, the annual cost of hospital stays for children with OFCs was over 400 million in 2013 4 .Moreover, the team of experts required for OFC care is often unavailable in resource-limited settings, leading to signi cant inequalities in cleft management 5 Developing preventive or improved therapeutic strategies for NSOFCs requires a thorough understanding of their etiology.As with many complex traits, the etiology of NSOFCs is multifactorial, with genetic factors playing a considerable role 6 .According to a recent review, over 60 (> 40 associated with NSCL/P) risk loci have been implicated mainly through common variant association studies 7 .However, all identi ed loci/genes are estimated to explain only a small fraction (~ 25% for NSCL/P and even less for NSCPO) of the estimated heritability 8 .Low frequency/rare variants, gene-gene interactions, and geneenvironmental interaction effects may likely explain the missing heritability 9 .
The role of rare variants in complex traits is well documented 10 , and a recent study found rare variants to be responsible for a larger proportion of the missing heritability in complex traits etiology 11 .However, studies evaluating the role of rare variants for NSOFCs have been restricted largely to the resequencing of known cleft candidate genes, limiting the discovery of novel genes.3][14] .A more e cient approach to rare variant analysis is to select a xed MAF threshold and conduct an aggregate test on all variants with a MAF below this threshold within a speci ed region (e.g.gene), either by assigning equal weight to all variants identi ed or by varying weights based on the estimated variance of each variant under the null hypothesis of no association 13 .This approach has been applied successfully in discovering genes associated with complex traits like blood pressure, myocardial infarction, and schizophrenia [15][16][17] .
Attempts at leveraging rare variant aggregation to identify novel genes for NSOFCs are relatively new and have been limited to samples of individuals of European ancestry 8,18,19 .Additionally, these studies included all the identi ed rare variants within a gene, an approach that has been shown to be less sensitive due to the inclusion of synonymous variants, which often do not contribute to disease etiology 20 .To address these limitations in previous studies, we conducted gene-based rare variant aggregation tests, including only the protein-altering rare variants using our African genome-wide association study (GWAS) data.We hypothesized that genes enriched for rare protein-altering variants associated with NSOFCs would contribute to the etiology of NSOFCs.Further, we utilized transcriptomics data to provide additional evidence for the associated genes.The African population, the ancestral origin of modern humans, harbors the greatest number of genetic variations and, thus, provides a considerable opportunity for genetic discoveries 21 .

DISCUSSION
We conducted gene-based rare variant aggregation tests to identify novel candidate genes that could explain the missing heritability for NSOFCs.In total, we identi ed 13 genes with suggestive associations primarily driven by rare missense variations.Seven genes (ALKBH8, CENPF, CSAD, EXPH5, PDZD8, SLC16A9, and TTC28) showed consistent expression in relevant mouse and human craniofacial tissues during the formation of the face and one gene (ABCB1) without a mouse ortholog showed expression in human craniofacial tissues.Further, three genes (ABCB1, TTC28, and PDZD8) were predicted to be intolerant to missense variations.
Using biological plausibility to prioritize loci/genes with suggestive association in GWAS has been previously shown as a valid approach to bypassing the large sample size requirement needed to achieve signi cant association 22 .While gene mutation constraint is a good metric for identifying pathogenic genes, it is more commonly seen with a dominant disease-causing gene and may not be informative in other disease models (e.g., recessive).Thus, we prioritized the 8 associated genes (ABCB1, ALKBH8, CENPF, CSAD, EXPH5, PDZD8, SLC16A9, and TTC28) with consistent expression in relevant craniofacial tissues during human or mouse face development.Majority of these genes (TTC28, CSAD, EXPH5, SLC16A9, ALKBH8, and CENPF) showed biased expression towards mesenchymal cells subtypes from human craniofacial tissues.This could point to the importance of mesenchymal cells in palate formation since these genes were associated with either NSCLP or NSCPO and not NSCLO.Moreover, this could also be due to the stage (CS17) of embryonic development at which the facial prominences were harvested.
The CS17 stage (~ 7 weeks post fertilization), a period that coincides with the later stages of lip formation and the beginning of palate formation.
Five genes (ABCB1, TTC28, PDZD8, CENPF, and ALKBH8) have been previously implicated in NSOFCs or diseases presenting with cleft phenotypes.The ABCB1 and TTC28 were associated with NSCL/P while the PDZD8, CENPF, ALKBH8 genes are associated with NSCPO.These genes except the ABCB1 without a mouse ortholog showed consistently high expression and enrichment in mouse craniofacial tissues especially the palate.The ABCB1 gene is an ATB binding cassette gene that functions to regulate fetal exposure to xenobiotics through the placenta 23 .Single nucleotide variations in the ABCB1 gene have been reported to increase the risk of NSCL/P 23 .The TTC28 gene is in the 22q12.2region and previous case report on patients with microdeletion of this region implicate this gene as potential candidate for pierre robin sequence-a condition that presents with cleft palate 24 .Furthermore, copy number variations overlapping this gene have been reported in cleft palate patients 25 .The PDZD8 gene assists with lipid transfer from the endoplasmic reticulum to the endosomes and lysosomes 26,27 .Burden of variations though not statistically signi cant have been previously reported in this gene among patients with cleft lip with or without palate 28 .CENPF is a kinetochore associated protein that colocalizes with the IFT88 (a ciliopathy gene) and compound heterozygous mutations in the CEPNF gene were reported in a human fetus with ciliopathic malformations, including cleft palate 29 .Mutations in ALKBH8 cause intellectual developmental disorder, autosomal recessive 71, MRT71 (OMIM #618504) in humans.This condition presents with craniofacial dysmorphic features, which include long lips with V-shaped upper lip, macrostomia, and retruded mandible 30 .Macrostomia and retruded mandible may cause cleft palate by impeding the elevation of palatal shelves during palate formation 31 .
We identi ed three potential novel genes (CSAD, EXPH5, SLC16A9) for NSOFCs.The burden of variants in CSAD and EXPH5 were associated with NSCLP while those in the SLC16A9 gene were associated with NSCPO.Although these genes lack previous reports of direct association with NSOFCs phenotypes, they have been implicated in processes crucial for craniofacial development.The EXPH5 gene, for instance, has been shown to play a role in cell-cell adhesion 32 ; a critical process in craniofacial morphogenesis.
The SLC16A9 gene is linked to lipid metabolic traits 33 ; a functionally relevant downstream target of the non-canonical transforming growth factor beta (TGFbeta) signaling -a signaling mechanism involved in face formation.The CSAD gene functions in the biosynthesis of taurine and the role of taurine in organogenesis has been demonstrated in mice 34 .
In the current study, we replicated the ABCB1 gene association which was previously reported for NSCL/P in a common variants' association study.While some reports have demonstrated the accumulation of rare variants in genes previously identi ed through common variants association analyses 35 , suggesting a convergence of common and rare variants in loci/genes associated with NSOFCs phenotypes, other reports suggest that common and rare variants may act through separate loci/genes 28 .For instance, targeted sequencing of regions surrounding genome-wide signi cant loci for NSOFCs showed no evidence of rare variants burden in genes/regulatory regions proximal to these loci 28 .Additionally, rare variants are population-dependent, which could explain why we did not replicate previously reported genes/loci from rare variant association studies in other populations.Furthermore, our tests of association require that we exclude any gene with only one rare variant, even if the same gene harbored common variants.This might have resulted in the omission of genes with contributions from both rare and common variants if participants in our cohort only harbor one rare variant in the gene.Therefore, future studies should consider a model that allows for the incorporation of both common and rare variants.
Our study has some limitations.First, we used array-based genotype data for discovery.This approach means some rare variants were not examined.Second, we controlled for population strati cation by adding the top 10 genotype PCs as covariates in our gene-based association regression models.Adjusting for top PCs has been shown to prevent p-value in ation and reduce false positive rates in common variant analysis, but its performance for rare variant analysis remains controversial 36 .The reason is that rare variants, being newer mutations, re ect a more granular population substructure compared to common variants 37 .Finally, we restricted our analysis to only the coding (missense and Lof) and splice-altering variants.Rare variants including insertions/deletions in non-coding regulatory regions also contribute to the etiology of NSOFCs; however, de ning these regions' analytical units and their functional characterization remains a challenge.Hence, future studies should use WGS data for the gene-based analysis to capture more rare variants and leverage annotated craniofacial enhancers regions to analyze rare variants in non-coding regions.
In summary, we identi ed 13 genes with suggestive associations in our GWAS data.Among the 13 genes, 3 genes (ABCB1, TTC28 and PDZD8) were predicted to be intolerant to missense variations.Human and mouse transcriptomics data further supported the association of 8 genes.Of the 8 genes, ve genes (ABCB1, TTC28, PDZD8, CENPF, ALKBH8) were previously associated with NSOFCs or diseases presenting with cleft phenotypes.The remaining three genes (CSAD, EXPH5, SLC16A9) are potentially novel candidate genes for NSOFCs.

GWAS Study Participants
The details of the GWAS participants have been previously published 38 .Brie y, NSOFC case children were recruited during surgical repair at cleft clinics and free surgical missions sponsored by Smile Train in Nigeria, Ghana, and Ethiopia.The cleft surgeons at each participating site used a standardized phenotyping protocol (physical examination and clinical photographs) to con rm NSOFC status.Additionally, echocardiography was used to rule out the presence of congenital heart defects to ensure nonsyndromic status.Control children were those without a birth defect diagnosis attending immunization/welfare clinics at the same center where the case children were recruited.To be eligible to participate in the study, case and control children must have biological parents of African ancestry who reside in Africa.Our sample included 1019 NSOFC case children and 2150 unrelated control children.Among the cases, 810 had non syndromic cleft lip with or without palate (NSCL/P) − 394 non-syndromic cleft lip only (NSCLO), 420 non-syndromic cleft lip and palate (NSCLP), and 205 had non-syndromic cleft palate only (NSCPO).The distribution of the GWAS study participants by cleft status and country of origin is shown in Supplementary Table 1.
Data Collection, DNA Extraction, and Genotyping Demographic information (age, sex, and residential location) and limited exposure information were obtained.Saliva specimens were collected using the Oragene saliva kit, de-identi ed and shipped to the Butali laboratory at the University of Iowa.DNA was extracted from the saliva using the standard Oragene saliva DNA extraction protocol and quanti ed using Qubit (http://www.invitrogen.com/site/us/en/home/brands/Product-Brand/Qubit.html;ThermoFisher Scienti c, Grand Island, NY).As part of internal QC, Taqman XY genotyping was used for sex con rmation.Subsequently, aliquoted DNA was sent to the Center for Inherited Disease Research (Baltimore, Maryland, USA) for genotyping.The Illumina Multi-Ethnic Genotyping Array MEGA v2 15070954 A2 (genome build 37), which has over 2 million variants including over 60 000 rare variants selected from populations of African origin, was used for the genotyping.Details on genotyping and QC measures have been previously published 38 .
Data Analysis

Gene-Based Analyses
Variant predication was performed using annotate variation (ANNOVAR) to identify the functional consequences (synonymous, nonsynonymous, and splice-altering) of each variant.Splice altering and non-synonymous variants (variations resulting in either amino acid change or premature termination of the protein) were ltered against the 1000 genome (1KG) population databases (http://www.1000genomes.org/)to identify rare variants (found in < = 1% of Africans included in the database).Additionally, we ltered for only the variants with a MAF < = 1% in the controls included in our sample (Supplementary table 2).Subsequently, genes with two or more rare non-synonymous variants in the data were ltered to satisfy the aggregation requirement for the proposed analyses (Fig. 1).For the analysis, both non-synonymous and splice-altering variants were included and referred to as "proteinaltering".Three gene-based collapsing models were used: (  39 .The CMC test, being a burden test, assumes the same direction of effects for all the variants within a gene, whereas SKAT is a variance component test that allows for an opposing direction of effects 39 .The SKAT-O test is an omnibus test that is more robust and e cient across different scenarios 39 .In all three tests, population strati cation and sex were controlled for by adding the top 18 principal components (PC) and child sex into the regression-based models.The lack of prior knowledge about the underlying biology of these variants precluded the ability to select one optimal test.Hence, the decision to conduct the three tests where the CMC and SKAT will show rigor and SKAT-O will con rm reproducibility.The Bonferroni correction was used to adjust for multiple testing and set the cutoff for statistical signi cance at a 5% error rate to 0.05 divided by the number of genes tested and a 10 2 higher threshold for suggestive signi cance.A gene was considered signi cant if it showed a statistically signi cant or suggestive signi cant association in either CMC and SKATO or SKAT and SKATO.Rare variant aggregation tests were conducted on our GWAS data using the SKAT R package (https://cran.rproject.org/web/packages/SKAT/index.html) or rare variant association tests implemented under the case-control study design.Further, we used the genomeAD gene mutation constraint prediction tool (https://gnomad.broadinstitute.org/help/constraint) to identify associated genes level of intolerance to mutational changes.

Gene Expression Analyses
To provide biological insight, expression of the associated genes during organogenesis of the human and mouse faces was examined.The human craniofacial gene expression dataset was generated from single-nuclei RNA-seq of craniofacial prominences of human CS17 embryos.The CS17 stage corresponds to ~ 7 weeks post-fertilization which coincides with the later stage of lip formation and the early stage of palate formation.Additional details on the RNA seq data quality control and analysis was reported by Yankee et al. 2023 27 .
SysFACE analysis was performed as previously described 40 using GSE7759, GSE22989, GSE31004, and GSE11400 microarray data (Affymetrix Mouse Genome 430 2.0 Array) and GSE55965 microarray data (Affymetrix Mouse Gene 1.0 ST Array).The datasets were analyzed using affy package in R. Multiple probesets representing individual genes were normalized and the probeset with highest median expression was considered representative of gene expression.WB data generated on Affymetrix Mouse Genome 430 2.0 Array platform as previously described was used for enriched expression analysis.

Figures
Figure 1 The process of sorting for variants within a gene of interest.The left side of the image shows all variants identi ed within the gene after sequencing.The variants of interest on the right were predicted to result in a change in amino acid or affect splicing by in-silico predictive tools and were included in our analysis.
Color codes; Red-Non-synonymous variants in protein-coding regions; Yellow-Synonymous variants (not selected for analysis); Blue-Splice-site variants.nasal eminence/prominence, and mandibular and maxillary arch, (C) frontonasal and mandible, (D) maxilla and palate.The intensity of the color in the heat-map (row-wise) is representative of the extent of candidate gene expression based on the average uorescence signal intensity in the speci c tissue.Note that for palate, there were independent datasets for E13.5 and E14.5 and these are denoted as E13.5a, E13.5b, etc. FaceBase datasets generated on Affymetrix Mouse Gene 1.0 ST Array microarray were meta-

Table 1
Gene-based results for the protein-altering and missense only models.
• Positive Z-scores indicate more constraint (fewer observed variants than expected), and negative scores indicate less constraint (more observed variants than expected) as observed in the genomeAD dataset.*Statistically signi cant missense constraint Z score.• Positive Z-scores indicate more constraint (fewer observed variants than expected), and negative scores indicate less constraint (more observed variants than expected) as observed in the genomeAD dataset.*Statistically signi cant missense constraint Z score.
1) protein-altering (PA), (2) missense only (MO), and (3) loss of function only (LOFO).The different phenotypes (NSCL/P and NSCPO) were analyzed independently and separately.NSCL/P was further subdivided into nonsyndromic cleft lip only (NSCLO) and nonsyndromic cleft lip and palate (NSCLP).Gene-based rare variant aggregate tests were used to identify genes enriched in rare variants associated with NSOFCs.Three rare variant aggregation tests were conducted: the combined multivariate and collapsing (CMC) test, the sequence kernel association test (SKAT), and the SKAT-O test.The rst two tests are complementary, operating under different assumptions