We achieved a mean read depth of 80 in the targeted regions for an average depth of coverage 110x. We generated an average of 840667 total variants comprising 775262 SNPs and 65405 indels per exome in probands compared to 736820 SNPs and 63418 indels in unaffected samples from a total of 777216 variants (Figure 2; see methods). Missense variants constituted the largest variation followed by loss-of-function variants in cases when compared to unaffected samples, with an overall 0.35% missense, 0.02% frameshift, 0.004% stop gain and 0.0009% stoplost. The downstream analysis leading to variant calling was performed carefully to yield the list of the final number of variants. This was also checked with gene density and high linkage disequilibrium (LD) regions, even as MAF <=0.01 was sought for with SNPsnap having no candidate matches, thereby confirming that these variants are extremely rare. Given the rare phenotype, the frequency of the prioritized variants was checked for in agreement with that of 1000 Genomes, GnomAD and ExAC databases. From the reported familial history, relatedness tests were not felt necessary, as the pedigree confirmed correct parenthood for all affected/unaffected samples (Supplementary Table 1). From the final list of segregated variants (Table 1), we identified three mutations in theMUC5B, FRG1, andTAF1B genes, which we deem to be extremely rare variants (Table 2a). Furthermore, we also found an AK9 copy number variant (CNV) that was run through SNPsnap36to assess whether rare allelic variation was enriched for particular biological annotations (Supplementary Table 6). Finally, a set of candidate variants inferred from all samples was checked for validation using Sequenom array/plex (Table 3).
We found two independent lines of evidence from various tools and reached consensus in detecting variants from each sample with all filtering steps. For example, the pathogenic mutations contributing to each relevant proband were screened first wherein three variants, viz. FRG1 (4-189957414), TAF1B (2-9904885) and MUC5B (11-1238987) were confirmed through ClinVar and GnomAD. While the FRG1 mutation happens to be a CNV-associated missense seen in Z15 and Z99 samples, it is further augmented by the fact that low expression of FRG1 is associated with tumor progression in the colon55. The TAF1B mutation is highly associated with colorectal tissues, as we found these stopgain/missense mutations to be highly prolific in some of the aggressive CPC probands, viz. Z15, Z19, Z34, Z42, Z46, Z54, Z62, Z66, Z74. TAF1B is known to be the second largest subunit of the TATA box-binding protein (TBP)-containing promoter selectivity factor TIF-1B/SL1 and is connected with ribosomal transcription51. We found that the number of detected variants was associated with colon, rectal and genitourinary tissues and the presence of vivid disease occurrence. The density of these pouch-related tissues with missense/stop gain mutations can be attributed to aggressive CPC type IV.
Dissecting the genetic architecture of a rare disease is certainly an arduous task. Our CPC exome analysis was intended to fill this gap by detecting SNVs and CNVs affecting the focal genes/loci. Assessing these variants in CPC has provided ample evidence of strata coherent to CPC traits. In our study, at least three other genes from trio-exome analysis were reported in colon-related ailments, viz. DLC1, HAVCR1 and GBA3, but their allele frequencies are not bona fide and comparable. While we aimed to characterize the variants by employing different approaches, a polygenic model was assumed. This could be compounded with two assumptions: (a) capturing exomes and identifying deleterious mutations from a high depth of coverage exomes and (b) identifying large cohorts of mutations that fall in a low depth of coverage exomes. Although we observed both classes of genetic variation contributing to the etiology of the disease, inferring proband-parent trios and detecting de novo and transmitted genetic variants is quite a challenge. By considering extremely rare variants and adopting a strategy of identifying them in high-depth exomes, we validated all 16 trios. Nevertheless, we could not compare the detection yield inherent to this spectrum of patients owing to lack of CPC phenotype and trio-exome studies of similar design. Although previous studies have shown relatively similar methods, they detected medically relevant variants in the majority of the diseased phenotypes11. Our findings are in agreement with a large number of reports for rare variants, suggesting that the cumulative contribution of variants across different genes is associated with distinct phenotypes. In addition, an important challenge for researchers and clinicians currently in investigating rare disorders involves predicting pathogenicity for VUS. With several guidelines mentioned for predicting the pathogenicity of variants8,18,39, molecular investigators face a daunting task in considering a rare variant as benign or pathogenic and inferring them to be pathogenic. In explaining the germline/heritability of complex variants, based on the rare-variant hypothesis, we argue that the extremely rare variants are associated with phenotype sampling54. Next, we showed how we can influence and prioritize these extremely rare variants and further proposed an optimization procedure to check the variants called between MAF <0.01 and MAF 0.01%. To address this, we discarded many variants that have MAF<0.01 and finally expanded the current annotation and prioritization to accommodate the CPC framework.
In rare diseases, where only a minority of the population is affected and prevalent in a specific geographical location, identifying and considering VUS will require thoughtful consideration. A notable among them, viz. The sequestome (SQSTM1 or P62) gene encodes a multifunctional scaffolding protein involved in multiple cellular processes40–41 in addition to showing mitochondrial integrity, import and dynamics as a discriminating autophagy receptor42. In addition, p62/SQSTM1 is ubiquitously expressed in various cell types, such as the cytoplasm, nucleus and lysosomes43, and is known to be overexpressed in various human genitourinary diseases, including colon cancer44, hepatocellular carcinoma45 and prostate cancer46 (Supplementary Table 2). While KCNJ12 was also among the bona fide variants, our variant classification did not compel us to be considered extremely rare (not shown). It is known to initiate transcription by RNA polymerase I and acts as a channel for regulatory signals, while KCNJ12 encodes ATP-sensitive inward rectifier potassium channel 12 and is subtly associated with repolarization of channels52–53. A list of segregated variants and those that were not extremely rare variants in the form of SQSTM1 and KCNJ12 were also considered for validation. The SQSTM1 mutation is invariably inherited from the father in the case of the Z12 index case and from the mother in the case of Z54. As the aforementioned variants were considered extremely rare variants, we also found AK9 (6-109528998.109528999 C|-) to be consistently seen across all probands. This deletion is invariably associated with nucleotide metabolism pathways and maintains homeostasis of cellular nucleotides38. Although AK9 (deletion) was phenomenally seen in all probands, we considered it a rare variant candidate for validation. Although multiple lines of evidence suggest that apart from VUS, mutations yielding somatic copy number alterations (SCNA) could not be ruled, as we found some uncommon missense mutations with MAF <0.05 in C7orf57, C9orf84, ORF5AR1, FGFR4, HLA-DRB5, NOTCH2NLA, and MUC5B genes that could be nonpathogenic/causal to CPC.
One of the interesting findings that emerged from our study is the role of the known unknowns or hypothetical genes that could be predecessors of noncoding; hence, their establishing roles in diseases such as CPC is limited47. Notably, the C10orf120 gene harbors CTCF binding sites, as these mutations remain undefined for most disease types, including cancer48. We observed that there was significant enrichment of CNVs and indels associated with intestinal/colon-related-specific genes, as they are widespread in tissues showing chromosomal instability, cooccur with neighboring chromosomal aberrations and are frequent in colon, rectum and gastrointestinal tumors but rare in other diseases. We argue that this mutational disruption associated with CTCF binding sites could be associated with pathogenesis, as it appears to be conserved in a majority of CPC probands (Supplementary Table 3). Another orphan ORF, viz. C7orf31 also harbors CTCF binding sites, which in fact showed significant enrichment for biological processes associated with regulatory, cellular and metabolic pathways (Table 4). Another C10orf120 has somatic variants subtly contributing to CPC pathogenesis and the maximum number of gains and losses observed in the form of CNVs in the CDC27, HLA-DRB5 and MST1 L genes (Table 2b). Although some of these ORFs' maternal association and pathogenicity cannot be ruled out, we construe that there are candidate genes that could be promising biomarkers as precursors of CPC, which is beyond the scope of this canonical hypothesis (Supplementary Table 4). In addition, we screened our variants from the Indian Genome Variant Database (IGVDB) and found that they are already reported in the Indian subpopulation49, and this stratification allowed us to review the patterns influencing common and rare variants. In principle, rare variants were found to have stronger patterns than common variants. Thus, there is an inherent need to study the mutations in the known unknown regions, which would possibly delve into understanding rare diseases.
To gain insights into the role of lncRNAs, we revisited our hypothesis from our previous study16 and reconfirmed whether NONHSAT002007 was inferred in WES samples with predictions from the NONCODE database. While we did not find mutations in lncRNAs from trio-exome analyses, we argue that the mutations in essential genes tend to be causal for rare diseases, paving the way for driver mutations with the mutations in noncoding genes suppressed for selective pressure. On a different note, we aimed to evaluate exome enrichment to that of the transcriptome and anticipated that any of the variants seen from our WES study could be associated with the differentially expressed genes (DEGs). From the transcriptome pair of CPC type-4 (proband and its unaffected parent), we observed several transcripts, alternative splice variants and fusion genes, but none of them could be associated with the causal genes inferred from the exome study (Supplementary Table 5). Although we found RGPD2 and RGPD4 genes known to be significantly associated with bowel/colon as among the top enriched, nevertheless, it is hypothetical to infer global gene expression from just a pair of datasets. This approach, if studied on all samples, we believe, could identify transcripts present at low levels, which in fact could be associated with the pathogenesis of CPC. As CPC cases emerge, it is difficult to classify the clinical significance of pathogenic variants without trio analysis, especially the interpretation of de novo variants. The trio exome data could provide insights into whether the variant could be inherited and whether the carrier mutation is transcended to the progeny. In conclusion, we argue that the genetic variability in CPC has had remarkable significance not only with the parents but also within each proband. With genetic diseases leading causes of death in infants, rapid clinical/trio exome sequencing could provide early diagnoses that could impact decision making in critically ill pediatric patients. Although more efficient early diagnosis could be made with intervention using chromosomal microarray (CMA), it was shown that the clinical utility of trio/exome/whole genome sequencing is more than the CMA50. The discovery of causal mutations could provide insights into developmental disorders/anorectal malformations, such as CPC, and their etiology, which closes the gaps of surgery, taking forward to precision therapy.