Primary hyperammonemia as a model
We selected a model from the following two requirements [A] and [B], considering condition [C].
[A] Sample number: Substantial numbers of nonsense mutations and/or variations are present.
[B] Severity of the disease: Either of the natural history or the current prognosis of the disease is poor. These diseases will be candidates for site-specific treatment targeting nonsense mutations.
[C] Conditions of data: Accuracy, transparency, or sustainability of the data. Conditions are satisfied by the usage of publicized data.
AF of disease-causing alleles of monogenic diseases differs among populations. Table 1 lists the epidemiological incidence rates of PAH (phenylalanine hydroxylase), ASS1, ASL, and citrin/SLC25A13 deficiencies as revealed by mass screening in Japan and Germany[15] and AF in the corresponding populations available in two databases.
Values in the row for PAH indicate that the AF of nonsense alleles is not always correlated with that of all disease-causing alleles. Since no gene will universally satisfy requirement [A], we selected primary hyperammonemia, which will without doubt satisfy requirement [B] and is associated with eight different genes involved in the urea cycle (Fig. S1). The prevalence of urea cycle disorders is generally recognized as high. Even if the prevalence of one gene deficiency is low in one specific population, other deficiencies with high prevalence can cover the number of samples. In addition, as the row for SLC25A13 shows, the prevalence of citrin deficiency, one form of primary hyperammonemia caused by SLC25A13 deficiency [20], is high in Japan. AF value from 38KJPN (approximately 38,000 Japanese population in the jMorp database) exceeded that of PAH, satisfying requirement [A].
Overview of patient mutations and clinical symptoms
In total, 156 sites have been reported as 383 independent events in 140 references (Fig. S2 and Table S4). The inheritance of 383 independent patient mutations, which is described as “occurrences” in this work (see the column of “occurrence” in Dataset 2), was classified according to the parent(s)’ diagnosis and family history (Fig.1A and Table S3-1). Out of the 289 occurrences of the seven autosomal recessive gene deficiencies, all 25 diagnosed occurrences were inherited from their respective parents. No de novo inheritance was diagnosed. In contrast, out of 94 occurrences in an X-linked dominant OTC deficiency, 10 were diagnosed as de novo, while 12 were inherited. Thus, the cohort collected with nonsense mutations retains a characteristic of Mendelian inheritance.
Onset, outcome and a selection coefficient (s) of the cases in Dataset 2 were evaluated (Fig. 1B and 1C, Table S3-2 and S3-3) in seven gene deficiencies in which penetrance is hypothesized to be complete. The value of s (0≤ s ≤1), calculated from equation (1), indicates the efficiency of elimination of the disease allele to the next generation and is 1 when there is no transmission at all to the next generation [21,22].
Except for female OTC deficiency, most cases were neonatal or late onsets. Cases diagnosed after the neonatal period are listed in Table S5. Onset after the neonatal period was not associated with a good prognosis. For example, the outcome of death or exhibiting symptoms was reported in p.Trp265*_amber OTC deficiency diagnosed at 4.3y [23] or p.Arg1262*_opal CPS1 at 13y [24], respectively. Mutations were in the middle of the OTC or CPS1 gene (Fig. 2A). These alleles are expected to severely impair enzymatic function and lead to a poor prognosis. Thus, the natural clinical courses of patients with nonsense mutations are generally severe (Fig. 1B and 1C). However, it should be noted that an early intervention followed by liver transplantation drastically increases cured cases [16,25–28] (Fig. 1C, see Supplementary Results for detailed clinical evaluation). Successful treatment decreases the value of s, which in turn increases the allele frequency (AF) of disease-causing variations. AF of nonsense variations in general populations is evaluated using s (see Discussion).
Patient mutations (Fig. 2A)
In Fig. 2A, the number of occurrences is shown along with affected genes. Mutations forming TAA, TAG, and TGA were observed at 40 (26%), 61 (39%), and 55 (35%) sites, respectively. 50 sites (32%) were in the X-linked dominant OTC gene. Autosomal CPS1, SLC25A13, ASL, and ASS1 genes accommodated 31, 26, 19, and 13 mutation sites, which corresponds to 20%, 17%, 12%, and 8.3% of the total, respectively. Five or six sites (3% or 4%) were reported in NAGS, or ARG1 and SLC25A15 genes, respectively.
All possible 23 nucleotide changes were reported within eight genes. The ratio of base transition to transversion was 1.29. CAG/Gln>TAG and CGA/Arg>TGA were the two most preponderant patterns, which accounted for 31 (20%) and 27 (17%) sites, respectively. Three other changes caused by base transition were also prominent. CAA/Gln>TAA, TGG/Trp>TAG, and TGG/Trp>TGA accounted for 10 (6%), 9 (6%), and 11 (7%) sites, respectively. In base transversion, change patterns starting from codons for glutamate were the two most prominent. GAA>TAA and GAG>TAG accounted for 11 (7%) and 9 (6%) sites, respectively. Because evaluation of 23 different values in a single table is extremely complicated, we developed a graphical representation (Fig. 3A), adopting the octagon representation employed by Garen for the description of nonsense triplets [29]. The overall tendency was similar to that in preceding database searches [30,31] (Fig. 3E and 3F), but CGA>TGA, previously reported as the most frequent (Fig. 3E and 3F) was next to CAG>TAG in hyperammonemia (Fig. 3A). This could be due to the limited number of CGA codons within the eight genes because, in contrast to 30 CGA codons, 132 CAG codons are present (Fig. S3). We therefore counted an independent mutation event as an “occurrence” (see Dataset 2).
With this information, CGA>TGA and CAG>TAG have been reported as numbering 156 (41%) and 69 (18%), respectively (Fig. 3B). Thus, CGA>TGA was the hottest spot for the mutations causing primary hyperammonemia, which reported 2.28 times more frequently than CAG>TAG mutation.
Most of the 156 sites were sporadic. Overall, 107 (69%) and 20 (13%) sites were reported only once and twice, respectively. Meanwhile, 29 sites were recurrent, or reported thrice or more, among which 20 were CGA>TGA (Fig. S3 marked with an atsign: @). The sum of occurrences at these CpG sites accounted for 146 independent mutation events (38.1% of 383 occurrences overall and 93.6% of 156 CpG occurrences), making these sites the major source of nonsense mutations causing hyperammonemia.
The mutation with the highest occurrence per site, however, was not at CpG sites. In the nine non-CGA sites reported more than twice, two prominent sites were derived from a single ethnic population. These were CAG>TAG (ASL p.Gln*354_amber) [32] and TCG>TAG (SLC25A13 p.Ser225*_amber) [17,20], mutations both having occurrences of 30 in Saudi Arabian and Japanese populations, respectively (Fig.2A and Dataset 1). AF of the latter in 38KJPN was 0.000426. Thus, while CGA>TGA is a hot spot, some nonsense mutations derived from non-CGA codons can become more prevalent than CGA>TGA.
Nonsense variations in general populations (Fig. 2B)
The number of variation sites was limited to 60 within the MANE select CDS (CDS of MANE select transcripts, Table S6-1). Because two variations, p.Glu271*_ochre and p.Glu273*_amber, in the OTC gene did not pass the filter, no nonsense variations were present in the X-linked OTC. Among 23 possible patterns, 16 patterns were observed. CGA>TGA and CAG>TAG were again two preponderant patterns, numbering 19 (32%) and 9 (15%) in total, respectively (Fig. 3C).
The count of alleles in the seven autosomal genes was 217 (Table S6-2). Nearly half were in the form of CGA>TGA (Fig. 3D). Nucleotide change and codon usage are two known biases that affect nucleotide replacement [33]. To confirm that CGA is indeed a hotspot for nucleotide changes, we adjusted the effects of codon usage, by dividing the obtained frequency values (Tables S4-2 and S6-2) by the codon frequency (Table S2-2). The usage of CGA codons in all eight genes was never high (1.9% of 23 patterns, Table S2-2). After this adjustment, the occupancy of CGA among all eight genes reached 57.4% and 58.1%, as shown in Tables S4-4 and S6-4, respectively, indicating that CGA is clearly the hottest spot of nonsense nucleotide change. However, the highest contributor was once again p.Ser225*_amber, because the codon usage of UCG/Ser was lower than that of CGA/Arg in the SLC25A13 gene. Because this variation has been solely identified from Japanese [34,35], inspection for variants by gene and by population is important.
Notable variations coincident with patient mutations
Overall, 30 of the 60 variants were located at the reported patient mutation sites. To delineate the relationship between a patient mutation and frequency of nonsense alleles in a population, 60 variants were dissected according to the population from which the variant was derived (Table S7). AF values in a population were then grouped according to the recurrence of a mutation (Fig. 4A). Two variants near the 3’ end of CDS, CPS1-Gln1413_EAS and SLC25A13_Trp606_JPN, were classified into group “0” (not-occurring). This can be because the penetrance of nonsense mutation becomes incomplete near the carboxyl-terminus in some genes especially when the degree of truncation is small for the function of the polypeptide, although the details of this need to be clarified. Some variants derived from an African (AFR) or American (AMR) population were outliers in the group “≤2” or “0,” which could be attributed to a low opportunity of genetic diagnosis in these populations. We focused on the two largest populations, 38KJPN and NFE (Fig. 4B). AFs were significantly different according to whether the mutation is recurrent, sporadic, or not occurring (P <0.0001 between “≥3” and “0” and P =0.0087 between “≥3” and “≤2”). The significance was further confirmed by whether a single mutation occurred or not (Fig. S4A, P =0.00060). Thus, the value of AF is a factor determining the recurrence of the mutation.
From the viewpoint of patient mutations, among 289 occurrences in the seven autosomal genes, 119 (41%, Fig. 4C, rightmost columns) were coincident with the 21 ethnicity-matched variation sites (Fig. 2B, leftmost column in the table, population name in the figure underlined). Overall, 14 were recurrent sites (Fig. 4B, a group of “≥3”), at which 109 occurrences were reported (92% of 119). The calculated correlation coefficient (r) between occurrence number and AF was 0.776, indicating a positive correlation. One of the most prominent sites among these 14 notable variations is Arg179 in the SLC25A15 gene, at which eight independent occurrences from five different ethnicities, three from Japanese and two from Chinese, were reported. In accordance with occurrences, 28 nonsense alleles were reported from five different populations, indicating that multiple populations can be involved in one site.