The pre-1970 triplet genetic code has controls for prokaryotes and viruses but lacks control for eukaryotes.
The origin of the triplet genetic code lies in the post-establishment of DNA structure1,2. The 1962 Nobel Prize in Physiology and Medicine was awarded to Crick et al.3 for establishing the DNA structure with four (T, A, C, and G) bases: T: A bases, and C: G bases, naturally forming complementarity pairs, known as Watson-Crick (WC) pairs. Consequently, Crick introduced the central dogma of biology, in which DNA is considered hereditary material. Protein synthesis occurs from DNA to mRNA to protein. The codons translate mRNA genetic information into proteins.
In 1961, Crick described the general nature of the genetic code and the proteins4. In 19635,6, he proposed a triplet genetic code for protein synthesis and one gene-one protein, a part of the central dogma of biology. The 1968 Nobel Prize in Medicine and Physiology7 was awarded to Robert W. Holley, Har Gobind Khorana, and Marshal W. Nirenberg for verifying the triplet code: 61 triplet codons encode twenty amino acids, 3 STOP signals, and one START signal. Khorana used the synthesis process8, while Nirenberg used the enzymatic binding process9,10. Holley et al.11 established a tRNA structure with attached amino acids and anticodons. At the ribosome, tRNA anticodons form WC pairs with mRNA triplet bases, promoting polypeptide bond formation resulting in protein.
Since the triplet encoding lacks a gene control mechanism, Jacob and Monod12 developed operons to control gene and enzyme synthesis via operator, promotor, regulator, and suppressor. The regulator gene initiates the process. In the transcription process, the regulator and suppressor set the operator on or off. Two widely used operons are lac and trp. The lac operon has a negative default control and is demonstrated by the digestive process of lactose. When lactose is absent, no action occurs, but when present, the lac operon controls the gene to synthesize the enzymes to digest the lactose.
On the other hand, trp has a positive default control. When tryptophan is present, nothing happens, but Trp controls the synthesis of tryptophan when absent. The 1965 Nobel Prize in Physiology or Medicine13 was awarded to François Jacob, André Lwoff, and Jacques "for their discoveries concerning genetic control of enzyme and virus synthesis."
Shortcomings of the pre-70 triplet genetic code
The triplet genetic code is nonoptimal and degenerate. It invokes the wobble hypothesis and lacks control for eukaryotes. Additionally, since the number of corresponding tRNAs is insufficient for twenty amino acids, iso-tRNA was proposed to decode multiple amino acids.
The triplet code could be more optimal. Crick proposed that four DNA (T, A, C, and G) bases encode 20 amino acids. According to Shannon's information coding theory, the optimal number of required bits to encode N objects is log2 N. Thus, for N=20 amino acids, the optimal number of bits necessary will be log2 20 = 4.32 bits. However, the triplet code has 64 codons, requiring 6 bits. Therefore, it is nonoptimal and degenerate.
Both rules of the central dogma of biology have been violated. Viruses violated the DNA start rule by starting mRNA, and post-70 split gene discovery led to the synthesis of one gene to multiple proteins.
Post-1970 DNA code for eukaryotes
Post-1970 molecular and cellular biology development necessitated new regulation and control. Eukaryotic cells require transcription, splicing, and various regulatory and control processes, including epigenetics. Approximately 1977 14,15, it was shown that less than 2 % of DNA bases encode proteins, remaining noncoding bases regulate and control the protein synthesis process. Genes were not continuously distributed but coding portions (exons) separated by noncoding parts (introns). The splicing process separates exons from introns. Richard J. Roberts and Phillip A. Sharp were awarded the 1993 Nobel Prize in Physiology or Medicine16 for discovering "split genes." Several unique proteins can be synthesized using alternate splicing from a single gene17,18, thus breaking one gene–one protein rule of the central dogma of biology.
Eukaryote Transcription
In eukaryotes, DNA transcription yields pre-mRNA, and its splicing generates mRNA for protein synthesis. Roger Kornberg elucidated the detailed transcription process using Baker's yeast as a eukaryotic model. He demonstrated that the eukaryotic transcription process starts with the TATA box, several transcription factor-binding proteins, mediators, promotors, activators, and other controlling factors. DNA transcription (RNA polymerization) yields Pol-I rRNA, Pol-II mRNA, and Pol-III tRNA. Ribosomes are synthesized using Pol-I rRNA, and tRNAs are synthesized using Pol-III. At the ribosome, Pol-II mRNA codons are translated into protein. Kornberg19 was awarded the 2006 Chemistry Nobel Prize for his "fundamental studies of the molecular basis of eukaryotic transcription."
Eukaryote Splicing
The splicing process separates exons from introns, and the exon-intron interface allows alternate splicing, making one gene yielding multiple proteins a possibility.
Transcription and splicing errors cause many human diseases. Errors in transcriptional regulatory elements and control cause several human diseases20. Splicing errors also cause diseases21,22. Learning how to control these errors may enable the development of drugs to cure these diseases.
The ribosome protein-making factory and the gene decoding
In the post-70 era, understanding the structure of the ribosome became critical because proteins are synthesized there. In 1955, Palade23 first identified this organelle “as a small particulate of the cytoplasm," for which the name ribosome was later adopted. The concentrated efforts of Venkatraman Ramakrishnan, Thomas A. Steitz, and Ada Yonath revealed the detailed ribosome structure. The 2009 Nobel Prize in Chemistry was awarded to them "for studies of the structure and function of the ribosome"24. The ribosome has two subunits: a large subunit and a small subunit, consisting of ribose RNA and ribose proteins. Eukaryotes, prokaryotes, and archaea have similar structures but differ in sizes and ribose protein ratios. The protein synthesis at the ribosome was illustrated using the ribosome structure 25-27.
Later, V. Ramakrishnan described the race to decipher the secret of ribosomes in his book “Gene Machine”28. The ribosomal structure has three characteristic sites: A, P, and E. mRNA are held in place by the small subunit and read in the 5‘ to 3' direction, and an aminoacyl-tRNA (aa-tRNA) with the cognate anticodon carries an amino acid to site A. The ribosome performs decoding to ensure that the codon and anticodon match. When a match occurs, the aa-tRNA moves to site P, after which the mRNA codon is read, and an aa-tRNA with a new anticodon carries another amino acid to be loaded. When the codon and anticodon match again, aa-tRNA moves to site P. Then, a peptide bond is formed between the first and second amino acids, and the first aa-tRNA is released. This cycle repeats, and nascent synthesized protein is threaded through a tunnel. The ribosomal decoding of the codon at the third wobble position29,30 is flexible enough that it can even accommodate a codon at the fourth position. The ribosome's decoding, translocation, and extension activities ensure the proper synthesis of a protein.
Ribosome structure is equally critical in controlling bacterial-antibiotic interactions31. Antibiotics disrupt bacterial protein synthesis by interrupting the decoding and translocation roles of ribosomes and blocking the exit tunnel of nascent proteins. Thus, antibiotics inhibit bacterial function rather than the cell's protein production ability.
In the post-1970 era, alternative synthetic orthogonally expanded quadruplet 32-35, sextuplet36, and octuplet37 genetic codes were tested. These codes were developed to overcome the limitation of 20 available canonical amino acids and inadequate triplet code regulation.
The orthogonally expanded codons have yet to be successfully employed to synthesize proteins using canonical amino acids.
The QED coding model
Quadruplet expanded DNA (QED) genetic code for eukaryotes
The QED genetic code was developed by reviewing chemical reactions and limitations encountered during the triplet code verification7.
- Khorana observed8 that self-complementary AU, poly-rAU, and CG, poly-rCG do not promote polypeptide formation.
- The synthesis8 of Poly r-GUA and Poly r-GAU was a total success, but triplet combinations (AUG) n and (UAG) n (where n represents repeated sequences) yielded no polypeptides. UAG was referred to as “chain terminators” (later called STOP codon).
- The triplet codon table includes two UGA and UAG (corresponding DNA bases: TGA and TAG) STOP codons, where the G position seems to be position independent and symmetric; i.e., U(GA): U(AG), with no sensitivity at the third base position.
Consequently, the QED genetic code is developed on the following assumptions:
- All four DNA (A, T, C, and G) bases are involved; in mRNA, T is replaced by U.
- Base positions are independent; i.e., for any A and B, AB and BA will be equivalent.
- Base positions are symmetric; i.e., for any A and B, (AB) and (BA) will be synonymous.
- An adjacent base forming self-complementarity pairs does not promote polypeptide formation. Instead, it controls the process: i.e., an adjacent AT or CG with any two NN bases (N= any A, T, C, or G) is noncoding and regulates the coding process. Following assumption (3), (AT)(NN)) and (NN)(AT)) are synonymous; likewise, (CG)(NN) and ((NN)(CG)) are synonymous. A (NN) T and C (NN) G yield additional flexibility for transitioning from noncoding to coding functions.
The four bases generate (4x4x4=256) two hundred fifty-six possible quadruplets. Following the constraints of assumptions (2) to (4), these numbers fall into encoding codons and noncoding codons for regulation and control.
The detailed methods for generating QED codons
Under assumptions (1) to (3), codons are arranged in a symmetric square matrix. Any N x N square symmetric matrices have N x (N+1)/2 independent elements, and element M (I, J) is synonymous with M (J, I), where I and J are the rows and columns of the matrix, respectively.
The two hundred fifty-six possible combinations of 4 bases can be arranged in a 16 x16 square symmetric matrix. The same result is obtained by starting with a 4x4 square symmetric matrix and then expanding it to a higher-order square symmetric matrix. A 4 x 4 square matrix will have 4 x (4+1)/2 =10 independent elements. Arranging these ten elements in a 10 x 10 square matrix yields 10 x (10+1)/2 = 55 independent elements. Under the 4th QED assumption, these fifty-five elements result in 20 independent coding elements required for encoding proteins and thirty-five independent noncoding elements for gene regulation, including transcription and splicing.
Table 1
(a) Four DNA (T, C, A, and G) bases arranged in a 4x4 square symmetric matrix.
|
T
|
C
|
A
|
G
|
T
|
TT
|
(TC)
|
(TA)
|
(TG)
|
C
|
|
CC
|
(CA)
|
(CG)
|
A
|
|
|
AA
|
(AG)
|
G
|
|
|
|
GG
|
Only the upper 10 symmetric independent elements of matrix M (I, J) are shown. The lower elements of M (J, I) can be generated using M (I, J) = M (J, I), where row I=1,2,3 and 4, and column J=1,2,3 and 4. Additionally, elements M (I, J) and M (J, I) are synonymous. Thus, (TC):(CT), (TG):(GT), (CA): (AC),(AG):(GA); (TA):(AT) and (CG):(GC) in(red) are synonymous with each other. Applying the 4th QED codon assumption, the 8 (bold) elements will be the part of coding, and the last two elements (TA) and (CG) in (red) will be the part of noncoding with regulatory functions.
Next, the 10 symmetric and independent elements of Table 1(a) are arranged in Table 1(b). The coding elements are shown in bold, while noncoding elements with regulatory functions in (red).
Table1
(b). Ten symmetric and independent elements of Table 1(a) arranged in a 10x10 square symmetric matrix.
|
TT
|
CC
|
AA
|
GG
|
(CT)
|
(AC)
|
(TG)
|
(AG)
|
(TA)
|
(CG)
|
TT
|
TTTT
|
(TT)(CC)
|
(TT)(AA)
|
(TT)(GG)
|
TT(CT)
|
TT(AC)
|
TT(TG)
|
TT(AG)
|
TT(TA)
|
TT(CG)
|
CC
|
|
CCCC
|
(CC)(AA)
|
(CC)(GG)
|
CC(CT)
|
CC(AC)
|
CC(TG)
|
CC(AG)
|
CC(TA)
|
CC(CG)
|
AA
|
|
|
AAAA
|
(AA)(GG)
|
AA(CT)
|
AA(AC)
|
AA(TG)
|
AA(AG)
|
AA(TA)
|
AA(CG)
|
GG
|
|
|
|
GGGG
|
GG(CT)
|
GG(AC)
|
GG(TG)
|
GG(AG)
|
GG(TA)
|
GG(CG)
|
(CT)
|
|
|
|
|
(CT)(CT)
|
(CT)(AC)
|
(CT)(TG)
|
(CT)(AG)
|
(CT)(TA)
|
(CT)(CG)
|
(AC)
|
|
|
|
|
|
(AC)(AC)
|
(AC)(TG)
|
(AC)(AG)
|
(AC)((TA)
|
(AC)(CG)
|
(TG)
|
|
|
|
|
|
|
(TG)(TG)
|
(GT)(AG)
|
(GT)(TA)
|
(GT)(CG)
|
(AG)
|
|
|
|
|
|
|
|
(AG)(AG)
|
(AG)(TA)
|
(AG)(CG)
|
(TA)
|
(TA)TT
|
(TA)CC
|
(TA)AA
|
(TA)GG
|
(TA)(CT)
|
(TA)(AC)
|
(TA)(GT)
|
(TA)(AG)
|
(TA)(TA)
|
(TA)(CG)
|
(CG)
|
(CG)TT
|
(CG)CC
|
(CG)AA
|
(CG)(GG)
|
(CG(CT)
|
(CG)(AC)
|
(CG)(TG)
|
(CG)(AG)
|
(CG)(TA)
|
(CG)(CG)
|
Only the upper half of the symmetric and independent coding (bold) and noncoding (red) elements of square matrix M (I, J) are shown. Under 4th QED assumption, any combinations of (AT) NN and (CG) NN (where N is any A, T, C, or G) in (red) are noncoding. The lower half of symmetric matrix M (J,I) can be generated using M(J,I)=M(I,J) (where I=1,2,3…10, and J=1,2,3..10). The iso-codon can be generated using these elements, as illustrated in rows 9 and 10 for columns 9 and 10, respectively.
The twenty bold independent protein-coding codons from Table 1(b) (replacing T with U for mRNA) and the corresponding isocodons are shown in Table 2 (a). In Table 2 (b), the thirty-five unique, independent noncoding codons (retaining DNA bases) with regulatory functions are shown in (red) font.
Table 2,
(a) Twenty protein-coding QED codons and their synonymous isocodons. For protein synthesis, T in Table 1(b) has been replaced by U for mRNA, Number of Hydrogen Bond (H.B.)
|
QUADRUPLEU EXPANDED DNA (QED) Codons
|
|
|
Codons
|
Synonymous Iso-codons, ( T>U)
|
H. B.
|
1
|
UUUU
|
UUUU
|
|
|
8
|
2
|
CCCC
|
CCCC
|
|
|
12
|
3
|
AAAA
|
AAAA
|
|
|
8
|
4
|
GGGG
|
GGGG
|
|
|
12
|
5
|
(AA)(CC)
|
(CC)(AA)
|
|
|
10
|
6
|
(UC)CC
|
(CU)CC
|
CC(UC)
|
CC(CU)
|
11
|
7
|
(UG)UU
|
(GU)UU
|
UU(UG)
|
UU(GU)
|
9
|
8
|
(UG)GG
|
(GU)GG
|
GG(UG)
|
GG(GU)
|
11
|
9
|
(CA)CC
|
(AC)CC
|
CC(CA)
|
CC(AC)
|
11
|
10
|
(UU)(GG)
|
(GG)(UU)
|
|
|
10
|
11
|
(AC)(CA)
|
(AC)(AC)
|
(CA)(CA)
|
(CA)(AC)
|
10
|
12
|
(GA)(GA)
|
(GA)(AG)
|
(AG)(GA)
|
(AG)(AG)
|
10
|
13
|
(GU)(GU)
|
(GU)(UG)
|
(UG)(UG)
|
(UG)(GU)
|
10
|
14
|
(GA)GG
|
GG(GA)
|
GG(AG)
|
(AG)GG
|
11
|
15
|
(CA)AA
|
(AC)AA
|
AA(CA)
|
AA(AC)
|
9
|
16
|
UU(UC)
|
UU(CU)
|
(UC)UU
|
(CU)UU
|
9
|
17
|
(AG)AA
|
AA(GA)
|
AA(AG)
|
(GA)AA
|
9
|
18
|
(AA)(GG)
|
(GG)(AA)
|
|
|
10
|
19
|
(CU)(CU)
|
(CU)(UC)
|
(UC)(UC)
|
(UC)(CU)
|
10
|
20
|
(UU)(CC)
|
(CC)(UU)
|
|
|
10
|
Table 2,
(b) Thirty-five QED noncoding regulatory codons from Table 1 (b)
|
Noncoding codons
|
Iso-noncoding codons
|
H.B.
|
1
|
(TA)(TA)
|
(TA)(AT)
|
(AT)(TA)
|
(AT)(AT)
|
8
|
2
|
(CG)(CG)
|
(GC)(GC)
|
(GC)(CG)
|
(GC)(GC)
|
12
|
3
|
(AU)GG
|
GG(U)
|
GG(U)
|
(U)GG
|
10
|
4
|
(UG)(AC)
|
(AC)(UG)
|
(UG)(CA)
|
(AC)(GU)
|
10
|
5
|
(UG)(AG)
|
(GU)(AG)
|
(UG)(GA)
|
(GU)(AG)
|
10
|
6
|
(UG)AA
|
AA(UG)
|
(GU)AA
|
AA(GU)
|
9
|
7
|
(UA)(GU)
|
(GU)(UA)
|
(UA)(UG)
|
(GU)(AU)
|
9
|
8
|
(UA)(GA)
|
(AG)( UA)
|
(UA)(AG)
|
(GA)(AU)
|
9
|
9
|
(UA)(GC)
|
(UA)(CG)
|
(CG)(UA)
|
(CG)(AU)
|
10
|
10
|
(UA)AA
|
AA(UA)
|
(AU)AA
|
|
8
|
11
|
(UA)(AC)
|
(AC)(UA)
|
(UA)(CA)
|
(AC)((AU)
|
9
|
12
|
(TT)(AA)
|
(AA)(TT)
|
|
|
8
|
13
|
(CC)(GG)
|
(GG)(CC)
|
|
|
12
|
14
|
TT(TA)
|
(TA)TT
|
(AT)TT
|
TT(AT)
|
8
|
15
|
TT(AC)
|
(AC)TT
|
(CA)TT
|
TT(CA)
|
9
|
16
|
TT(AG)
|
(GA)TT
|
(AG)TT
|
TT(GA)
|
9
|
17
|
TT(CG)
|
(CG)TT
|
TT(GC)
|
(GC)TT
|
10
|
18
|
CC(TA)
|
(TA)CC
|
(AT)CC
|
CC(AT)
|
10
|
19
|
CC(TG)
|
(TG)CC
|
(GT)CC
|
CC(GT)
|
11
|
20
|
CC(AG)
|
(AG)CC
|
(GA)CC
|
CC(GA)
|
11
|
21
|
CC(CG)
|
(CG)CC
|
(GC)CC
|
CC(GC)
|
12
|
22
|
AA(CT)
|
(CT)AA
|
(TC)AA
|
AA(TC)
|
9
|
23
|
AA(CG)
|
(GC)AA
|
(CG)AA
|
AA(GC)
|
10
|
24
|
GG(CT)
|
(CT)GG
|
(TC)GG
|
GG(TC)
|
11
|
25
|
GG(CG)
|
(CG)GG
|
(GC)GG
|
GG(GC)
|
12
|
26
|
GG(AC)
|
(AC)GG
|
(CA)GG
|
GG(CA)
|
11
|
27
|
(AC)(CG)
|
(CA)(CG)
|
(CA)(GC)
|
(AC)(GC)
|
11
|
28
|
(AC)(AG)
|
(AC)(GA)
|
(CA)(GA)
|
(CA)(AG)
|
10
|
29
|
(AG)(CG)
|
(GA)(CG)
|
(AG)(GC)
|
(GA)(GC)
|
11
|
30
|
(CT)(TA)
|
(TC)(TA)
|
(CT)(AT)
|
(TC)(AT)
|
9
|
31
|
(CT)(CG)
|
(TC)(CG)
|
(CT)(GC)
|
(TC)(CG)
|
11
|
32
|
(CT)(AC)
|
(TC)(AC)
|
(CT)(CA)
|
(CT)(AC)
|
10
|
33
|
(CT)(AG)
|
(TC)(AG)
|
(CT)(GA)
|
(TC)(GA)
|
10
|
34
|
(CT)(TG)
|
(TC)(TG)
|
(CT)(GT)
|
(TC)(GT)
|
10
|
35
|
(GT)(CG)
|
(TG)(CG)
|
(GT)(GC)
|
(TG)(GC)
|
11
|
From Table 2(a) and 2(b), the numbers of hydrogen bonds forming in coding and noncoding codons are respectively shown in Fig. 1 (a) and (b).
QED encoding codon assignments
The QED codons encode proteins and regulate processes in eukaryotes and prokaryotes. The protein-coding process is similar in prokaryotic and eukaryotic cells. Therefore, the tentative QED protein-coding codon assignment could use the already verified triplet code based on at least the first two bases, ignoring the degeneracy due to a wobbly third base. Therefore, the triplet codon table was rearranged with amino acids, degenerate codons, and corresponding tRNAs by imposing the 4th QED codon code assumptions in Table 3.
Table 3, Amino acids, triplet mRNA codons and tRNA anticodons, and stricken out disallowed triplet codons under the 4th QED codon assumptions.
Amino acid
|
Triplet mRNA codons under QED constraint and tRNA anticodons
|
Triplet codon/QED
|
Compressed form
|
tRNA-anticodon (38,39)
|
Ala/A
|
GCU, GCC, GCA, GCG
|
GCN, GCA?
|
UGC
|
Arg/R
|
CGU, CGC, CGA, CGG, AGA, AGG
|
AGR
|
CCG, ACG
|
Asn/N
|
AAU, AAC
|
AAC
|
GUU
|
Asp/D
|
GAU, GAC
|
GAY, GAC?
|
GUC
|
Cys/C
|
UGU, UGC
|
UGU
|
GCA
|
Gln/Q
|
CAA, CAG
|
CAA
|
UUG
|
Glu/E
|
GAA, GAG
|
GAR
|
YUC
|
Gly/G
|
GGU, GGC, GGA, GGG
|
GGD
|
NCC
|
His/H
|
CAU, CAC
|
CAC
|
GUG
|
Ile/I
|
AUU, AUC, AUA
|
AUH, AUC?
|
GAU
|
Leu/L
|
UUA, UUG, CUU, CUC, CUA, CUG
|
UUG, CUY
|
YAA
|
Lys/K
|
AAA, AAG
|
AAR
|
YUU
|
Met/M
|
AUG*
|
D, AUG?
|
CAU
|
Phe/F
|
UUU, UUC
|
UUY
|
RAA
|
Pro/P
|
CCU, CCC, CCA, CCG
|
CCH
|
KGG
|
Ser/S
|
UCU, UCC, UCA, UCG, AGU, AGC
|
UCY
|
GGA
|
Thr/T
|
ACU, ACC, ACA, ACG
|
ACM
|
NGU
|
Trp/W
|
UGG
|
UGG
|
CCA
|
Tyr/Y
|
UAU, UAC
|
UAY, UAC?
|
GUA
|
Val/V
|
GUU, GUC, GUA, GUG
|
GUK
|
NAC
|
START
|
AUG
|
AUG
|
|
STOP
|
UAA, UAG, UGA
|
UAR, UGA
|
|
N: Any U, C, A or G; Purine: R = A or G; Pyrimidine: Y = T (U) or C;?: matching tRNA
D: not C; H: not G; K: G or U; M: A or C
QED protein-coding codons are assigned using Tables 2 (a) and 3.
In Table 3, Nirenberg showed9,10 that polyU, polyA and polyC encode the amino acids Phe, Lys and Pro, respectively. This established a direct link among mRNAs, tRNAs, amino acids, codons and anticodons in protein synthesis at ribosomes. Additionally, in9,10 oligo chain lengths of 3 and 4: (oU) 3 and (oU) 4 showed nearly the same activities. Therefore, it is reasonable to assume that if triplet UUU can encode Phe, quadruplet UUUU could also encode Phe. Following this reasoning, LLLL-Lys and CCCC-Pro have been assigned. Since GGG in Table 3 encodes Gly, GGGG-Gly has also been assigned. Thus, four QED codons have been assigned as follows:
QED: UUUU – Phe; AAAA –Lys; CCCC- Pro; and GGGG-Gly are listed in Table 4 (a).
Next, sixteen QED codons are assigned following the Table 3 triplet codon assignments. In Crick’s original proposal, codons of only two bases could encode only sixteen amino acids. Hence, he added a third base, creating codon degeneracy and allowing the third base to form a dangling bond with the first base of the tRNA anticodon. For QED codon assignments, the first two bases of the triplet codon of each amino acid in Table 3 are compared with the first two bases of the QED protein-coding codons in Table 2(a). When a match occurs, the matching QED codon is assigned to that amino acid. Following this method, the QED codons are assigned as follows:
Table 3, Arg/R–AGA, AGG: In this case, if G is added to AGA and A is added to AGGA, then under QED assumptions 2 and 3, (AG)(GA) will represent both. In Table 2 (a), element # 12 (AG)(GA) matches this outcome. Thus, in Table 4(a), QED (AG)(GA)-Arg/R is assigned.
Table 3, Asn/N-AAC: Under QED 4th coding assumption, only C can be added at the fourth position, resulting in AA (CC). Element #5 of Table 2 (a) matches this outcome. Thus, in Table 4 (a), AA (CC)–Asn/N is assigned.
Table 3, Cys/C-UGU: Under the QED coding constraint, only U can be added, resulting in UGUU. Element #7 of Table 2(a) matches this outcome. Thus, in Table 4 (a), (UG) UU-Cys/C is assigned.
Table 3, Gln/Q-CAA: Under the QED rules, U and G are not allowed. Only A can be added, resulting in (CA) AA. Element #15 of Table 2 (a) matches this outcome and (CA) AA-Gln/Q is assigned in Table 4 (a).
Table 3, Glu/E–GAA, GAG: Here, either A or G can be added to either codon, but adding A to GAA will result in a lower preferred bonding energy. Thus, GAAA is preferred. Isoform element #17 of Table 2 (a) matches this outcome and is assigned (GA) AA-Gln/Q in Table 4 (a).
Table 3, His/H-CAC: under the QED rules, only C can be added in the fourth position, resulting in CACC. Element # 9 of Table 2 (a), (CA) CC matches this outcome and is assigned (CA) CC–His/H in Table 4 (a).
Table 3, Leu/L-UUG, CUU, and CUC: here at the third position, there are one purine and two pyrimidines. Thus, a pyrimidine (U or C) will be preferred. Since U will require a lower bonding energy than C, U is selected for the fourth position, leading to (CU)(CU). In Table 2 (a), element # 19, (CU)(CU) matches this and is assigned (CU)(CU)-Leu/L in Table 4 (a).
Table 3, Ser/S-UCU, UCC: as in the previous case, either U or C can be added at the fourth potion. Adding U to UCU will result in a lower energy, (UC) UU. Element # 16 of Table 2 (a) matches this outcome and is assigned (UC) UU-Ser/S in Table 4 (a).
Table 3, Thr/T-ACC, ACA: Following the previous reasoning, A is added to ACC and C to ACA, transforming these two codons in to the same codon (AC)(CA). Element # 11 of Table 2 (a) matches this outcome. Therefore, (AC)(CA)-Thr/T is assigned in Table 4 (a).
Table 3, Trp/W-UGG: It is safe to just add G at the fourth position, resulting in UGGG. Element # 8 of Table 2 (a), (UG) GG matches this outcome and is assigned as (UG) GG–Trp/W in Table 4 (a).
Table 3, Val/V-GUU, GUG: As in the two previous cases, G is added to GUU, and U is added to GUG, resulting in the same codon (GU)(UG). Element # 13 of Table 2 (a) matches this and is assigned as (GU)(UG) – Val/V in Table 4 (a)
Table 4,
(a) qed encoding codon assignments
Amino Acids
|
mRNA under QED
|
QED codons
|
Ref./comm.
|
Arg/R
|
AGA, AGG
|
(AG)AA
|
38
|
Asn/N
|
AAC
|
(AA)(CC)
|
38
|
Cys/C
|
UGU
|
(UG)UU
|
38
|
Gln/Q
|
CAA
|
(CA)AA
|
38
|
Glu/E
|
GAA, GAG
|
(GA)(GA)
|
38
|
Gly/G
|
GGU, GGA, GGG
|
GGGG
|
9,10
|
His/H
|
CAC
|
(CA)CC
|
38
|
Leu/L
|
UUG, CUU, CUC
|
(CU)(CU)
|
38
|
Lys/K
|
AAA, AAG
|
AAAA
|
9,10
|
Phe/F
|
UUU, UUC
|
UUUU
|
9,10
|
Pro/P
|
CCU, CCC, CCA
|
CCCC
|
9,10
|
Ser/S
|
UCU,UCC
|
(UC)UU
|
38
|
Thr/T
|
ACC, ACA
|
(AC)(CA)
|
38
|
Trp/W
|
UGG
|
(UG)GG
|
38
|
Val/V
|
GUU, GUG
|
(GU)(GU)
|
38
|
Ala/A
|
GCN?
|
(GG)(AA)**
|
|
Asp/D
|
GAY?
|
(GA)(GG)**
|
|
Ile/I
|
AUH?
|
UU(GG)**
|
|
Met/M
|
AUG?
|
(UC)CC**
|
|
Tyr/Y
|
UAY?
|
(UU)(CC)**
|
|
START
|
AUG
|
Noncoding
|
Regulatory
|
STOP
|
UAA, UAG, UGA
|
Noncoding
|
Regulatory
|
**To be assigned; (?): to be determined
Table 4 (b), summarizes the QED protein encoding codon assignment of Table 4 (a) with number of hydrogen bonds as the new QED protein encoding codon table.
Table 4,
(b) The QED protein encoding codon table arranged in H.B. ascending order.
Amino acids
|
QED-Codons
|
H.B.
|
QED-Codons
|
Amino acids
|
Phe
|
UUUU
|
8
|
AAAA
|
Lys
|
Cys
|
(UG)UU
|
9
|
(CA)AA
|
Gln
|
Ser
|
(UC)UU
|
9
|
(AG)AA
|
Arg
|
Thr
|
(AC)(CA)
|
10
|
(GU)(GU)
|
Val
|
Asn
|
(AA)(CC)
|
10
|
(UU)(GG)
|
* Met
|
**
|
(UU)(CC)
|
10
|
(GG)(AA)
|
**
|
Glu
|
(GA)(GA)
|
10
|
(CU)(CU)
|
Leu
|
Trp
|
(UG)GG
|
11
|
(CA)CC
|
His
|
**
|
(UC)CC
|
11
|
(GA)GG
|
**
|
Pro
|
CCCC
|
12
|
GGGG
|
Gly
|
*Met, to be verified, ** to be assigned: Ala/A, Asp/D, lle/l, and Tyr/Y.
Amino acids and encoding QED codons in Table 4 (b) have some exciting features. In
each H.B. bonding case, the anticodon of the QED encoding codon of an amino acid is the
encoding QED codon of the other amino acid. For example, UUUU encodes Phe and its
anticodon AAAA that encodes Lys. Based on this, a possibility exists that only ten tRNA may be
needed to synthesize proteins using canonical amino acids.
Multiple triplet codons code the same amino acid, but one tRNA decodes many amino acids. However, AUG encodes both control SART, and amino acid Met. What makes this dual role? Also, Met is not found first amino acid in every protein. If Met were the first amino acid but clipped, then what is the mechanism?
It has been reported40 that triplet GUG and UUG encode Met. Thus following the prior procedure, if U is added to GUG, and G to UUG, then QED codon (UU) (GG) will cover both codons. Element # 10 of Table 2(a) matches the outcome and tentatively *(UU) (GG) – Met is assigned. Since AUG has been assigned noncoding START codon in QED, this double role dilemma will not arise.
Noncoding QED codon assignment
The thirty-five noncoding QED codons from Table 2 (b) regulate protein synthesis, transcription,
and splicing processes. Following the protein-coding assignment procedure of QED codons
described above, the verified triplet START and STOP codons are used to assign the
corresponding QED codons. Since the information in Table 2 (b) is provided in DNA bases, T
has been replaced by U in the QED START and STOP codons. The assigned noncoding QED
codons are listed in Table 5.
Table 2 (b), first element: (TA)(TA)
In eukaryotes, transcription and splicing are the critical pre-mRNA processing steps to produce
rRNA, tRNA, and mRNA for protein synthesis. Transcription always starts at the TATA box.
Table 2(b) shows element #1 (TA)(TA), and is assigned to initiate the transcription process and listed in Table 5.
Table 2(b), 2nd element: (CpG)(CpG)
The splicing process separates protein-coding exons from noncoding introns and is unique to
eukaryotes. The (CG)(CG) element and G + C-rich bases are used to locate exon–intron
interfaces, and splicing then separates them. Furthermore, alternative splicing makes
it possible for one gene to encode multiple proteins. Therefore, the (CG)(CG) element of Table
2(b) is assigned for controlling splicing processes and is listed in Table 5.
Table 2 (b), assignments of the following 3 to 11 elements.
In Table 1 (b), among 35 noncoding codons, 10 are (AU) NN, and 10 are (CG) NN (where
N is A, T, C, or G) listed in Table 2(b). The remaining fifteen noncoding codons are mixed combinations.
For the QED START and STOP codon assignments, the triplet START and STOP codons of
Tables 3 are used as guides. Additionally, the T bases of these nine elements in Table 2(b) have
been replaced by U.
START
START–AUG triplet matches the first two bases of the third element in Table 2 (b). Thus,
QED START-(AU) GG is assigned and listed in Table 5.
STOP
In Table 3, STOP triplets include three codons: UGA, UAG, and UAA
QED STOP: The first two bases of elements 4 to 6 of Table 2 (b) match the first two bases of the
UGA triplet. Thus, QED STOP-(UG)(AC), -(UG)(AG), and -(UG)(AA) are assigned in Table 5.
Since (UG) AA has lower bonding energy, it is assigned STOP. The other two are assigned
as Regulatory or STOP.
The first two bases of elements 7 to 9 of Table 2 (b) match the first two bases of the triplet UAG.
Thus, QED STOP–(UA)(GU), -(UA)(GA), and -(UA)(GC) are assigned in Table 5.
Following the previous procedure, (UA)(GA) is assigned STOP and the other two as Regulatory
or STOP.
Table 2 (b), 10th, and 11th elements
The two bases of the 10th and 11th elements match the first two bases of triplet UAA. Thus,
QED STOP-(UA)AA and -(UA)(AC) are assigned in Table 5. Following the previous
procedure, (UA) AA is assigned STOP and (UA)(AC) as Regulatory or STOP.
The assignment of the remaining twenty-four G+C- and T+A-rich QED regulatory
noncoding codons will require further work.
Table 5, QED regulatory noncoding codon assignments, ** Table 2 (b), numbers
**
|
Triplet Codons
|
Noncoding QED Codons
|
QED Regulatory & Control
|
1
|
Absent
|
(TA)(TA)
|
TATA Box - Transcription start
|
2
|
Absent
|
(CG)(CG)
|
(CG)(CG), Exon/Intron Interface
|
3
|
START-AUG
|
(AU)GG
|
START
|
Comments
|
5
|
STOP-UGA (OPAL)
|
(UG)(AG)
|
STOP
|
|
|
8
|
STOP-UAG(AMBER)
|
(UA)(GA)
|
STOP
|
|
|
10
|
STOP-UAA(OCHER)
|
(UA)AA
|
STOP
|
|
|
4
|
|
(UG)(AC)
|
Regulatory
|
*
|
STOP
|
6
|
|
(UG)AA
|
Regulatory
|
*
|
STOP
|
9
|
|
(UA)(GC)
|
Regulatory
|
*
|
STOP
|
7
|
|
(UA)(GU)
|
Regulatory
|
*
|
STOP
|
11
|
|
(UA)(AC)
|
Regulatory
|
*
|
STOP
|
12
|
|
(TT)(AA)
|
Regulatory
|
*
|
|
13
|
|
(CC)(GG)
|
Regulatory
|
*
|
|
14
|
|
TT(TA)
|
Regulatory
|
*
|
|
15
|
|
TT(AC)
|
Regulatory
|
*
|
|
16
|
|
TT(AG)
|
Regulatory
|
*
|
|
17
|
|
TT(CG)
|
Regulatory
|
*
|
|
18
|
|
CC(TA)
|
Regulatory
|
*
|
|
19
|
|
CC(TG)
|
Regulatory
|
*
|
|
20
|
|
CC(AG)
|
Regulatory
|
*
|
|
21
|
|
CC(CG)
|
Regulatory
|
*
|
|
22
|
|
AA(CT)
|
Regulatory
|
*
|
|
23
|
|
AA(CG)
|
Regulatory
|
*
|
|
24
|
|
GG(CT)
|
Regulatory
|
*
|
|
25
|
|
GG(CG)
|
Regulatory
|
*
|
|
26
|
|
GG(AC)
|
Regulatory
|
*
|
|
27
|
|
(AC)(CG)
|
Regulatory
|
*
|
|
28
|
|
(AC)(AG)
|
Regulatory
|
*
|
|
29
|
|
(AG)(CG)
|
Regulatory
|
*
|
|
30
|
|
(CT)(TA)
|
Regulatory
|
*
|
|
31
|
|
(CT)(CG)
|
Regulatory
|
*
|
|
32
|
|
(CT)(AC)
|
Regulatory
|
*
|
|
33
|
|
(CT)(AG)
|
Regulatory
|
*
|
|
34
|
|
(GT)(CG)
|
Regulatory
|
*
|
|
35
|
|
(GT)(AG)
|
Regulatory
|
*To be assigned
|
|
Digital representation
Bioinformatics and NGS use digital techniques extensively to analyze DNA sequencing, analysis and interpretation of the results. Four canonical DNA bases are represented by two bits: 0 and 1: T: 11, A: 10, C: 01, and G: 00. Thus, each quadruplet QED encoding and noncoding codon will be represented by eight digits (one byte) consisting of 0 and 1. For example,
TTTT: 11111111; CCCC: 01010101, AAAA: 10101010; GGGG: 00000000
Accordingly, each of the twenty protein-coding and thirty-five regulatory codons can be expressed by one byte. Thus, the digital representation will allow developing compatible applications to capitalize on bioinformatics and cybersecurity tools.
The HIPA rule limits access to eHealth data. However, the digitally encrypted codons and security codes will overcome this limitation. Furthermore, DNA digital data presentation will make it easy to develop and certify the use of diagnostic tools at the point of care (POC) and a path for developing personalized medicine.
Incurable rare monogenic diseases, multigenic cancers and vaccines
Gene variants and errors in transcription, and splicing produce dysfunctional proteins causing the disease. More than 7,000 rare monogenic diseases have no cure, only management of symptoms.
A similar situation is observed for multigenic cancers. Over the last five decades since the establishment of the NCI (1970), cancer treatments have not changed considerably. Once cancer is detected, the treatment is initiated with surgery, followed by radiation and chemotherapy. The goal has been to extend life by five years. Once metastasis or remission occurs, no further treatment is available.
In rare diseases, dysfunctional protein correction is possible at the protein or DNA levels. At the protein level, this requires the replacement of incorrect amino acids with the correct ones. However, the triplet codon is degenerate, which makes selecting a unique codon among the degenerate ones a foremost hurdle. The nondegenerate protein-coding QED codons have no such limitation. At the DNA level, variant genes are first corrected with CRISPR gene editing tools. Normal proteins are generated to replace dysfunctional proteins.
No biological technique exists accessing cancerous cells selectively, a foremost hurdle that must be overcome to find a cure for cancer. Limited applicability of triplet code to eukaryotes might have prevented developing such technique. The eukaryote QED codon code has the potential to develop such a technique. Thus, the combination of the QED code, dysfunctional protein correction techniques, and the availability of the Human Cell Atlas 41, 42 and direct cell RNA sequencing 43, 44 are anticipated to provide the possibility of finding cures for multigenic disease cancer.
Vaccines and antibiotics are the best preventive tools for controlling some diseases. Antibiotics kill bacteria (prokaryotes) by disrupting their protein production ability. On the other hand, viruses take over the cell's (eukaryote) protein production machinery and speed up cellular protein production, leading to cell death. One way to prevent cell death is to produce antibodies that destroy the virus's protective proteins and the virus itself. With a known virus genome, antibody synthesis is relatively straightforward. An effective vaccine was produced using COVID-19 virus protective protein mRNA. The eukaryote QED codons have a distinct possibility of developing a targeted universal vaccine.
Protein synthesis to correct dysfunctional proteins and a step curing diseases
QED codons translate the genetic information carried in mRNA into proteins at the ribosome. The translation process is the same in eukaryotes, prokaryotes and viruses, but the starting and intervening steps differ, as shown in Fig. 2a-c. The different roles of the QED codons in control and translation are shown in bold.
Dysfunctional proteins causing diseases could be corrected either at the protein level or the DNA level. The steps are illustrated in Figs. 3 and 4.