Quadruplet expanded DNA (QED) genetic code for eukaryotic cells**

doi:10.21203/rs.3.rs-2159747/v3

Download PDF

Research Article

Quadruplet expanded DNA (QED) genetic code for eukaryotic cells^**

https://doi.org/10.21203/rs.3.rs-2159747/v3

This work is licensed under a CC BY 4.0 License

Version 3

posted

You are reading this older preprint version

Read the latest preprint version →

QED genetic code for eukaryotic cells is developed by analyzing triplet gene encoding and overcoming the lack of transcription and splicing controls. While verifying the triplet genetic code, Nobel laureate H.G. Khorana avoided synthesizing poly-rAU and poly- rCG, not promoting polypeptide formation. The QED codon is developed using these attributes. Here, the QED codon is assumed to comprise all four DNA bases (T, C, A, and G); the code is position-independent and symmetric. The adjacent bases (A: U) and (C: G) forming complementarity pairs naturally do not promote polypeptide formation; instead, they control the synthesis process, transcription, and splicing. Under these constraints, the resulting (4x4x4x4) 256 quadruplets fall into two groups: 20 independent codons encoding 20 canonical amino acids and 35 independent noncoding codons regulating the process, including transcription and splicing. Since gene variants lead to dysfunctional protein-causing diseases, steps to correct dysfunctional proteins are described, anticipating a strategy for developing cures for rare diseases and multigenic cancers.

Molecular Genetics

Eukaryote

quadruplet

expanded

genetic coding

nondegenerate

prokaryote

viruses

Gene encoding is the most critical step in translating mRNA gene information into proteins at the ribosome to maintain the homeostatic state of cells. Genetic code development occurred in two distinct periods: pre-1970 and post-1970. The pre-70 triplet genetic code has limited START and STOP control. Post-70 split gene discovery into eukaryotic cells requires transcription and the splicing process, but it is lacking in triplet coding. Additionally, understanding the ribosome's structure became equally critical in protein synthesis. Expanded orthogonal gene codons were also developed to overcome the limited number of available canonical amino acids and triplet codons during this period. These advances are briefly discussed. Next, the newly developed quadruplet expanded DNA (QED) genetic code for eukaryotes is presented, challenging the status quo. The QED genetic codon tables for protein-coding and noncoding regulatory and controls are generated. A digital representation of the QED codons is also created in anticipation of applying currently available biotechnology tools. Steps are described to develop cures for rare monogenic diseases and cancers by correcting dysfunctional proteins causing these diseases.

The pre-1970 triplet genetic code has controls for prokaryotes and viruses but lacks control for eukaryotes.

The origin of the triplet genetic code lies in the post-establishment of DNA structure^1,2. The 1962 Nobel Prize in Physiology and Medicine was awarded to Crick et al.³ for establishing the DNA structure with four (T, A, C, and G) bases: T: A bases, and C: G bases, naturally forming complementarity pairs, known as Watson-Crick (WC) pairs. Consequently, Crick introduced the central dogma of biology, in which DNA is considered hereditary material. Protein synthesis occurs from DNA to mRNA to protein. The codons translate mRNA genetic information into proteins.

In 1961, Crick described the general nature of the genetic code and the proteins⁴. In 1963^5,6, he proposed a triplet genetic code for protein synthesis and one gene-one protein, a part of the central dogma of biology. The 1968 Nobel Prize in Medicine and Physiology⁷ was awarded to Robert W. Holley, Har Gobind Khorana, and Marshal W. Nirenberg for verifying the triplet code: 61 triplet codons encode twenty amino acids, 3 STOP signals, and one START signal. Khorana used the synthesis process⁸, while Nirenberg used the enzymatic binding process^9,10. Holley et al.¹¹ established a tRNA structure with attached amino acids and anticodons. At the ribosome, tRNA anticodons form WC pairs with mRNA triplet bases, promoting polypeptide bond formation resulting in protein.

Since the triplet encoding lacks a gene control mechanism, Jacob and Monod¹² developed operons to control gene and enzyme synthesis via operator, promotor, regulator, and suppressor. The regulator gene initiates the process. In the transcription process, the regulator and suppressor set the operator on or off. Two widely used operons are lac and trp. The lac operon has a negative default control and is demonstrated by the digestive process of lactose. When lactose is absent, no action occurs, but when present, the lac operon controls the gene to synthesize the enzymes to digest the lactose.

On the other hand, trp has a positive default control. When tryptophan is present, nothing happens, but Trp controls the synthesis of tryptophan when absent. The 1965 Nobel Prize in Physiology or Medicine¹³ was awarded to François Jacob, André Lwoff, and Jacques "for their discoveries concerning genetic control of enzyme and virus synthesis."

Shortcomings of the pre-70 triplet genetic code

The triplet genetic code is nonoptimal and degenerate. It invokes the wobble hypothesis and lacks control for eukaryotes. Additionally, since the number of corresponding tRNAs is insufficient for twenty amino acids, iso-tRNA was proposed to decode multiple amino acids.

The triplet code could be more optimal. Crick proposed that four DNA (T, A, C, and G) bases encode 20 amino acids. According to Shannon's information coding theory, the optimal number of required bits to encode N objects is log₂N. Thus, for N=20 amino acids, the optimal number of bits necessary will be log₂20 = 4.32 bits. However, the triplet code has 64 codons, requiring 6 bits. Therefore, it is nonoptimal and degenerate.

Both rules of the central dogma of biology have been violated. Viruses violated the DNA start rule by starting mRNA, and post-70 split gene discovery led to the synthesis of one gene to multiple proteins.

Post-1970 DNA code for eukaryotes

Post-1970 molecular and cellular biology development necessitated new regulation and control. Eukaryotic cells require transcription, splicing, and various regulatory and control processes, including epigenetics. Approximately 1977^14,15, it was shown that less than 2 % of DNA bases encode proteins, remaining noncoding bases regulate and control the protein synthesis process. Genes were not continuously distributed but coding portions (exons) separated by noncoding parts (introns). The splicing process separates exons from introns. Richard J. Roberts and Phillip A. Sharp were awarded the 1993 Nobel Prize in Physiology or Medicine¹⁶ for discovering "split genes." Several unique proteins can be synthesized using alternate splicing from a single gene^17,18, thus breaking one gene–one protein rule of the central dogma of biology.

Eukaryote Transcription

In eukaryotes, DNA transcription yields pre-mRNA, and its splicing generates mRNA for protein synthesis. Roger Kornberg elucidated the detailed transcription process using Baker's yeast as a eukaryotic model. He demonstrated that the eukaryotic transcription process starts with the TATA box, several transcription factor-binding proteins, mediators, promotors, activators, and other controlling factors. DNA transcription (RNA polymerization) yields Pol-I rRNA, Pol-II mRNA, and Pol-III tRNA. Ribosomes are synthesized using Pol-I rRNA, and tRNAs are synthesized using Pol-III. At the ribosome, Pol-II mRNA codons are translated into protein. Kornberg¹⁹ was awarded the 2006 Chemistry Nobel Prize for his "fundamental studies of the molecular basis of eukaryotic transcription."

Eukaryote Splicing

The splicing process separates exons from introns, and the exon-intron interface allows alternate splicing, making one gene yielding multiple proteins a possibility.

Transcription and splicing errors cause many human diseases. Errors in transcriptional regulatory elements and control cause several human diseases²⁰. Splicing errors also cause diseases^21,22. Learning how to control these errors may enable the development of drugs to cure these diseases.

The ribosome protein-making factory and the gene decoding

In the post-70 era, understanding the structure of the ribosome became critical because proteins are synthesized there. In 1955, Palade²³ first identified this organelle “as a small particulate of the cytoplasm," for which the name ribosome was later adopted. The concentrated efforts of Venkatraman Ramakrishnan, Thomas A. Steitz, and Ada Yonath revealed the detailed ribosome structure. The 2009 Nobel Prize in Chemistry was awarded to them "for studies of the structure and function of the ribosome"²⁴. The ribosome has two subunits: a large subunit and a small subunit, consisting of ribose RNA and ribose proteins. Eukaryotes, prokaryotes, and archaea have similar structures but differ in sizes and ribose protein ratios. The protein synthesis at the ribosome was illustrated using the ribosome structure ^25-27.

Later, V. Ramakrishnan described the race to decipher the secret of ribosomes in his book “Gene Machine”²⁸. The ribosomal structure has three characteristic sites: A, P, and E. mRNA are held in place by the small subunit and read in the 5^‘ to 3' direction, and an aminoacyl-tRNA (aa-tRNA) with the cognate anticodon carries an amino acid to site A. The ribosome performs decoding to ensure that the codon and anticodon match. When a match occurs, the aa-tRNA moves to site P, after which the mRNA codon is read, and an aa-tRNA with a new anticodon carries another amino acid to be loaded. When the codon and anticodon match again, aa-tRNA moves to site P. Then, a peptide bond is formed between the first and second amino acids, and the first aa-tRNA is released. This cycle repeats, and nascent synthesized protein is threaded through a tunnel. The ribosomal decoding of the codon at the third wobble position^29,30 is flexible enough that it can even accommodate a codon at the fourth position. The ribosome's decoding, translocation, and extension activities ensure the proper synthesis of a protein.

Ribosome structure is equally critical in controlling bacterial-antibiotic interactions³¹. Antibiotics disrupt bacterial protein synthesis by interrupting the decoding and translocation roles of ribosomes and blocking the exit tunnel of nascent proteins. Thus, antibiotics inhibit bacterial function rather than the cell's protein production ability.

In the post-1970 era, alternative synthetic orthogonally expanded quadruplet ^32-35, sextuplet³⁶, and octuplet³⁷ genetic codes were tested. These codes were developed to overcome the limitation of 20 available canonical amino acids and inadequate triplet code regulation.

The orthogonally expanded codons have yet to be successfully employed to synthesize proteins using canonical amino acids.

The QED coding model

Quadruplet expanded DNA (QED) genetic code for eukaryotes

The QED genetic code was developed by reviewing chemical reactions and limitations encountered during the triplet code verification⁷.

Khorana observed⁸ that self-complementary AU, poly-rAU, and CG, poly-rCG do not promote polypeptide formation.
The synthesis⁸ of Poly r-GUA and Poly r-GAU was a total success, but triplet combinations (AUG) _n and (UAG) _n (where n represents repeated sequences) yielded no polypeptides. UAG was referred to as “chain terminators” (later called STOP codon).
The triplet codon table includes two UGA and UAG (corresponding DNA bases: TGA and TAG) STOP codons, where the G position seems to be position independent and symmetric; i.e., U(GA): U(AG), with no sensitivity at the third base position.

Consequently, the QED genetic code is developed on the following assumptions:

All four DNA (A, T, C, and G) bases are involved; in mRNA, T is replaced by U.
Base positions are independent; i.e., for any A and B, AB and BA will be equivalent.
Base positions are symmetric; i.e., for any A and B, (AB) and (BA) will be synonymous.
An adjacent base forming self-complementarity pairs does not promote polypeptide formation. Instead, it controls the process: i.e., an adjacent AT or CG with any two NN bases (N= any A, T, C, or G) is noncoding and regulates the coding process. Following assumption (3), (AT)(NN)) and (NN)(AT)) are synonymous; likewise, (CG)(NN) and ((NN)(CG)) are synonymous. A (NN) T and C (NN) G yield additional flexibility for transitioning from noncoding to coding functions.

The four bases generate (4x4x4=256) two hundred fifty-six possible quadruplets. Following the constraints of assumptions (2) to (4), these numbers fall into encoding codons and noncoding codons for regulation and control.

The detailed methods for generating QED codons

Under assumptions (1) to (3), codons are arranged in a symmetric square matrix. Any N x N square symmetric matrices have N x (N+1)/2 independent elements, and element M (I, J) is synonymous with M (J, I), where I and J are the rows and columns of the matrix, respectively.

The two hundred fifty-six possible combinations of 4 bases can be arranged in a 16 x16 square symmetric matrix. The same result is obtained by starting with a 4x4 square symmetric matrix and then expanding it to a higher-order square symmetric matrix. A 4 x 4 square matrix will have 4 x (4+1)/2 =10 independent elements. Arranging these ten elements in a 10 x 10 square matrix yields 10 x (10+1)/2 = 55 independent elements. Under the 4^th QED assumption, these fifty-five elements result in 20 independent coding elements required for encoding proteins and thirty-five independent noncoding elements for gene regulation, including transcription and splicing.

Table 1

(a) Four DNA (T, C, A, and G) bases arranged in a 4x4 square symmetric matrix.

	T	C	A	G
T	TT	(TC)	(TA)	(TG)
C		CC	(CA)	(CG)
A			AA	(AG)
G				GG

Only the upper 10 symmetric independent elements of matrix M (I, J) are shown. The lower elements of M (J, I) can be generated using M (I, J) = M (J, I), where row I=1,2,3 and 4, and column J=1,2,3 and 4. Additionally, elements M (I, J) and M (J, I) are synonymous. Thus, (TC):(CT), (TG):(GT), (CA): (AC),(AG):(GA); (TA):(AT) and (CG):(GC) in(red) are synonymous with each other. Applying the 4^thQED codon assumption, the 8 (bold) elements will be the part of coding, and the last two elements (TA) and (CG) in (red) will be the part of noncoding with regulatory functions.

Next, the 10 symmetric and independent elements of Table 1(a) are arranged in Table 1(b). The coding elements are shown in bold, while noncoding elements with regulatory functions in (red).

Table1

(b). Ten symmetric and independent elements of Table 1(a) arranged in a 10x10 square symmetric matrix.

	TT	CC	AA	GG	(CT)	(AC)	(TG)	(AG)	(TA)	(CG)
TT	TTTT	(TT)(CC)	(TT)(AA)	(TT)(GG)	TT(CT)	TT(AC)	TT(TG)	TT(AG)	TT(TA)	TT(CG)
CC		CCCC	(CC)(AA)	(CC)(GG)	CC(CT)	CC(AC)	CC(TG)	CC(AG)	CC(TA)	CC(CG)
AA			AAAA	(AA)(GG)	AA(CT)	AA(AC)	AA(TG)	AA(AG)	AA(TA)	AA(CG)
GG				GGGG	GG(CT)	GG(AC)	GG(TG)	GG(AG)	GG(TA)	GG(CG)
(CT)					(CT)(CT)	(CT)(AC)	(CT)(TG)	(CT)(AG)	(CT)(TA)	(CT)(CG)
(AC)						(AC)(AC)	(AC)(TG)	(AC)(AG)	(AC)((TA)	(AC)(CG)
(TG)							(TG)(TG)	(GT)(AG)	(GT)(TA)	(GT)(CG)
(AG)								(AG)(AG)	(AG)(TA)	(AG)(CG)
(TA)	(TA)TT	(TA)CC	(TA)AA	(TA)GG	(TA)(CT)	(TA)(AC)	(TA)(GT)	(TA)(AG)	(TA)(TA)	(TA)(CG)
(CG)	(CG)TT	(CG)CC	(CG)AA	(CG)(GG)	(CG(CT)	(CG)(AC)	(CG)(TG)	(CG)(AG)	(CG)(TA)	(CG)(CG)

Only the upper half of the symmetric and independent coding (bold) and noncoding (red) elements of square matrix M (I, J) are shown. Under 4^th QED assumption, any combinations of (AT) NN and (CG) NN (where N is any A, T, C, or G) in (red) are noncoding. The lower half of symmetric matrix M (J,I) can be generated using M(J,I)=M(I,J) (where I=1,2,3…10, and J=1,2,3..10). The iso-codon can be generated using these elements, as illustrated in rows 9 and 10 for columns 9 and 10, respectively.

The twenty bold independent protein-coding codons from Table 1(b) (replacing T with U for mRNA) and the corresponding isocodons are shown in Table 2 (a). In Table 2 (b), the thirty-five unique, independent noncoding codons (retaining DNA bases) with regulatory functions are shown in (red) font.

Table 2,

(a) Twenty protein-coding QED codons and their synonymous isocodons. For protein synthesis, T in Table 1(b) has been replaced by U for mRNA, Number of Hydrogen Bond (H.B.)

	QUADRUPLEU EXPANDED DNA (QED) Codons
	Codons	Synonymous Iso-codons, ( T>U)			H. B.
1	UUUU	UUUU			8
2	CCCC	CCCC			12
3	AAAA	AAAA			8
4	GGGG	GGGG			12
5	(AA)(CC)	(CC)(AA)			10
6	(UC)CC	(CU)CC	CC(UC)	CC(CU)	11
7	(UG)UU	(GU)UU	UU(UG)	UU(GU)	9
8	(UG)GG	(GU)GG	GG(UG)	GG(GU)	11
9	(CA)CC	(AC)CC	CC(CA)	CC(AC)	11
10	(UU)(GG)	(GG)(UU)			10
11	(AC)(CA)	(AC)(AC)	(CA)(CA)	(CA)(AC)	10
12	(GA)(GA)	(GA)(AG)	(AG)(GA)	(AG)(AG)	10
13	(GU)(GU)	(GU)(UG)	(UG)(UG)	(UG)(GU)	10
14	(GA)GG	GG(GA)	GG(AG)	(AG)GG	11
15	(CA)AA	(AC)AA	AA(CA)	AA(AC)	9
16	UU(UC)	UU(CU)	(UC)UU	(CU)UU	9
17	(AG)AA	AA(GA)	AA(AG)	(GA)AA	9
18	(AA)(GG)	(GG)(AA)			10
19	(CU)(CU)	(CU)(UC)	(UC)(UC)	(UC)(CU)	10
20	(UU)(CC)	(CC)(UU)			10

Table 2,

(b) Thirty-five QED noncoding regulatory codons from Table 1 (b)

	Noncoding codons	Iso-noncoding codons			H.B.
1	(TA)(TA)	(TA)(AT)	(AT)(TA)	(AT)(AT)	8
2	(CG)(CG)	(GC)(GC)	(GC)(CG)	(GC)(GC)	12
3	(AU)GG	GG(U)	GG(U)	(U)GG	10
4	(UG)(AC)	(AC)(UG)	(UG)(CA)	(AC)(GU)	10
5	(UG)(AG)	(GU)(AG)	(UG)(GA)	(GU)(AG)	10
6	(UG)AA	AA(UG)	(GU)AA	AA(GU)	9
7	(UA)(GU)	(GU)(UA)	(UA)(UG)	(GU)(AU)	9
8	(UA)(GA)	(AG)( UA)	(UA)(AG)	(GA)(AU)	9
9	(UA)(GC)	(UA)(CG)	(CG)(UA)	(CG)(AU)	10
10	(UA)AA	AA(UA)	(AU)AA		8
11	(UA)(AC)	(AC)(UA)	(UA)(CA)	(AC)((AU)	9
12	(TT)(AA)	(AA)(TT)			8
13	(CC)(GG)	(GG)(CC)			12
14	TT(TA)	(TA)TT	(AT)TT	TT(AT)	8
15	TT(AC)	(AC)TT	(CA)TT	TT(CA)	9
16	TT(AG)	(GA)TT	(AG)TT	TT(GA)	9
17	TT(CG)	(CG)TT	TT(GC)	(GC)TT	10
18	CC(TA)	(TA)CC	(AT)CC	CC(AT)	10
19	CC(TG)	(TG)CC	(GT)CC	CC(GT)	11
20	CC(AG)	(AG)CC	(GA)CC	CC(GA)	11
21	CC(CG)	(CG)CC	(GC)CC	CC(GC)	12
22	AA(CT)	(CT)AA	(TC)AA	AA(TC)	9
23	AA(CG)	(GC)AA	(CG)AA	AA(GC)	10
24	GG(CT)	(CT)GG	(TC)GG	GG(TC)	11
25	GG(CG)	(CG)GG	(GC)GG	GG(GC)	12
26	GG(AC)	(AC)GG	(CA)GG	GG(CA)	11
27	(AC)(CG)	(CA)(CG)	(CA)(GC)	(AC)(GC)	11
28	(AC)(AG)	(AC)(GA)	(CA)(GA)	(CA)(AG)	10
29	(AG)(CG)	(GA)(CG)	(AG)(GC)	(GA)(GC)	11
30	(CT)(TA)	(TC)(TA)	(CT)(AT)	(TC)(AT)	9
31	(CT)(CG)	(TC)(CG)	(CT)(GC)	(TC)(CG)	11
32	(CT)(AC)	(TC)(AC)	(CT)(CA)	(CT)(AC)	10
33	(CT)(AG)	(TC)(AG)	(CT)(GA)	(TC)(GA)	10
34	(CT)(TG)	(TC)(TG)	(CT)(GT)	(TC)(GT)	10
35	(GT)(CG)	(TG)(CG)	(GT)(GC)	(TG)(GC)	11

From Table 2(a) and 2(b), the numbers of hydrogen bonds forming in coding and noncoding codons are respectively shown in Fig. 1 (a) and (b).

QED encoding codon assignments

The QED codons encode proteins and regulate processes in eukaryotes and prokaryotes. The protein-coding process is similar in prokaryotic and eukaryotic cells. Therefore, the tentative QED protein-coding codon assignment could use the already verified triplet code based on at least the first two bases, ignoring the degeneracy due to a wobbly third base. Therefore, the triplet codon table was rearranged with amino acids, degenerate codons, and corresponding tRNAs by imposing the 4^th QED codon code assumptions in Table 3.

Table 3, Amino acids, triplet mRNA codons and tRNA anticodons, and stricken out disallowed triplet codons under the 4^th QED codon assumptions.

Amino acid	Triplet mRNA codons under QED constraint and tRNA anticodons
Amino acid	Triplet codon/QED	Compressed form	tRNA-anticodon (38,39)
Ala/A	~~GCU, GCC, GCA, GCG~~	~~GCN~~, GCA?	UGC
Arg/R	~~CGU, CGC, CGA, CGG,~~ AGA, AGG	AGR	CCG, ACG
Asn/N	~~AAU~~, AAC	AAC	GUU
Asp/D	~~GAU, GAC~~	~~GAY~~, GAC?	GUC
Cys/C	UGU, ~~UGC~~	UGU	GCA
Gln/Q	CAA, ~~CAG~~	CAA	UUG
Glu/E	GAA, GAG	GAR	YUC
Gly/G	GGU, ~~GGC~~, GGA, GGG	GGD	NCC
His/H	~~CAU~~, CAC	CAC	GUG
Ile/I	~~AUU, AUC, AUA~~	~~AUH~~, AUC?	GAU
Leu/L	~~UUA~~, UUG, CUU, CUC, ~~CUA, CUG~~	UUG, CUY	YAA
Lys/K	AAA, AAG	AAR	YUU
Met/M	AUG*	D, AUG?	CAU
Phe/F	UUU, UUC	UUY	RAA
Pro/P	CCU, CCC, CCA, ~~CCG~~	CCH	KGG
Ser/S	UCU, UCC, ~~UCA, UCG, AGU, AGC~~	UCY	GGA
Thr/T	~~ACU~~, ACC, ACA, ~~ACG~~	ACM	NGU
Trp/W	UGG	UGG	CCA
Tyr/Y	~~UAU, UAC~~	~~UAY~~, UAC?	GUA
Val/V	GUU, ~~GUC,~~ GUA, GUG	GUK	NAC
START	AUG	AUG
STOP	UAA, UAG, UGA	UAR, UGA

N: Any U, C, A or G; Purine: R = A or G; Pyrimidine: Y = T (U) or C;?: matching tRNA

D: not C; H: not G; K: G or U; M: A or C

QED protein-coding codons are assigned using Tables 2 (a) and 3.

In Table 3, Nirenberg showed^9,10 that polyU, polyA and polyC encode the amino acids Phe, Lys and Pro, respectively. This established a direct link among mRNAs, tRNAs, amino acids, codons and anticodons in protein synthesis at ribosomes. Additionally, in^9,10 oligo chain lengths of 3 and 4: (oU) ₃ and (oU) ₄showed nearly the same activities. Therefore, it is reasonable to assume that if triplet UUU can encode Phe, quadruplet UUUU could also encode Phe. Following this reasoning, LLLL-Lys and CCCC-Pro have been assigned. Since GGG in Table 3 encodes Gly, GGGG-Gly has also been assigned. Thus, four QED codons have been assigned as follows:

QED: UUUU – Phe; AAAA –Lys; CCCC- Pro; and GGGG-Gly are listed in Table 4 (a).

Next, sixteen QED codons are assigned following the Table 3 triplet codon assignments. In Crick’s original proposal, codons of only two bases could encode only sixteen amino acids. Hence, he added a third base, creating codon degeneracy and allowing the third base to form a dangling bond with the first base of the tRNA anticodon. For QED codon assignments, the first two bases of the triplet codon of each amino acid in Table 3 are compared with the first two bases of the QED protein-coding codons in Table 2(a). When a match occurs, the matching QED codon is assigned to that amino acid. Following this method, the QED codons are assigned as follows:

Table 3, Arg/R–AGA, AGG: In this case, if G is added to AGA and A is added to AGGA, then under QED assumptions 2 and 3, (AG)(GA) will represent both. In Table 2 (a), element # 12 (AG)(GA) matches this outcome. Thus, in Table 4(a), QED (AG)(GA)-Arg/R is assigned.

Table 3, Asn/N-AAC: Under QED 4^th coding assumption, only C can be added at the fourth position, resulting in AA (CC). Element #5 of Table 2 (a) matches this outcome. Thus, in Table 4 (a), AA (CC)–Asn/N is assigned.

Table 3, Cys/C-UGU: Under the QED coding constraint, only U can be added, resulting in UGUU. Element #7 of Table 2(a) matches this outcome. Thus, in Table 4 (a), (UG) UU-Cys/C is assigned.

Table 3, Gln/Q-CAA: Under the QED rules, U and G are not allowed. Only A can be added, resulting in (CA) AA. Element #15 of Table 2 (a) matches this outcome and (CA) AA-Gln/Q is assigned in Table 4 (a).

Table 3, Glu/E–GAA, GAG: Here, either A or G can be added to either codon, but adding A to GAA will result in a lower preferred bonding energy. Thus, GAAA is preferred. Isoform element #17 of Table 2 (a) matches this outcome and is assigned (GA) AA-Gln/Q in Table 4 (a).

Table 3, His/H-CAC: under the QED rules, only C can be added in the fourth position, resulting in CACC. Element # 9 of Table 2 (a), (CA) CC matches this outcome and is assigned (CA) CC–His/H in Table 4 (a).

Table 3, Leu/L-UUG, CUU, and CUC: here at the third position, there are one purine and two pyrimidines. Thus, a pyrimidine (U or C) will be preferred. Since U will require a lower bonding energy than C, U is selected for the fourth position, leading to (CU)(CU). In Table 2 (a), element # 19, (CU)(CU) matches this and is assigned (CU)(CU)-Leu/L in Table 4 (a).

Table 3, Ser/S-UCU, UCC: as in the previous case, either U or C can be added at the fourth potion. Adding U to UCU will result in a lower energy, (UC) UU. Element # 16 of Table 2 (a) matches this outcome and is assigned (UC) UU-Ser/S in Table 4 (a).

Table 3, Thr/T-ACC, ACA: Following the previous reasoning, A is added to ACC and C to ACA, transforming these two codons in to the same codon (AC)(CA). Element # 11 of Table 2 (a) matches this outcome. Therefore, (AC)(CA)-Thr/T is assigned in Table 4 (a).

Table 3, Trp/W-UGG: It is safe to just add G at the fourth position, resulting in UGGG. Element # 8 of Table 2 (a), (UG) GG matches this outcome and is assigned as (UG) GG–Trp/W in Table 4 (a).

Table 3, Val/V-GUU, GUG: As in the two previous cases, G is added to GUU, and U is added to GUG, resulting in the same codon (GU)(UG). Element # 13 of Table 2 (a) matches this and is assigned as (GU)(UG) – Val/V in Table 4 (a)

Table 4,

(a) qed encoding codon assignments

Amino Acids	mRNA under QED	QED codons	Ref./comm.
Arg/R	AGA, AGG	(AG)AA	38
Asn/N	AAC	(AA)(CC)	38
Cys/C	UGU	(UG)UU	38
Gln/Q	CAA	(CA)AA	38
Glu/E	GAA, GAG	(GA)(GA)	38
Gly/G	GGU, GGA, GGG	GGGG	9,10
His/H	CAC	(CA)CC	38
Leu/L	UUG, CUU, CUC	(CU)(CU)	38
Lys/K	AAA, AAG	AAAA	9,10
Phe/F	UUU, UUC	UUUU	9,10
Pro/P	CCU, CCC, CCA	CCCC	9,10
Ser/S	UCU,UCC	(UC)UU	38
Thr/T	ACC, ACA	(AC)(CA)	38
Trp/W	UGG	(UG)GG	38
Val/V	GUU, GUG	(GU)(GU)	38
Ala/A	~~GCN~~?	(GG)(AA)**
Asp/D	~~GAY~~?	(GA)(GG)**
Ile/I	~~AUH~~?	UU(GG)**
Met/M	~~AUG~~?	(UC)CC**
Tyr/Y	~~UAY~~?	(UU)(CC)**
START	AUG	Noncoding	Regulatory
STOP	UAA, UAG, UGA	Noncoding	Regulatory

**To be assigned; (?): to be determined

Table 4 (b), summarizes the QED protein encoding codon assignment of Table 4 (a) with number of hydrogen bonds as the new QED protein encoding codon table.

Table 4,

(b) The QED protein encoding codon table arranged in H.B. ascending order.

Amino acids	QED-Codons	H.B.	QED-Codons	Amino acids
Phe	UUUU	8	AAAA	Lys
Cys	(UG)UU	9	(CA)AA	Gln
Ser	(UC)UU	9	(AG)AA	Arg
Thr	(AC)(CA)	10	(GU)(GU)	Val
Asn	(AA)(CC)	10	(UU)(GG)	* Met
**	(UU)(CC)	10	(GG)(AA)	**
Glu	(GA)(GA)	10	(CU)(CU)	Leu
Trp	(UG)GG	11	(CA)CC	His
**	(UC)CC	11	(GA)GG	**
Pro	CCCC	12	GGGG	Gly

*Met, to be verified, ** to be assigned: Ala/A, Asp/D, lle/l, and Tyr/Y.

Amino acids and encoding QED codons in Table 4 (b) have some exciting features. In

each H.B. bonding case, the anticodon of the QED encoding codon of an amino acid is the

encoding QED codon of the other amino acid. For example, UUUU encodes Phe and its

anticodon AAAA that encodes Lys. Based on this, a possibility exists that only ten tRNA may be

needed to synthesize proteins using canonical amino acids.

Multiple triplet codons code the same amino acid, but one tRNA decodes many amino acids. However, AUG encodes both control SART, and amino acid Met. What makes this dual role? Also, Met is not found first amino acid in every protein. If Met were the first amino acid but clipped, then what is the mechanism?

It has been reported⁴⁰that triplet GUG and UUG encode Met. Thus following the prior procedure, if U is added to GUG, and G to UUG, then QED codon (UU) (GG) will cover both codons. Element # 10 of Table 2(a) matches the outcome and tentatively *(UU) (GG) – Met is assigned. Since AUG has been assigned noncoding START codon in QED, this double role dilemma will not arise.

Noncoding QED codon assignment

The thirty-five noncoding QED codons from Table 2 (b) regulate protein synthesis, transcription,

and splicing processes. Following the protein-coding assignment procedure of QED codons

described above, the verified triplet START and STOP codons are used to assign the

corresponding QED codons. Since the information in Table 2 (b) is provided in DNA bases, T

has been replaced by U in the QED START and STOP codons. The assigned noncoding QED

codons are listed in Table 5.

Table 2 (b), first element: (TA)(TA)

In eukaryotes, transcription and splicing are the critical pre-mRNA processing steps to produce

rRNA, tRNA, and mRNA for protein synthesis. Transcription always starts at the TATA box.

Table 2(b) shows element #1 (TA)(TA), and is assigned to initiate the transcription process and listed in Table 5.

Table 2(b), 2^nd element: (CpG)(CpG)

The splicing process separates protein-coding exons from noncoding introns and is unique to

eukaryotes. The (CG)(CG) element and G + C-rich bases are used to locate exon–intron

interfaces, and splicing then separates them. Furthermore, alternative splicing makes

it possible for one gene to encode multiple proteins. Therefore, the (CG)(CG) element of Table

2(b) is assigned for controlling splicing processes and is listed in Table 5.

Table 2 (b), assignments of the following 3 to 11 elements.

In Table 1 (b), among 35 noncoding codons, 10 are (AU) NN, and 10 are (CG) NN (where

N is A, T, C, or G) listed in Table 2(b). The remaining fifteen noncoding codons are mixed combinations.

For the QED START and STOP codon assignments, the triplet START and STOP codons of

Tables 3 are used as guides. Additionally, the T bases of these nine elements in Table 2(b) have

been replaced by U.

START

START–AUG triplet matches the first two bases of the third element in Table 2 (b). Thus,

QED START-(AU) GG is assigned and listed in Table 5.

STOP

In Table 3, STOP triplets include three codons: UGA, UAG, and UAA

QED STOP: The first two bases of elements 4 to 6 of Table 2 (b) match the first two bases of the

UGA triplet. Thus, QED STOP-(UG)(AC), -(UG)(AG), and -(UG)(AA) are assigned in Table 5.

Since (UG) AA has lower bonding energy, it is assigned STOP. The other two are assigned

as Regulatory or STOP.

The first two bases of elements 7 to 9 of Table 2 (b) match the first two bases of the triplet UAG.

Thus, QED STOP–(UA)(GU), -(UA)(GA), and -(UA)(GC) are assigned in Table 5.

Following the previous procedure, (UA)(GA) is assigned STOP and the other two as Regulatory

or STOP.

Table 2 (b), 10th, and 11th elements

The two bases of the 10th and 11th elements match the first two bases of triplet UAA. Thus,

QED STOP-(UA)AA and -(UA)(AC) are assigned in Table 5. Following the previous

procedure, (UA) AA is assigned STOP and (UA)(AC) as Regulatory or STOP.

The assignment of the remaining twenty-four G+C- and T+A-rich QED regulatory

noncoding codons will require further work.

Table 5, QED regulatory noncoding codon assignments, ** Table 2 (b), numbers

**	Triplet Codons	Noncoding QED Codons	QED Regulatory & Control
1	Absent	(TA)(TA)	TATA Box - Transcription start
2	Absent	(CG)(CG)	(CG)(CG), Exon/Intron Interface
3	START-AUG	(AU)GG	START	Comments
5	STOP-UGA (OPAL)	(UG)(AG)	STOP
8	STOP-UAG(AMBER)	(UA)(GA)	STOP
10	STOP-UAA(OCHER)	(UA)AA	STOP
4		(UG)(AC)	Regulatory	*	STOP
6		(UG)AA	Regulatory	*	STOP
9		(UA)(GC)	Regulatory	*	STOP
7		(UA)(GU)	Regulatory	*	STOP
11		(UA)(AC)	Regulatory	*	STOP
12		(TT)(AA)	Regulatory	*
13		(CC)(GG)	Regulatory	*
14		TT(TA)	Regulatory	*
15		TT(AC)	Regulatory	*
16		TT(AG)	Regulatory	*
17		TT(CG)	Regulatory	*
18		CC(TA)	Regulatory	*
19		CC(TG)	Regulatory	*
20		CC(AG)	Regulatory	*
21		CC(CG)	Regulatory	*
22		AA(CT)	Regulatory	*
23		AA(CG)	Regulatory	*
24		GG(CT)	Regulatory	*
25		GG(CG)	Regulatory	*
26		GG(AC)	Regulatory	*
27		(AC)(CG)	Regulatory	*
28		(AC)(AG)	Regulatory	*
29		(AG)(CG)	Regulatory	*
30		(CT)(TA)	Regulatory	*
31		(CT)(CG)	Regulatory	*
32		(CT)(AC)	Regulatory	*
33		(CT)(AG)	Regulatory	*
34		(GT)(CG)	Regulatory	*
35		(GT)(AG)	Regulatory	*To be assigned

Digital representation

Bioinformatics and NGS use digital techniques extensively to analyze DNA sequencing, analysis and interpretation of the results. Four canonical DNA bases are represented by two bits: 0 and 1: T: 11, A: 10, C: 01, and G: 00. Thus, each quadruplet QED encoding and noncoding codon will be represented by eight digits (one byte) consisting of 0 and 1. For example,

TTTT: 11111111; CCCC: 01010101, AAAA: 10101010; GGGG: 00000000

Accordingly, each of the twenty protein-coding and thirty-five regulatory codons can be expressed by one byte. Thus, the digital representation will allow developing compatible applications to capitalize on bioinformatics and cybersecurity tools.

The HIPA rule limits access to eHealth data. However, the digitally encrypted codons and security codes will overcome this limitation. Furthermore, DNA digital data presentation will make it easy to develop and certify the use of diagnostic tools at the point of care (POC) and a path for developing personalized medicine.

Incurable rare monogenic diseases, multigenic cancers and vaccines

Gene variants and errors in transcription, and splicing produce dysfunctional proteins causing the disease. More than 7,000 rare monogenic diseases have no cure, only management of symptoms.

A similar situation is observed for multigenic cancers. Over the last five decades since the establishment of the NCI (1970), cancer treatments have not changed considerably. Once cancer is detected, the treatment is initiated with surgery, followed by radiation and chemotherapy. The goal has been to extend life by five years. Once metastasis or remission occurs, no further treatment is available.

In rare diseases, dysfunctional protein correction is possible at the protein or DNA levels. At the protein level, this requires the replacement of incorrect amino acids with the correct ones. However, the triplet codon is degenerate, which makes selecting a unique codon among the degenerate ones a foremost hurdle. The nondegenerate protein-coding QED codons have no such limitation. At the DNA level, variant genes are first corrected with CRISPR gene editing tools. Normal proteins are generated to replace dysfunctional proteins.

No biological technique exists accessing cancerous cells selectively, a foremost hurdle that must be overcome to find a cure for cancer. Limited applicability of triplet code to eukaryotes might have prevented developing such technique. The eukaryote QED codon code has the potential to develop such a technique. Thus, the combination of the QED code, dysfunctional protein correction techniques, and the availability of the Human Cell Atlas ^{41, 42}and direct cell RNA sequencing ^{43, 44} are anticipated to provide the possibility of finding cures for multigenic disease cancer.

Vaccines and antibiotics are the best preventive tools for controlling some diseases. Antibiotics kill bacteria (prokaryotes) by disrupting their protein production ability. On the other hand, viruses take over the cell's (eukaryote) protein production machinery and speed up cellular protein production, leading to cell death. One way to prevent cell death is to produce antibodies that destroy the virus's protective proteins and the virus itself. With a known virus genome, antibody synthesis is relatively straightforward. An effective vaccine was produced using COVID-19 virus protective protein mRNA. The eukaryote QED codons have a distinct possibility of developing a targeted universal vaccine.

Protein synthesis to correct dysfunctional proteins and a step curing diseases

QED codons translate the genetic information carried in mRNA into proteins at the ribosome. The translation process is the same in eukaryotes, prokaryotes and viruses, but the starting and intervening steps differ, as shown in Fig. 2a-c. The different roles of the QED codons in control and translation are shown in bold.

Dysfunctional proteins causing diseases could be corrected either at the protein level or the DNA level. The steps are illustrated in Figs. 3 and 4.

The QED genetic coding developed for eukaryotic cells also apply to prokaryotes and viruses.

The QED codon has a new protein encoding and a noncoding regulatory codon table. The QED

encoding overcomes the triplet codon limitations. Furthermore, steps for correcting

dysfunctional proteins are described, anticipating an approach for identifying cures for rare

diseases and cancers.

Data availability: N/A

Code availability: N/A

Acknowledgements

I am grateful to Dr. Nawin Mishra, Distinguished Professor Emeritus of Genetics & Genomics, University of South Carolina, Columbia, SC 29208, for his encouragement and valuable comments regarding this work. The incurable rare disease of my daughter, Usha Singh, was the catalyst to set the lifelong journey to find a gene-disease causality relationship. The final outcome is the development of QED genetic coding for eukaryotic cells. This work would never have materialized without the unwavering full support of my family: my wife, Amrita Singh, and my children, Asha Singh and Om Prakash Singh (a.k.a. Tom Singh).

Author contribution - Rama Shankar Singh - 100%

Competing interest – No competing interest

Additional information

Correspondence and requests for materials should be addressed to Rama Shankar Singh.

Reprints and permissions information is available by contacting the author.

1. Watson, J. D. & Crick, F. H. The structure of DNA. Cold Spring Harb. Symp. Quant. Biol. 18, 123-131 (1953).

2. Watson, J. D. & Crick, F. H. C. Molecular structure of nucleic acids: a structure for deoxyribose nucleic acid. Nature 171, 737-738 (1953).

3. Crick, F. H. C., Watson, J. D. & Wilkins, M. H. F. The Nobel Prize in physiology or medicine 1962. NobelPrize.org. Nobel Prize outreach AB 2022 https://www.nobelprize.org/prizes/medicine/1962/summary (2022).

4. Crick, F. H., Barnett, L., Brenner, S. & Watts-Tobin, R. J. General nature of the genetic code for proteins. Nature 192, 1227-1232 (1961).

5. Crick, F. H. On the genetic code. Science 139, 461-464 (1963).

6. Crick, F. H. Codon--anticodon pairing: the wobble hypothesis. J. Mol. Biol. 19, 548-555 (1966).

7. Holley, R. W., Khorana, H. G. & Nirenberg, M. W. The Nobel Prize in physiology or medicine 1968. NobelPrize.org. Nobel Prize outreach AB 2022 https://www.nobelprize.org/prizes/medicine/1968/summary (2022).

8. Morgan, A. R., Wells, R. D. & Khorana, H. G. Studies on polynucleotides, LIX. Further codon assignments from amino Acid incorporations directed by ribopolynucleotides containing repeating trinucleotide sequences. Proc. Natl. Acad. Sci. U. S. A. 56, 1899-1906 (1966).

9. Nirenberg, M. & Leder, P. RNA codewords and protein synthesis. The effect of trinucleotides upon the binding of sRNA to ribosomes. Science 145, 1399-1407 (1964).

10. Jones, O. W. & Nirenberg, M. W. Qualitative survey of RNA codewords. Proc. Natl. Acad. Sci. U. S. A. 48, 2115-2123 (1962).

11. Holley, R. W. et al. Structure of a ribonucleic acid. Science 147, 1462-1465 (1965).

12. Jacob, F. & Monod, J. Genetic regulatory mechanisms in the synthesis of proteins. J. Mol. Biol. 3, 318-356 (1961).

13. Jacob, F., Lwoff, A. & Monod, J. The Nobel Prize in physiology or medicine 1965. NobelPrize.org. Nobel Prize outreach AB 2022 https://www.nobelprize.org/prizes/medicine/1965/summary (2022).

14. Berget, S. M., Moore, C. & Sharp, P. A. Spliced segments at the 5' terminus of adenovirus 2 late mRNA. Proc. Natl. Acad. Sci. U. S. A. 74, 3171-3175 (1977).

15. Manley, J. L., Fire, A., Cano, A., Sharp, P. A. & Gefter, M. L. DNA-dependent transcription of adenovirus genes in a soluble whole-cell extract. Proc. Natl. Acad. Sci. U. S. A. 77, 3855-3859 (1980).

16. Roberts, R. J. & Sharp, P. A. "For their discoveries of split genes" The Nobel Prize in physiology or medicine 1993. NobelPrize.org. Nobel Prize outreach AB 2022 https://www.nobelprize.org/prizes/medicine/1993/summary (2022).

17. Nilsen, T. W. & Graveley, B. R. Expansion of the eukaryotic proteome by alternative splicing. Nature 463, 457-463 (2010).

18. McManus, C. J. & Graveley, B. R. RNA structure and the mechanisms of alternative splicing. Curr. Opin. Genet. Dev. 21, 373-379 (2011).

19. Kornberg, R. D. The Nobel Prize in chemistry 2006. NobelPrize.org. Nobel Prize outreach AB 2022 https://www.nobelprize.org/prizes/chemistry/2006/summary (2022).

20. Maston, G. A., Evans, S. K. & Green, M. R. Transcriptional regulatory elements in the human genome. Annu. Rev. Genomics Hum. Genet. 7, 29-59 (2006).

21. Novoyatleva, T., Tang, Y., Rafalska, I. & Stamm, S. Pre-mRNA missplicing as a cause of human disease. Prog. Mol. Subcell. Biol. 44, 27-46 (2006).

22. Ward, A. J. & Cooper, T. A. The pathobiology of splicing. J. Pathol. 220, 152-163 (2010).

23. Palade, G. E. A small particulate component of the cytoplasm. J. Biophys. Biochem. Cytol. 1, 59-68 (1955).

24. Ramakrishnan, V., Steitz, T. A. & Yonath, A. Nobel Prize in Chemistry "for studies of the structure and function of the ribosome." The Nobel Prize in Chemistry 2009, NobelPrize.org https://www.nobelprize.org/prizes/chemistry/2009/summary (2009).

25. Wimberly, B. T. et al. Structure of the 30S ribosomal subunit. Nature 407, 327-339 (2000).

26. Selmer, M. et al. Structure of the 70S ribosome complexed with mRNA and tRNA. Science 313, 1935-1942 (2006).

27. Ogle, J. M. & Ramakrishnan, V. Structural insights into translational fidelity. Annu. Rev. Biochem. 74, 129-177 (2005).

28. Ramakrishnan, V. Gene Machine: The Race to Decipher the Secrets of the Ribosome (Hachette Book Group, 2018).

29. Demeshkina, N., Jenner, L., Westhof, E., Yusupov, M. & Yusupova, G. A new understanding of the decoding principle on the ribosome. Nature 484, 256-259 (2012).

30. Rozov, A. et al. Novel base-pairing interactions at the tRNA wobble position crucial for accurate reading of the genetic code. Nat. Commun. 7, 10457 (2016).

31. Carter, A. P. et al. Functional insights from the structure of the 30S ribosomal subunit and its interactions with antibiotics. Nature 407, 340-348 (2000).

32. Liu, C. C. & Schultz, P. G. Adding new chemistries to the genetic code. Annu. Rev. Biochem. 79, 413-444 (2010).

33. DeBenedictis, E. A., Carver, G. D., Chung, C. Z., Söll, D. & Badran, A. H. Multiplex suppression of four quadruplet codons via tRNA directed evolution. Nat. Commun. 12, 5706 (2021).

34. de la Torre, D. & Chin, J. W. Reprogramming the genetic code. Nat. Rev. Genet. 22, 169-184 (2021).

35. Kolber, N. S., Fattal, R., Bratulic, S., Carver, G. D. & Badran, A. H. Orthogonal translation enables heterologous ribosome engineering in E. coli. Nat. Commun. 12, 599 (2021).

36. Malyshev, D. A. et al. Efficient and sequence-independent replication of DNA containing a third base pair establishes a functional six-letter genetic alphabet. Proc. Natl. Acad. Sci. U. S. A. 109, 12005-12010 (2012).

37. Hoshika, S. et al. Hachimoji DNA and RNA: a genetic system with eight building blocks. Science 363, 884-887 (2019).

38 Sakes, M. E., et al., The transfer RNA identity problem: a search for rules, Science, 263, 191-197 (1994)

39 Agris, P. F., Decoding the genome: a modified view, Nucleic Acids Research, 32,223-238 (2004)

40. Peabody, D. S., Translation Initiation at Non-AUG Triplets in Mammalian Cells. Biol. Chem. 264, 5031-5035 (1989).

41. Travaglini1, K. J., et al. A molecular cell atlas of the human lung from single-cell RNA

Sequencing, Nature, 587, 619- 649 (2020)

42 The Cancer Genome Atlas Data Portal,

(https://www.cancer.gov/about- nci/organization/ccg/research/structural-genomics/tcga)43 Garalde, D. R., et al. Highly parallel direct RNA sequencing on an array of nanopore,

Nature Methods 15, 201–206 (2018)

44 "ENCODEDataMatrix”

(https://genome.ucsc.edu/ENCODE/dataMatrix/encodeDataMatrixHuman.html)

Download PDF

Version 3

posted

You are reading this older preprint version

Read the latest preprint version →

Quadruplet expanded DNA (QED) genetic code for eukaryotic cells^**

Status:

Version 3

Abstract

Figures

Introduction

Methods

Summary

Declarations

References

Status:

Version 3