The physicochemical and phylogenetic properties of twelve spike glycoprotein models for SARS-CoV-2 from China, Iran, and Tunisia

The coronavirus spike glycoprotein is a trimeric structural surface protein that facilitates the viral adhesion through attaching receptors on the human cell surface. This study aims to analyse and compare the genomic and phylogenetic properties of these spike glycoproteins from China, Iran, and Tunisia. This is a descriptive cross-sectional comparative study for the different properties of S glycoprotein from 12 SARS-CoV-2 specimens from GenBank. Clustal Omega was used to study model sequences alignment, residual conservation, phylogeny, and identity matrix. SWISS-MODEL developed and validated the 3D models for three protein sequences with the highest model quality. The different physicochemical characteristics of different models were assessed by ExPASy proteomics. and monophyletic, varying of evolutionary divergence. There are six fully, three highly, and lowly conserved residues across the sequences. The three highly reliable 3D of different being Tunisian, and the the models. All the models are highly hydrophilic. The Tunisian models were unstable in comparison to the relatively stable other models with different physicochemical characteristics.

This study highlighted the genomic and phylogenetic properties of SARS-CoV-2 spike glycoprotein sequences from three countries China, Iran, and Tunisia.

Materials And Methods
This is a descriptive cross-sectional comparative study analyzing the genomic and phylogenetic properties of S glycoprotein from 12 SARS-CoV-2 specimens from China (seven isolates), Iran (two isolates), and Tunisia (three isolates). The genomic coding and amino acid sequences of the SARS-CoV-2 are available in the GenBank (https://www.ncbi.nlm.nih.gov/genbank/sars-cov-2-seqs/) [11]. The identical proteins were identi ed by the (Identical Protein Group) to simplify the evaluation by choosing one descriptive representative sequence from each country.
The (Multiple Sequence Alignment) at Clustal Omega (https://www.ebi.ac.uk/Tools/msa/clustalo/) of the European Molecular Biology Laboratory-European Bioinformatics Institute (EMBL-EBI) [12] was used to study amino acid sequences alignment of the 12 sequences. Clustal Omega tools can determine any conserved residues in the 12 models.
The Phylogenetic Tree and Percent Identity Matrix were used to study the nature and the percentage of the similarity between the 12 sequences. The phylogenetic tree (phylogeny) is a diagram that depicts the lines of evolutionary descent in the protein sequences. A phylogenetic tree portrays the branching history of common ancestry, and the pattern of branching using cladograms and phylogram. The branching pattern re ects different evolutionary lineages [13,14].
The Tunisian protein sequences (QIZ14987.1 and QIZ14988.1) of Tunisia are identical. Still, we could not include any of their sequences because they contain unidenti ed amino acids at an abnormal position, which makes them unsuitable for modeling by the SWISS-MODEL [15].
The sequence modeling for the chosen proteins was evaluated according to their local and global quality estimates by the use of: Global Model Quality Estimation (GMQE) re ects modeling accuracy and reliability. The resulting score is expressed as a number between 0 and 1; the higher the number was, the more the reliability was [18].
Qualitative Model Energy Analysis (QMEAN): The composite estimator of the different geometrical properties of the protein sequence that provides both global (for the entire structure) and local (per residue) absolute quality estimates based on one single model [18,19].
The QMEAN Z-score estimates the "degree of nativeness" of the structural features observed in the model on a global scale. The QMEAN Z-score compares the model QMEAN score to the scores of the similar size expected experimental structures, through Cβ atoms only, all atoms, the solvation potential, and the torsion angle potential [18,19].
The QMEANDisCo enhances the accuracy of the QMEAN local scores by assessing the interatomic distance of the target model against the already ensemble information of the experimentally determined proteins that share homology to the target model [19].
ProtParam tool at the ExPASy proteomics server (https://web.expasy.org/protparam/) was used to study and computes various physicochemical properties of a protein sequence [20]. It provides information about different indices and scores: Molecular weight.
The number and charges of different amino acid residues.
The chemical formula.
Tryptophan residues availability.
Site and type of N-terminal residues.
Extinction coe cients indicate how much light a protein absorbs at a particular wavelength, which is useful during the puri cation process.
Optical density (Absorbance) which is the product of dividing the extinction coe cient by the molecular weight [21].
In vivo half-life.
Instability index (II) which provides an estimate of the stability of your protein in a test tube. The (II) of stable proteins is < 40, while for unstable proteins, the (II) > 40.
The aliphatic index is the relative volume occupied by aliphatic side chains (alanine, valine, isoleucine, and leucine) [22].
Grand Average of Hydropathy (GRAVY) marks the protein as hydrophilic (negative values) or hydrophobic (positive values) [23].
The Protein Molecular Weight -The Sequence Manipulation Suite at Bioinformatics.org (https://www.bioinformatics.org/sms/prot_mw.html) was used to compare the actual molecular weight of different protein sequences, without including any unknown amino acids [24]. Table 1 demonstrates the GenBank accession numbers of AA sequences of the 12 S glycoproteins of the SARS-CoV-2 from China, Iran, and Tunisia (this represent the whole worldwide plotted sequences until April 20th, 2020). The Tunisian sequences had the most abundant amino acid number in comparison to the Iranian and Chinese sequences. The sequences (QIZ14987.1 and QIZ14988.1) are identical, and they contain an abnormal amino acid at an ambiguous position. The Identical Protein Group provides information about the identical protein sequences from each country. All the Chinese sequences are shown in Table 1, were isolated on the same day (February 11th, 2020) from the (Wuhan seafood market pneumonia virus), from the nasopharyngeal swab, serum, throat swab, and sputum. All other sequences from Iran and Tunisia were taken from the nasopharyngeal swabs. Figure 1 demonstrates the multiple sequence alignment of different sequences. There are six fully, three highly, and ve lowly conserved residues.
The Chinese and the Iranian sequences share 100% similarity, while the Tunisian sequences had 25 and 29.27% similarity to the Chinese and the Iranian sequences, respectively. The rst Tunisian sequence described (QIV64962.1) had 99.39% similarity to the other two sequences (QIZ14987.1 and QIZ14988.1) because of the unidenti ed amino acid at the ambiguous position ( Table 2, and Figure 1).
The rst divergence event separated the lineage that gave rise to the (China-Shenzhen 7) sequence from a lineage of the other six Chinese sequences and a lineage that gave rise to the Iranian and Tunisian sequences, i.e., the 12 sequences shared a common ancestry (monophyletic group). The (China-Shenzhen 1 and 2) share a more recent common ancestor than either share with other sequences, i.e., they are therefore more closely related to each other than either is to other sequences. The (Tunisia-Tunis 2 and 3) sequences underwent recent evolutionary divergence that occurred (Supplementary Figure 1 A and B).
Because all the Chinese and Iranian sequences are of equal distance (in terms of branch arrangement) from their original ancestor, we could say that these sequences are equally related to each other. The three Tunisian sequences underwent the most recent divergence from the original evolutionary lineage.
The 3D models of the S glycoproteins from the three chosen sequences from the three countries were highly reliable models with a very high GMQE score. Still, they demonstrated that the very low global quality, very low QMEAN-scored Tunisian S glycoprotein had three chains in comparison to the Chinese sequence of medium global quality and the Iranian sequences of the highest global quality between the three countries. The Z-scores of all three sequences fall within the range between (0.5-0.7) (Figure 2 and Supplementary Figure 2).
The Tunisian sequences had the highest molecular weight, extinction index, and instability index (II) between the chosen sequences, which renders the Tunisian sequences unstable in comparison to both the Chinese and the Iranian sequences, which are relatively stable at different degrees. All the sequences are hydrophilic and share a high aliphatic index ( Table 3).
The N-terminals which signify the beginning of the S1 subunit of the S glycoprotein in SARS-CoV-2 are different in their chemical pro les: For the Chinese sequences, it is N (Asparagine), which is polar with a positively charged side group.
For the Iranian sequences, it is P (Proline), which is nonpolar with an uncharged side group.
For the Tunisian sequences, it is Q (Glutamine), which is polar with an uncharged side group.
These biophysical changes affect the half-life of the protein models in vitro and in vivo (Table 3).

Discussion
The coronavirus S glycoprotein is an important structural protein that is responsible for the phenotype of crown-like shape viral particles, from which the original name "coronavirus" was coined [4].
The GenBank provides unrestricted access to all types of genomic and polypeptide sequences that dealt with the novel coronavirus epidemic by SARS-CoV-2 [11]. Till the preparation of this paper (April 20th, 2020), only the Chinese sequences of S glycoprotein of SARS-CoV-2 were assessed and compared through the literature [25][26][27]. The Iranian and the Tunisian sequences were not assessed after being uploaded in the GenBank.
The three Tunisian models represent the largest S glycoprotein unit, in comparison to the Chinese and Iranian S glycoproteins, with their higher number of base pairs and a consequently higher number of amino acid residues. Even though the assessment of the Percent Identity Matrix by Clustal 2.1. revealed a 100% similarity or identity between the Chinese and the Iranian sequences, although the numbers of amino acid residues are different. Two out of the three Tunisian sequences (QIZ14987.1 and QIZ14988.1) were identical 100% but share a 25% identity with the Chinese and Iranian sequences. The presence of an unknown amino acid (X) at an ambiguous position in these sequences minimally decreased the identity of the sequences (QIZ14987.1 and QIZ14988.1) to the third Tunisian sequence (QIV64962.1) by 0.61% to be 99.39%. on the other hand, the Tunisian sequence (QIV64962.1) shared a 29.27% identity with the Chinese and Iranian sequences. Shanker and colleagues had studied the amino acid sequences of the seven S glycoprotein units from China and reached the same conclusion [27].
The N-terminal identi es the start of the S1 subunit in the three modeled sequences. There was a different polarity for the N-terminal residues, being polar with a positively charged side group in the Chinese model, nonpolar with uncharged side group in the Iranian model, and polar with uncharged side group in the Tunisian model [4,6].
The S glycoprotein remains uncleaved at the S1/S2 site during virus packaging in cells. The trimetric S-protein is processed at the S1/S2 furin-like cleavage site by host cell proteases, during infection [28]. We combined the SWISS-MODEL and the Clustal Omega modeling to nd the possible furin-like S1/S2 cleavage site that was thoroughly described by Coutard and colleagues at the monobasic R residue [4], as marked in Fig. 1.
Following cleavage or priming, the protein is divided into an N-terminal S1-ectodomain that recognizes a cognate cell surface receptor to aid tra cking into and hijacking the host cell, and a C-terminal S2 membrane-anchored protein involved in viral entry [4,6]. The cleavage at the furin-like cleavage site occurs during virus egress for S-protein priming and may provide a gain-of-function to the virus for e cient pathogenesis compared to other betacoronaviruses lineages [4].
The SWISS-MODEL provides global and local quality assessment for the modeled sequences. The models were highly reliable, with very high GMQE scores. Still, the global and local quality scores are different even after normalization (Fig. 2), that furtherly affect the biochemical pro les of the models ( Table 3).
The highly descriptive ProtParam tool of ExPASy provides some answers to such different quality pro les between models. All the models had no Tryptophan residues that could result in more than 10% error in the computed extinction coe cient. The extinction coe cients of the Tunisian and Iranian models were much higher compared to the Chinese models but with declining absorbance scores that are related to their molecular weight. The extinction coe cient is important when studying protein-protein and protein-ligand interactions [20].
The Iranian models had more than 20 hours' half-life in both mammalian reticulocytes in vitro and the yeast in vivo, in comparison to other models (Table 3), which may be related to the nonpolar N-terminal (Proline) that attached the uncharged side group. The in vivo half-life predicts the elapsed time it takes for half of the protein amount in a cell to disappear after its synthesis. ProtParam relies on the (N-end rule), which relates the half-life of a protein to the N-terminal residue identity that affects the stability of the protein in vivo; the prediction is given for 3 model organisms (human, yeast and E. coli) [20].
All models share a high aliphatic index that may contribute as a positive factor to increase the relative thermostability of the globular proteins [22]. All the models were hydrophilic, which is re ected by the highly negative GRAVY score [23], with the Tunisian model, the most hydrophilic one.

Conclusions
The S glycoproteins are not identical nor unique in model structure nor the physicochemical pro les in different parts of the world, with a special emphasis on the Tunisian S glycoprotein models, which had drastic biodiversity from the Chinese and Iranian models. Understanding spike protein models is fundamental to an understanding of viral pathogenesis, which will allow for additional protein-engineering efforts that could improve antigenicity and protein expression for pharmaceutical development.

Declarations Availability of data and materials
The datasets used during the current study are available from the corresponding author on reasonable request.

Ethics declarations
Ethics approval and consent to participate Not applicable.

Figure 1
Multiple sequence alignment of the amino acid sequences for 12 S glycoproteins of SARS-CoV-2 from China (seven sequences), Iran (two sequences), and Tunisia (three sequences) using Clustal Omega. The red arrow marks the ambiguous sites of the unidenti ed amino acid of the Tunisian sequences. Asterisks represent fully conserved residues, colons represent highly conserved residues, and periods represent lowly conserved residues. The black arrow marks the possible S1/S2 furin-like cleavage site.