CagA sequence differences and proteome proles between East Asian and Western H. pylori strains

The cytotoxin-associated gene A protein (CagA), an effector protein of Helicobacter pylori (H. pylori), was the rst identied bacterium oncoprotein. Based on its sequence characteristics, H. pylori has been classied into East Asian and Western strains. We hypothesized that the differences in structure of CagA and proteomic proles between East Asian and Western H. pylori strains are the primary cause of the differential clinical outcomes of H. pylori infection. in East Asian protein to cytotoxicity was highly expressed while proteins UreA and UreH, agellin and cell and were in East These proteins were associated the viability in and Reagent performed using Green I real-time PCR method with twostep reactions. 2–ΔΔCt method was used to the relative expression level of the target genes with East Asian group set as 1. The RT-qPCR analysis was performed with 10 biological repeats, and each sample had three technological repeats. The data were presented as the mean ± SD, and two-sided Student’s t-test was used to perform statistical analysis using Graphpad Prism 8 program. urease Differentially expressed proteins between East Asian and Western H. pylori strains. Three East Asian and three stains of pylori were selected to conduct iTRAQ-based quantitative proteomics analysis.


Background
Helicobacter pylori (H. pylori), a spiral-shaped, agellated, microaerophilic Gram-negative bacillus infecting over 50% of the world's population, was classi ed as a Group I human carcinogen by an International Agency for Research on Cancer (IARC) Working Group in 1994, and was recon rmed by a new Working Group with su cient evidence of causing non-cardia gastric carcinoma via chronic infection with H. pylori in 2009 [1]. The experimental studies in Mongolian gerbils also provided the strong evidence for the carcinogenicity of infection with H. pylori [2]. Although the prevalence of H. pylori infection appears to be decreased in certain parts of the world, it remains high in many regions and countries, particularly in African and Asian countries. It has recently been estimated that 79% of new gastric cancers diagnosed each year globally can be attributed to H. pylori infection, making it a primary risk factor for gastric cancer [3,4]. sequences) was detected for the rst time in all East Asian strains but not in Western strains (Fig. 1A). An eight amino acid difference between the EPIYA-C/D side and CM motifs between East Asian and Western strains was also detected (Fig. 1B).

Functional domain diversity of CagA between East Asian and Western strains
Studies on the crystal structure of CagA revealed that CagA consists of an N-terminal ordered region (residues 1-829), including domain I, II, III, and a C-terminal disordered segment (residues 830-1186) [11].
There are some important functional domains and segments in CagA, including the CagAphosphatidylserine (PS) interaction domain, EPIYA motifs, and CM motifs ( Fig. 2A). The PS segment in domain II mediates the attachment of CagA to the cytoplasmic membrane past-translocation, while Nand C-terminal binding sequences (NBS and CBS) are associated with CagA dimerization. EPIYA motifs (Glu-Pro-Ile-Tyr-Ala) have several important functions, such as CagA phosphorylation, binding of CagA to SHP-2, and activation of multiple intracellular signaling pathways. CM motifs, by binding to partitional defective 1 kinase b (PAR1b, a serine-threonine kinase), participate in the maintenance of gastric epithelial cell polarity. In brief, these domains play a crucial role in H. pylori-induced gastric pathogenesis. Interestingly, we found that there is signi cant sequence variability in these important domains and their anking regions between East Asian and Western CagA (Fig. 2B).

Construction of molecular phylogenic tree of CagA
One hundred and fty sequences of H. pylori-CagA, including our own isolates, were randomly obtained from the Genbank database to construct the CagA-based molecular phylogenic tree via MEGA software ( Fig. 3). In the phylogenic tree, a total of 150 strains were clustered into two large groups: East Asian group with characteristics of EPIYA-ABD (80 strains) and Western group with characteristics of EPIYA-ABC (70 strains). Western group was further clustered into three subgroups that was named as Western group, East Asian type of the Western group, and South America type of the Western group based on its geographical origin. The East Asian group primarily originated from China (20/80), Japan ( The sequences anking EPIYA-C/D sites affect tyrosine phosphorylation of CagA and are de ned as left and right CM domains, respectively. We found three sets of different sequences in both sides of EPIYA-C/D sites. The FLLKRHDKVDDLSKVG is a typical sequence located on both sides of EPIYA-C in the Western group and East Asian type of the Western group; it represents the classical CM motifs. The SSLKRYAKVDDLSKVG is located on both sides of EPIYA-C in the South America type of the Western group, which was reported as less virulent strains. The third set of sequences is found on both sites of EPIYA-D in the East Asian group. The KIASAGKGVGGFSGVG segment to the left of EPIYA-D replaces the classical CM motif on the left of EPIYA-C and the FPLRRSAAVND LSKVG to the right of EPIYA-D is partly identical to EPIYA-C (Additional le 2: Fig. S2). In addition, consistent with the results analyzed from our isolates, Western CagA has more variation, in which conversion of EPIYA to EPIYT at the EPIYA-B site, and duplication of EPIYA-C, appear in 25 (36%) and 31 (44%) of the 70 Western strains, respectively, while the same change at the EPIYA-B site was observed only in 4 (5%) of the 80 East Asian strains. The variations in EPIYA motifs of CagA from 150 strains was listed in Table 2.
We also analyzed the composition of 20 amino acids in East Asian and Western CagA from 150 H. pylori stains and found that certain amino acids, such as Glu, Leu, Thr, Arg, and iTRAQ-based quantitative proteomics of East Asian and Western H. pylori strains CagA sequence polymorphisms between East Asian and Western H. pylori strains cannot completely explain their pathogenic differences. Therefore, we sought to further de ne the proteomic changes of two strain groups. Six H. pylori strains, including three East Asian strains (GZ1, GZ3, and GZ7) with EPIYA-ABD motifs and three Western strains (NCTC11639, 26695, and GZ5) with EPIYA-ABC motifs) were used to conduct iTRAQ-based absolute quantitative proteomics. Proteomic analysis quanti ed a total of 2084 proteins and 108 differentially expressed proteins between Western and East Asian H. pylori strains according to the criteria of Bonferroni-corrected P< 0.01, and fold change ≥1.2 (up-regulation) or ≤0.8 (down-regulation) [22]. After exclusion of the hypothetical, duplicate, and unidenti ed proteins, 70 differential proteins are mapped to the standard strain H. pylori 26695 in Uniprot Database (https://www.uniprot.org/), of which 26 proteins were up-regulated and 44 proteins were down-regulated in the Western group compared to the East Asian group ( Fig. 4A and 4B). Using hierarchical clustering analysis, we found that, among differential proteins, CagA protein was highly expressed in the Western group. Alternatively, the urease subunit alpha (UreA) and urease accessory protein (UreH), both of which are related to the colonization of H. pylori in the human stomach, were highly expressed in the East Asian group. We further observed that agellin-associated proteins ( agellin FlaA, agellar hook protein FlgE, and agellar biosynthesis protein FlhA) and cell division proteins (FtsZ and FtsI) were found in abundance in the East Asian group (Fig. 4C). More details of differential proteins are presented in Additional le 4: Table S1, and the identi cation of differential proteins by MS/MS are presented in Additional le 5: Table S2.
Because of the high heterogeneity among different H. pylori strains, the biological repeatability of three strains in same group was analyzed by Principal Component Analysis (PCA) and correlation coe cients of normalized protein intensity between two strains of the same group were measured. The results indicated good clustering and clear distinction for both groups (Fig. 4D). The correlation coe cient ranges from 0.45 to 0.63 (Additional le 6: Fig. S4). The standard deviation and coe cient of variation of the abundance of 2084 proteins and 70 differential proteins were also calculated and shown in Additional le 7: Table. S3 and Additional le 8: Table. S4, respectively. Finally, the mRNA expression levels of ve differentially expressed proteins including UreA, FlaA, FlgE, CagA, and FlhA were validated by RT-qPCR and the validation results were consistent with the proteomic results (Fig. 4E).
Functional annotation and protein interaction networks of differential proteins To obtain functional information and interaction networks, 70 differentially expressed proteins were annotated with gene ontology (GO) by DAVID 6.8. KEGG pathway enrichment and interaction network analysis were conducted by KOBAS 3.0 and STRING online tools, respectively. The results showed that these differential proteins are mainly associated with biosynthetic processes, metabolism, translation and gene expression (Fig. 5A), and are enriched into nine important pathways, in which ve pathways possess signi cant enrichment (FDR-corrected P value < 0.05) ( Fig. 5B and 5C). Protein-protein interaction analysis indicated that the highly-expressed proteins in East Asian strains are clustered into two signi cant networks with UreA and FtsI as core nodes, while the highly-expressed proteins in Western strains are clustered into an important network with GroEL and CagA as nodes (Fig. 5D).

Discussion
Evidence including epidemiological, clinical, and experimental studies, transgenic models, and H. pylori infection of Mongolian gerbils concluded that chronic infection with H. pylori cagA-positive strains is the strongest risk factor of gastric cancer [11,[23][24]. A meta-analysis also determined that eradication of H. pylori is associated with a reduction of gastric cancer risk [25]. However, H. pylori prevalence rate, virulent strains, and gastric cancer incidence are highly variable in different countries and regions throughout the world. The incidence of gastric cancer in East Asian countries such as Japan, China, and Korea is almost ten-times higher than that in the United States [26].
CagA is the rst bacterial oncoprotein identi ed in human cancer and its sequences determine the classi cation and geographical origin of H. pylori strains. In this study, we found that 23% (6/26) strains isolated from the gastric mucosa of Chinese patients are Western strains, suggesting that Western H. pylori is prevalent in China. We also detected the co-infection of East Asian and Western strains in one individual. Moreover, we found that all isolates of both East Asian and Western strains are cagA-positive. This result is consistent with another report in which H. pylori strains isolated from East Asian countries have a higher cagA-positive rate (90%~100%) than strains from a Western population (~ 60%) [27].
EPIYA motifs of CagA are an important tyrosine-phosphorylation (pY) site. Accumulated evidence has showed that the sequence polymorphism and variable alignments of EPIYA segments has been linked to the pathobiological action of individual CagA [26]. We found that 5 of the 10 Western strains in this study have a deletion of the EPIYA-C site and the other ve strains have a variation of A→T at the EPIYA-B site. However, in 20 East Asian strains, only two strains have a A→V conversion at the EPIYA-D site, one strain has a P→S conversion at the EPYIA-B site, and one strain lacks the EPIYA-A site. The data suggest that the EPIYA motifs of Western strains have more variation than East Asian strains and the variation pattern of the two groups of strains is different. The results are further con rmed by the phylogenic trees, based on 150 H. pylori-CagA amino acid sequences from the GenBank database, in which 36% (25/70 strains) and 44% (31/70 strains) of Western strains carry the variation of A→T conversion at the EPIYA-B site and duplication of EPIYA-C, respectively. Conversely, A→T conversion of the EPIYA-B site was observed in only 5% (4/80 strains) of East Asian strains. The A/T polymorphism in the EPIYA-B motif was reported to in uences the function of CagA, in which CagA with an EPIYT-B motif has a higher a nity with PI3K compared to CagA with an EPIYA-B motif, subsequently leading to an increased secretion of IL-8 [28]. CagA with multiple EPIYA-C motifs was also con rmed to have a stronger pro-carcinogenetic potential [28]. Although the deletion of EPIYA-C was less documented,our results were con rmed by two Western strains NCTC 11639 (AB015416) and IND 07 (LC339379) submitted into GenBank from Japan and Indonesia, respectively, that have the same loss of EPIYA-C. The EPIYA-D motif polymorphism of East Asian CagA is rarely reported. We detected the A→V conversion at the EPYIA-D site in two clinical isolates which have not been found in any other H. pylori strains, including Western and East Asian strains. More importantly, we, for the rst time, identi ed the deletion or partial deletion of 13 amino acids downstream to the N-terminal binding sequence of CagA in all East Asian H. pylori studied.
Apart from the above-mentioned sequence differences between Western and East Asian strains, other unreported variations were also detected in this study, such as CM motifs. We note that the eight amino acid difference located between EPIYA-C/D and the right CM motif among the two kinds of strains was also unexplored. We also found the differences in amino acid composition of CagA between the two groups of strains. These ndings suggested that there are potentially more valuable differences between East Asian and Western strains that demand further investigation.
The proteome difference between East Asian and Western H. pylori strains is not reported. Through iTRAQ-based absolute quantitative proteomic analysis, we quanti ed a total of 2084 proteins and identi ed 70 differentially expressed proteins with functional annotation between East Asian and Western strains; 26 of these proteins were up-regulated and 44 proteins were down-regulated in Western strains compared to East Asian strains. Importantly, the comparison of proteome pro les indicated that CagA protein was more abundant in Western strains. This nding may partly explain the observed difference that in Western populations, cagA-positive strains are associated with enhanced induction of gastritis, gastric ulcers, and a higher risk of gastric cancer. However, in East Asian populations where almost all strains are cagA-positive, the cagA gene is not associated with an increased risk of gastric diseases [29].
Our study is the rst to report that the proteins involved in host colonization (UreA and UreH), cell division (FtsZ and FtsI), and cell movement ( agellin FlaA, agellar hook protein FlgE, and agellar biosynthesis protein FlhA) were more abundant in East Asian strains. Among these proteins, FlaA allows H. pylori to migrate into the host gastric epithelium to survive in a comparatively higher pH niche since H. pylori is not an acidophile [30,31]. FlhA and FlaE are reported to help agellar biosynthesis and functioning [32]. UreA matures to form an active enzyme by combining with nickel (Ni) to transform urea to ammonia with the help of Ni binding protein hypA, FtsZ and UreH [33]. The ammonia can decrease the pH of the gastric mucosal layer, ultimately promoting the colonization and persistent infection of H. pylori. A recent study revealed that urea emanating from the gastric epithelium can attract H. pylori by binding to chemotaxis protein TlpB on the bacterium's surface in the presence of a powerful urease and direct the bacterium's movement toward the gastric epithelium [34]. To verify the proteomic results, we detected the mRNA levels of the UreA, FlaA, FlgE, CagA, and FlhA by RT-qPCR because of the lack of the suitable antibodies and obtained the consistent results, suggesting that the East Asian strains of H. pylori may have stronger colonization and mobility capability in the human stomach compared to Western strains.
Further analysis indicated that the 70 differential proteins are mapped to ve signi cant pathways, including microbial metabolism in diverse environments, glyoxylate and dicarboxylate metabolism, and DNA replication. GO enrichment analysis showed that these proteins were primarily associated with the biosynthetic process, metabolism, translation, and gene expression of H. pylori. More importantly, utilizing proteomic data, we found the proteins highly expressed in East Asian strains were enriched into two key PPIs with UreA and FtsI as core nodes, while the proteins highly expressed in Western strains were enriched into one important PPI with CagA and GroEL as nodes. However, more experiments are required to con rm these results.
To analyze the in uence of the in vitro passage numbers of H.pylori on proteome expression in bacteria, the PCA, correlation coe cient of two strains, standard deviation (SD), and coe cient of variation (CV) of original data set were computed. PCA results indicated good clustering and clear distinction between East Asian and Western strain. The correlation coe cient between low-passage-number clinical isolate (GZ5) and high-passage-number strains (11639 or 26695) was higher than that between two high passage 11639 and 26695 strains in Western group. SD value of the abundance of 2084 proteins from three East Asian strains GZ1,GZ3, and GZ7 was from 0.0092 to 4.3676, while this value from Western strains GZ5, 11639, and 26695 was from 0.0047 to 3.9080. Similarly, 98% of 2084 proteins in both East Asian and Western groups had a CV value less than 10%. These results indicated good reproducibility of the methods and less dispersion of data set, which suggested that effects of passage in vitro on protein expression of H.pylori is small in our study.

Conclusions
This study compared the amino acid sequences of CagA and proteomic pro le between East Asian and Western H. pylori strains and found many signi cant and unreported differences. The results provide signi cant evidence of new differential sequences of CagA and proteomic pro les between East Asian and Western strains, which maybe serves as new study targets to determine the pathogenesis of H. pylori and to elucidate the mechanism underlying the development and progression of H.pylori-induced gastric diseases. identi ed isolates were used from 3 rd to 5 th passage after puri cation.

Methods
Acquisition of full-length sequences of H. pyloricagA.
DNA was extracted from identi ed H. pylori strains. The full-length cagA sequence was synthesized by PCR (sense primer: 5′-AACAATGACTAACGAAACCA-3′, antisense primer: 5′-TAAAGAA TGGCTCAAATTGT-3′, about 4000 bp) and cloned into pMD18-T plasmids to construct pMD18-T/cagA vector, which was identi ed by sequencing. The cagA sequences successfully obtained were submitted to the GenBank database.

Sequence alignment and clustering analysis of CagA
Sequences of the cagA gene were converted into amino acid sequences by DNAstar software and multisequence alignment and cluster analysis were performed using Clustal Omega software based on the amino acid sequences of CagA.
Next, 150 amino acid sequences of H. pylori CagA including our isolates were randomly obtained from the Genbank database. MEGA 4.0 software was used to construct CagA-based molecular phylogenetic trees with Neighbor-Joining and P-distance methods.
Isobaric tags for relative and absolute quantitation (iTRAQ) Six H. pylori strains, including three East Asian strains and three Western strains, were taken out from liquid nitrogen stocks at same time and resuscitated via culturing them for 3-5 days on Columbia agar plate containing 10% sheep blood and H. pylori selective supplement. Then these bacteria were cultured overnight in liquid medium under microaerobic conditions and harvested by centrifugation. Six samples were ground in liquid nitrogen. The bacterial protein was extracted by ultrasound method (ultrasound 60 s, 0.2 s on, 2 s off, amplitude 22%) in lysis buffer (7M urea, 2M Thiourea, 4% CHAPS, and 50 ml protease inhibitor cocktail) and quanti ed via Bradford assay. 200 μg protein was taken from each sample and reduced and alkylated with 1 L of 25 mM DTT at 60 °C for 1 hour. Then, 0.5 L of 50 mM iodoacetamide was added and incubated for another 10 min at room temperature. Subsequently, these protein samples were centrifugated with 12,000 g for 20 min in 10K ultra lter device (Milipore), then washed three times with 100 μl dissolution buffer from the iTRAQ Regents 8-plex kit (AB Sciex, Framingham,Ma,8390812) and centrifugated for 20 min at 12,000 g. Finally, 50 μl 4 μg trypsin (Promega ) was added into the ultra lter device and incubated overnight at 37 °C to digest proteins.
On the following day, the digested peptide solution was collected by centrifugation at 12,000 g for 20 min and peptides were labeled with iTRAQ reagents according to the manufacturer's instructions. For each 100 μg of protein, one unit of labeling reagent was used, and labeling was performed for two hours and was stopped by adding 100 μl of water. The following labels were used for samples: 113, 115, and 116 were used to label East Asian strains GZ1, GZ2, and GZ3, while 118, 119, and 121 were used to label Western strains NCTC11635, GZ5, and 26695. iTRAQ-labeled samples were mixed and dried by vacuum freeze centrifugation. The dried samples were frozen until further use.  [35,36], the probability computed by Protein Prophet algorithm was introduced to protein identi cation by MS/MS. Peptide identi cations were accepted if they could be established at greater than 91% probability to achieve an FDR (false discovery rate) less than 1.0% by the Scaffold Local FDR algorithm. Protein identi cations were accepted if they could be established at greater than 85% probability to achieve an FDR less than 10% and contained at least 1 identi ed peptide [37]. Protein probabilities were assigned by the Protein Prophet algorithm. Proteins that contained similar peptides and could not be differentiated based on MS/MS analysis alone were grouped to satisfy the principles of parsimony. Normalization was performed iteratively (across samples and spectra) on intensities. Medians were used for averaging.
Spectra data were log-transformed, pruned of those matched to multiple proteins, and weighted by an adaptive intensity weighting algorithm. Of 21287 spectra in the experiment at the given thresholds, 7668 (36%) were included in quantitation.

Identi cation of differentially expressed proteins
Six H. pylori strains were grouped into East Asian (3 strains) and Western groups (3 strains). The normalized intensity of each protein in the two groups was acquired from the above-described quantitative data analysis. The average of the normalized intensity of a single protein from the East Asian group or Western group was calculated, respectively, and the fold change (FC) of the single protein was de ned as the ratio of Western group to East Asian group. The differentially expressed proteins of the Western group versus East Asian group were identi ed according to the criteria of Bonferronicorrected P ≤ 0.01, and fold change ≥1.2 (up-regulation) or ≤ 0.8 (down-regulation) [38,39].
Gene ontology (GO), KEGG pathway enrichment, and interaction network analysis Gene ontology for differentially expressed proteins was conducted using UniProt database and DAVID 6.8 online analysis tool (https://david.ncifcrf.gov/) and visualized with GOplot R package. KEGG pathway enrichment was carried out via the KEGG pathway database and KOBAS 3.0 online tool (http://kobas.cbi.pku.edu.cn/); protein-protein interaction network (PPI) analysis was constructed using STRING 10.0 (https://string-db.org). Pathways and networks were visualized with Cytoscape software (version 3.7.1). GO items/pathways with FDR-corrected P value < 0.05 were considered signi cantly different [40].
RT-qPCR mRNA levels of ve differential proteins were determined by RT-qPCR. 16S rDNA gene of H. pylori was used to normalize the expression level of the target genes. The prime sequences used in this study were list in Table 3. Declarations ZJJ contributed to study concept and design, analysis, review and drafting of the manuscript. ZY and XY were responsible for the data collection of the manuscript. XL, QXY and ZQF were responsible for the data collection, analysis, and interpretation of the manuscript. WQR, WWL and ZQF helped with data collection, and review of the manuscript. LAZ and ZL contributed to study concept, critical review and revision of the manuscript. All authors have read and approved the nal manuscript.        pylori-CagA were obtained from the Genbank database to construct the CagA-based molecular phylogenic tree. Phylogenic tree of 150 H. pylori stains based on CagA sequences. One hundred and fty sequences of H.
pylori-CagA were obtained from the Genbank database to construct the CagA-based molecular phylogenic tree.

Figure 4
Differentially expressed proteins between East Asian and Western H. pylori strains. Three East Asian and three Western stains of H. pylori were selected to conduct iTRAQ-based quantitative proteomics analysis.
a Differentially expressed proteins between Western and East Asian groups. b Volcano plot of differential proteins in the Western group compared to the East Asian group. Red and green plots represent upregulation and down-regulation, respectively, in the Western group compared to the East Asian group. c Hierarchical clustering graph of differential proteins between the Western and East Asian group. Six H. pylori strains, including three East Asian strains (GZ3, GZ1 and GZ7) and three Western strains (GZ5, Hp11639 and Hp26695), were simultaneously collected for iTRAQ-based LC-MS/MS analysis. The proteomics data were derived from three independent biological experiments. Red color represents up-regulation and blue color represents down-regulation. d Principal Component Analysis (PCA) of six samples. e mRNA levels of ve differential proteins were determined by RT-qPCR. Total RNA of 10 East Asian strains and 10 Western strains were abstracted and transcribed into cDNA using the PrimeScript RT Reagent Kit with Gdna Eraser. qPCR was performed using SYBR Green I real-time PCR method with twostep reactions. 2-ΔΔCt method was used to calculate the relative expression level of the target genes with East Asian group set as 1. The RT-qPCR analysis was performed with 10 biological repeats, and each sample had three technological repeats. The data were presented as the mean ± SD, and two-sided Student's t-test was used to perform statistical analysis using Graphpad Prism