Subjects and initial bioinformatics data of Helicobacter pylori
DNA blood samples and gastric biopsies were taken from 10 patients with an age average of ~40 years, from the Andean zone of Tuquerres, and 9 samples from patients from the Pacific coast from the city of Tumaco, Colombia. The sequences were annotated using the prokaryote genome annotation from NCBI. The genomes correspond to sequenced data by our group in collaboration with Valderlbilt University and available on NCBI data base (Table 1) (29,30).
Table 1. Complete genome description of H. pylori strains in NCBI from two regions of the department of Nariño, Colombia
|
Risk
|
Isolate ID
|
Access code
|
Diagnosis
|
Genome size
|
N° of contigs
|
HRa
|
SV328_2
|
MTWO00000000
|
Dysc
|
1,645,479
|
56
|
HR
|
SV340_2
|
MTWP00000000
|
NAGd
|
1,633,298
|
53
|
HR
|
SV355_2
|
MTWQ00000000
|
IMe
|
1,635,304
|
39
|
HR
|
SV376_1
|
MTWR00000000
|
IM
|
1,691,791
|
207
|
HR
|
SV380_1
|
MTWU00000000
|
IM
|
1,631,819
|
40
|
HR
|
SV397_2
|
MTWS00000000
|
NAG
|
1,668,205
|
47
|
HR
|
SV449_1
|
MTWT00000000
|
IM
|
1,654,884
|
41
|
HR
|
PZ5056*
|
ASYU00000000
|
NAG
|
1,578,164
|
335
|
HR
|
PZ5080*
|
ASYV00000000
|
IM
|
1,597,127
|
283
|
HR
|
PZ5086*
|
ASYW00000000
|
NAG
|
1,547,845
|
295
|
LRb
|
PZ5005_3A3
|
MTWJ00000000
|
NAG
|
1,672,956
|
51
|
LR
|
PZ5006_3A3
|
MTWK00000000
|
NAG
|
1,643,170
|
53
|
LR
|
PZ5009_3A2
|
MSYO00000000
|
NAG
|
1,677,035
|
53
|
LR
|
PZ5016_3A3
|
MTWL00000000
|
MAGf
|
1,644,424
|
44
|
LR
|
PZ5019_3A3
|
MTWM00000000
|
IM
|
1,681,561
|
44
|
LR
|
PZ5033_3A2
|
MTWN00000000
|
IM
|
1,656,908
|
60
|
LR
|
PZ5004*
|
ASZF00000000
|
NAG
|
1,569,902
|
303
|
LR
|
PZ5024*
|
ASYS00000000
|
NAG
|
1,496,849
|
413
|
LR
|
PZ5026*
|
ASYT00000000
|
NAG
|
1,604,992
|
253
|
aHigh Risk: Host is resident of Tuquerres where risk for gastric cancer is high.
bLow Risk: Host is resident of Tumaco where risk for gastric cancer is low.
cDys, dysplasia
dNAG, nonatrophic gastritis
eIM, intestinal metaplasia
fMultifocal atrophic gastritis
*Data samples from previous work by Sheh et al., 2013 (29)
The blood and gastric biopsies were coding as follow, for Tuquerres samples (SV328_2, SV340_2, SV355_2, SV376_1, SV380_1, SV397_2, SV449_1), for Tumaco (PZ5005_3A3, PZ5006_3A3, PZ5009_3A2, PZ5016_3A3, PZ5019_3A3, PZ5033_3A2) (29). All participants provided informed consent; the study was approved by the institutional and local hospitals review boards. The bioinformatics analysis were performed during January and February of 2021. The human samples were genotyped using an Immunochip previously reported (25), which identifies around 196 x 103 SNPs in genes involved in immune disorders. The Admixture model of STRUCTURE assuming correlated allele frequencies, (50,000 iterations after a burn-in of 50,000 iterations).
The reference populations used in this study were published previously in Human Genome Diversity Project that content European, Amerindian and African ancestries (26,27). The number of tentative populations (K) was set from 1 to 3 and 10 runs were executed for each K. The STRUCTURE results (mixing model) showed that the model probability was maximized in k= 3 (14). CLUMPP was used to collate replicate runs and calculate means of individual ancestry (28).
To make the phylogenetic modelling we used African, Asian, European and Native American reference genomes (21). We included genomes from populations of Managua (Nicaragua) and Mexico City (Mexico) belonging to previous study (23). The genomes from Colombia were 7 isolated from Bogota (CG22366, CA22327, CA22311, CA22339, CA22312, CM22360, CM22351), 10 from Cundinamarca department (CC22402, CC26084, CC26093, CM22346, CA22337, CM22341, CG22389, CG22322, CA22393, CM22388), 14 from Boyaca department (CG22025, CG22087, CG22023, CM22046, CM22013, CG22367, CM22021, CG22370, CM22331, CM22368, CA22020, CM22315, CA22335, CM22347), one from Caldas department (CM22390), One from Caqueta department (CC26100), one from Meta (CA26024), four from Santander department (CG22385, CG22378, CA22019, CA22095) and two from Tolima department (CA22362, CA24004) (22).
Multilocus Sequence Typing (MLST) analysis based on genomes
The housekeeping genes atpA, efp, mutY, ppa, trpC, ureI, and yphC were annotated using PubMLST (https://pubmlst.org/helicobacter/), and the sequences were selected, downloaded and concatenated. The concatenated sequences were aligned using Muscle software (31). The phylogenetic analysis was constructed and calculated using a similarity analysis by means of Neighbor-joining (32) with the evolutionary model T92+G+I (Tamura with Gamma variation and invariable sites). The bootstrap analysis was done with 1000 replicates and the phylogenetic tree was edited in iTol v3.
Helicobacter pylori phylogenomic analysis
To the core genome analysis all the sequences were imported from bacteria isolated genome sequences database BIGSdb (34). Then an alignment of gene by gene was done using H. pylori coding sequences CDS from African strain J99 as reference, and the alignment was exported from the database. The output matrix from the genome comparing obtained by BIGSdb was used to create the phylogenomic tree using MEGA V7 (35).
The phylogenomic analysis based on SNPs was carried out using CSI-phylogeny (36) with the default parameters. The genome assembly was analyzed with the following parameters: minimum depth at SNPs positions of 10; relative depth at SNPs positions of 10; minimum distance between SNPs (prune) of 10; minimum SNPs quality of 30; minimal read mapping quality of 25; minimum Z-score: 1.96 corresponding to a p<0.05 value. The reads were mapped to the reference genome J99 with BMW mem, and the SNPs were assigned with the mpileup tool from SAMTools (37). The SNPs were filtered according to the assigned parameters to obtain a high-quality matrix. The SNPs matrix was created evaluating all the positions for each genome, which were concatenated creating a multiple FASTA file used in the Maximum-likelihood phylogenetic analysis, where we found 175,856 SNPs. The results were visualized and edited with FigTree v1.4.0 (33).
VacA cytotoxin and AlpA adhesin phylogenetic analysis
A phylogenetic analysis of virulence gene vacA and adhesine gene alpA were studied. The sequences were depurated and aligned using Muscle software (31). We used the tool Gblocks (38) to determine the parsimony site due to the high diversity of the genes. The evolutive model that better adjusted to the alignment was the General Time Reversible GTR+G+I that shows a Bayesian information criterion BIC=125767.110, lnL= -60531.321. The variation rate between sites was modeled with Gamma distribution=0.57. The analysis involved 196 DNA sequences with long of 2948pb. The vacA gene the phylogenetic analysis was determined, created and calculated using Maximum-likelihood estimation and a bootstrap analysis of 1000 replicates for more statistical accuracy in PhyML v. 3.0 (39).
To the gene alpA the best evolutive model was determined by the General Time Reversible GTR+G+I with BIC=43727.101, lnL=-19161.291. The variation rate between sites was modeled with a Gamma distribution of 0.66. The analysis involved 215 DNA sequences with long of 1093pb. For the gene alpA we applied the Maximum-likelihood method along with a bootstrap of 1000 replicates in PhyML v. 3.0 (39). The phylogenetic trees were visualized and edited with FigTree v.1.4.0 (33).