Development and Implementation of PmGT
Previous studies shown primers specific for P. multocida (KMT1) and its five capsular genotypes (A, B, D, E, F) [14], eight LPS genotypes (L1 ~ L8) [15], as well as 23 kinds of virulence factors-encoding genes (VFGs) commonly detected in epidemiological studies (ptfA, fimA, hsf-1, hsf-2, pfhA, tadD, toxA, exbB, exbD, tonB, hgbA, hgbB, fur, tbpA, nanB, nanH, pmHAS, ompA, ompH, oma87, plpB, sodA and sodC) [5]. Therefore, we extracted those primers-targeted nucleotide sequences (Supplementary Txt 1) from the complete genome sequences of our previously sequenced P. multocida strains, including HB01 (serogroup A) [24], HN04 (serogroup B) [23], HN06 (serogroup D, producing toxin) [25], and HN07 (serogroup F) [26]. These nucleotide sequences were then stored on a CentOS server. Afterwards, we downloaded the BLAST package from the NCBI website (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/) and installed it on this CentOS server. A PHP program was developed to call the BLAST algorithm and attain the genotyping results for capsular and LPS genotype queries. When the users select the MLST genotyping service, the PHP CURL (Client URL Library) functions were used to request and obtain the results through the RESTful service interface provided by the Public Database for Molecular Typing (PubMLST, https://pubmlst.org). The general process for genotyping is summarized as: when a query sequence is submitted via the web user interface, this sequence will be then submitted to the CentOS server via HTTP protocol. Thereafter, the sequence is evaluated by the PHP program, and the passed sequence will be BLASTed against the genotype database to yield a result, which will be returned to the webpage through the PHP program (Figs. 1A and 1B). Through the above procedures, the genotyping module of PmGT (http://liulab.hzau.edu.cn/PM/) was developed (Fig. 1).
Because P. multocida strains colonized wide spectrums of host species, we therefore intend to develop a model to predict the host tropism of emerging P. multocida strains based on their genome sequences. Many epidemiological and genomic studies on P. multocida have revealed that VFGs show a correlation with host species [5, 19, 27]. These findings make it possible to determine the host tropism of a P. multocida strain by analyzing the VFGs carried by its whole genome sequence. As of 31 May 2020, 262 sequences of P. multocida strains from different host species are publicly available through the NCBI genome database. These sequences are of P. multocida strains from different host species, including pigs (n = 66), poultry and wild birds (n = 39), cattle and other bovine species (n = 106), canis lupus (n = 3), cats (n = 2), humans (n = 13), horses (n = 2), rabbits and other leporine species (n = 20), rodents (n = 2), sheep and other ovine species (n = 6), and vicugna pacos (n = 2) (Supplementary Table S1). In addition, there is also one synthetic DNA sequence. Since there are few genome sequences of P. multocida isolates from other hosts except porcine, bovine and avian species publicly available in NCBI, we used the genome sequences of porcine, bovine, and avian isolates to develop and test the host tropism prediction model. Approximately 70% of the genome sequences from P. multocida of porcine (n = 16), bovine (n = 65), and avian origin (n = 19) were aligned against the nucleotide sequences of the 23 kinds of VFGs described above by using the BLAST tool. BLAST scores were then used as features and input into six different machine learning models (Entropy Decision Tree [eDTA], Gini DTA [gDTA], Brute K-Nearest Neighbor [bKNN], Ball Tree KNN [btKNN], Gaussian Naive Bayes [GNB], and Complement Naive Bayes [CNB]) to calculate the precision, the recall, and F1 score, as described in the Methods section. In each of the models, three different metrics (micro-, macro-, and weighted-average of F1 score) were calculated through 10-fold cross validation. The results revealed that the Decision Tree model showed overall higher average F1 scores than the KNN and the Bayesian models (Fig. 2A and Supplementary Figures S1A, S1B, S1C). Therefore, the Decision Tree algorithm was finally chosen to construct the host tropism prediction model. Scikit-learn (Sklearn) and NumPy in Python were applied to implement the above findings into automation and intelligence machine learning model, which was available at http://liulab.hzau.edu.cn/PM/model.php.
Currently, PmGT provides the above services includes six menus: (1) the “Home” page gives a brief introduction of P. multocida etiological characteristics to help the users understand the bacterium; (2) the “Organisms” page displays the genotypes of P. multocida strains based on their whole genome sequences that are publicly available in NCBI; this page also provides the link for the users to download the genomes of these P. multocida strains from NCBI; (3) the “Genotyping” page enables the users to determine whether a putative isolate is a P. multocida and genotype P. multocida strains by using the whole genome sequence assembled from the sequencing reads (Fig. 1C); (4) the “Host Prediction” page enables users to predict the host tropism of P. multocida isolates by submission of the whole genome sequences (currently only prediction of porcine, bovine, avian, and/or human isolates is available due to the limited number of genome sequences of P. multocida from the other hosts in NCBI) (Fig. 1D); (5) the “About” page summarizes the guidelines for the use of this web tool; (6) the “Contact” page provides the contact information of the developers.
PmGT shows the same accuracy with PCR methods in genotyping P. multocida strains
To test the accuracy of PmGT, we used two methods to type 52 P. multocida isolates (HB01, HB02, HB03, HN04, HN05, HN06, HN07, HNA01 ~ HNA22, HND01 ~ HND21, HNF01, and HNF02) from our laboratory collection [23]. First, we submitted their whole genome sequences to PmGT for genotyping. As a comparison, we also determined the capsular genotypes, LPS genotypes, sequence types, as well as the profile of the abovementioned 23-kinds of virulence genes by using PCR assays. All these 52 strains were genotyped by PmGT and through this online genotyping platform (Table 1). Genotyping by PCR assays confirmed these capsular, LPS, and MLST genotypes. PCR results of capsular and LPS genotypes are provided in Supplementary Figures S2 and S3.
Table 1
Genotypes of 52 Pasteurella multocida strains determined via the PmGT Platform
Strain | Capsular genotype | LPS genotype | MLST genotype (Sequence type) | GenBank accession numbers |
HB01 | A | L3 | ST1 | CP006976 |
HB02 | A | L1 | ST128 | LYOX00000000 |
HB03 | A | L3 | ST3 | CP003328 |
HN04 | B | L2 | ST44 | PPVE00000000 |
HN05 | D | L6 | ST11 | PPVF00000000 |
HN06 | D | L6 | ST11 | CP003313 |
HN07 | F | L3 | ST12 | CP007040 |
HNA01 | A | L3 | ST133 | PPVG00000000 |
HNA02 | A | L6 | ST10 | PPVH00000000 |
HNA03 | A | L3 | ST3 | PPVI00000000 |
HNA04 | A | L6 | ST10 | PPVJ00000000 |
HNA05 | A | L6 | ST10 | PPVK00000000 |
HNA06 | A | L6 | ST10 | PPVL00000000 |
HNA07 | A | L6 | ST10 | PPVM00000000 |
HNA08 | A | L3 | ST3 | PPVN00000000 |
HNA09 | A | L3 | ST3 | PPVO00000000 |
HNA10 | A | L6 | ST10 | PPVP00000000 |
HNA11 | A | L6 | ST10 | PPVQ00000000 |
HNA12 | A | L6 | ST10 | PPVR00000000 |
HNA13 | A | L3 | ST3 | PPVS00000000 |
HNA14 | A | L3 | ST3 | PPVT00000000 |
HNA15 | A | L3 | ST3 | PPVU00000000 |
HNA16 | A | L6 | ST10 | PPVV00000000 |
HNA17 | A | L3 | ST3 | PPVW00000000 |
HNA18 | A | L3 | ST3 | PPVX00000000 |
HNA19 | A | L3 | ST3 | PPVY00000000 |
HNA20 | A | L3 | ST3 | PPVZ00000000 |
HNA21 | A | L6 | ST10 | PPWA00000000 |
HNA22 | A | L6 | ST10 | PPWB00000000 |
HND01 | D | L6 | ST11 | PPWC00000000 |
HND02 | D | L6 | ST134 | PPWD00000000 |
HND03 | D | L6 | ST11 | PPWE00000000 |
HND04 | D | L6 | ST11 | PPWF00000000 |
HND05 | D | L6 | ST11 | PPWG00000000 |
HND06 | D | L6 | ST11 | PPWH00000000 |
HND07 | D | L6 | ST11 | PPWI00000000 |
HND08 | D | L6 | ST11 | PPWJ00000000 |
HND09 | D | L6 | ST11 | PPWK00000000 |
HND10 | D | L6 | ST11 | PPWL00000000 |
HND11 | D | L6 | ST11 | PPWN00000000 |
HND12 | D | L6 | ST134 | PPWM00000000 |
HND13 | D | L6 | ST134 | PPWO00000000 |
HND14 | D | L6 | ST11 | PPWP00000000 |
HND15 | D | L6 | ST11 | PPWQ00000000 |
HND16 | D | L6 | ST11 | PPWR00000000 |
HND17 | D | L6 | ST11 | PPWS00000000 |
HND18 | D | L6 | ST11 | PPWT00000000 |
HND19 | D | L6 | ST11 | PPWU00000000 |
HND20 | D | L6 | ST11 | PPWV00000000 |
HND21 | D | L6 | ST11 | PPWW00000000 |
HNF01 | F | L3 | ST12 | PPWX00000000 |
HNF02 | F | L3 | ST12 | PPWY00000000 |
Determination of the 23 types of virulence genes for each of the 52 strains by using this online system revealed that several genes (ptfA, fimA, oma87, and sodC) were broadly presented in the genome sequences genotyped (Fig. 3). However, several genes (hsf-1, hsf-2, pfhA, and tadD) were heterogeneously distributed, and in particularly, none of the 52 sequences genotyped carried the toxA or tbpA genes (Fig. 3). These results were also confirmed by PCR assays (Supplementary Table S1).
Genotypes of P. multocida from different hosts
To understand the genotypes of P. multocida strains circulation in different host species, the 262 whole genome sequences of P. multocida strains that are publicly available through the NCBI genome database as of 31 May 2020 were downloaded and were genotyped by PmGT. The results revealed that P. multocida strains isolated from different host species were preference to several specific capsular genotypes, LPS genotypes, and/or sequence types (Fig. 4). For example, most of the porcine strains were determined as capsular genotypes A (52%) and D (39%), LPS genotypes L3 (36%) and L6 (61%), sequence types ST3 (29%), ST11 (22%), and ST10 (34%), respectively; while most of the genotyped bovine strains were determined as capsular genotypes A (72%) and B (28%), LPS genotypes L3 (67%) and L2 (27%), and sequence types ST1 (59%) and ST44 (25%), respectively (Fig. 4). When combining the capsular genotypes and the LPS genotypes, it revealed that most of the genotyped avian P. multocida were typed as A:L1 and A:L3, while most of the genotyped bovine P. multocida were typed as A:L3 and B:L2; the genotyped porcine P. multocida mainly belonged to D:L6, A:L3, and A:L6; while the genotyped leporine P. multocida mainly belonged to A:L3; most of the genotyped human P. multocida were typed as A:L3 and A:L1 (Fig. 5A). If the capsular genotypes, LPS genotypes, and MLST genotypes were combined, most of the genotyped avian P. multocida were typed as A:L1:ST128 (Fig. 5B), while most of the genotyped bovine P. multocida were typed as A:L3:ST1 and B:L2:ST44 (Fig. 5C); the genotyped porcine P. multocida mainly belonged to D:L6:ST11, A:L3:ST3, and A:L6:ST10 (Fig. 5I); while the genotyped leporine P. multocida mainly belonged to A:L3:ST12 (Fig. 5H).
Virulence genotyping using the system developed herein revealed that the presence of multiple VFGs, including ptfA, fimA, hsf-2, exbB, exbD, tonB, hgbA, hgbB, fur, nanB, nanH, ompA, ompH, oma87, plpB, sodA, and sodC, was a broad characteristic of P. multocida strains from multiple host species (Fig. 6). However, several VFGs were only determined in the genome sequences of P. multocida from certain hosts. For example, toxA, a gene encoding a dermonecrotic toxin, was found only in strains from pig, sheep, and alpacas, while tbpA, a transferrin binding protein coding gene, was found only in strains from cattle, sheep, and alpacas (Fig. 6).
PmGT is able to predict the host tropism of P. multocida
By using the Entropy Decision Tree algorithms, the correlation of P. multocida VFGs and host species was revealed (Fig. 2B). We then used the remaining 30% of the genome sequences of P. multocida from porcine (n = 3), bovine (n = 31), and avian origin (n = 10) to test the host tropism prediction model developed herein. The average micro-F1 score reached 0.898, revealing that the model could predict the host species of the tested strains. In particularly, it could determine host species of P. multocida strains possessing the same genotypes and close relatedness correctly (compare the result of an avian F:L3:ST25 type isolate Pm70 vs. the result of a porcine F:L3:ST12 isolate HN07 [Figs. 7A vs. 7B]; as well as compare the result of a porcine B:L2:ST44 type isolate HN04 vs. the result of a bovine B:L2:ST44 type isolate ATTK [Figs. 7C vs. 7D]).
Because P. multocida strains are also frequently recovered in clinical settings of human medicine [7]. To facilitate a rapid for help diagnosis, we also implement a way to predict the hosts of putative P. multocida strains from humans by using the same principles, even though the current publicly available genome sequences for P. multocida of human origin are still limited (only 13 sequences as of 31 May 2020). We used 9 sequences to develop the model and used the additional 4 sequences to test. However, the results showed this model was still appliable for P. multocida of humans (Fig. 7E).