Most bacteria exist in very large populations, and the combination of high growth rates, short generation times, extensive horizontal gene transfer (HGT), and strong selection can lead to very high diversity along with variable levels of clonality1. For many applications, notably infectious disease epidemiology, robust classification systems that pragmatically and reproducibly differentiate variants at high resolution are essential. Multi-locus sequence typing (MLST) was developed to solve this problem, indexing sequence variation using a limited number, often as few as seven, housekeeping gene fragments without explicitly classifying them phylogenetically2. The sequence variation of these fragments is recorded as alleles and combinations of alleles as sequence types (STs) that can be organised into groups or clonal complexes (ccs), sometimes referred to as 'eBurst groups' (sBGs)3. As sequence capacity has increased, additional schemes with more loci have been introduced, including ribosomal MLST (rMLST, indexing the 53 ribosomal protein genes) and core genome (cgMLST, indexing all shared genes in a particular population); however, seven-locus classifications remain widely understood and used as a cornerstone for bacterial typing4,5.
Whilst defining alleles and sequence types are straightforward, as they are effectively summaries of sequence variation, representing higher-level groups, such as clonal complexes, is more problematic. In addition to HGT confusing purely phylogenetic approaches, the existence of intermediate variants can result in all variants merging into a single group. These problems are less intense for schemes with very large numbers of loci, but for seven-locus MLST schemes, pragmatic solutions have been adopted, such as defining clonal complexes with a central genotype5. However, while establishing a stable classification system, these approaches can misclassify STs into incorrect clonal complexes, as they rely on assumptions about the representativeness of the data set being analysed, which may or may not be correct. They can also be unstable to the addition of new data.
We have addressed this problem by leveraging the availability of large numbers of whole genome sequences and machine learning techniques. First, cgMLST data are analysed using the Neighbour Joining tree reconstruction method to establish clusters or 'Neighbour Groups' based on the similarity of their cgMLST profiles. Then, a supervised machine learning algorithm is used to optimally predict the membership of these clusters from fewer loci, such as the MLST loci. The trained algorithm enables a robust probabilistic assignment of a seven-locus genotype to a cluster defined with cgMLST data (Figure 1), which is especially helpful when whole genome sequence data are unavailable, for example, from clinical specimens, as WGS technology is not available or for legacy data. The algorithm is available as a command line tool accessible from https://github.com/bgrdessislava/NeighbourGroups.
An essential parameter for the NeighbourGroups model is the number of classification groups, which is user-defined and can be established empirically. For example, with a Campylobacter dataset of >10,000 isolates for which cgMLST data were available6, we performed a grid search to assess model performance for two to 100 classification groups (Figure 2). Model performance was evaluated with an adjusted Rand score, which determined whether two clusters were similar between the 'testing tree' and 'true tree'. An adjusted Rand score of >0.90 was defined as an excellent prediction, 0.80-0.90 good recovery of groups, 0.65-0.80 moderate recovery, and <0.65 poor recoveries, with low confidence in the reproducibility of the classification. This analysis indicated that, for this dataset, 20 NGroups gave an optimum performance, with an adjusted Rand score of 0.895, showing high agreement between the 20 groups assigned from cgMLST with those assigned from the seven-locus MLST data.
At the time of writing, there were more than 150 MLST schemes available for a wide range of microbial species, primarily bacteria, with hundreds of thousands of isolates typed to the level of seven loci MLST and, in many cases, also with cgMLST. Most of these can be found on the PubMLST website (https://www.pubmlst.org)7. From some MLST databases, notably those for Neisseria species8, Campylobacter jejuni. Campylobacter coli 6, and Salmonella enterica3, clonal complexes (or eBurst groups), have been defined using a variety of approaches, but for most data collections, such groups have not been rigorously defined or maintained. Given the variability of bacterial population structures, the number of different schemes and the number of isolates available, there is a need for a rational and automated approach to defining groups which can be applied to whole genome and MLST data. This is especially the case for pathogens for which it may not be possible to generate reliable whole genome sequence data from clinical specimens but where this information is beneficial. In addition, the Neighbour Group approach is easily implemented, and its assumptions easily understood, providing a pragmatic complement to other analysis approaches, many of which require whole genome sequences and high-capacity computing 9,10. A final advantage is that the approach indicates the confidence with which seven locus data can be assigned to whole genome sequence groups.