Invasive disease caused by Streptococcus pneumoniae (pneumococcus) is a serious infection. It can produce a wide spectrum of clinical manifestations including sepsis, meningitis, bacteremic pneumonia, arthritis, osteomyelitis, cellulitis, and endocarditis.
The people with the highest risk of suffering from this pathology are those under 2 years of age, the elderly and people with immune disorders or certain respiratory, cardiac, renal pathologies, among others.
Pneumococcal vaccination is intended to reduce the incidence, complications, sequelae, and mortality from pneumonia and invasive pneumococcal disease. Two types of vaccines are available: a polysaccharide vaccine against 23 serotypes (PPSV23) and a conjugate vaccine against 13 serotypes (PCV13) (13).
Currently, serotype identification is performed by the Quellung technique. This technique requires expensive reagents, is very laborious, requires specialized staff and involves long times until the result is obtained. In our country, the determination of the capsular type is performed only in the NRL and, due to this, it is difficult to obtain the results in real time. For all the above reasons, the search for reliable, fast and cheap alternative methods for Spn serotyping is of great interest in the field of public health.
In this sense, in recent years MALDI-TOF mass spectrometry has revolutionized the field of clinical microbiology (38). Although its application is fundamentally related to microbiological identification (39) applications combined with other bioinformatic analysis platforms have recently emerged (40, 41). In line with the above, the use of artificial intelligence has disruptively entered the field of health, thus becoming a new tool to be considered for the diagnosis of different pathologies (42–44). The first attempts to discriminate Streptococcus from the viridans group, using mass spectrometry, were encouraging according to the published results which were based on a small number of isolates or creating a specific database for this group of microorganisms (45, 46). However, it was later found that Bruker’s Biotyper 3.0 system could not resolve the low specificity in the identification of this genus (47–50). That is why several authors have ventured into various approaches to overcome this limitation. In 2012 Werno et al (51) proposed the use of specific peak analysis to confirm identification, which implied an improvement in the typing of different species of Streptococcus mitis, including Streptococcus pneumoniae.
Subsequently, Ikryannikova (52) published an article referring to the difficulty of discriminating pneumococcus from Streptococcus mitis by mass spectrometry; for which the authors proposed to use, in addition to biomarker peaks, artificial intelligence classifier algorithms. However, other authors (53–55) tried to replicate this methodology but were unable to find the peaks described.
It is important to note that this limitation occasionally arose during the creation of the spec- tral database and during the "real-time" classification of the unknown isolates performed in this manuscript.
In this context, an inclusion criterion was established for the isolates to be incorporated in this work (either as part of the training set as well as in the validation set). In this way, only those isolates that did not present discrepancies in the top ten of the results obtained by MS and that also showed a score > 2.0 were included, by doing this, we could affirm that our spectra accomplished the quality we needed to perform this approach.
The objective of this work was to evaluate the application of MALDI-TOF MS in combination with artificial intelligence algorithms, as screening methods in the serotyping of Streptococcus pneumoniae
To this end, two-class models were proposed to differentiate PCV13 vaccine serotypes from NON PCV13 serotypes, which would allow to performed the Quellung serotyping in a more targeted manner combined with the most prevalent circulating serotypes in our country, substantially saving both reagents and man-hours.
First, a calibration set comprised of isolates of all the most frequent vaccine and non-vaccine serotypes within the local epidemiology was used, and a principal component analysis (PCA) was performed from the spectra obtained by MS. This unsupervised analysis was carried out with the aim of exploring the behavior of these objects (Spn isolates) in function of the new variables defined by this study. It was possible to see that the first three components explained the highest percentage of the variance of the data, which gave an accumulated variance of 82%.
In the PC1 vs PC2 score graph (Fig. 4), a homogeneous distribution along the PC1 axis can be observed. This trend allowed us to discriminate two clusters that can be correlated with vaccine and non-vaccine isolates.
According to the loadings graph, the load values of each peak were obtained during the calculation of the main components. In this way, the peaks that give the greatest weight to the main components can be associated with the distribution of the samples observed in the score graph. This tool is useful for explaining the outliers that score charts show. From this it is possible to recalculate the analysis incorporating or leaving out the influential peaks according to the objective that is set. And, in this way, make decisions regarding which variables can be given or removed more weight to obtain the desired distribution.
In the unsupervised hierarchical clustering analysis (Fig. 5), the spectra were compared with each other in pairs, and the value obtained from said comparison allowed the hierarchical clustering in branches within a taxonomic tree according to the proximity between them. In the dendrogram obtained, a high percentage of discrimination between both classes can be seen, with only 1 isolate (PCV13) that was grouped outside of those expected.
Once the behavior of the study objects was explored in an unsupervised manner, supervised training was implemented, which aims to give a predictive approach to analysis, considered a subdomain of artificial intelligence, in which the computer uses algorithms to learn from a set of past data to make predictions about new data. In this way, it is possible to classify according to a previously established criterion in the training or calibration phase.
First, the best discriminatory peaks for each class were identified, according to the previously established parameters, which yielded two biomarker peaks for each class (4703.69 Da and 10785.95 Da). The performance values observed in the detection of these peaks were acceptable, since two well-defined groupings were observed in the two-dimensional plot of classes (Fig. 6).
Nakano et al (56) used ClinPro Tools software to create prediction models for the ten most prevalent serotypes in Japan, but were only able to validate the assay for three serotypes (3, 15A, and 19A). Subsequently Pinto (57), through the use of biomarker peaks and using the BioNumerics 7.6 software, found encouraging results but only with the serotypes
6A, 6B, 6C, 9N, 9V and 14. However, both authors conclude that it is necessary to carry out an external validation to corroborate the true reproducibility of the evaluated approaches.
In this work, five classifier models were calibrated using the ClinPro Tools software, of which the recognition capacity and cross-validation values showed greater efficiency for the GA/k-NN and SNN algorithms. An important result to highlight is the following, and it is that in accordance with what was found in the unsupervised analysis, more precisely in the two-dimensional figure, it is that the peaks selected to be able to make this figure, are in agreement with the peaks used by some of the classifying models, providing more robustness to the results obtained.
When implementing these models independently, low sensitivity and specificity values were observed, which is an inappropriate option to use them in this way as a screening technique for the serotyping of unknown isolates. However, when applying both algorithms in parallel and combined, a notable improvement in specificity (82%) and therefore in the positive predictive value (81%) was achieved. However, negative sensitivity and predictive value values continued to be around 60%, yielding inconclusive results in 37/100 isolates.
As a perspective to improve the predictive parameters, we can mention the increase in the number of isolates both to create the training set as well as the number of isolates to challenge.
Although the results obtained for this work were not as expected, by implementing the models developed in the form of screening, the use of antisera was reduced by 10.2% compared to the blindly Quellung technique.
Undoubtedly, the development and application of EM has contributed to meeting the demands in the field of microbiological diagnosis due to the important advantages it presents compared to more traditional methodologies. Among these advantages, its versatility can be mentioned (since it can be applied to bacterial, mycological, virus, parasite cultures, even from the clinical sample itself). Additionally, minimal sample preparation is required and it is a fast, easy-to-use analysis technique that allows parameters to be evaluated in real time (58).
In this work it was possible to demonstrate that the combination of MALDI-TOF MS and multivariate analysis allows the development of new strategies for the identification and characterization of Spn isolates of clinical importance. MALDI-TOF mass spectra generate a large amount of data, which requires appropriate analysis methods to make the most of the information contained in them. In this sense, multivariate analysis models (both supervised and unsupervised) allow extracting information from the spectra (multivariate data set) and correlating it with different properties of the samples.
The results of this work represent the bases to continue exploring the combination of MALDI-TOF MS with multivariate analysis, with prospects of improving the predictive parameters so that the developed models are robust and reliable. On the other hand, the development of this work provided a solid background in multivariate analysis as a tool to extract useful information and produce inferences from a large amount of data.
Finally, the possibility of projecting these methodologies to the study of other pathogens constitutes an added value as a future perspective.