The spectra-based surveillance system relies upon a microbiological surveillance system associated with a MALDI-TOF MS database. The overall process flow is described in Figure 1.
Microbiological surveillance system: BALYSES subsystem
The IHU-MI is integrated in the Assistance Publique – Hôpitaux de Marseille (AP-HM) in Marseille, France, and is the unique bacterial clinical microbiology laboratory of its 4 public and university hospitals. The laboratory activity has been monitored since February 2014 by an automated surveillance system named BALYSES  (Bacterial real-time Laboratory-based Surveillance System), which is one the five MIDaS subsystems. Connected to the laboratory information system, this surveillance system is based on a dedicated data warehouse gathering microbiological analysis results (sample id, requesting department, date, sampling, analysis, result, possible antibiotic susceptibility testing, possible antibiotic resistance phenotype, bacterial co-identifications) and patient-related information (anonymized patient id, age, sex, home postal code, anonymized hospital stay id, department stay date, death). It allows a systematic weekly detection of outbreaks for all bacterial species included in the database using CUSUM algorithms , the monitoring of trends for the sampling activity of the 15 most frequent bacterial species, and tracking of rare or new bacterial species.
MALDI-TOF MS database from IHU-MI
Since 2014, February 1st, the MALDI-TOF database has gathered around 900,000 spectra performed at the IHU-MI for the routine bacterial identifications of all patients hospitalized in the AP-HM.
The routine bacterial identification of the laboratory relies on strain culturing on blood or chocolate agar depending on the species and stopped in the middle of log phase (BioMérieux’s Columbia with 5% sheep blood agar, and Becton, Dickinson and Co’s Chocolate Agar GC II Agar with IsoVitaleXTM enrichment). From the cultures, a single colony is directly applied in on 2 or 4 spots on ground steel targets, air dried, overlaid with α-cyano-4-hydroxycinnamic acid matrix solution in 50 % of acetonitrile and 2.5 % of trifluoroacetic acid and air dried following the agreed protocol. All bacterial spectra are acquired using 3 Bruker Daltonics Microflex MALDI-TOF MS with FlexControl Software, using the default settings (positive linear mode within the m/z range of 2 to 20 kDa, laser frequency 60 Hz; ion source 1 voltage, 20 kV; ion source 2 voltage, 16.7 kV; lens voltage, 7.0 kV), and 240 laser shots at 60 Hz. Culture standardization is required for allowing spectra comparability within a same species. Bruker BioTyper® software allows the comparison between the spectrum and a reference database and leads to the bacterial species routine identification when the score threshold is ≥ 2.0. The Bacterial Test Standard (BTS) which is a solution of Escherichia coli DH5 alpha with two additional proteins, is used as a positive control and the matrix solution as a negative control for identification. Automata calibration is regularly performed as described by Bruker’s protocol using the BTS.
All MALDI-TOF MS spectra (‘fid’ files) and their parameter files (‘acqu’ files) produced during the identification process are extracted from automata and saved in a specific file system storage, associated with the surveillance data warehouse.
MALDI-TOF MS analysis
Our spectra processing platform is based on a homemade program written in R  and mainly using the following packages: MALDIquant v1.16.2  for spectra reading and quantitative analysis, seriation v1.2-2  for dendrogram ordering, and BinDA v1.0.3  for the protein peak discriminant analysis using binary predictors.
For investigating an alarm, the related surveillance database records and their associated spectra are selected using the appropriate request (e.g. based on species, dates, antibiotic susceptibility, home location, hospital department…), with a time window extension (over a maximal period of 4 months) for including a sufficient pre-alarm contrasting material. This delay may be shorter if the number of spectra is too huge to make the clustering readable, as for Escherichia coli or Staphylococcus aureus. The limit usually used in this case is about 1,500 spectra. During the selection process spectra quality is taken in account: only spectra of sufficient quality in terms of saturation and noise  and with plate controls (BTS) required for spectra deviation correction (as described below) are included in the analysis.
The selected spectra are imported into the analysis platform and are then injected into a 4-step workflow, which is described below. The spectra processing includes normalization , double alignment of spectra , Main Spectrum Profiles (MSP) and intensity matrix building. During these steps, the signal to noise ratio (SNR) was 2 and was used as a peak detection threshold, the peaks with a SNR<2 being considered as noise.
As described by Gibb and Strimmer , the normalization is made of intensity transformation (square root method), smoothing (moving average with half window size 12), baseline correction (Statistics-sensitive Non-linear Iterative Peak-clipping algorithm, 100 iterations) and intensity recalibration (on the maximal intensity peak).
The 8 reference peaks (3637.8, 5096.8, 5381.4, 6255.4, 7274.5, 10300.1, 13683.2, 16952.3 Da) of the BTS, required for each target plate, are used for a first alignment (quadratic warping function) aiming at controlling automata-dependant drift. Spectra with reference peaks out of the built-in Microflex tolerance window (300ppm) are dropped. Using the species typical peak composition described in our panspectrome database , a second alignment of the spectra based on their species-specific common peaks is then done (quadratic warping function with 0.005 tolerance).
Technical replicates are averaged into main spectrum profiles (MSP), and species specific common peaks are removed in order to increase the contrast between these spectra, which belong to the same bacterial species .
An intensity matrix, describing the intensity of spectra peaks for each MSP, and built as recommended by S. Gibb , is the final deliverable of this process.
The next step is the hierarchical clustering of the intensity matrix using Bray-Curtis distance and Ward agglomeration with ordination (or seriation). The ordination is based on the Gruvaeus-Wainer method , which orders the leaves at each merging step such the leaves at the edges of each cluster are beside the more similar ones, ensuring the unicity of the dendrogram. Time distances between dendrogram leaves are also calculated during this step.
The results of this clustering step are presented using 2 specific graphics: a time-heated dendrogram and a time-protein double proximity heatmap. Their aim is to support epidemiological inference based on the MSP closeness in terms of proteinic and temporal distances, suggesting the possible epidemiological relations between isolates, as elaborated by Sintchenko et al. . In the time-heated dendrogram, each leaf label is coloured with a heat scale according to the case occurrence time. More the case is recent and more the color is “hot”, from blue to red. Isolates possibly belonging to a same epidemiological event are represented in the dendrogram by subtrees with labels showing the same colour. The time-protein double proximity heatmap combines a first half-matrix showing proteinic distances with a second half-matrix coloured in accordance with the time distance between MSP (Figure 4, cluster A). The double heatmap is a possible alternative illustration where groups of MSP with close proteinic-temporal distances appear as hot colour squares along the matrix diagonal.
Characterization of MSP belonging to a group is done by contrasting this group against the other MSP with a discriminant analysis on protein peaks. For this purpose, we rely on the Gibb and Strimmer’s method for differential protein expression and prediction based on binary discriminant analysis (BinDA) . This method dichotomizes the intensity vector of each peak using the maximisation of the Kullback-Leibler divergence, before finally ranking them according to their discriminating power. All top-ranked peaks are automatically checked against the UniProt database (http://www.uniprot.org/) using its representational state transfer (REST) programmatic access. A mass fluctuation of ± 2 Da is allowed for the matching. For each top-ranked peak, prediction errors for group separation are estimated using cross-validation procedures .
This study has been allowed by the French Data Protection Authority (CNIL decision DR-2018-177), and declared on ClinicalTrials.gov Protocol Registration and Result System (id: NCT03626987).