Bacterial Typing and Identification Based on Fourier Transform Infrared Spectroscopy


 Fourier transform infrared (FT-IR) spectroscopy is a label-free and highly sensitive technique that provides complete information on the chemical composition of biological samples. The bacterial FT-IR signals are extremely specific and highly reproducible fingerprint-like patterns, making FT-IR an efficient tool for bacterial typing and identification. Due to the low cost and high flux, FT-IR has been widely used in hospital hygiene management for infection control, epidemiological studies, and routine bacterial determination of clinical laboratory values. However, the typing and identification accuracy could be affected by many factors, and the bacterial FT-IR data from different laboratories are usually not comparable. A standard protocol is required to improve the accuracy of FT-IR-based typing and identification. Here, we detail the principles and procedures of bacterial typing and identification based on FT-IR spectroscopy, including bacterial culture, sample preparation, instrument operation, spectra collection, spectra preprocessing, and mathematical data analysis. Without bacterial culture, a typical experiment generally takes <2 h.

interpretation and analysis, as bacterial FT-IR spectra are very similar with subtle biochemical differences [25][26][27][28] . The processing of bacterial FT-IR spectra includes spectral preprocessing and multivariate analysis. Baseline correction, spectral normalization, and derivative mathematization of bacterial FT-IR spectra are fundamental steps in spectral preprocessing and can increase the accuracy of bacterial typing and identification 26,29 . The multivariate statistical approaches can be divided into two types: unsupervised methods (e.g., principal component analysis [PCA], hierarchical cluster analysis [HCA]), and supervised methods (e.g., PCA-linear discriminant analysis, artificial neural networks [ANNs]). The supervised methods require prior knowledge of the bacterial identity 26,30 . With a set spectra of known bacteria, a model can be trained to identify unknown bacteria.
The aim of this study was to describe the detailed procedures for bacterial identification and typing based on FT-IR spectroscopy, including sample preparation, spectral acquisition, and data analysis.

Development of the protocol
The use of infrared spectroscopy in microbiological analysis was first reported in the 1950s 42-44 .
However, only after the development of the FT technique and powerful computers did it boost the number of applications of FT-IR in the field of bacteria in the late 1980s and 1990s [45][46][47][48][49] . Modern multivariate statistical analysis, such as factor analysis and ANNs, have contributed greatly to the utilization of this methodology for bacterial identification and typing at different taxonomic levels 26 .
The revival of IR spectroscopy as a means of characterizing microbial samples has been a subject of different reviews in the past 20 years 10,27,29,36,50 . Recently, the future perspective of FT-IR for healthcare and clinical applications has been discussed 51 . The potential of the methodology as a quick, inexpensive, and high-flux tool for bacterial identification and typing is widely accepted 51 .

Application of the method
Bacterial typing and identification by FT-IR can be applied to the fields of general microbiology 21,52,53 , rapid identification of life-threatening pathogens 29,54 , epidemiological investigations and pathogen 5 screening 14,55,56 , characterization and screening of microorganisms from the environment 9,57,58 , and maintenance of strain collections 59 . The audiences include the investigator and operators in microbiology and related fields. Recently, a special instrument (IR biotyper Bruker) was marketed by Bruker Daltonik GmbH (Germany), and more audiences will be interested in this technique.

Comparison with other methods
Many bacterial identification and typing methods have been developed for specific purposes, but many traditional methods are based on multistep culture-based assays, which is time-consuming and costly. Therefore, multistep culture-based assays are progressively being replaced by molecular biology methods [60][61][62][63][64] . One of the most well-developed molecular biology methods is polymerase chain reaction (PCR) 61 . It has high discriminatory power based on variation in the components, structure, and sequence of bacterial genetic material. Some PCR-based methods, such as pulsed-field gel electrophoresis and whole-genome sequencing, are regarded as the "gold standard", though some of these methods are semi-automated, with high cost and length of analysis impairing routine implementation 65 . Moreover, typing based on genetic information is not always straightforward, and a gap exists between genotypes and phenotype 50 .  [71][72][73] and fungi [74][75][76] , with an accuracy of 90% at the species level. Although promising, the robustness and discriminatory power needs to be improved by a unified process, such as sample preparation and data analysis.
Unfortunately, an inherent difference in peak intensity or peak location may be present in the mass spectrum, which is related to independent acquisitions in time and in different devices or laboratories.
Moreover, many investigations have reported that the accuracy of MALDI-TOF-MS is limited in some 6 highly similar species, including Escherichia coli and Shigella species, two important bacteria that cause different clinical diseases [77][78][79][80] .

Advantages and limitations of the method
The notable advantages of FT-IR methods are 17 that it is uniformly applicable to virtually all microorganisms that can be grown in culture; the results are available within minutes of obtaining adequate samples of the pure culture; IR spectroscopy can classify microorganisms at different levels of taxonomic discrimination without any preselection of strains by other taxonomic criteria; the specificity of the method is generally extremely high, allowing differentiation at the strain and serotype level; it can be used in epidemiological investigations, pathogen screening, hygiene, and therapy control; and it is rapid, cost-effective, non-destructive, in situ detection.
The limitations of FT-IR are that only microorganisms that can be grown in culture and are available as pure cultures can be analyzed; stable results can only be obtained when the microbiological parameters (culture medium, cultivation time, and temperature) are rigorously controlled; the spectral databases from different laboratories are not compatible unless all conditions (instrument, culture media, etc.) are the same; and only a few highly specialized databases have been set-up.
Many general applications require a broad and comprehensive database, as in clinical, food, or environmental microbiology, as reference databases need to deal with the diversity in species. Furthermore, multiple studies have shown promising results, but standardization is lacking 81 . Random variation between studies can originate from differences in instrumentation, operators, and environmental conditions. For example, before FT-IR studies, instrument settings, sample preparation, and operation mode should be optimized in order to improve the spectral quality and molecular sensitivity 19 .

Experimental design
The main experimental steps are bacterial culture, sample preparation, instrument settings, acquisition of spectra, spectral preprocessing, and computational analysis (Fig. 3).

7
The strains can be from strain collections or clinical isolates. Only a single colony in culture medium should be used for FT-IR analysis. The parameters, such as age of culture, cultivation medium, incubation time, and the preparation protocol, should be the same. Differences or changes in any of these parameters may lead to false negative results.
Bacterial sample cultivation is started from bacteria of the same age. Pure single colonies are selected for inoculation (a pure culture is essential; if only small colonies are present, an additional streak and incubation is required to increase cell mass). The cell material is spread using an inoculation loop in a way that generates confluent growth. In certain cases, subculturing (on the same medium) may be necessary in order to adapt the organism to the growth medium. The cultivation conditions, including the agar medium used, the temperature, and the incubation time, depend on the organism. These conditions must be as standardized as possible to take advantage of the high selectivity of the IR method. Anaerobic and microaerophilic bacteria (e.g., lactic acid bacteria) are grown in a sealed container in the presence of an oxygen-sorbent (e.g., Anaerocult, Merck; sorbent system should be standardized).

Instrumental parameters
For FT-IR spectrometry, water vapor rotational bands are of low intrinsic half-width and their precise wavenumber position and intensities are dependent on parameters such as partial pressures or temperature. Therefore, the FT-IR spectra can be the results of environmental conditions around the instrument. To collect high-quality FT-IR spectra of bacteria, the FT-IR instrument should be placed in an enclosed space avoiding disturbance of air flow. In addition, vapor contributions should be reduced by purging the instrument with dry air or nitrogen during spectra collection. Furthermore, the particular instrument will influence the signal-to-noise ratio (SNR); thus, user-specific conditions and some instrumental parameters need to be optimized. Normally, the optimal resolution of the instrument was set to 4 cm -1 or lower and the scan number to 62 scans or more. Background scans of the same sample should have the same parameters. Subsequently, the sample spectra will then undergo a background correction to result in a final spectrum.

Spectral acquisition
8 Further details regarding spectral acquisition using FT-IR spectroscopy can be found in our previous protocols 33 .

Spectra preprocessing
Pre-processing of the spectrum is a significant issue, as the application and effectiveness of preprocessing can greatly affect the results of the computational analysis. The pre-processing of spectra can correct variation related to spectrum acquisition, improve the robustness and accuracy of subsequent multivariate analyses, remove outliers, reduce the dimensionality of the data, and increase data interpretability by correcting issues associated with spectral data acquisition 82 . Preprocessing includes the following processes: spectrum subtracting, baseline correction, derivatization, normalization, and spectrum window selection 83 . An illustration of the recommended preprocessing workflow based on FT-IR spectra is given in Fig. 4.
Normalization: Normalization has been identified as one of the most important pre-processing methods and is commonly applied to minimize the effects of varying optical path lengths on the data, or to compensate for intensity variations in the source. The result of normalization is a spectrum that is simultaneously scaled and offset-corrected 84 . The most popular normalization method is vector normalization. Vector normalization is a widely used method for spectral normalization. Rather than a simple scaling operation, it ensures that all spectra within a dataset have the same vector length of 1.
The vector of absorption intensities x can be converted to its corresponding unit vector by dividing each element, , by the vector length, 86 .
Baseline correction: In transmission or transflection type IR spectroscopy, spectral baselines can be distorted as a result of scattering, absorption by the supporting substrate, changing conditions during data collection, or the variance due to instrumental factors. Such baseline distortions are critical when the absorbance values are systematically evaluated. Although baseline correction methods may rely on distinct principles and algorithms, they have the common objective of minimizing unwanted spectral offsets, broad baseline distortions, positive or negative slopes, and other baseline effects in 9 vibrational spectra. Subtracting the estimation of a background from the un-processed spectrum leads to a more interpretable signal, allowing spectral parameters to be determined more accurately 85 . A popular baseline technique available with Bruker Opus software is rubber-band baseline correction, which stretches the spectra down so that the minima in the spectral region of interest can be used to fit a convex polygonal line (i.e., the rubber band), which is then subtracted from the original spectrum.
Polynomial baseline correction is intended to be used when distortion of the spectra due to differentiation is not desired 86 . The method is based on a least-squares polynomial curve-fitting function. The degree of the polynomial, k, is modulated in the user input. Equation 2 shows the generalized polynomial formula, where M is the number of terms associated with the polynomial at a particular degree. The measured spectrum with number of data points n becomes the fitted spectrum ; a represents the constant, and x represents the indeterminate of the polynomial. The residual sumof-squares (RSS) between these two is given by Equation (3). These equations show a very standard polynomial least-squares fit. However, the implementation becomes a more useful baseline correction approach by iterating the smoothing 100 times (by default), and having a tolerance for change in RSS between iterations of 0.001 by default.
[2] [3] During this process, the small peak from 2,000 to 1,750 cm -1 resulting from water vapor can be diminished manually. The peak can be leveled by connecting lines through the selected points and subtracting them from the trace. This method can remove the baseline slope and be offset by an iterative fitting process. The algorithm should be used before resolution enhancement techniques.
Analytical window selection: Bands known to be in the 900-1200 region include the P-O-C and C-O-C stretches of mostly oligo-and polysaccharides 40,87 . The multiplet of peaks in this cluttered spectral region surely contains many polysaccharide or sugar C-O-C bands. This is important, as the relative amplitudes of these bands is what gives this multiplet of peaks subtle, but very reproducible, band 10 shapes for differentiating species. The polysaccharide and nucleic acid region (900-1200 cm -1 ) is the only region that is not disturbed by H 2 O under certain conditions 17,22 . The typing results based on this region are much better than with any other region or combination of regions 18 .
Derivative analysis: Recently, mathematical tools were applied to minimize confounding effects and to control optical or distorting influences 27,[88][89][90] . First or second derivative transformations can resolve broad bands overlapping raw spectra, reduce replicate variability, correct baseline shift, and amplify spectral variations [91][92][93][94][95][96][97] . The second derivative spectra are most commonly used for bacterial classification 98 , typing and identification 91,95,99,100 . Third or higher order derivative spectra tend to distort the graph and are not recommended.

Multivariate data analysis for bacterial typing and identification
Multivariate data analysis involves the application of several statistical methods to determine the similarities and differences between bacterial strains, and to identify the spectral ranges that mostly relate to these similarities and differences 101 . Different illustrations of multivariate data analysis can be found in Fig. 5. The two-dimensional projection score plots of 7 standard E. coli strains analyzed by PCA were shown in Fig. 5A. The relationship between these strains analysized by HCA were shown in Fig. 5B. A heat-map overview allow one to visualize the data set in a clear way, and the segregation between bacterial classes can be discriminated in Fig. 5C. Each visualization is available for users to choose.
The multivariate data analysis can be divided into supervised and unsupervised methods 25,27,102 . In unsupervised methods, such as PCA and HCA, are established according to the similarity between spectra 9 . Typing and differentiation of groups of related isolates does not necessarily rely on existing spectral databases or prior knowledge, but the identification of unknown isolates depends on the comparison of spectra with a reference database. Supervised methods are usually used to identify samples, as they use the category membership of samples to define classes 28 . Partial least squares discriminant analysis (PLSDA) or ANNs are powerful for extracting specific spectral signatures 27,97,103 .

11
The only commercial database created for bacterial species identification is Bruker Optik, and most studies have relied on in-house databases 104 .
Cluster analysis: In cluster analysis, a matrix is calculated that expresses the similarity, or "distance", between each spectrum and all other spectra of the data. There are several ways to evaluate the distance between or similarity of two spectra. The simplest way is to calculate the EUCLIDEAN distance between vectors. Other methods, such as CORRELATION and COSINE, calculate the Pearson correlation coefficient and the angle between the vectors, respectively. For convenience, the results are also translated into a distance value.
For two spectra, S and R, in the data hypercube, this distance is defined as the correlation coefficient Here, the spectra S and R are represented by one-dimensional vectors of M absorbance values, and and are the mean values for each vector. The resulting C SR matrix contains N 2 entries, where N is the total number of spectra within the data set. As the matrix is symmetric, only N(N-1)/2 spectral distance elements C SR need to be computed. Subsequently, the two most similar spectra in the hyper-cube are merged into a "cluster", and a new distance matrix column is calculated for the new cluster and all existing spectra. The process of merging spectra or clusters into new clusters is repeated, and the C SR is recalculated until all spectra have been combined into a few clusters. The mean spectra are extracted for all clusters and used for interpreting the chemical or biochemical differences between clusters. Reasonable noise in the spectral data does not affect the clustering process. In this respect, cluster analysis is much more stable than other methods of multivariate analysis in which an increasing amount of noise accumulates in the less relevant clusters Hierarchical cluster analysis: HCA and dendrograms are used to show analogies between bacterial spectra. Items closely related are clustered together on a branch with short twigs. The lines parallel to the item axis always connect two clusters (or single items) at a height that indicates at which distance the two clusters were merged into one. Therefore, in dendrograms, the left vertical axis describes the increasing variance or heterogeneity. The magnitude of this heterogeneity is based on the number of spectra in a cluster and the analogies between them.
When starting an agglomerative HCA, the distances calculated for all N items are sorted and the smallest distance between two single items leads to their merge into the first cluster. The second smallest distance is then processed. This is easy as long as only single items are involved. As far as the distance between a cluster and a single item, or the distance between two clusters, the two most useful linkage types are the AVERAGE linkage and WARD's algorithm, which are recommended for a first approach.
Though WARD often results in very nice and biologically meaningful clustering, one of its disadvantages is the distance scale, which varies with the number of items analyzed. In contrast, AVERAGE shows a scale that is independent of the number of items. This allows comparisons of completely different data sets and defining cut-off values for the distance to decide whether isolates are indistinguishable, closely related, or unrelated. Cluster analysis was performed by importing data hypercubes in the Bruker OPUS 3.0 format software package 105 .
With regard to isolates and spectra, the most important goal for exploration of the presented data is to find a reasonable cut-off value for the distance to decide which isolates fall into the same cluster and are considered indistinguishable. There are at least two parameters to consider: the cut-off value should be as low as possible to achieve high discriminatory power and is only really discriminatory if spectra that belong together (e.g., the technical replica measurements) are not scattered over different clusters and are coherently together in one cluster. When looking at a data set that contains only isolates that were measured once (only technical replicates present), the correct cut-off value will likely be underestimated. Therefore, we advise the user add at least three independent measurements of some well-known and different isolates.
Principal components analysis: PCA is one of the most common techniques for exploratory analysis.
The original data are decomposed into a few principal components (PCs) responsible for most of the variance within the original dataset. The PCs are orthogonal to each other and are generated in a decreasing order of explained variance, so that the first PC represents most of the original data variance, followed by the second PC, and so forth 106 . Mathematically, the decomposition takes the form: [5] Where X represents the preprocessed data (e.g., preprocessed sample spectra), T is the score, P is the loading, and E is the residual.
For standardization applications, PCA is a fast and reliable tool for determining whether differences exist between the spectra acquired by different systems. The PCA loadings represent the variance in the variable (e.g., wavenumber) direction and are used to detect the variables with the highest importance for the pattern observed in the scores. The PCA scores represent the variance in the sample direction, and they are used to assess similarities/dissimilarities among the samples, detecting clustering patterns.
Feature extraction: Feature extraction can be split into two distinct approaches: feature construction and feature selection 26 . Briefly, feature construction can be defined as the creation of new features in a data set that can infer otherwise obscured information, such as the previously mentioned linear methods PCA and PLS. This can be especially important for diagnostics, biomarker extraction, and pattern recognition in otherwise homogeneous data sets, and it plays an important role in hyperspectral imaging, as individual pixels can be reduced to single values related to spectral intensity or variance 19 . Feature selection approaches extrapolate existing features from the data set, such as specific wavenumbers, that can be used to determine spectral biomarkers and/or feed into diagnostic frameworks 107 . Techniques such as genetic algorithm, multivariate curve resolution, and successive projection algorithm are particularly popular as feature-extraction methods, as only informative variables are included in the resultant model 108 .
Identification: Identification methods can also be separated into two types: unsupervised and supervised. In unsupervised methods, a similarity comparison algorithm is usually directly used to calculate the spectral distances between unknown bacteria and known bacteria included in a pre-14 constructed library, or by introducing it in the HCA 10 , without artificially correcting parameters and results in the calculation process. By judging the similarity distance, bacteria can be identified at the genus, species, or subspecies level. The Pearson correlation coefficient and Euclidean distance are frequently used in distances measures, and Ward's link algorithm is most commonly employed in HCA to construct clusters 109 . The following Pearson correlation coefficient is defined as , in which and represent the intensity values of features in two profiles, and , and represents the number of features in each profile 110 .
[ 6] In supervised methods, initially each sample is artificially grouped into a definite class (genus, species, or strain, etc.) according to prior knowledge, and then the relationship between the data and the considered class will be built using methods such as discriminant analysis (DA) and canonical variate analysis (CVA) 10 with the purpose of forming weighted linear combinations of the data to minimize within group variance and maximize between group variance. The unknown bacteria can then be identified by recognizing the class assigned to it after a projection transformation using the former relationship with its feature data as the input. Compared to the supervised methods, the unsupervised methods have higher efficiency and lower expertise requirements, making them, particularly library-based methods, widely used in various fields for routine bacterial identification 110 .
The precise feature-extraction mentioned above is key for high-quality library construction and accurate identification, and the lower experimental standardization requirements (e.g., bacterial growth and culture media) of the features result in more stable and reliable identification. A reference strain (e.g., E. coli 8739) or standard reference material is often used as quality control to test the accuracy of the model 111 .
ANN analysis: An alternative to creating a classification model that represents the hierarchy is a supervised bioinformatics approach, such as ANN analysis 112 . Compared to cluster analysis, ANN analysis has the advantage of objective validation of the results. ANNs can used in identification by analyzing raw data and work well for managing overlapping data because they are a pattern analysis method of advanced multivariate data processing in which large amounts of information are analyzed by training the data in a pattern recognition algorithm to recognize the particular combination of variables in a subset of data 113 . The general strategy of ANN analysis includes teaching and optimizing the network models, followed by testing the classifiers with independent (external) validation data sets 114 . Teaching and internal validation were carried out on the basis of IR spectra with known class assignment using spectra from the database. External validation of the classifier was performed by generating ANN images from infrared spectral maps. Thus, the classifiers were created with database spectra, whereas the model was assessed by comparing the ANN images. 81 The primary advantage of this method is the capability to separately train and validate small and flexible networks that can be combined to build large modular ANN systems 115,116 .

• Standard strains (China Center of Industrial Culture Collection [CICC] and American Type Culture
• Isolation strains (Huashan Hospital of Shanghai, Shanghai, China).
• Autoclave (Shanghai Boxun Industry & commerce Co., Ltd.). Our protocol for the preparation of bacterial samples can also be applied using other FT-IR spectrometers from well-known companies (e.g. Shimadzu [Japan]). Alternative software from commercial or academic sources as described in Box 1 can also be used.

▲ CRITICAL
• FT-IR spectroscopy equipment: for a list of available commercial instruments, please refer to Table   2.
• CaF 2 liquid cells: demountable transmission cell with spacer can be purchased from Specac • Silicon microtiter plates (Bruker Part No. I23258P) and microtiter plate frame.

Reagent setup
• TSA medium: 38 g TSA medium powder added to 1 liter of double distilled water. The autoclaved mixture can be stored at room temperature.
• LB broth medium: 25 g LB broth medium powder added to 1 liter of double distilled water. The autoclaved mixture can be stored at 4°C until use.
▲ CRITICAL Fresh liquid media or autoclaved liquid media can be stored at 4°C until use (several weeks). For plating, heat TSA solid media until dissolved, cool to 50°C, and plate in Petri dish. Store the agar plates in a refrigerator at 4°C up to a month.
• 0.9% (wt/vol) NaCl solution: 9 g NaCl added to 1 liter of double distilled water. The autoclaved mixture can be stored at 4°C until use.

Equipment setup
Hardware • Infrared spectroscopy system. For a list of available commercial instruments, please refer to Table   2.

Software
• Spectral acquisition software normally provided by the instrument manufacturer ( Table 2).

2| Sample prepared for FT-IR spectroscopy
Transfer the collected bacterial cells to appropriate FT-IR slides. (ii) Sample drying. The bacteria suspension was dried at 40°C for 30 min. The spectra of the samples on these slides can be collected without interference from coating. ! CAUTION Control the exposure temperature and exposure time. When the sample is exposed to a higher temperature or longer time, the bacterial film will swell, crack, and flake off.  (iii) Turn on the IR source.

? TROUBLESHOOTING
(iv) Turn on the FT-IR spectrometer.

▲ CRITICAL STEP The IR Biotyper should be permanently switched on.
▲ CRITICAL STEP Each time after switching on the instrument, you must wait for ~1 h after powerup for the IR source to stabilize.

? TROUBLESHOOTING
(v) Turn on the computer and monitor, and then start the FT-IR collection program.
(vi) Set the data path to store your data.
(iii) Set the data path to store your data.
(iv) Clean the ATR internal reflection element (diamond, zinc selenide, germanium, or silicon) with distilled water and dry it with tissue. ▲ CRITICAL STEP Make sure that the crystal is thoroughly cleaned and dried before background acquisition.

▲ CRITICAL STEP
A prepared and dried sample plate can be stored for several days in a dry and 22 dust-free place at room temperature without loss of sample quality.
(v) Place the slide in contact with the ATR internal reflection element. ▲ CRITICAL STEP Ensure that the ATR internal reflection element is completely covered by the sample and that the minimum sample thickness is 3~4 times the depth of penetration to ensure that there is no interference from the substrate.

(C) Diffuse reflectance mode
(i) To start the measurement, click Start Acquisition; the drawer will open and a dialog will request insertion of the microtiter plate.
(ii) Insert the frame with the microtiter plate in the IR Biotyper instrument. After the drawer at the right side of the IR Biotyper instrument opens, put the frame with the inserted microtiter plate in the holding fixture. Make sure that the frame corner labeled A1 is placed in the marked corner of the X/Ystage (recess with red point). Otherwise, when you evaluate the measurement results afterwards, the sample position and spectrum will not correspond with each other.
(iii) Click Start on the graphical user interface to close the drawer and start the measurement.
! CAUTION During spectrum acquisition, the IR instruments can set a quality control to check the different spectral properties, e.g., absorbance intensity, SNR, and water vapor disturbance. The control sample can be the same standard samples that were inactivation bacteria. Spots with spectra pass this quality test. Failed spectra will not be taken into account for further calculations. (optional) Bruker Infrared Test Standards (Bruker Part No. 1851760) can be bought and used for quality control. two standard samples were usually used (IRTS1 and IRTS2).

! CAUTION
The cell should be delivered in a clean, ready-to-use state and packed in a reclosable plastic bag. After use they can be cleaned. For the microtiter plate, if enough free spots are still available, it can be used for another measurement run. You can clean the microtiter plate by covering the surface with deionized water and carefully rubbing the plate with a sponge or cleaning tissue. To remove any fatty residues, repeat the cleaning with 60% (v/v) isopropanol. To dry the microtiter plate, use clean pressurized air (oil-free) or a clean and dust-free cleaning cloth.

? TROUBLESHOOTING
(iv) Turn on the drying equipment for the instrument. ▲ CRITICAL STEP Dry air or nitrogen should displace water vapor absolutely.
(v) Turn on the IR source.

? TROUBLESHOOTING
(vi) Turn on the FT-IR spectrometer.

▲ CRITICAL STEP
The IR Biotyper should be permanently switched on. ▲ CRITICAL STEP Each time after switching on the instrument, you must wait for ~1 h after powerup for the IR source to stabilize.

? TROUBLESHOOTING
(vii) Turn on the computer and monitor, and then start the FT-IR collection program.
(viii) Set the data path to store your data.

! CAUTION
The cell should be delivered in a clean, ready-to-use state and packed in a reclosable plastic bag. After use they can be cleaned. For the microtiter plate, if enough free spots are still available, it can be used for another measurement run. You can clean the microtiter plate by covering the surface with deionized water and carefully rubbing the plate with a sponge or cleaning tissue. To remove any fatty residues, repeat the cleaning with 60% (v/v) isopropanol. To dry the microtiter plate, use clean pressurized air (oil-free) or a clean and dust-free cleaning cloth.

4| Collecting FT-IR spectra •TIMING 1-3 h
Set the spectrum collection parameters based on the chosen FT-IR instrument. The options are transmission mode (option A), ATR mode (option B), Biotyper (option C), or A/R (option D).

(C) Diffuse reflectance mode
Spectra collected over the wavelength range of 4000 to 600 cm -1 at a rate of 20 scans per second 24 with a resolution of 4 cm -1 . To improve the SNR, 128 spectra were co-added and averaged.

5| Collect a background spectrum without IR cell.
This spectrum is used to remove spectral signals that originate from air, moisture (water vapor), and coating materials on the reflecting mirrors along the IR radiation path from the spectra of the protein and buffer in order to subtract the background noise. ▲ CRITICAL STEP If the water vapor is not purged completely by dry air, there will be peaks in the regions of 1,500-1,200 cm -1 and 4,000-3,500 cm -1 . If these peaks appear, you should check the tubing and ensure no air leakage is occurring. You should also check the liquid desiccant system and drying equipment, and then purge for a longer period of time.
▲ CRITICAL STEP Air bubbles and/or empty areas will result in poor absorption spectra caused by fringing from channel spectra, dispersion of the IR beam by bubbles, and/or incorrect absorbance from lack of sample.
! CAUTION A background spectrum should be taken if atmospheric changes occur (e.g., if a door is suddenly opened).
?TROUBLESHOOTING 6| Collect a subtractive spectrum. ▲ CRITICAL STEP It is important to acquire the spectrum for the buffer before acquiring the spectrum for the sample.

7| Collect a spectrum for one type of bacteria.
▲ CRITICAL STEP The sample cell must be handled with care. Make sure the bacterial film does not flake off.
8| Save the raw data on the hard drive or other media. ▲ CRITICAL STEP Always save the data immediately to prevent loss if there is a power failure.

9| Clean the cell when finished.
Wash the sample cell with ethyl alcohol (~5 mL volume), and then with an excess of water (~10 mL volume). Finally, gently wipe the windows with lens cleaning paper to keep them clean.

! CAUTION
The overriding principle of experimental disposal is that all waste should be decontaminated, autoclaved, or incinerated within the laboratory to guarantee "zero leaking" of infectious biohazards.
Upon completion of all measurements, use the function in the software to compute the absorbance spectra.
Subtract the reference spectra from the bacterial spectra to remove the background signals (signals from water vapor, CO 2 gas, NaCl, etc.).

▲ CRITICAL STEP
To obtain high-quality bacterial IR spectra, the background spectra must be carefully and adequately subtracted from the obtained sample spectrum. Any uncompensated absorption bands in the original spectra will be enhanced by the derivative analysis due to degradation of the SNR.

12| Perform baseline correction.
This is a pre-processing step to account and correct for noise and sloping baseline effects. Rubber band baseline correction is a convex polygonal line in which edges are ''troughs'' within the spectrum. Manual point baseline correction picks the wavenumber locations of the polygonal line and can subtract the overlapped area from the absorption spectra.

13| Savitsky-Golay differentiation
(i) From the applications menu, select derivative function (first differentiation or second differentiation).
(iii) Save the second-derivative spectrum.

▲CRITICAL STEP
The derivative spectra show the details much better, allowing the differences between different strains to be visualized. In particular, second-derivative spectra (for which ease of 26 visual inspection is often inverted around the frequency axis) are well suited for identifying component bands in a complex spectral region.

▲CRITICAL STEP
The sharp and narrow absorption peaks from residual water vapor would be greatly enhanced by the second-derivative analysis and distort the protein amide I bands. Discard points below a threshold, and then fit the remaining points with a straight line.

14|
Smoothing and data normalization.This step can be done using min-max normalization or vector normalization.
(Optional) Scale the variables: this could be done by standardization (normalization of variables to zero mean and unit s.d.) or by normalization to a 0-1 range. ▲ CRITICAL STEP Prior to conducting any kind of data analysis, it is important to assess the overall data quality and determine if there are any obvious outliers. Smooth and normalize the amide I band area. The application of smoothing is to remove the possible white noise, which results from using the second derivative function. As the spectra were collected on different instruments, even at the same protein concentration, the SNRs were quite different, especially at lower concentrations, and the application of smoothing and normalization were necessary. ! CAUTION As excessive smoothing can build up side lobes and periodic noise, which may be confused with true spectral features, the amount of smoothing should be kept to a minimum.

15| After second-derivative calculation.
Truncate the second-derivative spectrum to 900 and 1200 cm -1 sections by selecting the Utilities function from the applications menu and then selecting the ZAP function.

16| Baseline.
A baseline can be easily obtained by connecting the two most positive points within the selected region.

17| Save the final results.
Save the results in a comma-separated value or CSV file (.csv) or a text file (.txt).

Bacterial typing, classification, and identification •TIMING 2-5 h
18| Create a new folder in your file system. This will contain the preprocessed files.

19| Assemble the data set.
(1) Execute the following steps in Excel: (i) Open the text file created in the previous step. This will appear as a table in Excel. The last column of this table is filled with zeros, which means all spectra are initially assigned the same class.
(ii) Next, enter the class labels in this column (take care not to miss any boxes). Use a different number for each class, starting from 0 or 1. Use numbers only. (2) (Optional) Software (IR-data-assemble) has been programmed by our lab and is provided freely.
(ii) Find the file 'IR-data-assemble' and then click 'Open'.
(iii) In the 'Path' field, enter the complete path for your raw files.
(iv) In the 'Path destination' field, enter the complete path for the 'dat' folder created. Click 'OK'. This will convert all raw files into preprocessed 'Pirouette .dat' files.
(v) click 'Go!', then specify a location and file name to store your merged data set, and click 'Save'. ▲ CRITICAL STEP At least five replicates are required for each group.

20| Multivariate analysis
Analysis of the raw spectra produced by the procedure shown above can be performed with bioinformatics tools (Box 1). We use hierarchical clustering as a simple and powerful tool (Box 1).
We describe two options for achieving this: using a home-made tool (A) or MetaboAnalyst (B).
(A) Home-made tool: The ''home-made tool'' is used to establish a library that can be used to identify individual sample bacteria at different levels. We provide this home-made tool as supporting information. First, several individual spectra are averaged to enerate the reference spectrum of the strain to establish the library. The Pearson correlation coefficient was used in the comparison of spectral profiles 117 , and the score derived from the distance is always used to indicate the similarity between the unknown and known bacteria. The matching scores of these spectra can be divided into ranges: highly probable species identification, probable species identification, reliable genus identification and unreliable identification according to the experimental results. Finally, a confusion matrix can be used to evaluate the typing performance of the library 118 .
(B) MetaboAnalyst: MetaboAnalyst uses a step-wise data processing pipeline that guides users through all of the major data processing steps, beginning with data-type selection, formatting, "cleansing," and normalization. It then guides users into using (or exploring) a number of data analysis options ranging from basic univariate (i.e., t tests, analysis of variance [ANOVA]) and multivariate (i.e., PCA) methods for advanced machine learning approaches, such as random forest and support vector machine (SVM) classification. (iii) Data integrity checking is done to ensure that the data meets the basic requirements for meaningful downstream analysis. ▲ CRITICAL STEP MetaboAnalyst (http://www.metaboanalyst.ca) is a comprehensive Web application for metabolomic data analysis and interpretation. It provides a variety of data processing and normalization procedures and supports a number of data analysis and data visualization tasks using a range of univariate and multivariate methods, such as PCA, heatmap clustering, and machine learning methods. The software was carefully designed to enable researchers with little statistical or computational background to perform complex data analysis procedures commonly used in metabolomic studies.
(iv) Low-quality data filtering. Select the default "Interquantile range (IQR)" and click the Process button.
(v) Select "Normalization by sum" for sample normalization. After normalization, click the Proceed button.
▲ CRITICAL STEP Differences in thickness or concentration can sometimes be the most prominent source of spectral variation between samples, often masking the biochemical differences. To minimize these effects, the spectra in the dataset are scaled to match a specific criterion. Normalization to a particular peak can be applied when such a peak is consistently present in all spectra in the dataset. (vii) Click Download to generate a .pdf report (Analysis_Report.pdf).
You have the choice between two views: distance matrix and dendrogram. The distance matrix shows all distances of the spectra to each other sorted according to the chosen HCA, whereas the dendrogram gives a more condensed, tree-like overview of the spectral relations. In general, we recommend using the dendrogram view first.

Box 1 COMPUTATIONAL TOOLS FOR BACTERIAL FT-IR PATTERN ANALYSIS
For bacterial data analysis, we use the IR software (ABB). Using this software, we pre-process spectra by applying default parameters, whereby we apply FT-IR spectra compressing, the Savitsky-Golay smoothing and second-derivative method, and rubber band baseline correction method. For data normalization, we use 'maximum norm'.
For species and subspecies typing, we hierarchically cluster the FT-IR spectra of bacterial species and subspecies. Each FT-IR spectrum in a dataset is compared to the other spectra. On the basis of the distance values, we produce a dendrogram using the appropriate function of the statistical toolbox of Biotyper. Taking a list of FT-IR signals and their intensities into consideration, dendrograms are produced by similar scoring of a set of FT-IR spectra. A correlation function is applied to calculate distance values. In general, species with distance levels <0.15 are reliably classified. However, this value can be changed by the user.
Clearly, the FT-IR spectra analysis for bacterial samples is not restricted to the described software package that is applied in our laboratory. Instead of hierarchical clustering, model construction can also be applied. The user may also test alternative classification procedures and software from other companies. Statistical tools, such as Matlab, are described by Al-Holy et al. 98 . Moreover, specific software for bacterial identification and analysis has been developed by a number of academic research groups.

Box 2 COMPUTATIONAL TOOLS FOR BACTERIAL IDENTIFICATION
Computational tools for bacterial identification include five main steps: spectral preprocessing, feature-extraction, library construction, distance measuring, and score and prediction.
Spectral preprocessing: The raw spectral profile data always present complex features. In addition to the true signal, two main disturbances can be revealed in the spectra: baseline drift, and noise that comes from electric and/or chemical interference. Spectra obtained from different conditions, such as different instrument parameters or different amounts of sample, can have different spectral intensities, which should be deleted for an equal weighted analysis. The valuable feature is usually expressed as a peak value or derivative data regarding the raw data. Consequently, the raw profile data need to be preprocessed to reduce redundant and noise information and obtain valuable signals.
Baseline correction, smoothing, normalization, first or second derivative, and peak picking are commonly used in this step before mathematization 109 .
Feature-extraction: As mentioned above, feature construction methods, such as PCA and PLS, or feature selection methods, such as genetic algorithm, multivariate curve resolution, and successive projection algorithm, can be employed in this step to extract the most informative variables that will be used later in the resultant model. In this case, the amount of data is greatly reduced and the model execution more efficient.
Library construction: Library construction can also be described as the creation of reference spectra for known bacteria of interest. In general, a number of spectra data are collected, with the aim of accurately describing the known bacteria, and the reference spectrum is created using the features extracted from this spectra by an average or other linear combination algorithm 119 . The precise feature-extraction, meaning qualifies features and just the right amount of features selected in this step, plays an important role in increasing the discriminatory power of library-based approaches, and the lower experimental standardization requirements (e.g., bacterial growth and culture media) of the features allow more stable and reliable identification 10 . According to the importance of these features in distinguishing one strain from another or to categorize one strain within a group, different weights of these features can be assigned and recorded in the database to achieve more accurate identification.
Distance measure: The similarity between the unknown and known bacteria can be calculated through a distance measure algorithm, most commonly the Pearson correlation coefficient 17,109,110 .
In this algorithm, values with the same feature make a pair, and all values presenting different features of the unknown and reference spectra are compared in these pairs. A higher score means a shorter distance and will be obtained when the values in each pair are closer. Bacterial identification often requires analysis of the entire spectrum rather than one or a few biomarker features, and the different weights assigned to the spectrum features allows a more accurate comparison. In addition, 32 strain identification is often facilitated with software employing advanced statistical analyses 110 .
Score and prediction: A score derived from the distance is always used to indicate the similarity between the unknown and known bacteria. In general, a higher score indicates greater similarity, and the highest matching scores and related spectra in the library will be selected as the final results among all of the comparisons between the unknown spectrum and the spectra in the library. Next, a prediction is made according to the score, with a higher score resulting in a more precise result. The highest matching scores of these comparisons with the Biotyper TM database were divided into ranges: highly probable species identification (>2.3), probable species identification (2.0-2.3), reliable genus identification (1.7-2.0), and unreliable identification (<1.7) 120 .

Troubleshooting
Troubleshooting advice can be found in Table 6.

Timing
Sample preparation (cell culture on solid medium or in liquid medium, and CaF 2 slide preparation): 1~2 h. Steps 1-5 can be done in advance.

FT-IR spectrum pre-processing 1~2 h
Bacterial typing and classification 2-5 h Anticipated Results

Typing of E. coli
Hierarchical clustering based on spectral information contained in the spectral range of 1,200-900 cm -1 , the polysaccharide region, was used for typing. The results in 2D score plot was shown in Fig. 6 and the dendrogram in Fig. 7. A clear discrimination of different species could be observed in Fig. 6, and 12 distinct clusters were produced in Fig. 7 for the discrimination of 12 strains of E. coli.

Identification of E. coli. and Shigella
A FT-IR spectra library can be employed to identify individual bacteria at different levels. Usually, several individual bacterial spectra are averaged to enerate the reference spectrum of the strain to establish the library, and the Pearson correlation coefficient was used in spectral comparison. The score derived from the bacteria spectral distance is always used to indicate the similarity between the unknown and known bacteria. After comparison, the reference strain with the highest correlation score for the comparison was regarded to be the result of the identification. The library-based identification of 207 FT-IR spectra of E. coli. and Shigella strains was shown in Fig. 8. The identification accuracy was 95.2% for distinguishing between E. coli and Shigella at the genus level.
For species and strain level, the identification accuracy are 91.8% and 81.2% for all selected strains, respectively.    49 Fig. 4 Combination of pre-processing steps. The raw spectral data were processed using the following sequence of pre-processing steps: (i) subtract the background spectra and obtain an absorbance spectra sample; (ii) baseline adjustment; (iii) derivative; (iv) normalization;

References
(v) spectral window selection.

Supplementary Files
This is a list of supplementary files associated with this preprint. Click to download. Equation 1 to 6.docx Table 1 Band assignment in bacterial FT-IR spectra.docx Table 3 Troubleshooting.docx   Table 2 Instruments, holders, and corresponding data acquisition software.docx Bacterial Typing Protocol.docx