LC-MS Peak Assignment Based on Unanimous Selection by Six Machine Learning Algorithms

doi:10.21203/rs.3.rs-845859/v1

Download PDF

Research Article

LC-MS Peak Assignment Based on Unanimous Selection by Six Machine Learning Algorithms

https://doi.org/10.21203/rs.3.rs-845859/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 01 Dec, 2021

Read the published version in Scientific Reports →

You are reading this latest preprint version

Recent Mass spectrometry (MS)-based techniques enable deep proteome coverage with relative quantitative analysis, resulting in increased identification of very weak signals accompanied by increased data size of liquid chromatography (LC)–MS/MS spectra. However, the identification of weak signals using an assignment strategy with poorer performance resulted in imperfect quantification with misidentification of peaks and ratio distortions. Manually annotating a large number of signals within a very large dataset is not a realistic approach. In this study, therefore, we utilized machine learning algorithms to successfully extract a higher number of peptide peaks with high accuracy and precision. Our strategy evaluated each peak identified using six different algorithms; peptide peaks identified by all six algorithms (i.e., unanimously selected) were subsequently assigned as true peaks, which resulted in a reduction in the false-positive rate. Hence, exact and highly quantitative peptide peaks were obtained, providing better performance than obtained applying the conventional criteria or using a single machine learning algorithm.

Mass Spectrometry

Biological Chemistry

Mass spectrometry

Proteomics

Machine learning

Peak assignment

Liquid chromatography–mass spectrometry (LC–MS) has advanced remarkably in recent years, and LC-MS–based shotgun proteomics techniques enable the comprehensive identification and quantification of tryptic peptides. Further developments in high-resolution MS capabilities have enabled MS1-based quantitative comparisons of objective and control peptides from extracted ion chromatograms (XICs) ^1,2. Shotgun proteomics techniques are thus commonly used in biological research (e.g., identification of disease-specific biomarkers) ^3–5. Although the isotope dot product (idotP) and mass error (ΔM), calculated from MS1 spectra using Skyline ^6,7, were adopted as comparative quantification criteria for peptide pairs in some other studies ⁸⁻¹², these conventional criteria are not sufficient to distinguish peptide peaks from noise. As such, the presence of noise peaks becomes a greater problem as the size of the dataset increases. Consequently, validating all extracted peaks requires manual inspection to eliminate noise peaks in the dataset.

Some recent investigations have used machine learning techniques to identify peptide peaks from large datasets ^13–15. A mass precision algorithm was developed to extract the signal from the noise, thus improving quantitation using a random forest (RF) classifier and heuristic score ¹³. Another algorithm has been released that identifies quantitative peaks from interfering peaks or poor chromatograms in targeted proteomics using a supervised machine learning approach ¹⁴. Supervised machine learning approaches developed using quantitative results annotated by experts enable beginners to easily extract quantitative peak pairs with high accuracy. In contrast to the above advantage, however, false positive and false negative results can occur even when using datasets classified using supervised machine learning; consequently, false-positive peaks may reduce accuracy and introduce ratio distortion.

In this study, we adopted idotP and ΔM in addition to seven other informative features of chromatographic peaks. We examined these features using six different types of supervised machine learning algorithms to individually extract the peptide peaks. Our strategy evaluated each peak identified using six different algorithms; peptide peaks identified by all six algorithms (i.e., unanimously selected) were subsequently assigned as true peaks. Because unanimous agreement between all six algorithms leads to a reduction in the false-positive rate, the advantage of this system is that it enables extraction of more-exact and highly quantitative peptide peaks in comparison with a single supervised machine learning or applying conventional criteria. Here, we report an example of such quantitative comparisons using our unanimous peak assignment procedure.

Sample preparation

C57BL/6 male mice were purchased from CLEA Japan, Inc. (Tokyo, Japan). All procedures involving animals complied with the guidelines of the National Institutes of Health and were approved by the Animal Experimentation and Ethics Committee of Kitasato University School of Medicine. The whole liver of each mouse was homogenized on ice using a BioMasher II (Nippi, Tokyo, Japan) for 3 min with 1 mL of phase-transfer surfactant (PTS; 12 mM sodium deoxycholate, 12 mM sodium lauryl sulfate, and 200 mM triethylammonium bicarbonate [TEAB]) ²³. Aliquots of the homogenate were sonicated in a Bioruptor sonicator (SONIC Bio Co., Kanagawa, Japan) for 30 min (30 s on/30 s off, high setting) while on ice water. Insoluble materials were removed by centrifugation at 19,000g for 15 min at 4 °C. The protein concentration was measured using a NanoDrop spectrophotometer (Thermo Fisher Scientific, Waltham, MA, USA) and adjusted to 1 µg/µL with PTS. Protein extraction samples were flash-frozen using liquid nitrogen and then stored at −80 °C until use.

For evaluation of the machine learning algorithms, proteins extracted from 20 µg of mouse liver were resuspended in 20 µL of PTS and incubated with the addition of 2 µL of 200 mM Bond-Breaker TCEP solution (TCEP, Thermo Fisher Scientific) for 30 min at 50 °C to cleave the disulfide bonds, and then the solution was further incubated on ice for 10 min. The reduced proteins were then alkylated with 2 µL of 375 mM iodoacetamide and 200 mM TEAB in the dark at room temperature for 30 min. The alkylation reaction was quenched by addition of 2 µL of 400 mM L-cysteine and incubation in the dark for 10 min at room temperature. The sample was digested with 200 ng each of trypsin and lysylendopeptidase for 18 h at 37 °C. The reaction mixture was then mixed with a 1.5× volume of 1.7% trifluoroacetic acid (TFA) and subsequently centrifuged at 19,000g for 15 min at 4 °C. The supernatant was desalted using StageTips with a C18 Empore disk membrane, as described previously ²³. The fraction was eluted using 50% acetonitrile (ACN) and 0.1% TFA and then freeze-dried. The freeze-dried sample was resuspended with 20 µL of 3% ACN and 0.1% formic acid (FA) using a combination of vortexing and ultrasonic agitation in a Bioruptor sonicator (30 s on/30 s off, high setting) for 10 min each while on ice water. The sample was analyzed using a quadrupole Orbitrap benchtop mass spectrometer (Q-Exactive, Thermo Fisher Scientific) equipped with an EASY-nLC 1000 system (Thermo Fisher Scientific). Tryptic peptides were injected directly onto an analytical column (C18, particle diameter 3 µm, 0.075 mm ´ 125 mm; Nikkyo Technos, Japan). Tryptic peptides were separated with a gradient of solvents A (0.1% FA) and B (0.1% FA and 90% ACN) (0-1 min, 5-10% B; 1-20 min, 10-25% B; 20-26 min, 25-50% B; 26-27 min 50-80% B) at a flow rate of 300 nL/min using the EASY-nLC 1000. Peptides were introduced from the chromatography column to the Q-Exactive. Some parameters of the MS spectrum were as described previously ⁹. MS1 spectra were collected over the scan range 350-900 m/z at 70,000 resolution to hit an automatic gain control (AGC) target of 1 ´ 10⁶. The AGC target value for fragment spectra was set at 1 ´ 10⁵. The 20 most-intense ions with charge states of 2⁺ to 4⁺ that exceeded an intensity of 2.0 ´ 10³ were fragmented.

For quantitative comparisons, proteins extracted from 20 µg of mouse liver dissolved in 20 µL of PTS were dimethylated with 8 µL of 0.6 M NaBH₃CN and 16 µL of 4% ¹²CH₂O (light-labeled) or 4% ¹³CH₂O (heavy-labeled) for 10 min at room temperature. The dimethylation reaction was quenched by addition of 8 µL of 1% NH₃ and incubation for 1 min, and then the light- and heavy-labeled samples were mixed. A total of 58 µL of the mixture sample were precipitated by the addition of 700 µL of ACN followed by the addition of 25 µL of 5% TFA. After centrifugation at 19000g for 15 min at 4 °C, the supernatant was discarded to collect the precipitate. The precipitate was dissolved with 20 µL of PTS, and the subsequent procedures of alkylation, digestion, and LC-MS analysis were performed according to the above procedures described for evaluation of the machine learning algorithms.

Peptides were introduced to the Q-Exactive from an analytical column (C18, particle diameter 3 µm, 0.075 mm ´ 125 mm; Nikkyo Technos). Tryptic peptides were separated with a gradient of solvents A and B (0-29 min, 5-30% B; 29-37 min, 30-55% B; 37-38 min, 55-80% B) at a flow rate of 300 nL/min using the EASY-nLC 1000. MS1 spectra were collected over the scan range 350-1400 m/z at 140,000 resolution to hit an AGC target of 3 ´ 10⁶. The two most-intense ions with charge states of 2⁺ to 4⁺ that exceeded an intensity of 2.0 ´ 10⁵ were fragmented. Other parameters were set as described for evaluation of the machine learning algorithms.

All raw data files obtained in the LC-MS/MS analyses were deposited in the ProteomeXchange Consortium (http://proteomecentral.proteomexchange.org) via the jPOST partner repository (http://jpostdb.org) ²⁴ with the dataset identifiers PXD027824 for ProteomeXchange and JPST001287 for jPOST.

Protein identification

LC-MS/MS data were searched against the mouse UniProt sequence database (release 2018; 25,131 entries, reviewed). Database searches were performed using the SEQUEST algorithm incorporated into Proteome Discoverer 1.4.0.288 software (Thermo Scientific) with the following parameters: enzyme, trypsin; maximum missed cleavage sites, 3 for evaluation of machine learning or 2 for quantitative comparisons; precursor mass tolerance, 6 ppm; fragment mass tolerance, 0.02 Da; fixed modification, cysteine carbamidomethylation; variable modification, methionine oxidation. For quantitative comparisons, light-labeled dimethylation (+28 Da) at lysine and heavy-isotope labeled dimethylation (+34 Da) at lysine were adapted as the search parameters. Peptide identification was filtered to a false discovery rate (FDR) of <1%.

XICs for precursor ions were obtained using Skyline 20.1.0 (http://proteome.gs.washington.edu/software/skyline) ^6,7 based on the identified peptide library. The spectrum library was imported from the msf file generated by Proteome Discoverer with a cutoff score of FDR = 0.99. Peptide settings were as follows: enzyme, trypsin KR/P; maximum missed cleavages, 2; minimal length of peptide, 7; maximal length, 30; modifications, carbamidomethyl (Cys), oxidation (Met); maximum variable mods, 5. Transition settings were as follows: precursor charges, 2⁺-4⁺; type, p (precursor); ion mass tolerance, 0.02 m/z; isotope peaks included, count 3; mass analyzer, Orbitrap, resolution, 70,000 at 200 m/z; use only scans within 5 min of predicted retention time; isotope labeling enrichment, default.

Extraction of informative features from chromatographic peaks

Nine types of informative features of the chromatographic peaks were extracted using Skyline: idotP, average mass error, signal-to-noise ratio, standard deviation of the intensity of FWHM of isotope peaks, average retention time, intensity at chromatographic peak boundary, shape similarity, and co-elution score (Supplementary Table S1 and Supplementary Figure S7). Jagging score was defined as the number of data points lower than the FWHM within an integral interval of the peak. Shape similarity score was defined as the Pearson product-moment correlation coefficient generated based on the similarity in shapes of chromatographic peaks of isotopes. The co-elution score was defined as the average shift in the cross-correlation function for each pair of isotopic peak traces within the window of the selected peak, as described in a previous report ²⁵. For all features, missing values were replaced with a zero.

Peak extraction and assignment

All values in each feature parameter were scaled using min-max normalization. Subsequently, the dimensionality from the original feature space was reduced using PCA. We selected PCA components as inputs and then applied them to SVM ¹⁹, ANN ²⁰, KNN ¹⁸, and GNB ²¹ algorithms. The values of all feature parameters not subjected to min-max normalization and PCA were placed into other machine learning algorithms, RF ¹⁶ and XGB ¹⁷. The k-fold cross-validation (k = 5) approach was used to avoid the overfitting problem, and the hyper-parameters were optimized as described previously ²⁶. Cross-validation and optimization of hyper-parameters were applied for 5 machine learning algorithms, except GNB. True peaks and noise peaks in the training example were annotated manually.

Quantification

Peptide pairs for which both the light- and heavy-labeled peptides were identified were chosen for comparative quantification. The sum of the XIC area of three ion precursors (monoisotopic mass [M] and isotopic masses [M+1 and M+2]) generated from the respective peptides was determined as the corresponding peak area.

Coding environment

Python 3.7.7 was used to perform the machine learning analyses using the following imported libraries; numpy 1.19.1, pandas 1.1.0, scikit-learn 0.23.1, xgboost 0.9, matplotlib 3.2.2, and seaborn 0.10.1. Figures were prepared using matplotlib and seaborn. FeatureExtract.py and MachineLearning.py were used for extraction of chromatographic features and execution of the machine learning algorithms, respectively. Both python scripts are shown in the Supplementary Materials.

Evaluation of training example

Tryptic peptides equivalent to 0.1 µg of homogenized total protein were analyzed using nanoLC-MS/MS. Peptide identification using Proteome Discoverer 1.4 yielded 5,842 peptide fragments derived from 1,214 proteins. The training example contained 380 peptides and 357 noise peaks. A total of 380 peaks with idotP ≥0.85 and |DM| ≤10.0 ppm were manually annotated as peptide peaks (Supplementary Figure S1A-S1D). Randomly selected signals were evaluated manually, resulting in 357 peaks, including peaks that fell within the criteria, which were assigned as the noise peaks (Supplementary Figure S1E-S1H). The distributions in the ranges of features such as m/z, retention time, and intensity of all annotated peaks are summarized in Supplementary Figure S2, and the data suggest that there was no bias between peptide and noise peaks, except with regard to peak intensity.

The distributions of nine features between the 380 peptide peaks and 357 noise peaks were confirmed using violin plots (Figure 1). Descriptions of the nine informative features are noted in Supplementary Table S1. With regard to idotP, the median value in the noise peaks was 0.90, and as a result, over half of the noise peaks were present in the extracted dataset set at a threshold of 0.85. Furthermore, the distributions of average mass error, jagging score, and standard deviation of full-width half maximum (FWHM) of the peptides peaks overlapped well with those of the noise peaks. These plots suggested that no parameter markedly distinguished the peptide and noise peaks. The peak distributions were also projected in two-dimensional plots generated from two of the nine features (Figure 2). These two-dimensional plots did not enable the discrimination of peptide peaks from the dataset.

The first principal component (PC1) in the principal component analysis (PCA) explained most of the variation in the original variable features and correlated with the idotP variable in the initial space with an eigenvalue of 0.93. The eigenvectors of PC2 and PC3 represented the peak shape similarities among isotope peaks. The eigenvector of PC4 was also related to the peak shape similarities and peak co-elution scores. The cumulative contribution ratio from PC1 to PC4 was approximately 1.0. Therefore, all peaks shown in the nine features in the original space were represented in the degenerated space spanned by the four eigenvectors in the PCA space. The peak distributions in the three-dimensional projections of PC1-PC2-PC3, PC1-PC2-PC4, and PC2-PC3-PC4 are displayed in Supplementary Figure S3. These projections revealed no threshold for the separation between peptide and noise peaks in the PCA space.

Building and evaluating the machine learning algorithms

We concluded that no suitable threshold was found in the dataset subjected to PCA and therefore applied the machine learning algorithms to classify the peptide peaks from the dataset. A total of 737 peaks, including 380 peptide and 357 noise peaks, were divided into a training set (418 data peaks; 219 peptide and 199 noise peaks) and a cross-validation set (319 data peaks; 161 peptide and 158 noise peaks) and then independently analyzed with the six different supervised machine learning algorithms using the dataset. Analysis of the learning curves for the six individual machine learning algorithms revealed that both the training and cross-validation scores converged to a value >0.8 (Supplementary Figure S4). Increasing the number of peaks to 200 using RF ¹⁶, extreme gradient boosting (XGB) ¹⁷, and k-nearest neighbor (KNN) ¹⁸ and to 300 using a linear support vector machine (SVM) ¹⁹ and artificial neural network (ANN) ²⁰ improved the training of the machine learning algorithms. Thus, the size of the training dataset was sufficient to build the machine learning algorithms without introducing overfitting problems.

Using the cross-validation set, the peak labels generated by the six machine learning algorithms were compared with the manually annotated peak assignments (Table 1). The Gaussian naïve Bayes (GNB) ²¹ evaluation exhibited the lowest accuracy (91%), determined by dividing the number of correct predictions by the total number of peaks. In contrast to GNB, the ANN and XGB exhibited the highest accuracy (96%), and the ANN predicted 154 true peptide peaks and 150 true noise peaks in the cross-validation dataset. The precision, defined as the number of predicted peptides among all peptides annotated manually, indicated that the SVM, RF, XGB, and ANN achieved a high precision (approximately 96%). However, the KNN and GNB exhibited relatively poor precision, with 9% and 12.5% false-positive rates, respectively. The precision determined using the conventional criteria, which was assigned by idotP and DM, classified 35% of false peaks as peptide peaks. Although the machine learning algorithms were better classification tools in terms of identifying peptides as true peaks, nearly 4% of false-positives derived from the machine learning algorithms may lead to inaccurate quantitative results. Therefore, in this study, peptide peaks unanimously selected as true peaks by the six machine learning algorithms were assigned as peptide peaks. Approximately 98.6% of identified peaks were consistent with the manually annotated peaks, and this unanimity enabled us to reduce the number of false-positives to the lowest possible limit (Table 1). Although the total number of identified peptides was reduced slightly using our strategy rather than a single machine learning algorithm or previous criteria, our strategy for peak identification exhibited quite high fidelity.

Quantification analysis based on unanimous predictions

The unanimous prediction strategy was applied to the analysis of a mixture of equivalent quantities of proteins labeled with light and heavy dimethylations as an example for quantitative comparisons. The peptide identification workflows using this and previous approaches are indicated in Supplementary Figure S5. In the previous approach, peptide peaks were identified based on the following criteria: idotP ≥0.9 and |DM| ≤6 ppm, resulting in the prediction of 939 peak pairs. In the unanimous selection identified the prediction of 893 peak pairs (Supplementary Table S2). The distributions of retention time, intensity, and m/z for these peak pairs are shown in Supplementary Figure S6. No deviation in retention time was observed using both approaches. The peak intensity frequency extracted by the approach converged to 1 ´ 10⁸ with a Gaussian-like distribution. Analysis of the variations in m/z showed that an increase in m/z was related to a decrease in the number of identified peaks. Although the unanimous algorithm prediction approach identified fewer peaks than were identified using the previous criteria, no significant distribution differences between the approaches were found. Therefore, the unanimous algorithm prediction approach identifies peptide peaks from a search space similar to that annotated using our previous criteria. Log ratio-mean average (MA) scattering plots were generated to visualize the distributions of peak pairs extracted using both approaches (Figure 3). The average of the log-ratios of the unanimous algorithm prediction approach was 0.030, with a deviation of 0.012. In contrast, that of the former criteria yielded a slightly lower value of 0.021 but a higher deviance value of 0.022. In the MA plots, the peptides assigned solely by the former criteria appeared within a broad range of intensity, with an inaccurate ratio (Figure 3). Focusing on pairs solely assigned by the former criteria, some of the peaks were assigned as peptides by the former criteria but confirmed as noise peaks by visual analysis (Figure 4A). Some peaks were located out of the interval of the integral (Figure 4B). The chromatogram of the isotopes exhibited different shapes from chromatograms of other isotopes (Figure 4C). In contrast to the previous approach, the unanimous algorithm prediction approach produced fewer peptide peaks with inaccurate ratios (Figure 3). Convergence of deviations was observed, suggesting that the unanimous algorithm prediction approach is useful for comparative quantification of peptides with high accuracy.

Shotgun proteomics using high-resolution MS enables us to conduct MS1-based quantitative comparisons of objective and control peptides from the XIC ^1,2. To identify the proteins, involved in the physiologic and/or pathologic processes, based on abundance using shotgun proteomics, poor chromatographic peaks must be excluded from complex LC-MS/MS spectra when using conventional criteria, such as idotP and ΔM ⁸⁻¹², and then quantifications based on the areas of extracted peaks of identified proteins can be compared. Recently, deep proteomic techniques have been developed that are capable of detecting weak peptide signals, thus introducing the problem of determining how to treat these weak signals for deeper quantification of protein/peptide abundance. However, manually annotating a large number of signals from a vast dataset is impractical, and it is difficult for an investigator to maintain consistent application of judgment criteria during peak annotation. In this study, we introduced six machine learning algorithms to successfully extract a higher number of peptide peaks with high accuracy and precision. Although machine learning algorithms are reliable classification tools for identifying peaks as true, single machine learning algorithms are associated with false-positive rates of at least 5%, which can lead to inaccurate quantitative results due to ratio distortion. In contrast to use of a single machine learning algorithm, our strategy evaluated each peak identified by six different algorithms, and those peaks selected unanimously by the algorithms as true were assigned as peptide peaks, resulting in a reduction in the false-positive rate to 1.4%. The advantage of this strategy is that unanimous selection reduces the rate of false positives. Hence, more-exact and highly quantitative peptide peak identification is possible compared with use of a single machine learning algorithm or application of conventional criteria. Furthermore, our strategy also recorded how many machine learning algorithm selections were true. Consequently, we could identify an obscure peak with the score calculated from the number of selections returned to 0, 0.17, 0.33, 0.5, 0.66, 0.83, to 1, enabling re-assignment of the peak as an object peak.

Recent data-independent acquisition (DIA) MS-based technique enable deep proteome coverage with relative quantitative analysis ²², resulting in an increase in the identification of very weak signals from very large LC-MS/MS spectral datasets. The identification of weak signals using an assignment strategy with poorer performance resulted in inaccurate quantification and misidentification of peaks, along with ratio distortion. In this study, we developed a new peak assignment strategy based on unanimous selection by multiple machine learning algorithms to enable highly sensitive peak annotation results with a significantly lower false-positive rate. When coupled with DIA techniques, this strategy could enable determination of trace amount differences in protein abundance in cells and/or tissues, thereby providing new insights into physiologic and pathologic mechanisms in the near future.

Author contributions

H.I., T.M., and Y.K. designed the experiments. H.I., R.K., and M.I. performed the experiments. H.I., T.M., and Y.K. analyzed the data. H.I., T.M., and Y.K. wrote the paper. All authors reviewed the results and approved the final version of the manuscript.

Acknowledgment

This work was supported by Grants-in-Aid for Scientific Research from the Ministry of Education, Culture, Sports, Science and Technology, Japan (grant numbers 17KK0141, 17K19926, 17H02206, 18K06467, and 21K06036), and by research support from the All Kitasato Project Study (AKPS).

Conflict of Interest

The authors declare that they have no competing financial interests.

Supplementary material

The python scripts are described in the Supplementary Material, which is available free of charge via the Internet at (URL).

Franke, A. A., Li, X., Dabalos, C. & Lai, J. F. Improved oxytocin analysis from human serum and urine by orbitrap ESI-LC‐HRAM‐MS. Drug Testing and Analysis, 12, 846–852 (2020).
Masaki, T. et al. GIP_HUMAN[22–51] is a new proatherogenic peptide identified by native plasma peptidomics. Sci. Rep, 11, 14470 (2021).
Wijasa, T. S. et al. Quantitative proteomics of synaptosome S -nitrosylation in Alzheimer’s disease. Journal of Neurochemistry, 152, 710–726 (2020).
Coles, G. L. et al. Unbiased Proteomic Profiling Uncovers a Targetable GNAS/PKA/PP2A Axis in Small Cell Lung Cancer Stem Cells., 38, 129–1437 (2020).
Rotunno, M. S. et al. Cerebrospinal fluid proteomics implicates the granin family in Parkinson’s disease. Sci. Rep, 10, 1–11 (2020).
MacLean, B. et al. Skyline: An open source document editor for creating and analyzing targeted proteomics experiments., 26, 966–968 (2010).
Schilling, B. et al. Platform-independent and label-free quantitation of proteomic data using MS1 extracted ion chromatograms in skyline: Application to protein acetylation and phosphorylation. Molecular and Cellular Proteomics, 11, 202–214 (2012).
Nakagawa, Y., Matsui, T., Konno, R. & Kawashima, Y. Biochemical and Biophysical Research Communications A highly ef fi cient method for extracting peptides from a single mouse hypothalamus. Biochemical and Biophysical Research Communications, 548, 155–160 (2021).
Konno, R. et al. Highly accurate and precise quantification strategy using stable isotope dimethyl labeling coupled with GeLC-MS/MS. Biochemical and Biophysical Research Communications, 550, 37–42 (2021).
Streng, A. S. et al. Development of a targeted selected ion monitoring assay for the elucidation of protease induced structural changes in cardiac troponin T. Journal of Proteomics, 136, 123–132 (2016).
Tannous, A. et al. Comparative Analysis of Quantitative Mass Spectrometric Methods for Subcellular Proteomics. ACS Applied Materials and Interfaces, https://doi.org/10.1021/acs.jproteome.9b00862 (2020).
Dallas, D. C. et al. Peptidomic analysis reveals proteolytic activity of kefir microorganisms on bovine milk proteins. Food Chem, 197, 273–284 (2016).
Bakalarski, C. E. et al. The impact of peptide abundance and dynamic range on stable-lsotope-based quantitative proteomic analyses. Journal of Proteome Research, 7, 4756–4765 (2008).
Toghi Eshghi, S., Auger, P. & Mathews, W. R. Quality assessment and interference detection in targeted mass spectrometry data using machine learning 03 Chemical Sciences 0301 Analytical Chemistry. Clin. Proteomics, 15, 1–13 (2018).
Deeb, S. J. et al. Machine learning-based classification of diffuse large B-cell lymphoma patients by their protein expression profiles. Molecular and Cellular Proteomics, 14, 2947–2960 (2015).
Breiman, L. Random Forest. Mach. Learn, 5, 5–32 (2001).
Chen, T. & Guestrin, C. XGBoost. in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining vol. 42 785–794(ACM, 2016).
Laaksonen, J. & Oja, E. Classification with learning k-nearest neighbors. in Proceedings of International Conference on Neural Networks (ICNN’96) vol. 3 1480–1483 (IEEE).
Cortes, C. & Vapnik, V. Support-vector networks. Mach. Learn, 20, 273–297 (1995).
Zhang, G. P. Neural networks for classification: a survey. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews), 30, 451–462 (2000).
Karthika, S. & Sairam, N. A Naïve Bayesian Classifier for Educational Qualification.Indian Journal of Science and Technology8, (2015).
Kawashima, Y. et al. Optimization of data-independent acquisition mass spectrometry for deep and highly sensitive proteomic analysis. International Journal of Molecular Sciences, 20, 1–14 (2019).
Masuda, T., Tomita, M. & Ishihama, Y. Phase transfer surfactant-aided trypsin digestion for membrane proteome analysis. Journal of Proteome Research, 7, 731–740 (2008).
Okuda, S. et al. JPOSTrepo: An international standard data repository for proteomes. Nucleic Acids Res, 45, D1107–D1111 (2017).
Reiter, L. et al. mProphet: automated data processing and statistical validation for large-scale SRM experiments. Nature Methods, 8, 430–435 (2011).
Bergstra, J. & Bengio, Y. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13, 281–305 (2012).

Table 1. Confusion matrixes used in this study. Precision rate was defined as the number of true peptides divided by the number of peptides predicted as true.

	Predicted peptides		Predicted as noise
	True peptide	False-positive	False-negative	True noise
	Precision rate (%)
Manual annotation	161	0	0	158
idotP and DM	160	87	1	71
idotP and DM	64.8%
SVM	151	7	10	151
SVM	95.6%
RF	151	6	10	152
RF	96.2%
XGB	151	5	10	153
XGB	96.8%
ANN	154	8	7	150
ANN	95.1%
KNN	152	15	9	143
KNN	91.0%
GNB	147	21	14	137
GNB	87.5%
Unanimous selection	140	2	21	156
Unanimous selection	98.6%

No competing interests reported.

SupplementaryMaterials.pdf

Download PDF

Journal Publication

published 01 Dec, 2021

Read the published version in Scientific Reports →

Editorial decision: Major revision
13 Oct, 2021
Reviews received at journal
11 Sep, 2021
Reviewers agreed at journal
08 Sep, 2021
Reviewers invited by journal
08 Sep, 2021
Editor assigned by journal
08 Sep, 2021
Editor invited by journal
26 Aug, 2021
Submission checks completed at journal
26 Aug, 2021
First submitted to journal
25 Aug, 2021

You are reading this latest preprint version

LC-MS Peak Assignment Based on Unanimous Selection by Six Machine Learning Algorithms

Status:

Journal Publication

Version 1

Abstract

Figures

Introduction

Materials And Methods

Results And Discussion

Conclusion

Declarations

References

Tables

Additional Declarations

Supplementary Files

Status:

Journal Publication

Version 1