Using ProteoCombiner to integrate bottom-up and top-down proteomics data to improve proteoform identification

Here we present a high-performance software for proteome analysis that combines different mass spectrometric approaches, such as, top-down for intact protein analyses and bottom-up, for proteolytic fragment characterization. ProteoCombiner capitalizes on the data arising from different experiments and proteomics search engines and presents the results in a user-friendly manner. Our tool also provides a rapid and easy visualization, manual validation and comparison of the identi�ed proteoform sequences, including post-translational modi�cations (PTM) characterization. Thus, ProteoCombiner is recommended for studies dealing with different proteomics strategies, in order to increase con�dence in proteoform identi�cation including PTMs.


Introduction
Proteoforms describe all combinatorial sources of variation from a single gene, including genetic variations, alternative splicing, and post-translational modi cations [1]. Proteoform analysis is nowadays of crucial importance since they have been proven to have a key role in biological systems [2] [3]. This includes both precise determination of the expressed sequence of the protein and the characterization of all Post Translational Modi cations (PTM), in particular their precise localization and determination of their combination (combinatorial PTM). Several proteomics strategies, including bottom-up, middle-down and top-down proteomics, have been developed to analyze the PTM at the peptide level or to directly target proteoforms. Nevertheless, there is a lack of bioinformatics tools able to combine such proteomics results and it is still di cult to map the PTMs identi ed in bottom-up data onto proteoforms obtained by top-down proteomics.
Software installation: Download ProteoCombiner by clicking on the Download button at https://proteocombiner.pasteur.fr.

Work ow
The following work ow demonstrates how to combine proteomics data using ProteoCombiner.  3.1.1.6. By clicking on Filter button, this operation is accomplished using all of the parameters described above.

Combined results
All identi ed proteins that contain a valid sequence* will be displayed on this tab sorted by Score followed by Sequence Coverage. The protein score is represented by the best proteoform score.
*Protein sequence present in the database.
3.2.1. By clicking on an identi ed protein, all respective identi ed proteoforms will be displayed* below the protein table sorted by CombScore**.
*If there is no identi ed proteoform for a respective protein, all possible identi ed peptides will be displayed instead.
**CombScore is calculated by summing two different scores: i) a score related to the TDP identi cation software (which is normalized between 0 and 1; and ii) the percentage of the proteoform sequence coverage based on the peptides, obtained in BUP approach, that match to this proteoform. This score also ranges between 0 and 1. 1 By double-clicking on an identi ed peptide, the tandem mass spectrum, which contains the best identi cation score, will be displayed on Spectrum Viewer. (Figure 3) [ Figure 3] 3.2.2. By double-clicking on an identi ed protein, a new window will be opened that shows the Protein Coverage (section 3.5).

Bottom-up proteomics results
All identi ed proteins and the corresponding peptides will be displayed on this tab.
3.3.1. By clicking on an identi ed protein, all respective identi ed peptides will be displayed below the protein table.
3.3.1.1. By clicking on an identi ed peptide, all respective identi ed tandem mass spectra will be displayed below the peptide table.
3.3.1.1.1. By double-clicking on an identi ed tandem mass spectrum, the Spectrum Viewer will be opened. (Figure 3) 3.3.2. By double-clicking on an identi ed protein, a new window will be opened that shows the Protein Coverage (section 3.5).

Top-down proteomics results
All identi ed proteoforms will be displayed on this tab grouped by Theoretical Mass in Da.
3.4.1. By double-clicking on the Scan Number column of an identi ed proteoform, the Spectrum Viewer will be opened. (Figure 3). By clicking on any other column, a new window will be opened that shows the Protein Coverage (section 3.5).

Protein Coverage
3.5.1. Once the window is opened, the information will be displayed on the top tab: Protein description, Monoisotopic and Average protein mass* and Sequence coverage. (Figure 4) *If Protein coverage window is opened from the click of a speci c proteoform, its monoisotopic and average mass will be displayed instead of the protein mass.
[ Figure 4] 3.5.2. The left box will display all approaches used to identify the proteoforms and/or peptides. By clicking on each item, all respective lines will be high-lighted in the right box.
3.5.2.1. Bottom-up and middle-down approaches, represented by the blue color, 3.5.2.2. Top-down approach is represented in three different colors: Expected proteoforms in orange, which means all identi ed proteoforms by the full theoretical mass; non-expected proteoforms in cyan, which represents all identi ed truncated proteoforms; and tagged proteoforms in red, which means all proteoforms that were identi ed by a part of the protein sequence.
3.5.3. The right box will display all identi ed proteoforms and/or peptides that will be represented by different lines. All of them will be displayed sorted by CombScore. On the top will be displayed the full protein sequence and all theoretical modi cations (present in the database) as can be seen in Figure 4. All theoretical chains will be shown below protein sequence in gray dash lines. The user is able to check all information about the modi cation or theoretical chain by passing the mouse over the line or the modi ed amino acid. (Figure 4) 3.5.3.1. Each proteoform will be displayed according to the classi cation: expected, non-expected or tagged proteoform. By right-clicking on each line, the user is able to assess the proteoform identi cation (valid or invalid); in addition, it's possible to highlight only the peptides that t into the proteoform. ( Figure  5). By hovering over each line, some information will be shown: Proteoform sequence, Score, Search engine that identi ed this speci c sequence, Start and End positions.
3.5.3.2. All identi ed PTMs will also be displayed and by hovering over each one, it is possible to check their position and description.
[ 3.6. Loading results 3.6.1. ProteoCombiner loads results in its own format (*.pcmb). This can be accomplished in three ways, the easiest one is by double-clicking on a ProteoCombiner results le. If the Results Browser window is opened, another way to launch the le is by clicking on Load option from the File menu or by pressing CTRL + O, as seen in (Figure 6). Otherwise, if the main window is opened, select Load Results from File menu (or press CTRL + O), as seen in Figure 7.
[ Figure 6] [ Figure 7] 3.7. Exporting results 3.7.1. ProteoCombiner also allows to export all combined results to Excel® (*.xlsx) or PDF® le. This is done by selecting Excel le from the File menu Þ Export results (or by pressing ALT + E). Or by selecting PDF le from the File menu Þ Export results (or by pressing ALT + P). (  Filter parameters responsible for selecting results.
Page 11/13  Protein coverage window: All identi ed proteoforms and/or peptides are displayed here in different lines.
The theoretical sequences (proteoforms) are displayed in gray dash lines and the theoretical PTMs above protein sequence. The information about the PTM and theoretical sequences are shown when the mouse is hovered over them.

Figure 5
An assessment can be applied for each identi ed proteoform in order to increase the con dence of the evaluation. The user can also highlight only peptides that t with a speci c proteoform. Save or Load ProteoCombiner results from Results Browser window. The user can also export the results as an Excel® or PDF format le.