Direct mapping of Peptide-to-Spectra-Matches to genome information facilitates qualifying proteomics information

Background: Small Proteins have received increasing attention in recent years. They have in particular been implicated as signals contributing to the coordination of bacterial communities. In genome annotations they are often missing or hidden among large numbers of hypothetical proteins because genome annotation pipelines often exclude short open reading frames or over-predict hypothetical proteins based on simple models. The validation of novel proteins, and in particular of small proteins (sProteins), therefore requires additional evidence. Proteogenomics is considered the gold standard for this purpose. It extends beyond established annotations and includes all possible open reading frames (ORFs) as potential sources of peptides, thus allowing the discovery of novel, unannotated proteins. Typically this results in large numbers of putative novel small proteins fraught with large fractions of false-positive predictions. Results: We observe that number and quality of the Peptide-to-Spectra-Matches (PSMs) that map to a candidate ORF can be highly informative for the purpose of distinguishing proteins from spurious ORF annotations. We report here on a workﬂow that aggregates PSM quality information and local context into simple descriptors and reliably separates likely proteins from the large pool of false-positive, i.e., most likely untranslated ORFs. We investigated the artiﬁcial gut microbiome model SIHUMIx, comprising eight diﬀerent species, for which we validate 5114 proteins that previously have been annotated only as hypothetical ORFs. In addition, we identiﬁed 37 non-annotated protein candidates for which we found evidence in proteomic and transcriptomic level. Half (19) of these candidates have close functional homologs in other species. Another 12 candidates have homologs designated as hypothetical proteins in other species. The remaining six candidates are short ( < 100 AA) and are most likely bona ﬁde novel proteins. Conclusions: The aggregation of PSM quality information for predicted ORFs provides a robust and eﬃcient method to identify novel proteins in proteomics data. The workﬂow is in particular also capable of identifying small proteins and frameshift variants. Since PSMs are explicitly mapped to genomic locations, it furthermore facilitates the integration with transcriptomics data and other source of genome-level information.


Background
Small proteins (sProteins) with a size below 100 amino acids have received increasing attention in particular in prokaryotes. Recent studies has revealed indispensable biological functions of some sProteins. CydX (37 AA), for instance, regulates the activity of cytochrome * Correspondence: john@bioinf.uni-leipzig.de 1 Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität Leipzig, Härtelstraße 16-18, D-04107 Leipzig, Germany Full list of author information is available at the end of the article † Corresponding author oxidase and thus ATP production in E. coli [1], and SgrT (43 AA) is an inhibitor of the EIICBGlc glucose transporter regulating glucose uptake [2]. Systematic surveys keep identifying large number of sProteins in prokaryotes, see e.g. [3,4], hence it has become clear that sProteins are not rare peculiarities. The human gut microbiome, for instance, features thousands of sProteins, many of which are to predicted to function in in cell-cell communication [5]. Nevertheless, the available information has remained comparably sparse due the technical difficulties with their detec-tion and identification with both computational and experimental methods.
The annotation of newly sequenced genomes is primarily based on homology using already existing gene annotations from related species as a basis. By definition, this approach is limited to homologs of genes that have been described already in at least one species. The method is also susceptible to incorrect entries in protein data bases. Complementarily, putative coding sequences can be recognized with the help of Markov models that classify open reading frames (ORFs). To obtain a reliable signal, usually a minimum length of 100 codons is required in genome annotation [6]. These methods become unreliable for shorter ORFs, including those compiled in the BactPepDB [7], which surveys the complete prokaryotic genomes available for peptides with a length between 10 and 80 amino acids. Comparative approaches, in particular methods such as RNAcode [8] that evaluate sequence alignments rather than single sequences, can reliably recognize even very short coding sequences. They lose their power, however, if not enough genomes in a suitable genetic distance are available. To-date, the computational prediction of sProteins is thus by no means an easy routine task. Ribosome profiling [9] also provides information on translated regions and thus constitutes an alternative manner to identify putative novel proteins.
The gold standard for detecting sProteins is their direct identification in bottom-up proteomics. This technique relies on proteolytically cleaved proteins and subsequent analysis by LC-MS/MS [10]. Classic bottom-up proteomics protocols, however, tend to identify few sProteins since the small size implies that sProteins often yield only a single proteotypic peptide [11][12][13]. This issue is aggravated by the fact that peptide identification itself depends on underlying databases of predicted polypeptides and corresponding decoys. Tools such Mascot [14], comet [15], MS-GB+ [16] and many others, therefore cannot identify peptides that are not in the set of protein annotations provided a priori. A peptide identified in this manner is referred to as Peptide-to-Spectra-Match (PSM).
Proteogenomics approaches typically make use of a conceptual translation of the genome into all six reading frames as the basis for peptide identification. This results in much larger 6frame databases and thus a (moderate) reduction of sensitivity, but completely avoids all annotation-related biases [17][18][19]. With a focus on sProteins, it is also possible to extend annotations with additional predictions of (short) ORFs with high coding potential [20]. Already two decades ago EST data have been used to predict novel isoforms to allow the identification of proteins arising from splice variants [21]. More recently, the same idea has been used with hypothetical splice variants to identify missense SNPs, short indels, chimeric proteins, and intron retention [18,19]. Metaproteomics [22], i.e., the application of proteogenomics to entire communities, incurs an additional layer of complexity for data analysis due to the need of disentangling different, but often closely related species [23,24].
The focus of this study was the discovery of novel, unannotated proteins, in particular those that have not been flagged as likely candidates by homologybased genome annotation. This problem is more difficult than just verifying an annotated protein candidate because the overlooked cases are often short, have no or only poorly described homologs in other species, harbor unusual features such as frameshifts, or overlap incorrect annotations. As a consequence, the sensitivity needs to be increased, which necessarily leads to a rapidly growing number of false positive predictions. Here we describe a workflow to prioritize the candidates based on aggregated quality measures of the PSMs that map to candidate and translational status of overlapping annotation items.

Accuracy of identifying candidate proteins
Our goal is to call protein candidates with high sensitivity and to tightly control the false positive rate at the same time. Since we are primarily interested in novel sProteins, we want to do this is in a manner that does not rely on any existing annotation. Thus we start by mapping all PSMs of sufficient quality (see Methods) to the genome and use the genomic map of PSMs to determine candidate proteins using a set of rules (see Methods). A candidate extends downstream to the closest stop codon, while the upstream end is determined by first start codon upstream of the upstreammost PSM mapped to the candidate.
In order to determine how well true proteins can be discriminated from false candidates on the basis of properties of mapped PSMs, we use an extensive data set [25] for E. coli. E. coli is nearly perfectly annotated, hence unannotated candidates most likely are false positive calls. In addition, we compare candidate calls using a 6frame database with calls based on a database of annotated proteins (proteome). While we expect that the sensitivity of 6frame is reduced compared to proteome, we can use candidates found only with the 6frame but not with the proteome database to estimate the false positive rate.
If only a single PSM is required to identify a protein, we observe that the majority of the annotated E. coli proteins is called using both the 6frame data base and proteome database. The consistency increases rapidly  Figure 1: Top: Comparison of 6frame and proteome data bases for peptide identification using comet search algorithm. Three sets of proteins are displayed. Blue is used for the set of all proteins in the NCBI annotation. The green set represents all proteins that are identified by at least n PSMs using the NCBI annotation as a search data base. The red set are proteins which are identified by at least n PSMs using the 6frame data base. Below: The score distributions of the PSMs for each intersection are shown. The score is given in a common logarithm. Red shows the set of proteins which are only detected by the 6frame database. The green set are the proteins which are only detected by using the NCBI annotation as the search database. Grey is the intersection of proteins that are identified by both databases. In (a) only a single PSM is required, there are noticeable differences between annotation based and 6frame based searches. The differences quickly decrease when larger number of PSMs per candidate in required. In (b) and (c) we require six and ten not necessarily distinct PSMs. While recovery is reduced by about 15% and 20%, resp., the two methods yield nearly identical results. In the 6frame-based data, very few unannotated candidates remain. Discrepancies between the data bases are related almost exclusively to PSMs with poor scores. The number of such PSMs decreases drastically if only PSMs mapping to confidently identified proteins are considered (e,f). Candidate proteins that are only detected by one data base show an accumulation of poorly scoring PSMs, we thus have to interpret these as mostly false positives.
if more -not necessarily distinct -PSMs are required, Low confidence PSMs, furthermore, are strongly enriched in proteins to which only very few PSMs are mapped, Fig. 1 (d-f). This matches the observation that false positive PSMs accumulate among unannotated ORFs [26]. These observations suggest that it is possible to devise an aggregate statistics of PSM quality scores that is capable of assessing candidate proteins even in the absence of multiple, distinct peptides.
As a second measure how well we can replicate the original annotation by using a 6frame database, we quantify the differences between start sites predicted with 6frame database and start sites reported in the original annotation. Their 3'-ends match perfectly as they are determined by the stop codons. For most of the annotated candidates, we recover the original length of the annotated protein (dominating peak at 0 in Fig. 2). For a fraction of the data we predict shorter candidates as compared to the annotation, presumably due to a lack of PSM coverage on the N-terminal part of the candidate. In a small number of cases our candidates begin upstream of the given annotation. This concerns 24 proteins with 6 PSMs. Fig. 2 shows one example, the fatty acyl-CoA synthetase FadD. Here, PSM evidence clearly shows that the true start codon is located upstream of the annotated coding sequence (CDS). Similar arguments can be made for 4 of the 24 cases with extended N-termini, the full list can be found in the result web page in the supplement material., indicating the despite outstanding quality of the annotation of the E. coli K12 reference genome, it is still not perfect and proteogenomics data are able to correct some of the remaining inaccuracies.
This observation prompted us to also inspect the 11 "false positives" that are supported by 10 or more PSMs. It turns out that two of them correspond to two parts of the formate dehydrogenase O subunit alpha, which our pipeline did not recognize due to a (presumably erroneous) stop codon in the genomic sequence. Two candidates are the two parts of the peptide chain release factor RF2, which has long been known to contain an obligatory frameshift [27]. Its peptides thus appear in two distinct predicted ORFs, neither of which completely matches the annotation. Several mRNAs in E. coli are known to produce minor variants that include a frameshift [28]. Two additional candidates are an IS5 transposase, for which frameshift has also been reported [29], and the transcriptional regulator GlpR, which, according to the UniProt annotation also harbors a frameshift.
This leaves only 5 ORFs as likely false positives. Surprisingly, these candidates are well distinguished by the distribution of PSM scores: while the frameshift proteins harbor mostly well-scoring PSMs, the remaining, likely false positives are matched only by PSMs with poor scores. This observation further supports the Table 1: Ambiguous mapping of PSMs in the SI-HUMIx dataset with annotation based and 6frame databases. More than 95% of the PSMs are unique, and thus can be unambiguously assigned to one of the species of the consortium, and the majority of the remaining PSMs matches only two positions on the metagenome (multiplicity= 2) . Increasing the sensitivity of requiring 6 PSMs per candidate moderately increases the number of candidate proteins to 29. Using the number of candidates predicted from with 6frame proteogenomics approach that do not match the annotation (or are not called using a proteome database) shows that the FDR quickly drops with the number of PSMs that are required to call a candidate, Addition File 1: Figure 2. Our analysis of the E. coli data suggests that a coverage of 6-10 PSMs is sufficient to identify likely candidate proteins. Notably, these PSMs may correspond to the same peptide. It is unlikely that the E. coli genome harbors many undiscovered candidates. We therefore analyse a larger, much less well annotated data set next.

Metaproteogenomics of SIHUMIx
The proteomics data for SIHUMIx was analyzed using a combined 6frame database for the eight species. In order to verify that this approach can properly separate the spectra from the different species we determined the number of PSMs mapped to more than one species, Table 1. For this model system we also analysed extensive RNA-seq data as a means of supporting proteogenomics-based predictions. It is not unexpected that there is only moderate agreement between protein and RNA abundances in Fig. 4, since RNA/protein ratios are known to vary considerably between organism [30].
The rate of detection of known and hypothetical proteins in the eight SIHUMIx species, as expected, correlates with the relative abundance in the mixture, see Table 2. There is near perfect congruence between 6frame and proteome database, see Additional File 1: Table 1.
The distribution of known and hypothetical protein differs dramatically across the eight SIHUMIx species.   Figure 2: Comparison between the length difference of candidates from 6frame search and the NCBI annotation, expressed as fraction of the total protein length. A value of 0% indicates that candidate and annotation are identical. Negative values mean that the candidate is shorter than the annotated protein due to lack of coverage by PSMs towards the N-terminus. Positive values imply that the candidate has a longer N-terminal sequence that extended the annotation. Below, the N-terminus of the fatty acyl-CoA synthetase FadD is shown. It is misannotated and extends to canonical start. The extension is also supported by evidence of conserved coding sequences determined by RNAcode [8].
In most species, the majority of the proteins is annotated as hypothetical based on the level of evidence available in the data sources. Since the confidence levels are unlikely to be truly consistent between species due to differences in the efforts that have been expended for their annotation, these number have to be interpreted with caution. They do, however, at least reflect qualitative trends.

Novel proteins in SIHUMIx
We discovered a total of 419 unannotated protein candidates supported by at least 6 PSMs in SIHUMIx. Since these initial candidates also include all those predictions that overlap annotated proteins in a different reading frame, we expect a priori that most of them are false positives. While it is manageable to manually evaluate a few hundred candidate proteins in a data set of particular interest, this is not feasible for routine applications and thus requires computational support. In order to better understand this candidate set we systematically gathered all information on them that is readily accessible by computational means. This leads to a natural workflow for prioritizing and validating the candidates.
Homology search against the NCBI protein database identifies 60 of 419 candidates with extensive similarity to proteins with a functional annotation in related species. These cases are clearly shortcomings in the available annotations and constitute a positive control for our approach and help to establish the criteria that can be applied to the remaining candidates. We exclude these 60 proteins from further analysis because we are interested here in those candidates proteins that cannot be found trivially by homology-based methods. In addition to these homologs of known proteins, we The relative frequency is given by number of PSMs per species divided by total number of PSMs for the proteogenomics composition respectively number of reads for transcriptome composition. Both the mapped reads and the PSMs were normalized to complete genomic size and proteogenomic (6frame DB) size respectively.
have another 47 of 419 candidates are homologs of hypothetical proteins. We first considered the distribution of the e-values of the PSMs that contribute to each candidate protein. Fig. 3 already strongly suggests that this is a reliable predictor. We use the averageŝ of the scores s = − log(e-value) for the three best PSMs as an aggregate descriptor. Fig. 7 summarizes the data with at least 6 supporting PSMs. Almost all candidates witĥ s > 3.5 have homologous known proteins in other species. As an example, the B. producta candidate Table 2: Summary of the number of proteins detected with at least 10 and 6 PSMs in SIHUMIx proteomics data using the 6frame translation. Novel (nov) proteins are not contained in the annotation, hypothetical (hyp) proteins are annotated but tagged with low confidence (see Methods for details), known refer to all proteins for which higher levels of confidence are associated with the available annotation. The eight species are ordered by decreasing abundance. The last column gives the fraction of the annotated proteins that were detected. nov 57 is shown in Fig. 6 (top). It has a probable length of 72 amino acids and shows a recognizable homology with an deny late kinase of similar length from Listeria.
In total, we find 47 of 419 candidates withŝ > 3.5. We first inspect all candidates with more than 10 high scoring PSMs. Interspersed among these known genes are three novel protein (B. producta nov 5, B. theta. nov 59 and nov 131). Nov 5 is clearly a complete protein, while nov 59 and nov 131 may be associated with frameshifts and constitute only parts of a protein. The most prominent candidate, B. producta nov 5 is shown in Fig. 5, lower panel. It has a likely length of 62 amino acids, judging from both the observed PSMs and the transcriptome data. Most but not all of these highconfidence candidates show evidence of transcription. Low RNA levels do not necessarily imply that the predicted protein is a false positive. In fact detection limits for RNA and protein may be vastly different. The typically much longer half-life of proteins may also contribute to explaining the presence of protein with low or undetectable RNA levels. there also an annotated protein in the same reading direction. Owing to our definition of the candidates, which extends to the nearest in-frame stop and the nearest in-frame start codon, this kind of overlap is indicative of an annotation error or a frameshift. Inspection shows that for B. producta nov 1 the available annotation of a TetR family transcriptional regulator extends across the stop codon. The remaining signals likely pertain to frameshifts. For B. producta nov 126 there is some weak evidence for translation of the annotated gene on the opposite strand, and convincing evidence for translation of a Cna B-type domain-containing protein corresponding to nov 126 that has been left unannotated.
Only three candidates with 6-9 PSMs haveŝ ≥ 3.5: B. producta nov 216, an IS66 family transposase, nov 307 in hypothetical protein without functional annotation, and E. coli nov 302, the frameshift fragment of peptide chain release factor RF2 already discussed above.
The analysis of the remaining candidates withŝ < 3.5 is much less straightforward. Although the overwhelming majority of them shows no homology to a known or hypothetical protein, this set contains at least a small number of proteins with known homologs with convincing proteomics evidence: B. producta nov 174ŝ = 3.3, B. producta nov 215ŝ = 2.9, B. theta. nov 180ŝ = 2.8, and possibly E. coli nov 122 s = 2.3. Some others, such as B. producta nov 84 s = 3.0 and nov 28ŝ = 2.4, however, are almost certainly false positives. A few curious cases, such as E. coli nov 123,ŝ = 2.1, are indicative of incorrect stopcodons or read-through; here the candidate sequence matches a GntR family protein from related species whose sequences extend beyond the stop codon of the annotated E. coli GntR gene immediately upstream of nov 123.
Protein expression of the opposite strand is a good indication that a candidate is a false positive: while overlapping ORFs are not uncommon in bacteria, long overlaps of coding regions are very rare [31,32]. There are, however, a handful of exceptions: As already mentioned above, B. producta nov 126 is much more plausible than the potentially expressed ORF on the opposite strand. A few additional cases are supported by many good PSMs mapping to two or more distinct peptides. The best example in our data is L. plantarum nov 19,ŝ = 3.28. It deserves a more detailed follow-up.
For moderate values ofŝ < 3.5, therefore, we need additional criteria to distinguish between bona fide protein detections, likely parts of other proteins that should prompt an update of an known protein, and false positives. We therefore inspected additional descriptors. First, we consider number of distinct peptides corresponding to the PSMs belonging to a given candidate. Supporting the use ofŝ as a valuable indicator, we find that with few exceptions, the candidates with largeŝ values have multiple peptides, while for smallŝ, most candidates are supported only by a single peptide. The few notable exceptions (nov 174, nov 215, nov 180) with more than 3 distinct peptides have already been noted above as proteins with known homologs.

Workflow for Identifying and Prioritizing Candidate Proteins
The detailed evaluation of both the E. coli and the SIHUMIx metaproteomics date reported above informs the workflow for the identification of novel proteins shown in Fig. 8. It primarily relies of the number of PSMs mapped to an ORF and the distribution of their e-values, irrespective of whether or not there are multiple distinct peptides. The initial decision is based on the number of PSMs, followed by a cut-off on the average score of the three best PSMs. Together the two values ensure reproducibility of good matches in the data set. For values ofŝ ≥ 3.5, unlikely candidates are only those without distinct peptide matches and no evidence for transcription. For values 2.5 ≤ŝ < 3.5 multiple distinct peptides may rescue an initial negative decision. Here, transcriptomics date are not helpful, since prokaryotic genomes produce diverse non-coding transcripts [33][34][35], so that transcription in itself cannot be used as a reliably predictor of translation.

Discussion
We have shown here that prokaryotic proteins can be identified with high reliability by considering the PSMs that map to the corresponding genomic location. Using SIHUMIx as an example we found thatŝ (the average logarithm of the e-value of the best few PSMs that map to a candidate ORF) is an excellent discriminator between bona fide proteins and other false positive signals. In conjunction with the number of PSMs, it is sufficient to identify nearly all of the ORFs in the SIHUMIx data that have functionally annotated homologs in related species and thus are most likely true proteins. In a fine-grained analysis, the number of distinct peptides helps to distinguish likely candidates from background noise in the case of moderate values ofŝ. Manual inspection also revealed that translation products involving frameshifts can also be detected even if the frame-shifted part is contains only a single detectable peptide. Somewhat surprisingly, RNA expression data add very little to the task of identifying novel proteins. Taken together, our observations leads us to the workflow summarized in Fig. 8. It is designed to efficiently identify previously unannotated candidates. It can also be employed to validate previously annotated proteins using the same decision criteria, since it accurately reproduces the annotation for known proteins from the PSM data and in some cases can help correcting annotation errors such as incorrect start codons.
Since PSMs are mapped directly to the genomic sequence in our workflow, standard genome browsers can be used to visualize the data. This also facilitates the integration with other data source, including transcriptome data and information on sequence conservation. The presentation of the data in a genome browser supports the manual evaluation of protein candidates in their genomic context, since information of overlapping features, including predicted proteins and PSM data mapping to other reading frames directly accessible.
Candidates identified as (likely) novel proteins can be followed up on by further computational steps. Most importantly, a homology search is likely to identify a large fraction of candidates as homolog of pro- Figure 8: Rules to prioritize candidate proteins for further investigation. A candidate is classified as transcribed if more than 70% of its length is above the median RNA expression level of the organism. Annotated genes which overlap a candidate are classified as translated if they are identified by more than 6 unique PSMs. teins that have been described already in other species. As in the case of the SIHUMIx example, we expect this to leave only a small fraction of novel proteins and homologs that so far have appeared only as "hypothetical proteins".
The workflow of Fig. 8 provides a robust way to identify novel proteins, including sProteins, from large mass spectrometry data. The method is applicable not only to a single species but also to metaproteomics data, provided the species composition of the sample is known. In the artificial gut community SIHUMIx we found 37 non-annotated novel proteins, among them six sProteins. Applications to microbial communities, however, are likely to be limited to the most abundant species, since the probability to identify a protein depends on its relative abundance in the sample.

Materials and Methods
Proteomics data sets For our analysis we used two different tandem mass spectrometry data set. One is a data set from a single strain E. coli K-12, grown under standard conditions. The data set consist of seven experimental replicas and part of the publication [25]. The SIHUMIx datasets are described in detail in [36,37]. They comprise 166 independent measurements, of which 90 used a standard protein preparation protocol and the remaining 76 cover different enrichment protocols to elevate the level of small proteins in solution.
More than 5.7 million PSMs were analysed for the project. The search of the SIHUMIx data sets against the 6frame data base (the main analysis to find new protein candidates) resulted in over 2,5 million PSMs. Beside different protein enrichment protocols, both Trypsin and Asp-N were used as different cleavage enzymes.
Peptide identification We used getorf [38] to retrieve all open reading frame between to stop codons from the genomic DNA sequence without any length constraints. For each ORF we store its amino acid sequence as well as its genomic start and end coordi-nates. The reading frame is defined as that start coordinate k mod 3 in forward direction and (k mod 3)− 3 in negative direction. We then used Comet [15,39] to search tandem mass spectra against protein sequence databases. Standard search parameters were used form both the 6frame and the annotated protein databases, with the following exceptions: (i) we allowed semidigestion at the N-terminus to accommodate fragmentation at the start codon, (ii) we conducted a concatenated search against a decoy database, and (iii) we used the full resolution of the MS/MS spectra.
Estimation of false discovery rates for PSMs. In addition to calculating the FDR a PSM by using a decoy data base within the comet software, two alternative approaches to estimate the FDR have been proposed [26,40]. In the first approach, we assume that a false positive PSM is mapped with equal rate to a translated and non-translated locus. Ignoring the possibility of overlapping proteins in different frames we interpret all n PSMs mapping to one of the five incorrect reading frames of an annotated protein as false positives, resulting in an estimated number of (6/5)n false positives. Of the N PSMs mapping in the correct reading frame, we expect N − (1/5)n to be true positives. We can therefore estimate the false discovery rate as where n + N is the total number of PSMs mapped to an annotated locus irrespective of the frame. Alternatively, we make the assumption that the protein annotation is complete and assume that a fraction α of the genome is covered by annotated proteins. All n PSMs mapped outside this annotation are counted as false positives. Thus we have where N = N + n is the total number of mapped spectra. The prefactor extrapolates the same FDR to the annotated part of the genomes. In order to account for very short ORFs to which no ORFs can be mapped by construction, the factor α can be estimated more accurately by estimating the chance that a randomly drawn PSM from the 6-frame annotation falls into an annotated region. For E. coli this yields α = 0.293. We note that F DR ann is robust against incomplete annotation and also will not change substantially if many wrong genes are falsely annotated. In constrict, F DR genome will only work well for genomes with reasonably complete annotations [26]. We checked consistency of the PSM estimates for the E. coli data.
Among the N = 180059 mapped PSMs we observed n = 829 hits to an incorrect reading frames obtain F DR ann = 0.55%, i.e., a slight improvement over comet's internal estimate of 1% from hits in the decoy database. Alternatively, at least in a well-annotated genome such as E. coli we may use PSMs mapped to unannotated regions as an estimator. This yields F DR genome = 0.52%. We also validated that, as expected [26] the genome-based FDR estimates are proportional to the FDRs estimated for the decoy database (Additional File 1: Figure 3).
Mapping PSMs to the genome To map PSMs to the genome, we determine its relative position in the ORF or ORFs of the protein or 6frame database. This position is then directly translated to the genomic coordinates using the known genomic coordinates of the ORFs/proteins. Peptides may map to multiple ORFs/proteins. If this is the case, the multiplicity of the mapping is stored and can be accessed in the genome browser.
Construction and annotation of candidate proteins. We start from the collection of ORFs for a genome. For each ORF, we determine all PSMs that map inside it. The C-terminus of the candidate is determined by the stop codon of the ORF. The N-terminus is the closest canonical start codon before the first mapped PSM, or if no such start codon exists within the ORF, the first position of the ORF.
The candidate proteins are then compared to the protein annotation that is available for the genome in question. A candidate is considered annotated if it overlaps an annotation item in the correct reading frame and reading direction. In each case, we record the difference between the genomic start positions of annotation and candidate.
Protein contained in the available annotations are classified as known unless they are tagged with validation levels 1, protein uncertain or 2, protein predicted in UniProt (i.e., lacking evidence from experiment or homology), or carry the annotations frameshifted, internal stop, hypothetical, Putative, or pseudogene. All of these are interpreted as hypothetical in Tabel 2.
Transcriptome data The transcriptome data were taken from [25] and mapped with segemehl [41] to an index comprising the eight SIHUMIx species as separate chromosomes. Default parameters were used. Annotation files were generated with samtools (http: //www.htslib.org/). Total expression per species was averaged over all replicates.
Visualization. We display the data using the UCSC genome browser [42], which make it easy to integrate them with other data, including transcriptome data, available annotations, as well as custom annotations.

Availability of data and material
The transcriptomics data is available under the bioproject PRJNA655119 https://www.ncbi.nlm.nih.gov/bioproject/655119. The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE [43] partner repository with the dataset identifier PXD023243. The genomes and corresponding annotations used for the project are all publicly available by The NCBI Assembly database [44] a full list can be found in Additional File 1: Table 2. The following material is available for download from www.bioinf.uni-leipzig.de/Publications/SUPPLEMENTS/20-002/: • SIHUMIx track hub (track hub for the UCSC genome browser) • Result web page (full list of candidates, ecoli annotation errors and interactive plots) • Validation hash map (Maps each annotated protein in SIHUMIx to a validation level as of the time of the publication) All scripts which are used to generated the data for this publication are available under https://github.com/studla/PROTMAP.

List of Abbreviations
PSM peptide spectrum map FDR false discovery rate AA amino acid Declarations