Process for selecting 7,444 measurement runs
The HeLa cell lines were repeatedly measured as MNT and QC of the mass spectrometers at Novo Nordisk Foundation Center for Protein Research (Copenhagen, Denmark) and Max Planck Institute of Biochemistry (Martinsried, Germany). MNT samples were run after instrument cleaning or repair and QC samples were run during the measurement of larger cohorts to monitor instrument performance over time using HeLa as a reference sample. As this data is composed over many years, it contains acquisitions using many machines, various liquid chromatography methods, column lengths and injection methods. The machine identifier is stored in the raw data, however most information w.r.t. chromatography was not recorded. The most widely used instrument for providing gradients for chromatographic separation was a Thermo EASY-nLC 1200, but although EVOSEP ONE. In total 50,521 raw files were collected, including mostly single-shot DDA runs of samples. However, there were also a few DIA or fractionated measurements available.
Of the 50,521 raw-files, we were able to process 11,062 of these one by one with MaxQuant 1.6.128 yielding between 0 to 54,316 peptides identified. We then selected runs with at least 15,000 identified peptides based on the information in the summary.txt, which gave us 7,484 runs with a minimum raw file size of 0.618 GB. We removed duplicated runs using the creation date and machine identifier and finally had a total of 7,444 unique runs which were uploaded to PRIDE. We extracted metadata information using ThermoFisherRawFileParser and provided this for the selected 7,444 samples.
Sample preparation protocol
The cells were lysed by different labs and researchers, so the exact procedure might differ slightly over time from the reference protocol uploaded to a protocol preprint server9. Although on a per sample basis the exact protocol could not be recovered, all proteins are expected to be digested using trypsin. The injection volume ranges from one to seven microliter.
Processing steps of a single raw file
We processed raw files using a Snakemake10 workflow as a single run in MaxQuant 1.6.12 (ref.8) yielding single abundances for precursor, aggregated peptide and protein group intensities using LFQ. For the DDA searches with MaxQuant we used the UNIPROT human reference proteome database 2018 release, containing 21,007 canonical and 72,792 additional sequences. Contaminants were controlled using the default contaminants fasta shipped with MaxQuant. From the MaxQuant search of each file, we dumped the tab separated files to PRIDE along the associated raw file. In a companion paper we used the “evidence.txt” for precursor quantifications, “peptides.txt” for aggregated peptides and “proteinGroups.txt” for protein groups referencing them by their gene group3.
Ready to use development dataset
We created a curated, ready-to-use development dataset for three levels of the data for convenience: protein groups, peptides and precursors. The filtered aggregation is provided as zipped folders for these three levels. In brief, filtering was done using a feature completeness cutoff of 25 percent, i.e. that a feature had to be observed in at least 25% of the 7,444 samples. Zero or missing intensity entries were discarded and depending on the file type certain filtering was performed. This is extensively described in Webel et. al (2023)3. First, for protein groups, i.e. proteinGroups.txt files, we dropped “Only_identified_by_site”, “Reverse” and “Potential_contaminent” entries. Then we dropped entries without a “Gene name” and used “Gene name” as identifiers and selected entries with a maximum “Intensity” if one gene set was used for more than one protein group. Second, for aggregated peptides, i.e. peptides.txt files, we used the “Intensity” column for LFQ intensities and used “Sequence” as unique identifiers. Last, for precursors, i.e. evidence.txt files, we dropped potential contaminant entries, zero intensity entries as they provided no quantification for an identified feature, used the “Intensity” column for the label-free quantitation intensities, used “Sequence” and “Charge” as identifiers, and finally selected the entry with maximum intensity for non unique combinations of “Sequence” and “Charge” as this normally corresponds to the best Andromeda score. We did this as we were not interested in modifications and therefore neglected the number of available modified peptides.
Data Records
Each uploaded raw file has a MaxQuant search output associated with a set of standard text files as described on their website. Each run was identified by its recording date and the machine identifier as stored in the raw file. The MaxQuant text file based output was stored as a zipped folder with the same identifier as the raw file. These files contained the three data levels which were used in our original publication3. First, the proteinGroups.txt file contained all proteinGroups found by searching the associated raw file. Second, the peptides.txt file contains aggregated evidence for each peptide identified. Third, the evidence.txt file contained all aggregated evidence for each precursor, i.e. peptide uniquely identified by its charge and modification. The data was then further filtered, e.g. by removing contaminants, non quantified entities and rare identified precursors, peptides or protein groups.
The aggregated and filtered data dumps proteinGroups_curated.zip, peptides_curated.zip and precursors_curated.zip contain the data with features in columns and samples in rows, as well as transposed which is the more classical used view in proteomics: samples in columns and features in rows. Note that feature here corresponds to the output of MaxQuant on the evidence, i.e. precursor level, the peptides level and the protein group level. Each zip file contains a brief python script indicating how to load the data into a pandas DataFrame11. For further processing it is highly recommended to save the desired selection in a binary format for faster loading. Selected metadata is provided in a separate csv file, called pride_metadata.csv. It combines ThermoRawFileParser12 metadata from the raw files with the size of the raw files as well as the summary information of the MaxQuant run. This file can be used to select data of interest, e.g. by selecting files associated with certain instruments, see Table 1 and Fig. 2a. The aggregated summary information from MaxQuant is provided additionally separately in mq_summaries.csv.
Table 1
Instrument information from ThermoRawFileParser and associated counts in 7,444 raw files.
Instrument model | Instrument label | Count |
Q Exactive Plus Orbitrap | Q-Exactive-Plus-Orbitrap_1 | 9 |
Q-Exactive-Plus-Orbitrap_143 | 3 |
Q Exactive Orbitrap | Q-Exactive-Orbitrap_1 | 353 |
Q Exactive HF-X Orbitrap | Q-Exactive-HF-X-Orbitrap_6070 | 564 |
Q-Exactive-HF-X-Orbitrap_6071 | 542 |
Q-Exactive-HF-X-Orbitrap_6075 | 515 |
Q-Exactive-HF-X-Orbitrap_6101 | 458 |
Q-Exactive-HF-X-Orbitrap_6096 | 395 |
Q-Exactive-HF-X-Orbitrap_6078 | 393 |
Q-Exactive-HF-X-Orbitrap_6011 | 277 |
Q-Exactive-HF-X-Orbitrap_6073 | 260 |
Q-Exactive-HF-X-Orbitrap_6016 | 219 |
Q-Exactive-HF-X-Orbitrap_6004 | 208 |
Q-Exactive-HF-X-Orbitrap_6028 | 95 |
Q-Exactive-HF-X-Orbitrap_6025 | 69 |
Q-Exactive-HF-X-Orbitrap_6044 | 69 |
Q-Exactive-HF-X-Orbitrap_6324 | 68 |
Q-Exactive-HF-X-Orbitrap_6022 | 64 |
Q-Exactive-HF-X-Orbitrap_6013 | 55 |
Q-Exactive-HF-X-Orbitrap_6043 | 55 |
Q-Exactive-HF-X-Orbitrap_6023 | 38 |
Q Exactive HF Orbitrap | Q-Exactive-HF-Orbitrap_207 | 412 |
Q-Exactive-HF-Orbitrap_147 | 383 |
Q-Exactive-HF-Orbitrap_143 | 332 |
Q-Exactive-HF-Orbitrap_204 | 319 |
Q-Exactive-HF-Orbitrap_206 | 274 |
Q-Exactive-HF-Orbitrap_148 | 228 |
Q-Exactive-HF-Orbitrap_1 | 156 |
Q-Exactive-HF-Orbitrap_1 | 99 |
Q-Exactive-HF-Orbitrap_2612 | 30 |
Orbitrap Fusion Lumos | Orbitrap-Fusion-Lumos_FSN20115 | 226 |
Orbitrap Exploris Slot #134 | Orbitrap-Exploris-480_MA10134C | 67 |
Orbitrap Exploris Slot #130 | Orbitrap-Exploris-480_MA10130C | 2 |
Orbitrap Exploris Slot #0215 | Orbitrap-Exploris-480_MA10215C | 32 |
Orbitrap Exploris Slot #0132 | Orbitrap-Exploris-480_MA10132C | 105 |
MA-MP9 | Orbitrap-Exploris-480_Invalid_SN_0001 | 34 |
Exactive Series Orbitrap | Exactive-Series-Orbitrap_6004 | 36 |
The mass spectrometry proteomics data and aggregated dumps have been deposited to the ProteomeXchange Consortium via the PRIDE13 partner repository with the dataset identifier PXD042233.
Technical Validation
We validated that the metadata associated to a raw file is unique in the uploaded dataset based on its creation date and machine identifier. Initially, some files were duplicated and reprocessed with differing file names. Using the information from the summary.txt of the MaxQuant output, we filtered runs by including only runs with a minimum of 15,000 identified peptides. The samples can be further selected using their ratio of identified to MS1 or MS2 spectra (Fig. 2b), their ratio MS1 to MS2 spectra recorded (Fig. 2c), their retention time (Fig. 2d).
Based on the clustering of the missing value pattern, we could observe that more low abundant protein groups have a higher percentage of missing values, as expected (Fig. 3a). Most samples had more than 3,000 protein groups quantified, and over 1,500 protein groups were present in nearly all the samples (Fig. 3b). Pearson correlations between samples and most protein groups were positively correlated to each other (Fig. 3c,d). Clustering of samples and protein groups using Pearson correlation on normalised intensities indicated that protein groups clustered by their normalised intensities and samples form groups visually (Fig. 4). This means that large clusters of the data provide a relatively homogenous intensity distribution between samples.
Usage Notes
The curated data dumps proteinGroups_curated.zip, peptides_curated.zip and precursors_curated.zip are ready to use and contains a brief python script showing how to load the data into memory using python. We recommend selecting the data of interest using the provided metadata in pride_metadata.csv and store the processed data in a binary output format for improved file reading speed. The protein groups, peptides and precursors dataset are provided in wide format, dumped both with the samples in the rows and features in the columns, and reversed. File loading using pandas11 in python is provided by a small script contained in the zipped folders for each level.
Using the original MaxQuant zipped output folders, it is required to download, unzip and then read in the tab separated files, which have a `txt` file extension instead of `tsv`, for further processing.
Searching samples in a distributed way
A snakemake workflow10 for processing of the uploaded HeLa samples is available. It can be used as a skeleton to re-analyze the uploaded Hela samples with a different search engine than MaxQuant, or your own files. It is based on the assumptions that files are located at a remote server and each file is processed individually.
Reading metadata for raw files
A snakemake workflow10 for reading the file metadata of the Thermo Fisher instruments is provided. It downloads the data from a FTP server (or finds the files locally) and then uses the ThermoRawFileParser12 to extract the metadata from each file.