Data Import
Metabolite AutoPlotter reads quantified data containing the identified metabolites and their intensities, generated elsewhere, e.g. with manufacturer’s software or other software tools for peak detection and metabolite identification. Processed data need to include at least the metabolite names, unique sample IDs (as filenames) and the intensity (area or height) and can be imported as Microsoft Excel format (xls/ xlsx), comma separated files (csv) or other text formats. We implemented direct imports for the 2 most common outputs generated in our institute for LC-MS data (TraceFinder & Compound Discoverer, both from Thermo Fisher Scientific). Alternatively, data can be imported in a more generalised way either as list or as matrix, to allow the import of data generated with other software tools. A description of the data-structures and expected columns is shown in Table 1.
Adding sample descriptions
Figure 1-top illustrates the AutoPlotter workflow. Once the data was imported it is converted to the internal data structure. A template for the sample-table is generated and downloaded from within the app in order to define the conditions and the main properties used for the plotting. This includes the names of conditions shown on the plots, the order of the conditions and their colours. Samples can be normalised at this stage by cell counts or protein content. AutoPlotter offers two different strategies to do so. For the “absolute correction” the metabolite intensities (peak areas) are divided by the supplied correction value. This can be used to express the results as “peak area per million cells” or “peak area per microgram of protein”, and is particularly useful when absolute quantities are used. Alternatively, a “relative correction” can be used. Here the metabolite intensities are divided by the given correction value and then multiplied with the average of all correction values. As result the dimension of the metabolite-intensities is maintained and the samples are equalised to similar levels. Consequently, this strategy is not recommended when differences between samples should be maintained, for example when different cell lines are used, differing in size and therefor in the harvested sample counts.
Further different levels of replicates as measurement (~technical) replicates and experimental (~biological) replicates can be specified in the sample descriptions. In the latter case the measurement replicates for each experimental replicate will be averaged and the averages will be used to present the data. Once the edited sample table is uploaded it is saved together with the cleaned input-data in the folder “Inputs” contained in the result folder. With these two files the processing can be repeated at any time in the future.
Data processing
The philosophy of the tool is to represent the uploaded data as original as possible. Missing values can occur in metabolomics experiments as a result of metabolites being absent in some samples or being present with a concentration below the detection limit. Consequently, missing values occur more frequently for low abundant compounds. Nevertheless, missing values can also appear independent of the concentration due to problems with the peak integration, as m/z not matching within the defined mass-error due to coeluting compounds or retention time shifts outside the allowed window. Missing values are explicitly accepted by the tool and are further treated as “missing”, we do not impute missing values or fill missing values with zeros. Inputs as empty cells, “NA”, “NaN”, “N/F”, “N/A” are treated as missing. For the “Compounds in Columns”-input additionally zeros can be treated as “missing” or kept as zero and hence as explicit missing. For “Compound Discoverer” or “Matrix” inputs, a threshold can be defined that need to be reached; below that threshold values are treated as missing. In line with this, we do not normalise, scale or transform the data. Although this is required for some further analysis (principal component analysis (PCA), independent component analysis (ICA), partial least square discriminatory analysis (PLS-DA) or clustering), this is not needed to show the data. The presence of missing values helps to identify poor measurements or noisy metabolites, while the addition of missing value imputation may gain false confidence.
At the step of data processing metabolites can be normalised with internal standards. Normalisation can be performed before or after the external normalisation. The first case is suited for standards added during the extraction step, to compensate for variations during sample preparation and measurement. The latter case is useful when the intensities should be normalised to a detected metabolite. Similar to the external normalisation values can be normalised absolutely or relatively. Multiple metabolites can be selected and will be summed up, consequently the normalisation is more affected by highly abundant metabolites. When there is not a single metabolite serving as representative for the other compounds, samples can additionally be normalised with the sum of all peak areas (“Total Peak Sum”).
To further evaluate the quality of the data we also include a quality control feature to evaluate the performance of the replicate measurements. For this the relative fold-changes for each metabolite between the replicates of the same condition are calculated and shown for all metabolites together. The means of all the metabolites should be centred around 1 when the replicates perform comparably. This strategy evaluates the quality of the samples based on all detected metabolites, is independent of outliers in single metabolites and missing values and works well for a small number of replicates. Samples that differ dramatically at this stage can be identified and removed conveniently before performing further analysis.
After the data are processed, structured tables are exported together with the plots. These tables can be used for further processing, statistical analysis or to import the data into other visualisation tools, e.g. GraphPad Prism. A detailed explanation of the exports is included in Supplement File 1, and is exported together with the results.
Application to quantitative experiments
The most important and unique feature of the tool is to automatically generate one single graph for every compound in the dataset. As this highly depends on the user preferences, we included an interface that allows an interactive customisation of the plot-design (Fig. 1-bottom).
Four different plot-types are currently included in the application: bar-plot, violin-plot, box-plot, and univariate scatterplots (Fig. 2, A-C +E). The bar-plots show the mean and the standard deviation, the same applies for the scatterplots that additionally show the individual measurements. The violin-plots show the density distribution and the median, the box-plots show the median, the 25th and 75th percentiles and the lowest/highest value within 1.5 times inter-quartile range from the box. Violin-plots and box-plots only make sense when enough data points are present, so they are only available when 5, respectively 10 replicate measurements per condition are present in the dataset. When technical and experimental replicates are present the user can additionally generate 2 different plots that allow for the comparison of the performance within the replicates (Fig. 2D+F).
Plot personalisation
All plots can be further modified beyond the default settings (Fig. 2G-K). Already with the sample-descriptions the order of the conditions and their colours can be defined (Fig. 2-right vs. Fig. 2-left). Bar-, box- and violin-plots can be generated with outlined colour instead of filled colour to reduce their weight and amount of ink used (Fig. 2H+I). Additionally, the individual data-points can be overlaid over each of the plots (Fig. 2H). Furthermore, the dimensions of the plots (width and height), their resolution, as well as the output format (jpg, pdf, png, svg) can be defined. Also the text-elements can be altered, this includes the sizes, the direction of the text along the X-axis, the axis titles for X- and Y-axis. Additional text can be added on the left or the right side of the compound names to add some experimental descriptions. Simple statistics can be added as well; this can be either multiple pairwise comparisons or comparisons against one reference group, as t-test (expecting equal variances) or Wilcoxon test. The results of the statistics can be shown as symbols or numeric values (Fig. 2I+J).
Once the design is defined, the user can click on “Generate all plots!” and a plot for every metabolite is saved as an image file with the metabolite name as filename. Characters not allowed in filenames (e.g. :, ;, <, >) are converted automatically during this step. All results (the plots and the tables) are packed into a zip-file that can be downloaded by the user.
Bar-plots belong to the standard-repertoire for data-visualisation, as they are easily understood by anyone, and can be effectively used to screen for differences due to the weight of the different colours. However we want to point out, that they are probably not the best representation in all cases. First of all the data usually do not show continuous counts from zero to the end value, but multiple numbers with a similar distance away from zero. Secondly they highly compress the data, hiding some underlying features of the data sets [10, 11]. Metabolite AutoPlotter includes different features to overcome this limitation, e.g. by overlaying the individual data points or using violin-plots to better represent the distributions [12].
Application to tracing Experiments
Stable isotope tracing experiments can be processed and visualised as well. Figure 3 shows the results of a 13C-tracing experiment performed exclusively to show this feature. For this HCT116 cells were cultivated in DMEM supplemented with either u-13C-Glucose or u-13C-Glutamine for 1 to 24 hours and intracellular metabolites were extracted and measured by LC-MS. The isotopologues for tracing experiments should be supplied as “metabolite +1”, ”metabolite +2”, and so on. Metabolites up to M+30 can be processed with the application, even though so many masses cannot be shown efficiently. During data processing the isotopic information is extracted from the metabolite names and all isotopologues are grouped and shown as stacked bars for each condition, indicated by M+0 for the unlabelled species and M+1, M+2, for isotopologes with a mass-shift of one respectively two Dalton. These plots can be shown either in absolute intensities (peak areas, Fig. 3A) or in relative intensities (summed up to 100%, Fig. 3B). Both representations have advantages regarding data interpretation. They can be generated in a single run by repeating the plotting step. Additionally, two different colour-scales can be used. The short scale ranging to M+7 is sufficient for showing the central carbon metabolism (Fig. 3A). The longer gradient is based on a stepped rainbow with 5 colour blocks in 3 shades, being able to show up to M+15 allowing an instant overview over the number of isotopologues (Fig. 3B). As an example the high number of isotopologues in NAD+ cannot be resolved with the short colour scale, while the long scale reveals that it contains mainly M+5 and M+10, from incorporating ribose-phosphate. Figure 3 further shows what can be expected from biology: Some metabolites are labelled exclusively from glucose or glutamine, whereas both tracers contribute to tracing in the TCA-cycle intermediates. With the time course progression the labelling in most of the metabolites increases as well.
Natural isotopic abundance correction
The presence of multiple stable isotopes (as 13C, 15N, 34S, 18O, 2H), in the metabolites can make the interpretation of stable isotope tracing experiments cumbersome. The heavy isotopes introduced by the tracing experiment need to be separated from isotopes being present naturally. The carbon-13 isotope occurs with a frequency of approximately 1.1%. For a 3 carbon molecule like pyruvate the naturally occurring M+1 isotopologue has a frequency of about 3% and therefore the natural abundance of 13C makes not much of a difference. However, for a 10 carbon molecule such as adenosine triphosphate (ATP), the M+1 isotopologue will be found with an intensity of 10% relative to M+0 and should not be neglected. It is therefore very important that researchers correct the natural abundance of stable isotopes when performing stable isotope tracing experiments, particularly for large molecules. Often researchers perform experiments with no tracer to show the naturally occurring components. This is however an unnecessary burden given that we can computationally correct for the natural abundance of stable isotopes. We would argue that the computational correction is actually more precise given that the natural abundance of stable isotopes is very consistent across samples.
AutoPlotter provides an option for natural abundance correction by integrating the AccuCor package [13]. The user needs to supply the sum-formulae for the unlabelled metabolites, the type of the tracer used (13C, 15N or 2H) and its isotopic purity. If the sum- formulae are not already included in the inputs, they can be uploaded later. Figure 4 shows a comparison before and after the correction for ATP (C10H16N5O13P3) and Glutathione (GSH) (C10H17N3O6S). In both metabolites there is a high proportion of M+1 visible (dark blue) and also M+4 (yellow) in ATP. This complicates the interpretation as the reader needs to subtract these contributions in its head, which is impossible without an unlabelled reference sample. After the correction (Fig. 4-right panels) both contributions disappear, revealing an unlabelled state for ATP after 13C-glutamine tracing and GSH for the first 6 hours of 13C-glucose labelling.
Data safety
The application is hosted at shinyapps.io to benefit from the high quality standards of the RStudio-team. Users should be aware that the environment is not HIPAA-compliant, so confidential data (e.g. non-anonymised patient information) should be removed from the data prior to submission. Beyond that we make sure that the uploaded data are safe. Intermediary data are stored transiently in a temporary folder on the server, after the session closes data will be deleted. This can also be done by the user pressing a button. We do not have access to the temporary folders; neither do we have the interest or time to check other researchers’ results. Once we published the source-code, users can run the tool locally or set up their own servers.
Limitations and performance
Whilst we worked hard to design the tool to be as open and flexible as possible, we are aware that it cannot be used to address all potential questions. To avoid too high memory usage, the maximum size for the uploaded data is limited to 25 megabyte, but this should be sufficient for most experiments. There are only 3 essential levels of information needed: the compounds, the samples and their measured intensities. When the compound name is not unique the retention time (RT) should be supplied additionally to merge these two and for natural abundance correction the sum-formulae are needed. The complete requirements for the inputs are reported in Table1.
The generated graphs only allow a discrete X-axis, so it is not possible to plot scaled numerical axes as times or concentrations. Also the names of the conditions shown in the plots need to be different, so this does not allow for grouped plots. This could be realised with more complex or multiple input files for the sample descriptions but would also increase the risk of user errors.
Further the natural abundance correction as it is implemented currently is limited to elements occurring in nature (C, H, N, O, S) so it cannot be used to correct GC-MS experiments containing Si, Cl, or Br.
Beyond that there are no limitations regarding the number of compounds, replicates or conditions, as long as the memory lasts. We have processed untargeted experiments with over 1000 compounds flawlessly. The performance depends on the number of conditions, the number of replicates and the type of plots being generated. Nevertheless, in most situations creating the sample-table and adjusting the design is the most time consuming step and typically takes longer than the time needed to generate the plots. Plotting 250 compounds (5 conditions and 3 replicates) roughly needs 1 minute, which nearly doubles when multi-plot overview pages are generated. This is a dramatic performance boost compared to manual plotting in which one could probably generate a maximum of 3 or 4 plots in a minute.