Community development of GECKO
To ensure wide application and enable future development by the research community, we established the GECKO toolbox as open-source software, mostly encoded in MATLAB. It integrates modules for enhancement of GEMs with kinetic and proteomics constraints, automated retrieval of kinetic parameters from the BRENDA database (python module), as well as simulation utilities and export of ecModel files compatible with both the COBRA toolbox44 and the COBRApy package45. The development of GECKO has been continuously tracked in a public repository (https://github.com/SysBioChalmers/GECKO) since 2017, providing a platform for open and collaborative development. The generation of output model files in .txt and SBML L3V1 FBC246 formats enabled the utilization of ecYeastGEM1 structure as a standard test to track the effects of any modifications in the toolbox algorithm through the use of the Git version control system, contributing to reproducibility of results and backwards compatibility of code.
Interaction with users of the GECKO toolbox and the ecYeastGEM model has also been facilitated through the use of the GECKO repository, allowing users to raise issues related with the programming of the toolbox or even about conceptual assumptions of the method, which has guided cumulative enhancements. Additionally, technical support for installation and utilization of the toolbox and ecYeastGEM is now provided through an open community chat room (available at: https://gitter.im/SysBioChalmers/GECKO), reinforcing transparent and continuous communication between users and developers.
New additions to the GECKO toolbox
The first implementation of the GECKO method significantly improved phenotype predictions for S. cerevisiae’s metabolism under a wide variety of genetic and environmental perturbations38. However, its development underscored some issues, in particular that quantitative prediction of the critical dilution rate and exchange fluxes at fermentative conditions are highly sensitive to the distribution of incorporated kinetic parameters. Although S. cerevisiae is one of the most studied eukaryal organisms, not all reactions included in its model have been kinetically characterized. Therefore, a large number of kcat numbers measured for other organisms (48.35%), or even non-specific to their reaction mechanism (56.03% of kcat values found by introduction of wild cards into E.C. numbers) were needed to be incorporated, in order to fill the gaps in the available data for the reconstruction of the first S. cerevisiae ecModel, ecYeast7. Moreover, detailed manual curation of kcat numbers was needed for several key enzymes in order to achieve biologically meaningful predictions.
As the BRENDA database47 is the main source of kinetic parameters for GECKO, all of the available kcat and specific activity entries for non-mutant enzymes were retrieved. In total, 38,280 entries for 4,130 unique E.C. numbers were obtained and classified according to biochemical mechanisms, phylogeny of host organisms and metabolic context (Supp. file 1), in order to assess significant differences in distributions of kinetic parameters. This analysis showed that not all organisms have been equally studied. Whilst entries for H. sapiens, E. coli, R. norvegicus and S. cerevisiae account for 24.02% of the total, very few kinetic parameters are available for most of the thousands of organisms present in the database, showing a median of 2 entries per organism (Fig. 1A, Supp. file 1). The analysis also showed that kinetic activity can differ drastically, spanning several orders of magnitude even for families of enzymes with closely related biochemical mechanisms (Fig. 1B). Finally, it was also observed that kcat distributions for enzymes in the central carbon and energy metabolism differ significantly from those in other metabolic contexts across phylogenetic groups of host organisms (life kingdoms, according to the KEGG phylogenetic tree48), even without filtering the dataset for entries reported exclusively for natural substrates, as previously done by other studies49 (Fig. 1C).
As kcat numbers depend on biochemical mechanisms, metabolic context and phylogeny of host organisms, a modified set of hierarchical kcat matching criteria was implemented as part of GECKO 2.0. The modified parameterization procedure enables the incorporation of kinetic parameters that have been reported as specific activities in BRENDA when no kcat is found for a given query (as the specific activity of an enzyme is defined as its kcat over its molecular weight), adding 8,118 new entries to the catalogue of kinetic parameters in the toolbox. A phylogenetic distance-based criterion, based on the phylogenetic tree available in the KEGG database48, was introduced for cases in which no organism-specific entries are available for a given query in the kinetic parameters dataset. A comparison of the new kcat matching criteria with their predecessor set is shown in Supp. file 2.
In order to assess the impact of the modified kcat assignment algorithm on an ecModel, ecYeast7 was reconstructed using both the first and the new version of the GECKO toolbox (GECKO 2.0). A classification of the matched kcat numbers according to the different levels of the new matching algorithm is provided in Fig. 1D. The incorporation of specific activity values in the parameter catalogue increased the number of kinetic parameters matched to complete E.C. numbers (no added wild cards) from 1432 to 2696 (Fig. 1E). Moreover, the implementation of the phylogenetic distance-based criterion yielded a distribution of kinetic parameters that showed no significant differences when compared to the values reported in BRENDA for all fungi species, in contrast to the kinetic profile matched by the previous algorithm (p-values <10-10 and <10-7, when compared to the BRENDA fungi and S. cerevisiae distributions, respectively, under a Kolmogorov-Smirnov test) (Fig. 1F). The quality of phenotype predictions for the ecYeast7 model enhanced by GECKO2.0 was evaluated by simulation of batch growth in 19 different environments, with an average relative error of 23.97% when compared to experimental data (Fig. 1G), in contrast, its GECKO1.0 counterpart yielded an average relative error of 32.07%.
The introduction of manually curated kcat numbers in a metabolic network has been proven to increase the quality of phenotype predictions for S. cerevisiae22,25,38; nevertheless, this is an intensive and time consuming procedure that is hard to ensure for a large number of models subject to continuous modifications. In order to ensure applicability of the GECKO method to any standard GEM, a unified procedure for curation of kinetic parameters was developed based on parameter sensitivity analysis. For automatically generated ecModels that are not able to reach the provided experimental value for maximum batch growth rate, an automatic module performs a series of steps in which the top enzymatic limitation on growth rate is identified through the quantification of enzyme control coefficients. For such enzymes, the E.C. number is obtained and then its correspondent kcat value is substituted by the highest one available in BRENDA for the given enzyme class. This procedure iterates until the specific growth rate predicted by the model reaches the provided experimental value.
Finally, as the first version of the toolbox relied on the structure and nomenclature of the model Yeast7, its applicability to other reconstructions was not possible in a straightforward way. In order to provide compatibility with any other GEM, based on COBRA44 or RAVEN50 formats, all of the organism-specific parameters required by the method (experimental growth rate, total protein content, organism name, names and identifiers for some key reactions, etc.) can be provided in a single MATLAB initialization script, minimizing the modifications needed for the generation of a new ecModel.
ecModels container: an automatically updated repository
Several GEMs that have been published are still subject to continuous development and maintenance1–3,5,6, this renders GEMs to be dynamic structures that can change rapidly. In order to integrate such continuous updates into the enzyme constrained version of a model in an organized way, an automated pipeline named ecModels container was developed.
The ecModels container is a continuous integration implementation whose main functionality is to provide a catalogue of ecModels for several relevant organisms that are automatically updated every time a modification is detected either in the original GEM source repository or in the GECKO toolbox, i.e. new releases in their respective repositories. The pipeline generates ecModels in different formats, including the standard SBML and MATLAB files, and stores them in a container repository (https://github.com/SysBioChalmers/ecModels) in a version controlled way, requiring minimal human interaction and maintenance. The GECKO toolbox ensures the creation of functional and calibrated ecModels that are compatible with the provided experimental data (maximum batch growth rate, total protein content of cells and exchange fluxes at different dilution rates as an optional input). This whole computational pipeline is illustrated in Fig. 2. Further description of the ecModels container pipeline functioning is included in in the Materials and Methods section.
A catalogue of new ecModels
Following the aforementioned additions to the GECKO toolbox, that have allowed its generalization, we used the toolbox for the reconstruction of four new ecModels from previously existing high-quality metabolic network reconstructions: iYali4, for the oleaginous yeast Yarrowia lipolytica5; iSM996, for the thermotolerant yeast Kluyveromyces marxianus6; iML1515, for the widely studied bacterium E. coli4; and Human1, being the latest and largest network reconstruction available for studying H. sapiens metabolism2. For the microbial models, all model parameters were calibrated according to the provided experimental data, generated by independent studies4,51–53, yielding functional ecModels ready for simulations. Size metrics for these models can be seen in Table 1.
These ecModels, together with ecYeastGEM, are hosted in the ecModels container repository for their continuous and automated update every time that a version change is detected either in the original model source or in the GECKO repository. In the case of microbial species, two different model structures are provided: ecModel, which has unbounded individual enzyme usage reactions ready for incorporation of proteomics data; and ecModel_batch in which all enzyme usage reactions are connected to a shared protein pool. This pool is then constrained by experimental values of total protein content, and calibrated for batch simulations using experimental measurements of maximum batch growth rates on minimal glucose media, thus providing a functional ecModel structure ready for simulations.
For ecHumanGEM just the unbounded ecModel files are provided, as this is a general network of human metabolism, containing all reactions from any kind of human tissue or cell type for which evidence is available, and therefore not suitable for numerical simulation. As H. sapiens is the most represented organism in the BRENDA database, accounting for 11% of the total number of available kcat values (Supp. file 1), kinetic parameters from other organisms were not taken into account for its enhancement with enzyme constraints. ecHuman1 provides the research community with an extensive knowledge base that represents a complete and direct link between genes, proteins, kinetic parameters, reactions and metabolites for human cells in a single model structure, subject to automated continuous update by the ecModels container pipeline.
Table 1.- Size metrics summary for the ecModels catalogue.
Original GEMs
|
Organism
|
S. cerevisiae
|
Y. lipolytica
|
K. marxianus
|
E. coli
|
H. sapiens
|
Model ID
|
yeastGEM_8.3.3
|
iYali4
|
iSM996
|
iML1515
|
Human1
|
Reactions
|
3963
|
1924
|
1913
|
2711
|
13101
|
Metabolites
|
2691
|
1671
|
1531
|
1877
|
8400
|
Genes
|
1139
|
847
|
996
|
1516
|
3628
|
Enzyme constrained GEMs
|
Model ID
|
ecYeastGEM
|
eciYali
|
eciSM996
|
eciML1515
|
ecHumanGEM
|
Reactions
|
8028
|
3881
|
5334
|
6084
|
46259
|
Metabolites
|
4153
|
1880
|
2064
|
2334
|
12191
|
Enzymes
|
965
|
647
|
716
|
1259
|
3224
|
Enzyme coverage
|
84.72%
|
76.39%
|
71.89%
|
83.05%
|
88.86%
|
Reactions w/ kcat
|
3771
|
1586
|
2891
|
2562
|
27014
|
Reactions w/ Isoenzymes
|
504
|
205
|
532
|
456
|
3791
|
Promiscuous Enzymes
|
572
|
324
|
469
|
673
|
2184
|
Enzyme complexes
|
252
|
75
|
27
|
383
|
756
|
Visualization of GECKO simulations in the Caffeine platform
We implemented simulations with ecModels in Caffeine, an open-source software platform for cell factory design. Caffeine, publicly available at http://caffeine.dd-decaf.eu, allows user-friendly simulation and visualization of flux predictions made by genome-scale metabolic models. Several standard modelling methods are already included in the platform, such as 13C fluxomics data integration, and simulation of gene deletion and/or overexpression, to interactively explore strain engineering strategies. In order to allow for GECKO simulations, we added a new feature to the platform for uploading enzyme-constrained models and absolute proteomics data. Additionally, we added a simulation algorithm that recognizes said models, and overlays the selected proteomics data on them, leaving out data that makes the model unable to grow at a pre-specified growth rate. After these inclusions to the platform, enzyme usage can now be computed on the fly and visualized on metabolic maps (Fig. 2B), to identify potential metabolic bottlenecks in a given condition. The original proteomics data can be visualized as well, to identify if the specific bottleneck is due to a lack of enzyme availability, or instead due to an inefficient kinetic property. This will suggest different metabolic engineering strategies to the user: if the problem lies in the intracellular enzyme levels, the user can interpret this as a recommendation for overexpressing the corresponding gene, whereas if the problem lies in the enzyme efficiency, the user could assess introducing a heterologous enzyme as an alternative.
GECKO simulation utilities
As ecModels are defined in an irreversible format and incorporate additional elements such as enzymes (as new pseudo-metabolites) and their usages (represented as pseudo-reactions), they might sometimes not be directly compatible with all of the functionalities offered by currently available constraint-based simulation software44,45,50,54,55. We therefore added several new features to the GECKO toolbox that allow the exploration and exploitation of ecModels. These include utilities for: 1) basic simulation and analysis purposes, 2) accessible retrieval of kinetic parameters, 3) automated generation of condition-dependent ecModels with proteomic abundance constraints, 4) comparative flux variability analysis between a GEM and its ecModel counterpart, and 5) prediction of metabolic engineering targets for enhanced production with an implementation of the FSEOF method56 for ecModels. Detailed information about the inputs and outputs for each utility can be found on their respective documentation, available at: https://github.com/SysBioChalmers/GECKO/tree/master/geckomat/utilities. All of these utilities were developed in MATLAB due to their dependency on some RAVEN toolbox functions50.
Predicting microbial proteome allocation in multiple environments
In order to test the quality of the phenotype predictions of an ecModel automatically generated by the ecModels container pipeline, batch growth under 11 different carbon sources was simulated with eciML1515 for E. coli. Figure 3A shows that, for all carbon sources, growth rates were predicted at the same order of magnitude as their corresponding experimental measurements, with the most accurate predictions obtained for growth on D-glucose, mannose and D-glucosamine. Furthermore, batch growth rate and protein allocation predictions, using no exchange flux constraints, were compared between eciML1515 and the iJL1678 ME-model32, the latter accounting for both metabolism and macromolecular expression processes. The sum squared error (SSE) for batch growth rate predictions across the 11 carbon sources using eciML1515 was 0.27, a drastic improvement when compared to the 1.21 SSE of iJL1678 ME-model predictions32. Figure 3B shows the predicted total proteome needed by cells to sustain the provided experimental growth rates for the same 11 environments. Notably eciML1515 predicts values that lie within the range of predictions of the iJL1678 ME-model (from the optimal to the generalist case) for 10 out of the 11 carbon sources (see Materiales and Methods for simulation details). This shows that the new version of the GECKO toolbox ensures the generation of functional ecModels that can be readily used for simulation of metabolism, due to its systematic parameter flexibilization step which reduces the need of extensive manual curation for new ecModels. Furthermore, iML1515 is a model available as a static file at the BiGG models repository57; therefore, its integration to the ecModels container for continuous update demonstrates the flexibility of our pipeline, regarding compatibility with original GEM sources, which can be provided as a link to their git-based repositories or even as static URLs.
Proteomics constraints refine phenotype predictions for multiple organisms and conditions
The previously mentioned module for integration of proteomics data generates a condition-dependent ecModel with proteomics constraints for each condition/replicate in a provided dataset of absolute protein abundances [mmol/gDw]. Even though absolute quantification of proteins is becoming more accessible and integrated into systems biology studies58–62, a major caveat of using proteomics data as constraints for quantitative models is their intrinsic high biological and technical variability63, therefore some of the incorporated data constraints need to be loosened in order to obtain functional ecModels. When needed, additional condition-dependent exchange fluxes of byproducts can also be used as constraints in order to limit the feasible solution space. A detailed description of the proteomics integration algorithm implemented in GECKO is given in Supp. file 2.
The new proteomics integration module was tested on the three ecModels for budding yeasts available in ecModels container (ecYeastGEM, eciYali, eciSM996). We measured absolute protein abundances for S. cerevisiae, Y. lipolytica and K. marxianus, grown in chemostats at 0.1 h-1 dilution rate and subject to several experimental conditions (high temperature, low pH and osmotic stress with KCl)64, and incorporated these data into the ecModels as upper bounds for individual enzyme usage pseudo-reactions. Then, exchange fluxes for CO2 and oxygen corresponding to the same chemostat experiments were used as a comparison basis to evaluate quality of phenotype predictions. For each organism-condition pair, 3 models were generated and compared in terms of predictions: a pure stoichiometric metabolic model, an enzyme-constrained model with a limited shared protein pool, and an enzyme-constrained model with proteomics constraints. It was found that the addition of the enzyme pool constraint enables major reduction of the relative error in prediction of gaseous exchange fluxes in some of the studied conditions. Additionally, the incorporation of individual protein abundance constraints improves even further the predictive accuracy of gaseous exchanges, for 10 out of the 11 evaluated cases (Fig. 4A-C).
The impact of incorporating enzyme and proteomics constraints on intracellular flux predictions was further assessed by mapping all condition-dependent flux distributions from the tested ecModels to their corresponding reactions in the original GEMs. In general, metabolic flux distributions showed high similarity when comparing ecModel to GEM predictions (Fig. S1), as 70-90% of the active fluxes were predicted within the interval of 0.5 < fold-change < 2 across all conditions (Fig. S2 A-C, Supp. file 3). In addition, principal component analysis on absolute enzyme usage profiles predicted by ecModels revealed that, at low dilution rates, predictions of enzyme demands are mostly defined by the selected set of imposed constraints (shared protein pool vs. proteomics constraints) rather than by environmental condition, i.e. exchange fluxes (Fig. S2 D-F). However, more straightfroward comparison of the models’ predictions, by pairwise comparison of predicted absolute enzyme usage profiles, showed that 60 – 80% of the predicted enzyme usages lie within a range of 0.5 < fold-change < 2, when comparing ecModels predictions with and without proteomics constraints, across organisms and conditions (Fig. 4D, Fig. S2 G-I and Supp. file 3). It was observed that the incorporation of proteomics constraints induces a drastic differential use for a considerable amount of enzymes, as 12-21% of enzyme usages were predicted as either enabled or disabled by these constraints across all the simulated conditions, showing slight enrichment for enabled alternative isoenzymes for already active reactions (Supp. file 3). This suggests that upper bounds on enzyme usages induce differentiated utilization of isoenzymes, reflecting well why isoenzymes have been maintained throughout evolution.
The explicit inclusion of enzymes into GEMs by the GECKO method enables prediction of enzyme demands at the protein, reaction and pathway levels. Total protein burden values predicted by ecModels for several relevant metabolic superpathways (central carbon and energy metabolism, amino acid metabolism, lipid and fatty acid metabolism, cofactor and vitamin metabolism and nucleotide metabolism, according to the KEGG metabolic subsystems48), showed that central carbon and energy metabolism is the most affected sector in the ecYeastGEM network by integration of proteomics constraints, as protein burden predictions were higher, at least by 20%, for 3 out the 4 simulated conditions when compared with predictions of the ecYeastGEM without proteomics data (Fig. 4E).
Relative enzyme usages, estimated as predicted absolute enzyme usage over enzyme abundance for all of the measured enzymes in an ecModel , can be understood as the saturation level of enzymes in a given condition. In order to analyze the metabolic mechanisms underlying long-term adaptation to stress in budding yeasts, relative enzyme usage profiles were computed from all the previous simulations of ecModels with proteomics constraints. Enzymes that display fold-changes higher than 1 for both absolute abundance and their saturation level, when comparing predicted usage profiles between stress and reference conditions, suggest regulatory mechanisms on individual proteins that contribute to cell growth on the anlyzed stress condition. Figure 4F shows all of the enzymes that were identified as responsive to environmental stress in this study, displaying enrichment for enzymes involved in biosynthesis of diverse amino acids and folate metabolism.
A further mapping of all enzymes in these ecModels to a list of 2,959 single copy protein-coding gene orthologs across the three yeast species64 found 310 core proteins across these ecModels. Principal component analysis revealed that variance on absolute enzyme usages and abundance profiles for these core proteins is mostly explained by differences in the metabolic networks of the different species rather than by environmental conditions (Fig. S3 B-C), reinforcing previous results suggesting that, despite being phylogenetically related, their long-term stress responses at the molecular level have evolved independently after their divergence in evolutionary history64.
Exploring the solution space reduction
A major limitation in the use of GEMs is the high variability of flux distributions for a given cellular objective when implementing flux balance analysis, as this requires solving largely underdetermined linear systems through optimization algorithms15,65. This limitation has usually been overcome with incorporation of measured exchange fluxes as constraints. However, these data are typically sparse in the literature. Previous studies explored the drastic reduction in flux variability ranges of ecModels for S. cerevisiae and 11 human cell-lines when compared to their original GEMs due to the addition of enzyme constraints1,2,38. However, the irreversible format of ecModels (forward and backwards reactions are split in order to account for enzyme demands of both directions) hinders their compatibility with the flux variability analysis (FVA) functions already available in COBRA44 and RAVEN50 toolboxes. As a solution to this, an FVA module was integrated to the utilities repertoire in GECKO, whose applicability has been previously tested on studies with ecModels for S. cerevisiae1 and human cell lines2. This module contains the necessary functions to perform FVA on any set of reactions of an ecModel, enabling also a direct comparison of flux variability ranges between an ecModel and its GEM counterpart in a consistent way (see Supp. file 2).
The FVA utility was applied on three different ecModels of microbial metabolism and their correspondent GEMs (iML1515, iYali4 and iSM996). In all cases the FVA comparisons were carried out for both chemostat and batch growth conditions in order to span different degrees of constraining of the metabolic networks (0.1 h-1 dilution rate and minimal glucose uptake rate fixed for chemostat conditions; biomass production fixed to experimental measurements of and unconstrained uptake of minimal media components, for batch conditions). Cumulative distributions for flux variability ranges for all explored ecModels and GEMs are shown in Figure 5, in which it can be seen that median flux variability ranges are much reduced for all ecModels and conditions, especially at high growth rates where enzyme constraints reduce the variability range 5-6 orders of magnitude when compared to pure GEMs. The cumulative distributions also show a major reduction in the amount of totally variable fluxes (reactions that can carry any flux between -1000 to 1000 mmol/gDw h), which are an indicator of undesirable futile cycles present in the network due to lack of thermodynamic and enzyme cost information66–68. For high growth rates, the amount of totally variable fluxes accounts for 3-12% of the active reactions in the analysed GEMs, in contrast to their corresponding ecModels in which such extreme variability ranges are completely absent.
Further analysis of the FVA results revealed that a reduction of at least 95% of the variability range was achieved for more than 90% of all active fluxes at high growth rates in all ecModel. Interestingly, the aforementioned flux variability metrics were overall improved even for the chemostat conditions, despite a higher degree of constraining (fixed low growth rate and optimal uptake rate), which restrains these models to an energy efficient respiratory mode (Supp. file 4).