Machine Learning-assisted Identification of Bioindicators Predicts Medium-chain Carboxylate Production Performance of an Anaerobic Mixed Culture
Background: The ability to quantitatively predict ecophysiological functions of microbial communities provides an important step to engineer microbiota for desired functions related to specific biochemical conversions. Here, we present the quantitative prediction of medium-chain carboxylate production in two continuous anaerobic bioreactors from 16S rRNA gene dynamics in enrichment cultures.
Results: By progressively shortening the hydraulic retention time from 8 days to 2 days with different temporal schemes in both bioreactors operated for 211 days, we achieved higher productivities and yields of the target products n-caproate and n-caprylate. The datasets generated from each bioreactor were applied independently for training and testing in machine learning. A predictive model was generated by employing the random forest algorithm using 16S rRNA amplicon sequencing data. More than 90% accuracy in the prediction of n-caproate and n-caprylate productivities was achieved. Four inferred bioindicators belonging to the genera Olsenella, Lactobacillus, Syntrophococcus and Clostridium IV suggest their relevance to the higher carboxylate productivity at shorter hydraulic retention time. The recovery of metagenome-assembled genomes of these bioindicators confirmed their genetic potential to perform key steps of medium-chain carboxylate production.
Conclusions: Shortening the hydraulic retention time of the continuous bioreactor systems allows to shape the communities with desired chain elongation functions. Using machine-learning, we demonstrated that 16S rRNA amplicon sequencing data can be used to predict bioreactor process performance quantitatively and accurately. Characterising and harnessing bioindicators holds promise to manage reactor microbiota towards selection of the target processes. Our mathematical framework is transferrable to other ecosystem processes and 3 microbial systems where community dynamics is linked to key functions. The general methodology can be adapted to data types of other functional categories such as genes, transcripts, proteins or metabolites.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Due to technical limitations, full-text HTML conversion of this manuscript could not be completed. However, the manuscript can be downloaded and accessed as a PDF.
This is a list of supplementary files associated with this preprint. Click to download.
Additional file 1: Figure S1. Alpha rarefaction curves. Figure S2. Workflow of the random forest classification analysis. Figure S3. Workflow of a two-step random forest regression analysis. Figure S4. Gas production of bioreactors. Figure S5. Biomass production of bioreactors. Figure S6. Microbial community composition profiles of bioreactors. Figure S7. Alpha diversity metrics of bioreactor communities. Figure S8. Prediction results of C6 and C8 productivities using non-HRT bioindicators for considering community assembly caused by time. Figure S9. Prediction results of C6 and C8 productivities for all samples in the four HRT phases using HRT bioindicators. Figure S10. Prediction results of C6 and C8 productivities for all samples in the four HRT phases using non-HRT bioindicators for considering community assembly caused by time. Figure S11. Random forest feature importance of A-HRT bioindicators and B-HRT bioindicators used to predict C6 and C8 productivities. Figure S12. Random forest feature importance of the non-HRT bioindicators used to predict C6 and C8 productivities. Figure S13. Metabolic pathways involved in converting lactate and xylan to n-caproate and n-caprylate. Figure S14. Correlation network of environmental factors, process performance and microbial community. Figure S15. Prediction results of C6 and C8 productivities for all samples in the four HRT phases using the four ASVs of HRT bioindicators irrespective of time. Figure S16. Reducing HRT increases abundances of HRT bioindicators driving the catabolism of xylan and lactate to n-caproate and n-caprylate. Table S1. Growth medium used for the reactor operation. Table S2. Daily feeding of bioreactors A and B during the four HRT phases. Table S3. Gini scores of all ASVs in the classification-based prediction of HRT phases. Table S4. Mean carboxylate yields (i.e. C mole product to substrate ratios) at HRTs of 8 days and 2 days (stable production period). Table S5. Explained variances of the training set in the regression-based prediction of process parameters using A-HRT bioindicators and B-HRT bioindicators. Table S6. Explained variances of the training set in the regression-based prediction of process parameters using non-HRT bioindicators for considering community assembly caused by time.
Additional file 2: Dataset S1. MAGs taxonomy and genome metrics.
Additional file 3: Dataset S2. Functional annotations of xylose fermentation for MAGs with the same taxonomy as HRT bioindicators.
Additional file 4: Dataset S3. Functional annotations of chain elongation for MAGs with the same taxonomy as HRT bioindicators.
Additional file 5: Dataset S4. Functional annotations of xylose fermentation for all MAGs.
Additional file 6: Dataset S5. Functional annotations of chain elongation for all MAGs.
Posted 22 Sep, 2020
Machine Learning-assisted Identification of Bioindicators Predicts Medium-chain Carboxylate Production Performance of an Anaerobic Mixed Culture
Posted 22 Sep, 2020
Background: The ability to quantitatively predict ecophysiological functions of microbial communities provides an important step to engineer microbiota for desired functions related to specific biochemical conversions. Here, we present the quantitative prediction of medium-chain carboxylate production in two continuous anaerobic bioreactors from 16S rRNA gene dynamics in enrichment cultures.
Results: By progressively shortening the hydraulic retention time from 8 days to 2 days with different temporal schemes in both bioreactors operated for 211 days, we achieved higher productivities and yields of the target products n-caproate and n-caprylate. The datasets generated from each bioreactor were applied independently for training and testing in machine learning. A predictive model was generated by employing the random forest algorithm using 16S rRNA amplicon sequencing data. More than 90% accuracy in the prediction of n-caproate and n-caprylate productivities was achieved. Four inferred bioindicators belonging to the genera Olsenella, Lactobacillus, Syntrophococcus and Clostridium IV suggest their relevance to the higher carboxylate productivity at shorter hydraulic retention time. The recovery of metagenome-assembled genomes of these bioindicators confirmed their genetic potential to perform key steps of medium-chain carboxylate production.
Conclusions: Shortening the hydraulic retention time of the continuous bioreactor systems allows to shape the communities with desired chain elongation functions. Using machine-learning, we demonstrated that 16S rRNA amplicon sequencing data can be used to predict bioreactor process performance quantitatively and accurately. Characterising and harnessing bioindicators holds promise to manage reactor microbiota towards selection of the target processes. Our mathematical framework is transferrable to other ecosystem processes and 3 microbial systems where community dynamics is linked to key functions. The general methodology can be adapted to data types of other functional categories such as genes, transcripts, proteins or metabolites.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Due to technical limitations, full-text HTML conversion of this manuscript could not be completed. However, the manuscript can be downloaded and accessed as a PDF.