This section provides a brief overview of data and analysis pipeline, data assembly and process prior to the modeling training, and the machine learning model building workflow in this work. We also provide the details of statistical techniques applied to improve the model performance and interpretation including feature selection, oversampling technique and feature importance. The principles that we demonstrated in this work can be readily applied to other clinical outcomes and/or disease indications. All methods were carried out in accordance with relevant guidelines and regulations.
Data and analysis pipeline. Patient data were extracted from the United Kingdom (UK) Biobank (https://biobank.ctsu.ox.ac.uk), a large-scale biomedical database and research resource of 500,000 participants aged 40 to 69 years recruited throughout the UK between 2006 and 2010. The database includes participants with a wide range of serious and life-threatening illnesses, who have undergone measures, provided blood, urine and saliva samples, and detailed information about themselves, and agreed to have their health followed.
Data were extracted (UK Biobank application #26041) for participants with a diagnosis of any liver disease according to ICD codes were selected for study. Participants with a diagnosis of AATD-LD (identified by ICD code or questionnaire) and PiZZ genotype (SNP rs28929474) were subsequently identified. The data were pre-processed before modeling (e.g., centering and scaling the predictors, imputing the missing predictor information via multiple imputation). The process flow for data assembly, processing, and analysis is shown in Fig. 5.
Feature engineering. To prevent the modeling barriers from the overfitting or multicollinearity, redundant features were eliminated through feature selection methods prior to the model training. The final set of predictor variables for the model training was selected through the joint application of seven feature selection methods including: (1) filter methods, such as Pearson correlation and Chi-squared correlation; (2) wrapper methods, such as feature elimination recursive; and (3) embedded methods such as Lasso and three tree-based models (Fig. 6). Predictor variables selected by at least 4 of the 7 selection methods were assigned to one of four predictor blocks (Fig. 5).
Oversampling technique. To address the imbalanced classification challenge where there were too few records of a minority class for the model to effectively learn and to improve the model performance on the minority class, the synthetic minority oversampling technique (SMOTE)16 was applied to the clinical outcomes with data imbalance including liver-related death and liver transplant. The new synthetic records were generated using the existing samples of the minority class by linear interpolating for the minority class. AUPRC was used as a performance measure for data imbalance.
Machine learning model building. The stacking ensemble learning algorithm17 was applied in this work for a better model prediction performance. The stacking ensemble is a meta-learning algorithm that combines the predictions from multiple well-performing machine learning models including classification tree and/or regression methods to make the final model perform better than any single model in the ensemble. We applied and combined the learning from RF, GB, ENRR, and ANN-MLP in this work.
-
RF is a ML algorithm for classification, which consists of a large number of individual decision trees and uses bagging and feature randomness for training to create an uncorrelated forest of trees. The final prediction from random forest model is the class selected by most trees.
-
GB is a ML algorithm that uses boosting technique and grows trees in a stage-wise, gradual, additive and sequential manner. Two GB algorithms were applied in this work, including eXtreme Gradient Boosting (XGBOOST), which splits the tree level-wise; and light GBM, which has faster training speed and higher efficiency.
-
ENRR is an application of regularized regression with penalties to avoid extreme parameters that could cause overfitting. ENRR combines two commonly used regularization techniques (Lasso and Ridge) into a hybrid penalized model.
-
ANN is one of the deep-learning algorithms inspired by the structure and function of the human brain. MLP is a class of feedforward ANN. We applied multiple-input single-output neural network forecasting in this work.
To optimize the stability of the prediction results, a nested five-fold cross-validation with independent random partitions was conducted with 100 repetitions. The nested cross-validation has an inner loop cross-validation nested in an outer cross-validation, where the inner loop was used for model selection and hyperparameter tuning and the outer loop was used for model performance evaluation. The ML model was trained using the Training Set and evaluated using the Test Set through the nested five-fold cross-validation. Of note, the SMOTE oversampling technique was applied to the Training Set for liver-related death and liver transplant. To avoid the noisy estimate of model performance by a single run of nested five-fold cross-validation, we conducted different splits of Training and Test data by repeating the nested five-fold cross-validation 100 times to stabilize the performance of the ML models. The model performance was evaluated by prediction accuracy, AUROC, and AUPRC. The mean result and standard deviation across all iterations were reported. It is worth pointing out that the mean result is considered as a more accurate and stable estimate to the underlying performance of model prediction.
This analysis was carried out using Python 3.8 and Keras 2.5.0. Figure 7 presents the workflow of the stacking ensemble learning algorithm in this work.
Feature importance. Feature importance refers to a class of techniques for assigning scores to input features in a predictive model that indicates the relative importance of each feature when making a prediction, which can provide insight and better understanding into the data and a ML prediction model. We applied the permutation importance14 to each of the five ML models to obtain the permutation importance scores and calculated the final feature importance score by summing up these importance scores (Appendix C). The important predictors were identified and ranked based on the final importance score.
Data and code availability
The data underlying this article is a part of the UK Biobank dataset (application #26041), but not publicly available. The data, data processing, feature extraction, machine learning, and analysis code will be shared by the corresponding author upon reasonable request.