Resampled dimensional reduction for feature representation in machine learning

We aimed to provide a resampling protocol for dimensional reduction resulting a few latent variables. The applicability focuses on but not limited for developing a machine learning prediction model in order to improve the number of sample size in relative to the number of candidate predictors. By this feature representation technique, one can improve generalization by preventing latent variables to over�t data used to conduct the dimensional reduction. However, this technique may warrant more computational capacity and time to conduct the procedure. The key stages consisted of derivation of latent variables from multiple resampling subsets, parameter estimation of latent variables in population, and selection of latent variables transformed by the estimated parameters.


Introduction
Adoption of insurance-based healthcare is emerging worldwide, which needs better prevention in order to improve both patient outcome and healthcare e ciency.To achieve these goals, machine learning algorithms are widely applied for developing a clinical prediction with satisfying predictive performance. 1- 6However, machine learning models, including those applying multivariable logistic regression, were high risk of bias, especially because of low sample size in relative to the number of candidate predictors. 7 deal with this problem, dimensional reduction can be applied to represent many candidate predictors into fewer latent variables.However, most prediction models that used these variables, if not all, conducted a dimensional reduction without either resampling or data partition, which exposed to a risk of optimistic bias, and is not robust for samples beyond the training set, which is one of the main problems in current machine learning practice. 8This is because resampling or data partition are more well-known in either predictive modeling or supervised machine learning, compared to a dimensional reduction that is typically used for statistical inference and unsupervised machine learning.
We applied this protocol for multiple studies in a human-machine learning project.This compared our human-machine learning algorithm with those applying human learning and other machine learning algorithms to predict a variety of health outcomes.Ethical clearance was waived by the Taipei Medical University Joint Institutional Review Board (TMU-JIRB number: N202106025).We followed two guidelines for developing and reporting machine learning predictive models in biomedical research, 9,10 speci c for multivariable prediction models instead of those identifying risk or prognostic factors. 10This protocol aimed to provide a resampling protocol for dimensional reduction resulting a few latent variables, focusing on but not limited to application for developing a machine learning prediction model.

Reagents Equipment
The data analysis was conducted using R 4.0.2programming language (R Foundation, Vienna, Austria) in RStudio 1.3.959(RStudio PBC, Boston, MA, USA).The R packages were all in sync by utilizing Bioconductor 3.11. 11 Since machine learning predictive modeling was the main context of this protocol, an R package of caret 6.0.86 was used for data partition.To facilitate main steps of this protocol, we created an R package rsdr 0.1.0.We also created medhist 0.1.0to preprocess the sample dataset.These are available for download from this repository https://github.com/herdiantrisufriyana.Details on other R package versions and all of the source codes (vignette) for the data analysis are available in https://github.com/herdiantrisufriyana/resdimer.
A set of hardware requirements may be needed to reproduce our work.This is a single machine with 8 logical processors for the 3.40 GHz central processing unit (CPU) (Core (TM) i7-4770, Intel®, Santa Clara, CA, USA), and 16 GB RAM.But, if the sample size is smaller than that of dataset we used in this protocol, a machine with only 4 logical processors and 4 GB RAM can also be used.

Procedure 1. Split a dataset randomly for a derivation and validation set
Only the derivation set was used to estimate latent variables at the population level.This set was also used later for training set of a prediction model.This would make the model blinded to the distribution of weights for feature representation of any external validation sets.

Choose resampling and dimensional reduction methods
We made the rsdr 0.1.0(an R package) that allows future investigators to conduct a principal component (PC) analysis or singular value decomposition using resampling methods, as described in this protocol.
Instead of computing singular values by bootstrapping, as an example, we computed PCs by k-fold crossvalidation for reasons of simplicity considering a simpler theoretical framework and an achievable computational capacity.To compute PCs by k-fold cross-validation, each of β l , μ j , and σ j was inferred from the derivation set only, of which a (K-k m )/K part of n instances for m=[1,2,⋯,K] (equation in Figure 1) was used each time to compute the variance.

Standardize each variable with variable-wise average and standard deviation
For every subset of resampling, an X matrix was constructed of n×p dimensions for i=[1,2,⋯,n] instances and j=[1,2,⋯,p] candidate predictors.Each vector was standardized with a column-wise μ j mean and σ j standard deviation of all instances for each candidate predictor.

Map from higher to lower dimension by nding weights that maximize variances of new dimensions
For every subset of resampling, we mapped each vector x (i) ∈ X onto a new vector of PC scores t l(i) = x (i) • β l for l=[1,2,⋯,q] by a matrix β of weight vectors where q ranged up to p. Mapping was used to nd estimates of weight vectors that maximized the variance (equation in Figure 1).The l th PC was calculated by subtracting the l th -1 PC from X, then nding the estimate of the l th PC as l th -1 PC.

Estimate variable-wise and standard deviation and weights of the transformation at population level
An estimate of the weight vector β l was calculated by averaging β l , μ j , and σ j from all K=10 of (Kk m )/K parts.The eigenvalue of the matrix is commonly known for X T X, which achieves the maximum variance by β as the eigenvector.For each PC, one can nd some original variables that are represented by a PC by ltering those with minimum absolute number of estimated weights of the transformation for that PC.

Apply the estimated values to standardize and transform original variables into new dimensions
Each of original variables in either derivation or set was standardized by subtracting it with the variable-wise estimated average, and subsequently by dividing it with the variable-wise estimated standard deviation.All of the standardized variables were mapped to each of PCs by multiplying each of these variables with the estimated weights.A dot product, which is a PC, was a sum of all the multiplication results.

Select a particular number of new dimensions highest proportions of variance
This step is for predictive modeling.The recommended number of sample size in relative to the number of candidate predictors was computed for a speci c algorithm (e.g.200 events per variable for random forest). 12Maximum number of candidate predictors was calculated by dividing the number of events with that number.The PCs were sorted by proportion of variance explained from the highest to the lowest.We selected top PCs as many as the maximum number of candidate predictors.Only top PCs were used for predictive modeling.

Troubleshooting
Step

Anticipated Results
The number of latent variables, i.e.PCs, is the same with that of original predictors at maximum.The composition is different among PCs for the weights of original predictors in each PC.Original predictors with larger weights may describe what a PC represents, semantically.One may assign each PC a term that describes original predictors with larger weights in that PC.By this way, this will also improve our interpretation if a PC is considered important for a prediction.Derivation of PCs with resampling may provide a better estimate for these latent variables in population.A latent variable may also imply a novel factor of a disease.

Figures
Figures