Comparative assessment and outlook on methods for 1 imputing proteomics data 2

Background: Missing values are a major issue in quantitative proteomics data 15 analysis. While many methods have been developed for imputing missing values in 16 high-throughput proteomics data, comparative assessment on the accuracy of 17 existing methods remains inconclusive, mainly because the true missing 18 mechanisms are complex and the existing evaluation methodologies are imperfect. 19 Moreover, few studies have provided an outlook of current and future development. 20 Results: We first report an assessment of eight representative methods collectively 21 targeting three typical missing mechanisms. The selected methods are compared on 22 both realistic simulation and real proteomics datasets, and the performance is 23 evaluated using three quantitative measures. We then discuss fused regularization 24 matrix factorization, a popular low-rank matrix factorization framework with similarity 25

and/or biological regularization, which is extendable to integrating multi-omics data 26 such as gene expressions or clinical variables. We further explore the potential 27 application of convex analysis of mixtures, a biologically-inspired latent variable 28 modeling strategy, to missing value imputation. The preliminary results on proteomics 29 data are provided together with an outlook into future development directions.  for high-throughput identification and quantification of thousands of proteins in a single 40 analysis [1,2]. The LC-MS signals can be displayed in a three-dimensional space with 41 the mass-to-charge ratios, retention times and intensities for the observed peptides. 42 However, this approach suffers from many missing values at the peptide or protein 43 level, which significantly reduces the amount of quantifiable proteins with an average 44 of 44% missing values [3][4][5]. 45 While there are multiple causes for this missingness, three typical missing 46 mechanisms are widely acknowledged. Low abundant proteins may be missing because 47 their concentration is below the lower limit of detection (LLD); while poorly ionizing 48 peptides may cause proteins to be Missing Not at Random (MNAR) [6]. However, 49 missingness may also extend to mid-and even high-range intensities, statistically 50 categorized into Missing at Random (MAR) and Missing Completely at Random 51 (MCAR) [7]. MAR is actually missing conditionally at random given the observed, 52 known covariates, or even unknown covariates. MCAR depends neither on observed 53 nor on the missing data, thus the incomplete data are representative for the entire data. 54 Practically, MAR and MNAR cannot be distinguished because by definition missing 55 values are unknown [8]. More importantly, missing values in reality can originate from 56 a mix of both known and unknown missing mechanisms [7,9]. 57 A common solution for missingness is to impute the missing values based on 58 assumed missing mechanisms. But, this comes at the expense of potentially introducing 59 profound change in the distribution of protein-level intensities, because most of existing 60 methods are designed specifically for a single missing mechanism. This can have 61 unpredictable effects on downstream differential analyses. Moreover, while many 62 imputation methods have been adopted for imputing missing values in proteomics data, 63 comparative evaluation on their relative performance remains largely inconclusive, and 64 few studies provide an outlook addressing unresolved problems or future development 65 directions [4,9,10]. 66 To gain first-hand insight into the strengths and limitations of both imputation 67 methods and assessment designs, we conduct a collective assessment of eight 68 representative methods involving three typical missing mechanisms in conjunction with 69 authentic missing values. Compared on a set of realistic and preserving simulations 70 derived from real proteomics data sets, the performance of the selected methods is 71 measured by three criteria, root-mean-square error (RMSE), normalized root-mean-72 square error (NRMSE), and Sum of Ranks (SOR). There are several important 73 observations from this comparison study. First, while imputation methods perform 74 differentially under various missing mechanisms, algorithmic parameter settings, and preprocessing procedures, there are a few methods that consistently outperformed peer 76 methods across a range of realistic simulation studies. Second, the quality of 77 performance assessment depends on the efficacy of simulation designs and a more 78 realistic simulation design should include authentic missing values and preserve 79 original overall data distribution. Third, existing assessment methodology is imperfect 80 in that performance is indirectly assessed on imputing either artificial or masked, but 81 not authentic missing values (see Discussion section). 82 To explore a more integrative strategy for improving imputation performance, 83 we discuss a low-rank matrix factorization framework with fused regularization on 84 sparsity and similarity -Fused Regularization Matrix factorization (FRMF) [11][12][13], 85 which can naturally integrate other-omics data such as gene expression or clinical 86 variables. We also introduce a biologically-inspired latent variable modeling strategy -87 Convex analysis of Mixtures (CAM) [13,14], which performs data imputation using 88 the original intensity data (before log-transformation). The preliminary results on real 89 proteomics data are provided together with an outlook into future development 90 directions. 91

92
Experimental design and protocol 93 We selected eight representative methods for comparative assessment, based on their 94 intended missing mechanism(s) and imputation principles, summarized in  [7,9,[16][17][18]. We then explored and tested several variants of FRMF and CAM, where local similarity information is obtained from baseline or other 100 data acquired from the same samples. 101 We conducted the comparative assessments in two complementary simulation 102 settings. First, the realistic simulation data were generated from the observed data 103 portion (no authentic missing value) of a real proteomics dataset, where artificial 104 missing values were introduced involving two typical missing mechanisms and used 105 for performance assessment. Second, the realistic simulation data were generated from 106 the complete data matrix (including authentic missing values) of a real proteomics 107 dataset, where a small percentage of data points were randomly set-aside (masked 108 values) and used solely for performance assessment. The preprocessing eliminates 109 those proteins whose missing rates are higher than 80% and then performs log2 110 transformation [19]. The parameters were optimized for each imputation method by 111 parameter sweeping over a wide range of settings at each missing rate. The overall 112 experimental workflow is given in Figure 2. 113

114
The real LC-MS proteomics data form the base from which the simulation data sets 115 were produced [6]. The data were acquired using data-independent acquisition (DIA) 116 protocol, and protein level output was generated by mapDIA [20]. The dataset contains 117 200 samples associated with 2,682 proteins measured in human left anterior descending 118 (LAD) coronary arteries collected as part of a study of coronary and aortic 119 atherosclerosis [21]. The data were produced in three separate batches, indexed by A, 120 B, and C, and all have passed quality control and preprocessing procedures, 121 summarized in

135
In this simulation setting, we used the full data matrix (including both observed and 136 authentic missing values) from the human coronary proteomics data set. To preserve 137 the original patterns of both observed and authentic missing data, for each protein, a 138 small percentage of data points in the complete data matrix were randomly set-aside as 139 'NA' (masked values) with the masking rate(s) proportional to the authentic missing 140 rate(s). This procedure was repeated for all proteins and the masked values were considered as a mix of MNAR and MAR conditioned on the observed missing rates 142 and data patterns (Figure 3, Supplementary Information).

154
The imputation performance of the eight methods on MCAR mechanism is shown in 155

189
Based on biologically-inspired latent variable modeling of complex tissues -CAM [13, 190 14], we proposed and evaluated three variants of the CAM based imputation method. 191 CAM_complete performs CAM based imputation using the non-missing portion of full 192 data matrix; CAM_SVT and CAM_NIPALS perform CAM based imputation using full 193 data matrix while initialized by SVT and NIPALS,respectively. 194 The experimental results are shown in Figure 8  where is the number of proteins containing at least one missing value, is the protein 243 index in this protein subset, and (NRMSE ) is the ranks of protein-wise NRMSE 244 across different imputation methods. 245 Introduction to FRMF method 246 As aforementioned, low-rank matrix factorization has been a popular and effective 247 approach for missing data imputation [12]. For imputing proteomics data, the 248 assumption is that there is only a small number of biological processes determining the 249 expression profiles. Consider an × complete data matrix describing samples 250 and proteins. A low-rank matrix factorization approach seeks to approximate 251 containing missing values by a linear latent variable model, 252 where × and × are the low-rank factor matrices, and ≪ min ( , ). In order to 254 prevent overfitting, the solution is often formulated as a regularized sparse SVD 255 minimization problem on the observed values 256 where ‖. ‖ 2 denotes the Frobenius norm, (. ) is the indicator function, and A , S > 0 258 are the regularization parameters. When local similarity information is available, FRMF 259 can be formulated by adding a fused regularization term 260 where is the fused regularization parameter, and ℱ( ) denotes the neighborhood 263 sample subset of sample and can be determined using baseline data or other relevant 264 measurements e.g. gene expression or pathological score. In our study, ℱ( ) is 265 determined by the between-sample cosine similarity cos ( , ) based on data matrix 266 in FRMF_self, or cos ( , ) based on pathological scores in FRMF_cross_patho. 267

268
CAM is a latent variable modeling and deconvolution technique previously used for 269 identifying biologically-interpretable cell subtypes × and their composition × in 270 complex tissues [6,13,14,21]. We adopt the CAM framework into (1) and demonstrate 271 that hybrid CAM_SVT and CAM_NIPALS can handle missing values naturally and 272 this combination leads to a novel and biologically-plausible imputation strategy. The 273 workflow of CAM based method with three variants is given in Figure 9. While FRMF is a promising and novel imputation approach, its effectiveness 298 for improving classic low-rank methods would depend on diversity among samples, 299 discriminatory power of similarity measure, and complementary nature of additional 300 and relevant measurements. Newly proposed CAM method represents an interesting 301 direction for further development. More importantly, CAM performs missing value 302 imputation using original intensity rather than log-transformed data, and this is mathematically more rigorous because log-transformation violates the linear nature of 304 The scripts used in the paper is available in R script ProImput. 313 Code for all experiments can be found in the vignette at 314 https://github.com/MinjieSh/ProImput. The operation system 315 can be any system supporting R language. 316 -

Competing interests 317
The authors declare that they have no competing interests.