Enhancing Classification of liquid chromatography mass spectrometry data with Batch Effect Removal Neural Networks (BERNN)

doi:10.21203/rs.3.rs-3112514/v1

Download PDF

Article

Enhancing Classification of liquid chromatography mass spectrometry data with Batch Effect Removal Neural Networks (BERNN)

https://doi.org/10.21203/rs.3.rs-3112514/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 06 May, 2024

Read the published version in Nature Communications →

Version 1

posted

You are reading this latest preprint version

Liquid Chromatography Mass Spectrometry (LC-MS) is a powerful method for profiling complex biological samples. However, batch effects typically arise from differences in sample processing protocols, experimental conditions and data acquisition techniques, significantlyimpacting the interpretability of results. Correcting batch effects is crucial for the reproducibility of proteomics research, but current methods are not optimal for removal of batch effects without compressing the genuine biological variation under study. We propose a suite of Batch Effect Removal Neural Networks (BERNN) to remove batch effects in large LC-MS experiments, with the goal of maximizing sample classification performance between conditions. More importantly, these models must efficiently generalize in batches not seen during training. Comparison of batch effect correction methods across three diverse datasets demonstrated that BERNN models consistently showed the strongest sample classification performance. However, the model producing the greatest classification improvements did not always perform best in terms of batch effect removal. Finally, we show that overcorrection of batch effects resulted in the loss of some essential biological variability. These findings highlight the importance of balancing batch effect removal while preserving valuable biological diversity in large-scale LC-MS experiments.

Biological sciences/Computational biology and bioinformatics/Machine learning

Biological sciences/Computational biology and bioinformatics/Proteome informatics

Biological sciences/Computational biology and bioinformatics/Classification and taxonomy

Biological sciences/Biotechnology/Proteomics

Biological sciences/Biotechnology/Metabolomics

Liquid chromatography-mass spectrometry (LC-MS) has become an essential analytical technique in molecular biology because of its ability to accurately and simultaneously quantify thousands of compounds in biological samples. It is a powerful tool for large-scale screening of potential biomarkers, which enables the identification of specific measurable indicators that can aid in disease diagnosis ¹, prognosis² and treatment selection ³, leading to improved patient outcomes and personalized medicine. Despite the power of LC-MS, the utility of large-scale experiments remains compromised due to the omnipresence of confounding factors. Confounders can be divided into biological confounders, such as age or gender, and non-biological confounders, such as batch effects. The latter are practically unavoidable in large-scale studies due to limitations in instrumental availability and timeline of sample collection and ideally would be removed from the final biological quantification value ⁴. It can be difficult, even impossible, to completely remove batch effects without affecting the quality of the biological signal. By assessing the classification improvement of machine learning models, we can determine if the batch effect correction method successfully removes the technical variations and restores the underlying biological patterns ^5,6. Furthermore, classification models enable personalized medicine by identifying patterns and biomarkers that can discriminate between different subjects. In highly heterogeneous disorders, such as Alzheimer's Disease, it enables the discovery of biomarkers that are not common to every person affected by this disorder ⁷. Batch effects are a systematic variation that arises from experimental differences introduced unintentionally during data collection. It typically emerges from differences in sample processing protocols (e.g., variations in technicians, reagents, or equipment), experimental conditions (e.g., discrepancies in temperature, humidity), and data acquisition techniques (e.g., variations in sequencing platforms, microarray scanners) It is a critical problem in many high-throughput bioassays, such as microarrays ⁸, RNA-Seq⁹ or LC-MS-based proteomics and metabolomics ¹⁰, that involve processing samples in different batches or on different platforms, which can result in differences in data quality and analysis performance. Batch effects in proteomics occurs due to variability in sample preparation, instrument condition and performance, or environmental factors present during the sample preparation and analysis workflow. These technical variations may lead to false positive or false negative protein identifications, as well as inconsistencies in the quantification of protein abundances across batches, which can hamper the reproducibility and validity of the study ¹¹.

Common approaches to correct batch effect in LC-MS include Quality Control (QC) based methods (e.g., qc-rlsc ¹²), location-based methods (e.g., Combat ⁸) and matrix refactorization methods (e.g., harmony ¹³). These methods assume an accurate new representation can be obtained using a generalized linear model, which is not necessarily an accurate assumption. This issue can be addressed using Deep Neural Networks (DNN), which use a succession of nonlinear transformations ¹⁴. In the latter, batch information is incorporated into the DNN either as an input feature ^6,15,16, as a regularization term in the objective function¹⁷ or vectorized and added to the encoded vector representing the input features, such as NormAE ^12–14. However, because of their overparameterization, DNNs usually require large amounts of data to be generalizable. Given the high number of parameters to train, and the fact that there is a high number of possible architectures (and so no one-fits-all solution), DNNs can be resource-intensive and time-consuming to train.

To determine whether a biological signal has been preserved after batch effect correction, one can utilize classification performance. The highest scores determine the effectiveness of the correction. There are several metrics available to evaluate the extent to which the batch effect has been removed, with the goal being to determine an optimal balance between preserving biological variation and removing the batch effect⁶. The success of batch effect removal is often evaluated based on the number of significant features using differential analysis methods^11,18 or solely on the reduction of batch-mixing metrics^10,19. Discrimination between classes under study using DNNs might be discredited because they are considered black-boxes²⁰ and the purpose is usually to find biomarkers for disorders that can already be diagnosed otherwise. Today, the problem of interpretability of DNNs has been thoroughly investigated and many options are available to break into the black box ^21,22, including methods specific to deep learning models, such as attribution methods (e.g., Integrated Gradients²³ and DeepLIFT ²⁴), model-agnostic approaches (such as LIME²⁵ and SHAP ²⁶), all of which provide insight into how each feature is important for the model’s decision-making process. Not only do these methods can indicate the average importance of each feature for a complete classification task, but they can also provide interpretable classification explanations at the sample level ²⁷.

In this context, we propose a suite of models to overcome these various batch correction problems, all based on autoencoders executable in parallel. Our approach to countering batch effects is different from most other solutions, as we do not rely on a single solution that we claim to be superior to all others. Instead, we acknowledge that not all problems require the same solution and propose multiple potential solutions to address batch effects. Thus, we aim to empower researchers to easily try multiple methods simultaneously and pick the optimal approach for their dataset and scientific question. Amongst this suite of models, we present the first use of Variational Autoencoders (VAE), Domain Adversarial Neural Networks (DANN) and Domain Inverse Triplet Loss (invTriplet) for batch correction in LC-MS. Additionally, in contrast to other batch correction methods, we do not recommend using the corrected output of the autoencoder for biomarker discovery through downstream analysis. Instead, we demonstrate how Shapley values ²⁶ can be used for biomarker discovery. Finally, our method simultaneously corrects batch effects and performs sample classification, making the method an end-to-end solution which ensures that batch effect removal improves classification.

3.1 Model descriptions

Our models, that we call Batch Effect Removal Neural Networks (BERNN), are composed of different modules, each with different objectives: the autoencoder, the batch classifier and the label classifier (Fig. 1; for more details Fig S1). The autoencoder aims at finding representations, usually smaller than the inputs, that can be used to reconstruct the inputs, as can be seen in Fig. 1A and 1B. The autoencoder, by itself, can improve classification generalization by removing the noise and by providing a smaller representation. To further remove batch effects, we use either an adversarial loss or a modified version of the triplet loss to find feature representations that cannot discriminate between batches. The adversarial strategy is already used by NormAE ¹⁸, but we are using the Gradient Reversal Layer (GRL) to make the training more straightforward ²⁸. To our knowledge, all methods in the literature using DNN to deal with batch effects use an autoencoder ^5,15,16,18, but it is not mandatory (as in ²⁸ original model). The label classifier is optional to get a representation free of batch effect, but is essential for model selection, as we define the objective of the model as to get the best classification scores in batches never seen during training. This objective ensures the maximum biological information is preserved and should therefore always be used to improve the reliability of downstream analysis. To ensure generalizability across batches, we always validate our results on batches that were not used during training.

In addition to the methods used to obtain batch-free representations, we also used Variational Autoencoders (VAE) (see methods section 5.4.4). With recent technologies, datasets tend to have a very high number of features, but a much smaller number of samples. VAE enables the model to perform data augmentation in datasets that are highly dimensional, but with few samples. VAEs can reduce overfitting in overparameterized neural networks ²⁹, which is important for optimal performance.

Most batch effect correction methods return the corrected expression data, but ³⁰ does not provide the corrected features: it creates an integrated embedding. All our methods also create a new embedded representation, but can also return a vector of the corrected data, although the utility of these data is not guaranteed. The corrected data is obtained by reconstructing the original input from the new embedding using the decoder (the purple module in Fig. 1). This is because the importance of the reconstruction loss (i.e., measure of the error between the predicted and true values) is only one of multiple losses that constitute the final loss. Each individual loss can have its importance significantly reduced by an hyperparameter. When used in combination with a classification task, it is possible to get the features that had the most importance for the classification. Although BERNN is designed to optimize classification, it can also be used for biomarker discovery by using SHAP ²⁶, by quantifying the contribution of each feature to the model predictions, and thus identifying the most influential features as biomarkers. BERNN provides insights into the relative importance of different features, enabling the identification of biomarkers for improved understanding and diagnosis in various biomedical applications. We ran a SHAP analysis (See Fig S2) as an example of how the features contributing the most to the decision of the models can be identified without the corrected original features.

3.2 Evaluation and selection

To get an accurate picture of the model performances benchmarked in this study, we used three datasets with different characteristics (Fig. 2). They are different in terms of number of features (889, 6461 and 17887), number of batches (3, 7, and 21) and the degree of batch effect (Adjusted Mutual Information (AMI) from 0.13 to 1.0). The differences in initial batch effects are also visually evident (Fig. 2B). These datasets, which have very different characteristics, achieved top performances with different models, supporting the concept that different problems might require different solutions. Models were evaluated on their classification performances using the accuracy (Fig S3) and Matthews Correlation Coefficient (MCC) (Fig. 3A, 4A, 5A). Because some datasets were highly imbalanced (more samples in one class than the other), we selected the top performing models based on their MCC scores. The MCC produces high scores only if predictions are good across all four confusion matrix categories (true positives, false negatives, true negatives and false positives) ³¹. Class imbalance was strongest in the Adenocarcinoma dataset, where 87.5% of the samples in the dataset were of the dominant class. As can be seen in Figure S3B, the best accuracy for this dataset is obtained using the raw data, because the model always predicts the dominant class. The main drawback of using MCC is its high sensitivity to misclassification of the samples in the minority class. When the imbalance is very high, misclassification of a single sample can have a big impact on the score. This can partly explain why the error bars for MCC were sometimes quite large. In comparison, error bars are smaller for accuracies (Fig S3).

To evaluate the performance of batch effect correction, we used two main categories of metrics: batch mixing metrics and quality control (QC) metrics. To assess the batch mixing performance, we used Batch Entropy (BE), Adjusted Rand Index (ARI) and AMI (see Methods section 5.2.2). QC metrics measure how much the QC samples are different from each other. Because QC samples are the same for a given dataset, they should in theory be at the same position in a Euclidean space, and each of the features should have perfect Pearson Correlation Coefficients (PCC). The metrics nMED and aPCC are derived from these assumptions, respectively (see methods section 5.2.3). Note that in results presented in Figs. 3–5, for BERNN models, all the batch effect correction metrics are for models where the MCC was maximized. They could all possibly reach much better batch effect correction performance, but it is not the objective of the study.

3.3 Training scenarios

As explained in Fig. 1, there are 2 scenarios to train any BERNN. Both scenarios start with a warming up phase (step 1), which uses the complete dataset (including the validation and test set) to train all components of the model that are unsupervised, which include the autoencoder and the batch classifier (if an adversarial loss is used). In the first scenario, when the warmup is over, the autoencoder is frozen and only the labels classifier is updated using the training set (step 2a). In the second scenario, after the warmup, the training alternates between step 1 and step 2b, which is the same as step 2a, except that the backpropagation is allowed to flow through the encoder.

The first scenario was used only for the Alzheimer dataset, the two others used the second scenario. When the second scenario is used with the Alzheimer dataset, the overfitting problem is too important and only random predictions are made for the validation and test sets. However, because the labels classifier cannot influence the representation learned, it probably causes an underfitting of the data. It may be because of the data augmentation they provide that VAEs, particularly VAE-DANN, perform better on this dataset (Fig. 3A). On the Adenocarcinoma and Aging Mice datasets, the second scenario performed best. The first scenario made the model underfitting too much, thus the second scenario performed much better (Fig. 4A). In this particular case, the AE-based models performed much better than the VAE-based models, especially AE-invTriplet. For the Aging Mice dataset, all BERNN models were better than the other methods, but their performances were almost exactly the same, with only negligeable differences (Fig. 5A). In this case, it is possible that the maximum performance was obtained even with the simplest AEs, leaving no possibility for improvement with more advanced methods. Thus, we found that no single model can pretend performing optimally for all 3 datasets analyzed, with some models being the best for a certain dataset while performing poorly on others.

3.4 Reducing batch effects can improve classification

We define the best model as the one with the best average classification performances over all repetitive holdout iterations (see methods section 5.5.1). Because we are interested in studying if the model learned generalizes to new batches, all the samples from a given batch must be contained within the same split. In standard cross-validation practices, the test set always remains the same. However, some batches are easier to classify in any given training set, which could explain the sometimes-large error bars in Fig. 3A, Fig. 4A and Fig. 5A. When using cross-validation, if the test set randomly comprises these “easy” classification batches, it may lead to much better performance in the test set than in the validation set. To avoid this potential confusion, we used a repetitive holdout method so that the test set is resampled for each holdout iteration.

Most batch correction methods improve (decrease) batch entropy (BE). AMI and ARI are also consistently improved (decreased in value) in datasets with the greatest initial batch effects, but not in the Alzheimer dataset, which had moderate batch effects in the raw dataset. For every dataset, the best validation MCC score always used a transformation that improved BE (Fig. 3D, Fig. 4D & Fig. 5C). Although in the case of the adenocarcinoma dataset many VAE-based networks performed worse than a classification using the raw data (see methods section 5.5.4), VAE and AE networks are both the most efficient in at least one dataset. In all three datasets, all the best classification performances were reached using neural networks models that also significantly improved BE. It is also true that some of the methods that led to some of the best BE improvements had some of the worst MCC scores, sometimes far worse than when using the raw data. The best example of this is the Aging Mice dataset, where the classification scores using the raw data reached high MCC scores, despite having a very bad BE (Fig. 5). On the other hand, Combat and harmony greatly reduced batch effects at the cost of very bad MCC scores. This reduction can lead to the loss of important biological information, which could have negative consequences even in situations where classification is not a primary concern, such as differential quantification/expression analysis. When evaluating differences between samples from two conditions, classification performance provides evidence of the significance of a set of biomarkers identified to discriminate between the conditions.

In both datasets that used replicate QC samples (Alzheimer and Adenocarcinoma), the networks with the best classification performances had better nMED than the uncorrected data, but the networks that did best on these metrics did not necessarily result in the best classification metrics. Note than in Figs. 3–5, only the model with the best classification performances were retained. Many models that performed even better on every batch effect metrics than the models kept were discarded due to poor classification performances.

3.5 AE outperforms all other methods

While it is not possible to confirm a single model as the best choice to improve classification, all the best MCCs were obtained using a version of our BERNNs. All of them performed almost identically on the Aging Mice dataset. On the Alzheimer dataset, VAE-DANN performed the best, followed closely by NormVAE and invTriplet-VAE. It is possible that the reverse triplet loss (revTriplet; for definition see supplementary methods), which did not perform as well as the other models developed in this study, would perform best in other datasets (For the complete results, see Fig S4-S7). In all datasets, one of revTriplet or invTriplet was always part of the best of the BERNN models for batch correction according to the batch entropy (Fig S4). Combat or harmony were often the best according to the three batch mixing metrics, but always at the cost of very poor classification performances. The triplet losses require an additional hyperparameter (called margin), which makes hyperparameter optimization more complex. It has been shown to have important consequences on optimization in scRNAseq where inverse triplet loss was also used to overcome batch effects ³⁰. It is possible they might require more hyperparameters optimization to achieve even better performances. Nevertheless, one of the invTriplet models was either the top performer (Fig. 4) or was always very close to the best performance (Fig. 3 and Fig. 5).

When batch effects hindered classification generalization, using one of BERNN’s models was always beneficial in all three datasets. However, when a batch effect is evident, but the model can generalize predictions in new batches, the course of action is less clear. The AgingMice dataset provides a good example of this scenario, as LinearSVC achieved very good classification scores with unnormalized data. Normalization methods that reduce batch effects or batch correction methods applied prior to training the model classifications often had some of the best batch mixing scores (Fig S7). However, they either reduced the efficiency of the classification tasks or barely made any difference in performance. The classification was significantly reduced by some methods, such as harmony or combat (excluding pycombat), indicating that the biological signal crucial for distinguishing between conditions was completely eliminated.

It is well known in machine learning that no single model can claim to be the best at solving any task. In this study, we propose a suite of deep learning architectures models to enable users to find the optimal solution for different problems. This suite introduces batch correction in LC-MS using VAE-based models, the use of GRL for implementing a DANN and triplet losses, all of which were part of the best model in at least one dataset.

The inverse triplet loss is particularly interesting because it is the only loss that is effectively minimized. The other losses that use GRL attempt to minimize the batch classification, but the loss increases until it reaches the loss corresponding to a random classification of the batches. This property is particularly useful in the context of multitask learning ³², because it requires all losses to be minimized to function properly. In this study, some models require to simultaneously train multiple losses: the autoencoder reconstruction loss (1), the batch classification loss (2), the classification loss (3) and, for VAEs, the Kullback-Liebler loss (4). Each of these losses needs hyperparameters to lever their importance for the model to be optimal. Adversarial models, for example, are notoriously known to be difficult to train because of the fragile balance between the training of the discriminator and generator ³³.

The bottleneck representation should be preferred to the reconstructed inputs, because the decoder cannot improve the representations. The reconstructions can only be as good as the bottleneck representations or more likely, worse. Reconstruction is usually chosen because it is easier to interpret for biologists, because the goal of the studies is to identify features that can be used as biomarkers. If the end goal is not the classification in itself, but to identify biomarkers that can be used to search for new therapeutical compounds, then it makes sense to focus on denoised representations. We argue, however, that the bottleneck should be used in combination with other methods, like SHAP ²⁶ or LIME ³⁴ (Fig S2), that can identify the most useful features using methods.

We found that the original NormAE available online had poor classification results, especially on the Alzheimer dataset (unpublished data). Instead of using the implemented version available online, it was implemented in our suite of models, as it is very similar to the DANN method, except without the GRL (see methods 5.4.2 for more details), which we believe to be a crucial component to the success of the methods that use DANN. Because they are all trained with the same training scripts, we can make fair comparisons between the methods. Indeed, the original NormAE had more layers and a different training strategy than we used. All methods should be trained using the same architectures so that the only difference between each method is the method itself, not the architecture. We were able to surpass the classification performance of the default configuration which was developed the adenocarcinoma dataset (we obtained the dataset from ³⁵, as ¹⁸ did not have its dataset available, but the dataset description suggest they are the same dataset).

In future developments, the reasons why some models are better suited for a particular dataset will be studied. For example, the VAE-based models performed much better than our other models in the dataset that had the least batch effect and the most batches, whereas it was the opposite in the dataset with the most batch effect and least number of batches. This aspect will be further developed to get improved guidelines that could be used to reduce the amount of time needed to explore the architectures and have an idea from the start of what should work in a particular framework.

In conclusion, we propose a new tool, BERNN, that integrates multiple solutions to remove batch effect in LC-MS analyses while allowing an optimal classification of the biological samples. Using three different proteomic and metabolomic datasets, we benchmarked BERNN models to six other tools available in the literature and found that they outperformed them in all cases while not only considering reduction of batch effect but also classification performances. Finally, at the difference of most of batch correction tools which provide a corrected version of the data, here we rely on the encoded version for data classification. However, we demonstrated that combining approaches such as SHAP with BERNN can be used to retrieve important features enabling the discovery of potential biomarkers.

5.1 Datasets description

We are using three datasets with different levels of batch mixing heterogeneity and different numbers of batches to demonstrate how BERNN can apply to different scenarios. A summary of the three datasets is available in Fig. 2A.

5.1.1 Alzheimer Disease dataset

Cerebrospinal fluid (CSF) samples were obtained according to standardized collection and processing protocols through the Mass General Institute for Neurodegenerative Disease (MIND) biorepository, following written informed consent for research biobanking (IRB: 2015P000221). This cohort represents a clinically complex cohort spanning a wide variety of neurological disorders, which closely aligns with a real-life diagnostic situation. The raw data for this dataset is available on ProteomeXchange with accession number PXD043216

CSF proteins were trypsin-digested prior to LC-MS/MS analysis on an Orbitrap Fusion Tribrid (Thermo Fischer Scientific) mass spectrometer operating in Data Independent Acquisition (DIA) mode. The 923 samples were injected in duplicate across 22 different batches. A QC sample was also generated by mixing a small aliquot of all CSF samples and analyzed in the same conditions at the beginning and at the end of each batch. This QC sample was also used to generate a Gas Phase Fractionation (GPF) library. The raw files were then processed with DIA-NN ³⁶ software (version 1.8.1) for protein identification and quantification. DIA-NN was used in two steps: 1) Library-free search on the GPF files using a Uniprot Reference Homo Sapiens database to generate a spectral library; 2) Library-based search on the sample and QC sample files using the spectral library generated in step one. The main report generated by DIA-NN was used with the DIA-NN R package ³⁷ to get quantifications of proteins corresponding to unique genes. Common contaminants were removed from the dataset to avoid any bias in the classification or in any subsequent analysis. A complete list of removed protein IDs is available in the supplementary content (Table S1).

The dataset consists of a total of 923 samples, with 84 QC samples and 839 samples that were obtained from 408 subjects, with 22 subjects having more than one sample from repeat clinic visits. The samples were distributed into 22 batches with an average of approximately 41 samples per batch (mean = 41.25, std = 15.18). The cohort was subdivided into 6 different disease classes. Cognitively unimpaired patients (CU), Alzheimer’s Disease with dementia (DEM-AD), Alzheimer’s Disease with mild cognitive impairment (MCI-AD), dementia with causes other than Alzheimer’s disease (DEM-other), mild cognitive impairment with causes other than Alzheimer’s disease (MCI-other) and patients with Normal Pressure Hydrocephalus (NPH). Importantly, CU patients are not healthy controls, but all had clinical indication for lumbar puncture, and span a variety of other non-dementia diagnoses5

The batches are very heterogeneous with a similar total number of samples per batch, but different class composition across each batch (Fig S8). Classes are not fully balanced. The complete preprocessed dataset is available at https://github.com/spell00/BERNN_MSMS.

5.1.2 Adenocarcinoma dataset

The dataset is composed of a total of 642 samples, with 74 QC samples and 568 patients, comprising 497 colorectal cancer and 71 chronic enteritis patients. There were 192, 192, and 184 subject samples with 25, 25, and 24 QCs in three batches, respectively. The raw MS file were converted to mzXML using ProteoWizard ³⁸, then preprocessed using the R package XCMS. After data processing, the final dataset has 6461 metabolite peaks. More details on this dataset in the original paper ³⁵. The dataset is available at https://github.com/dengkuistat/WaveICA_2.0/tree/master/data .

5.1.3 Aging Mice dataset

This dataset was introduced by ³⁹³⁹. The dataset is made of a total of 372 samples, of which 171 received a high fat diet and 201 has a chow diet. There were only 3 QC samples, all in the same batch, and thus these samples were discarded. The samples were distributed into 7 batches with an average of 53 samples per batch (mean = 53.14, std = 26.91). Each sample has 17887 features that represent the peptides' precursors. The raw data for the this AgingMice dataset is available with ProetomeXchange accession number PXD009160.

To reconstruct the data matrix used in this study, first download to Github repository https://github.com/symbioticMe/batch_effects_workflow_code. Then, download the file http://ftp.pride.ebi.ac.uk/pride/data/archive/2021/11/PXD009160/E1801171630_feature_alignment_requant.tsv.gz and place it the folder AgingMice_study/data_AgingMice/1_original_data of the batch_effects_workflow_code repository. Then, run the scripts 1a_prepare_sample_annotation.R, 1b_prepare_raw_proteome.R and 4b_peptide_correlation_raw_data.R to get the matrix of log precursor values used in this study.

5.2 Tools for evaluating batch effects

The tools we used to evaluate the presence of batch effect can be divided into 3 main categories: visual diagnostic using a dimensionality reduction technique, batch mixing metrics and quality control metrics. It is important to note that batch effects can be subtle and difficult to detect, and that different methods may identify different sources of variation in the data. Therefore, it is often recommended to use multiple methods and to carefully validate and interpret the results.

5.2.1 Visual diagnostic

The first category for evaluating batch effect is visual diagnostic (Fig. 2). This is usually done with methods such as PCA, UMAP ⁴⁰ or t-SNE ⁴¹. It is often how batch effect is first noticed. The presence of a visually observable batch effect means it is a high source of variance in the data. However, these visualizations can be incomplete, thus the absence of visually observable batch effect does not mean it is inexistant. For example, for the Alzheimer dataset, the batch effect is not as obvious as in other datasets. If only visual diagnostics is done, batch effects might go unnoticed.

5.2.2 Batch mixing metrics

All batch mixing tests use a classifier trained to predict which batch each sample is from. We used a k-nearest neighbors’ classifier with 20 neighbors to calculate the probability of a sample belonging to each batch. The highest probabilities were used as predictions to calculate ARI and AMI. The probabilities are used to calculate the batch entropy.

5.2.1.1 Batch Entropy

We can say there is no batch effect when it is impossible to accurately predict from which batch a sample is drawn from. If the best prediction is random, there is no batch effect. To get the maximum batch entropy (BE), the batch classifier should predict all possible batch as equivalently probable. For example, if there are 4 batches, the batch effect is at its lowest when the highest entropy is reached, which is when the batch classifier returns the vector [0.25, 0.25, 0.25, 0.25]. To calculate batch entropy, the probability of a given sample belonging to each of the possible batches is obtained using the relative frequency of its N-nearest neighbors. The BE is given by Shannon’s entropy:

$$BE= I\left(B\right) = {log}\left(\frac{1}{p\left(B\right)}\right)$$

For BE, higher values mean better batch mixing. In order for the metric to be easily comparable to the other two batch mixing metrics, for which decreasing values indicate better batch mixing, we made a metric we called normalized Batch Entropy (nBE), which is the maximum entropy (ME) value possible minus the BE, divided by ME. The maximum and minimum values of nBE are 1 and 0, respectively. In an experiment with K batches, the entropy is at maximum when p(B) = 1/K, thus nBE is defined as:

$$nBE = \frac{{log}\left(\text{K}\right) - {log}\left(\frac{1}{p\left(B\right)}\right)}{{log}\left(\text{K}\right)}$$

5.2.1.2 Adjusted Rand index (ARI)

The Rand Index is simply the number of samples correctly identified divided by the total number of samples. It measures the similarity between two data clusters. Values close to 1 indicates high batch clustering (a KNN classifier perfectly predicts the batch each sample belongs to), so high batch mixing is represented by values close to 0 (the batch predictions of a KNN classifier are no better than a random prediction). We use the Adjusted Rand Index (ARI), which is adjusted for chance. The variables compared are the batch predictions and the batches' true values. It is defined as follow,

$$ARI=\frac{{\sum }_{ij}\left(\genfrac{}{}{0pt}{}{{n}_{ij}}{2}\right)-\left[{\sum }_{i}\left(\genfrac{}{}{0pt}{}{{a}_{i}}{2}\right){\sum }_{j}\left(\genfrac{}{}{0pt}{}{{b}_{j}}{2}\right)\right]/\left(\genfrac{}{}{0pt}{}{n}{2}\right)}{\frac{1}{2}\left[{\sum }_{i}\left(\genfrac{}{}{0pt}{}{{a}_{i}}{2}\right)+{\sum }_{j}\left(\genfrac{}{}{0pt}{}{{b}_{j}}{2}\right)\right]-\left[{\sum }_{i}\left(\genfrac{}{}{0pt}{}{{a}_{i}}{2}\right){\sum }_{j}\left(\genfrac{}{}{0pt}{}{{b}_{j}}{2}\right)\right]/\left(\genfrac{}{}{0pt}{}{n}{2}\right)}$$

Where ${n}_{ij}$, ${a}_{i}$, ${b}_{j}$ are values from the contingency table.

5.2.1.3 Adjusted Mutual Information (AMI)

Mutual Information measures the entropy shared between individual entropies. It measures the dependance between two variables, in this case two discrete variables. As for ARI, values close to 1 indicates high batch clustering (a KNN classifier perfectly predicts the batch each sample belongs to), so high batch mixing is represented by values close to 0 (the batch predictions of a KNN classifier are no better than a random prediction). The variables compared are the batch predictions and the batches' true values.

$$E\left[MI\left(U,V\right)\right]={\sum }_{i=1}^{R}{\sum }_{j=1}^{C}{\sum }_{{n}_{ij}={\left({a}_{i}+{b}_{j}-N\right)}^{+}}^{\text{min}\left({a}_{i},{b}_{j}\right)}\frac{{n}_{ij}}{N}\text{log}\left(\frac{N\cdot {n}_{ij}}{{a}_{i}{b}_{j}}\right)\times \frac{{a}_{i}!{b}_{j}!\left(N-{a}_{i}\right)!\left(N-{b}_{j}\right)!}{N!{n}_{ij}!\left({a}_{i}-{n}_{ij}\right)!\left({b}_{j}-{n}_{ij}\right)!\left(N-{a}_{i}-{b}_{j}+{n}_{ij}\right)!}$$

Where ${n}_{ij}$, ${a}_{i}$, ${b}_{j}$ are values from the contingency table. $C$ and $R$ are the two sets of clusters getting compared, both with $N$ elements.

5.2.3 Quality control metrics

Two of the datasets used in this study contain QC samples that were systematically analyzed with each batch of analyses. The features of that sample should always be the same, so we can calculate how much they diverge and use these metrics to measure the batch effect importance. These metrics were (to our knowledge) introduced by ¹⁸.

5.2.1.4 Average Pearson Correlation Coefficient (aPCC)

Because it is always exactly the same sample, we know that all QCs should theoretically be perfectly correlated, which means a perfect Pearson Correlation Coefficient (PCC) of 1. All samples are compared in pairs, so the final value is an average over all PCCs.

5.2.1.5 Normalized Median Euclidean Distance (nMED)

If batch correction is efficient, all QC samples should be very close to each other. The Euclidean distance is used to measure how far each pair of samples are from one another. Instead of using the average, like in ¹⁸, the median is used because it is less affected by aberrant values. Unlike ¹⁸, we also normalize the value by dividing it by the median Euclidean distance of all non-QC samples. If a transformation makes the QC samples very close to each other, but non-QC samples are equally close to each other, then the transformation did not actually alleviate the batch effect. For this reason, nMED should be preferred to the average Euclidean distance proposed in ²⁶ .

5.3 Batch Effect Removal methods

We tried normalizing each dataset using three different methods: minmax, standard and robust standardization. The first methods to counter batch effects that we tried was applying the same three normalization methods, but to counter batch effects, they were applied individually to each batch individually, which we named minmax_per_batch, standard_per_batch and robust_per_batch. The latter two have more potential to remove batch effect, transforming the values into z-scores, thus forcing each batch to have a mean of 0 and unit variance. We also used combat ⁸ and Harmony ¹³, as they are popular methods in microarrays/RNAseq and scRNAseq respectively. Combat is also used to remove batch effects from LC-MS datasets ⁴². We used two implementations of combat: a R version https://rdrr.io/bioc/sva/man/ComBat.html and a python version https://github.com/epigenelabs/pyComBat, which we named pycombat in this manuscript. Intriguingly, the two implementations had very different outcomes.

We also used WaveICA ²⁵ and NormAE ¹⁸, as they are state-of-the-art methods is batch effect correction of LC-MS data. We used our own implementation of NormAE, which has a slightly different architecture. We reduced the number of layers to a single hidden before and after the bottleneck to make the comparison with our own models. Unlike NormAE, we consider the number of neurons in each layer to be hyperparameters that are optimized. To give it a fair chance to outperform our own methods, we also use all the same hyperparameter optimization as our models.

5.4 Autoencoders

All the models in BERNN are implemented using PyTorch and are based on autoencoders. In short, autoencoders are composed of an encoder and a decoder. The encoder turns the inputs into embeddings (also referred as the bottleneck of the autoencoder), which are usually smaller than the inputs, but not necessarily. The objective of the autoencoder is to obtain new representations and reconstruct the original inputs from it the best it can. The embeddings should then contain as much information as possible from the inputs, without the unnecessary noise. To find the best performing model on a given dataset, a total of 10 models can be trained using BERNN. For a complete representation of all the possible models that can be trained using BERNN, see Fig S1. All the models were not represented in the main results to alleviate the reading, but the complete results are available in Figures S4-S6.

5.4.1 Reconstruction with batch mapping

The first method that is implemented to obtain batch-free representations is to add to the embedding of the autoencoder a vector of the same size representing the batch ID (Fig S9). This was implemented in NormAE, although not mentioned in the paper ¹⁸. It makes it possible to obtain better reconstruction loss by adding the batch information into the vector for the reconstruction. Because the batch ID is contained in this vector added to the embedding, the latter does not need to contain information about the batch. A similar method is used to get batch-free representations in scRNAseq, such as in ¹⁴. In this case, the batch ID is directly appended to the bottleneck representation of the variational autoencoder.

5.4 2 Domain Adversarial Neural Network (DANN)

Domain Adaptation Neural Network (DANN) is a type of deep learning algorithm that enables a model trained on one domain to be adapted to another related domain with different characteristics, allowing it to perform better on the target domain. DANN achieves this by learning to extract domain-invariant features from the input data ²⁸. In this case, we are defining batches to be from different domains. The original DANN was developed to adapt the learning from a single domain to another one, so our work is more akin to ⁴³, which extends domain adaptation for multiple domains. Our AE-DANN model is represented in Fig S1. The loss of AE-DANN is the following:

$${min}_{D,E}{min}_{{F}_{b}}V\left(D,E.{F}_{\text{b}}\right)=los{s}_{\text{rec}}\left(x,D\left(E\left(x\right),{y}^{\text{b}}\right)\right)+{\lambda }^{\text{b}}los{s}_{\text{disc}\_\text{b}}\left({F}_{\text{b}}\left(E\left(x\right)\right),{y}^{\text{b}}\right)$$

The loss_disc−b is minimized, but it is adversarial because of the Gradient Reversal Layer (GRL).

NormAE is also similar in nature to a DANN, except it does not use the GRL. Using the GRL is advantageous, because the total loss is composed of losses that are all minimized and added together. When not using the GRL, like with NormAE, the adversarial loss is maximized, and it is subtracted to the other losses being optimized simultaneously.

$$\text{m}\text{i}{\text{n}}_{D,E}\text{m}\text{a}{\text{x}}_{{F}_{b}}\text{V}\left(\text{D},\text{E}.{\text{F}}_{\text{b}}\right)=\text{l}\text{o}\text{s}{\text{s}}_{\text{rec}}\left(\text{x},\text{D}\left(\text{E}\left(\text{x}\right),{\text{y}}^{\text{b}}\right)\right)-{{\lambda }}^{\text{b}}\text{l}\text{o}\text{s}{\text{s}}_{\text{disc}\_\text{b}}\left({\text{F}}_{\text{b}}\left(\text{E}\left(\text{x}\right)\right),{\text{y}}^{\text{b}}\right)$$

Using this definition, if the second term of the equation becomes too large, the loss could become negative, which should not be allowed to happen.

5.4.3 Inverse Triplet Loss

The inverse triplet loss is like the normal Triplet Loss (defined in the section Reverse Triplet Loss of the supplementary material), but the positive and negative samples are inversed; the negative samples take the place of the positive samples in the triplet loss equation, and vice-versa.

$${\mathcal{L}}_{invTriplet}\left(A,P,N\right)=\text{max}\left(|\text{f}\left(A\right)-\text{f}\left(N\right){|}_{2}-|\text{f}\left(A\right)-\text{f}\left(P\right){|}_{2}+\alpha ,0\right)$$

A is the anchor input, P is any Positive input of the same batch as A, N is any negative sample of a different batch than A, α is the margin between positive and negative pairs and f is the embedding given by passing the inputs through the encoder of the autoencoder. Using the normal triplet loss would result in samples from the same batch clustering together and different batches to be far away from each other. The distance between the clusters is controlled by the hyperparameter α. The Inverse Triplet loss does the opposite by inversing the Positive and Negative samples in the equation, which encourages batch-free representations. The samples from different batches get closer, while samples from the same batch are pushed further apart. The later objective is used to prevent all samples from collapsing. If the samples from the same batch are not pushed apart, the loss would be optimal if all samples were transformed into the exact same value, which is not the desired outcome. The distance minimized in this case is the Euclidean distance, but any distance could be used.

5.4.4 Variational Autoencoders

The variational autoencoder (VAE) is a probabilistic generative model based on the variational Bayes approach. To train a VAE, we want to optimize the lower bound defined in Eq. 3 of ⁴⁴:

Where D_KL is the Kullback-Liebler Divergence, φ represents the variational parameters (encoder parameters) and θ the generative parameters (decoder parameters). The Kullback-Liebler Divergence pushes the variational posterior ${q}_{{\upvarphi }}\left(z∣x\right)$ to resemble the prior ${p}_{{\theta }}\left(z\right)$, which is the unit normal distribution. Both the labels and batch classifiers use the reparametrized variable z as inputs. It is also optionally combined with a DANN, which is also trained on z to make sure the new representations are free of batch effects. All the different AEs listed above have also been implemented as VAEs, including NormAE.

5.5 Training strategies

5.5.1 Repetitive holdout

Repetitive holdout is a method to evaluate the performance of a model on a dataset. It is similar to cross-validation, however each split is random, there is no limit to the number of times it can be done on a dataset and the test set is resampled for every holdout iteration. Resampling the test set is particularly important, because some batches can be much easier to classify than others, which can make the test set classification much better or much worse than the validation set. Using repetitive holdout makes classification on the valid and test sets comparable.

When splitting the dataset, the samples from a given batch must all be contained in the same split. We do this in order to inform on the generalization abilities of a model to make predictions in a new batch.

5.5.2 Class imbalance

Class imbalance has a negative impact on machine learning models predictive abilities. If nothing is done about it, the model might learn to only predict the majority class. This is especially true if the imbalance is very large. It is also a concern if the problem is very hard to model. PyTorch’s WeightedRandomSampler is used to counter class imbalances in datasets by giving more weights to samples from minority classes during training.

5.5.3 BERNN Hyperparameter Optimization

We used the function optimize from the package ax-platform (https://pypi.org/project/ax-platform/) to perform a Bayesian optimization of the hyperparameters for each of the BERNN models (all implemented in PyTorch). We used 20 combinations of hyperparameters to optimize each model. The hyperparameters optimized are the following: learning rate, weight decay, dropout, number of warmup epochs, layer1, layer2, label smoothing parameter, triplet loss margin (when it applies), beta (controls the strength of KL divergence, when it applies), gamma (controls the strength of the adversarial or triplet loss, when it applies). The batch size is set to 32 and the number of epochs after warmup is set to 1000, but the training is stopped if no improvements are made in the last 100 epochs. Models were trained on Nvidia RTX3090 GPUs.

5.5.4 Classification with non-BERNN representations

For the classification of raw data, or any batch corrections that do not involve a DNN, we used either a Random Forest Classifier (RFC) or a Linear Support Vector Machine (LinearSVM) from the python package scikit-learn. For both, we used models from the python package scikit-learn. To deal with classes imbalance in some datasets, the parameter class_weights was set to “balanced” for both models. Automatically adjust weights inversely proportional to class frequencies. The hyperparameters optimized for the RFC were min_samples_split, min_samples_leaf, n_estimators, criterion, and oob_score. For the LinearSVM, we optimized tol, max_iter, penalty and C. The hyperparameters for the RFC and LinearSVM models were optimized using the package scikit-optimize (https://scikit-optimize.github.io/stable/) to execute a Bayesian Optimization.

5.5.5 Model interpretability

For model explanation purposes and to identify the most important features for the classification, we used SHAP²⁶ to produce an example analysis. The advantage of that approach is that it makes it possible to explain the decision made on individual samples and could be used for precision medicine. It would allow identification of complex patterns that apply only to a subset of the samples that could not be identified by differential analysis.

Data availability

All the data necessary to reproduce the experiments can be found in the repository https://github.com/spell00/BERNN_MSMS.

Code availability

Notebooks and python scripts used for data visualization, batch effect correction and classification are available at https://github.com/spell00/BERNN_MSMS

Author contributions

SP conceived the study, conducted literature analysis, conceived the models, designed the methodology, wrote all the code, conducted all the experiments, analyzed the results and wrote the draft manuscript.

SP, FP and ML contributed to the design of the methodology.

SP, ML, FRD, FP, BC, MBG, AD reviewed, commented, and revised the final manuscript.

SL, BC, MBG, WW, TTL generated the LC-MS/MS data for the Alzheimer Dataset.

FRD processed the raw data for the Alzheimer dataset.

ACN, SEA, BC acquired the funding, designed the study, and supervised acquisition of the Alzheimer dataset.

AD, ML, FRD and FP supervised the study.

Acknowledgments

AD research laboratory is supported by Research and Innovation Chair L’Oréal in Digital Biology. The Orbitrap Fusion mass spectrometer utilized was supported in part by NIH SIG grants S10OD018034 and Yale School of Medicine. Funding for this project came from NIH awards AG062421 to SEA, AG062306 to SEA, BCC & ACN, and AG066508 to ACN. BCC is supported by an Alzheimer's Research UK Senior Research Fellowship, and by the Bright Focus Foundation.

Conflicts of interest

S. Arnold has received honoraria and/or travel expenses for lectures from Abbvie, Eisai, and Biogen and has served on scientific advisory boards of Corte, has received consulting fees from Athira, Cassava, Cognito Therapeutics, EIP Pharma and Orthogonal Neuroscience, and has received research grant support from NIH, Alzheimer’s Association, Alzheimer’s Drug Discovery Foundation, Abbvie, Amylyx, EIP Pharma, Merck, Janssen/Johnson & Johnson, Novartis, and vTv. S.N. Leslie is a current employee of Janssen Pharmaceuticals. B. Carlyle has received grant funding from Ono Pharmaceutical. Other authors report no conflicts of interest.

Banerjee, S. Empowering Clinical Diagnostics with Mass Spectrometry. ACS Omega 5, 2041–2048 (2020).
de Fátima Cobre, A. et al. Diagnosis and prognosis of COVID-19 employing analysis of patients’ plasma and serum via LC-MS and machine learning. Comput Biol Med 146, (2022).
Califf, R. M. Biomarker definitions and their applications. Exp Biol Med 243, 213 (2018).
Han, W. & Li, L. Evaluating and minimizing batch effects in metabolomics. Mass Spectrom Rev 41, 421–442 (2022).
Niu, J., Yang, J., Guo, Y., Qian, K. & Wang, Q. Joint deep learning for batch effect removal and classification toward MALDI MS based metabolomics. BMC Bioinformatics 23, 1–19 (2022).
Li, H., McCarthy, D. J., Shim, H. & Wei, S. Trade-off between conservation of biological variation and batch effect removal in deep generative modeling for single-cell transcriptomics. BMC Bioinformatics 23, 1–22 (2022).
Zheng, H., Petrella, J. R., Doraiswamy, P. M., Lin, G. & Hao, W. Data-driven causal model discovery and personalized prediction in Alzheimer’s disease. npj Digital Medicine 2022 5:1 5, 1–12 (2022).
Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).
Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nature Biotechnology 2018 36:5 36, 421–427 (2018).
Čuklina, J. et al. Diagnostics and correction of batch effects in large-scale proteomic studies: a tutorial. Mol Syst Biol 17, e10240 (2021).
Liu, Q. et al. Addressing the batch effect issue for LC/MS metabolomics data in data preprocessing. Scientific Reports 2020 10:1 10, 1–13 (2020).
Dunn, W. B. et al. Procedures for large-scale metabolic profiling of serum and plasma using gas chromatography and liquid chromatography coupled to mass spectrometry. Nature Protocols 2011 6:7 6, 1060–1083 (2011).
Korsunsky, I. et al. Fast, sensitive, and flexible integration of single cell data with Harmony. bioRxiv 461954 (2018) doi:10.1101/461954.
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat Methods 15, 1053–1058 (2018).
Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nature Methods 2018 15:12 15, 1053–1058 (2018).
Xu, C. et al. Probabilistic harmonization and annotation of single-cell transcriptomics data with deep generative models. Mol Syst Biol 17, e9620 (2021).
Amodio, M. et al. Exploring single-cell data with deep multitasking neural networks. Nat Methods 16, 1139–1145 (2019).
Rong, Z. et al. NormAE: Deep Adversarial Learning Model to Remove Batch Effects in Liquid Chromatography Mass Spectrometry-Based Metabolomics Data. Anal Chem 92, 5082–5090 (2020).
Sánchez-Illana, Á. et al. Evaluation of batch effect elimination using quality control replicates in LC-MS metabolite profiling. Anal Chim Acta 1019, 38–48 (2018).
Kang, Y., Vijay, S. & Gujral, T. S. Deep neural network modeling identifies biomarkers of response to immune-checkpoint therapy. iScience 25, 104228 (2022).
Savage, N. Breaking into the black box of artificial intelligence. Nature (2022) doi:10.1038/D41586-022-00858-1.
Sheu, Y. H. Illuminating the Black Box: Interpreting Deep Neural Network Models for Psychiatric Research. Front Psychiatry 11, 1091 (2020).
Sundararajan, M., Taly, A. & Yan, Q. Axiomatic Attribution for Deep Networks. (2017).
Shrikumar, A., Greenside, P. & Kundaje, A. Learning Important Features Through Propagating Activation Differences.
Ribeiro, M. T., Singh, S. & Guestrin, C. ‘Why should i trust you?’ Explaining the predictions of any classifier. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 13-17-August-2016, 1135–1144 (2016).
Lundberg, S. M., Allen, P. G. & Lee, S.-I. A Unified Approach to Interpreting Model Predictions.
Roder, J., Maguire, L., Georgantas, R. & Roder, H. Explaining multivariate molecular diagnostic tests via Shapley values. BMC Med Inform Decis Mak 21, 1–18 (2021).
Ganin, Y. et al. Domain-Adversarial Training of Neural Networks. Journal of Machine Learning Research 17, 1–35 (2016).
Huang, Q., Qiao, C., Jing, K., Zhu, X. & Ren, K. Biomarkers identification for Schizophrenia via VAE and GSDAE-based data augmentation. Comput Biol Med 146, (2022).
Simon, L. M., Wang, Y. Y. & Zhao, Z. Integration of millions of transcriptomes using batch-aware triplet neural networks. Nat Mach Intell 3, 705–715 (2021).
Chicco, D. & Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 21, 1–13 (2020).
Kendall, A., Gal, Y. & Cipolla, R. Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics.
Saxena, D. & Cao, J. Generative Adversarial Networks (GANs). ACM Computing Surveys (CSUR) 54, (2021).
Ribeiro, M. T., Singh, S. & Guestrin, C. ‘Why Should I Trust You?’ Explaining the Predictions of Any Classifier.
Deng, K. et al. WaveICA 2.0: a novel batch effect removal method for untargeted metabolomics data without using batch information. Metabolomics 17, (2021).
Demichev, V., Messner, C. B., Vernardis, S. I., Lilley, K. S. & Ralser, M. DIA-NN: neural networks and interference correction enable deep proteome coverage in high throughput. Nature Methods 2019 17:1 17, 41–44 (2019).
GitHub - vdemichev/diann-rpackage: Report processing and protein quantification for MS-based proteomics. https://github.com/vdemichev/diann-rpackage.
Adusumilli, R. & Mallick, P. Data Conversion with ProteoWizard msConvert. Methods Mol Biol 1550, 339–368 (2017).
Williams, E. G. et al. Multiomic profiling of the liver across diets and age in a diverse mouse population. Cell Syst 13, 43–57.e6 (2022).
McInnes, L., Healy, J. & Melville, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. (2018) doi:10.48550/arxiv.1802.03426.
van der Maaten, L. & Hinton, G. Visualizing Data using t-SNE. Journal of Machine Learning Research 9, 2579–2605 (2008).
Čuklina, J. et al. Diagnostics and correction of batch effects in large-scale proteomic studies: a tutorial. Mol Syst Biol 17, (2021).
Sebag, A. S. et al. MULTI-DOMAIN ADVERSARIAL LEARNING.
Kingma, D. P. & Welling, M. Auto-Encoding Variational Bayes. 2nd International Conference on Learning Representations, ICLR 2014 - Conference Track Proceedings (2013).

Yes there is potential Competing Interest. S. Arnold has received honoraria and/or travel expenses for lectures from Abbvie, Eisai, and Biogen and has served on scientific advisory boards of Corte, has received consulting fees from Athira, Cassava, Cognito Therapeutics, EIP Pharma and Orthogonal Neuroscience, and has received research grant support from NIH, Alzheimer’s Association, Alzheimer’s Drug Discovery Foundation, Abbvie, Amylyx, EIP Pharma, Merck, Janssen/Johnson & Johnson, Novartis, and vTv. S.N. Leslie is a current employee of Janssen Pharmaceuticals. B. Carlyle has received grant funding from Ono Pharmaceutical. Other authors report no conflicts of interest.

BERNNSupplementary.docx

Download PDF

Journal Publication

published 06 May, 2024

Read the published version in Nature Communications →

Version 1

posted

You are reading this latest preprint version

Enhancing Classification of liquid chromatography mass spectrometry data with Batch Effect Removal Neural Networks (BERNN)

Status:

Journal Publication

Version 1

Abstract

Figures

Introduction

Results

3.1 Model descriptions

3.2 Evaluation and selection

3.3 Training scenarios

3.4 Reducing batch effects can improve classification

3.5 AE outperforms all other methods

Discussion

Methods

5.1 Datasets description

5.2 Tools for evaluating batch effects

5.2.1.1 Batch Entropy

5.2.1.2 Adjusted Rand index (ARI)

5.2.1.3 Adjusted Mutual Information (AMI)

5.2.1.4 Average Pearson Correlation Coefficient (aPCC)

5.2.1.5 Normalized Median Euclidean Distance (nMED)

5.3 Batch Effect Removal methods

5.4 Autoencoders

5.5 Training strategies

Declarations

References

Additional Declarations

Supplementary Files

Status:

Journal Publication

Version 1