Enhancing Classification of liquid chromatography mass spectrometry data with Batch Effect Removal Neural Networks (BERNN)

Liquid Chromatography Mass Spectrometry (LC-MS) is a powerful method for profiling complex biological samples. However, batch effects typically arise from differences in sample processing protocols, experimental conditions and data acquisition techniques, significantlyimpacting the interpretability of results. Correcting batch effects is crucial for the reproducibility of proteomics research, but current methods are not optimal for removal of batch effects without compressing the genuine biological variation under study. We propose a suite of Batch Effect Removal Neural Networks (BERNN) to remove batch effects in large LC-MS experiments, with the goal of maximizing sample classification performance between conditions. More importantly, these models must efficiently generalize in batches not seen during training. Comparison of batch effect correction methods across three diverse datasets demonstrated that BERNN models consistently showed the strongest sample classification performance. However, the model producing the greatest classification improvements did not always perform best in terms of batch effect removal. Finally, we show that overcorrection of batch effects resulted in the loss of some essential biological variability. These findings highlight the importance of balancing batch effect removal while preserving valuable biological diversity in large-scale LC-MS experiments.

To determine whether a biological signal has been preserved after batch effect correction, one can utilize classi cation performance. The highest scores determine the effectiveness of the correction. There are several metrics available to evaluate the extent to which the batch effect has been removed, with the goal being to determine an optimal balance between preserving biological variation and removing the batch effect 6 . The success of batch effect removal is often evaluated based on the number of signi cant features using differential analysis methods 11,18 or solely on the reduction of batch-mixing metrics 10,19 . Discrimination between classes under study using DNNs might be discredited because they are considered black-boxes 20 and the purpose is usually to nd biomarkers for disorders that can already be diagnosed otherwise. Today, the problem of interpretability of DNNs has been thoroughly investigated and many options are available to break into the black box 21,22 , including methods speci c to deep learning models, such as attribution methods (e.g., Integrated Gradients 23 and DeepLIFT 24 ), model-agnostic approaches (such as LIME 25 and SHAP 26 ), all of which provide insight into how each feature is important for the model's decision-making process. Not only do these methods can indicate the average importance of each feature for a complete classi cation task, but they can also provide interpretable classi cation explanations at the sample level 27 .
In this context, we propose a suite of models to overcome these various batch correction problems, all based on autoencoders executable in parallel. Our approach to countering batch effects is different from most other solutions, as we do not rely on a single solution that we claim to be superior to all others. Instead, we acknowledge that not all problems require the same solution and propose multiple potential solutions to address batch effects. Thus, we aim to empower researchers to easily try multiple methods simultaneously and pick the optimal approach for their dataset and scienti c question. Amongst this suite of models, we present the rst use of Variational Autoencoders (VAE), Domain Adversarial Neural Networks (DANN) and Domain Inverse Triplet Loss (invTriplet) for batch correction in LC-MS. Additionally, in contrast to other batch correction methods, we do not recommend using the corrected output of the autoencoder for biomarker discovery through downstream analysis. Instead, we demonstrate how Shapley values 26 can be used for biomarker discovery. Finally, our method simultaneously corrects batch effects and performs sample classi cation, making the method an end-to-end solution which ensures that batch effect removal improves classi cation.

Model descriptions
Our models, that we call Batch Effect Removal Neural Networks (BERNN), are composed of different modules, each with different objectives: the autoencoder, the batch classi er and the label classi er ( Fig. 1; for more details Fig S1). The autoencoder aims at nding representations, usually smaller than the inputs, that can be used to reconstruct the inputs, as can be seen in Fig. 1A and 1B. The autoencoder, by itself, can improve classi cation generalization by removing the noise and by providing a smaller representation. To further remove batch effects, we use either an adversarial loss or a modi ed version of the triplet loss to nd feature representations that cannot discriminate between batches. The adversarial strategy is already used by NormAE 18 , but we are using the Gradient Reversal Layer (GRL) to make the training more straightforward 28 . To our knowledge, all methods in the literature using DNN to deal with batch effects use an autoencoder 5,15,16,18 , but it is not mandatory (as in 28 original model). The label classi er is optional to get a representation free of batch effect, but is essential for model selection, as we de ne the objective of the model as to get the best classi cation scores in batches never seen during training. This objective ensures the maximum biological information is preserved and should therefore always be used to improve the reliability of downstream analysis. To ensure generalizability across batches, we always validate our results on batches that were not used during training.
In addition to the methods used to obtain batch-free representations, we also used Variational Autoencoders (VAE) (see methods section 5.4.4). With recent technologies, datasets tend to have a very high number of features, but a much smaller number of samples. VAE enables the model to perform data augmentation in datasets that are highly dimensional, but with few samples. VAEs can reduce over tting in overparameterized neural networks 29 , which is important for optimal performance.
Most batch effect correction methods return the corrected expression data, but 30 does not provide the corrected features: it creates an integrated embedding. All our methods also create a new embedded representation, but can also return a vector of the corrected data, although the utility of these data is not guaranteed. The corrected data is obtained by reconstructing the original input from the new embedding using the decoder (the purple module in Fig. 1). This is because the importance of the reconstruction loss (i.e., measure of the error between the predicted and true values) is only one of multiple losses that constitute the nal loss. Each individual loss can have its importance signi cantly reduced by an hyperparameter. When used in combination with a classi cation task, it is possible to get the features that had the most importance for the classi cation. Although BERNN is designed to optimize classi cation, it can also be used for biomarker discovery by using SHAP 26 , by quantifying the contribution of each feature to the model predictions, and thus identifying the most in uential features as biomarkers. BERNN provides insights into the relative importance of different features, enabling the identi cation of biomarkers for improved understanding and diagnosis in various biomedical applications. We ran a SHAP analysis (See Fig S2) as an example of how the features contributing the most to the decision of the models can be identi ed without the corrected original features.

Evaluation and selection
To get an accurate picture of the model performances benchmarked in this study, we used three datasets with different characteristics (Fig. 2). They are different in terms of number of features (889, 6461 and 17887), number of batches (3, 7, and 21) and the degree of batch effect (Adjusted Mutual Information (AMI) from 0.13 to 1.0). The differences in initial batch effects are also visually evident (Fig. 2B). These datasets, which have very different characteristics, achieved top performances with different models, supporting the concept that different problems might require different solutions. Models were evaluated on their classi cation performances using the accuracy ( Fig S3) and Matthews Correlation Coe cient (MCC) (Fig. 3A, 4A, 5A). Because some datasets were highly imbalanced (more samples in one class than the other), we selected the top performing models based on their MCC scores. The MCC produces high scores only if predictions are good across all four confusion matrix categories (true positives, false negatives, true negatives and false positives) 31 . Class imbalance was strongest in the Adenocarcinoma dataset, where 87.5% of the samples in the dataset were of the dominant class. As can be seen in Figure S3B, the best accuracy for this dataset is obtained using the raw data, because the model always predicts the dominant class. The main drawback of using MCC is its high sensitivity to misclassi cation of the samples in the minority class. When the imbalance is very high, misclassi cation of a single sample can have a big impact on the score. This can partly explain why the error bars for MCC were sometimes quite large. In comparison, error bars are smaller for accuracies ( Fig S3).
To evaluate the performance of batch effect correction, we used two main categories of metrics: batch mixing metrics and quality control (QC) metrics.

Training scenarios
As explained in Fig. 1, there are 2 scenarios to train any BERNN. Both scenarios start with a warming up phase (step 1), which uses the complete dataset (including the validation and test set) to train all components of the model that are unsupervised, which include the autoencoder and the batch classi er (if an adversarial loss is used). In the rst scenario, when the warmup is over, the autoencoder is frozen and only the labels classi er is updated using the training set (step 2a). In the second scenario, after the warmup, the training alternates between step 1 and step 2b, which is the same as step 2a, except that the backpropagation is allowed to ow through the encoder.
The rst scenario was used only for the Alzheimer dataset, the two others used the second scenario. When the second scenario is used with the Alzheimer dataset, the over tting problem is too important and only random predictions are made for the validation and test sets. However, because the labels classi er cannot in uence the representation learned, it probably causes an under tting of the data. It may be because of the data augmentation they provide that VAEs, particularly VAE-DANN, perform better on this dataset (Fig. 3A). On the Adenocarcinoma and Aging Mice datasets, the second scenario performed best. The rst scenario made the model under tting too much, thus the second scenario performed much better (Fig. 4A). In this particular case, the AE-based models performed much better than the VAE-based models, especially AE-invTriplet. For the Aging Mice dataset, all BERNN models were better than the other methods, but their performances were almost exactly the same, with only negligeable differences ( Fig. 5A). In this case, it is possible that the maximum performance was obtained even with the simplest AEs, leaving no possibility for improvement with more advanced methods. Thus, we found that no single model can pretend performing optimally for all 3 datasets analyzed, with some models being the best for a certain dataset while performing poorly on others.

Reducing batch effects can improve classi cation
We de ne the best model as the one with the best average classi cation performances over all repetitive holdout iterations (see methods section 5.5.1). Because we are interested in studying if the model learned generalizes to new batches, all the samples from a given batch must be contained within the same split. In standard cross-validation practices, the test set always remains the same.
However, some batches are easier to classify in any given training set, which could explain the sometimes-large error bars in Fig. 3A, Fig. 4A and Fig. 5A. When using cross-validation, if the test set randomly comprises these "easy" classi cation batches, it may lead to much better performance in the test set than in the validation set. To avoid this potential confusion, we used a repetitive holdout method so that the test set is resampled for each holdout iteration.
Most batch correction methods improve (decrease) batch entropy (BE). AMI and ARI are also consistently improved (decreased in value) in datasets with the greatest initial batch effects, but not in the Alzheimer dataset, which had moderate batch effects in the raw dataset. For every dataset, the best validation MCC score always used a transformation that improved BE (Fig. 3D, Fig. 4D & Fig. 5C). Although in the case of the adenocarcinoma dataset many VAE-based networks performed worse than a classi cation using the raw data (see methods section 5.5.4), VAE and AE networks are both the most e cient in at least one dataset. In all three datasets, all the best classi cation performances were reached using neural networks models that also signi cantly improved BE. It is also true that some of the methods that led to some of the best BE improvements had some of the worst MCC scores, sometimes far worse than when using the raw data. The best example of this is the Aging Mice dataset, where the classi cation scores using the raw data reached high MCC scores, despite having a very bad BE (Fig. 5). On the other hand, Combat and harmony greatly reduced batch effects at the cost of very bad MCC scores. This reduction can lead to the loss of important biological information, which could have negative consequences even in situations where classi cation is not a primary concern, such as differential quanti cation/expression analysis. When evaluating differences between samples from two conditions, classi cation performance provides evidence of the signi cance of a set of biomarkers identi ed to discriminate between the conditions.
In both datasets that used replicate QC samples (Alzheimer and Adenocarcinoma), the networks with the best classi cation performances had better nMED than the uncorrected data, but the networks that did best on these metrics did not necessarily result in the best classi cation metrics. Note than in Figs. 3-5, only the model with the best classi cation performances were retained. Many models that performed even better on every batch effect metrics than the models kept were discarded due to poor classi cation performances.

AE outperforms all other methods
While it is not possible to con rm a single model as the best choice to improve classi cation, all the best MCCs were obtained using a version of our BERNNs. All of them performed almost identically on the Aging Mice dataset. On the Alzheimer dataset, VAE-DANN performed the best, followed closely by NormVAE and invTriplet-VAE. It is possible that the reverse triplet loss (revTriplet; for de nition see supplementary methods), which did not perform as well as the other models developed in this study, would perform best in other datasets (For the complete results, see Fig S4-S7). In all datasets, one of revTriplet or invTriplet was always part of the best of the BERNN models for batch correction according to the batch entropy ( Fig S4). Combat or harmony were often the best according to the three batch mixing metrics, but always at the cost of very poor classi cation performances. The triplet losses require an additional hyperparameter (called margin), which makes hyperparameter optimization more complex. It has been shown to have important consequences on optimization in scRNAseq where inverse triplet loss was also used to overcome batch effects 30 . It is possible they might require more hyperparameters optimization to achieve even better performances. Nevertheless, one of the invTriplet models was either the top performer ( Fig. 4) or was always very close to the best performance ( Fig. 3 and Fig. 5).
When batch effects hindered classi cation generalization, using one of BERNN's models was always bene cial in all three datasets. However, when a batch effect is evident, but the model can generalize predictions in new batches, the course of action is less clear.
The AgingMice dataset provides a good example of this scenario, as LinearSVC achieved very good classi cation scores with unnormalized data. Normalization methods that reduce batch effects or batch correction methods applied prior to training the model classi cations often had some of the best batch mixing scores ( Fig S7). However, they either reduced the e ciency of the classi cation tasks or barely made any difference in performance. The classi cation was signi cantly reduced by some methods, such as harmony or combat (excluding pycombat), indicating that the biological signal crucial for distinguishing between conditions was completely eliminated.

Discussion
It is well known in machine learning that no single model can claim to be the best at solving any task. In this study, we propose a suite of deep learning architectures models to enable users to nd the optimal solution for different problems. This suite introduces batch correction in LC-MS using VAE-based models, the use of GRL for implementing a DANN and triplet losses, all of which were part of the best model in at least one dataset.
The inverse triplet loss is particularly interesting because it is the only loss that is effectively minimized. The other losses that use GRL attempt to minimize the batch classi cation, but the loss increases until it reaches the loss corresponding to a random classi cation of the batches. This property is particularly useful in the context of multitask learning 32 , because it requires all losses to be minimized to function properly. In this study, some models require to simultaneously train multiple losses: the autoencoder reconstruction loss (1), the batch classi cation loss (2), the classi cation loss (3) and, for VAEs, the Kullback-Liebler loss (4). Each of these losses needs hyperparameters to lever their importance for the model to be optimal. Adversarial models, for example, are notoriously known to be di cult to train because of the fragile balance between the training of the discriminator and generator 33 .
The bottleneck representation should be preferred to the reconstructed inputs, because the decoder cannot improve the representations. The reconstructions can only be as good as the bottleneck representations or more likely, worse. Reconstruction is usually chosen because it is easier to interpret for biologists, because the goal of the studies is to identify features that can be used as biomarkers. If the end goal is not the classi cation in itself, but to identify biomarkers that can be used to search for new therapeutical compounds, then it makes sense to focus on denoised representations. We argue, however, that the bottleneck should be used in combination with other methods, like SHAP 26 or LIME 34 (Fig S2), that can identify the most useful features using methods.
We found that the original NormAE available online had poor classi cation results, especially on the Alzheimer dataset (unpublished data). Instead of using the implemented version available online, it was implemented in our suite of models, as it is very similar to the DANN method, except without the GRL (see methods 5.4.2 for more details), which we believe to be a crucial component to the success of the methods that use DANN. Because they are all trained with the same training scripts, we can make fair comparisons between the methods. Indeed, the original NormAE had more layers and a different training strategy than we used. All methods should be trained using the same architectures so that the only difference between each method is the method itself, not the architecture. We were able to surpass the classi cation performance of the default con guration which was developed the adenocarcinoma dataset (we obtained the dataset from 35 , as 18 did not have its dataset available, but the dataset description suggest they are the same dataset).
In future developments, the reasons why some models are better suited for a particular dataset will be studied. For example, the VAEbased models performed much better than our other models in the dataset that had the least batch effect and the most batches, whereas it was the opposite in the dataset with the most batch effect and least number of batches. This aspect will be further developed to get improved guidelines that could be used to reduce the amount of time needed to explore the architectures and have an idea from the start of what should work in a particular framework.
In conclusion, we propose a new tool, BERNN, that integrates multiple solutions to remove batch effect in LC-MS analyses while allowing an optimal classi cation of the biological samples. Using three different proteomic and metabolomic datasets, we benchmarked BERNN models to six other tools available in the literature and found that they outperformed them in all cases while not only considering reduction of batch effect but also classi cation performances. Finally, at the difference of most of batch correction tools which provide a corrected version of the data, here we rely on the encoded version for data classi cation. However, we demonstrated that combining approaches such as SHAP with BERNN can be used to retrieve important features enabling the discovery of potential biomarkers.

Datasets description
We are using three datasets with different levels of batch mixing heterogeneity and different numbers of batches to demonstrate how BERNN can apply to different scenarios. A summary of the three datasets is available in Fig. 2A.

Alzheimer Disease dataset
Cerebrospinal uid (CSF) samples were obtained according to standardized collection and processing protocols through the Mass General Institute for Neurodegenerative Disease (MIND) biorepository, following written informed consent for research biobanking (IRB: 2015P000221). This cohort represents a clinically complex cohort spanning a wide variety of neurological disorders, which closely aligns with a real-life diagnostic situation. The raw data for this dataset is available on ProteomeXchange with accession number PXD043216 CSF proteins were trypsin-digested prior to LC-MS/MS analysis on an Orbitrap Fusion Tribrid (Thermo Fischer Scienti c) mass spectrometer operating in Data Independent Acquisition (DIA) mode. The 923 samples were injected in duplicate across 22 different batches. A QC sample was also generated by mixing a small aliquot of all CSF samples and analyzed in the same conditions at the beginning and at the end of each batch. This QC sample was also used to generate a Gas Phase Fractionation (GPF) library. The raw les were then processed with DIA-NN 36 software (version 1.8.1) for protein identi cation and quanti cation. DIA-NN was used in two steps: 1) Library-free search on the GPF les using a Uniprot Reference Homo Sapiens database to generate a spectral library; 2) Library-based search on the sample and QC sample les using the spectral library generated in step one. The main report generated by DIA-NN was used with the DIA-NN R package 37 to get quanti cations of proteins corresponding to unique genes. Common contaminants were removed from the dataset to avoid any bias in the classi cation or in any subsequent analysis. A complete list of removed protein IDs is available in the supplementary content (Table S1). The batches are very heterogeneous with a similar total number of samples per batch, but different class composition across each batch ( Fig S8). Classes are not fully balanced. The complete preprocessed dataset is available at https://github.com/spell00/BERNN_MSMS.

Adenocarcinoma dataset
The dataset is composed of a total of 642 samples, with 74 QC samples and 568 patients, comprising 497 colorectal cancer and 71 chronic enteritis patients. There were 192, 192, and 184 subject samples with 25, 25, and 24 QCs in three batches, respectively. The raw MS le were converted to mzXML using ProteoWizard 38 , then preprocessed using the R package XCMS. After data processing, the nal dataset has 6461 metabolite peaks. More details on this dataset in the original paper 35 . The dataset is available at https://github.com/dengkuistat/WaveICA_2.0/tree/master/data .

Aging Mice dataset
This dataset was introduced by 3939 . The dataset is made of a total of 372 samples, of which 171 received a high fat diet and 201 has a chow diet. There were only 3 QC samples, all in the same batch, and thus these samples were discarded. The samples were distributed into 7 batches with an average of 53 samples per batch (mean = 53.14, std = 26.91). Each sample has 17887 features that represent the peptides' precursors. The raw data for the this AgingMice dataset is available with ProetomeXchange accession number PXD009160.
To reconstruct the data matrix used in this study, rst download to Github repository https://github.com/symbioticMe/batch_effects_work ow_code. Then, download the le http://ftp.pride.ebi.ac.uk/pride/data/archive/2021/11/PXD009160/E1801171630_feature_alignment_requant.tsv.gz and place it the folder AgingMice_study/data_AgingMice/1_original_data of the batch_effects_work ow_code repository. Then, run the scripts 1a_prepare_sample_annotation.R, 1b_prepare_raw_proteome.R and 4b_peptide_correlation_raw_data.R to get the matrix of log precursor values used in this study.

Tools for evaluating batch effects
The tools we used to evaluate the presence of batch effect can be divided into 3 main categories: visual diagnostic using a dimensionality reduction technique, batch mixing metrics and quality control metrics. It is important to note that batch effects can be subtle and di cult to detect, and that different methods may identify different sources of variation in the data. Therefore, it is often recommended to use multiple methods and to carefully validate and interpret the results.

Visual diagnostic
The rst category for evaluating batch effect is visual diagnostic (Fig. 2). This is usually done with methods such as PCA, UMAP 40 or t-SNE 41 . It is often how batch effect is rst noticed. The presence of a visually observable batch effect means it is a high source of variance in the data. However, these visualizations can be incomplete, thus the absence of visually observable batch effect does not mean it is inexistant. For example, for the Alzheimer dataset, the batch effect is not as obvious as in other datasets. If only visual diagnostics is done, batch effects might go unnoticed.

Batch mixing metrics
All batch mixing tests use a classi er trained to predict which batch each sample is from. We used a k-nearest neighbors' classi er with 20 neighbors to calculate the probability of a sample belonging to each batch. The highest probabilities were used as predictions to calculate ARI and AMI. The probabilities are used to calculate the batch entropy.

Batch Entropy
We can say there is no batch effect when it is impossible to accurately predict from which batch a sample is drawn from. If the best prediction is random, there is no batch effect. To get the maximum batch entropy (BE), the batch classi er should predict all possible batch as equivalently probable. For example, if there are 4 batches, the batch effect is at its lowest when the highest entropy is reached, which is when the batch classi er returns the vector [0.25, 0.25, 0.25, 0.25]. To calculate batch entropy, the probability of a given sample belonging to each of the possible batches is obtained using the relative frequency of its N-nearest neighbors. The BE is given by Shannon's entropy: For BE, higher values mean better batch mixing. In order for the metric to be easily comparable to the other two batch mixing metrics, for which decreasing values indicate better batch mixing, we made a metric we called normalized Batch Entropy (nBE), which is the maximum entropy (ME) value possible minus the BE, divided by ME. The maximum and minimum values of nBE are 1 and 0, respectively. In an experiment with K batches, the entropy is at maximum when p(B) = 1/K, thus nBE is de ned as:

Adjusted Rand index (ARI)
The Rand Index is simply the number of samples correctly identi ed divided by the total number of samples. It measures the similarity between two data clusters. Values close to 1 indicates high batch clustering (a KNN classi er perfectly predicts the batch each sample belongs to), so high batch mixing is represented by values close to 0 (the batch predictions of a KNN classi er are no better than a random prediction). We use the Adjusted Rand Index (ARI), which is adjusted for chance. The variables compared are the batch predictions and the batches' true values. It is de ned as follow, Where , , are values from the contingency table.

Adjusted Mutual Information (AMI)
Mutual Information measures the entropy shared between individual entropies. It measures the dependance between two variables, in this case two discrete variables. As for ARI, values close to 1 indicates high batch clustering (a KNN classi er perfectly predicts the batch each sample belongs to), so high batch mixing is represented by values close to 0 (the batch predictions of a KNN classi er are no better than a random prediction). The variables compared are the batch predictions and the batches' true values.
Where , , are values from the contingency table. and are the two sets of clusters getting compared, both with elements.

Quality control metrics
Two of the datasets used in this study contain QC samples that were systematically analyzed with each batch of analyses. The features of that sample should always be the same, so we can calculate how much they diverge and use these metrics to measure the batch effect importance. These metrics were (to our knowledge) introduced by 18 .

Average Pearson Correlation Coe cient (aPCC)
Because it is always exactly the same sample, we know that all QCs should theoretically be perfectly correlated, which means a perfect Pearson Correlation Coe cient (PCC) of 1. All samples are compared in pairs, so the nal value is an average over all PCCs.

Normalized Median Euclidean Distance (nMED)
If batch correction is e cient, all QC samples should be very close to each other. The Euclidean distance is used to measure how far each pair of samples are from one another. Instead of using the average, like in 18 , the median is used because it is less affected by aberrant values. Unlike 18 , we also normalize the value by dividing it by the median Euclidean distance of all non-QC samples. If a transformation makes the QC samples very close to each other, but non-QC samples are equally close to each other, then the transformation did not actually alleviate the batch effect. For this reason, nMED should be preferred to the average Euclidean distance proposed in 26 .

Batch Effect Removal methods
We tried normalizing each dataset using three different methods: minmax, standard and robust standardization. The rst methods to counter batch effects that we tried was applying the same three normalization methods, but to counter batch effects, they were applied individually to each batch individually, which we named minmax_per_batch, standard_per_batch and robust_per_batch. The latter two have more potential to remove batch effect, transforming the values into z-scores, thus forcing each batch to have a mean of 0 and unit variance. We also used combat 8 and Harmony 13 , as they are popular methods in microarrays/RNAseq and scRNAseq respectively. Combat is also used to remove batch effects from LC-MS datasets 42 . We used two implementations of combat: a R version https://rdrr.io/bioc/sva/man/ComBat.html and a python version https://github.com/epigenelabs/pyComBat, which we named pycombat in this manuscript. Intriguingly, the two implementations had very different outcomes.
We also used WaveICA 25 and NormAE 18 , as they are state-of-the-art methods is batch effect correction of LC-MS data. We used our own implementation of NormAE, which has a slightly different architecture. We reduced the number of layers to a single hidden before and after the bottleneck to make the comparison with our own models. Unlike NormAE, we consider the number of neurons in each layer to be hyperparameters that are optimized. To give it a fair chance to outperform our own methods, we also use all the same hyperparameter optimization as our models.

Autoencoders
All the models in BERNN are implemented using PyTorch and are based on autoencoders. In short, autoencoders are composed of an encoder and a decoder. The encoder turns the inputs into embeddings (also referred as the bottleneck of the autoencoder), which are usually smaller than the inputs, but not necessarily. The objective of the autoencoder is to obtain new representations and reconstruct the original inputs from it the best it can. The embeddings should then contain as much information as possible from the inputs, without the unnecessary noise. To nd the best performing model on a given dataset, a total of 10 models can be trained using BERNN. For a complete representation of all the possible models that can be trained using BERNN, see Fig S1. All the models were not represented in the main results to alleviate the reading, but the complete results are available in Figures S4-S6.

Reconstruction with batch mapping
The rst method that is implemented to obtain batch-free representations is to add to the embedding of the autoencoder a vector of the same size representing the batch ID (Fig S9). This was implemented in NormAE, although not mentioned in the paper 18 . It makes it possible to obtain better reconstruction loss by adding the batch information into the vector for the reconstruction. Because the batch ID is contained in this vector added to the embedding, the latter does not need to contain information about the batch. A similar method is used to get batch-free representations in scRNAseq, such as in 14 . In this case, the batch ID is directly appended to the bottleneck representation of the variational autoencoder.

2 Domain Adversarial Neural Network (DANN)
Domain Adaptation Neural Network (DANN) is a type of deep learning algorithm that enables a model trained on one domain to be adapted to another related domain with different characteristics, allowing it to perform better on the target domain. DANN achieves this by learning to extract domain-invariant features from the input data 28 . In this case, we are de ning batches to be from different domains. The original DANN was developed to adapt the learning from a single domain to another one, so our work is more akin to 43 , which extends domain adaptation for multiple domains. Our AE-DANN model is represented in Fig S1. The loss of AE-DANN is the following: The loss disc−b is minimized, but it is adversarial because of the Gradient Reversal Layer (GRL).
NormAE is also similar in nature to a DANN, except it does not use the GRL. Using the GRL is advantageous, because the total loss is composed of losses that are all minimized and added together. When not using the GRL, like with NormAE, the adversarial loss is maximized, and it is subtracted to the other losses being optimized simultaneously.
Using this de nition, if the second term of the equation becomes too large, the loss could become negative, which should not be allowed to happen.

Inverse Triplet Loss
The inverse triplet loss is like the normal Triplet Loss (de ned in the section Reverse Triplet Loss of the supplementary material), but the positive and negative samples are inversed; the negative samples take the place of the positive samples in the triplet loss equation, and vice-versa.
A is the anchor input, P is any Positive input of the same batch as A, N is any negative sample of a different batch than A, α is the margin between positive and negative pairs and f is the embedding given by passing the inputs through the encoder of the autoencoder. Using the normal triplet loss would result in samples from the same batch clustering together and different batches to be far away from each other. The distance between the clusters is controlled by the hyperparameter α. The Inverse Triplet loss does the opposite by inversing the Positive and Negative samples in the equation, which encourages batch-free representations. The samples from different batches get closer, while samples from the same batch are pushed further apart. The later objective is used to prevent all samples from collapsing. If the samples from the same batch are not pushed apart, the loss would be optimal if all samples were transformed into the exact same value, which is not the desired outcome. The distance minimized in this case is the Euclidean distance, but any distance could be used.

Variational Autoencoders
The variational autoencoder (VAE) is a probabilistic generative model based on the variational Bayes approach. To train a VAE, we want to optimize the lower bound de ned in Eq. 3 of 44 : Where D KL is the Kullback-Liebler Divergence, φ represents the variational parameters (encoder parameters) and θ the generative parameters (decoder parameters). The Kullback-Liebler Divergence pushes the variational posterior to resemble the prior , which is the unit normal distribution. Both the labels and batch classi ers use the reparametrized variable z as inputs. It is also optionally combined with a DANN, which is also trained on z to make sure the new representations are free of batch effects. All the different AEs listed above have also been implemented as VAEs, including NormAE.

Repetitive holdout
Repetitive holdout is a method to evaluate the performance of a model on a dataset. It is similar to cross-validation, however each split is random, there is no limit to the number of times it can be done on a dataset and the test set is resampled for every holdout iteration. Resampling the test set is particularly important, because some batches can be much easier to classify than others, which can make the test set classi cation much better or much worse than the validation set. Using repetitive holdout makes classi cation on the valid and test sets comparable.
When splitting the dataset, the samples from a given batch must all be contained in the same split. We do this in order to inform on the generalization abilities of a model to make predictions in a new batch.

Class imbalance
Class imbalance has a negative impact on machine learning models predictive abilities. If nothing is done about it, the model might learn to only predict the majority class. This is especially true if the imbalance is very large. It is also a concern if the problem is very hard to model. PyTorch's WeightedRandomSampler is used to counter class imbalances in datasets by giving more weights to samples from minority classes during training.

BERNN Hyperparameter Optimization
We used the function optimize from the package ax-platform (https://pypi.org/project/ax-platform/) to perform a Bayesian optimization of the hyperparameters for each of the BERNN models (all implemented in PyTorch). We used 20 combinations of hyperparameters to optimize each model. The hyperparameters optimized are the following: learning rate, weight decay, dropout, number of warmup epochs, layer1, layer2, label smoothing parameter, triplet loss margin (when it applies), beta (controls the strength of KL divergence, when it applies), gamma (controls the strength of the adversarial or triplet loss, when it applies). The batch size is set to 32 and the number of epochs after warmup is set to 1000, but the training is stopped if no improvements are made in the last 100 epochs. Models were trained on Nvidia RTX3090 GPUs.

Classi cation with non-BERNN representations
For the classi cation of raw data, or any batch corrections that do not involve a DNN, we used either a Random Forest Classi er (RFC) or a Linear Support Vector Machine (LinearSVM) from the python package scikit-learn. For both, we used models from the python package scikit-learn. To deal with classes imbalance in some datasets, the parameter class_weights was set to "balanced" for both models. Automatically adjust weights inversely proportional to class frequencies. The hyperparameters optimized for the RFC were min_samples_split, min_samples_leaf, n_estimators, criterion, and oob_score. For the LinearSVM, we optimized tol, max_iter, penalty and C. The hyperparameters for the RFC and LinearSVM models were optimized using the package scikit-optimize (https://scikit-optimize.github.io/stable/) to execute a Bayesian Optimization.

Model interpretability
For model explanation purposes and to identify the most important features for the classi cation, we used SHAP 26 to produce an example analysis. The advantage of that approach is that it makes it possible to explain the decision made on individual samples and could be used for precision medicine. It would allow identi cation of complex patterns that apply only to a subset of the samples that could not be identi ed by differential analysis.

Declarations
Data availability All the data necessary to reproduce the experiments can be found in the repository https://github.com/spell00/BERNN_MSMS.  Models architectures.
A) VAE-DANN is a variational autoencoder with a domain classi er trained with a gradient reversal layer (GRL). B) AE-invTriplet is an Autoencoder that uses the inverse triplet loss to make the new representation batch-free. The encoders for the positive, negative and anchor samples all share the same weights. In both models, step 1 is the warmup and uses the whole dataset (including the train, validation and test sets). Once the warmup is done, two scenarios are possible. In the rst possibility, step 2a, only the labels classi er is trained using the training set; the rest of the model is frozen, not allowing backpropagation to go through the encoder. In the second scenario, step 2b, the training alternates between step 1 and step2a, with the exception that backpropagation is allowed to ow through the encoder. The two models share some of the same modules: the encoder (blue), the decoder (purple) and the labels classi er (green). Some of the modules are not shared between the two models represented here but could be used to make new models. It is the case for the domain classi er (pink), the inverse triplet loss (yellow) and sampling from the parameters of a gaussian distribution (beige).

Figure 2
Datasets description and overview of batch effect correction.
A) Table summarizing the three datasets used in the study. For each dataset, 2 conditions are classi ed by BERNNs. B) PCA Visualisation of the raw data for all three datasets. The datasets are, from left to right, ordered by strength of the initial batch effect. For each dataset, the middle row of images represents the transformation resulting in the best valid MCC. The images in the last row are from representations that were in the top methods purely for batch correction but performed badly for classi cation.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. BERNNSupplementary.docx