## 5.1 Datasets description

We are using three datasets with different levels of batch mixing heterogeneity and different numbers of batches to demonstrate how BERNN can apply to different scenarios. A summary of the three datasets is available in Fig. 2A.

**5.1.1 Alzheimer Disease dataset**

Cerebrospinal fluid (CSF) samples were obtained according to standardized collection and processing protocols through the Mass General Institute for Neurodegenerative Disease (MIND) biorepository, following written informed consent for research biobanking (IRB: 2015P000221). This cohort represents a clinically complex cohort spanning a wide variety of neurological disorders, which closely aligns with a real-life diagnostic situation. The raw data for this dataset is available on ProteomeXchange with accession number PXD043216

CSF proteins were trypsin-digested prior to LC-MS/MS analysis on an Orbitrap Fusion Tribrid (Thermo Fischer Scientific) mass spectrometer operating in Data Independent Acquisition (DIA) mode. The 923 samples were injected in duplicate across 22 different batches. A QC sample was also generated by mixing a small aliquot of all CSF samples and analyzed in the same conditions at the beginning and at the end of each batch. This QC sample was also used to generate a Gas Phase Fractionation (GPF) library. The raw files were then processed with DIA-NN 36 software (version 1.8.1) for protein identification and quantification. DIA-NN was used in two steps: 1) Library-free search on the GPF files using a Uniprot Reference Homo Sapiens database to generate a spectral library; 2) Library-based search on the sample and QC sample files using the spectral library generated in step one. The main report generated by DIA-NN was used with the DIA-NN R package 37 to get quantifications of proteins corresponding to unique genes. Common contaminants were removed from the dataset to avoid any bias in the classification or in any subsequent analysis. A complete list of removed protein IDs is available in the supplementary content (Table S1).

The dataset consists of a total of 923 samples, with 84 QC samples and 839 samples that were obtained from 408 subjects, with 22 subjects having more than one sample from repeat clinic visits. The samples were distributed into 22 batches with an average of approximately 41 samples per batch (mean = 41.25, std = 15.18). The cohort was subdivided into 6 different disease classes. Cognitively unimpaired patients (CU), Alzheimer’s Disease with dementia (DEM-AD), Alzheimer’s Disease with mild cognitive impairment (MCI-AD), dementia with causes other than Alzheimer’s disease (DEM-other), mild cognitive impairment with causes other than Alzheimer’s disease (MCI-other) and patients with Normal Pressure Hydrocephalus (NPH). Importantly, CU patients are not healthy controls, but all had clinical indication for lumbar puncture, and span a variety of other non-dementia diagnoses5

The batches are very heterogeneous with a similar total number of samples per batch, but different class composition across each batch (Fig S8). Classes are not fully balanced. The complete preprocessed dataset is available at https://github.com/spell00/BERNN_MSMS.

**5.1.2 Adenocarcinoma dataset**

The dataset is composed of a total of 642 samples, with 74 QC samples and 568 patients, comprising 497 colorectal cancer and 71 chronic enteritis patients. There were 192, 192, and 184 subject samples with 25, 25, and 24 QCs in three batches, respectively. The raw MS file were converted to mzXML using ProteoWizard 38, then preprocessed using the R package XCMS. After data processing, the final dataset has 6461 metabolite peaks. More details on this dataset in the original paper 35. The dataset is available at https://github.com/dengkuistat/WaveICA_2.0/tree/master/data .

**5.1.3 Aging Mice dataset**

This dataset was introduced by 3939. The dataset is made of a total of 372 samples, of which 171 received a high fat diet and 201 has a chow diet. There were only 3 QC samples, all in the same batch, and thus these samples were discarded. The samples were distributed into 7 batches with an average of 53 samples per batch (mean = 53.14, std = 26.91). Each sample has 17887 features that represent the peptides' precursors. The raw data for the this AgingMice dataset is available with ProetomeXchange accession number PXD009160.

To reconstruct the data matrix used in this study, first download to Github repository https://github.com/symbioticMe/batch_effects_workflow_code. Then, download the file http://ftp.pride.ebi.ac.uk/pride/data/archive/2021/11/PXD009160/E1801171630_feature_alignment_requant.tsv.gz and place it the folder *AgingMice_study/data_AgingMice/1_original_data* of the **batch_effects_workflow_code** repository. Then, run the scripts *1a_prepare_sample_annotation.R*, *1b_prepare_raw_proteome.R* and *4b_peptide_correlation_raw_data.R* to get the matrix of log precursor values used in this study.

## 5.2 Tools for evaluating batch effects

The tools we used to evaluate the presence of batch effect can be divided into 3 main categories: visual diagnostic using a dimensionality reduction technique, batch mixing metrics and quality control metrics. It is important to note that batch effects can be subtle and difficult to detect, and that different methods may identify different sources of variation in the data. Therefore, it is often recommended to use multiple methods and to carefully validate and interpret the results.

**5.2.1 Visual diagnostic**

The first category for evaluating batch effect is visual diagnostic (Fig. 2). This is usually done with methods such as PCA, UMAP 40 or t-SNE 41. It is often how batch effect is first noticed. The presence of a visually observable batch effect means it is a high source of variance in the data. However, these visualizations can be incomplete, thus the absence of visually observable batch effect does not mean it is inexistant. For example, for the Alzheimer dataset, the batch effect is not as obvious as in other datasets. If only visual diagnostics is done, batch effects might go unnoticed.

**5.2.2 Batch mixing metrics**

All batch mixing tests use a classifier trained to predict which batch each sample is from. We used a k-nearest neighbors’ classifier with 20 neighbors to calculate the probability of a sample belonging to each batch. The highest probabilities were used as predictions to calculate ARI and AMI. The probabilities are used to calculate the batch entropy.

## 5.2.1.1 Batch Entropy

We can say there is no batch effect when it is impossible to accurately predict from which batch a sample is drawn from. If the best prediction is random, there is no batch effect. To get the maximum batch entropy (BE), the batch classifier should predict all possible batch as equivalently probable. For example, if there are 4 batches, the batch effect is at its lowest when the highest entropy is reached, which is when the batch classifier returns the vector [0.25, 0.25, 0.25, 0.25]. To calculate batch entropy, the probability of a given sample belonging to each of the possible batches is obtained using the relative frequency of its N-nearest neighbors. The BE is given by Shannon’s entropy:

$$BE= I\left(B\right) = {log}\left(\frac{1}{p\left(B\right)}\right)$$

For BE, higher values mean better batch mixing. In order for the metric to be easily comparable to the other two batch mixing metrics, for which decreasing values indicate better batch mixing, we made a metric we called normalized Batch Entropy (nBE), which is the maximum entropy (ME) value possible minus the BE, divided by ME. The maximum and minimum values of nBE are 1 and 0, respectively. In an experiment with K batches, the entropy is at maximum when p(B) = 1/K, thus nBE is defined as:

$$nBE = \frac{{log}\left(\text{K}\right) - {log}\left(\frac{1}{p\left(B\right)}\right)}{{log}\left(\text{K}\right)}$$

## 5.2.1.2 Adjusted Rand index (ARI)

The Rand Index is simply the number of samples correctly identified divided by the total number of samples. It measures the similarity between two data clusters. Values close to 1 indicates high batch clustering (a KNN classifier perfectly predicts the batch each sample belongs to), so high batch mixing is represented by values close to 0 (the batch predictions of a KNN classifier are no better than a random prediction). We use the Adjusted Rand Index (ARI), which is adjusted for chance. The variables compared are the batch predictions and the batches' true values. It is defined as follow,

$$ARI=\frac{{\sum }_{ij}\left(\genfrac{}{}{0pt}{}{{n}_{ij}}{2}\right)-\left[{\sum }_{i}\left(\genfrac{}{}{0pt}{}{{a}_{i}}{2}\right){\sum }_{j}\left(\genfrac{}{}{0pt}{}{{b}_{j}}{2}\right)\right]/\left(\genfrac{}{}{0pt}{}{n}{2}\right)}{\frac{1}{2}\left[{\sum }_{i}\left(\genfrac{}{}{0pt}{}{{a}_{i}}{2}\right)+{\sum }_{j}\left(\genfrac{}{}{0pt}{}{{b}_{j}}{2}\right)\right]-\left[{\sum }_{i}\left(\genfrac{}{}{0pt}{}{{a}_{i}}{2}\right){\sum }_{j}\left(\genfrac{}{}{0pt}{}{{b}_{j}}{2}\right)\right]/\left(\genfrac{}{}{0pt}{}{n}{2}\right)}$$

Where \({n}_{ij}\), \({a}_{i}\), \({b}_{j}\) are values from the contingency table.

## 5.2.1.3 Adjusted Mutual Information (AMI)

Mutual Information measures the entropy shared between individual entropies. It measures the dependance between two variables, in this case two discrete variables. As for ARI, values close to 1 indicates high batch clustering (a KNN classifier perfectly predicts the batch each sample belongs to), so high batch mixing is represented by values close to 0 (the batch predictions of a KNN classifier are no better than a random prediction). The variables compared are the batch predictions and the batches' true values.

$$E\left[MI\left(U,V\right)\right]={\sum }_{i=1}^{R}{\sum }_{j=1}^{C}{\sum }_{{n}_{ij}={\left({a}_{i}+{b}_{j}-N\right)}^{+}}^{\text{min}\left({a}_{i},{b}_{j}\right)}\frac{{n}_{ij}}{N}\text{log}\left(\frac{N\cdot {n}_{ij}}{{a}_{i}{b}_{j}}\right)\times \frac{{a}_{i}!{b}_{j}!\left(N-{a}_{i}\right)!\left(N-{b}_{j}\right)!}{N!{n}_{ij}!\left({a}_{i}-{n}_{ij}\right)!\left({b}_{j}-{n}_{ij}\right)!\left(N-{a}_{i}-{b}_{j}+{n}_{ij}\right)!}$$

Where \({n}_{ij}\), \({a}_{i}\), \({b}_{j}\) are values from the contingency table. \(C\) and \(R\) are the two sets of clusters getting compared, both with \(N\) elements.

**5.2.3 Quality control metrics**

Two of the datasets used in this study contain QC samples that were systematically analyzed with each batch of analyses. The features of that sample should always be the same, so we can calculate how much they diverge and use these metrics to measure the batch effect importance. These metrics were (to our knowledge) introduced by 18*.*

## 5.2.1.4 Average Pearson Correlation Coefficient (aPCC)

Because it is always exactly the same sample, we know that all QCs should theoretically be perfectly correlated, which means a perfect Pearson Correlation Coefficient (PCC) of 1. All samples are compared in pairs, so the final value is an average over all PCCs.

## 5.2.1.5 Normalized Median Euclidean Distance (nMED)

If batch correction is efficient, all QC samples should be very close to each other. The Euclidean distance is used to measure how far each pair of samples are from one another. Instead of using the average, like in 18, the median is used because it is less affected by aberrant values. Unlike 18, we also normalize the value by dividing it by the median Euclidean distance of all non-QC samples. If a transformation makes the QC samples very close to each other, but non-QC samples are equally close to each other, then the transformation did not actually alleviate the batch effect. For this reason, nMED should be preferred to the average Euclidean distance proposed in 26 .

## 5.3 Batch Effect Removal methods

We tried normalizing each dataset using three different methods: minmax, standard and robust standardization. The first methods to counter batch effects that we tried was applying the same three normalization methods, but to counter batch effects, they were applied individually to each batch individually, which we named *minmax_per_batch*, *standard_per_batch* and *robust_per_batch*. The latter two have more potential to remove batch effect, transforming the values into z-scores, thus forcing each batch to have a mean of 0 and unit variance. We also used combat 8 and Harmony 13, as they are popular methods in microarrays/RNAseq and scRNAseq respectively. Combat is also used to remove batch effects from LC-MS datasets 42. We used two implementations of combat: a R version https://rdrr.io/bioc/sva/man/ComBat.html and a python version https://github.com/epigenelabs/pyComBat, which we named pycombat in this manuscript. Intriguingly, the two implementations had very different outcomes.

We also used WaveICA 25 and NormAE 18, as they are state-of-the-art methods is batch effect correction of LC-MS data. We used our own implementation of NormAE, which has a slightly different architecture. We reduced the number of layers to a single hidden before and after the bottleneck to make the comparison with our own models. Unlike NormAE, we consider the number of neurons in each layer to be hyperparameters that are optimized. To give it a fair chance to outperform our own methods, we also use all the same hyperparameter optimization as our models.

## 5.4 Autoencoders

All the models in BERNN are implemented using PyTorch and are based on autoencoders. In short, autoencoders are composed of an encoder and a decoder. The encoder turns the inputs into embeddings (also referred as the bottleneck of the autoencoder), which are usually smaller than the inputs, but not necessarily. The objective of the autoencoder is to obtain new representations and reconstruct the original inputs from it the best it can. The embeddings should then contain as much information as possible from the inputs, without the unnecessary noise. To find the best performing model on a given dataset, a total of 10 models can be trained using BERNN. For a complete representation of all the possible models that can be trained using BERNN, see Fig S1. All the models were not represented in the main results to alleviate the reading, but the complete results are available in Figures S4-S6.

**5.4.1 Reconstruction with batch mapping**

The first method that is implemented to obtain batch-free representations is to add to the embedding of the autoencoder a vector of the same size representing the batch ID (Fig S9). This was implemented in NormAE, although not mentioned in the paper 18. It makes it possible to obtain better reconstruction loss by adding the batch information into the vector for the reconstruction. Because the batch ID is contained in this vector added to the embedding, the latter does not need to contain information about the batch. A similar method is used to get batch-free representations in scRNAseq, such as in 14. In this case, the batch ID is directly appended to the bottleneck representation of the variational autoencoder.

**5.4 2 Domain Adversarial Neural Network (DANN)**

Domain Adaptation Neural Network (DANN) is a type of deep learning algorithm that enables a model trained on one domain to be adapted to another related domain with different characteristics, allowing it to perform better on the target domain. DANN achieves this by learning to extract domain-invariant features from the input data 28. In this case, we are defining batches to be from different domains. The original DANN was developed to adapt the learning from a single domain to another one, so our work is more akin to 43, which extends domain adaptation for multiple domains. Our AE-DANN model is represented in Fig S1. The loss of AE-DANN is the following:

$${min}_{D,E}{min}_{{F}_{b}}V\left(D,E.{F}_{\text{b}}\right)=los{s}_{\text{rec}}\left(x,D\left(E\left(x\right),{y}^{\text{b}}\right)\right)+{\lambda }^{\text{b}}los{s}_{\text{disc}\_\text{b}}\left({F}_{\text{b}}\left(E\left(x\right)\right),{y}^{\text{b}}\right)$$

The lossdisc−b is minimized, but it is adversarial because of the Gradient Reversal Layer (GRL).

NormAE is also similar in nature to a DANN, except it does not use the GRL. Using the GRL is advantageous, because the total loss is composed of losses that are all minimized and added together. When not using the GRL, like with NormAE, the adversarial loss is maximized, and it is subtracted to the other losses being optimized simultaneously.

$$\text{m}\text{i}{\text{n}}_{D,E}\text{m}\text{a}{\text{x}}_{{F}_{b}}\text{V}\left(\text{D},\text{E}.{\text{F}}_{\text{b}}\right)=\text{l}\text{o}\text{s}{\text{s}}_{\text{rec}}\left(\text{x},\text{D}\left(\text{E}\left(\text{x}\right),{\text{y}}^{\text{b}}\right)\right)-{{\lambda }}^{\text{b}}\text{l}\text{o}\text{s}{\text{s}}_{\text{disc}\_\text{b}}\left({\text{F}}_{\text{b}}\left(\text{E}\left(\text{x}\right)\right),{\text{y}}^{\text{b}}\right)$$

Using this definition, if the second term of the equation becomes too large, the loss could become negative, which should not be allowed to happen.

**5.4.3 Inverse Triplet Loss**

The inverse triplet loss is like the normal Triplet Loss (defined in the section Reverse Triplet Loss of the supplementary material), but the positive and negative samples are inversed; the negative samples take the place of the positive samples in the triplet loss equation, and vice-versa.

$${\mathcal{L}}_{invTriplet}\left(A,P,N\right)=\text{max}\left(|\text{f}\left(A\right)-\text{f}\left(N\right){|}_{2}-|\text{f}\left(A\right)-\text{f}\left(P\right){|}_{2}+\alpha ,0\right)$$

*A* is the anchor input, *P* is any Positive input of the same batch as *A*, *N* is any negative sample of a different batch than *A*, *α* is the margin between positive and negative pairs and *f* is the embedding given by passing the inputs through the encoder of the autoencoder. Using the normal triplet loss would result in samples from the same batch clustering together and different batches to be far away from each other. The distance between the clusters is controlled by the hyperparameter *α*. The Inverse Triplet loss does the opposite by inversing the *Positive* and *Negative* samples in the equation, which encourages batch-free representations. The samples from different batches get closer, while samples from the same batch are pushed further apart. The later objective is used to prevent all samples from collapsing. If the samples from the same batch are not pushed apart, the loss would be optimal if all samples were transformed into the exact same value, which is not the desired outcome. The distance minimized in this case is the Euclidean distance, but any distance could be used.

**5.4.4 Variational Autoencoders**

The variational autoencoder (VAE) is a probabilistic generative model based on the variational Bayes approach. To train a VAE, we want to optimize the lower bound defined in Eq. 3 of 44:

Where *D**KL* is the Kullback-Liebler Divergence, φ represents the variational parameters (encoder parameters) and θ the generative parameters (decoder parameters). The Kullback-Liebler Divergence pushes the variational posterior \({q}_{{\upvarphi }}\left(z∣x\right)\) to resemble the prior \({p}_{{\theta }}\left(z\right)\), which is the unit normal distribution. Both the labels and batch classifiers use the reparametrized variable z as inputs. It is also optionally combined with a DANN, which is also trained on z to make sure the new representations are free of batch effects. All the different AEs listed above have also been implemented as VAEs, including NormAE.

## 5.5 Training strategies

**5.5.1 Repetitive holdout**

Repetitive holdout is a method to evaluate the performance of a model on a dataset. It is similar to cross-validation, however each split is random, there is no limit to the number of times it can be done on a dataset and the test set is resampled for every holdout iteration. Resampling the test set is particularly important, because some batches can be much easier to classify than others, which can make the test set classification much better or much worse than the validation set. Using repetitive holdout makes classification on the valid and test sets comparable.

When splitting the dataset, the samples from a given batch must all be contained in the same split. We do this in order to inform on the generalization abilities of a model to make predictions in a new batch.

**5.5.2 Class imbalance**

Class imbalance has a negative impact on machine learning models predictive abilities. If nothing is done about it, the model might learn to only predict the majority class. This is especially true if the imbalance is very large. It is also a concern if the problem is very hard to model. PyTorch’s *WeightedRandomSampler* is used to counter class imbalances in datasets by giving more weights to samples from minority classes during training.

**5.5.3 BERNN Hyperparameter Optimization**

We used the function *optimize* from the package *ax-platform* (https://pypi.org/project/ax-platform/) to perform a Bayesian optimization of the hyperparameters for each of the BERNN models (all implemented in PyTorch). We used 20 combinations of hyperparameters to optimize each model. The hyperparameters optimized are the following: learning rate, weight decay, dropout, number of warmup epochs, layer1, layer2, label smoothing parameter, triplet loss margin (when it applies), beta (controls the strength of KL divergence, when it applies), gamma (controls the strength of the adversarial or triplet loss, when it applies). The batch size is set to 32 and the number of epochs after warmup is set to 1000, but the training is stopped if no improvements are made in the last 100 epochs. Models were trained on Nvidia RTX3090 GPUs.

**5.5.4 Classification with non-BERNN representations**

For the classification of raw data, or any batch corrections that do not involve a DNN, we used either a Random Forest Classifier (RFC) or a Linear Support Vector Machine (LinearSVM) from the python package *scikit-learn*. For both, we used models from the python package scikit-learn. To deal with classes imbalance in some datasets, the parameter *class_weights* was set to “balanced” for both models. Automatically adjust weights inversely proportional to class frequencies. The hyperparameters optimized for the RFC were *min_samples_split*, *min_samples_leaf*, *n_estimators*, *criterion*, and *oob_score*. For the LinearSVM, we optimized *tol*, *max_iter*, *penalty* and *C*. The hyperparameters for the RFC and LinearSVM models were optimized using the package *scikit-optimize* (https://scikit-optimize.github.io/stable/) to execute a Bayesian Optimization.

**5.5.5 Model interpretability**

For model explanation purposes and to identify the most important features for the classification, we used SHAP26 to produce an example analysis. The advantage of that approach is that it makes it possible to explain the decision made on individual samples and could be used for precision medicine. It would allow identification of complex patterns that apply only to a subset of the samples that could not be identified by differential analysis.