## Data acquisition and data preprocessing

## An extensive compendium of transcriptomic profiles (GPL570 dataset)

Publicly available raw gene expression data generated using the Affymetrix HG-U133 Plus 2.0 microarray platform was obtained from the Gene Expression Omnibus (GEO; accession number GPL570) [10]. Preprocessing and aggregation of raw gene expression data were conducted using the robust multi-array average algorithm available in Analysis Power Tools (release 2.11.3). The mapping of probesets to genes and quality control procedures are described in the Supplementary Note. This comprehensive collection of transcriptomic profiles is referred to as the GPL570 dataset.

For predictive analysis, we selected transcriptomic datasets with available biological phenotypes from GEO. All selected datasets were preprocessed in the same manner as the GPL570 dataset.

## Training of the Autoencoder

Autoencoder (AE) methods are a type of unsupervised deep neural network utilized for various tasks, including feature learning, dimensionality reduction, data denoising, and classification [5]. The encoder network learns to reduce the dimensionality of the input data by mapping it to a latent space, thereby generating a new representation with a limited number of variables, referred to as the latent representation. The decoder network learns to reconstruct the samples in the input data from their latent representations with minimal loss of information.

In our study, the encoder maps the GPL570 dataset, consisting of 139,786 samples and 19,863 genes, to a latent representation with 1,024 latent variables. The gene expression levels serve as the input variables for the encoder. The decoder attempts to reconstruct the samples' gene expression levels from their latent representations. A schematic representation of the AE can be found in Supplementary Fig. 1, and a detailed description of hyperparameters and layers is provided in Supplementary Table 1.

After randomly shuffling the samples in the GPL570 dataset, we divided the dataset into training (70%, n = 97,850), validation (15%, n = 20,968), and test sets (15%, n = 20,968). The training of AEs is focused on minimizing the difference between the input and reconstructed data. We employed the Mean Squared Error (MSE) loss function to train the encoder and the decoder parameters. AE training was conducted using the Ranger optimizer [11]. The MSE was used to train and evaluate the model's performance from a sample perspective. In addition, we calculated a metric to gauge the reconstruction performance from the gene perspective. First, we calculated the Pearson correlation between the genes in the input data, resulting in a triangular correlation matrix with dimension *p* genes by *p* genes. Second, we calculated the same triangular correlation matrix using the reconstructed data. We then computed the absolute difference between the correlations obtained with the input and the reconstructed data for each gene pair. After sorting the absolute correlation differences in ascending order, we obtained the 95th percentile, referred to as the R-difference95th, as a reconstruction performance metric from the gene perspective. The closer the R-difference95th is to zero, the better the reconstructed data captures the gene-by-gene correlation structure present in the input data.

## Training of the Adversarial Variational Autoencoder

An Adversarial Variational Autoencoder (AVAE) is a type of deep neural network that enhances the capabilities of a standard AE by incorporating adversarial training and imposing a constraint on the latent distribution [12]. This approach can result in more biologically meaningful representations in the latent space and may allow, for example, the generation of new transcriptomic profiles with similar properties by adding noise to the latent representation [13]. A schematic representation of the AVAE and a detailed description of its hyperparameters and layers is provided in Supplementary Fig. 2 and Supplementary Table 2.

To train and evaluate the AVAE, we used the same GPL570 training, validation, and test sets as with the AE. During the training process, the AVAE aims to encode the samples' gene expression levels into a latent representation and to reconstruct them from this latent representation. In our AVAE, we imposed constraints that force the latent variables to conform to a Gaussian prior distribution. We employed the Ranger optimizer, using parameters identical to those in the AE training, to minimize both the MSE for reconstruction performance and the Kullback-Leibler divergence (KL) to regulate the distribution of the latent variables [14].

The calculation of loss for the encoder and decoder is as follows: The encoder loss is the sum of the MSE loss of the reconstructed data and the KL loss of the generated latent representation. The decoder loss is the sum of the discriminator loss for the reconstructed data, the discriminator loss for the generated data, and the MSE loss of the reconstructed data.

## Consensus Independent Component Analysis on the GPL570 dataset

Consensus Independent Component Analysis (c-ICA) was performed to segregate bulk transcriptomic profiles into statistically independent transcriptional components (TCs), as previously described in more detail [15].

In brief, applying c-ICA to a transcriptomic dataset with *p* genes and *n* samples yields *i* transcriptional components, each of dimension 1×*p*. Each TC captures the transcriptional pattern of an underlying process (e.g., biological process or cellular state). A TC comprises *p* scalars, representing both the direction and magnitude of the effect of the underlying process on a gene's expression level. In addition to the TCs, c-ICA also outputs a mixing matrix (MM) of dimension *i*×*n*, containing coefficients for each TC-sample pair. The inner product between an individual sample’s vector of coefficients in the MM and the vector of scalars for each gene across all TCs results in reconstructed transcriptomic profiles that closely resemble the input profiles. Thus, the coefficients in the MM can be interpreted in a similar manner to the latent representations derived from the AE and AVEA.

Initially, a preprocessing technique called whitening is applied to the input dataset to accelerate the convergence rate of the ICA algorithm. This involves conducting PCA on the sample-by-sample covariance matrix. The number of principal components capturing at least 90% of the total variance in the GPL570 dataset serves as the input for the c-ICA. Due to the random initialization of the optimization algorithm, 25 ICA runs were conducted, and only TCs that could be identified consistently across multiple runs were selected using a consensus approach. For parameters and details on performing the c-ICA algorithm on the GPL570 dataset, please refer to the Supplementary Note. The c-ICA yield TCs, each representing a robust, statistically independent, and distinct transcriptional pattern of an underlying process, along with a mixing matrix that describes the latent representations of the transcriptomes.

## Dimensionality reduction methods applied to single datasets

Various dimensionality reduction methods were applied to each gene expression dataset. For the purpose of reference comparison, all datasets were also included without any dimensionality reduction. These datasets, which contain a high number of genes and thus feature complexity, are hereafter referred to as gene-level datasets. The gene-level datasets served as input for several dimensionality reduction methods, which are described below.

## Principal Component Analysis

PCA is a widely used method for reducing the dimensionality of gene expression data [3]. In each gene-level dataset, individual gene's expression levels were standardized to have a mean of 0 and a standard deviation of 1. The sample correlation matrix was then calculated. Subsequently, the eigenvectors and eigenvalues of this matrix were determined. The pseudo-inverse of the eigenvector matrix yields activity scores for each principal component (PC) in each sample. These activity scores, capturing 100% of the variance observed in the gene-level dataset, are referred to as the single PCA latent representation.

## Consensus Independent Component Analysis

c-ICA was trained independently on each of the gene-level datasets. For every dataset, the number of principal components that accounted for 100% of the dataset variance during the whitening step was used as the input for ICA. Only TCs that were identified multiple times across 50 runs were selected, following a consensus approach (see the Supplementary Note for details). The resulting mixing matrix, which comprises the activity scores for each TC across all samples, is referred to as the single ICA latent representation.

## Autoencoder

We utilized each gene-level dataset as input for the AE network, which had been pre-trained on the GPL570 dataset. Our objective in applying transfer learning was focused on dimensionality reduction, rather than transferring specific phenotypic information. To achieve a compact representation, we passed the datasets through the encoder layers of the AE. This resulting representation, encompassed by 1,024 latent variables, is referred to as the AE latent representation.

## Adversarial Variational Autoencoder

Each gene-level dataset was individually used as input for the trained AVAE network. The datasets were processed through both the shared and the specific encoder networks, resulting in a representation with 256 latent variables (refer to Supplementary Fig. 2 for details). This latent representation contains vectors representing the mean and standard deviation. This mean vector consists of 128 latent variables for each sample and is referred to as the AVAE latent representation.

## c-ICA transformation obtained with the GPL570 dataset

Each gene-level dataset was projected into its latent representation using the c-ICA model that had been trained on the GPL570 dataset. This involved calculating the pseudo-inverse of the independent component matrix and then multiplying this by the matrix of the gene-level dataset [16]. The resulting projected mixing matrix for each dataset included activity scores for 3,286 independent components per sample, captured as latent variables, and is referred to as the GPL570 ICA latent representation.

In summary, we generated six different representations for each gene-level dataset. First, we applied PCA and c-ICA methods that were individually trained on each dataset to create 'single PCA' and 'single ICA' latent spaces. Secondly, we leveraged a transfer learning approach, using models trained on the GPL570 dataset: an AE, an AVAE, and a c-ICA model. This yielded three additional latent spaces: 'AE', 'AVAE', and 'GPL570 ICA.' Alongside the original gene-level data, these five representations served as the foundation for subsequent predictive modeling.

## Predictive modeling

## Regularization techniques

In predictive modeling, we used logistic regression combined with three regularization techniques: Lasso, Ridge, and Elastic Net, for phenotype prediction [7–9, 17]. These techniques are designed to deal with correlated predictors and provide more stable models. Lasso produces sparse models by setting the coefficients of certain predictors to zero. Ridge assigns coefficients close to zero to reduce multicollinearity. Elastic Net combines the strengths of both Lasso and Ridge. These techniques help us identify informative input predictors with minimal inter-correlation for effective prediction. Detailed information with all parameters can be found in the Supplementary Note.

## Evaluation metrics for performance

The Matthew Correlation Coefficient (MCC) was selected as the performance metric for the models [18]. In the context of binary labels, the MCC serves as a discrete version of the Pearson correlation coefficient and can be interpreted in a similar manner [19]. For each dataset and representation type (i.e., gene-level data and the five latent representations), the highest MCC value obtained from the three regularization techniques — Lasso, Ridge, and Elastic Net — was chosen as the final MCC score.

## Cross-validation

To validate the performance of the predictive models, a *k*-fold cross-validation (CV) was used. Initially, the samples in a dataset were divided into *k* folds in a stratified manner based on the phenotypic labels. Subsequently, *k*-1 of these folds were used for training, while the remaining kth fold served as the test set. After completing the CV, the overall model performance was calculated using the aggregated predictive labels from all folds. CV was conducted in two different ways: first with *k* = 10, resulting in a 90% training and 10% testing split; and second, in a reverse scheme with *k* = 5, leading to a 20% training and 80% testing split, to examine the model’s predictive behavior when trained on a smaller sample size.

## Permutation test

A phenotype permutation test was conducted using 200 permutations to evaluate the significance of the predictive model’s performance. Prior to carrying out the CV, the phenotypic labels were randomly shuffled. For each of these random reshufflings, a cross-validated MCC was calculated using the predictive model, thereby generating a null distribution of the model's predictive performance. The statistical significance for testing the null hypothesis —that there is no association between the input predictors and the phenotypic label — was indicated by a p-value. This p-value is defined as the proportion of permutations yielding an MCC equal or better than the MCC obtained from the actual data, relative to the total number of permutations. We refer to this combination of CV and permutation testing as the CV-permutation test.

## Regularization technique comparison

We conducted a paired samples Wilcoxon test to assess whether one regularization technique significantly outperforms the others when applied with specific dimensionality reduction methods. For each dimensionality reduction method, the performance difference between regularization techniques were tested across 30 datasets. The resulting p-values were subsequently adjusted using the Bonferroni correction for multiple testing.

## Robustness of input predictors

To evaluate the robustness of predictor selection within the prediction models, we used a cross-validation approach focused specifically on Lasso regularization. Unlike Ridge and Elastic Net, Lasso has the ability to zero out predictors, that do not increase predictive performance. For each dataset, the samples were split in a stratified manner into ten folds. Nine of these folds were used as training data for Lasso. Predictors with non-zero coefficients in the trained Lasso model were selected. This process was repeated in 20 different CV runs, tallying the occurrence of each input predictor, which could occur a maximum of 200 times. Due to the variable number of predictors across different representations, we used the proportion of occurrence for each predictor, calculated as the number of times a predictor had a non-zero coefficient divided by the total number of predictors that had a non-zero coefficient in at least one CV run. The robustness of the predictor selection was then quantified by calculating the area under the curve for the proportion of predictors relative to the number of runs (AUC of proportion). A higher AUC of proportion indicates greater robustness in predictor selection.