Training of the denoising autoencoder
We used a one-layer denoising autoencoder model comprised of an encoder and a decoder (Fig. 1). The encoder embeds the original input data into a lower-dimensional space, the hidden layer, and the decoder reconstructs the input from the values of the hidden layer. We tuned the parameters of the model using cross-validation by training on 70% of the sample and testing on the left-out 30%. Both the training and testing loss showed proper convergence (Fig S1), indicating that we largely avoided overfitting. We then projected the input data from all non-control input samples to the embedding space of the trained model, obtaining a set of 50-dimensional vectors.
Hidden units associate with TEA clusters
By encoding the original data into the hidden vector space, the model produced a sparse embedding space; moreover, we could observe distinct patterns related to clinical traits (Fig. 2a). Some hidden units showed approximately complementary behaviors (e.g., H26 and H38), indicating their distinct relevance to key molecular mechanisms and clinical subgroups of patients.
To better interpret the learned patterns, we evaluated the embeddings of all input data for their correlations with identified clinical traits. We first removed hidden units with variance < 0.001. Using the embeddings of the highly-variable hidden units, the samples formed several small clusters corresponding to the previously defined TEA clusters, with high homogeneity (Fig. 2a). It was found that TEA1 containing larger proportion of severe patients than TEA3 (Fig S2). TEA2 was relatively harder to distinguish because it was somewhat of an intermediate between TEA1 and TEA3, as some of the TEA2 samples were spread across several clusters.
Generally, the values of hidden units showed gradual monotonic changes, either increasing or decreasing, from TEA1 to TEA3 (Fig S3), indicating associations of the hidden units with clinical traits of the samples. We selected five hidden units (H26, H27, H36, H38, H45) that were significantly correlated with TEA cluster labels (Spearman correlation > 0.65), namely Hsig. Based on the performance of these five hidden units, we could further categorize them into two major classes: one that was negatively correlated with TEA cluster labels (H26, H27; Fig. 3b), and one that was positively correlated (H36, H38, H45; Fig. 3c).
Annotation of relevant hidden units
To further understand the biological significance of the hidden units, we tried to associate them with functional enrichment based on the weights of the encoder network that mapped the input to the hidden unit space. This weight represents the contribution of the gene to the value of the hidden unit, and can be considered the component of the hidden unit. Thus, we could infer biological significance of the hidden units from the weights of the encoder.
The weight distribution of the encoder layer showed similar patterns for hidden units belonging to the same set (Fig. 3a). Comparatively, the negative set showed a nearly symmetric distribution with a slight negative skew, whereas the positive set showed a highly positively-skewed shape.
Using encoder weights for all 22,148 genes as the ranking score, we performed gene set enrichment analysis using KEGG pathway gene sets to obtain functional enrichments of Hsig. Similar to the weight distribution, hidden units from the same set showed strong resemblance with respect to enrichment of functional terms (Fig S4). Notably, many of the enriched functional terms were related to molecular functions and signaling pathways associated with disease pathogeny and autoimmune response, whereas deficient terms included “gene expression machinery” and “metabolism”. Specifically, the enrichment of the term “asthma” showed high significance in all five hidden units (Fig. 3b, Fig S5).
We then extracted the top-weighted genes of Hsig as potential clinical markers for downstream analysis. For each hidden unit, we extracted the genes with top 200 highest weights; we removed ribosome-related genes from the analysis to prevent ribosome functional roles from dominating the selected gene set. Concordant with previous observations, the gene sets for hidden units belonging to the same class were highly similar. By merging the top-weighted genes from all five hidden units, we obtained a list of 330 genes. We identified several genes that showed distinct differential patterns of weights between the positive and negative set (Fig. 3c). GO analysis revealed an enrichment of disease-related terms like “antigen processing and presentation”, “immune response” and “interferon-gamma signaling pathway” (Fig S6).
Prediction of asthma severity
In order to assess the clinical relevance of the learned model, we investigated the association between hidden unit values and clinical data. Generally, the performance of hidden units showed distinct correlations with various clinical traits (Fig S7). We then tested the predictive performance of some clinical traits using the embedding values of Hsig and the expression of the combined list of their top-weighted genes.
We first used the value of all hidden units, Hsig and the top-weighted genes to predict the asthma severity levels of the patients. We only used samples labeled as “mild” or “severe” for the classification analysis. Given each training dataset, we trained a random forest classifier and assessed the predictive accuracy on the test data, represented by the AUROC value (Fig. 4a).
From the top-weighted genes, we then performed feature selection to obtain the most relevant genes. We considered 50 genes with the top average importance as the most relevant to the prediction of severity (Fig. 4b, Table S1). Expression of these genes reached an average AUROC of 0.8066 (0.6194 for random selection of 50 genes from top-weighted gene list) in predicting severity level. Comparatively, this value is higher than for a set of differentially expressed genes between asthma patients and control samples identified by Yan et al. (average AUROC of 0.5448; Fig. 4a). We hypothesize that asthma-associated genes, identified from case-control differential expression analysis, may represent a collective set of genes in general pathogenic pathways. As pathogenicity and transition to severe phenotypes may involve different mechanisms, they may not be sufficient to fully explain the causes of severe asthma for different subgroups.
Our list of selected genes included genes related to autoimmune responses, such as antigen processing and presentation, T-cell toxicity, interferon signaling and cell adhesion (Fig. 4c). In the context of a protein-protein interaction network, we were able to identify genes residing in various functional modules (Fig. 4d). Specifically, several genes, such as human leukocyte antigen genes and cathepsin L1, belonged to a module related to immune response. In addition, the list included genes with potential relevance to asthma pathogeny, such as immunoglobulin heavy chain genes (IGHG1, IGHA1 and IGHV3-48), inflammation (S100A9) (13) and iron response (IREB2) (14, 15).
Prediction of clinical measurements
To evaluate an individual’s lung function, clinicians use the FEV1/FVC ratio, which is the ratio of the volume forcefully exhaled in a second versus the maximum volume of a forceful exhale (16). A significantly reduced FEV1/FVC ratio is a sign of airflow obstruction, and is considered a criterion for asthma severity.
We found that the hidden units in Hsig generally showed a higher absolute value of correlation with the FEV1/FVC ratio, in both the positively and negatively correlated subsets, compared to other hidden units, indicating clinical relevance of these hidden units (Fig. 5a).
Finally, we tried to predict the pre- and post-treatment FEV1/FVC ratios with the value of hidden units and expression of their top-weighted genes using a support vector machine and LASSO (Fig. 5b, c, Table S2, S3 and Fig S8, S9). The selected genes achieved the highest predictive performance, in terms of both mean squared error and explained variance. Together, these results show that the selected hidden units and genes are clinically relevant and could be used as markers for asthma severity.