Training the denoising autoencoder
We used a one-layer denoising autoencoder model comprised of an encoder and a decoder (Fig 1). The encoder embeds the original input data into a lower-dimensional space, the hidden layer, and the decoder reconstructs the input from the values of the hidden layer. We tuned the parameters of the model using cross-validation by training on 90% of the randomly selected sample and testing on the remaining 10%, and repeated this process 10 times. Both the training and testing loss showed proper convergence (Fig S1, S2), indicating that we largely avoided overfitting. We then projected the input data from all non-control input samples to the embedding space of the trained model, obtaining a set of 50-dimensional vectors.
Hidden units associate with TEA clusters
By encoding the original data into the hidden vector space, the model produced a sparse embedding space; moreover, we could observe distinct patterns related to clinical traits (Fig 2a). Some hidden units showed approximately complementary behaviors (e.g., H26 and H38), indicating their distinct relevance to key molecular mechanisms and clinical subgroups of patients.
To better interpret the learned patterns, we evaluated the embeddings of all input data for their correlations with identified clinical traits. We first removed hidden units with variance < 0.001. Using the embeddings of the highly variable hidden units, the samples formed several small clusters corresponding to the previously defined TEA clusters, with high homogeneity (Fig 2a). We found that TEA1 contained a larger proportion of severe patients than TEA3 (Fig S4). TEA2 was relatively harder to distinguish because it was somewhat of an intermediate between TEA1 and TEA3, as some of the TEA2 samples were spread across several clusters.
Generally, the values of hidden units showed gradual monotonic changes, either increasing or decreasing, from TEA1 to TEA3 (Fig S5), indicating associations of the hidden units with clinical traits of the samples. We selected five hidden units (H26, H27, H36, H38, H45) that were significantly correlated with TEA cluster labels (Spearman correlation > 0.65), namely Hsig. Based on the performance of these five hidden units, we further categorized them into two major classes: one that was negatively correlated with TEA cluster labels (H26, H27; Fig 3b) and one that was positively correlated (H36, H38, H45; Fig 3c)
Annotation of relevant hidden units
To further understand the biological significance of the hidden units, we tried to associate them with functional enrichment based on the weights of the encoder network that mapped the input to the hidden unit space. This weight represents the contribution of the gene to the value of the hidden unit, and can be considered the component of the hidden unit. Thus, we could infer biological significance of the hidden units from the weights of the encoder.
The weight distribution of the encoder layer showed similar patterns for hidden units belonging to the same set (Fig 3a). The negative set showed a nearly symmetric distribution with a slight negative skew, whereas the positive set showed a highly positively skewed shape.
Using encoder weights for all 22,148 genes as the ranking score, we performed gene set enrichment analysis using KEGG pathway gene sets to obtain functional enrichments of Hsig. Similar to the weight distribution, hidden units from the same set showed strong resemblance with respect to enrichment of functional terms (Fig S6). Notably, many of the enriched functional terms were related to molecular functions and signaling pathways associated with disease pathogenesis and autoimmune response, whereas deficient terms included “gene expression machinery” and “metabolism”. Specifically, the enrichment of the term “asthma” showed high significance in all five hidden units (Fig 3b, Fig S7).
We then extracted the top-weighted genes of Hsig as potential clinical markers for downstream analysis. For each hidden unit, we extracted the top 200 genes with the highest weights; we removed ribosome-related genes from the analysis to prevent ribosome functional roles from dominating the selected gene set. Concordant with previous observations, the gene sets for hidden units belonging to the same class were highly similar. By merging the top-weighted genes from all five hidden units, we obtained a list of 330 genes. We identified several genes that showed distinct differential patterns of weights between the positive and negative set (Fig 3c). Gene ontology (GO) analysis revealed an enrichment of disease-related terms like “antigen processing and presentation”, “immune response” and “interferon-gamma signaling pathway” (Fig S8).
Prediction of asthma severity
In order to assess the clinical relevance of the learned model, we investigated the association between hidden unit values and clinical data. Generally, the performance of hidden units showed distinct correlations with various clinical traits (Fig S9). We then tested the predictive performance of some clinical traits using the embedding values of Hsig and the expression of the combined list of their top-weighted genes.
We first used the value of all hidden units, Hsig and the top-weighted genes to predict the asthma severity levels of the patients. We only used samples labeled as “mild” or “severe” for the classification analysis. Given each training dataset, we trained a random forest classifier and assessed the predictive accuracy on the test data, represented by the area under the receiver operating characteristic (AUROC) value (Fig 4a).
From the top-weighted genes, we then performed feature selection to obtain the most relevant genes. We considered 50 genes with the top average importance as the most relevant to the prediction of severity (Fig 4b, Table S1). Expression of these genes reached an average AUROC of 0.8066 (0.6194 for random selection of 50 genes from the top-weighted gene list) in predicting severity level. This value is higher than for the set of differentially expressed genes between asthma patients and control samples identified by Yan et al. (average AUROC of 0.5448; Fig 4a). We hypothesize that asthma-associated genes, identified from case-control differential expression analysis, may represent a collective set of genes in general pathogenic pathways. As pathogenicity and transition to severe phenotypes may involve different mechanisms, these factors may not be sufficient to fully explain the causes of severe asthma for different subgroups.
We also compared our selected genes with differentially expressed genes (DEGs) related to severity (see Methods for details). In terms of predicting asthma severity, the DEGs showed higher performance than our selected genes (average AUROC of 0.9632; Fig 4a). There is no overlapping between our selected genes with top DEGs, but direct connections exist between these two groups in the gene network from STRING (string-db.org) (Fig S10). Moreover, network analysis showed that our selected genes have higher interconnectivity and higher centrality in the network (Wilcoxon test, p-value = 0.018) compared to the DEGs, indicating their more central roles (Fig S10, Fig S11). Thus, as these genes are selected from the hidden units corresponding to functional gene groups, we believe that they may be more functionally important in the regulatory network despite performing lower for predicting severity than severity-related genes.
Our list of selected genes included genes related to autoimmune responses, such as antigen processing and presentation, T-cell toxicity, interferon signaling and cell adhesion (Fig 4c). In the context of a protein-protein interaction network, we identified genes residing in various functional modules (Fig 4d). Specifically, several genes, such as human leukocyte antigen genes and cathepsin L1, belonged to a module related to immune response. In addition, the list included genes with potential relevance to asthma pathogenesis, such as immunoglobulin heavy chain (IGHG1, IGHA1 and IGHV3-48), inflammation (S100A9) (13) and iron response (IREB2) (14, 15) genes.
Prediction of clinical measurements
To evaluate an individual’s lung function, clinicians use the ratio of the volume forcefully exhaled in one second versus the maximum volume of a forceful exhale (i.e., the FEV1/FVC ratio) (16). A significantly reduced FEV1/FVC ratio is a sign of airflow obstruction, and is considered a criterion for asthma severity.
We found that the hidden units in Hsig generally showed a higher absolute value of correlation with the FEV1/FVC ratio, in both the positively and negatively correlated subsets, compared to other hidden units, indicating clinical relevance of these hidden units (Fig 5a).
Finally, we tried to predict the pre- and post-treatment FEV1/FVC ratios with the value of hidden units and expression of their top-weighted genes using a support vector machine and LASSO (Fig 5b, c, Table S2, S3 and Fig S12, S13). The selected genes achieved the highest predictive performance, in terms of both mean squared error and explained variance. Together, these results suggest that the selected hidden units and genes are clinically relevant and could be used as markers for asthma severity.