More robust machine learning strategy improves biomarker performance
In this study, we implemented a new machine learning strategy for IA biomarker discovery, which consisted of a larger dataset (94 training, 40 testing), LASSO for feature selection, and more robust algorithms, K-Nearest Neighbor, Random Forest, and Support Vector Machine with cubic and Gaussian kernels. Our larger dataset and LASSO feature selection led to a new panel of 37 genes to use in IA predictive models. Two genes of these 37 genes, C1QL1 and TGS1, were also in our previously-discovered 26-gene panel. The new learning algorithms trained using the 37 genes all performed very well in the testing cohort with accuracies of 0.83–0.90 and AUCs of 0.95–0.99, a marked increase over our previous algorithms. Interestingly, all 4 new models had an NPV of 1, indicating that in the testing dataset there were no false negatives. This may be important for future applications of these biomarkers as a prescreen, since false negatives would be particularly deleterious.
To examine how the increased sample size and improved algorithms affected model performance, we retrained the previous 26-gene panel using the new algorithms in the current, larger dataset. The performance of the retrained models in the testing set (n = 40) using the 26-gene panel improved from our previous study with accuracies ranging from 0.83–0.93 and AUCs of 0.84–0.97. Despite this increase in performance using the new algorithms, models using the previously identified 26 genes still fell short of those using the newly identified 37 genes; the average testing AUC using 26 genes was 0.91 compared to 0.97 when using 37 genes. This suggests that the 37 features identified by LASSO are more reliable for IA prediction than the 26 selected by filtering in our last study.
We believe that improved IA prediction can be attributed to our increased sample size, which afforded several advantages. First, it allowed us to use LASSO to identify features instead of simple filtering methods. Thresholding filters like we used in our last study consider each gene independently, which can neglect groups of genes that function together in pathophysiologic mechanisms and could be useful as a biomarker. Filtering methods can also select highly correlated, redundant genes, which can increase the number of features required to make accurate predictions. HSIC LASSO, a nonlinear feature selection method, overcomes these issues and identifies combinations of non-redundant genes with strong dependence on disease status. Implementing LASSO in the training dataset identified 37 unique IA-associated genes, two of which (C1QL1, TGS1) had also been identified as part of the 26-gene panel in our past expression profiling study.17 The identification of non-redundant features may be one reason why the biomarkers created in this study outperform our past efforts, as some of the 26 features (with the exception of C1QL1 and TGS1) may have ultimately been uninformative for classification.
Secondly, a larger sample size also enabled us to leverage more complex machine learning models, namely Support Vector Machine and Random Forests which perform better in larger datasets.45 In our previous effort we did implement Support Vector Machine, but only achieved a testing accuracy of 0.70, possibly because the training dataset contained only 30 patients.17 In this this larger study we were able to achieve an accuracy of 0.85 for Support Vector Machine (Gaussian kernel). Nevertheless, we found that in our data Random Forest consistently performed the best, with a testing accuracy of 0.85 and AUC of 0.99. Both Random Forests and K-Nearest Neighbors are weighted neighbors schemes. However, the K-Nearest Neighbors algorithm may have had poorer performance because this classifier simply uses the training data for prediction instead of learning a discriminative rule. The performance of the K-Nearest Neighbors classifier is reliant on the quality of the training data, which in the case of transcriptomes derived from human samples may be noisy. However, this problem is well-solved in Random Forest. Through the random sampling process, Random Forest handles outliers by binning them. Also, by averaging the decision trees, the Random Forest method provides a low bias and moderate variance model, which improves the generalizability of the output model. In other words, Random Forest not only attains a good performance in the training data but also performs well in unknown (testing) data. And while Support Vector Machine performed well here, Random Forest likely surpassed Support Vector Machine by avoiding overfitting and achieving better predictive power.
We note that increasing sample size may have introduced more variability in our data due to a larger, heterogeneous population that was not cohort-controlled. For example, in our entire population we found smoking was significantly higher in patients with IA (χ = 0.017), which may be because smoking is a well-known risk factor for IA formation and rupture.46−48 Indeed two genes in our model, LRRN3 and GPR15, are among the top differentially expressed genes in blood between current and never smokers according to a meta-analysis by Huan et al.49 Their presence in our predictive model may be because of the higher proportion of smokers in IA group or because these genes are capturing biological mechanisms related to smoking that are important in IA pathogenesis, such as endothelial dysfunction.50−52 Still, when we performed covariate analysis using MatchIt to create subgroups with similar distributions of covariates between IA and control groups, we found that no one subgroup had significantly higher misclassification rates. For instance, 61% of all subjects in “Subclass 5” were smokers, and this subgroup had a misclassification rate of 13%. Yet, “Subclass 1”, which had 0% smokers, had a misclassification rate of 14%. These results suggest that our prediction models may not be affected greatly by covariate imbalance, albeit testing this in even larger cohorts will be needed to confirm these results.
Complex role of circulating neutrophils in intracranial aneurysm
Inflammation is widely-recognized to play a central role in the pathophysiology of IA.53−55 It is commonly thought that in IA neutrophils are recruited to the sac, where they infiltrate the wall and coordinate the inflammatory responses.53, 56, 57 In this study, gene ontology enrichment analysis showed that genes with higher expression in IA identified by edgeR in the entire dataset were related to cell migration and lymphocyte migration ontologies. These processes, which were also observed in neutrophils from patients with IAs in our previous studies16, 17, increase upon peripheral activation and prompt inflammatory cell migration and infiltration of diseased tissue.53, 58 IPA analysis mirrored these results, showing 3 significant networks, 2 of which were involved in activation-related processes: cell-to-cell signaling and interaction, and inflammatory disease function. Interestingly, one of the largest nodes of gene connectivity in all the networks was TNF, a proinflammatory cytokine with many functions including regulation of cell proliferation and apoptosis. TNF has been shown to have a mechanistic role in IA formation in animal models,59 and an increased presence in human IA tissue compared to superficial temporal artery control tissue.60 In this network, TNF has a predicted connection to DEFA1, which was significantly elevated in neutrophils of IA patients. Higher levels of this cytotoxic defensin protein that is contained within neutrophil granules have been reported in IA tissue, suggesting that production of this protein may occur peripherally before neutrophils enter the IA wall.61 Here the molecules CCR4 and CCR6 (receptors of MIP-3 alpha and MIP-1, RANTES, CCL17, and MCP-1, respectively) were also related to the TNF node. We suspect that these receptors, which play a role in dendritic and T cell migration and recruitment during inflammation,62 may coordinate inflammatory cell migration once expressed in aneurysm tissue.
We also observed the dysregulation of inflammation and a potential role of TNF in our bioinformatics analyses of model genes selected by LASSO in the training dataset. TNF was a hub of connectivity in networks created using the LASSO genes. In these networks, we observed an indirect relationship between TNF and the complement system (i.e. C1QTNF1), which is also associated with C1QL1 (one of 37 model genes). This may be because complement activation plays a critical role in the inflammatory response,63 has been implicated in IA wall degradation and rupture,64 and involves proteins that are increased in human IA tissue (including CFB, CFH, C1Q, and C3AR165). We suspect that the complement alternative pathway may be one mechanism through which neutrophils become activated as it can amplify activation through a positive feedback mechanism.66 In addition to complement members, the TNF node was also related to CD44, a cell surface glycoprotein critical to neutrophil recruitment during inflammation. Because neutrophils interact with CD44, PSGL-1, and E-selectin ligand 1 as they roll along activated endothelial cells, this result may reflect neutrophils transmigrating into inflamed endothelium.67 Our data shows TNF may also interact with the transcription factor TP53, a node with connections to numerous molecules, many of which have decreased expression. TP53 plays a variety of roles in inflammation, such as acting on the NF- κB pathway.68 Overall, our bioinformatics analyses of genes selected by LASSO, while not overlapping greatly with the differentially expressed genes selected by edgeR in the entire dataset (with the exception of C1QL1, GRP15), show that the biology of neutrophil activation and inflammation responses are captured by the IA prediction model gene panel.
In addition to neutrophil activation and heightened inflammatory signaling, we observed other aberrant neutrophil functions not specifically characterized in IA, including our previous studies.16, 17 In genes identified in the whole dataset by edgeR, gene ontology enrichment analysis showed that the differentially expressed genes with decreased expression in IA had functions related to sodium channel activity, ion channel activity, and gated channel activity, as well as signaling and regulation of membrane potential processes (ASIC2, GRIK3, SCN5A). GRIK3, glutamate receptor 7, is particularly interesting as glutamate is a chemotactic factor for neutrophils after injury or infection.69 Glutamate binding to its receptors can trigger release of cytokines and MMPs and can activate immune responses, all critical processes in IA.70, 71 Future studies are needed to better understand how these channel activities impact IA pathogenesis.
New ontologies were also captured using the genes identified by LASSO in the training dataset. Using the LASSO genes with lower expression in IA, we found dysregulation of apoptosis as gene ontology enrichment analysis reported both negative regulation of execution phase of apoptosis and regulation of execution phase of apoptosis. These were associated with MTRNR2L1, a neuroprotective and antiapoptotic factor72, and RFFL, which is related to TNF signaling.73 Dysregulated MTRNR2L1 expression may be responsible for increasing the lifespan of neutrophils. Increased lifespan provides further evidence of neutrophil activation in IA. We note that TP53, previously discussed, also induces apoptosis.74, 75 These results are echoed in the blood profiling study of IA published by Jin et al.76 They reported hsa-miR-21, an upregulated miRNA in IA serum, induces apoptosis by extracellular signals, potentially triggering more apoptotic reactions to facilitate the medial thinning and destructive remodeling, a hallmark of IA pathogenesis.77−80 Overall, we suspect that capturing neutrophil activation and inflammation responses involved in IA is the reason why the 37-gene panel was able to detect IA.
Limitations
In this study, we increased sample size from our previous study by adding 94 samples to 40 samples we previously analyzed. However, these two batches used different versions of the Illumina kit for library preparation, which necessitated the implementation of batch effect correction that could potentially have introduced bias or skewed our dataset.81 Secondly, all samples were recruited from patients receiving cerebral imaging at a single center, which may introduce selection bias. Future studies are needed to validate our predictive models using broader patient populations from multiple centers. Thirdly, inflammatory or vascular diseases other than IA could affect model prediction. Larger studies with multiple control groups of individuals with other vascular and inflammatory conditions are needed to refine our model.