Study enrollment
This study was approved by University at Buffalo Institutional Review Board (study no. 030-474433). All methods followed the approved protocol. Written informed consent was obtained from all subjects prior to sample collection. Patients receiving cerebral DSA at Gates Vascular Institute, Buffalo, NY with and without IA diagnosis were enrolled in this study. Most indications for DSA included confirmation of noninvasive imaging results of presence of IAs or other cerebral vascular conditions, follow-up of non-invasive imaging for headache or visual disturbance, or follow-up of previously identified IAs. All patients who consented to participate in this study were over 18 years, English speaking, and had not previously been treated for IA. Patients with potentially altered immune systems were excluded, including, for example, patients who had recent invasive surgery, were receiving chemotherapy, had a fever (>100oF), had received solid organ transplants, had autoimmune diseases, or were taking prednisone or other immunomodulating drugs.
Between December 2013 and September 2018, we collected 232 blood samples from cerebral DSA patients at Gates Vascular Institute (103 from patients with IA, and 129 from IA-free controls). Forty-three of these samples had been sequenced as a part of our previous studies.22, 23 In all cases, IA diagnosis was confirmed by DSA images. Patient medical record data was also collected to study demographics and comorbidities.
Neutrophil isolation
During DSA, 16 mL of blood was drawn from the access catheter in the femoral artery and transferred into two 8 mL, citrated, cell preparation tubes (BD, Franklin Lakes, NJ). Neutrophils were isolated within 1 hour of blood collection, as described elsewhere.7 Briefly, cell preparation tubes were centrifuged at 1,700×g for 25 minutes to separate erythrocytes and neutrophils from mononuclear cells and plasma in the peripheral blood samples via a Ficoll density gradient. Erythrocytes and neutrophils were collected into a 3 mL syringe. Following hypotonic lysis of red blood cells, neutrophils were isolated by centrifugation at 400×g for 10 min, disrupted and stored in TRIzol reagent (Life Technologies, Carlsbad, CA) at -80oC until further processing. Neutrophils isolated in this fashion are more than 98% CD66b+ by flow cytometry and contain no contaminating CD14+ monocytes.24
RNA preparation
Neutrophil RNA was extracted as described previously22 using TRIzol, according to the manufacturer’s instructions. Trace DNA was removed by DNase I (Life Technologies, Carlsbad, CA) treatment. RNA was purified using the RNeasy MinElute Cleanup Kit (Qiagen, Venlo, Limburg, Netherlands) and suspended in RNase-free water. The purity and concentration of RNA in each sample were measured by absorbance at 260 nm and 280 nm on a NanoDrop 2000 spectrophotometer (Thermo Scientific, Waltham, MA), and 200 ng to 400 ng of RNA was reserved for sequencing. Precise RNA concentration was measured via the Quant-iT RiboGreen Assay (Invitrogen, Carlsbad, CA) with a TBS-380 Fluorometer (Promega, Madison, WI), and the quality of the RNA samples was measured with an Agilent 2100 BioAnalyzer RNA 6000 Pico Chip (Agilent, Las Vegas, NV). RNA samples to be sequenced had acceptable purity (260/280 ratio of ~1.8 or greater, range: 1.76-2.12) and integrity (RIN of ~5 or greater, range: 4.5-9.1) prior to RNA sequencing.
RNA sequencing
For newly processed samples, the Illumina TruSeq Stranded Total RNA Gold Library Preparation Kit (Illumina, San Diego, CA) was used for library preparation. Samples were subjected to 50-cycle, single-read sequencing in a HiSeq2500 system (Illumina) and demultiplexed using Bcl2Fastq. To increase sample size, we combined reads from these new samples with reads from our previous samples22, 23 that were sequenced in the same manner, but for which libraries were constructed using the Illumina TruSeq RNA library Prep Kit V2 (Illumina, San Diego, CA). For all data, per-cycle base-call (BCL) files generated by the Illumina HiSeq2500 were converted to per-read FASTQ files using bcl2fastq version 2.20.0.422 using default parameters. The quality of the sequencing was reviewed using FastQC version 0.11.5. Detection of potential contamination was done using FastQ Screen version 0.11.1. FastQC and FastQ Screen quality reports were summarized using MultiQC version 1.5. No adapter sequences were detected, so no trimming was performed. Genomic alignments were performed using HISAT2 version 2.1.0 using default parameters. NCBI reference GRCh38 was used for the reference genome and gene annotation set. Sequence alignments were compressed and sorted into binary alignment map (BAM) files using samtools version 1.3. Counting of mapped reads for genomic features was performed using Subread featureCounts version 1.6.2 using the parameters -s 2 –g gene_id –t exon –Q 60, the annotation file specified with –a was the NCBI GRCh38 reference from Illumina iGenomes. Aggregate quality control data (i.e. alignment statistics and feature assignment statistics) were again summarized using MultiQC.
Differential expression analysis and data exploration
Before implementing our machine learning pipeline, we performed differential expression analysis on the whole dataset to identify transcripts that were significantly differentially expressed in IA using Bioconductor package edgeR version 3.24.0. After estimating dispersion, edgeR identified differentially expressed genes by using a negative binomial distribution with generalized linear models and a quasi-likelihood F-test to identify differentially expressed genes.25, 26 We incorporated the two sequencing batches into the design matrix to correct for any potential batch effects due to different library preparation kits. Genes with a counts sum >0 across all samples were used as input. Multiple hypothesis testing correction was performed using Benjamini-Hochberg false discovery rate (FDR) correction.27 Transcripts with an FDR-corrected p-value (q-value)<0.05 were considered significantly differentially expressed. To explore how transcriptomes separated patients with and without IA on a broad scale, we performed hierarchical clustering, using the hclust package in R under default settings (complete linkage).
Next-generation sequencing is typically performed on high quality RNA (RIN>7), when possible.28 However, clinical samples, particularly human neutrophils that contain high levels of endonucleases as part of the host defense response to bacteria, rarely produce RNA without any degree of degradation.29 Between IA and control groups in our study, there was no statistically significant difference in RIN (p=0.18, Student’s t-test). Yet, given that some samples had low RIN and others had high RIN, we performed co-variate correlation analysis as shown in Xiong et al.,30 in order to determine if RNA quality could have affected the expression levels of differentially expressed transcripts. A correlation between gene expression and RIN was considered if it had both a Pearson correlation coefficient r>0.80 and a p-value<0.01.
Verification of expression differences by qPCR in a sub-cohort
To verify expression differences in differentially expressed genes, quantitative polymerase chain reaction (qPCR) was performed. Due to limitations in RNA quantity, qPCR was performed on 10 transcripts in a subset of 50 of the 134 samples (20 IA and 30 control). We followed the protocol described previously.22 In brief, oligonucleotide primers were designed using Primer3 software (Primer3Web 0.4.0) and Primer BLAST (NCBI, Bethesda, MD) to have a 60°C melting temperature, a length of 15–25 nucleotides, and a product of 50–250 base pairs (with at least one primer that spans an exon-exon junction), as well as an estimated efficiency >0.8 (actual range: 0.82-1.1, average: 0.96, median: 1.0).31 Primer sequences, annealing temperatures, efficiencies, and product lengths are reported in Supplemental Table 1. For reverse transcription, first-strand cDNA was generated from total RNA using qScript cDNA Synthesis kit (Quantabio, Beverly, MA, USA) according to the manufacturer’s directions. qPCR was run with 5 ng of cDNA in 20 µL reactions in triplicate in Bio-Rad CFX Connect (Bio-Rad, Hercules, California) using the qScript One-Step SYBR Green Master Mix kit (Quantabio, Beverly, MA, USA) and gene-specific primers at a concentration of 0.02 μM each. The temperature profile consisted of an initial step of 95°C for 1 min, followed by 40 cycles of 95°C for 15 seconds and 60°C for 1 min, and then a final melting curve analysis from 60°C to 95°C over 20 min.
Gene-specific amplification was demonstrated by a single peak using the Bio-Rad dissociation melt curve. Samples were normalized based on HPRT1, GAPDH, and GPI (housekeeping genes32-34) expression, which were run in parallel reactions to the genes of interest. Supplemental Figure 1 shows the stability in the expression (from RNA sequencing, Supplemental Figure 1A) and Ct values (from qPCR, Supplemental Figure 1B) of the housekeeping genes in the control and IA groups. All had similar coefficient of variation values, and there was no statistical significance in their variation between the two groups for sequencing or qPCR (all p-values>0.01, F-test). Their Ct values were used to calculate gene-specific average fold-change between the two groups using the 2-ΔΔCt method.35 These values were then averaged across all housekeeping genes. For comparison with qPCR data, RNA sequencing data from the same samples was used to calculate the average fold-change in gene expression between the IA and the control groups. This fold-change value (presented as an absolute fold-change in the positive or negative direction [for fold-change<1]) was then compared to the same metric calculated by the 2-ΔΔCt method in qPCR data in order to determine if the absolute fold-change in expression was in the same direction and statistically different (Student’s t-test, significance at p-value<0.05).
Feature selection for classification model development
To build predictive models for IA, we began with raw counts to eliminate bias and uncertainty associated with distribution modeling incorporated in edgeR. Raw counts were then normalized to transcript per million (TPM) values to facilitate comparison of expression between samples by normalizing by both gene length and sequencing depth. Then, we applied an abundance filtering by only selecting protein coding genes with average TPM>1 across all samples, reducing the set of potential transcripts to 18,833. To account for the two sequencing batches in our study design (as done in edgeR analyses), we performed batch effect correction using ComBat under the default settings in R.36, 37 Then, 70% of samples were randomly allocated to a training cohort (n=39 IA and n=55 control) and 30% to a testing cohort (n=16 IA and n=24 control), maintaining the proportion of IA and controls.
For feature selection in the training cohort, we performed a supervised feature selection by using the Hilbert-Schmidt Independence Criterion Least Absolute Shrinkage and Selection Operator (HSIC LASSO) method. HSIC LASSO was implemented in the ComBat corrected dataset to select features for the model. To visualize how those selected transcripts separated samples from patients with and without IAs, we performed principal component analysis (PCA) using the prcomp package under the default settings.38 Co-variate correlation analysis following Xiong et al.30 was performed again to determine if RNA quality (i.e. RIN) could have affected the expression levels of TPM-normalized transcripts selected by LASSO. A correlation between gene expression and RIN was considered if it had both a Pearson correlation coefficient r>0.80 and a p-value<0.01.
Model training
We used MATLAB Statistics and Machine Learning Toolbox (MathWorks, Natick, MA) to train 4 popular algorithms (K-Nearest Neighbor, Random Forest, Support Vector Machine with Gaussian and cubic kernels) on 2 different gene panels – the 37 transcripts identified by LASSO in this study and the 26 transcripts identified by filtering in our previous study. While these algorithms have been used in other disease classification applications39-44, we implemented all 4 algorithms to determine which best suited our data. Specific parameters for each algorithm are as follows:
- For K-Nearest Neighbors, we used a Euclidean metric and 10 neighbors (k). The resulting model classified test samples by calculating their distance to each training sample and the test sample labels were predicted by choosing the class that was most common among their 10 nearest neighbors.
- For Random Forest, which constructs a multitude of decision trees in training and outputs the mode of the classes as the predicted class,45 we set the number of trees as 1000. The Random Forest was built by constructing a multitude of decision trees based on subsets of the training data generated by random sampling with replacement and the resulting model classified testing samples by the majority vote of the decision trees.
- For Support Vector Machines, we used two different kernels46, Gaussian and cubic. To separate a binary-labeled sample, the Support Vector Machine transforms them into a multidimensional space using the kernel, and then a hyper-plane, which maximizes the distance to samples of either class, is established. The resulting model classified testing samples by transforming them into a higher dimensional space with the corresponding kernel and making decisions based on their signed distance to the hyper-plane.
Model assessment
We estimated the performance of each model for the new LASSO-identified features as well as our previously-identified 26 features by a leave-one-out (LOO) cross-validation within the training cohort. Model predictions were compared to each patient’s clinical diagnosis from imaging, and the true positives, true negatives, false positives, and false negatives were tallied. We then calculated each model’s sensitivity, specificity, and accuracy, as described elsewhere.23 Based on model predictions, we created receiver operating characteristic (ROC) curves and calculated the area under the ROC curve (AUC) to assess model performance. Additionally, to gauge predictive value of models, we determined positive predictive value (PPV) and negative predictive value (NPV). PPV and NPV were estimated using formulas based on Bayes’ theorem as previously described23 with 5% aneurysm prevalence, which is within the range of IA prevalence reported in the literature (3.2-7%47-50). The classification models were then independently evaluated in the testing cohort (n=40), and classification results were compared to clinical diagnoses to calculate the true sensitivity, specificity, and accuracy for each model. ROC curves were constructed and AUCs, along with PPV and NPV, were used to assess the performance of each classifier. This was performed for algorithms trained on LASSO-selected features and the previous 26 features.
Testing influence of clinical covariates on gene expression differences
While we randomly assigned samples to training or testing cohorts, this study was not cohort-controlled and used a large, heterogeneous population. Consequently, it is possible that factors other than IA status, such as demographics or comorbidities, could be affecting differential expression and model performance. To determine if patient characteristics influenced model performance, we first performed a chi-square test to determine if there were different rates in the aneurysm and control populations. We examined gender, hypertension, heart disease, stroke, high cholesterol, cancer, diabetes, arthritis, asthma, smoking status, and age. Additionally, we performed covariate matching in the MatchIt program in R to create 6 subclasses under default settings with similar distribution of covariates (age [60 and under vs over 60], sex, smoking status [non-smoker vs current smoker], hypertension, heart disease, stroke history, high cholesterol, cancer, diabetes, arthritis, asthma, and IA family history) for aneurysm and control populations.51, 52 To create subclasses, we used a distance measure determined by a logistic regression model to estimate the propensity score. We then examined misclassification rate in each of the subclasses to determine if any group with a specific “covariate profile” was associated with greater misclassification.
Bioinformatics
Gene ontology enrichment analysis was performed using Gene Ontology enRIchment anaLysis and visuaLizAtion tool (GORILLA).53 A background list of neutrophil expression from 3 healthy individuals (average fragments per kilobase million>0.5) was used to compute hypergeometric statistics and assign significance to GO terms.7 GO functions and processes with a p-value<0.001 were reported. Ingenuity Pathway Analysis (IPA) software (Qiagen Inc., https://www.qiagenbioinformatics.com/products/ingenuity-pathway- analysis) was used to investigate networks associated with the differentially expressed genes identified by edgeR (q<0.05, fold-change>2) and those selected by LASSO during feature selection. Each gene identifier was mapped to its corresponding gene object in the Ingenuity Knowledge Base and overlaid onto a molecular network derived from information accumulated in the Knowledge Base. Gene networks were algorithmically generated based on their “connectivity” derived from known interactions between the products of these genes. Networks were considered significant if their p-scores were ≥15. Network score is calculated as p-score=-log10(p-value), so a score of 15 corresponds to a p-value of 1E-15.54