Patients
Data from 200 consecutive adult patients who had undergone surgery for vestibular schwannoma between 7/2006 and 8/2016 were selected retrospectively and were anonymized. This study was performed in line with the principles of the Declaration of Helsinki. Approval was granted by the Ethics Committee of the University Hospital Halle (Saale) (Ref. Number 2018 − 138). All patients of whom data were included and anonymized in the study had given their written informed consent for usage of their data in scientific studies. Inclusion criteria were first surgery for VS, availability of complete continuous intraoperative EMG recordings from clinical routine as well as facial nerve outcome data from follow-up after at least 6 months. Exclusion criteria were previous irradiation and neurofibromatosis. Mean age was 51 years and ranged from 21 to 80 years. 109 patients were women. Tumor size was Koos 1 in 18 patients, Koos 2 in 57, Koos 3 in 70 and Koos 4 in 55 patients [13]. Preoperative facial nerve function was House-Brackmann (HB) grade 1 on median (range 1–3, 3 patients with HB 3) [12]. A separate intermediate nerve was observed intraoperatively in 99 patients.
Recordings
Continuous EMG was recorded during the complete surgical procedure as described previously [3, 4]. In short, 15mm long non-insulated needle electrodes were placed parallel in the facial muscles with an interelectrode distance of 5mm. For each of the 3 main branches of the facial nerve 4 electrodes were positioned on the operated side. By referencing neighboring electrodes, the setup yielded 3 bipolar channels per branch. The ground electrode was placed in the contralateral upper arm. Data were recorded with a Grass-Telefactor 15LT biosignal amplifier (West Warwick, RI, USA) with approximately 7kHz and using a 5Hz high pass filter.
EMG processing
Recorded 9-channel data was evaluated postoperatively by computer-assisted visual inspection using in-house software. Extending automated marking [3], on- and offsets of individual A-train patterns were marked. In addition, A-train clusters, defined as A-trains occurring in the majority of recordings channels within the same time segment of a few seconds [11] were identified visually. Subsequently, the durations of all A-train events were summed up per channel, yielding a total of 9 traintime values for each patient.
Clinical data
Clinical data were extracted from clinical documentation and included preoperative and immediate postoperative facial nerve function as well as follow-up after 6 months, graded according to House-Brackmann [12]. HB degrees were checked and corrected if necessary, by a single experienced evaluator (author JP) to reduce issues of limited interrater reliability of HB grading [14]. Intraoperative observation of a separate intermediate nerve was taken from the surgeon’s documentation.
Relationship to postoperative outcome
Relationship of traintime, tumor size and output of neural networks with postoperative outcome was evaluated using partial correlation as applied previously [4]. Due to the ordinal scaling of HB grades, Spearman rank correlations were used. Partial correlations allow to quantify an association between two parameters while controlling for one or more covariates. A statistically significant partial correlation suggests an association which is not explained by the covariates. For example, in a previous study [4], traintime showed a significant partial correlation with postoperative outcome controlled for tumor size, which suggests that traintime is associated with outcome not primarily depending on tumor size. In the context of the diagnostic value of the evaluated neural networks regarding postoperative facial nerve function, a significant partial correlation while controlling for both raw traintime and tumor size would suggest complementary information in comparison to traintime or tumor size alone.
Construction and evaluation of neural networks
Feed-forward networks with different input parameters, a single hidden layer and simultaneous postoperative and follow-up HB grades as outputs were constructed using the feedforward function of the Matlab Deep Learning Toolbox (Matlab R2021a, The Mathworks, Natick, MA, USA). The number of neurons of the hidden layer were chosen according to the number of inputs, e.g. 11 for 11 input parameters (9 channels of traintime, Koos tumor size and preoperative HB grade).
The procedure utilized a Levenberg-Marquardt training function and mean squared error (MSE) for performance evaluation. Available data was randomly separated into a 75% training and a 25% validation split, i.e., 150 randomly selected datasets served to train the network and 50 datasets were used to evaluate the performance. After the finalized training, resulting performance was evaluated in only the validation split by calculating chi2 statistic between network output, i.e., estimated HB grades and postoperative and follow-up HB grades. For more intuitive interpretation chi2 values were transformed into Cramér’s V effect sizes. For 5x5 tables (for estimated and clinical HB ranging from 1 to 5), values below 0.05 are considered negligible, 0.05–0.13 small, 0.13–0.22 medium and above 0.22 as large [15].
Statistical evaluation of network performance
Training and consequently performance of neural networks depend on the random choice of training and validation splits as well as random initialization of synapse weights between layers. To better estimate the overall performance of neural networks, we applied a bootstrapping technique to sample the distribution of performance observed with many networks. The approach repeated a single run of calculations 1000 times, yielding 1000 performance estimates, i.e., chi2 values of the comparison between network output and postoperative/follow-up outcome. Each run randomly designated 150 datasets to the training and 50 datasets to the validation split. A neural network was then constructed with the training split. Chi2 values were then calculated using only the validation split.
The mean and 95% confidence intervals of the resulting distribution was taken as overall performance. For calculation of significance, the distribution was compared to a surrogate distribution using a Komolgorov-Sminorv (KS) test. The surrogate distribution was constructed by shuffling input data of the validation in respect to the outcome values. Chi2 values were then calculated using the network output using the surrogate data. The procedure was also repeated 1000 times yielding the surrogate distribution. The underlying hypothesis of this procedure is that the ordered inputs should predict the outcome in the validation split better than the randomly shuffled surrogate.
Comparison of different input sets
The primary endpoint of our study was to evaluate neural networks with inputs traintime, tumor size and preoperative facial nerve function. Additionally, we evaluated performance, when adding the information that a separate intermedius and/or A-train clusters were observed. Performance differences are discussed based on 95% confidence intervals (CI). Overlapping CI were interpreted as a lack of significant differences, which is considered conservative [16].
Evaluation of tumor size
The networks trained on traintime, tumor size and preoperative facial nerve function were further analyzed to study the influence of tumor size. To this end, the complete dataset was subdivided into groups according to the Koos tumor size. Chi2 values from the comparison of outcomes with estimates of the previously trained networks were then calculated for each group individually. The rationale of this step was that performance within each group cannot depend on only the tumor size. Due to comparable preoperative HB grades in most patients and therefore also within the tumor size subgroups, the observed correlations then necessarily must depend on traintime. Mean correlations and 95%-confidence intervals (CI) are reported over all 1000 randomizations. For evaluation of differences between tumor size categories, a general linear regression model (GLM) was fitted to the network estimates, taking Koos tumor size and sample size in the groups into account. This approach was used to control for the rather different patient numbers in the tumor size groups, ranging from 18 with Koos 1 to 70 with Koos 3.
Influence of a separate intermedius nerve
Performance of neural networks was investigated regarding the influence of a separate intermedius nerve. To this end, estimates of networks were categorized into groups according to intraoperative observation of a separate intermedius nerve or lack thereof. Based on all 200 patients, chi2 concordance statistics with clinical HB grades were calculated for each group and each of the 1000 randomizations. Differences were again statistically evaluated with the KS test. We decided not to perform this evaluation in only the validation split unlike the remaining analysis but in the complete sample. Due to the random selection of 50 cases in each randomization, this would have led to varying and frequently unbalanced percentages of cases with a separate intermedius nerve. The main goal of this analysis step was to evaluate whether the impact of a separate intermedius nerve could be compensated by the best method or whether such an influence would still be present. Since chi2 statistics and to some degree Cramér’s V are sensitive to the sample size, comparison to performance of other neural networks evaluated in only the smaller validation split is limited.