Neural-signature methods for structured EHR prediction

Models that can effectively represent structured Electronic Healthcare Records (EHR) are central to an increasing range of applications in healthcare. Due to the sequential nature of health data, Recurrent Neural Networks have emerged as the dominant component within state-of-the-art architectures. The signature transform represents an alternative modelling paradigm for sequential data. This transform provides a non-learnt approach to creating a fixed vector representation of temporal features and has shown strong performances across an increasing number of domains, including medical data. However, the signature method has not yet been applied to structured EHR data. To this end, we follow recent work that enables the signature to be used as a differentiable layer within a neural architecture enabling application in high dimensional domains where calculation would have previously been intractable. Using a heart failure prediction task as an exemplar, we provide an empirical evaluation of different variations of the signature method and compare against state-of-the-art baselines. This first application of neural-signature methods in real-world healthcare data shows a competitive performance when compared to strong baselines and thus warrants further investigation within the health domain.


Introduction
Prediction tasks defined on structured EHR data are a key focus for applications of Machine Learning in Healthcare, with the potential to improve patient outcomes through faster and more accurate diagnoses. Due to the rapidly increasing quantity and availability of EHR data, methods in deep learning are increasingly being utilised to model the complex interactions in a range of healthcare related predictive tasks. Due to the sequential nature of EHR data, RNNs have emerged as a key component in many recent state of the art methods. This paper introduces signature methods as a theoretically wellgrounded method of extracting features from sequential structured EHR data. We provide an empirical evaluation of signature methods as a novel alternative to RNNs for disease prediction using data collected during routine healthcare encounters.
The signature transform maps a path (for example a time series) onto an infinite sequence of summary statistics. It is known that these terms completely characterise the path (up to translation) and that any function on the path can be modelled arbitrarily well by a linear function on the signature [21,34]. In a machine learning context, this makes the signature a useful feature set with which to learn from. The signature has been successful across a range of predictive tasks involving time series data [37] in particular in the medical domain [38]. However, the signature method has not yet been applied to structure EHR data, most likely due to its high dimensionality posing computational challenges. To this end, we follow recent work that enables the signature to be used as a differentiable layer within a neural architecture enabling Open Access *Correspondence: andre.vauvelle.19@ucl.ac.uk Page 2 of 12 Vauvelle et al. BMC Medical Informatics and Decision Making (2022) 22:320 application in high dimensional domains where calculation would have previously been intractable [7]. In this paper we perform an empirical evaluation of signature methods as a novel alternative to RNNs for disease prediction using EHR data. We create a 90 day HF prediction task with data from the UK Biobank [9] to compare neural-signature models with various augmentations against RNN and bag of words baselines. The key results can be summarised as follows: • Neural-signature methods are able to produce a competitive predictive performance when compared to RNN models, returning results over two separate corpora and metrics within one standard deviation • Log-signature and lead-lag variants improve results from those similar to basic bag of words models to those comparable with RNNs • Adding time-augmentations does not significantly effect model performance for both neural-signature and RNN models

Related work
The methods previously used to address temporality in EHR can be roughly separated into three main areas; Discretization This consists of splitting the continuoustime variables into discrete bins. Features are then calculated from the sub-sequences within each time period. For categorical data, the most common approach is to count the number of events.
Neural approaches Neural network approaches attempt to automatically learn a feature set that best describes the underlying data for a specific prediction task. [41] and [12] applied RNN variants to find results that reported improved performance over existing state-of-the-art methods.
RNN variants continue to play a role in more recent papers [1,13,14,20,35,42,47,51]. Modifications include using bidirectional RNNs to reduce steps between dependencies, attention mechanisms to improve interpretability, facilitate combining with Convolutional Neural Network (CNN) models, and improved embeddings for visits with graph-based attention models. In all such papers, RNNs are used to handle the sequential aspect of structured EHR.
While it is clear that RNNs perform comparatively well in deep learning applications, an alternative set of methods is also worth discussing.
Sequential feature extraction This encompasses methods that are able to extract flat features from sequential data while retaining information relating to ordering. Despite being more popular in higher frequency data modalities such as streams data, previous works using structure EHR have explored; shapelets [52] and symbolic aggregate approximation [4] for adverse drug reaction prediction. More broadly, this includes methods such as the discrete Fourier transformation [43].
It is this category that signature methods can be considered to belong. A key advantage of signature methods is a strong theoretical groundwork showing the signatures usefulness in non-parametric hypothesis testing [11] and algebraic geometry [40]. Machine learning applications have also been demonstrated in a growing variety of domains [10] including: healthcare [3,27,28,38], finance [24,39], action recognition [32,49] and hand-writing recognition [50].

Data preprocessing and cohort
The UK Biobank [9] is a national population-based study comprising of 502,629 individuals. We extract a retrospective heart failure (HF) cohort using the same methodology as [19] which uses the previously-validated phenotyping algorithm in the CALIBER resource [18].
To form the sequential input data required for our predictive model, we extract primary and secondary diagnosis terms (ICD10), procedure terms (OPCS4) and, timestamps ("epistart") from the UK Biobank inpatient dataset. Patient events are extracted with a buffer period of 90 days before HF diagnosis (for controls this is the HF diagnosis of its matched case) to exclude highly correlated events such as end of life care [29]. Events that occur at the same time are randomly ordered.
We create two separate corpora for each patient: PRIMDX is a corpus that only contains primary diagnosis terms, and PRIMDX-SECDX-PROC also includes secondary diagnoses and procedure terms. Since the number of events in each sequence is greater for the PRIMDX-SECDX-PROC cohort, this allows us to compare each of our methods' ability to handle longer sequences with more complex and redundant information.
In Table 1 we provide a breakdown of the demographics of the matched cohort used in this study. In Appendix A we provide further details on the HF cohort extracted and the tokenization process of healthcare terms.

Methods
Let each patient record be denoted by the path Our objective is to classify each sequence with a binary variable which indicates whether the patient will develop heart failure within 90 days. The dimension of the path, d, is be determined by the maximum number of unique tokens as we represent each token with a one-hot-vector, such that only the dimension corresponding with the index of the vocabulary is one and with zeros everywhere else.

Signature methods
The definition of the signature transform is as follows. Let T > 0 and 0 < t 1 < t 2 < · · · < t n−1 < t n = T . Let T ] → R d be the unique continuous function such that f x (t i ) = x i and is affine on the intervals between them. The signature is the infinite collection of iterated integrals The form of the signature in Eq. 1 can be broken down to help give the reader a better understanding as done in [10]. We can start by simplifying to a single index i ∈ {1, . . . , d} . This reduces Eq. 1 to As this is a single integral and f is affine, the equation simply resolves to the increment of i-th coordinate of the path The double-interated integral considers any pair of coordinates i, j ∈ {1, . . . , d} such that (1) where we have used Eq. 2 and replaced t to denote the integration limits as Notice that the integration limits in Eq. 6 correspond to the integration over a triangle. Going further this process can continue recursively and be interpreted as integrating over an increasingly high dimensional simplex. This real number is known as the k-fold iterated integral seen in Eq. 1 where the superscripts are members of the set We can further simplify this form and remove the need for the integral when we consider the path as a series of linear segments in a piecewise linear path. For a single segment the signature can be expressed by the product of its increment To calculate each signature term of the full path, we can use Chen's Identity, which states that the signature of the entire path can be calculated from the signatures of its segments [30] (6)  Using the signature as an infinite series in a machine learning pipeline would not be tractable. Instead, it is common to truncate the series to the k-th level, this is also known as the depth of the signature. This results in the finite collection of terms Sig(x) i 1 ,...,i k where the multi-index is restricted to length N. For example a signature of depth 1 is the collection of d real numbers Sig(x) 1 , . . . , Sig(x) d and a signature of depth 2 is the collection of d + d 2 real numbers The number of terms τ , for any truncated signature of order N of a d-dimensional path, where d ≥ 1 , is the geometric series: For structured EHR data with hundreds or thousands of unique terms, this poses a significant computational issue. In the next section, we highlight a number of variations that can be used to encourage information into lower order signature terms.
In Appendix B, we provide a further breakdown of the definitions provided here and explore an example in toy data to show how the signature terms describe sequential data. Theoretically, the signature terms are proven to uniquely describe any path up to translations (Proposition 1) and act as a universal non-linearity (Proposition 2). This latter property is shared with neural networks and allows us to reduce potentially complicated non-linear relationships between variables into linear ones.

Signature variations
There is a body of variations on the standard signature transform that have been developed. Each can tailor the properties of the signature to be more suited for a certain task. [37] provides an overview of possible variations of the signature together with an empirical evaluation on streams data. Given the substantially (12) greater dimensionality of structured EHR data, we restrict our investigation to the augmentations in Table 3 and the log-signature (Table 2).

Augmentations
An augmentation considers transforming our sequence of patient events x ∈ R d into one or several new sequences, p, whilst potentially changing the dimensionality of each path to a. In general, this can be described by the map The time augmentation consists of the concatenation of an extra dimension. As shown in Proposition 1, this can be used in the absence of any actual timestamps by simply using the index of the event in the sequence. In both cases, this removes the property of time-parameterisation invariance of the signature [31]. We also investigate applying actual time differences from prediction date to account for the irregularly sampled nature of the data. We follow [1] by applying the parameterised scaling function, f (�T ) = T scale log(�T ) capped a maximum T max . T scale and T max control extreme time-deltas and are optimised as hyperparameters.
The basepoint [25] is used to remove the property of translational invariance. This property means that the signature of two paths separated by a constant translation will be the same. The basepoint also has a significant advantage for our pipeline as ∼ 20% of pathways in the dataset used in this study have only a single event. Basepoint introduces an origin point at the start of each path and thus ensures each path has at least two points which is a requirement for calculating the signature.
The lead-lag augmentation [10,22] adds shifted copies of the path as new coordinates. This augmentation explicitly captures the quadratic variation of the underlying process, an important concept for our data where the co-variance between medical concepts is known to be highly important to the underlying pathology of disease [16,42]. A lag of a single timestep is described by the following augmentation The learnt projection can be described by the affine transformation or embedding, θ A ∈ R a×d , such that This reduces the dimensionality of the path to make the calculation of the signature tractable. The log-signature The log-signature corresponds with taking the formal logarithm of the signature in the algebra of formal power series [10]. Both the signature and its logarithm uniquely define a path (Proposition 1) but the log-signature does not hold the same universality property (Proposition 2) [32]. The log-signature maps to a smaller number of terms at each truncation level determined by Witt's formula, which is shown in Appendix B.3.2 along with an example.

Deep signatures
For the affine transformations discussed in Equation 15, we briefly described a learning process. As detailed in recent works from [7], it is possible to train the affine transformations together with the signature transform through an end-to-end neural network architecture.
Here, the signature acts as a non-parametric pooling function able to extract provably useful information from sequential data. It is possible to calculate the gradient needed in this method as the signature can be formulated as a calculation tree of differentiable operations [25,45].
The generalised function of the neural-signature model used in this work can be written as where we have denoted that the learnt parameters as the weights of a fully connected neural network classifier θ fc and elements of the affine transformation augmentation θ A . The sigmoid function is used to map the output activation to a [0,1] score.

Experiments
As baselines, we consider a bag of words model with logistic regression as a commonly used most basic model, along with a GRU model, which is comparable with the state of the art RETAIN [14,48]. We also include a GRU variation that incorporates the time delta augmentation.
Additionally, we consider the following signature models: the standard signature (S) provides the baseline for further variations, the log-signature (LS) removes the universality property (the fully connected neural network classifier still guarantees this overall) but greatly reduces the number of signature terms, the lead-lag (LL) augmentation encourages information about the quadratic variation into lower-order signature terms, the add time index augmentation (ATI) provides sensitivity to parameterization, the time delta (ATD) version goes further to account for non-uniform sampling rates. We limited the exploration on augmentations to the above after initial testing on validation data found the leag-lag augmentation to be most influential.
We use two metrics for evaluation; area under the receiver operator curve (AUROC) and area under the precision-recall curve (AUPRC).
Previous studies have shown that the AUROC can provide misleading results when there is considerable data imbalance, mainly if the number of negative examples is high, and we have a preference for identifying true positive examples [46]. This issue exists in our task due to the 1:9 case-control split and the increased benefit of correctly identifying HF cases over correctly identifying controls. The result of this class imbalance can cause AUROC to become inflated due to a high number of true negative cases. AUPRC is an alternative metric that  captures the trade-off between precision and recall. Crucially, it ignores the number of true negatives allowing changes in performance to be seen without being diluted as in AUROC.
The signature variations explored are summarised together with the baselines in Table 3. Common to each model is the architecture shown in Fig. 1. Further details on implementation, including initialisation, activation functions, optimisation, hyperparameters, regularisation, and other such related details are found in Appendix C.

Results
From Table 4, we observe similar predictive performance across signature models using lead-lag augmentations and GRU models over all corpora, with all metrics from the two sets of models remaining within one standard deviation of performance seen on the validation data. All models perform the same or better on the larger PRIMDX-SECDX-PROC cohort, but more complex models gain a more significant benefit from the added data.
The addition of time augmentations does not show a consistent improvement in performance over just applying the lead-lag augmentation, and there is no consistent difference between adding a time index and time delta. Increasing the depth of the signature to three also shows no significant increase in performance. Signature models perform similar to the bag of words baselines without the lead-lag transform.

Data ablation study
Our final set of experiments evaluates how the models perform as the volume of data is reduced. For the data ablation study, each trial randomly samples a proportion of the training and validation dataset. For each proportion a new set of hyperparameters is found for each model.
The model parameters and hyperparameters are trained using 5-fold cross validation on the sub-sample while the remaining data is unused. The ablation study test data remains the same as the main experiment.
Again, results are broadly similar for both models except for three points where the two models produce performance outside of one standard deviation of validation performance. Notably, at 20% data ablation for AUPRC, the signature model has a 21.0% higher score with 0.283 versus 0.237 for the GRU. Overall, both models' performance begins to saturate at ∼ 20% for both metrics, and the results show no conclusive trend as to which model performs best as the amount of training data is reduced.

Discussion
Given the properties that the two methods share these results might not come as a surprise. However, without the lead-lag augmentation the performance of the signature models drop significantly. This confirms the prior belief that the quadratic variance of the path plays an important role in structured EHR HF prediction. This could correspond with encouraging features that describe changing comorbidities to be present in the lower order terms of the signature. For the data ablation study, we expected the signature model to outperform the GRU baseline however, results for both methods are similar. Our prior hypothesis was partly motivated by the success of signature methods in previous shallow machine learning tasks [38]. A key difference in our task could be the high dimensionality and reliance on embeddings to make the signature tractable. The need to train these embeddings is likely data-intensive but could benefit from initialisation using pretrained word2vec embeddings as has been shown for RNNs [12].

Comparison to previous literature
Comparing the results in this paper directly to previous work is challenging due to the use of different underlying study designs, populations, and incomplete definitions of cohorts and outcomes [17]. We note that previous works investigating sequential models for predicting HF on structured EHR data have found greater performance [14,44,48]. In particular, these works also show a more significant performance difference between bag of words baselines and RNN based architectures. Again, differences in the features and data sources used make comparisons difficult. For example, we could compare against Solares et al. [48], which also uses data from a multicenter UK EHR data source and achieves 0.951 AUROC using the RNN based model RETAIN. However, we must consider that the authors also include primary care and demographics data, which could influence prediction performance independent of model choice. The large US multi-center study by [44] show RETAIN achieving a more comparable AUROC of 0.769 on US healthcare care with a balanced cohort with 14,500 cases and with only diagnosis codes provided for as prediction input. The same model achieved an AUROC of 0.822 when trained and tested on the full cohort of 152,790 cases and 1,152,517 controls with diagnoses, demographic, medication, and surgery data.
In this work, we have restricted our work to prediction on high dimensional structured EHR data. Signature methods have shown success in related health prediction applications but with lower dimensional, high frequency data domains including: mood ratings for Bipolar and Borderline Personality Disorder [3], brain imaging data for Alzheimer's disease [36] and physiological data for Sepsis prediction [38]. Future work could look to expand signature method applications within similar domains such as ECG signals diagnosis [6] and prediction systems for biogas production [8].

Conclusion
Given the prevalence of RNNs in current structured EHR architectures, any improvement in this fundamental component is likely to influence future work significantly. A substantial body of theory motivates the use of signature transforms to represent sequential data, and previous works have shown them to have strong empirical performance. In particular, recent works on neuralsignature architectures have enabled their applications on high-dimensional data.
This work is the first to show that neural-signature methods with dimensional reduction before the transform are competitive on high dimensional structured EHR data. Using an HF prediction task, we evaluated the signature transforms as an alternative to RNNs that provide a predictive and compact representation of sequential structured EHR data. We show that the signature achieves comparable performance to RNNs and that the performance of both models saturates with a similar number of training examples. While the signature originates from perhaps abstract theory, empirically, it can successfully compete with the current state-of-the-art architectures.

Appendix A: Cohort
The HF cohort excludes patients with: self-reported prevalent HF cases, those who died during the study period, heart failure diagnoses outside the study period between 1997 and 2015. The cases are matched to controls on assessment center, year of recruitment, sex, and year of birth. Controls use an index date that is assigned to the date of HF diagnosis for its matched case. We randomly sampled the total potential control population of 496,892 with the above matching criteria to create a cohort with a ratio of 1:9 cases to controls. This resulted in 5722 cases and 51,498 controls, where 15 cases could not be matched.

A.1 Tokenization
We follow a similar practice to NLP tokenization, using tags to identify whether a code comes from a particular corpus. This allows us to distinguish between primary and secondary codes with the same ICD10 term and introduce additional tokens to indicate if there is no data present. The tokenization process also applies two measures to reduce the vocabulary size; limit the length of all terms to 4 characters and a minimum count of 5. This is possible because many ICD10 and OPCS contain extension codes or additional sub-chapter codes which are not commonly used across healthcare providers and often contain details that indicate minor differences between terms. The tokenizer vocabulary is fitted on the training data, and any subsequent code that is not present in the vocabulary is assigned an out-of-vocabulary token in place.

B.1 A geometric intuition and exploration in toy data
Previously in Eq. 3 we showed that the first level resolves to the increment of a path in a single dimension. It is also possible to find a geometric interpretation for 2nd level terms in Eq. 2. For 2nd level terms where i = j we find Sig(x) i,j = (x i T − x i 0 ) 2 /2 which is simply the area of a triangle sided with the increment in the coordinate. If Another geometric comparison can be made here by considering the Lévy area, which is the signed area shown in Fig. 2. This area illustrates the 2nd order terms through the following equation The Fig. 2 can be related to our task if we consider the line (path 2) and its cord (path 1) to be two separate patient pathways with three unique events (1,2,3), but for path 2 the order of events 1 and 2 are swapped. When classifying a patient pathway with an HF prediction, the order of events may hold vital information. However, when we calculate the terms of the signature in Table 5, we see that the two paths are only separable when taking a signature of depth 3.
Taking the geometric interpretation to higher orders becomes more complicated, but for additional interpretable relations, we refer the reader to [10] which expands on the ideas here to include relations to statistic moments. It is clear that these signature terms are extracting information about the order and that higher-order signature terms can be used to discriminate different paths better. It is also possible to observe how the terms of the signature change by drawing a path using an online tool 1 .

B.2 Key properties and caveats for signature methods
A natural extension to our toy example would be to ask if the signature can be used to discriminate between any path. In fact, it has been shown that a path is essentially defined by its signature and that almost no information is lost. There are two key properties of the signature that make its use interesting for path-like data.
For the purpose of this statement we have introduced the two concepts of time-augmenting a path and of (17  In other words, we know that we can find a linear function that when applied to the signature of a path can approximate any function applied to that same path. This powerful property is shared with neural network approaches and allows us to reduce potentially complicated non-linear relationships between variables into the linear ones.
We refer the interested reader to Appendix A of [7] for an expanded summary.

B.3 Variations on the signature transform
There is a growing body of variations on the standard signature transform that have been developed. Each can tailor the properties of the signature to be more suited for a certain task. [37] gives an overview of the variations on the signature transform and proposes a generalised framework for extracting signature features for timeseries data.
Here we have introduced a number of new terms. Each of which will be explained in more detail and discussed in relation to our data. Briefly; φ is known as an augmentation, W describes a windowing operation, ρ describes a re-scaling operation both pre and post calculating the signature. Finally, Sig N represents either the now familiar signature operation up to depth N or the log-signature variation, which will be introduced later in this section. The indices (i, j) refer to the signature terms of windowed operations, with y i,j,k reducing to the more familiar signature terms up to level k, y k , when a global window is used. For our work, we will focus on only two aspects of this generalised framework, augmentations and the logsignature, leaving the exploration of the additional steps to future work.

B.3.1 Augmentation
An augmentation considers transforming our sequence of patient events x ∈ R d into one or several new sequences, p, whilst potentially changing each paths dimensionality to a. In general this can be described by the map There are a number of different augmentations that have been used for continuous time series data.
The augmentations in Table 2 are broadly separated in two categories. Fixed augmentations consider using a transformation that does not vary after initialisation and learnt augmentations which include learnable parameter weights that are trained as part of the model fitting process. We start by explaining the two previously mentioned augmentations, time and basepoint augmentations.
The time augmentation is the addition of an extra dimension that is dependant on the order of the sequence. As shown in Proposition 1, this can be used in the absence of any actual timestamps by simply using the index of the event in the sequence or if timestamps are available In both cases this removes the property of time-parameterisation invariance of the signature [31]. Meaning that without time augmentation the signature only encodes the order in which events arrive and does not consider the when event arrives. This could potentially be a significant factor for our task. For example, consider two patient records of the same sequence of HF related inpatient admissions, if in one sequence the frequency of these admission was much higher it might suggest a increased disease progression. Using the actual timestamps allows the signature to account for the irregularly sampled nature of the data.
The basepoint [25] and invisibility augmentations [49] are both created with the goal of removing the property of translational invariance. This says that the signature of two paths separated by a constant translation will be the same. In [37] a comparison shows that the invisibility augmentation has essentially the same performance as the basepoint but increases the dimensionality of the path, which is of concern since the signature scales with O(d N ) as seen in Eq. 12. The basepoint also has a significant advantage for our pipeline as ∼ 20% of pathways in the dataset used in this study have only a single event. Basepoint introduces an origin point at the start of each path and thus ensures each path has at least two points which is a requirement for calculating the signature. The lead-lag augmentation [10,22] adds shifted copies of the path as new coordinates. This explicitly captures the quadratic variation of the underlying process, an important concept for our data where the co-variance between medical concepts or otherwise described as comorbidities, are known be highly important to the underlying pathology of disease [16,42]. A lag of a single timestep is described by the following augmentation Lead-lag comes at a high cost as the dimensionality and length of the path is doubled. The remaining augmentations provide a potential remedy for this as they propose reducing the dimensionality the path before calculating the signature. We only consider stream-preserving neural networks that take one projection, p, as the dimensionality of our paths is orders of magnitude higher than that considered in [37]. stream-preserving neural networks have also been used as an initial stage in previous literature on structured EHR data HF prediction tasks and provide a strong basis for comparison [15,19,48]. In its most basic form, a stream preserving neural network can be created by introducing a learnable affine transformation, θ A ∈ R a×d , such that

B.3.2 The log-signature
The log-signature is a compressed version of the signature that only retains the Lyndon words of the index series [2]. As an example, we show the terms of the log signature up to depth 2 as Here we can notice that the 2nd order terms of the log signature directly correspond with the Lévy area in Equation 17.
The number of terms in this more compact form of the signature can be generalised with the following.
This results in almost a third of the output signature terms when calculating the signature for a path with d = 50 at depth 3 ( τ log = 42, 925 vs. τ = 127, 550).

C.1 Augmentations
After tokenization, we project the sequence into a lowerdimensional representation with an embedding, which is the learnable affine transformation we describe in Equation 15. For neural-signature and GRU baselines, this is initialized with a Xavier distribution and trained in an end-to-end fashion and updated during gradient descent.
We experiment with applying different combinations of previously discussed augmentations: lead-lag, index, or time delta parameterization. If both lead-lag and time parameterization are applied, then time parameterization is always done first, such that φ(x) ∈ R 2(a+1) . This sequence of embedded codes is used as input to the encoder stage.

C.2 Classifier
Dropout is applied to the outputs of fully connected hidden layers and determined by hyperparameter optimization.
The bag of words models use a logistic regression classifier with the sklearn implementation of L-BFGS-B and L2 regularisation.

C.3 Implementation
The experiments were implemented using: PyTorch and Signatory [25]. Models were trained on various GPU cards available through the UCL Department of Computer Science High Performance Computing Cluster, using distributed asynchronous hyperparameter optimization and MongoDB. Code used for the project, including full details on hyperparameter bounds are available at https:// github. com/ andre-vauve lle/ doctor-signa ture.

C.4 Training and validation schema
All models are trained by minimizing binary crossentropy loss with Adam [26]. We also stop training after five epochs if validation loss does not improve. The best model is then taken for evaluation on the test dataset.
Nested cross validation is used to evaluate our models and baselines. The data is split into training, validation, (26)  and held-out test sets with a 90% training and validation, 10% test data split. Training and validation sets are used in an inner loop with 5-fold cross validation to train model hyperparameters. Each fold is stratified such that we preserve the percentage of samples for each class. We use a Bayesian hyperoptimization algorithm, Tree-structured Parzen Estimator, to reduce the number of evaluations needed. We fix the number of trails for each model at 100 [5]. We then select the hyperparameters with the highest average AUROC score across a 5-fold. Test data is unseen until the final evaluation, where we use it to make predictions using models trained with optimal hyperparameters from the inner validation loop on one fold of data.