Predicting Outcomes of Business Process Executions based on LSTM Neural Networks and Attention Mechanism

To eﬀectively predict the outcome of an ongoing process instance helps make an early decision, which plays an important role of so-called predictive process monitoring. Existing methods in this ﬁeld are tailor-made for some empirical operations such as the preﬁx extraction, clustering and encoding, leading that their relative accuracy is highly sensitive to the dataset. Moreover, they have limitations in real-time prediction applications due to the lengthy prediction time. Since Long Short-term Memory (LSTM) neural network pro-vides a high precision in the prediction of sequential data in several areas, this paper investigates LSTM and its enhancements and proposes three diﬀerent approaches to build more eﬀective and eﬃcient models for outcome prediction. The ﬁrst move on enhancement is that we combine the original LSTM network from two directions, forward and backward, to capture more features from the completed cases. The second move on enhancement is that we add attention mechanism after extracting features in the hidden layer of LSTM network to distinct them from their attention weight. A series of extensive experiments are evaluated on twelve real datasets when comparing with other approaches. The results show that our approaches outperform the state-of-the-art ones in terms of prediction eﬀectiveness and time performance.


Introduction
Process mining refers to discover, detect and improve the behavior patterns from event logs produced by Process-aware Information Systems (PAIS) during business process execution (Van der Aalst 2016). Such logs involve some knowledge that can be used to predict the new process execution. For example, on the basis of these historical event logs, we can predict the next activity to be executed (Pauwels and Calders 2020;Tama et al. 2020;Mehdiyev et al. 2020;Tax et al. 2017), the remaining execution time (Sun et al. 2020;Verenich et al. 2019;Tax et al. 2017;Rogge-Solti and Weske 2013), and the final outcome (positive vs. negative) (Kratsch et al. 2020;Maggi et al. 2014;Metzger et al. 2015) of an on-going process instance (case). The predictive tasks that can monitor the process execution in real-time by predicting its execution information belong to Predictive Process Monitoring (PPM) techniques. In particular, at the process execution (monitoring) stage of a Business Process Management (BPM) lifecycle, to accurately predict the outcome of an ongoing process instance can help make decisions at the right time. In this paper, we focus on how to construct a predictive model to make outcome predictions for these running cases based on their executed events as well as the mined knowledge from historical event log. Usually, the final outcome of a process can be determined by its business goal such as the result of normal, deviant for a production process, or the result of accepted, declined, canceled for an application process. As for the multi-class outcome prediction task, we can translate it into multiple binary classification tasks (Teinemaa et al. , 2018 so as to compare with other researches Metzger et al. 2015).
Up to now, many approaches based on traditional machine learning techniques have been proposed for the outcome prediction of business processes (Teinemaa et al. , 2018. Among them, the most representative methods are Random Forest (RF) and XGBoost-based. As for a business process, their goal is to construct one or more predictive models (i.e. classifiers) based on the hidden knowledge involved in historical process log, and then predict the outcome of a running case. Here, the running case means that a new process instance is being executed, but not yet completed. Specifically, such methods can be divided into two phases. During the offline phase, some customized operations, such as the prefix extraction, clustering, encoding, and classification algorithms, need to be selected empirically for constructing one or more classification models, whereas during the online phase, it is necessary to determine which cluster a running case belongs to and then select the corresponding coding strategy and classification model to predict its outcome. On this occasion, an inappropriate selection during the offline phase may lead to unstable performance of the prediction model. In addition, during the online prediction phase, the operations for determining the right predictive model require long time, which limits the efficiency of these methods.
With the development of neural networks, the construction of prediction model becomes more and more intelligent and the effectiveness as well as efficiency of prediction has also been improved significantly. Especially, Recurrent Neural Network (RNN) ) and its improvements have demonstrated their capabilities of classification prediction. To obtain the long context dependency features in a sequence, Long Short-term Memory (LSTM) network (Hochreiter and Schmidhuber 1997;Graves 2012) has been proposed on the basis of RNN and proven to be more efficient in similar predictive tasks Zhou et al. 2016). Moreover, to enrich the features from two directions (forward and backward), an improvement of Bidirectional LSTM network (BiLSTM) has also been provided for such prediction tasks. Besides, the attention mechanism that originates from human visual sys-tem (Rensink 2000) has been applied in some researches of neural networks Yang et al. 2016). This mechanism aims at assigning more weight to the most relevant content of the input data instead of focusing on all available information, and is often integrated with RNN, LSTM, BiLSTM, and so on Zhou et al. 2016;Shang et al. 2015).
Actually, if considering a running case that consists of a sequence of events, we can take the outcome prediction task as a sequential event prediction problem. Motivated by these observations, we leverage neural networks, especially the bidirectional feature extraction and attention mechanism, for the outcome prediction problem, and propose three approaches of Bi-LSTM, Att-LSTM, Att-Bi-LSTM based on the original LSTM network in this paper. Compared with the traditional machine learning techniques, our approaches can automatically build a predictive model without prior empirical knowledge by extracting the features that affect the process outcome from historical process instances. Therefore, our approaches have significant advantages in effectiveness and efficiency in real-time prediction. Among our approaches, the difference lies in the way of feature extraction. The Bi-LSTM approach adopts a single bidirectional feature extraction based on LSTM network, the Att-LSTM approach extracts features based on LSTM network with attention mechanism, and Att-Bi-LSTM approach adopts a hybrid feature extraction of bidirectional LSTM network and attention mechanism. Accordingly, the performance of prediction models that are constructed according to different ways of feature extraction is different. Therefore, in this paper, we study on the prediction effectiveness, time performance, and stability of the above different approaches and compare them with other traditional techniques.
The preliminary work of this paper has been published in (Wang et al. 2019). The major extension includes model construction, experiment, and discussion. Regarding the construction of predictive model, we present two other approaches, Bi-LSTM and Att-LSTM, based on the two enhancements of bidirectional feature extraction and attention mechanism respectively. Moreover, we describe a complete process for the outcome prediction of business processes, including how to train a prediction model offline and how to predict the final outcome of an on-going case online. As for the experiment, we introduce a new indicator of prediction stability to evaluate the performance of our proposed approaches and then compare the influence of bidirectional feature extraction and attention mechanism for the outcome prediction. Finally, we further discuss the threats to validity of our study in experimental evaluation.
In summary, the contributions of this paper are as follows.
-We propose three approaches, Bi-LSTM, Att-LSTM, and Att-Bi-LSTM, to predict the outcome of a running case, in which we show how to automatically build an effective and efficient prediction model by using single bidirectional feature extraction, single attention mechanism, and a mixture of them on the basis of LSTM network. -We introduce an indicator of temporal stability in terms of prediction effectiveness to evaluate our approaches and compare them with other approaches based on traditional machine learning techniques. -We conduct a series of extensive experiments to investigate the effectiveness of the prediction model construction as well as the performance of online outcome prediction.
The rest of paper is organized as follows. In section 2, we briefly discuss the related studies on outcome prediction and deep learning techniques. In section 3, we introduce some basic definitions and give the problem statement. Then the detailed solution and the evaluation of its effectiveness and efficiency are demonstrated in Sections 4 and 5 respectively. Finally, Section 6 discusses the threats to validity of our study, and Section 7 concludes the paper and outlines future work directions.

Outcome Prediction of Business processes
The main issue to be addressed in this paper is to predict the outcome class of an on-going case in real time. In fact, the key is to train a prediction model based on the historical cases labeled as the outcome class firstly, and then predict the outcome class of a running case based on this model. Consequently, the solutions can be generally divided into two phases, one for the offline phase where the prediction model is trained, the other for the online phase where the outcome class of a running case can be forecasted based on its executed events and the trained prediction model. In terms of the traditional machine learning techniques, the former phase usually includes four main operations such as the prefix extraction for all historical cases, clustering, encoding, and training of the prediction models. Based on the clusters (buckets) and classifiers constructed in the offline phase, the outcome class of a running case can be predicted at online phase. Until now, many methods have been proposed regarding these operations. For example, Verenich et al. (2016) presented a clustering-classification method, where the extracted prefixes of historical cases are clustered firstly and then a classifier is established for each cluster to predict the outcome. Di Francescomarino et al. (2016) developed a framework by constructing classifier (i.e., prediction model) offline (where each cluster is for each length of prefixes of historical cases) and making prediction online (where a running case is determined to a cluster firstly and then applied with the corresponding classifier). Similarly, De Leoni et al. (2016) implemented a generic framework to obtain a decision tree or a regression tree by correlating business process characteristics and then use this tree to cluster event logs in PPM.
In terms of feature encoding, many methods are proposed to predict the most likely outcome of the ongoing case by extracting features from the historical cases in an event log rather than the events of each case. For example, Leontjeva et al. (2016) proposed two feature encoding methods, one is based on the index of events in a case, and the other combining the index-based encoding method with hidden Markov models (HMMs). Similarly, Verenich et al. (2016) used the index-based encoding method. However, their method can only be used in all traces with the same length because the length of the encoded feature vector increases with each execution event. Besides, the commonly used final state encoding technique based on the last available data attribute after each execution of the activity in case Di Francescomarino et al. 2016), has been proved disadvantageous because it ignores the information of all events that occurred in the past. In addition, Senderovich et al. (2017) investigated a feature encoding method for process cases, which relies on a bi-dimensional state space representation of intra-case dependencies and inter-case dependencies in predicting process monitoring.
In terms of constructing classifier(s), the commonly used methods mainly include decision tree (DT) (De Leoni et al. 2016), random forest (RF) (Bishop 2006, support vector machine (SVM) and gradient boosted trees (XG-Boost) (Leontjeva et al. 2016;Friedman 2001). Among them, DT can explain the results well, RF has higher prediction accuracy, while XGBoost usually achieves the higher accuracy than RF. Besides, Teinemaa et al. (2016) proposed another method, which can train a classifier for each possible prediction point such as after the execution of a certain event. Likewise, Conforti et al. (2015) developed an approach to obtain a multiclassification model by using a decision tree method to train a classifier at each decision point of a process. Some process cases include not only the activities them-selves but also the documents created by these activities. To address the prediction for such this case, Lakshmanan et al. (2010) adopted an ant-colony optimization (ACO) based method to extract a probabilistic activity graph from process traces, and then used this graph to identify key decision points in a given process.
The above mentioned classification approaches and some others such as clustering-based and regressionbased (Kang et al. 2012) usually require complex operations to build classification models. Fortunately, deep learning techniques can usually solve sequential data prediction tasks effectively and efficiently by automatically constructing a classifier from historical data. Recently, some approaches based on deep learning techniques have been proposed for the next activity prediction and remaining time prediction tasks in PPM. For example, Tax et al. (2017) used LSTM network to predict the next activity to be executed and its possible timestamp in terms of a running case, which is proven to outperform existing techniques. By viewing the next event prediction task as a multi-classification problem, Mehdiyev et al. (2020) presented a multi-stage deep learning method based on a deep feed forward multilayer neural network. However, to the best of our knowledge, the similar research for predicting the outcome of a process has not yet been proposed. As indicated in (Kratsch et al. 2020;Teinemaa et al. 2019), it is very meaningful to study LSTM for process outcome prediction. Therefore, in this paper, we are dedicated to proposing the approaches based on neural networks for the outcome prediction problem.

Recurrent Neural Networks and Attention Mechanism
As for an on-going case, its final outcome can be usually predicted based on the executed part of this case and a classification model trained from the completed cases of the event log. Here, each completed historical case consists of a sequence of events and the final outcome of this case can be determined by some features extracted from the execution of these events. In order to obtain a classifier for outcome prediction, these features need to be captured automatically from historical cases. Up to now, there has been no research using neural networks to predict the outcome of a running process case. Fortunately, the essence of outcome-oriented prediction is similar to the text relation classification task in the field of Natural Language Processing (NLP), in which each text consists of a sequence of words and the features extracted from these words can determine the text relation. Regarding the problem of learning hidden features automatically, some Deep Neural Networks (DNN) techniques have been proved to be employed successfully. For instance, in terms of the text relation classification, Zeng et al. (2014) utilized Convolutional Neural Networks (CNN) to extract lexical and sentence level features without complex pre-processing operations. However, considering CNN networks are not suitable for learning long-distance dependency information, Zhang and Wang (2018) employed bidirectional RNN to learn patterns from raw data by extracting the information from the past and future context. To counter the gradient vanishing of RNN networks and the limitation of context scope, LSTM (Hochreiter and Schmidhuber 1997) was introduced for the text relation classification and then some corresponding variants were also proposed. As demonstrated in Zhou et al. 2016), bidirectional LSTM can not only learn hidden features with both the past and future context information, but also keep long-distance dependency memory. Therefore, in this paper, we focus on the study of bidirectional feature extraction based on LSTM networks for outcome prediction task.
In addition, in some neural networks, the attention mechanism has proven to be able to optimize the extracted features in different fields. For example, in question answering system, Nie et al. (2017) applied the attention mechanism in bidirectional LSTM network to optimize the features by focusing on a certain part of the candidate answer for a question. The experiment results showed that the proposed model outperforms the existing approaches. Similarly, in machine translation task, the attention mechanism is first used in an encoder-decoder framework to optimize the features by using this network to revisit all parts of a source sentence, instead of encoding all features of this sentence . Especially in the text relation classification, Zhou et al. (2016) employed the attention mechanism in bidirectional LSTM to optimize the features by assigning more weight to the most important semantic features automatically. Likewise, on the basis of CNN, Lin et al. (2016) used a sentence-level attention mechanism to optimize features by reducing the weights of multiple noisy instances dynamically for distant supervised relation extraction. The experimental results showed that the model can achieve significant and consistent improvements as compared with the state-of-the-art methods. Besides, the attention-based models have also been applied to various areas such as the semantic relation extraction (Geng et al. 2020), image classification (Mnih et al. 2014;Zhu et al. 2020), speech recognition (Chorowski et al. 2015), and image caption generation (Xu et al. 2015). In a word, the attention mechanism is effective at extracting the most distinguishing features. Hence, in this paper, we focus on the application of attention mechanism for the outcome prediction so as to identify the features that have a decisive effect on the outcome of a case.

Preliminaries and Problem Statement
For ease of understanding, in this section, we will give the background knowledge and some basic concepts as well as the detailed statements regarding the outcome prediction problem.

Recurrent Neural Networks and Attention Mechanism
Artificial Neural Network (ANN) is a mathematical model that simulates the processing mechanism of the human brain's nervous system for complex information based on the basic principles of neural network in biology. It is characterized by parallel distributed processing ability, high fault tolerance, intelligence and selflearning ability (Hopfield 1982). Especially with the development of deep learning technology, neural networks are widely used in the field of predictive classification. In general, a neural network consists of a layer of input cells, multiple layers of hidden cells, and a layer of output cells. Cells in each layer are connected by weighted connections to cells in previous and following layers in various forms, depending on the architecture of neural networks. The output of each cell represents a specific "activation" function of the weighted sum of its input.
As for the outcome prediction problem, an on-going case consisting of a sequence of executed events can be viewed as the input of a neural network, while the outcome class label can be viewed as the output of this neural network. In this way, we can construct a neural network with specific structure, called classifier, by training some executed traces with labeled outcome class in an event log. In general, neural networks with special structures can memorize information from input at early moment (viewing the order of sequential data as different moments). RNN is such a special neural network , which is composed of a series of repeating modules, where the repeating module is a single hidden layer with typical activation functions such as tanh and sigmoid. The general RNN is shown in Fig. 1(a) and its variant with multiple inputs and single output is shown in Fig. 1(b). The latter is more applicable to the outcome prediction problem. Here, x t , y t and h t denote the values of the input layer, the output layer, and the hidden layer at moment t respectively. Besides, h t−1 records the state of the previous output  at moment t−1. W xh (t), W hh (t), and W oh (t) denote the input-hidden weight matrix, the hidden-hidden weight matrix, and the hidden-output weight matrix respectively. The state h t of hidden layer and the output y t can be calculated by: where F 1 and F 2 represent the activation functions in hidden layer and output layer, respectively. Usually, F 1 denotes tanh function and F 2 denotes softmax function. Based on these equations, we can find that the input information at early time will be lossy when transmitting to the output layer by h t = W hh (t−1)·h t−1 . When the hidden state information h 0 of the start moment is transmitted to the moment t with h t = W t hh · h 0 , the matrix W hh needs to be multiplied many times. In this case, we can get h t = QΛ t Q T · h 0 if W hh can be eigenvalue decomposed. If the eigenvalue Λ < 1, the result of h t diminishes to 0 (called gradient diminishing). If Λ > 1, it expands to infinity (called gradient explosion). Hence, RNN cannot maintain the memory of input that is longer or far away from the output.
As a variant of RNN, LSTM (Hochreiter and Schmidhuber 1997) can learn long-term dependencies by replacing the hidden layer with a memory cell c t to tackle with the gradient vanishing issue. Fig. 2 shows a module of general RNNs and a memory cell for LSTM respectively. In this memory cell, there are three gates including input gate i t that determines how much information can flow into this cell, forget gate f t determines how much information can be forgotten in this cell and output gate o t determines how much information can be outputted from this cell. Each gate represents a way to control the data information. In Fig. 2(b), h t−1 and c t−1 denote the output and the state of the memory cell at t − 1 moment, respectively. Based on them, we first take the input of x t and h t−1 into consideration and determine what the information to throw away from the cell state with a sigmoid activation function (i.e., σ in Fig. 2(b)). Then, we need to determine what information to store in the cell state by the input gate and the updated cell state. As for the input gate, i t can decide which state to be updated. Afterwards, a new candidate value g t can be obtained by a tanh activation with the input x t and h t−1 . Based on the i t and g t , we can update the old cell state c t−1 to a new cell state c t according to the information to be forgotten and the new candidate values. Finally, the information to be outputted can be determined by running the previous information with a tanh activation and then putting the cell state c i in a sigmoid activation. In this way, the LSTM units can keep the information of the previous state (i.e., c t−1 and h t−1 ) and memorize the extracted features of the current data input of x t .
Thus, it can be used for sequential data prediction and usually achieves good performance.
Inspired by human cognition, the attention mechanism was first presented by Minh et al. and then applied to the image field (Mnih et al. 2014). It has a significant optimization effect on the traditional neural networks. In general, human attention refers to the fact that human beings generally do not regard all things as a whole but tend to selectively acquire some important parts of the observed things according to their needs when observing external things. Such a mechanism can greatly improve the efficiency and accuracy of visual information processing. Similarly, when people observe an image, they do not actually get the pixels of each position of the whole image at one time. Instead, they focus on specific parts of the image according to their needs. When learning from a part of an image, in order to simplify the task, the current state will process the pixels of the focused part that determined by previous learning and the current input image rather than all the pixels of this image. So the attention mechanism can extract more critical information and gain popularity in DNNs, especially for RNNs and CNNs (Wu et al. 2018). In general, the obtained features from the input of a neural network have different influence on the output. The attention mechanism is used to assign the different attention (weight) to all features (local parts) based on their different effect on the output and then comprise the global representation of these features. In this case, the higher the weight of a feature, the more important it is for the output. Therefore, using the attention mechanism in neural networks can make more accurate judgment by giving the networks with the ability to distinguish the different significance of features.

Basic Concepts
In terms of a business process, the executed process instances of an event log are required for the outcome prediction of an on-going case. The event log is composed of many completed cases, each of which consists of a series of event records. In general, each event record has some attributes, such as the event class attribute (i.e., activity name) specifies which activity the event refers to, the timestamp attribute specifies when the event starts and ends, and the case ID attribute specifies which case of the process generates the event. Usually, these three attributes exist in all event records. In addition, event records may also have other attributes. If the value may change with the event, such an attribute is often called event attribute. For instance, the amount of a bill in order-payment process can be recorded as an event attribute represented by the activity Create Bill. In contrast, another kind of attribute, called case attribute, which value can be shared by all events in a case. For example, in this order-payment process, the customer ID is a typical case attribute because all events in a case share this attribute. In other words, the value of this attribute for all events in this case is identical. Different from the event attribute, such an attribute is static and its value does not change throughout the lifetime of a case.
Definition 3.1 (Event, Attribute). An event (record) e can be represented as a tuple e = (a, c, t start , t end , d 1 , ..., d m ), in which a is the process activity attribute related to this event, c is the case id attribute, t start and t end are the attributes of start timestamp and end timestamp, and d 1 , ..., d m are a list of additional attributes (where ∀i ∈ [1, m], d i ∈ D i with D i being the domain of such an attribute value). Here, a, t start , and t end belong to the event attribute, c is the case attribute. In terms of an event log, the collection of all events is recorded as A.
One execution of a process is usually called process instance or case, which consists of a series of events. For a certain process instance, the sequence of involved events can form a trace.
Definition 3.2 (Trace, Prefix trace). Each case corresponds to a trace consisting of a non-empty finite sequence of events σ = e 1 , e 2 , ..., e |σ| such that ∀i, j ∈ [1, |σ|], e i ∈ A, e j ∈ A, e i .c = e j .c. In terms of an event log, the collection of all traces is denoted as S. As for a trace σ, a prefix trace of length l(l ≤ |σ|) can be defined as σ l = e 1 , e 2 , ..., e l , which indicates the first l executed events of this case. In an event log, all prefix traces with length l can be included in S l .
The outcome-oriented prediction for a given running case aims at predicting its outcome class, which indicates the final outcome according to some business goals. In order to construct a classifier (i.e., classification model) from the event log of a business process, each completed case needs to be labeled with its outcome class firstly.
Definition 3.3 (Outcome class labeling). A single outcome class label y(σ) with domain of {0, 1} can be assigned to trace σ for binary classification, in which 1 denotes that the final outcome of this case is consistent with the goal of business process, while 0 is the opposite.
After labeling the outcome class, it is necessary to encode these historical traces of an event log for training a classification model automatically.
Definition 3.4 (Event encoding, Trace encoding). Event encoding f : A → R p maps the attribute values of each event e into a vector with fixed length, in which p denotes the total dimension of the encoded vectors of all attribute values for a given event. Based on it, trace encoding g : S → X |S|×p maps trace σ into a matrix where |S| denotes the maximum length of a trace. In particular, the operation of zero-padding is required when the length of a trace is less than the maximum.
After case encoding, we take these matrices as inputs of neural networks so as to construct a classification model regarding the outcome prediction automatically. Actually, the essence of a classification model reveals a mapping relationship between the encoded vector of a case and its outcome class label.
Definition 3.5 (Prediction model). A prediction model (i.e., classification model or classifier) y : S → {0, 1}(σ ∈ S, ∀e ∈ σ, e → R p ) estimates the outcome class label of an encoded (prefix) trace where 1 denotes that the final outcome is consistent with the goal of business process, while 0 is the opposite.
According to the constructed classification model, the outcome class of an on-going case can be predicted by taking the encoded matrix of a case that consists of the encoded vector of executed events and zero-padding events as the input of this model.

Problem Definition
In this paper, the problem to be solved is the outcome prediction of a case at run-time, which involves training a classifier from the historical event log for a business process at offline phase. In this phase, the executed cases with labeled outcome class in an event log is required. Based on the constructed classifier, the outcome class label of this running case can be predicted at online phase. In other words, the problem can be formalized as follows.
As shown in Fig. 3, the input of this problem is an event log, which is composed of some historical cases σ 1 , σ 2 , ..., σ s with labeled outcome classes such as Yes (denoted as 1) and No (denoted as 0). Based on this labeled log, a classification model can be trained by neural networks at offline phase. Then, at online phase, the outcome prediction, i.e. the outcome class Yes or No, of the given case σ can be determined based on this model as well as the execution of σ .

The Approaches
The key to predicting the outcome prediction of an on-going case is to construct an effective classifier based on the historical cases in event log. Such a classifier reveals the relationship between a case and its outcome class, and requires the ability to learn the long-term dependencies among a sequence, because each input historical case is a sequence of events when training this classifier. Based on the above description, thus, we can employ a neural network based on LSTM to obtain a classifier for the outcome prediction problem. The entire process is shown in Algorithm 1.
Algorithm 1 contains two parts, i.e., the offline phase when a classifier is trained from the historical cases of an event log (lines 1-8), and the online phase when the outcome of an on-going case is predicted according to this classifier (line 9). At the offline phase, given a random initial value for all parameters of the neural network (model), we can compute the output of this model for each trace (case) in an event log and then measure the loss between this output value and the labeled outcome class (lines 1-3). Based on this loss, we update all the weights and biases of this model by using the error Back-Propagation (BP) algorithm to propagate the error to the neural network (line 4). The above operations are repeated for each case until the end condition is satisfied (lines 5-6). At last, we obtain a classifier for an event log based on the final determined parameters. At the online phase, the outcome class of an on-going case can be predicted according to the execution of this case.

Offline Training
As described above, the LSTM network can automatically capture the decisive hidden features from the historical cases in event log. During the offline phase of training a classifier, some different approaches based on the original LSTM network are available for discovering the relationship between cases and outcome class. Thus, Algorithm 1 : Outcome prediction with neural networks Input: the outcome class labeled event log L = { σ 1 , y 1 , σ 2 , y 2 , · · · , σ s , y s }(y t ∈ {0, 1}) an on-going case σ Output: the predicted outcome classŷ of case σ 1: Randomly initialize all weights W and biases b of a neural network; 2: for t = 1, 2, · · · , s do 3: Compute the predicted outcomeŷ t of σ t based on the current values of W , b and Eq. (10)  if the loss function Loss(ŷ t , y t ) (in Eq. (11)  in this paper, we first propose an approach based on the original LSTM network, called LSTM, to construct a classifier for the outcome prediction problem. However, the LSTM network only considers the relationship between the preceding events and the current event, but ignores the events that occur after this event when extracting features. Especially, the outcome of a case is not only related to the events of this case, but also related to the context information of these events, especially for some complex processes. Usually, the more complex a business process is, the more complex the decisive factors of its final outcome are. Furthermore, the original LSTM network considers all the features have the same influence on the outcome when transforming the extracted features into the output of the classification result. In fact, each feature extracted from cases has different influence on the outcome. Therefore, we make two enhancements based on the original LSTM network.
On the one hand, in order to capture more context features, we propose an approach, called Bi-LSTM, based on the adapted bidirectional LSTM network, which is combined with the original LSTM network forward and backward. On the other hand, in order to distinct and optimize the extracted features, we propose an approach, called Att-LSTM, which adds the attention mechanism in the original LSTM network. In addition, on the basis of these enhancements, we propose a hybrid approach, called Att-Bi-LSTM, to combine the advantages of the bidirectional LSTM network and attention mechanism. In this case, the constructed classifier can not only capture rich features but also optimize them further. The four approaches mentioned above can be used to construct an effective outcome prediction model for predicting the outcome of a process. They can be divided into three parts: the vector-ization of input events in a case, the feature extraction from these events of a case, and the classification of the output. The details are as follows.

LSTM approach
Here, we describe how to use the original LSTM network to train a classifier from an event log by taking a completed case in event log L for example. At first, the vectorization representation of executed cases are obtained by encoding the attributes of the events with different ways according to their value types. Then, the LSTM network is used to extract key features from these encoded cases according to the fact that the outcome of a case is determined by some certain events as well as their attributes. Finally, the probability of outcome class is calculated based on the extracted feature vectors. The architecture of LSTM approach for process outcome prediction is shown in Fig. 4 Fig. 4 The architecture of LSTM approach for process outcome prediction.
Input layer. In this layer, an event log L = {σ 1 , σ 2 , ..., σ s } with s cases are viewed as the input of LSTM neural network for training a classifier. The t-th trace in L is denoted as σ t = e t1 , e t2 , . . . , e tn (n = |σ t |), in which e ti (i ∈ {1, 2, · · · , n}) is the i -th event (record) in this trace. In addition, each trace is labeled with its outcome class according to the business goal. Such a trace can be used for training a classifier once.
Encoding Layer. In this layer, all event attributes in trace can be encoded according to the type of attribute value. If its type is numerical, the attribute value can be normalized according to the range of all possible values for this attribute in event log L. If the type is categorical, one-hot encoding method can be used to convert the value to vector of 0 and 1. According to these rules, we obtain the vector of each event in all traces by mapping it into a p-dimension vector expressed as x ti = [x ti,1 , x ti,2 , . . . , x ti,p ], in which p is the total dimension of the encoded vector of all attributes for an event, determined by the record attributes in an event log. Afterwards, the encoded sequence of vectors x t1 , x t2 , · · · , x tn can be fed into the next layer for feature extraction.
Feature Extraction Layer (LSTM Layer). For t-th trace in this event log, the input of this layer is a sequence of event vectors x t1 , x t2 , . . . , x tn . Then, we obtain the outputs h t1 , h t2 , . . . , h tn of this (hidden) layer for each event, in which the output h ti ( h ti ) of the i -th event can be calculated based on the output state h t,i−1 of the previous event x t,i−1 as well as the current input event vector x ti . The obtained hidden feature vector h ti ( h ti ) denotes the learned information between the current event and its previous events, which can be defined as: Each cell of this layer is equipped with three gates to control the flow of information as shown in Fig. 2(b). One is the input gate input ti (similar to i t in Fig. 2(b)) which determines how much information can flow into the memory cell, one is the forget gate f orget ti (similar to f t in Fig. 2(b)) which determines how much information is forgotten, and the other is the output gate output ti (similar to o t in Fig. 2(b)) which determines how much information outputs from the current state cell c ti (similar to c t in Fig. 2(b)). For each cell of this layer, let h t,i−1 and c t,i−1 be the vectors of output and cell state separately from the prior unfold cell on the same level. In order to extract the features from the current event e ti , we take the input x ti of the current state and the previous state information of h t,i−1 and c t,i−1 into consideration. At first, we determine what information (i.e., the features) should be thrown away from x ti and h t,i−1 by the sigmoid activation function of F 1 .
where W f and b f are trainable parameters of weights and biases. Then, we determine the information to be stored by using the input gate and an update operation. By using the input gate, we decide which state to be updated by Eq. (5).
where F 1 is the sigmoid activation function, W i and b i are also the parameters to be trained. Afterwards, a new candidate value g ti (similar to g t in Fig. 2(b)) needs to be computed first by: where F 2 is tanh activation function, W g and b g are the corresponding parameters to be trained. Once input ti and g ti are obtained, we update the previous state c t,i−1 to a new state c ti by: where the left item represents the information which needs to forget from the previous state c t,i−1 while the right item represents the new update information to the current cell state c ti . At last, what information of this cell state outputs to the final output can be computed by: where F 1 is the sigmoid activation function, W o and b o are the trainable parameters. Then the final output of this cell state o ti to the following layer and h ti to the subsequent cell can be obtained by running the the information of c ti with the tanh activation function firstly and then multiplying with the information of this cell state provided as output: represents vector concatenation) be the input of sigmoid function to estimate the outcome class of trace σ t . The computed probabilityŷ t of the outcome class of 1 (i.e. positive) for this trace can be calculated as Eq. (10): where W c and b c are the trainable parameters in this layer and the value ofŷ t is in the range of 0 − 1.
At last, we use a binary cross-entropy function to measure the loss of this classifier based on the labeled outcome class y t and the predicted probabilityŷ t (i.e. actual result from neural networks) for trace σ t .
In order to determine the parameters that minimize the loss function, we use some different optimization BP (Back-Propagation) algorithms such as RMSProp (Root Mean Square Prop) (Tieleman and Hinton 2012) and Adam (Adaptive Moment Estimation) (Kingma and Ba 2015) in training. Based on these optimized gradient descent algorithms, the above parameters of all W and b in the LSTM network can be adjusted by the propagation of the loss Loss(ŷ t , y t ) (i.e., error). In this way, all cases in event log L are processed until a distinct set of parameters are determined with the minimum sum of loss from these cases. Therefore, a classification model for an event log can be constructed based on LSTM network with these determined parameters.

Bi-LSTM approach
In different business processes, the relationship between cases and their outcomes can be diverse. In general, the outcome of a case is not only related to the events in this case, but also related to its content, especially for some long cases. If the contextual information between these events can be captured, the extracted feature can be further enriched to make the trained classifier more effective. As the first enhancement of LSTM approach, Bi-LSTM adds a backward propagation layer of the original LSTM network. In terms of the events sequence of a case, the forward propagation layer (i.e. the LSTM Layer in LSTM network) obtains the relationship information between the preceding events and the current event, while the backward propagation layer obtains the relationship information between the current event and the subsequent events. In order to extract the hidden features from the events in a case by Bi-LSTM, we obtain the contextual information for each event by combining the outputs of the forward and backward hidden layers where events are fed forward and backward simultaneously. Similar to LSTM, the Bi-LSTM approach also has four layers, as shown in Fig. 5. The difference between them exists in the event feature extraction layer (LSTM Layer vs. BiLSTM Layer). Here, we only describe the BiLSTM Layer.
Feature Extraction Layer (BiLSTM Layer). In this (hidden) layer, each unit is combined by the original LSTM network from two directions (i.e., forward and backward) as shown in Fig. 5. Taking the sequence of event vectors x t1 , x t2 , . . . , x tn as the input of this of the next event e t,i+1 (backward), respectively. The obtained vector o ti ∈ R p for event e ti denotes the learned hidden information (i.e., features) based on the current event e ti and the hidden information from the previous event in the same level, which can be calculated by: where − → h ti denotes the output of the cell based on the vector x ti and the output − −− → h t,i−1 of the previous cell forward, and ← − h ti denotes the output of the cell based on the vector x ti and the output ← −− − h t,i+1 of the previous cell backward.
− → h ti and ← − h ti can be calculated according to the Eqs. (4)-(9). After that, the obtained o t1 , o t2 , · · · , o tn can also be as the input of Output Layer for calculating the estimated probability. Thus, a classification model can be obtained based on the bidirectional LSTM network with the determined parameters.

Att-LSTM approach
The outcome of a case is related not only to some features that have a decisive effect on the outcome, but also to their influence degree. Although the LSTM approach can capture the decisive features from historical cases, it cannot figure out the extent of their influence.
As the second enhancement of LSTM approach, Att-LSTM approach adds additional attention mechanism layer base on LSTM network, which can distinct the weight of each feature. The architecture of Att-LSTM approach is shown in Fig. 6, which has five layers, one more Attention Layer than LSTM approach. Besides, the Output Layer is also different from that in LSTM approach. Here, we only describe the Attention Layer and Output Layer. Attention Layer. After extracting features from the previous hidden layer (LSTM Layer), we obtain the output o ti for each event vector x ti . As for each hidden output o ti , we firstly compute its attention score u ti for the outcome class. Then, we translate each attention score to probability distribution by the softmax activation function. Finally, a weighted sum of these hidden outputs, i.e. a global representation of these features, is obtained based on their probability distributions. Fig. 7 shows how the attention mechanism works. On the basis of the outputs o t1 , o t2 , . . . , o tn from the previous hidden layer, the final global representation v t of them can be computed with the attention mechanism. In this figure, o t ∈ R p×n denotes the matrix of [o t1 , o t2 , . . . , o tn ], in which p is the dimension of each vector o ti and n is the length of this vector (i.e., the event number of case σ t ). In this layer, each vector o ti can be nonlinearly transformed to its implicit representation u ti by: where u ti denotes the attention score of feature o ti , W h and b h are the parameters of this layer. Similarly, we obtain the matrix u t = [u t1 , u t2 , · · · , u tn ] that corresponds to the hidden feature output o t = [o t1 , o t2 , · · · , o tn ]. In order to obtain the weight of each feature, we then use the softmax activation function to normalize the point multiplication of u t and the initialized attention distribution probability µ t by: Here, µ t is a parameter vector obtained by training (µ T t is its transpose), and each α ti ∈ α t denotes the attention weight of each feature vector o ti , which can be computed by: Finally, as shown in Fig. 7, the final output v t denotes the global representation of these features o t , which can be computed as a weighted sum of them by: The obtained v t involves some features that are significantly correlated with the outcome class (label) for trace σ t . Output Layer. The obtained v t can be the input of the sigmoid function to estimate the outcome class (label) of trace σ t . The estimated probabilityŷ t of the outcome class (label) of 1 for trace σ t can be calculated as follow: where W c and b c are the trainable parameters in this layer and the value ofŷ t is in the range of (0, 1). Similarly, we also use a binary cross-entropy function to measure the loss of this model and then use the optimized BP algorithms to adjust the relevant parameters.
Finally, a classification model can be obtained based on the determined parameters of this Att-LSTM network.

Att-Bi-LSTM approach
In order to construct a more effective classifier, we combine the above two enhancements and propose the Att-Bi-LSTM approach, which can not only capture more features but also distinct their extent to influence the outcome for a case. As shown in Fig. 8, Att-Bi-LSTM also has five layers, in which the BiLSTM Layer is the same as that of the Bi-LSTM approach, the Attention layer and the Output layer are the same as that of the Att-LSTM approach, and the first two layers are identical to that of LSTM (Bi-LSTM or Att-LSTM) approach. In this case, we first obtain the predicted probabilityŷ t of the outcome class for trace σ t . After that, we also use the binary cross-entropy function to measure the loss between the actual outputŷ t from the network and the expected (labeled) output y t for trace σ t . By updating all involved parameters with BP algorithms for all traces, the classifier can be obtained by combining this network and the final determined parameters.  Fig. 8 The structure of Att-Bi-LSTM approach for process outcome prediction.

Online Predicting
In order to predict the outcome of an on-going case, we need to use the classifier trained at the offline phase based on the execution events of this case. For an ongoing case, there should be some already executed events and each of them has some attributes. Taking the encoded vectors of these events and attributes as the input of the trained classifier, the probability of outcome class for this running case can be calculated. In general, an on-going case keeps changing with the arrival of new events during its running process. Therefore, the predicted results (i.e., the probabilities of getting a positive result) of each phase for an on-going case are not completely consistent because their inputs of the classifier are different. Under such circumstance, the prefix trace is used to simulate an executing case. Especially in the evaluation experiments of Section 5, we first extract the distinct cases from an event log and then obtain all prefix traces with different lengths. Taking these prefix traces as the input of the trained classifiers for different approaches, we then predict the outcome class to evaluate the prediction accuracy of these approaches. As for each distinct case in an event log, the accuracy of its prediction can be calculated based on the predictive results of extracted prefix traces with different lengths. The detail will be discussed in the next section.

Experimental Settings
In this section, to evaluate the effectiveness and efficiency of our proposed approaches, i.e., Bi-LSTM, Att-LSTM and Att-Bi-LSTM, we compare them with LSTM and the conventional machine learning methods of XGBoost (Friedman 2001) and RF (Random Forest) (Bishop 2006). Here, the RF and XGBoost are selected because they are more effective than other traditional learning methods Fernández-Delgado et al. 2014). Moreover, the RF (i.e. RF-based) and XGBoost (i.e. XGBoost-based) respectively denote a series of approaches that utilize the Random Forest algorithm and the XGBoost algorithm to build one or more classifiers but with the different encoding methods. To make the comparison experiment more convincing, we apply the above six approaches to twelve real datasets that are from different business processes. Besides, we select the prediction accuracy, earliness, stability, and time performance for a more comprehensive comparison. The approaches are implemented in Python by using the scikit-learn library, and ran on the server with six cores of E5-2620 2.00GHz, 64GB memory, and Windows 7.

Datasets
The datasets used in our experiment are originated from six real-life event logs of BPIC 2012, BPIC 2017, Sepsis cases, Production, Road Traffic Fines and Hospital Billing, which are all from the public 4TU Centre for Research Data 1 . They are selected because they contain both case attributes (static) and event attributes (dynamic). Moreover, the outcome of cases in these logs can be derived easily, and the order between the events in these logs is clearly defined without ambiguity.
BPIC 2012. This event log was derived from the Business Process Intelligence Challenge in 2012 (BPIC 2012), which records the execution history of a loan application process in a Dutch financial institution. In order to facilitate classification, we define the class labels based on the final outcome of each case, i.e. whether the loan application is accepted, rejected, or cancelled. For such an event log to be classified, it could be thought of as a multi-class classification problem. In order to facilitate comparison with the existing methods, we transform this event log into three event logs of bpic 2012 accepted, bpic 2012 declined, and bpic 2012 cancelled with independent binary classification problems.
BPIC 2017. This event log originates from the same loan application process in BPIC 2012 and then has been improved by cleaning noise data. Similarly, we define three separate binary classification problems with event logs of bpic2017 accepted, bpic2017 declined and bpic2017 cancelled from the event log BPIC 2017.

Sepsis cases. This log comes from a hospital in
Dutch and records the diagnostic trajectories of patients with life-threatening sepsis when they enter the hospital. According to the conditions after patients left, such as returning to emergency within 28 days, being admitted to intensive care, and being discharged from the hospital, the log was classified into three event logs of sepsis cases 1, sepsis cases 2, and sepsis cases 3 for three separate binary classification problems. Similarly, we can obtain three datasets from the log Sepsis cases.
Production. The log is derived from a manufacturing process that records the information referred to the activities, performers and/or machines involved in producing goods. The outcome of this event log is labeled according to whether or not the rejected products exist. If there is no rejected products in this manufacturing process, the outcome of this case is labeled as normal. Otherwise it is labeled as abnormal.
Road Traffic Fines. The log comes from a police station in Italy. It mainly includes activities of paying fines as well as some information related to individual cases, such as the reason and the total amount paid for each fine. In terms of this log, we label these cases according to whether the fine is repaid in full or is sent for credit collection.
Hospital Billing. The log is derived from an ERP system of a hospital, where each case represents an execution of a billing procedure for medical services. We define two labels based on whether they were reedited or not.
Thus, we have collected 12 datasets, in total.

Preprocessing
The preprocessing includes deleting noise data, filling in attributes for aligning all attributes of historical cases, encoding and extending additional attributes, reducing feature spaces for avoiding the dimension explosion, and under-sampling for sample imbalance, illustrated as follows.
Delete noise records. All cases are checked if they are finished by comparing the last event of each case with a specific end event. If the case doesn't end with the specific event, it can be viewed as a incomplete case and should be deleted. In addition, some other methods are used here to check whether a historical case is completely executed, by aligning the cases with the corresponding process model.
Fill attribute values. In general, there are very few data missing because the event log was automatically generated by information systems. However, that may be for the case due to the following two reasons. First, events usually only records some attributes when a particular event changes. Therefore, in order to determine the value of this attribute when an event occurs, we need to search the (prefix) trace for the latest event where the value of the problematic attribute has changed (if no change point is found, search for the first event). For instance, the resource name involved in the execution of an activity in a case is usually only logged when the resource has changed due to a previously occurring event. In this case, we need to search for the most recent previous event under the same circumstances and record the resource of that event as the resource of the current event. Secondly, different activities can produce different types of data. For example, in a loan application process, the order information for each customer only exists after the order event is created. Similarly, during ticket processing, the payment amount is only available after a payment event has occurred. The absence of these forms is called legal missing data or missing data out of range (Schafer and Graham 2002). In this experiment, padding is done by adding additional attributes. If an event does not have this attribute, the attribute has a value of 0.
Extend additional attributes. In order to improve the prediction accuracy of a classification model, we extend some new attributes in these datasets based on the original attributes. As for the timestamp attribute in the above datasets, some other new attributes corresponding to an event can be calculated such as hour, weekday, month, time since case start and time since last event. These attributes are extracted from inter-case, which can help predict whether the outcome of a running process violates according to (Senderovich et al. 2017;Conforti et al. 2015). Besides, the waiting time of an event has a lot to do with all the active cases since the event has a great impact on the outcome of the case if the outcome is determined by the waiting time or the satisfaction of the customer. Therefore, we add an attribute open cases to record the on-going cases that occur at the same time as an event. In addition, a case attribute event number has been added to record how many events has been performed in this case for the current event.
Reduce feature space. To avoid the dimension explosion after attribute encoding, we compress the spatial domain of some attribute values before training a classifier. For instance, regarding the categorical attribute of events in historical cases, we divide its possible values into several intervals if there are many. In addition, as for the sparse values of a categorical attribute, we group them together and classify as other.
Under-sample for sample imbalance. Very long traces can decrease the performance of classifiers (Teinemaa et al. 2018). To avoid the sample imbalance, we truncate the traces over the truncated length. As for each dataset, the truncated length is determined independently where 90% of the traces in the minority outcome class (i.e., positive vs. negative) have already completed. In other words, we choose the outcome class with less traces first and then decide the length of traces by grouping them in ascending order based on their length and finding the point of 90%. These truncated sequences are not available anymore for training or evaluation because these few long traces can decrease the performance of classifiers. Table 2 gives the statistics of the datasets after preprocessing. As it indicates, the size of datasets varies significantly ranging from 220 (production) to 129,615 (traffic fines). Meanwhile, the ratio of positive class is also different. For example, the most imbalanced one is the hospital billing dataset where only 5% of cases are labeled as positive, whereas the classes are almost balanced in the production, bpic2012 accepted, bpic2012 cancelled, and traffic fines datasets. Additionally, in terms of the length of trace, the most heterogeneous is the sepsis cases 3 dataset, where the longest case consists of 185 events but the shortest one consists of only 4 events. After the under-sampling operations, we obtain the truncated length for each dataset. For the datasets originated from BPIC 2012 and BPIC 2017 event logs, the truncated length are determined because the signal starts to converge around 40 and 20 events (Teinemaa et al. 2018), respectively.

Evaluation metrics
In predictive process monitoring, a good prediction should be accurate in the early stage of an on-going case. In addition, the prediction should be made in a continuous way for a running case when each new event arrives. Thus, it is very important to make the stable and consistent predictions. Therefore, we select accuracy, earliness, stability to measure the effectiveness of our proposed approaches, and execution time to measure their efficiency.
Accuracy. In general, the output of a classification model usually gives the probability of a class, not a certain class. Therefore, to determine the class of prediction result, a threshold needs to be set manually first and then the probability can be converted into a certain class. Only when the prediction probability of a prediction sample is greater than this threshold, the sample will be labeled as positive. Otherwise it will be labeled as negative. Thus, the value of threshold greatly affects the calculation of accuracy. Here, we use the Area Under ROC Curve (AUC) as the measurement of prediction accuracy based on (Bradley 1997). Each point on the ROC curve represents a pair of FPR (False Positive Rate, on the X-axis coordinate) and TPR (True Positive Rate, on the Y-axis coordinate) based on a given threshold. The AUC can be used to measure the quality of the classifier. Even if the sample distribution is not uniform, the accuracy of classification model measured by AUC can remain unbiased.
Earliness. The outcome prediction of a running case is a continuous prediction problem, in which the same case has many predictive stages formed by an incoming event. Hence, if the accuracy of a classification model can reach a certain level at early stage, this model is considered to have a good performance at prediction. Inspired by (Leontjeva et al. 2016), we keep applying the trained classifiers from the above six approaches to the subset of prefix traces with different lengths until the outcome prediction of a running case reaches an acceptable level of accuracy. The minimum prefix length at which the classifier reaches a specified accuracy threshold can be used to measure its earliness, indicated as Eq. (20). The smaller the value of earliness is, the more predictive the prediction model is.
where N is the number of different prefix lengths of the test set, L denotes the maximum length of these cases in test set, and l i indicates the length of prefix traces at which their obtained average AUC from a classifier reaches a certain threshold δ. In particular, if the AUC of outcome prediction for a running case has never reached the threshold all the time, we will set its earliness to 100, meaning the worst earliness. Stability. A classification model is considered as stable if its outputs from successive prefix traces of the same case (i.e. the different prediction stages of the same case) is similar. Inspired by (Teinemaa et al. 2018), we define the temporal stability of a classification model as one minus the average absolute difference between any two consecutive prediction scores: where N is the number of traces to be predicted, |σ i | denotes the length of trace σ i , and |ŷ i t −ŷ i t−1 | gives the difference between two successive predictions,ŷ i t and y i t−1 , of σ i . Here,ŷ i t denotes the prediction after t-th event occurred in the i -th case, andŷ i t−1 denotes the prediction after (t − 1)-th event occurred in the i -th case. As shown in Eq. (21), in order to eliminate the bias of prediction against long traces, this metric evaluates the average absolute difference between successive prediction accuracy in each case and then computes the average of them.
Running time. A prediction is of no practical significance, if the time required for prediction is close to the remaining execution time of a running case. Here, we select two time metrics, i.e., offline time and online time, to evaluate the time performance of the above approaches. Offline time refers to the total time required to build a prediction model from a historical event log, and online time refers to the average time to predict the outcome of a running case.

Parameter settings
To simulate the outcome prediction in real scenarios, we divide each dataset into 80% training set and 20% test set based on the temporal order. In other words, all cases involved in an event log are sorted according to their start time. The first 80% of them are used for training, and the remaining 20% are used for testing. At the same time, we remove the cases where the time of some events in training set might overlap with the time of that in test set to avoid the interference in the evaluation process. In order to find the best performance of each approach under hyper-parameters optimization, each training set is further randomly divided into 80% training data and 20% validation data. Similarly, the training data and validation data are composed of some historical cases. Specifically, we train the classifiers by applying LSTM-based (i.e., LSTM, Bi-LSTM, Att-LSTM and Att-Bi-LSTM) approaches to the training data with different parameter settings firstly and then choose the one that best fits the validation data. Afterwards, we determine the values of hyperparameters by using random search method (Friedman 2001) rather than grid search because the latter is unavailable when many parameters are involved in the LSTM-based approaches (Teinemaa et al. 2018). The detailed hyperparameters as well as their distributions used in optimization are shown in Table 3. Regarding our proposed approaches based on neural networks, the hyperparameters are mainly involved in the number of hidden layers, the number of units in these hidden layers, the initial learning rate, batch size, dropout, and the used optimization algorithm in gradient descent algorithm. Based on them, 16 parameter combinations for each approach are determined randomly and the number of epoch for these LSTM-based approaches is fixed to 50. Similarly, as for the approaches of RF and XGBoost, the Tree-structured Parzen Estimator (TPE) algorithm (Bergstra et al. 2011) can be used for each combination of a bucketing method and a sequence encoding method in each dataset respectively. In terms of these parameter combinations, we choose one of them with the highest AUC in validation data based on . Besides, in order to calculate the earliness, we set the threshold of AUC

Experimental Results
In order to evaluate the effectiveness and efficiency of the above approaches, we apply them to the training set of each dataset for building classification models. In this offline training phase, for each dataset, we construct a prediction model from training data first and then optimize all the hyperparameters based on the validation data. In the online phase, in order to simulate the on-going cases, we first extract all the prefix traces with different length from each complete trace in test set and then make prediction for each of them. After that, for each dataset, we calculate the prediction accuracy based on AUC for each prefix trace independently with different approaches. Based on them, we finally compute the earliness, temporal stability, offline time, and online time of predictions.

Accuracy comparison
To compare the accuracy of prediction, Table 4 shows the overall AUC as well as the standard deviation for each dataset and the mean overall AUC for different approaches. Here, the overall AUC of RF and XGBoost denote the best one that the they can obtain among the RF-based and the XGBoost-based approaches, respectively. The overall AUC for a dataset is a weighted average of the AUC calculated from all prefix traces with different lengths where the weights are assigned according to the number of prefix traces. For example, given m different lengths {l 1 , l 2 , . . . , l m } of prefix traces and the set of prefix traces L li = {σ 1 , σ 2 , . . . , σ ni } of each length l i in a dataset, the AUC value of each prefix trace σ j , i.e., σ j .score, can be obtained firstly when predicting with a classifier. Then the overall AUC can be calculated by: The overall AUC is different from the average AUC of all prefix traces. The benefit of this weighting method is that the overall AUC is affected by different lengths of prefix traces equally in the testing set rather than being biased longer prefix traces. As Table 4 indicates, Att-Bi-LSTM achieves the best performance with the highest overall AUC in 11 out of 12 datasets. In terms of the mean overall AUC of all datasets, Att-Bi-LSTM performs best, followed by Bi-LSTM. Besides, the approaches of LSTM, Att-LSTM and XGBoost perform similar while RF the worst. Thus, we can conclude that the enhancement of bidirectional feature extraction in Bi-LSTM is more effective than that of attention mechanism in Att-LSTM, and the prediction accuracy of Att-Bi-LSTM, which combines both, has been greatly improved.
For further comparison, Fig. 9 presents the AUC values of test prefix traces (i.e., running cases) with different lengths, with each subfigure showing the detailed AUC values for each dataset. In this way, the number of cases with a certain length of prefix is monotonically decreasing with the increasing of prefix length. In the datasets of bpic2012 cancelled and sepsis case 2, the prediction accuracy in terms of AUC has a downward change as the prefix length reaches at a certain length, such as the 35 and 10 respectively. Generally, the prediction accuracy should increase with the prefix length because the classifier could obtain more useful information. Similar to (Teinemaa et al. , 2018, the reason may lie in that many cases in these datasets are relatively short and are finished with a short (prefix) length. Therefore, the accuracy of prediction decreases as the prefix length increases to a certain extent.
As shown in Fig. 9, as for the datasets of bpic2012 variants, bpic2017 variants, and hospital billing, we find that the AUC values of different approaches keep increasing normally as the prefix length increases but increase dramatically after the prefix length reaches at a specific value. Furthermore, in terms of the datasets of sepsis cases variants, the AUC values of different approaches are similar when the prefix length is small, but differ significantly once the length reaches at a specific value. This may indicate the occurrence of a certain key activity in business process (not the last  The AUC comparison of outcome prediction with different approaches on different datasets. executed event) has a decisive effect on the outcome. Among our proposed approaches, with the increase of prefix length, the prediction accuracy AUC of Att-Bi-LSTM increases more obvious than other approaches on most datasets. Nevertheless, the advantage of prediction accuracy gradually disappears as the prefix length increases to a certain value on the datasets of sep-sis cases 1 and traffic fines. The reason may be that the key events that affect the process outcome occur earlier in process execution. Table 5 shows the earliness of all approaches on each dataset, which varies dramatically. As defined in Eq. (20), the value of earliness ranges between [0, 100]. Particularly, we find that the earliness of both XGBoost for sepsis cases 2 dataset and LSTM for sepsis cases 3 dataset is up to 100, which means that the on-going case only can be predicted accurately when it is almost finished. Furthermore, as shown in Fig. 9, we discover that the AUC value of XGBoost for sepsis cases 2 dataset reduces greatly with the prefix length increasing and fails to reach the AUC threshold of 0.7 finally. Similarly, the AUC of LSTM on the dataset of sepsis cases 3 also decreases and fails to reach 0.7 when the prefix length reaches the maximum. However, on the sepsis cases 2 dataset, the earliness of other approaches is relatively small. In addition, it is obvious that the earlinesses of LSTM-based approaches are similar especially on some datasets like bpic2012 accepted, bpic2012 cancelled, bpic2017 accepted, and sepsis cases 2. In terms of the average earliness, the enhancement of attention mechanism in Att-LSTM is proved to be more effective than that of bidirectional feature extraction in Bi-LSTM, and both of them performs better than the basic LSTM. Finally, Att-Bi-LSTM combined by these two enhancements can achieve the best earliness, then followed by Att-LSTM and Bi-LSTM, while RF has the worst earliness.

Stability comparison
To evaluate the effectiveness of the above approaches further, Table 6 gives their temporal stabilities on each dataset. To compare with our proposed approaches, we first show the range of stability for RF (i.e. RF-based) and XGBoost (i.e. XGBoost-based) approaches respectively since the stability of these approaches based on the same classification algorithm (RF or XGBoost) varies significantly. As for our proposed LSTM-based approaches, we compute their stabilities according to Eq. (21) based on the prediction accuracy of AUC. As shown in Table  6, in most cases, the stability of XGBoost-based approaches varies but the achieved maximum can be the best among all approaches. Especially on the dataset of sepsis cases 1, the stability of XGBoost-based approaches reaches 1, which means that the difference of each two successive predictions of an on-going case may always keep the same although it is not the usual situation. To compare two enhancements of feature extraction (i.e. the bidirectional feature extraction and the attention mechanism for feature extraction), we find that Bi-LSTM achieves the better stability on 9 out of 12 datasets. Moreover, in terms of the mean stability of these 12 datasets, Bi-LSTM has a better stability than that of Att-LSTM. Among these LSTM-based approaches, Att-Bi-LSTM combined with these two enhancements of feature extraction performs best in terms of mean stability. However, on 6 out of 12 datasets, the prediction of Att-Bi-LSTM is not stable than that of Att-LSTM or Bi-LSTM, which can also be found in Fig. 9.

Time performance comparison
In order to evaluate the efficiency of the above approaches, Table 7 shows the offline total (in seconds) and the online avg (in milliseconds) by making predictions for all prefix traces extracted from the test set of each dataset. Here, the offline total denotes the time required to train a classifier while the online avg denotes the average of time required to predict all test samples (i.e., prefix traces). Table 7 also gives the number of test samples for each dataset.
For a fair comparison, we present the range of required time for offline total and online avg, for RF and XGBoost respectively since the time of these two approaches varies significantly based on the selected operations. As shown in Table 7, we find that the online avg of LSTM-based approaches keeps within 10 milliseconds (most of them are within 5 milliseconds) and is significantly less than that of RF and XGBoost. Meanwhile, the offline total of our proposed approaches is mostly within the ranges of RF and XGBoost. These results show that LSTM-based approaches have obvious advantages in real-time prediction. In addition, we find that the two enhancements of bidirectional feature extraction and attention mechanism make Bi-LSTM, At-LSTM and Att-Bi-LSTM less efficient in terms of offline total and online avg, compared with the basic LSTM approach. However, compared with the improved accuracy of prediction, the small difference of time performance can be neglected.

Threats to Validity
In this section, we discuss some threats that may affect the validity of our study. One potential threat may come from the selected case attributes, event attributes, as well as the encoding ways of them when training classifiers. These selected attributes may interfere with the outcome of a case when training. Therefore, we extended some additional attributes of event logs as much as possible in our experiment so as to enhance event logs for training classifiers. Moreover,  bpic2012 accepted  50  48  50  50  50  50  bpic2012 declined  85  75  65  65  65  70  bpic2012 cancelled  50  48  50  50  50  50  bpic2017 accepted  30  30  30  30  30  30  bpic2017 declined  30  40  30  40  30  30  bpic2017 cancelled  30  30  30  30  40  30  sepsis cases 1  62  81  79  79  79  69  sepsis cases 2  50  100  8  8  8  8  sepsis cases 3  58  48  100  48  58  48  production  41  31  69  55  29 Table 7 The comparison of execution time of different approaches some available encoding ways based on aggregation involved in training a classifier may be lossy, which may cause some information to be discarded. However, in our experiment, only the RF and XGBoost approaches use such coding ways by some aggregation functions. Similarly, they also utilize the coding ways that encode each case based on their last events. However, the coding strategies employed in LSTM-based approaches are lossless, such as the one-hot coding. In addition, we choose AUC as the accuracy evaluation criteria to deal with the problem of sample imbalance. However, the measurement of AUC may also introduce some other unknown bias that we have not found it yet. However, up to now, the threats from the other existing accuracy measurements are much more serious than the AUC.
Besides, some other threats of our study about effectiveness may result from the incompleteness of the experiment. Firstly, we only tried one attention mechanism, while many other attention mechanism variants have been proposed such as the multi-dimensional attention, soft attention vs. hard attention, and global attention vs. local attention. In our opinion, the basic attention mechanism has been evaluated more effective while its variant only could be better. Secondly, although the hyperparameters are optimized using a state-of-the-art hyperparameter optimization technique, it is possible that there exists a different optimization algorithm that can be used to find better parameter settings. In our experiment, we have conducted an additional validation experiment to determine a optimized hyper-parameter, which has reduced the threat to some extent. Finally, the generalization of our study may have some limitations because the experiments were conducted only on 12 datasets from six events logs. However, this is usually inevitable for all research.

Conclusion and Future Work
In this paper, we proposed three approaches (Bi-LSTM, Att-LSTM, and Att-Bi-LSTM) based on the two enhancements of LSTM feature extraction (bidirectional feature extraction and attention mechanism) to predict the outcome of on-going process instances. In particular, the proposed Bi-LSTM approach adopts the bidirectional feature extraction enhancement by combining the original LSTM network from two directions, i.e., forward and backward. The proposed Att-LSTM approach utilizes attention mechanism to optimize and enhance the features captured by LSTM network. Finally, Att-Bi-LSTM was proposed by combining these two enhancements of feature extraction mentioned above. To investigate the effectiveness of these enhancements and compare with the conventional machine learning techniques, we conducted a series of extensive experiments on twelve real datasets and then analyzed the experimental results in terms of accuracy, earliness, stability, and time performance. The experiments demonstrated our proposed LSTM-based approaches can predict the outcome of running cases more effectively and efficiently than the other approaches because they can assign the different weights to bidirectionally captured features that affect the outcome.
According to the characteristics of enhancement strategies, in the near future, we will study the "concept drift" in event logs, meaning features that affect the outcome of the process change over time. Besides, we plan to investigate the incremental outcome prediction based on neural networks to avoid the duplicate offline training of the samples, i.e. the historical process instances, with the continual execution of a process. Last but not least, we plan to further study the outcome prediction task that can provide the interpretable guidances.