## 3.1 Feature extraction

For feature extraction the AE events are sorted in sequential order. The input dataset begins with the events of the first hydraulic fracturing test at 1.6 m depth (149 events), then follow the events of the second hydraulic fracturing test at 2.5 m depth (166 events), then follow the events of the third test at 3.4 m depth (88 events), then follow the events of the fourth test at 5.8 m depth (129 events), then follow the events of the fifth test at 7.2 m depth (150 events) and finally follow the events of the sixth test at 9.0 m depth (83 events). Figure 3 shows the waveforms (left-hand side) of AE events with the calculated arrival times of the longitudinal (green lines) and transverse (red lines) waves. The dashed black vertical line marks the origin time of the event. The green and red dashed lines mark the mean value of the L- and T-arrival times, respectively. The selected signals are from representative events which are located during hydraulic fracturing tests HF1 to HF6 in 1.6 m, 2.5 m, 3.4 m, 5.8 m, 7.2 m, and 9.0 m well depths (Figs. 4a to 4f), respectively. The corresponding arrival times profiles (ATP) are shown as horizontal bars at the right-hand side. A comparison of Figs. 4a and 4b indicates similar ATP patterns of events in 1.6 m and 2.5 m depths. This is because the waves propagate along almost the same paths to the sensors. With constant propagation velocities of the longitudinal and transversal waves, the arrival times are also approximately the same. At greater borehole depths, the ATPs differ more clearly from each other.

## 3.2 Architecture of neural network

There is currently no universal success strategy for the development of powerful neural networks. Therefore, many solutions are still experimentally based [3, 10–12]. However, the following points should be considered in the development. First, selection of an appropriate network model. Second, specification of a network topology, this includes the number of units and their connections, and third, specification of learning parameters. This includes the weights of input information or the threshold of a neuron, which come indirectly by the activation function. Figure 5 schematically displays the architecture of the neural network with input and output (green squares), and the hidden and output layer (blue squares). W and b are the network weights and biases, respectively.

It should be noted that the number of neurons in the input layer is determined by the data. In the output layer it is determined by the number of classes. For the hidden layer it depends on the data and the structures in the data. In this work a standard network is used for pattern recognition. This is a two-layer feedforward network, with a sigmoid transfer function in the hidden layer, and a softmax transfer function in the output layer. The number of hidden neurons is set to 10. This number should be increased if the network is not working as well as expected. Basically, a manual hyperparameter optimization was done, which gave the best results for this model. It was taken into account that overfitting occurs with too many layers and neurons. The number of output neurons is equal to the number of elements in the target vector. Specifically, in this application the input data for all events is a vector with 16 elements. These elements are the arrival times profile \({p}_{i}\) (see Eq. 1) of the longitudinal and transversal waves, which are equal due to the normalization of the time scale. As mentioned, the arrival time profile is independent of time scale and material, and it is more robust to input errors. By subtracting the arrival time averages, the error is evenly distributed over all inputs. The target data are the six various Classes \(C1\) to \(C6\) that can be assigned to the six fracturing tests HF1 to HF6 as seen in Fig. 2. In this context, the scattered events that are not spatially unambiguous assigned to one of these six clusters are classified in the order in which they were registered. The target data consist of vectors of all zero values except for a 1 in the element, which is representing the class.

## 3.3 Training of neural network

When training multilayer networks, the general practice is to divide the data into three subsets. The first subset is the training set, which is used for computing the gradient and updating the network weights and biases. The second subset is the validation set. The error on the validation set is monitored during the training process. The validation error normally decreases during the initial phase of training, as does the training set error. However, when the network begins to overfit the data, the error on the validation set typically begins to rise. The network weights and biases are saved at the minimum of the validation set error. The data is split into 70% for training, 15% to validate that the network is generalizing and to stop training before overfitting, and 15% to independently test network generalization. The division of the data is done randomly. Figure 6 shows the result of the test of the neural network after 27 epochs which was applied to the 765 AE events of the six classes C1 to C6.

At the left-hand side of this figure the elements of the output vector (values between 0 and 1) are plotted as green bars indicating the probability with which the event can be predicted to a class. The locations of the related AE events are shown at the right-hand side in projection on the x-y plane. In general, about 91% of the events are predicted to the correct cluster. The mean output value of all AE events in Class C1 is about 0.8 and in the Classes C2 to C6 0.86, 0.89, 0.93, 0.89, and 87%, respectively.

## 3.4 Confusion matrix

For discrete class mapping, the largest value of the output vector is used. Thus, only the predicted class is specified for an event. To describe the performance of this discrete classification a confusion matrix is commonly used. The confusion matrix itself is relatively simple to understand, but the associated terminology can be confusing. To create a confusion matrix, two possible prediction classes are determined: positive "p" or negative "n". "p" means that the event is assigned to a class. A "n" means that this event is not assigned to this class. To distinguish between the actual class and the predicted class the labels "Y" and "N" for the class predictions. "Y" means correct classification (noted as positive) and accordingly "N" means not correctly classified (noted as negative). Thus, there are four possibilities. If the case is positive and classified as positive, it is counted as a true positive case (TP); if it is classified as negative, it is counted as a false negative (FN). If the case is negative and classified as negative, it is counted as a true negative case (TN); if it is classified as positive, it is counted as a false positive (FP). Given six classes, the confusion matrix has six-by-six elements. The equations to calculate the elements of the confusion matrix is given in the paper from Fawcett (2005) [13].

Figure 7 shows the confusion matrix for the training data set, the validation data set, and the testing data set. The sum of these matrices can be seen in the lower right corner of this figure. The rows of the confusion matrix correspond to the true class and the columns correspond to the predicted class. The sum of the columns of a class result in the number of classified events. Diagonal and off-diagonal elements correspond to correctly and incorrectly classified observations, respectively. In addition, Fig. 7 displays the number of correctly and incorrectly classified events for each true and predicted class as percentages related to number of events of the corresponding true and predicted class.

If the confusion matrix for all is considered, 141 events (94,6%) of Class 1 are correctly classified. The remaining 5.6% are incorrectly assigned to Class 2 (4 events), Class 3 (1 event), Class 5 (1 event), and Class 6 (2 Events). In Class 2, 95.2% could be classified correctly. 6 and 2 events are misclassified to Class 1 and 6, respectively. For Class 3, the predicted class matches the true class in 97.5%. Only two events are not correctly classified. From Class 4 with 129 events, 94.3% of the events could be attributed to the true class. A similar result is shown for Classes 5 and 6 with a percentage of 89,3 and 94 correct classified events, respectively. Most of misclassified events (9 and 5, respectively) are predicted for Class 1. Only 5 events (Class 5) are incorrectly classified to Class 6.

## 3.5 Receiver operating characteristic

One method for graphical representation of the performance of classifier is called receiver operating characteristic (ROC) diagrams. ROC diagrams are commonly used in medical decision making and in recent years have been increasingly used in machine learning and data mining research. An ROC diagram shows the relative trade-offs between benefits (true positives) and effort (false positives). ROC diagrams are two-dimensional graphs in which the TP rate is plotted on the y axis and the FP rate is plotted on the x axis. When creating an ROC diagram, the data are simply sorted in descending order by score and processed sequentially, updating the TP and FP values.

Figure 8 shows the ROC diagram of the six classes of the testing dataset. The diagonal line is indicating a random process: Values near the diagonal mean an equal hit rate and false positive rate, which corresponds to the expected hit frequency of a random process. A ROC curve that remains significantly below the diagonal indicates that the values have been misinterpreted.

At the beginning, the curves especially of Class 3 and 6 vertically rise and change horizontally to an almost constant value near one. As mentioned, the ROC curve is a two-dimensional representation of classifier performance. However, it may be useful to reduce classifier performance to only one scalar value. A common method is to calculate the area under the ROC curve. This value is abbreviated as AUC [14, 15]. AUC is scale invariant. It measures how well the predictions are classified, rather than their absolute values. Since the AUC is part of the area of the unit square, its value will always be between 0 and 1. Larger AUC values indicate better classifier performance. Since random processes are characterized by the diagonal, no realistic classifier should have an AUC of less than 0.5. The AUC of the six classes ranged from 0.92 (Class 5) to 1 (Class 3 and 6), indicating a near-perfect predication.

## 3.6 Bootstrap analysis

Since the test data were randomly selected, it cannot be assumed that all classes are equally represented. To determine the confidence interval of the whole data set with 765 classified events, a bootstrap analysis is performed. The advantage of bootstrapping is that this method makes no distributional assumption [16]. Bootstrapping is based on resampling, which means that samples are repeatedly extracted from the given test data. Figure 9 shows the bootstrap analysis of the value of the classification score obtained for 300 resamples.

The green vertical dashed lines in probability distribution indicate lower and upper boundaries of the 95% confidence level (approximately 1.96 standard deviations). The red vertical dashed line shows the mean value of the output score. In this figure the density distribution of the classification score includes a superimposed normal distribution curve to illustrate normality. (black line). The mean value of the dataset is 0.952. The bootstrap distributions appear to be normal and therefore, the bootstrap results can be trusted.

## 3.7 Visualizing using t-SNE

Figure 10 displays the so-called t-SNE plot. With t-SNE, the analysis starts with a high-dimensional data set of classified AE events. A distance in Euclidean space is now defined between each two data sets. Subsequently, t-SNE searches for a mapping of these objects to a low-dimensional space (usually in two-dimensional space) via gradient descent, so that the distances between the objects are preserved as best as possible.

The low-dimensional data then allow direct representation in the form of a graph; structures can often be identified here by visual inspection [17, 18]. While pure clustering methods provide clusters of data, it is far from trivial to graphically represent or analyze the relationships between clusters. In this context, t-SNE also offers itself as a downstream processing step to graphically represent clusters found by other algorithms. The figure shows very well that the clusters are tightly constrained in their groups. However, it is also noticeable that some individual events are related to the wrong cluster.