In this study, we first used a trained PhaseNet model (model0) built by Zhu and Beroza (2018) and its architecture, using approximately 220,000 seismic waveform data from about 30,000 events recorded at Hakone volcano from 1999 to 2020. We created a model trained from scratch (model1), and the model was fine-tuned with the aforementioned Hakone data using model0 as the initial value (model2) and evaluated their performance. The performance of model1 and model2 was significantly improved compared with that of model0, especially in S-wave detection, while the performance of model1 was slightly better than that of model2 but almost the same. This suggests that the number of Hakone data used in this study is sufficient for the PhaseNet architecture to learn the seismic features of this region. Model 1 could find the maximum F1 value when the batch size and learning rate were varied, but model 2 could not find the maximum value in the same test, and the smaller the batch size and the larger the learning rate, the better the performance. If the learning rate is too small, there is a risk that the model will be stuck in a local solution and not perform well, but if it is too large, there is possible that it will jump over the minimum value. Also, suppose the batch size is too large. In that case, there is a risk that the data features will be averaged out and the individuality of the data will be lost. On the other hand, there is an advantage of faster learning without being trapped by local solutions, but in general, it has been shown that performance decreases as the batch size increases (e.g., Keskar et al., 2017). However, Smith et al. (2018) argue that there is an optimal batch size for a given learning rate and that it is not important to reduce the batch size but to choose a learning rate appropriate for that batch size. However, there is no way to determine the optimal batch size or learning rate without benchmark testing. Although detailed hyperparameter testing is computationally demanding, the optimizer called adam used in PhaseNet is known to require only low-level hyperparameter tuning (Kingma and Ba, 2014). Since the values chosen for model 1 are not very different from those used in the original model, we believe the test results are highly reliable. In model2, the smaller batch size showed better performance, which means that each data feature is sensitively captured and learned. In other words, it can be considered that this is the result of starting from model 0 and learning to fit the Hakone data better. However, model 2 is unstable because the loss function does not converge when the batch size is larger than 64, so model 1 is more suitable for Hakone data. Although model1 had the best performance among the three models, when multiple waveforms with different amplitudes were in the time window, the probability values for small amplitudes were lower than the threshold for phase pick, and the prediction accuracy for large amplitudes was often lower. On the other hand, Model 0 did not respond to any small amplitude in such cases and tended to detect only large amplitudes. The training data contains only one seismic waveform per data set, normalized to the maximum amplitude. Since the maximum amplitude also normalizes the data for phase pick, small amplitudes are considered noise and unlikely to be detected. The probability value increased a little in models 1 and 2 because the Hakone seismic waveform has different characteristics from those of the California crustal earthquakes used in the model 0 study, and may have captured those characteristics, even if the amplitude is small. In this study, to overcome the problem observed in model 1, two waveforms were randomly taken from the existing validation data. By creating such data, waveforms of small amplitude in the data normalized by the maximum amplitude can be labeled and trained as P and S waves. We evaluated the performance of model3, which was trained with all weights initialized, and model4, which was trained with the weights of model1 as initial values, and found that both models improved the ability to detect waveforms with smaller amplitude when there were multiple waveforms with different amplitudes in the same time window. Model4 also outperformed model1 in the number of events detected. Model 3, on the other hand, was not as good as model 1 in terms of the number of events detected. This may be because the number of seismic waves is 100,000, one-half of the number of data used in the training of model 1. Deep learning models for seismic phase pick usually use amplitude-normalized data as training data. The amplitude value varies with the distance from the epicenter and the magnitude of the earthquake in each seismic data set. So, when seismic waveforms acquired over a wide area are used as training data, the amplitude values will be in several orders of magnitude range. The model may not converge if trained without normalization. However, in the case of a group of earthquakes that occurred within a narrow range of magnitude and did not vary much in magnitude range, the inclusion of fluctuations in amplitude values within that range in the training data is thought to have improved accuracy.
Model 4 detects events in the continuous seismic waveforms more accurately than model 1, but some waveforms are still missed. Although these may be improved by increasing the number of data, a detailed look at the results shows that many waveforms tended to show even a slight increase in probability value. Therefore, we predicted the same continuous waveform data by setting the detection threshold of model 4 to 0.1 and 0.3 and applying REAL to the results. We further applied VELEST (Kissling et al., 1995) for the earthquake location and finally relocated events using HypoDD (Waldhauser and Ellsworth, 2000). The events were relocated 2094, 1296, and 1091 earthquakes at threshold values of 0.1, 0.3, and 0.6, respectively (Fig. 7, Table 3). From the above, it is not necessary to set the threshold high at the time of waveform detection, but instead, it is more effective to set it low and filter out noise in the subsequent phase association, hypocenter location, and relocation.
Table 3
Number of event detections for each algorithm relative to the detection threshold
Threshold | REAL | VELEST | hypoDD |
0.1 | 7381 | 6909 | 2094 |
0.3 | 1937 | 1865 | 1296 |
0.6 | 1311 | 1302 | 1091 |
Regarding the types of earthquakes newly detected by the PhaseNet model, a few are out of the cluster as epicenter locations, but most are within the cluster. Although the magnitudes in the original catalogs are calculated in different ways and cannot be compared, a comparison of the histograms of magnitude and frequency for the earthquakes detected at different threshold values shows that the smaller the threshold value, the more small earthquakes are picked up (Fig. 8). However, as the threshold value decreases, the number of earthquakes out of the cluster also increases. Hence, it is necessary to examine whether these newly detected small earthquakes are real or not in the future. Note that the Matched filter method (MF) (Gibbons and Ringdal, 2006) was used to detect earthquakes in this earthquake swarm (Yukutake et al., 2022a), and the number of earthquakes detected was 2600, which is more than the number detected by model4 with threshold 0.1. A comparison of events detected by model4 and MF shows that many of the earthquakes detected by MF are shallower than those in the original catalog, while the events detected by model4 show a distribution similar to the depth of events in the catalog (Fig. 9). However, since the hypocenter relocation in this study is not based on waveform correlation, depth may change with the operation.
Furthermore, station correction is performed in MF but not in this relocation. As for the processing time, it is possible to improve the performance of both methods by devising the code. The advantage of PhaseNet over MF is that it does not require template earthquakes, which are necessary for MF. It can be generalized to some extent, even if trained on training data from other regions. For example, when model 1 was applied to seismic data from other volcanoes in Japan, it succeeded in detecting eight times more earthquakes than in the catalog (Yukutake and Kim, 2022b). Since not all volcanoes have been monitored with high accuracy for many years like Hakone volcano, it is worth aiming to improve the accuracy of such machine learning models in the future, since it may be possible to create a model specialized to a particular region by tuning a small amount of training data based on the model developed at Hakone volcano. In addition, many of the seismic waveforms may overlap during the very active swam period. By using the know-how of training data with multiple seismic waveforms used in this study, we can artificially create overlapping data and add them to the training data, thereby learning their characteristics and potentially contributing to the event detection capability.