Comparative analysis of machine learning models for solar flare prediction

In this paper, we develop five machine learning models, neural network (NN), long short-term memory (LSTM), LSTM based on attention mechanism (LSTM-A), bidirectional LSTM (BLSTM), and BLSTM based on attention mechanism (BLSTM-A), for predicting whether a ≥C class or ≥M class flare will occur in an active region in the next 24 hr. We use the data base provided by the Space-weather Helioseismic and Magnetic Imager Active Region Patches of Solar Dynamic Observatory, including 10 magnetic field features of active regions from 2010 May 1 to 2018 September 13. The samples are labeled flare information (i.e. No-flare/C/M/X) using solar flare events catalogue provided by the Geostationary Operational Environmental Satellite and Solar Geophysical Data solar event reports. In addition, we generated 10 cross-validation sets from these data using the cross-validation method. Then, after training, validating, and testing our models, we compare the results with the true skill statistics (TSS) as the assessment metric. The main results are as follows. (1) The TSS scores for ≥C class are 0.5472 ± 0.0809, 0.6425 ± 0.0685, 0.6904 ± 0.0575, 0.6681 ± 0.0573, and 0.6833 ± 0.0531 for NN, LSTM, LSTM-A, BLSTM and BLSTM-A, respectively. The TSS scores for ≥M class are 0.5723 ± 0.1139, 0.6579 ± 0.0758, 0.5943 ± 0.0712, 0.6493 ± 0.0826, and 0.5932 ± 0.0723, respectively. (2) For the first time, we add an attention mechanism to BLSTM for flare prediction, which improves the performance of the model for ≥C class. (3) Among the five models, the prediction model based on deep learning algorithms is generally superior to the model based on the traditional machine learning algorithm. The performance of the LSTM models is comparable to that of the BLSTM models. In general, LSTM-A for ≥C class performs better than other models. In addition, we also discuss the influence of 10 features on LSTM-A, and we find that removing the least significant feature will result in better performance than using all 10 features together, and the TSS score of the model will improve to 0.7059 ± 0.0440.


Introduction
Solar flares are sudden bursts of electromagnetic radiation from the solar atmosphere (Sinha et al. 2022). The phenomenon is powered by a rapid release of energy in the corona, which is considered to be triggered by magnetic activity (Fleishman et al. 2020). Solar flares have an immediate impact in the form of enhanced radiation and energetic particles in as little as 8 min after the start of the event, causing severe space weather disturbances that pose a hazard to astronauts and technological systems in space and on the ground (Veronig 2020). Therefore, much effort is being invested in solar flare research, including prediction and mitigation programs.
In recent years, some progress has been made in predicting solar flares based on magnetic field observations. The methods employed for solar flare prediction include statistical assessment of magnetic features models (Song et al. 2009;Mason and Hoeksema 2010;Wheatland et al. 2016) and machine learning algorithms based on magnetic features, such as the artificial neural networks (Li and Zhu 2013), the k-nearest neighbors (Huang et al. 2013), the support vector machines (Yuan et al. 2010;Bobra and Ilonidis 2016;Sadykov and Kosovichev 2017;Nishizuka et al. 2018), and the random forests (Liu et al. 2017;Florios et Cinto et al. 2020). It is worth noting that none of the above methods take into account the time dependence of the solar magnetic features. With the accumulation of solar magnetic field data and the development of new machine learning algorithms, time dependence has been taken into account in solar flare prediction.  proposed an Long-short term memory (LSTM) model to predict solar flares within the next 24 hr and achieved better performance after adding the attention mechanism. However, it is unknown whether the model can achieve good performance in multiple data sets. Therefore, it is very important to study how to improve the prediction performance and universal applicability of solar flare prediction models due to the important influence of solar flare prediction on space weather.
In this paper, we compare the classification prediction performance of five solar flare prediction models in 10 crossvalidation data sets, with four of the models taking into account time dependence. We adopt the method of shuffle and split cross-validation by ARs to divide the data sets. Compared to splitting by years, this approach has the advantage that ARs in each split are randomly dispersed in different phases of a solar cycle, removing the bias introduced by artificially specifying splits (Sun et al. 2022). The bidirectional LSTM (BLSTM) model is believed to have potential in time sequence processing. BLSTMs have been widely used in a variety of applications, such as total electron content forecasting (Sun et al. 2017), text classification (Liu and Guo 2019), and wind speed forecasting (Jaseena and Kovoor 2021). We are interested in the performance of BLSTM in flare prediction as well as its performance after being added attention mechanism. We attempt to compare the performance of the five models, neural network (NN), LSTM, LSTM based on attention mechanism (LSTM-A), BLSTM, and BLSTM based on attention mechanism (BLSTM-A), in predicting the occurrence of solar flares within 24 hr. To our knowledge, this is the first time that BLSTMs have been used for solar-flare prediction.
The rest of this paper is organized as follows. Section 2 describes the data, and Sect. 3 presents the method. Section 4 presents the results, and Sect. 5 provides discussion and conclusions.

Data
The Solar Dynamics Observatory (SDO; Pesnell et al. 2012) on the Helioseismic and Magnetic Imager (HMI; Schou et al. 2012) begin routine observations on April 30, 2010. In this study, we use a data product called Space-Weather HMI Activity Region Patches (SHARP; Bobra et al. 2014), created by the SDO/HMI team. Magnetic maps of ARs can be obtained from SHARP, making it easy to obtain the data and desired latitudinal features. We select flare events between 2010 May 1 and 2018 September 13, including the central peak of solar cycle 24, which are collected continuously over an average period of 12 minutes. In order to avoid the influence of the projection effect (Ahmed et al. 2013;Bobra et al. 2014), magnetic map data of AR located at ±45 • inside the central meridian are selected in this study, and a maximum of 120 images are collected for each AR.
In previous studies, some magnetic field features have been associated with enhanced flare productivity (Bobra et al. 2014). The study of Leka and Barnes (2003) and Bobra and Couvidat (2015) shows that the three SHARP magnetic features TOTUSJZ, TOTUSJH and TOTPOT are important for solar flare prediction. The SHARP magnetic features ABSNJZH, AREA_ACR, MEANPOT, R_VALUE, and SHRGT45 are considered important in Liu et al. (2017). In addition, the other two SHARP features, SAVNCPP and USFLUX, show strong predictive power in . Therefore, we choose these 10 features as our predicting factors, as shown in Table 1, and other non-feature information (T_REC: image time, NOAA_AR: AR name, NOAA_NUM: number of ARs, CLASS: AR category).
We need to build data sets to train, validate and test the model. Then the data sets are prepared in the following way: (1) for each magnetogram sample of AR, the no flare (weaker than C1.0) label is attached to the magnetogram if the AR does not flare within 24 hr after this magnetogram sample.
(2) For each magnetogram sample of AR, the corresponding flare label (i.e., C, M, or X) is attached to the magnetogram if a C/M/X class flare occurs within 24 hr after this magnetogram sample. It is worth noting that several ARs produce recurring flares with different flare classes within 24 hours. For the first flare of an AR, the corresponding flare label is assigned to the magnetogram sample within 24 hours after the observation time. For subsequent flares of the same AR, the corresponding labels are assigned to the magnetogram sample in the period from the end of the prior flare to the end of the current flare. (3) Our AR classification system is based on the maximum class of GOES flares produced by an AR, and follows a four-class scheme (Yuan et al. 2010;Liu et al. 2017;Zheng et al. 2019;Li et al. 2020). In other words, ARs are further categorized into four classes (i.e., No-flare, C, M, and X) if they yield at least one flare with such GOES-level criterion but no flares with higher GOES-level criterion: "Class = X" means that the AR produces at least one X-class flare, "Class = M" means at least one M-class flare but no X-class flares, "Class = C" means at least one C-class flare but no M/X-class flares, and "Class = No-flare" means that the AR only produces microflares weaker than C1.0 flares. Finally, we collect 818 ARs, including 401 No-flare, 324 C-class, 82 M-class, and 11 X-class ARs. For the ≥M class, samples of M/X-class flare in an AR are defined as positive class, while all the others are defined as negative class. For the ≥C class, samples of C/M/X-class flare in an AR are defined as positive class, while all the others are defined as negative class.
Fraction of area with shear >45°Area with shear > 45 • /total_area We adopt the method of shuffle and split cross-validation by ARs to divide the data sets as suggested from Zheng et al. (2019). First, we randomly shuffle the AR numbers in different classes of No-flare/C/M/X and split them at a ratio of approximately 64%, 16%, and 20%. Next, we allocate the magnetogram samples belonging to 64% of the ARs to the training dataset, those belonging to 16% of the ARs to the validating dataset, and those belonging to 20% of the ARs to the testing dataset. This ensures that the ARs and samples in the testing dataset do not appear in the training or validating datasets, effectively verifying the validity and stability of the model. Finally, we construct a single dataset containing the training, validating, and testing datasets. By repeating this process 10 times in total, we obtain 10 separate crossvalidation datasets, each consisting of training, validating, and testing datasets. As indicated in Table 2, each of the 10 datasets has the same number of ARs and samples for each category used for training, validating, and testing. We find that the number of No-flare/C ARs is much higher than that of M/X ARs. By using the same data set, we can also make a fair performance comparison of the different flare prediction models used in this paper.

LSTM and BLSTM
Recurrent Neural Networks (RNNs) are feedforward neural networks with recursive hidden states activated by previous states at certain times. Thus, RNNs can dynamically pattern contextual information and can handle variable length sequences. LSTM is a RNN architecture that is now the dominant structure for RNNs. LSTM solves the gradient disappearance problem by replacing self-connected hidden units with memory blocks. Memory blocks use dedicated memory cells to store information and are better at finding and exploiting remote contexts. Memory cells enable the network to know when to learn new information and when to forget old information. LSTM units consist of four components, as shown in Fig. 1. The i is an input gate, which controls the size of new memory content added to memory. The f is a forget gate, which determines the amount of memory to be forgotten. The o is an output gate, which modulates the amount of output memory content. The C is a cell activation carrier, which consists of two components consisting of a partially forgotten previous memory C t−1 and a modulated new memoryC t . The t nominates the tth moment. For the LSTM, the hidden state h t given input x t is computed as follows: Fig. 1 The illustration of an LSTM unit. The i is an input gate, which controls the size of new memory content added to memory. The f is a forget gate, which determines the amount of memory to be forgotten.
Here, f t is the forget gate, i t is the input gate, o t is the output gate, C t is the cell state, x t is the input vector to the LSTM unit, and h t is the output vector of the LSTM unit A standard LSTM layer illustration can be found in Fig. 2(a). The layer can only make use of the historical context. However, the lack of future contexts may lead to incomplete understanding of the problem meaning. Therefore, BLSTM is proposed to access the forward and backward contexts by combining the forward hidden layer and backward hidden layer, as shown in Fig. 2(b). As time passes, the forward and backward passes on the unfolded network operate in a similar fashion to those on the regular network, with the difference that the BLSTM needs to unfold both the forward and backward hidden states in all time steps. The BLSTM layer is trained using back propagation through time (BPTT) (Graves and Schmidhuber 2005).

Model composition
We use five models for flare prediction, as shown in Fig. 3, consisting of an input layer, a feature extraction Fig. 3 The architecture of the models. Each model consists of an input layer, a feature extraction layer, and an output layer layer and an output layer. The networks take the sequence (x t−m+1 , x t−m+2 , . . . , x t ) as input, and produce a twodimensional vector [y 0 , y 1 ] with a value of [1, 0] or [0, 1] as output, which is determined by the probability calculated by the softmax function in the output layer.

Input layer
Let x t represent the data sample collected at time point t. For each time point t, we take m consecutive data samples (x t−m+1 , x t−m+2 , . . . , x t ) from the training set and let the m consecutive data samples be the input to the network (in the study presented here, m is set to 120). Since these m samples originate from the same AR, the labels of the samples are all the same. Thus, if x t belongs to the positive class, the input sequence is defined as positive; otherwise, the sequence is defined as negative.

Feature extraction layer
We extract features by five machine learning algorithms. In the NN model, the input sequence is sent to a fully connected layers which has 128 neurons. The LSTM model and BLSTM model input the sequence into the LSTM layer and BLSTM layer for feature extraction as shown in Fig. 2.
The LSTM-A model will input the data into the LSTM cell to obtain the hidden layer output at different positions, as shown in Fig. 4. We refer to the function used by Luong et al. (2015) to calculate the weights of the hidden layers at different locations, which is defined as: where W contains learnable parameters. Then the softmax function is used to normalize the weights to obtain the normalized weights, denoted as: The hidden output is weighted and summed with the normalized weight α i to obtain a context vector c t , which is computed as: The final attention vector v of the input sequence is derived by concatenating the context vector c t and last hidden state h t , and then being activated by a hyperbolic tangent layer as follows: This activation vector v is then passed through the layers as shown in Fig. 3 to derive the predicted values.
The BLSTM-A model replaces the LSTM in Fig. 3 with the BLSTM to extraction features.

Output layer
There are two neurons in the output layer activated by the softmax function. The output of the network, i.e., the predicted result, is a two-dimensional vector [y 0 , y 1 ] with a value of [1, 0] or [0, 1], indicating x t is positive (i.e., the AR will produce a C-class or M-class flare within the next 24 hr) or x t is negative (i.e., the AR will not produce a C-class or M-class flare within the next 24 hr).

Training process
We train and validate these five different models by the training and validating data sets during each epoch to track the learning performance. Approximately 64% of the samples in the data set are used to train each prediction model, and approximately 16% of the samples in the data set are used as the validating data set to validate the models. In total, we obtain 10 separate training and validating data sets for training and validating the models. At the end of each period, we compute the loss function on the validating data set and select the model that minimizes the validating loss as the best training model, similar to previous studies (Huang et al. 2018;Park et al. 2018;Zheng et al. 2019). Figures 5 and 6 show the learning curves for five different prediction models in terms of whether they produce ≥C-class or ≥M-class flares, giving the training loss and validating loss with the respect to the number of epochs, respectively. The 10 different colored curves indicate the variations in training and validating loss with epochs for models trained and validated by 10 different data sets. In our work, we experimentally choose the learning rate parameters to make the curve as smooth as possible. From Figs. 5 and 6, we can find that for all prediction models, the training loss and validating loss usually decrease as the number of epochs increases. By examining the learning curves, we find that none of the models encountered poor learning or severe overfitting problems.

Results
To evaluate these models, we describe the prediction results in terms of a confusion matrix, also known as a table of contingency rates for each AR class k. ARs correctly predicted   ARs that do not belong to class k and are correctly predicted as non-class k are true negatives (TN), while they are false positives (FP) in the case of predictions for class k. Many categorical indicators exist in the literature calculated from these four quantities. We prefer the True Skill Score (TSS) to all the other metrics because it is insensitive to the class imbalance ratio and thus best suited for comparison with other groups (Bobra and Ilonidis 2016). The TSS is defined as follows: In our study we use TSS to evaluate the results of our models for flare prediction. The TSS ranges from −1 to 1, with 1 representing perfect prediction, 0 representing no skill, and −1 representing the worst score. Since the TSS score is unbiased for the class imbalance rate, recent studies suggest it as the main metric for evaluating flare prediction models (Bloomfield et al. 2012;Nishizuka et al. 2018), and we follow the suggestion of Bloomfield et al. (2012) and use the TSS score as assessment metric.
In this paper, we build 10 separate training, validating, and testing data sets by the shuffle and split cross-validation method. We develop five solar flare models, including NN, LSTM, LSTM-A, BLSTM and BLSTM-A, to predict a flare of ≥C class and ≥M class will occur in an active region in the next 24 hr. The mean and standard deviation of TSS for these five frameworks at time step 120 on these 10 validating and testing data sets are shown in Table 3. As can be seen from Table 3, the four models based on LSTM significantly outperform the NN for ≥C class flare prediction. The average TSS of LSTM-A and BLSTM-A are 0.6904 and 0.6833 respectively. Therefore, we claim that LSTM-A has the best performance in predicting ≥C class flare. When predicting ≥M class flare, we find that adding the attention mechanism to the LSTM and BLSTM models does not improve the performance. The average TSS of LSTM and BLSTM are 0.6579 and 0.6453 respectively, which are higher than that of the other three models. However, the TSS is not improved for ≥M class by adding the attention mechanism to the LSTM and BLSTM models. Figures 7 and 8 compare the TSS scores achieved by the five methods on each data set for ≥C class and ≥M class. In Fig. 7, we can see that for the ≥C class, the performance of the four LSTM-based models significantly outperforms that of the NN and adding the attention mechanism to the LSTM and BLSTM models improves the performance. In Fig. 8, we can see that for the ≥M class, LSTM and BLSTM perform better than other models on most data sets.
In this study, we analyze the importance of each of the 10 features based on the best prediction performance models: the LSTM-A and LSTM models for ≥C class and ≥M class, respectively. Figures 9 and 10 show the feature importance plots of the LSTM A model for ≥C-class flare prediction and the LSTM model for ≥M-class flare prediction, respectively. In each figure, the blue bar represents the average individual TSS score for the 10 features, and the red and yellow lines respectively represent cumulative TSS scores after removing the j most and least important features. Error bars, representing standard deviations, are also plotted. In Fig. 9(a) we find that for ≥C class flare prediction, R_VALUE, USFLUX, TOTUSJH, and TOTUSJZ are the more important features, while AREA_ACR and SHRGT45 are the two least important features. As can be seen from Fig. 9(b), although TOTPOT is a less important feature, the TSS drops steeply after removing TOTPOT. When removing the least important features in turn, we can see that the TSS metrics do not change significantly in Fig. 9(c). Even when only the single feature USFLUX is available, the TSS is 0.6413, which shows that USFLUX is an extremely important feature. SHRGT45 is considered important in Bobra and Couvidat (2015) and Liu et al. (2017), but it is ranked the lowest in our list. Using all 10 features together does not produce the highest average TSS score. While the least important feature SHRGT45 is removed, the highest TSS score is produced. This may be because SHRGT45 becomes a noisy feature when combined with the other nine features in this paper, and using it may deteriorate the performance of the model. For ≥M class flare prediction, we can see from Fig. 10(a) that R_VALUE is the most important feature and SHRGT45 is the least important feature. As can be seen from Fig. 10(b), although TOTUSJH is a less important feature, the TSS drops steeply after removing TOTUSJH. When removing the least important features in turn, we can see that the TSS metrics do not change significantly as in Fig. 9(c). It is evident that the feature R_VALUE holds significant importance, because the resulting model achieves a remarkably high TSS score of 0.6109 with the use of only this single feature. By comparing Figs. 9 and 10, we find that TOTUSJZ, ABSNJZH, and R_VALUE are the top scoring features for both ≥C class and ≥M class prediction, and the three features are also considered important in Bobra and Couvidat (2015) and Liu et al. (2017). When the most important features are removed successively, the performance of the model does not change much at the first seven steps. It is due to the fact that these top features are highly correlated in physics. Sinha et al. (2022) confirmed the correlation between these physical features and suggested that one possi- ble reason behind the high correlation could be that they are extrinsic features.

Discussion and conclusions
We develop five models to predict whether a flare event of ≥C class or ≥M class will occur within 24 hours. We will use the SHARP LOS images provided by the SDO/HMI team (Schou et al. 2012), making it convenient to obtain the data and the desired features therein. The samples are labeled flare information (i.e. No-flare/C/M/X) using solar flare events catalogue provided by the Geostationary Operational Environmental Satellite and Solar Geophysical Data solar event reports. In addition, We use the cross-validation method to generate 10 cross-validation sets from the data. In each of 10 data sets, we segregate the entire data set into the training, validating and testing data set by AR number to simulate the real-time flare prediction, which is more suited to the model evaluation and flare predictions. The proportions of training, validating and testing data sets are 64%, 16%, and 20% respectively. Through extensive experiments, we use TSS to evaluate the performance of five models and compare them with closely related machine-learning methods. The main results are summarized as follows.
(2) For ≥C class flare prediction, we achieve better results by using the BLSTM model as compared to the LSTM model, and the addition of attention improves the model performance with a result of 0.6833 ± 0.0531, which is only a little lower than that of the best predicted model (i.e., LSTM-A). For ≥M class flare prediction, the BLSTM model yields slightly lower performance compared to the LSTM model, and adding the attention mechanism does not improve the performance, probably because the reinforcement is focused on the features of MEANPOT and SHRGT45, which have the lowest TSS scores in the prediction.
(3) We assess the significance of all 10 features used in this study. Our experimental results show that among these 10 features, SHRGT45 is much lower than the other features. For predicting whether an AR will generate ≥C-class flare, using only 9 features can achieve better performance than using all 10 features.
In this paper, we evaluate various machine learning models for predicting solar flares and find that some of them achieve good performance, but there are still many directions for improvement. First, the insufficient number of Xlevel ARs and samples leads to a serious rank imbalance problem, which makes the X-level learning poor. We can generate adversarial networks (GANs; Kim et al. 2019) to obtain more X-level AR samples. Second, the data set in this paper is based on 10 physical feature data obtained from images. Convolutional neural network (CNN) can be used to extract multidimensional data to improve the generalization ability of the model. Our next step is to develop models for processing time-series images using CNN and LSTM networks to further improve the performance of flare prediction.