This plotted an IRC map for Chungcheongbuk-do using the FR and machine learning approaches, including group method of data handling (GMDH), convolutional neural network (CNN), and long short-term memory (LSTM). The model prediction accuracy was evaluated using the area under the receiver operating characteristic curve (AUROC). Figure 6 shows the methodological steps used in this study.
2.2.1 Identifying influencing factors of IRC variation using the FR
The FR is used to identify potential statistical relationships between a given phenomenon and its associated variables (Lee and Talib 2005). This study determined the FR of variables related to IRC and categorized them based on each class. In the correlation analysis, FR is the percentage of the area studied represented by IRC. The presence or absence of radon must first be defined and are set to 0 and 1 based on a threshold value of the indoor radon index of 148 Bq/m3 to indicate areas with high and low radon concentrations. For indoor radon, FR was calculated using 1 for the indoor radon index in Chungcheongbuk-do. The FR approach was used to evaluate the correlation between the indoor radon index and various factors and was calculated by dividing the area related to each IRC variable in a subclass and the entire study area within that subclass, as shown in Eq. 1 (Huang et al. 2020).
$$FR=\frac{{N}_{\left({I}_{i}\right)}∕{N}_{\left({F}_{i}\right)}}{{N}_{\left(I\right)}∕{N}_{\left(A\right)}}$$
1
where \({N}_{\left({I}_{i}\right)}\)is the pixels of the IRC in subclass i of the factor; \({N}_{\left({F}_{i}\right)}\) is the total number of pixels in subclass i; \({N}_{\left(I\right)}\) is the total IRC distribution of the factor; and \({N}_{\left(A\right)}\) is the total area.
2.2.2 Model description
Frequency ratio
The FR model is a statistical model that is used to analyze the frequency of events or patterns in a dataset. The IRC map was developed using the FR approach, using the weighted sum tool for an integrated analysis to generate the FR map (Jana et al. 2019). The weighted sum tool allows strategic weighting and the amalgamation of various factors to produce and IRC map.
Group method of data handling
The GMDH is a robust approach to mathematical modeling and data analysis developed by Alexey G. Ivakhnenko in the 1970s that has been applied in various fields (Ivakhnenko 1970). The GMDH algorithm uses a self-organization principle to identify the optimal model complexity by systematically evaluating numerous models that meet the specified criteria (Ivakhnenko 1978). The GMDH algorithm consists of multiple functions that effectively handle several issues and enhance the precision of problem-solving outcomes. The functions include linear, polynomial, and ratio-polynomial variations (Ivakhnenko and Ivakhnenko 2000). The relationship between input and output variables can be described by a complex discrete form of the Volterra functional series, commonly referred to as the Kolmogorov-Gabor polynomial (Farlow 1984). Eq. 2 gives the correlation between the input and output variables of the model. GMDH starts with a set of input data that includes multiple variables. Each variable represents a feature or attribute of the dataset. The variables are grouped into different sets or layers. Within each group or layer, mathematical models are developed to represent the relationships among variables. The process continues until a predefined stop criterion is met or until the best possible model is achieved. The resulting model can be used for making predictions on new data.
$$y={a}_{0}+{\varSigma }_{i=1}^{n}{a}_{i}{x}_{1}+{\varSigma }_{i=1}^{n}{\varSigma }_{j=1}^{n}{a}_{1}{x}_{1}{x}_{j}+{\varSigma }_{1=1}^{n}{\varSigma }_{J=1}^{n}{\varSigma }_{{k}_{1}}^{n}{a}_{1}{x}_{1}{x}_{J}{x}_{2}+\dots$$
2
Where \(y\) is the prediction result; \(x\) is the input variables vector; \(a\) is the coefficient calculated using the least squares error approach; and \(n\) denotes the number of input variables.
Convolutional neural network
A CNN is a deep-learning algorithm that belongs to the broader category of machine learning approaches. Deep learning is a specialized area of machine learning that emphasizes the use of artificial neural networks that include a layer committed to the convolution operation. The fundamental architecture of the CNN model consists of convolution, pooling, and fully connected layers (Lecun et al. 1998; Yamashita et al. 2018).
The input layer serves as the initial interface for raw data, converting it into a numerical form, often expressed as a tensor, which is a multi-dimensional data array. The core part of CNN is a convolutional layer functions, where convolution operations are performed on the inputted numerical data through a set of adaptable filters, also known as kernels (Thi Ngo et al. 2021). Upon completion of the convolution operations, an activation function is employed on a per-element basis to the resultant data from the convolutional layer. This function injects non-linearity into the computational model, which is essential for the network’s ability to capture and represent complex data relationships (Berkani et al. 2023).
Pooling operations contribute to the reduction of computational demand and bolster the model's invariance to shifts in position. By downsampling the feature maps from the convolutional layers, the pooling layer condenses the dimensionality of the data (Barata et al. 2019). General techniques include max pooling, which isolates the highest value within a designated subsection, and average pooling, which determines the sectional average (Chawshin et al. 2022).
The fully connected layer plays a crucial role in rendering predictions by integrating the high-level features derived from antecedent layers (Ma et al. 2021). Meanwhile, the output layer in a CNN, responsible for producing the ultimate output, generally comprises neurons that align with each of the defined categories.
Long short-term memory
LSTM is another type of deep neural network algorithm in which the network output is fed back into the network as subsequent input (Kong et al. 2019). The structural framework of LSTM networks excels in identifying intricate spatial configurations and temporal trends across diverse settings. LSTM networks are structured around unique memory units that incorporate gating mechanisms, which enable the model to preserve and modify data over extended sequences, ensuring crucial information is not discarded.
The key components of an LSTM cell include input gates, forget gates, cells, output gates, and cell outputs state (Graves 2012). The input gate is responsible for managing how new data is assimilated into the cell state, effectively serving as a filter for incoming information. The forget gate critically assesses which parts of the stored information are no longer pertinent and thus should be removed, taking into account both the new input and the preceding hidden state. Concurrently, the candidate cell state, poised with potential information for addition to the cell state, hinges on the decisions made by both the input and forget gates. Following this, the cell's output state, a key component of the LSTM's memory architecture, is refreshed with this vetted information, thereby encapsulating the crux of the input sequence at that specific moment. The output gate's role is pivotal in dictating the quantum of updated information that is conveyed from the cell state, thereby modulating the balance between memory retention and attrition across time steps. This mechanism of selective information management within the LSTM's memory cells facilitates the maintenance of a continuous and relevant data stream across prolonged sequences, effectively mitigating the challenge of information degradation over extended periods.
2.2.3 Model performance assessment using AUROC
The AUROC powerful metric for evaluating the accuracy of predictions made by machine learning models. The AUROC method was used to validate the indoor radon index potential maps. For validation, training (70% of the total radon data) and testing (30% of the total radon data) were selected randomly to generate radon index maps using the training data. In the training and testing phase, the AUROC can be used as a performance indicator to evaluate the efficacy of the FR and machine learning algorithms (Bradley 1997). The ROC expresses the predictive index value obtained from the prediction map as the ratio of the location of radon data to the total area. The x-axis of the graph shows the upper percentage of areas with a high indoor radon index and the y-axis shows the lower percentage of areas with training or testing radon data (Pencina et al. 2008). The model's predictive performance is categorized into five ranges based on the AUC: fail (0.5–0.6), poor (0.6–0.7), fair (0.7–0.8), good (0.8–0.9), and excellent (0.9–1.0) (Carter et al. 2016). The AUROC is used to assess model performance in many fields, such as flood susceptibility maps (Dodangeh et al. 2020), groundwater potential (Panahi et al. 2020), habitat potential maps (Widya et al. 2023), and radon distribution maps (Rezaie et al. 2022).