Background subtraction via regional multi-feature-frequency model in complex scenes

Although a wide variety of background subtraction methods has been proposed in recent years, none has been able to fully address multi-scale moving objects and dynamic background in real surveillance tasks. In this paper, a novel and effective background subtraction method, named regional multi-feature-frequency (RMFF), is proposed to detect multi-scale moving objects under dynamic background. Unlike many existing methods construct background model using simple multi-feature combinations, RMFF exploits the spatiotemporal cues of multi-feature as well as superpixels at each scale, thus allowing for more robust information to be exploited for background modeling. Specifically, the spatial relationship between pixels in a neighborhood and the frequencies of features over time are first exploited, enabling accurate detection of moving objects while ignoring most dynamic background changes. Then, the use of multi-scale superpixels for exploiting the structural information existing in real-world scenes further enhances robustness to multi-scale objects and environmental variations. Finally, an adaptive strategy is employed to dynamically adjust the foreground/background segmentation threshold for each region without user intervention. This adaptive threshold is defined for each region separately, and can adjust dynamically based on continuous monitoring of the background changes, thereby effectively reducing potential segmentation noise. Experiments on the 2014 version of the ChangeDetection.net dataset demonstrate that the proposed method outperforms the 12 state-of-the-art algorithms in terms of overall F-Measure and performs effectively in many complex scenes. Consequently, it is verified that the developed approach is feasible and useful for robust application in practical video surveillance.


Introduction
Background subtraction (BS) is a fundamental research topic in computer vision, and has applications in a broad range of domains, such as moving object detection (Mahalingam et al. 2019), object tracking (Sudha et al. 2020), object classification , medical image enhancement (Song et al. 2019), and video surveillance (Chen et al. 2019b, a), to name a few (Sobral et al. 2014;Garcia-Garcia et al. 2020). BS has been studied extensively, and many ingenious algorithms have been proposed in the past decade. Here, we broadly divide the related methods into two categories according to their representations: pixel-level methods and region-level methods.
Pixel-level methods are the most popular in practical applications, since they have been shown to accurately extract the shape of moving objects and efficiently distinguish between foreground and background pixels. The landmark work in this category involves modeling the distribution of each pixel value over time as a Gaussian mixture model (GMM) (Stauffer et al. 1999). Inspired by the effectiveness of this approach in sustaining background variations, probability model, and online updating scheme, a plethora of improved methods have been proposed over the last decade (Akilan et al. 2018;Boulmerka et al. 2018). Although these GMM-based algorithms have obtained some impressive quantitative results, they have proved to be computationally expensive and may fail to address high-frequency variations. To deal with the limitations of algorithms that construct background models using Gaussian parameters, non-parametric models using kernel density estimation (KDE) have been proposed to estimate the multimodal distribution of the pixels (Elgammal et al. 2000;Chen et al. 2019b, a;Liao et al. 2010;Panda et al. 2021). Unlike parametric models, these methods estimate background probability density functions at individual pixel locations based on recent intensity observations. However, the KDE-based algorithms are computationally expensive and thus unsuitable for real-time operation. Instead of choosing a background KDE model, other nonparametric methods such as SACON (Wang et al. 2007) or visual background extractor (ViBe) calculate sample consensus of the background samples for each pixel (Barnich et al. 2011), and have proven to be robust in different types of background scenes. While these methods and variants (St-Charles et al. 2015;Jiang et al. 2018;He et al. 2019) have shown effective performance in speed and memory, they are sensitive to sudden illumination variations and frequent changes of pixels. Along with these algorithms, several other pixel-level approaches such as histogrambased methods Roy et al. 2018;He et al. 2021) and pixel distribution-based methods (Lin et al. 2020;Zhao et al. 2022;He et al. 2022) have been shown to perform well in the face of shadow disturbance and brightness changes.
While these pixel-level approaches have been shown to be able to distinguish foreground and background pixels accurately, they may fail to deal with the issue of drastic background changes, especially when handling a highly dynamic background, as they ignore the spatial correlations between pixels. Researchers have started to focus on region-level methods, which take inter-pixel relations into consideration and show reasonable results on different datasets. A group of weighted local binary pattern (LBP) histograms are used to construct a background model for each region (Heikkila et al. 2004). A region-level mixture of Gaussians (RMoG) model is proposed to exploit the relationship between pixels in a neighborhood (Varadarajan et al. 2013(Varadarajan et al. , 2015. As a typical fuzzy-based BS method, FCDH (Panda et al. 2016) could measure the color difference between a pixel and its neighbors in a small local neighborhood. These regional methods usually divide an image into blocks, and have produced good results on realworld videos, but one of their drawbacks is that they may overlook the structural information existing in real scenes. The shape of foreground objects is, therefore, often inaccurate, limiting the practical application of these methods. To overcome this problem, some other region-level methods based on superpixels were presented and effectively increased the spatial coherency. Chen et al. (2018) took advantage of superpixel segmentation, spanning trees, and optical flow, and combined them with a GMM scheme to detect moving objects.
A superpixel-based density estimation algorithm (Lim et al. 2014) was introduced for capturing videos, which combines appearance and motion models of the foreground and background, and determines pixelwise labels using binary belief propagation. Other superpixel-based background models such as superpixel-based matrix decomposition algorithm (Javed et al. 2015), multi-scale superpixels framework (Zhao et al. 2016), and effective superpixel background model (Chen et al. 2020) were proposed to alleviate the effects of noise, adverse weather and dynamic scenes. To date, many other ingenious region-level methods have also been put forward to model spatial structures of scenes, such as the use of spatiotemporal frameworks (Cores et al. 2023), graph learning methods (Giraldo et al. 2022), and frame-based methods (Liu et al. 2022;. Although BS has been widely discussed and many ingenious theories and techniques have been proposed in the last few decades, achieving satisfactory performance in terms of high foreground detection accuracy and low computational complexity is still difficult, especially in complex scenarios, due to challenges such as the presence of dynamic backgrounds, illumination changes, and bad weather. One of the important reasons for the poor practical performance in such scenes is that most of the current approaches are pixel-level, and construct background models of observed scenes as sets of independent pixel processes, thereby overlooking the spatial correlations between pixels. To alleviate this problem, many regionlevel methods have been proposed to model spatial structures of scenes. Generally, methods in this category take into account the spatial connections between neighboring pixels to refine the raw pixel-level classification. However, these methods may fail in highly dynamic scenes, especially when coping with change events at various speeds. To address the aforementioned problems, Yang et al. (2018) presented a novel background subtraction algorithm, named stability of adaptive feature (SoAF), which exploits both pixel-level and region-level features to create background models. While SoAF obtained impressive performance in some controlled environments, its performance in challenging situations, such as bad weather, low framerate, night videos, thermal, and turbulence, is still far from satisfactory. There are three main issues required to be discussed: SoAF treats each pixel independently, so few spatial relationships between neighboring pixels are used in background modeling, which makes it sensitive to dynamic backgrounds, illumination variations, and scene noise. SoAF does not take account of much structural information existing in real scenes, resulting in limitations in performance in complex scenes.
The complicated ''double threshold'' of SoAF involves tedious threshold tuning work, and cannot fully adapt to complex scenes. The ''double-threshold'' strategy was defined based on the assumption that all observations have similar behavior throughout the analyzed sequences; however, this assumption is rarely true.
In this paper, the above three issues were considered in a unified framework, and a new foreground detection model named regional multi-feature-frequency (RMFF) is proposed. The main contributions of this paper can be summarized as follows.
First, spatial correlations between neighboring pixels are taken into considered to solve the first problem. The real observed scene often has a non-static background due to illumination changes, noise, or the motion of the background scene itself. The movement of the background tends to repeat periodically over time in a neighborhood region, due to the presence of objects such as raindrops, snowflakes, moving escalators, swaying trees, or water ripples. Neighboring pixels, therefore, often share similar temporal distributions. Unlike SoAF, which treats each pixel individually, the proposed method explicitly considers spatiotemporal coherence between neighboring pixels, so that more robust local information can be exploited for background representation, which makes this approach more robust and accurate than SoAF.
Second, multi-scale superpixels are adopted for exploiting structural information to cope with the second issue. The multi-scale superpixels strategy is used to combine information from neighboring pixels and make sure uniform regions are assigned homogeneous labels, which further improves the accuracy of foreground segmentation. As discussed in part 3.2, with the advantages of multi-scale superpixels, we achieved encouraging performance for detecting multi-scale moving objects in complex scenes such as bad weather, dynamic background, and low framerate.
Third, we developed an adaptive threshold tuning mechanism based on multi-feature-frequency to distinguish background and foreground to handle the third issue. In SoAF, the ''double-threshold'' strategy is used to determine whether an observed pixel is part of the foreground or background, involves tedious threshold tuning, and cannot fully adapt to different scenes. Moreover, the ''doublethreshold'' strategy assumes that all observations have similar behavior throughout the analyzed sequences. However, this is rarely the case. Instead of using a fixed double threshold, the second contribution of this paper is an adaptive threshold tuning mechanism. As discussed in part 3.2, the proposed adaptive threshold strategy can ultimately produce more robust and accurate results than the fixed global threshold. This paper is organized as follows: Sect. 2 introduces the proposed method, which includes feature extraction, region-level background modeling, foreground detection, the adaptive threshold, and multi-scale superpixels techniques. Section 3 describes the experiments and discussions. Finally, conclusions are drawn in Sect. 4.
2 Modeling the background with multifeature-frequency While pixel-level methods have achieved impressive performance, there are three major issues that we addressed with our work. The first is that these methods do not address the issue of pixel dependency while creating a model of a background, which does not actually reflect most practical situations. By developing a framework for region-level background model, where pixels are aggregated into a r Â r block, we addressed this issue by taking spatial connections into consideration. The proposed RMFF also treats the pixel-level method as a special case when the region is considered to be degenerate. The second issue that we address is that many algorithms classify individual pixels as background or foreground depending on a simple decision threshold, which is ineffective in most practical situations. Given the diversity of scene backgrounds and applications, ideally for dynamic regions, the decision threshold should be increased, so as to not include objects in the foreground. In contrast, for static regions, the decision threshold should be low, such that slight interesting changes lead to a decision for the foreground. An adaptive threshold strategy is, therefore, required, which is changed dynamically depending on the characteristics of the scene changes. The essential and novel idea of the proposed RMFF is that the per-region threshold dynamically changes based on an estimate of the background dynamics. The third issue is that many approaches do not take structural information into consideration, and are thus not robust to noise. By integrating multi-scale superpixels, we exploit the structural information and the spatial coherency to model background changes. The pipeline of the proposed approach is illustrated in Fig. 1.
In this section, we present the methodology of RMFF from the following aspects: First, we introduce in Sect. 2.1 the way in which individual regions are characterized using multi-features; then, we show in Sect. 2.2 how these features can be used in a frequency-based model; next, we present in Sect. 2.3 how the background model can be used and updated, resulting in effective foreground detection; next, we discuss in Sect. 2.4 how the spatial coherency and the overall quality of foreground are increased using multi-scale superpixels technique; and finally, we present in Sect. 2.5 how model sensitivity and adaptation is improved using the adaptive threshold strategy.

Feature extraction
Unlike pixel-level methods that process each pixel independently, RMFF exploits interpixel relationships to deal with specific background changes such as dynamic backgrounds, illumination changes, and adverse weather. Our background model is described by color, texture, and gray difference features.
for each region 5.
calculate the mean value of each feature 6.
for feature in feature vector 10.
seek the matched background model by Equation (5)  11. calculate the frequency using Equation (9) if a match is found 13.
model.update using Equations (6) The selection of color space is crucial to the accuracy and robustness of foreground extraction. RGB is one of the most widely used color spaces in background modeling. Perhaps, the most important properties of the RGB are its robustness to both environmental and camera noise, and its direct usage without the computational cost of color conversion (Sajid et al. 2017). Based on its good performance in background modeling, RGB is a natural choice for foreground/background classification in the RMFF model. However, there are some potential problems with the RGB space; it is sensitive to illumination changes and the shadows of moving objects. To deal with these problems, we incorporate scale-invariant local ternary patterns (SILTP) (Liao et al. 2010) into the feature vector. Although it has been shown to be very robust to illumination variations and moving soft shadows, SILTP is not robust to frequent changes of pixels and some scenes lacking texture. By combining RGB with SILTP in the proposed background model, we bring together the advantages of both features, and compensate for their defects. In addition, inspired by GraphMOS (Giraldo et al. 2022), we further apply the absolute value of the gray difference between the current frame and the temporal median image of a video sequence to our feature vector to enhance the contrast and reduce background noises. Here, each pixel value of the temporal median image is the middle value of all the pixel values in sorted order at the corresponding position of the video sequence, obtained using the temporal median filter. The idea of choosing the gray difference value to obtain reliable feature arose from the observation that the illumination variations and irrelevant dynamic backgrounds can be attenuated efficiently by considering the difference between a current frame and the median image. As shown in Fig. 2, the gray difference feature (GDF) and gray feature (GF) are almost the same in precision scores (the third row) in the three videos, but the robustness improvement of GDF over GF is validated in both the recall (the second row) and F-Measure (the fourth row) plots. Overall, the GDF outperforms the GF on these video sequences.
In summary, as mentioned above, the proposed method chooses L (L = 5) features: red, green, blue, SILTP, and GDF, to construct the feature vector, as follows:

Region-level background modeling
Background modeling plays the most important role in the background subtraction process. In this work, a regional background model was constructed using multi-featurefrequency, which takes inter-pixel dependencies into consideration. The multi-scale superpixels technique exploits structural information, thereby preserving the spatial homogeneity of the foreground. An adaptive foreground/ background decision threshold strategy, which can be dynamically adjusted for each region over time, is also introduced in this model. These strategies enable our method to produce high levels of detection accuracy in complex scenes. In SoAF , the authors presented a pixel-level background subtraction method, which uses the stabilities of multi-features to detect moving foreground objects. In contrast, we created a background model with region-level frequencies rather than pixel-level stabilities, offering a way to adapt to complex scenes with dynamic backgrounds, illumination changes, or bad weather, while also achieving competitive performance in all other scenes. In the proposed method, the importance of each feature to the background model is evaluated quantitatively using the notation of ''frequency.'' The underlying idea is that the cumulative frequency of each feature over time characterizes the importance of the feature in background modeling. A simple example of a particular feature of a region is illustrated in Fig. 3. As shown in Fig. 3, each distribution of the background model is described by a feature value f , a bin width d, and a peak frequency p. In the following discussion, we cover the proposed background model procedure for one region, but the procedure is identical for each region. The RMFF background model for a region consists of a group of features, each of which is composed of K models M ¼ fm 1 ; m 2 ; :::; m K g (three boxes as shown in Fig. 3), where K is selected by the user. Each model has a peak frequency that is the cumulative total frequency over time. For each region in an image, we first identify its size as r Â r, where r is another parameter selected by the user. Then, we calculate the mean of the pixel values in each region. In this way, we obtain the RMFF feature vector F, defined in Eq. (1). Next, the absolute value of the distance between the current feature r j and its corresponding feature value f i;j in the background model is compared to a threshold d j . If this distance value is less than d j , we consider this feature entry of current region to be matched with the j-th feature component of the i-th   Fig. 1 The pipeline of the RMFF algorithm.
(1) Feature extraction, the median image of each video sequence is obtained by the temporal median filter, and several features (text, color, and gray difference) are extracted and fused into a five-dimensional feature image. (2) Background modeling, the feature image is divided into feature blocks, and then analyzed by a histogram, and the frequencies of each feature are utilized to create background models. (3) Foreground detection, multi-scale superpixels and adaptive threshold are used to detect moving objects Background subtraction via regional multi-feature-frequency model in complex scenes 15309 background model. Finally, the proposed background model is described by a feature vector F, as defined in Eq.
(1), a width of bin vector D, and a peak frequency vector P. D and P can be mathematically defined as follows: where d red , d green , d blue , d SILTP , and d GDF are the bin width of the red, green, blue, SILTP, and GDF features, respectively; and p red , p green , p blue , p SILTP , and p GDF are the peaks of the corresponding models. As illustrated above, the first frame is utilized to initialize the background model, which is formally represented as follows: m i ¼ fF i ; D i ; P i g ¼ ff i;j ; d i;j ; p i;j g for i 2 f1; . . .; Kg; for j 2 f1; . . .; Lg where m i is the i-th background model, K is the number of background models, and L is the number of features. Therefore, the length of the feature vector should be K Â L.

Foreground detection
Foreground detection is defined by the difference between the regional background models and local observations. Let us denote a particular region in the new video frame by R ¼ fr red ; r green ; r blue ; r SILTP ; r GDF g. In the foreground detection procedure, the absolute value between the j-th feature r j and the j-th feature of the i-th background model f i;j is first calculated. If it is below the corresponding bin width d i;j , the feature r j is considered to match with the j-th feature of the i-th background model and is labeled as ''matched''; otherwise, it is labeled as ''unmatched.'' This process is described as follows: The K i;j ¼ 1 means that the new frame j-th feature r j is matched with the j-th feature of the i-th background model f i;j , which is then updated with the new data as follows: where a f and a d are both user-settable learning rates for updating f i;j and d i;j , respectively. e j is a small value to ensure that the bin width is nonzero. For each feature, the peak frequencies of matched models are summed as follows: where i 2 f1; . . .; Kg,j 2 fred; green; blue; SILTP; GDFg,H K is the sum of peak frequencies of the features in all models that are matched with the current region. Then, we define the sum of peak frequencies in all background models as H: Finally, we estimate the probability of a new region r t being background as In addition, the K i;j ¼ 0 means that the r j is unmatched with f i;j . If there is no feature model that matches with r j , the feature value with the lowest frequency f i;j is replaced with the current value r j . The new feature model is given a low initial frequency p i;j and an initial bin width d i;j . In our experiments, the initial values 1 and [8,8,8,4,2] were used, respectively.

Multi-scale superpixels
As discussed previously, region-level methods cannot accurately extract the shape of moving objects. In a similar approach to that used in SPMS (Zhao et al. 2016), a superpixel technique at multi-scale is introduced to address this problem, and can significantly improve the accuracy of foreground segmentation. In our method, the Simple Linear Iterative Clustering (SLIC) algorithm (Achanta et al. 2012) is used to extract multi-scale superpixels. We obtain superpixels from each input frame at multiple scales using the SLIC algorithm. Given any pixel v i;j belonging to the nth superpixel of s-th scale, where i and j are the location of pixels, the final probability of this pixel is expressed as follows: where s and n are the index of scale and superpixel, respectively, S is the number of scales, Sp s;n is the n-th superpixel of s-th scale, N s;n is the number of pixels belonging to the Sp s;n , pðv s;n Þ is the proximity computed using Eq. (11), and x s is the weight of the proximity value of each pixel in the superpixel of s-th scale, which is defined as follows:

Adaptive threshold for segmenting the foreground
The segmentation threshold essentially controls the precision of the proposed method. Many conventional background subtraction methods use fixed decision thresholds. However, they are not always compatible with real-world videos, as they depend on the assumption that the background is static, or will always present identical behavior throughout the analyzed video sequence. In fact, this assumption is invalid, since the background will differ over time. Therefore, we treat decision thresholds as adaptive state variables, which can dynamically adjust over time for each pixel separately. In foreground/background classification processing, at a location ði; jÞ, Pðvi;jÞ is compared with a segmentation threshold Tði; jÞ to decide whether the pixel is foreground or background. The Tði; jÞ is defined as follows: Tði; jÞ ¼ l þ mpði; jÞ 2 1 þ mpði; jÞ 2 ð14Þ where l is a user-settable parameter to make sure the segmentation threshold is larger than l, mpði; jÞ is the temporal mean of H K : where Q is the order of the current frame, for example, for the first frame Q = 1, for the second frame Q = 2, …, for the 100th frame Q = 100. Therefore, the foreground/background classification at location ði; jÞ is represented as where Bði; jÞ is a binary value: ''1'' indicates the background, and ''0'' indicates the foreground.
Algorithm 1 shows the main steps of the proposed RMFF for background subtraction. Note that the proposed method exploits the properties of region-level framework and superpixel hierarchies to model background changes.

Experimental results
We compared the proposed method with 12 state-of-the-art algorithms on a real dataset named the ChangeDetection.net (CDnet2014, available at http://www.changedetec tion.net/) (Wang et al. 2014) that was reported at the 2014 Computer Vision and Pattern Recognition Workshops (CVPRW). We chose this dataset due to its myriad of realworld indoor and outdoor scenes, each of which is accompanied by corresponding ground-truth data. Furthermore, it is claimed to be the largest video dataset for change detection, encompassing a total of 53 video sequences, grouped into 11 categories: bad weather (BW), baseline (BL), camera jitter (CJ), dynamic background (DB), intermittent object motion (IOM), low framerate (LFR), night videos (NV), pan-tilt-zoom (PTZ), shadow (SHD), thermal (TH), and turbulence (TB); this dataset has been extensively utilized by numerous background subtraction algorithms for comparative experiments in recent years, such as RMoG (Varadarajan et al. 2015), SoAF  (Giraldo et al. 2022), just to name a few. In addition, this dataset also provides many detailed detection results for directly comparing newly proposed algorithms with others.
The official metrics used to rank methods are also presented, as follows:

Investigation into the effect of region size variation
In the first experiment, we varied the region size r = 1, 2, 4, and 8 on eight typical video sequences with complex backgrounds. An evaluation of the proposed method with different region sizes is depicted in Fig. 4. In general, the FM score improved as the region size increased, especially for the ''canoe'' and ''cubicle'' sequences, which are observed in Fig. 4. As the region size increased, the FM score did not increase all the time, and could even decrease due to the region-level features making RMFF less discrimination. This is further illustrated in Fig. 5, which presents some segmentation results obtained from these sequences with different region sizes. As can be seen, in the bad weather sequences, the snow accumulation (the first column) and the wet snow (the second column) are almost completely removed with r = 4, while the cars are still entirely detected as foreground. This phenomenon can be attributed to the fact that region-level features are more robust to background noises, and the region size r = 4 is more suitable for these kinds of bad weather scenes than other sizes. Next, for the dynamic backgrounds, one can also see that as the region size increased the number of false positives arising from the background motion reduced. However, due to region-level features, the detected foregrounds were not accurate in boundary, and this situation became more serious as the size increased. In the boat sequence (3rd column) and the canoe sequence (4th column), the dynamic background from the shimmering water was well handled as the region size increased. A similar performance can be seen in the fountain02 sequence (5th column) and the overpass sequence (6th column), which contain a fountain and a swaying tree, respectively. Another advantage of this method is that it is tolerant against both strong and faint shadows. For the backdoor sequence (7th column) and the cubicle sequence (8th column), RMFF could cope with these shadows very well. However, since RMFF is a region-level model, the boundary accuracy of the detected objects degraded slowly as the region size increased.

Effect of multi-scale superpixels technique and adaptive thresholds
In this sub-section, we present the detection results obtained with and without our other two strategies, including multi-scale superpixels and adaptive thresholds, while keeping the region size constant. As shown in Fig. 6, these two strategies were critical to the accuracy of foreground detection in terms of FM scores. As illustrated, RMFF combines the advantages of the two strategies to obtain high FM scores on the eight sequences in Fig. 6. Furthermore, as shown in Fig. 6, without multi-superpixel strategy, the FM deteriorated sharply under dynamic background (''boats,'' ''fall,'' and ''fountain02''). This supports the claim that the proposed method can address effectively multi-scale moving objects under dynamic background in real surveillance tasks. Figure 7 further supports the assertion that the use of multi-scale superpixels and adaptive thresholds is critical to segmentation accuracy. From Fig. 7, it is apparent that in the superpixelless configuration (second row), many foregrounds were absorbed into the background (1st, 2nd, 7th, and 8th column of the second row) or many false positives were produced (3rd, 4th, 5th, and 6th row). The third row of Fig. 7 presents the detection results without adaptive thresholds, it is not difficult to observe that many false negatives (2nd, 4th, and 7th column of the third row) and false positives (5th column of the third row) were generated. When combining adaptive thresholds and multi-scale superpixels in RMFF, as shown in the last row of Fig. 7, the segmentation results were much more reliable and accurate than without them. This experiment demonstrated how important multi-scale superpixels and adaptive thresholds are in the proposed RMFF.  Table 1. These parameters were found empirically and remained the same in all video sequences. The results produced by our method on the CDnet 2014 dataset are shown in Table 2. As depicted in Table 2, the Re, Pr, and FM scores of the proposed method in the bad weather, baseline, dynamic background, shadow, and thermal categories achieve promising performance (with all scores greater than 0.7). These categories contain many challenges such as low-visibility winter storm conditions, strong background motion, variations in illumination, strong and faint shadows, thermal artifacts, and multi-scale objects. These results indicate that the proposed method is very flexible and can adapt to even the most complex scenes. The video sequences in the PTZ category posed the biggest challenge, and obtained the worst results. The video sequences in the PTZ category were captured by a moving camera, which leads to the pixels in each frame being rapidly shifted over time, making it difficult for our method to obtain a stable multi-feature-frequency model. However, the proposed RMFF, like the other methods, achieved good performance for all the categories, not just one. Table 3 shows the comparison between the FM achieved by our RMFF and those produced by 12 state-of-the-art algorithms. RMFF not only had the highest overall FM, but it was among top two in five out of 11 categories. RMFF obtained the best FM value in DB, because the relationship between pixels in a neighborhood is taken into consideration, which effectively removes most background motion. It is also important to note that in the BW category, RMFF ranked second, with an FM of 0.8439. This is an encouraging performance, about 30 and 53 percentage points higher than the indicators of the similar algorithms SoAF 1 and SoAF 2 , respectively. This result also shows that RMFF can effectively deal with the interference caused by bad weather. In the LFR, TH, and TB categories, RMFF was also ranked second, and its FMs were slightly lower than those of the best methods. Although in BL, CJ, IOM, NV, and SHD, RMFF was not among the top three, it achieved competitive results. In the BL category, RMFF obtained an FM of 0.8294, which is a good result. In the CJ and IOM categories, which involve camera jitter and objects being placed and removed intermittently, our innovative model allowed RMFF to obtain FM values of 0.6154 and 0.4656, respectively. In the NV category, many halos and reflections caused by strong headlights cause great difficulties to the detection methods. All methods produced relatively poor results in this category. In the PTZ category, most of the state-of-the-art algorithms produced poor FM values due to the assumption that the camera is static over time. Unfortunately, RMFF also failed to deal with this category, and had our worst FM 0.0678. For a moving camera, RMFF may produce many false positives or lose foreground. In the SHD category, RMFF was ranked fourth, and produced a good result of approximately 0.8 FM, Qualitative results for the eight typical video sequences with complex backgrounds. First row: input frames; second row: adaptive thresholds but without multi-scale superpixels; third row: multi-scale superpixels but without adaptive thresholds; and fourth row: both 0.01 0.01 [8,8,8,4,2] Background subtraction via regional multi-feature-frequency model in complex scenes 15313 indicating that our method can effectively suppress the interference caused by both strong and faint shadows. Overall, the proposed RMFF performed favorably against the other methods in terms of overall F-Measure. Some qualitative comparison results for our RMFF and other state-of-the-art methods on ten real sequences from the CDnet 2014 dataset are shown in Fig. 8. The first four rows show challenging outdoor weather involving lowvisibility winter storms. Only our method performed relatively well. Specifically, the first row depicts two cars running in the blizzard weather, as can be seen, only our RMFF removed blizzard effectively as well as detecting the foreground accurately. SoAF 1 , SoAF 2 , RMoG, CP3online, SC-SOBS, and Euclidean distance failed to suppress the snowstorm and lost most of foreground regions, while for ViBe, MST, GMM, IGMM, and IUTIS-2, the results were slightly better. GraphCutDiff could detect most of the foreground, but it also produced some false positives. The second row shows people skating in the snow. Only our method produced accurate segmentation, while the other methods lost part of the foreground. The third row of Fig. 8 depicts a traffic scene on a snowy day. Only our method could obtain a complete foreground object while handling snow and the dark tire tracks. SoAF 1 produced many false positives while SoAF 2 failed to detect the car. CP3-online and GraphCutDiff could detect most of the foreground, but failed to suppress the moving shadows, resulting in an inaccurate boundary. The fourth row presents cars and pedestrians at the corner of a street. Our RMFF produced the best results. Of the other methods, *In each column, red font is for best and blue font for second best. Out of the 12 categories, RMFF ranks first in two categories, and second in four SoAF 1 obtained the most foreground pixels, but produced many false positives, while the other methods failed to detect the foreground. The last six rows of Fig. 8 show outdoor scenes with dynamic background motion. The purpose of these experiments was to demonstrate the ability of our method to handle complex dynamic scenes.
Our method was one of the best in reducing false positives and false negatives. This is because our RMFF method models regions rather than individual pixels, thus ensuring that useful correlation existing in a neighborhood region is exploited to improve detection accuracy in complex scenes. In particular, the fifth and sixth rows of Fig. 8 show boats on shimmering water. Only our method and SoAF 2 could effectively cope with the water rippling and produce accurate results compared to the ground truth. SoAF 1 , ViBe, and GraphCutDiff did not deal with the water movement very well, and generated many false positives. The other methods lost some boat body parts, and produced inaccurate segmentation. In the seventh row of Fig. 8, a car is passing in front of a waving tree due to the wind. The proposed method and GraphCutDiff were the two best methods, removing background motion perfectly and detecting the car quite well in such a dynamic scene. Our segmentation result was more accurate. CP3-online failed to capture the moving car, while RMoG tended to produce bulb shapes in which many foreground regions were absorbed into the background. The other methods performed poorly in suppressing the motion of tree shaking, and produced many false positives. The next two rows of Fig. 8 show cars passing next to fountains. The proposed method, SoAF 2 , and RMoG were the best methods, suppressing background motion and producing accurate foreground results. ViBe, MST, GMM, IGMM, CL-VID, SC-SOBS, and GraphCutDiff could not always eliminate the dynamic background of the fountains and water waves, and produced many false alarms. SoAF 1 and Euclidean distance could not deal with dynamic background motions, and generated many false positives, while CP3-online lost many true foreground pixels and produced many false negatives. The last row shows a person passing in front of a tree waved by the wind. Our method and GraphCutDiff achieved performance superior to those of the other methods. SoAF 2 , RMoG, CP3-online, MST, GMM, IGMM, SC-SOBS, IUTIS-2, and Euclidean distance could not handle the disturbance caused by the tree swaying, resulting in some true foreground parts being absorbed into the background. In contrast, SoAF 1 and CL-VID produced many false positives, while ViBe performed poorly, with many false positives and false negatives. Overall, the results of our method were better than those of the others, particularly with respect to complex background suppression and accurate foreground extraction. Background subtraction via regional multi-feature-frequency model in complex scenes 15315

Processing speed
The proposed method was implemented in MATLAB on a personal computer with an i9 CPU and 64G RAM. For a typical image resolution of 320 Â 240, the proposed method could reach about 4 fps, approximately 90% of the time of which was consumed by the multi-scale superpixels processing.

Conclusion
In this paper, we introduce an effective method named RMFF for background subtraction, to address the challenges posed by complex scenes. To handle dynamic background changes, a region-level background model is proposed, in which the spatial correlation between neighboring pixels is taken into consideration. Then, the frequency of features is used to model the background from the region over time. Finally, the multi-scale superpixels and per-region segmentation threshold are exploited to greatly alleviate the impact of multi-scale moving objects and interference from background noise, motions, and shadows. A comprehensive evaluation of the proposed method on a standard dataset, which contains 53 video sequences, demonstrated the capability and flexibility of RMFF over a wide variety of environmental conditions. Exhaustive experimental results have demonstrated that the proposed RMFF method returned effective solutions for the multi-scale moving objects detection under dynamic background. However, there is still room for further improvement. For example, several algorithmic hyper-parameters are introduced in the proposed method, resulting in a multi-parameter joint optimization problem. How to use advanced optimization algorithms to address this problem is one of the main challenges to be solved in the future. Recently, with the continuous development of optimization theory and technique, advanced optimization algorithms have taken the center stage in solving different complex practical problems because of their flexibility and non-complex nature. Advanced optimization algorithms, such as customized heuristics and metaheuristics, hybrid algorithms, island algorithms, particle swarm optimization, sine cosine algorithm, and gray wolf optimization, have been widely used to deal with real-world challenging decision problems of engineering (Xue et al. 2022), online learning , scheduling (Kavoosi et al. 2019;Dulebenets et al. 2021;Kavoosi et al. 2020), economic power dispatch , transportation (Pasha et al. 2022), medical (Rabbani et al. 2022), and so on. These advanced optimization approaches play a critical role and have achieved excellent performance in the aforementioned applications. Background subtraction can also be described as combinatorial optimization problems. How to use advanced optimization algorithms to improve the performance of background subtraction has attracted increasing interest, and many background subtraction methods have introduced optimization algorithms and shown encouraging performance, such as GraphMOS (Giraldo et al. 2022), HMAO (Li et al. 2019), RMoG (Varadarajan et al. 2015), and so on. Hence, as a part of the potential future research, the following extensions can be conducted: (1) consideration of advanced optimization algorithms for background modeling; (2) development of new strategies for spatiotemporal cues extraction; (3) consideration of new approaches for automatically adjusting the region size, the number of superpixels, and the learning rates to fit a wide range of background transformations; (4) development of multi-parameter joint optimization frameworks for the hyper-parameter selection; and (5) design of new strategies for algorithm acceleration to meet the requirement of real-time background subtraction for practical video processing tasks.