A real-time fire and flame detection method for electric vehicle charging station based on machine vision

In the charging process of electric vehicle (EV), high voltage and high current charging methods are widely used to reduce charging time, resulting in severe battery heating and an increased risk of fire. To improve fire detection efficiency, this paper proposes a real-time fire and smoke detection method for EV charging station based on Machine Vision. The algorithm introduces the Kmeans +  + algorithm in the GhostNet-YOLOv4 model to rescreen anchor boxes for fire smoke targets to optimize the classification quality for the complex and variable features of targets; and introduces the coordinate attention (CA) module after the lightweight backbone network GhostNet to improve the classification quality. In this paper, we use EV charging station monitoring video as a model detection input source to achieve real-time detection of multiple pairs of sites. The experimental results demonstrate that the improved algorithm has a model parameter number of 11.436 M, a mAP value of 87.70%, and a video detection FPS value of 75, which has a good continuous target tracking capability and satisfies the demand for real-time monitoring and is crucial for the safe operation of EV charging station and the emergency extinguishing of fire.


Introduction
EVs have drastically transformed the global automobile industry's development pattern, and this trend will continue to fuel the industry's future expansion [1]. China will have 7.84 million new EVs by the end of 2021, of which 6.4 million will be fully electric. Annually, the number of fires caused by charging EVs is rising. During the operation of charging piles and onboard batteries, issues such as line overload, short circuits, poor contact, improper charging, and heat dissipation failure under high-temperature circumstances can easily cause fire mishaps. With the increased commercialization of EVs, the safety of charging has become the most significant barrier to marketing these EVs [2].
Various fire smoke target detecting systems have been proposed by researchers for speedy responses to fire events. The earliest methods for identifying fire smoke were based on static fire and smoke properties, such as texture, wavelet, color, and edge direction histograms [3][4][5][6]. These algorithms are computationally intensive and difficult to achieve realtime detection. In recent years, YOLO series algorithms have achieved outstanding achievements in the field of target detection [7][8][9][10] and have become increasingly prevalent. The experimental results demonstrate that the YOLO series algorithms have good real-time performance and high accuracy and can satisfy the requirements for target detection. The application of the lightweight YOLO network model to real-time target detection has become one of the main areas of research [11][12][13][14]. This is because lightweight target detection models reduce the size of the network model, reduce the amount of computation, and improve the accuracy of detection and real-time performance.
The lightweight YOLO model has widespread application in crop detection [15][16][17][18], industrial electronic device detection [19], and other sectors. For instance, Reference [20] presents an enhanced approach for YOLOv3 target detection. To get detection results, the Kmeans clustering algorithm and Squeeze-and-Excitation Networks (SE) module are added to the original algorithm based on the features of the detection target. The reference [21] proposes an improved gesture recognition algorithm for YOLOv4 by introducing the spatial pyramid pooling (SPP) module and the Kmeans + + algorithm, which is deployed on Android mobile phones and achieves real-time gesture detection and recognition on mobile phones and has significant implications for the advancement of human-computer interaction.
This study presents a real-time fire and smoke detection method for EV charging station based on Machine Vision, Considering the complex and changeable fire smoke color and shape characteristics of EV charging station, it adds GhostNet and Kmeans + + to the YOLOv4 clustering algorithm [22]. In addition, the CA module [23,24], which is based on the charging station's current monitoring equipment, enhances the neural network model through training with fire smoke data sets, discovers fire information in real-time, and configures dynamic early warning and fire monitoring.
The following are the primary contributions of this paper: • The high-performance, lightweight neural network Ghost-Net is chosen as the backbone feature extraction network. Based on a linear transformation, a richer target multichannel feature map is made, the number of parameters and computational complexity of the network model are cut down, the accuracy of detection is kept, and the speed of detection is increased. • The Kmeans + + clustering algorithm and the CA module are included in the original algorithm to strengthen the ability to extract fire and smoke characteristics, solve the problem of missed detection and false detection, and improve the model's generalizability, resilience, training efficiency, and detection accuracy. • The monitoring video of EV charging station is employed as the input video source for neural network model detection, enabling simultaneous monitoring of many charging stations and drastically reducing the response time to fire emergencies. The network model is deployable on mobile embedded platforms, such as inspection robots, and has promising application possibilities in mobile target recognition applications based on video feeds.

Anchor box optimization based on Kmeans + + clustering algorithm
As depicted in Fig. 1, clustering algorithms aim to divide things into clusters that minimize the distance between objects within the same cluster to detect comparable objects. When a fire erupts at EV charging station, the fire and smoke are rapidly influenced by external variables, resulting in complicated and variable shapes. Model convergence is delayed avoiding an imbalanced percentage of positive and negative samples generated using the original anchor frame value on the fire data set. The Kmeans + + clustering method is used to group the fire data set, and the cluster center data frame distribution is used to come up with nine new anchor values. The basic idea of the K means + + clustering algorithm is that the initial cluster centers should be as far away as possible. The placement of the initialized centroids affects the clustering outcomes and execution time. Random selection will result in a very slow convergence of the algorithm. Hence, it is vital to choose K-appropriate centroids. The K means + + algorithm optimizes the random initialization centroid of K means and selects K cluster centers based on the following principle: Assuming that n initial cluster centers (0 < n < K) have been selected, the more distant points from the current n cluster centers will have a higher probability of being selected as the n + 1st cluster center when the n + 1st cluster center is selected, again by a random method when the first cluster center (n = 1) is selected. The K means + + clustering algorithm is used to re-cluster the fire smoke data set, and Table 1 shows the parameters for the anchor box. Figure 2 depicts the network structure of YOLOv4 using the input picture size of 416 × 416 as an illustration. The target detection algorithm must be improved to meet the requirements for real-time detection on low-power mobile platforms in industrial applications. This study replaces the CSPDarkNet-53 backbone network with the GhostNet network. Standard convolution is separated into two steps: a portion of the feature map is formed with fewer convolution kernels, and the remaining portion is obtained by simply calculating the generated portion. The two feature maps are combined to reduce the number of required computational resources. After batch normalization layer optimization, the GhostNet-YOLOv4 approach makes the model much simpler, speeding up detection without affecting accuracy. GhostNet is a unique network topology suggested by Huawei Labs. The fundamental concept is to create a phased convolution calculation module and execute linear convolution on a small number of non-linear feature maps to generate additional feature maps. The new feature map is the preliminary revision map used to reduce extra features and produce a lighter model. As shown in Fig. 3a, the current techniques for convolution are pointwise convolution for reducing the number of dimensions and depthwise convolution for extracting features.

Fire detection model based on GhostNet-YOLOv4
GhostNet combines linear operations with ordinary convolution, which can linearly transform the generated normal convolutional feature maps into some redundant feature  maps to get similar feature maps. GhostNet has greatly improved convolutional processing by building the Ghost module, as shown in Fig. 3b. The Ghost module is the core of the GhostNet feature extractor. Compared with common convolutional neural networks, this module does not change the size of the output feature map, which can greatly reduce the parameters and computational complexity of the network model. In addition, the Ghost module has the advantages of plug-and-play and easy portability.
Since the feature points of small targets are easily overlooked during the network feature extraction and fusion process, the CA module was introduced to enhance the fire smoke feature extraction capability to improve model's precision; the upgraded network model is depicted in Fig. 4.

Feature extraction network optimization based on CA module
Some photos in the data set have a low resolution; the target in the image is small, and the coincidence of the fire and smoke targets blurs the edges. During target feature extraction and fusion, the network quickly loses features. Figure 5 illustrates how this paper adds the CA module to the feature extraction network. Through the CA module, the model can pay greater attention to the characteristics of fire and smoke, suppress unwanted background information elements, such as lights and clouds, and decrease the number of missed detections of tiny targets. As shown in Fig. 6, the rate enhances the model's precision.
The CA module embeds location information within the channel attention module, allowing the network to monitor a greater area. The channel attention module utilizes twodimensional global pooling to transform the input into many single-feature vectors. As shown in Fig. 6, the CA module disassembles the channel attention module into two onedimensional feature encoding processes to aggregate features in different directions. The CA module can acquire long-range relevant data in one spatial direction while keeping accurate position data in the other. They individually encode the generated feature maps to form a pair of orientation-aware and position-sensitive feature maps that may

Coordinate information embedding
The global pooling method is typically utilized for the global encoding of channel attention encoding spatial information. Still, it is challenging to preserve the position information, because the channel descriptors compress the global spatial information. To enable the attention module to record longrange spatial interactions with exact position information, the global pooling is divided into a one-to-one, one-dimensional feature encoding operation using the formula below.
Specifically, each channel is encoded along the horizontal and vertical coordinates using a pooling kernel of size (H, 1) or (1, W) , respectively, when given an input X . Consequently, the height h output of the channel c can be stated as Likewise, the output of the channel c of width w can be written as Transformations described above aggregate features along two spatial directions to produce two direction-aware feature maps. This differs from SE Module, which creates a single feature vector, as the channel attention approach only produces a single feature vector. These two modifications also enable the attention module to record long-term dependencies along one spatial direction and maintain accurate position information along the other. This improves the network's ability to locate objects of interest precisely.

CA generation
Using the above-described transformation, this method can extract the global receptive field and encode precise position data, according to claim 1. Utilizing the resulting representation necessitates a second transformation, termed CA generation. After the transformation in the information embedding, this section combines the transformation described before and then uses the 1 × 1 convolution transformation function F 1 : [⋅, ⋅] is the concatenate operation along the spatial dimension, is the nonlinear activation function, and the f intermediate feature map encodes spatial information in the horizontal and vertical directions. Here, r is used to control the SE module size reduction rate. Then, decompose f into two separate tensors f h ∈ R C∕r * H and f w ∈ R C∕r * H along the spatial dimension. Transform f h and f w into tensors with the same number of channels to the input X using two other 1 × 1 convolution transforms F h and F w , respectively, to obtain: where is the sigmoid activation function. To reduce the complexity and computational overhead of the model, the number of channels of f is usually reduced here using an appropriate reduction of r . Then, the outputs g h and g w are extended as attention weights, respectively.
Finally, the output Y of the CA Module can be written as

Experimental data set construction
When the electrical equipment of the EV charging station combusts, a great deal of gray smoke will be produced, and some of this smoke will combine with the fire, altering its brightness and shape. Smoke and fire are easily influenced by external causes, resulting in complex and mutable properties, such as forms and hues. In data set selection and creation, it is necessary to increase the quantity of data set images, label as much dynamic target feature information in multiple stages as feasible and enhance detection precision. The accuracy of a deep learning model's predictions depends on the quality of the training set. Insufficient data might often result in overfitting during the training phase of data analysis. Not only may data augmentation tackle the issue of insufficient data, but it can also alleviate the overfitting issue. As shown in Fig. 7, this paper employs two geometric transformation processes: rotation and mirror inversion. YOLOv4 gets tricked by mosaic data augmentation. As depicted in Fig. 8, the objective is to randomly crop four images and then combine them into one image as training data. Four images are spliced together to increase the batch size during background enhancement, and four images are also calculated during batch normalization. Due to the robust data enhancement capability of the Mosaic data enhancement method, the model parameters are initialized  arbitrarily. Because some of the training images generated by Mosaic deviate from the actual distribution of natural images, the first 70% of epochs are determined during the training process. Each stage has a 50% chance of employing mosaic data augmentation during mixup processing.
During the construction phase, the data set is assembled. To assure the image's quality, the width or height is set to at least 600 pixels and duplicated, fuzzy, and light-polluted photographs are eliminated. Finally, 16,862 photos of 9658 fire targets and 10,720 smoke targets are obtained. The software LabelImg is used to label the fire and smoke map in the experimental image, generate an XML file containing the coordinates of the fire and smoke, and divide the file according to the VOC data set format. The training set to the validation set to test set division ratio is 0.81:0.09:0.1.

Experimental environment configuration
This paper's experimental setting is a Windows 10 machine with an Intel i7-11800H CPU, 16 GB of ROM, and an RTX 3070 GPU. Each model is written in PyTorch. The local computer's configuration is displayed in Table 2.
The training of the model consists of two phases. Using the pre-training weights of the YOLOv4-VOC network trained on the massive data set VOC-2007, the network parameters are initialized and trained. The training procedure consists of two parts. First, the backbone network portion is frozen with a batch size of 64, an initial learning rate of 1 × 10 -2 , and 50 training rounds; next, the training is unfrozen with a batch size of 32, an initial learning rate of 1 × 10 -4 , and 250 training rounds. All of the optimizers used in training are SDG, the parameters are set to their default values, and after each training round, the learning rate goes back to its initial value of 0.90. This greatly reduces the amount of time and resources needed for training.

Model performance evaluation metrics
In machine learning (ML), natural language processing (NLP), information retrieval (IR), and other domains, evaluation is required, and its evaluation indicators include AP, mAP, Precision, Recall, and F1-Measure.
According to the combination of the sample's actual category and the model's predicted category, the binary classification problem can be separated into four categories: TP, FP, TN and FN, as depicted in Table 3. Precision is the ratio of actual positive samples to all anticipated positive samples. The recall is the fraction of expected positive results among the actual positive samples. The AP value for each class is the area under the Precision and Recall curve. The mAP value is the mean of all classes' AP values. The F1 score is determined by the harmonic mean of the Precision and Recall subscores.

Analysis of results
The loss curve depicts the error on the training set during model training, and the error curve drops fast throughout the first ten training rounds. After ten iterations, the loss curve stabilizes when the error fluctuates around 0.03. To evaluate the detection performance of the updated model, 3371 images from the test set were chosen for testing and evaluation. The experimental test outcomes are depicted in Table 4 and Fig. 9.
Select some EV fire images for the algorithm model of GhostNet-YOLOv4 for real-time detection. Figure 10 demonstrates the detection outcomes. The fire and smoke forecast frames corresponded to the target region. The revised model is more sensitive to black smoke and has greater detection precision, consistent with the experimental hypotheses.
As the video input of the simulated monitoring equipment to the YOLOv4 algorithm model for real-time detection, a video of an EV charging station on fire is picked in this research. Figure 11 demonstrates the detection outcomes. This model's detection speed is sufficient for continuous real-time detection at 75 FPS with great precision. Strong continuous tracking capabilities can be applied to remote dynamic video monitoring to meet EV charging station' real-time monitoring requirements.

Model comparison analysis
To demonstrate the superiority of the GhostNet-YOLOv4 model suggested in this paper, the backbone of YOLOv4 is substituted with the traditional lightweight networks CSPDarkNet53-tiny and MobileNet-v3 for comparison. In addition, the algorithm proposed in this paper is compared with YOLOv5-L and YOLOv5-S. Throughout the experiments, the same test data set is chosen, and the settings remain constant.    For horizontal comparison, select some EV firing images and enter them into the YOLOv4, GhostNet-YOLOv4-CA, and YOLOv4-Tiny models. The input image dimensions are 608 × 608, and the confidence is set to 0.5. Figures 12, 13 and 14 demonstrate the detection outcomes. The detection accuracy of the improved algorithm proposed in this paper is slightly lower than that of the YOLOv4 algorithm, as shown in Figs. 12 and 13, but the problem of missed detection in the YOLOv4 algorithm has been resolved, as shown in Fig. 12b, c; compared to the YOLOv4-Tiny model, the detection accuracy of the improved model proposed in this paper is the same. After integrating the CA module, the problem of missed detection is fixed, and the detection frame is better suited to the target's location, which lets mobile systems detect in real-time, as shown in Figs. 13 and 14.

Conclusion
In this paper, a machine vision-based target detection algorithm is proposed for the safety monitoring of EV charging stations to achieve real-time fire smoke detection. The experimental comparison analysis shows that the improved GhostNet-YOLOv4-CA model with an mAP of 87.70%, an FPS (video) value of 75, and a model parametric number of 11.436 M is synthetically better than the YOLOv4 model, which can be applied to the fire detection of low-arithmetic mobile platforms and help the safe operation of EV charging stations.
This study will be revised in future work: first, expanding the EV charging stations fire data set., increase the number of images of on-fire EVs and charging piles, raise the complexity of the fire scene, and enhance the detection accuracy; Second, further optimize the network model, prune the