Wild Fire Classification using Learning Robust Visual Features

doi:10.21203/rs.3.rs-4268769/v1

The diversity in the texture, color, and shape of flames, and their similarity to most sunset skies and red forests, has always made reducing false alarms in forest fire detection a challenging task. Traditional image processing methods rely too heavily on manual feature annotation. Using conventional machine learning models can effectively extract flame characteristics and reduce the error rate of manual annotation, but their limited local perception often leads to a high rate of false alarms. This paper proposes the use of the self-supervised model with Learning Robust Visual Features (DINOV2) for feature extraction, and a classification head for categorizing forest fires. In tests across most fire datasets, DINOV2 achieved an accuracy rate of up to 99% while simultaneously reducing the rate of false alarms.

Earth and environmental sciences/Environmental sciences

Earth and environmental sciences/Natural hazards

Forest Fire Classification

Vision Transformers

Self-Supervised Learning

Forests are known as the lungs of the Earth, and forest fires have severe impacts on the environment, economy, and the surrounding residents. The catastrophic forest fires in Australia from 2019 to 2022 had unprecedented effects, including loss of 33 lives and the destruction of over 3,000 homes. Additionally, the long-term health impacts of smoke exposure are immeasurable¹. Due to the abundance of combustible material in forests, fires spread rapidly and are difficult to detect and control in a short amount of time. Traditional methods of early fire detection are unable to cover forest terrains on a large scale, and the cost of extensive sensor deployment is high². Subsequently, methods involving color or moving pixel detection were employed, but these had a high rate of false alarms against backgrounds of similar colors. However, with the advent of powerful deep learning models capable of extracting features and deeper semantic information, the limitations of dataset size become apparent. In this paper, we tested the self-supervised model DINOV2, utilizing an MLP classification head to apply DINOV2 for wildfire classification. Compared to traditional CNN and transformer models, we found that DINOV2 outperforms conventional CNN models, achieving an accuracy rate of up to 99% with a very low false alarm rate.

In early wildfire detection systems, there was a heavy reliance on physical sensors such as temperature, smoke, and flame sensors³. However, this detection method suffered from high false alarm rates, high costs, and technical deficiencies. With the advancement of machine learning, computer vision methods have been gradually developed to address wildfire detection, using only cameras for monitoring to determine the occurrence of fires.

Some researchers have proposed methods based on color pixels and setting thresholds to classify flames, such as RGB⁴. Additionally, Celik et al. introduced the YCbCr algorithm, which separates brightness and chroma in the color space to classify flames⁵. Although these classic computer vision algorithms achieved good detection results on specific datasets, they exhibited low recognition efficiency and were prone to misclassification of objects with similar colors. Moreover, traditional forest fire recognition methods heavily relied on manual selection of relevant features and required operation in weather and lighting conditions without weather and lighting interferences⁶. They also had poor generalization ability, only suitable for specific datasets⁷. As a result, scholars proposed using Convolutional Neural Networks (CNNs) to automatically extract features for wildfire detection.

Chen et al.⁸ proposed a method that used Local Binary Patterns (LBP) to extract texture features of smoke and combined it with an SVM classifier for smoke classification in early wildfire occurrences. Subsequently, the CNN-17 model was employed to detect wildfires. They effectively addressed image quality issues caused by weather or lighting by applying Histogram Matching (HM) and Image Smoothing to the images. However, due to the limited number of samples, the model’s generalization ability was poor. To overcome this, Liu et al.⁷ suggested using Generative Adversarial Networks (GANs) to

generate high-quality samples. They then used an Adaboost classifier for preliminary fire prediction, followed by CNN and SVM to improve prediction accuracy. This method showed excellent performance on small datasets but lacked advantages on larger datasets, and the model structure was complex.

To overcome dataset limitations, Sousa et al.⁹ proposed using the Inception-v3 model pretrained on ImageNet to enhance model generalization. Khan et al.¹⁰ utilized VGG-19 transfer learning to significantly improve model prediction accuracy. However, the model still had a tendency to misclassify sunsets or objects with similar colors and had low confidence in results when fires were located at the image edges. Shamsoshoara et al.¹¹ introduced a new public dataset and used the Xception DCNN model for wildfire image classification. This dataset imitated the early stages of real wildfire occurrences with different small-scale fire distributions and forest-covered fires, greatly increasing the difficulty of model training. The results showed promising accuracy, and Rafik et al.¹² combined the EfficientNet-B5 and DenseNet-201 models to tackle small fire purity and background complexity issues.

Although CNN models have performed well in current wildfire classification tasks, their inability to encode the global statistical information of input images makes it difficult to distinguish sunsets or objects with similar colors¹³. This has led to the adoption of vision transformer (ViT) models, which excel in handling global contextual information. Recently, Zhang et al.¹³ proposed a VIT-based approach that utilized self-attention mechanisms to focus primarily on wildfire features while preserving overall image details, thereby improving model accuracy. In the work of Alexey et al.¹⁴, they directly applied transformers to image classification, demonstrating that the naive application of self-attention can entirely replace CNN’s convolutional operations, and the model can operate relatively efficiently under the self-attention mechanism in relatively simple environments.

Contribution to this experiment:

We have tested the compatibility of DINOV2 with MLP and KNN classifiers.
By comparing traditional CNN models, transformers, and DINOV2, we evaluated the performance of these models in fire classification tasks.
We analysed the quality of the models’ attention mechanisms using heatmaps.
Based on DINOV2’s feature extraction we applied MLP classification head to fire classification.

Dino

In the DINOV2 experiment led by Caron et al¹⁵, they designed a simple self-supervised method named DINO, which can be viewed as knowledge distillation without labels. The essence of the DINO framework lies in directly predicting the output of a teacher network constructed using a momentum encoder, employing standard cross-entropy loss for self-supervised training. This method only requires centering and sharpening the output of the teacher network to prevent collapse and does not need other popular components such as predictors, advanced normalization, or contrastive loss, which have limited benefits in terms of stability or performance.

An important feature of the DINO framework is its flexibility; it is suitable for both convolutional neural networks (convnets) and vision transformers (ViTs), without the need for modifying the architecture or adjusting internal normalization. A significant finding of their study is that using smaller patches in ViTs can enhance the quality of features.

In the specific implementation of DINO, the model passes two different random transformations of an input image to the student and teacher networks. Both networks have the same architecture but different parameters. The teacher network’s output is centered by computing the average over the batch. Each network outputs a K-dimensional feature, normalized by a temperature softmax across the feature dimension. Their similarity is then assessed using cross-entropy loss. A stop-gradient operation is applied to the teacher network to allow gradient propagation only through the student network. The parameters of the teacher network are updated via the exponential moving average of the student parameters.

In our experiments, as Fig. 1 shows the primary focus was on utilizing the self-supervised vision model DINOV2, developed by Facebook Research¹⁶. We loaded this model through the torch.hub.load method for extracting complex feature representations from input images. Subsequently, we designed a classifier consisting of two fully connected layers and a ReLU activation function for further processing and classifying these features. Specifically, the first fully connected layer reduces the output dimensionality of the Transformer from 384 to 256. The ReLU activation function is then applied to introduce non-linearity, enhancing the model’s expressive capability. Finally, the second fully connected layer maps the features onto two categories, accomplishing a binary classification task. During the forward propagation process of the model, input data are first processed through the Transformer module for feature extraction. These features are then normalized and ultimately predicted through the classifier.

Dataset

DeepFire¹⁰: As Figure 2 shows, the images for this dataset were collected via internet keyword searches and include a diverse range of forests, mountains, forest fires, and wildfires. It comprises a total of 1,900 images, each consisting of 250x250 pixel RGB images. Among these, 950 images are of fires, while the remaining 950 are non-fire images.

FLAME¹¹: As Figure 3 shows the images for this dataset were obtained from Northern Arizona, USA. They were recorded using drones during artificially simulated pile burning and then split into frames. The dataset consists of a total of 48,010 fire and non-fire images, each with a resolution of 254x254 pixels. It includes 17,855 non-fire images and 30,155 fire images.

Fire¹⁷: As Figure 4 shows this dataset is provided by AHMED and his team as a public dataset on Kaggle. The fire images folder contains 755 images of outdoor fires, some of which include heavy smoke. The other folder, non-fire images, contains 244 nature images, such as forests, trees, grass, rivers, people, foggy forests, lakes, animals, roads, and waterfalls. This is an imbalanced dataset.

In this part, we further delve into the evaluation of model performance, emphasizing not only the precision metrics like accuracy, precision, recall, and F1 score but also the paramount importance of inference time in the context of real-time wildfire detection and classification.

Figure 5. Comparison of Model Metrics of dataset Fire, DeepFire and FLAME

Figure 5 presents a dendrogram showing the performance of various models across different datasets. In the first two large-scale forest fire datasets, DINOV2 consistently leads in all metrics. In the smaller fire dataset, FLAME, it also performs comparably to other models. In the following sections, we will describe in more detail the performance of each model across different datasets using more specific data formats.

Model	Accuracy	Precision	Recall	F1 Score	Inference Time (ms)
ViT-B-16	93.9394%	90.37%	94.58%	92.18%	54.965
ResNet-50	76.7677%	88.27%	52.08%	47.35%	0.904
VGG-16	83.8384%	91.21%	66.67%	70.18%	0.265
VGG-19	80.8081%	89.89%	60.42%	61.62%	0.014
DINOV2	98.6050%	98.99%	99.34%	97.92%	0.111

Table 1. Result with dataset Fire

In the Fire dataset, as Table 1 shows other models like the ViT-B-16 showed promising results with an accuracy of 93.9394% and a precision of 90.3672%. However, its inference time of 54.965 seconds, though relatively swift, did not match the efficiency needed for the most time-sensitive applications. On the other hand, models such as VGG-19, despite having a modest accuracy of 80.8081% and precision of 89.8936%, underscoring the potential for real-time applications. Yet, it was DINOV2 that emerged as the unequivocal leader in this dataset. With an outstanding accuracy of 98.605%, a precision of 98.9898%, and an inference time only 0.097ms slower than VGG-19, DINOV2 seamlessly bridged the gap between high accuracy and high-speed inference, setting a new standard for efficiency and reliability.

Model	Accuracy	Precision	Recall	F1 Score	Inference Time (ms)
Ali et al.¹⁰	95.00%	95.72%	94.21%	94.96%	-
Sousa et al.⁹	93.60%	94.12%	93.13%	93.57%	-
Govil et al.¹⁸	91.20%	94.16%	86.00%	89.00%	-
Tang et al.¹⁹	92.00%	-	-	-	-
Sun et al.²⁰	94.10%	96.98%	90.63%	93.70%	-
ViT-B-16	97.36%	97.37%	97.37%	97.37%	13.134
ResNet-50	83.10%	82.89%	82.87%	82.89%	0.137
VGG-16	97.45%	97.37%	97.37%	97.37%	0.050
VGG-19	97.69%	97.63%	97.63%	97.63%	0.013
DINOV2	99.22%	99.21%	99.21%	100.00%	0.097

Table 2. Result with dataset Deepfire

The DeepFire dataset, known for its complex imagery and potential for misclassification, was where the prowess of DINOV2 truly shone through. As Table 2 shows other models like VGG-16 and VGG-19 showcased near-perfect accuracies (97.45% and 97.69% respectively) DINOV2, with a flawless accuracy of 99.22% a the same time the F1 score excellently balances precision and recall, leading all other models by 100% while also having the lowest rates of missed detections and false alarms.

Model	Accuracy	Precision	Recall	F1 Score	Inference Time (ms)
Ghali et al.¹²	85.12%	-	-	84.77%	0.018
Xception¹²	78.41%	-	-	78.12%	0.002
Xception¹¹	76.23%	-	-	73.90%	0.010
EfficientNet-B5¹²	75.82%	-	-	73.90%	0.010
EfficientNet-B4¹²	69.93%	-	-	65.51%	0.008
EfficientNet-B3¹²	65.81%	-	-	64.02%	0.004
EfficientNet-B2¹²	66.04%	-	-	60.71%	0.002
InceptionV3¹²	80.88%	-	-	79.53%	0.002
DenseNet169¹²	80.62%	-	-	79.40%	0.003
MobileNetV3-Small¹²	51.64%	-	-	44.97%	0.001
MobileNetV3-Large¹²	65.10%	-	-	60.91%	0.001
ViT-B-16	99.30%	99.28%	99.23%	99.25%	3.385
ResNet-50	99.85%	99.82%	99.82%	99.83%	0.036
VGG-16	99.84%	99.86%	99.87%	99.87%	0.011
VGG-19	99.89%	99.08%	99.90%	99.90%	0.012
DINOV2	98.00%	98.89%	98.35%	98.61%	0.090

Table 3. Result with dataset FLAME

The FlAME dataset further reaffirmed the superiority of DINOV2. As Table 3 shows models such as VGG-19 and ResNet-50 achieved remarkable accuracies (99.89% and 99.85% respectively). Although DINOV2 does not perform as well as the VGG series on this dataset, most models achieve very good performance on the simple dataset FLAME. In the first two complex datasets, DINOV2 exhibits superior performance and still performs well in the simple FLAME dataset. From the collective insights garnered across all datasets, it becomes abundantly clear that DINOV2 is not just another model in the landscape of wildfire detection. Its exceptional balance of high accuracy, precision, recall, and F1 score, coupled with its swift inference time, elevates it to a league of its own. This harmonious blend of speed and accuracy positions DINOV2 as an indispensable tool in the domain of fire detection, redefining the benchmarks for performance and efficiency and paving the way for a future where real-time, reliable fire detection can significantly mitigate risks and preserve natural and human resources.

	Head	Accuracy	Precision	Recall	F1 Score
Fire	MLP	98.6%	98.90%	99.30%	97.90%
	KNN	98.5%	99.00%	96.90%	97.90%
DeepFire	MLP	99.2%	99.20%	99.20%	99.20%
	KNN	99.0%	99.30%	98.00%	98.60%
Flame	MLP	98.0%	98.80%	98.30%	98.60%
	KNN	99.6%	99.50%	99.50%	99.50%

Table 4. Classification Head comparison

In the ablation experiments, we mainly focus on the F1 score and Recall as the primary metrics for analysis. Given the practical application scenarios of fire classification, it’s crucial to optimally balance the model’s false alarm rate and miss rate. Additionally, a high recall can significantly reduce the false alarm rate of fire detection, thereby saving human resources. The results in Table 4 indicate that, within the balanced data environment of Fire, MLP’s recall significantly surpasses that of the KNN classifier head. Looking at the imbalanced dataset DeepFire, MLP’s recall is much higher than KNN’s, and it Accurately detected the number of positive samples and reduced the misidentification of false negatives. This balance is well reflected in the F1 score. Although in the FLAME dataset, KNN outperforms MLP.

Additionally, due to the characteristics of the KNN lazy model, it may overly influence the prediction results when dealing with imbalanced datasets. In contrast, MLP can mitigate the issues arising from imbalanced datasets through techniques such as oversampling or cost-sensitive learning also KNN requires the calculation of k and the distance metric, so the value of k significantly influences the model’s performance. Overall, MLP should be our preferred choice.

Traditional machine learning models do not have an advantage in such tasks, and the steps required to complete a classification task are much more complex, especially in data preprocessing, image feature extraction, and manual feature selection. Traditional models rely on manual feature selection, which can significantly impact the performance and results of the model, leading to incorrect classifications²¹. However, with the popularity of CNN models, they have the ability to automatically select features, avoiding the errors introduced by manual feature selection, and they can address image annotation issues end-to-end²². In terms of detection results and model performance, CNNs far surpass traditional machine learning models. Although CNN-based image classification methods have the advantage of extracting spatial features, compared to Transformers, CNN models face challenges in handling sequential data. Additionally, CNN models have limitations in modelling long-range dependencies. Transformers, on the other hand, utilize self-attention mechanisms, allowing them to capture global dependencies within sequences and excel in handling long-range dependencies²³.

Due to the local receptive field nature of CNNs, their performance tends to decline when the dataset size is small²⁴. The Fig. 6 shows the heatmap visualization of each model. The attention of the model in the image is described using a heatmap.As Fig. 6.b shows in the attention heatmap images with fire, we can clearly see that in the images of three CNN models, none of them noticed the fire in the image. When we look at the third column of images, only the best performing VGG-16 model noticed some of the flames, and most of its attention was focused on the edges of the image. In the second column of images, when the flames are in the center of the image, they are hard to notice. Only the DINOV2 model encompassed the large area of flames. By understanding the attention mechanism of CNN models, we can see that in images without fire, CNN models only noticed parts similar to flames, such as yellow leaves and tree trunks, leading to misjudgement of the model. Moreover, in the poorly performing ResNet-50 and VGG-19, they only focused on the edges of the image, a result of the local attention mechanism of CNNs. However, DINOV2, with its global view, can pay attention to the entire image and correctly judge the images without fire. The attention mechanism of DINOV2 in Fig. 6.e successfully encompasses the flames and manages to capture all the fire present in the image.

In our study, we conducted a comparative analysis of CNN, Transformer, and DINOV2 models by assessing their accuracy, precision, and inference time. While traditional CNN models demonstrated notable improvements in fire classification, they exhibited suboptimal performance in larger fire datasets, particularly with a high rate of false positives. This limitation became evident through heatmap evaluations, where CNNs, impacted by their local receptive field, struggled to distinguish between actual fires and objects with similar colors, such as leaves during sunset.

On the other hand, the self-supervised learning model DINOV2, despite being limited by dataset scale, showed a remarkable aptitude in this fire classification task. Our experimental results indicated that DINOV2 possesses a significant edge in feature extraction and consistently delivers robust performance across various datasets. The ability of DINOV2 to effectively discern intricate details in images sets it apart, making it a superior choice for accurate and reliable fire detection.

Author Contribution

Conceptualization, X.F. and M.P.; methodology, X.F. ; software, X.F and X.Z.; validation, X.F., T.Z., X.Z., X.T. and M.P.; formal analysis, X.F., T.X. and M.P.; investigation, X.F. and T.Z.; resources, X.F., X.T. and M.P.; data curation, X.F. ,X.Z. and T.Z.; writing—original draft preparation, X.F.; writing—review and editing, X.F., T.Z., X.Z., X.T. and M.P.; visualization, X.F., T.Z. and X.Z.; supervision, X.T. and M.P.; project administration, X.T. and M.P.; funding acquisition, M.P. All authors have read and agreed to the published version of the manuscript.

Data Availability

All data are open access and can be downloaded from the corresponding websites. The ’dataset’ section of the article provides links and information for all the relevant datasets, the corresponding links are provided as following. The Fire data that support the findings of this study are available in Kaggle repository at https://www.kaggle.com/datasets/phylake1337/fire-dataset/data.Deepfire data that support the findings of this study are available in Kaggle repository at https://www.kaggle.com/datasets/alik05/forest- fire-dataset.The FLAME data that support the findings of this study are available in Kaggle repository at https://ieee- dataport.org/open-access/flame-dataset-aerial-imagery-pile-burn-detection-using-drones-uav.

Abram, N. J. et al. Connections of climate change and variability to large and extreme forest fires in southeast australia. Commun. Earth & Environ. 2, 1–17 (2021).
Zhang, F. et al. Integrating multiple factors to optimize watchtower deployment for wildfire detection. Sci. total environment 737, 139561 (2020).
Gaur, A. et al. Fire sensing technologies: A review. IEEE Sensors J. 19, 3191–3202 (2019).
Chen, T.-H., Wu, P.-H. & Chiou, Y.-C. An early fire-detection method based on image processing. In 2004 International Conference on Image Processing, 2004. ICIP’04., vol. 3, 1707–1710 (IEEE, 2004).
Celik, T. & Demirel, H. Fire detection in video sequences using a generic color model. Fire safety journal 44, 147–158 (2009).
Jiang, L., Qi, Q., Zhang, A., Guo, C. & Cheng, X. Improving the accuracy of image-based forest fire recognition and spatial positioning. Sci. China Technol. Sci. 53, 184–190 (2010).
Liu, Z., Zhang, K., Wang, C. & Huang, S. Research on the identification method for the forest fire based on deep learning. Optik 223, 165491 (2020).
Chen, Y. et al. Uav image-based forest fire detection approach using convolutional neural network. In 2019 14th IEEE conference on industrial electronics and applications (ICIEA), 2118–2123 (IEEE, 2019).
Sousa, M. J., Moutinho, A. & Almeida, M. Wildfire detection using transfer learning on augmented datasets. Expert. Syst. with Appl. 142, 112975 (2020).
Khan, A., Hassan, B., Khan, S., Ahmed, R. & Abuassba, A. Deepfire: A novel dataset and deep transfer learning benchmark for forest fire detection. Mob. Inf. Syst. 2022, 1–14 (2022).
Shamsoshoara, A. et al. Aerial imagery pile burn detection using deep learning: The flame dataset. Comput. Networks 193, 108001 (2021).
Ghali, R., Akhloufi, M. A. & Mseddi, W. S. Deep learning and transformer approaches for uav-based wildfire detection and segmentation. Sensors 22, 1977 (2022).
Zhang, K., Wang, B., Tong, X. & Liu, K. Fire detection using vision transformer on power plant. Energy Reports 8, 657–664 (2022).
Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
Caron, M. et al. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, 9650–9660 (2021).
Oquab, M. et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023).
Ahmed Gamaleldin, H. S. A. S., Ahmed Atef. Fire dataset (2018).
Govil, K., Welch, M. L., Ball, J. T. & Pennypacker, C. R. Preliminary results from a wildfire detection system using deep learning on remote camera images. Remote. Sens. 12, 166 (2020).
Tang, Y., Feng, H., Chen, J. & Chen, Y. Forestresnet: A deep learning algorithm for forest image classification. In Journal of Physics: Conference Series, vol. 2024, 012053 (IOP Publishing, 2021).
Sun, X., Sun, L. & Huang, Y. Forest fire smoke recognition based on convolutional neural network. J. For. Res. 32, 1921–1927 (2021).
Alzubaidi, L. et al. Review of deep learning: concepts, cnn architectures, challenges, applications, future directions. J. big Data 8, 1–74 (2021).
Ke, X., Zou, J. & Niu, Y. End-to-end automatic image annotation based on deep cnn and multi-label data augmentation. IEEE Transactions on Multimed. 21, 2093–2106 (2019).
Dai, Y., Gao, Y. & Liu, F. Transmed: Transformers advance multi-modal medical image classification. Diagnostics 11, 1384 (2021).
Jacobsen, J.-H., Van Gemert, J., Lou, Z. & Smeulders, A. W. Structured receptive fields in cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2610–2619 (2016).

No competing interests reported.

Wild Fire Classification using Learning Robust Visual Features

Status:

Version 1

Abstract

Figures

Introduction

Methodology

Dino

Dataset

Results

Discussion

Conclusion

Declarations

Author Contribution

Data Availability

References

Additional Declarations

Status:

Version 1