In this part, we further delve into the evaluation of model performance, emphasizing not only the precision metrics like accuracy, precision, recall, and F1 score but also the paramount importance of inference time in the context of real-time wildfire detection and classification.
Figure 5. Comparison of Model Metrics of dataset Fire, DeepFire and FLAME
Figure 5 presents a dendrogram showing the performance of various models across different datasets. In the first two large-scale forest fire datasets, DINOV2 consistently leads in all metrics. In the smaller fire dataset, FLAME, it also performs comparably to other models. In the following sections, we will describe in more detail the performance of each model across different datasets using more specific data formats.
Model
|
Accuracy
|
Precision
|
Recall
|
F1 Score
|
Inference Time (ms)
|
ViT-B-16
|
93.9394%
|
90.37%
|
94.58%
|
92.18%
|
54.965
|
ResNet-50
|
76.7677%
|
88.27%
|
52.08%
|
47.35%
|
0.904
|
VGG-16
|
83.8384%
|
91.21%
|
66.67%
|
70.18%
|
0.265
|
VGG-19
|
80.8081%
|
89.89%
|
60.42%
|
61.62%
|
0.014
|
DINOV2
|
98.6050%
|
98.99%
|
99.34%
|
97.92%
|
0.111
|
Table 1. Result with dataset Fire
In the Fire dataset, as Table 1 shows other models like the ViT-B-16 showed promising results with an accuracy of 93.9394% and a precision of 90.3672%. However, its inference time of 54.965 seconds, though relatively swift, did not match the efficiency needed for the most time-sensitive applications. On the other hand, models such as VGG-19, despite having a modest accuracy of 80.8081% and precision of 89.8936%, underscoring the potential for real-time applications. Yet, it was DINOV2 that emerged as the unequivocal leader in this dataset. With an outstanding accuracy of 98.605%, a precision of 98.9898%, and an inference time only 0.097ms slower than VGG-19, DINOV2 seamlessly bridged the gap between high accuracy and high-speed inference, setting a new standard for efficiency and reliability.
Model
|
Accuracy
|
Precision
|
Recall
|
F1 Score
|
Inference Time (ms)
|
Ali et al.10
|
95.00%
|
95.72%
|
94.21%
|
94.96%
|
-
|
Sousa et al.9
|
93.60%
|
94.12%
|
93.13%
|
93.57%
|
-
|
Govil et al.18
|
91.20%
|
94.16%
|
86.00%
|
89.00%
|
-
|
Tang et al.19
|
92.00%
|
-
|
-
|
-
|
-
|
Sun et al.20
|
94.10%
|
96.98%
|
90.63%
|
93.70%
|
-
|
ViT-B-16
|
97.36%
|
97.37%
|
97.37%
|
97.37%
|
13.134
|
ResNet-50
|
83.10%
|
82.89%
|
82.87%
|
82.89%
|
0.137
|
VGG-16
|
97.45%
|
97.37%
|
97.37%
|
97.37%
|
0.050
|
VGG-19
|
97.69%
|
97.63%
|
97.63%
|
97.63%
|
0.013
|
DINOV2
|
99.22%
|
99.21%
|
99.21%
|
100.00%
|
0.097
|
Table 2. Result with dataset Deepfire
The DeepFire dataset, known for its complex imagery and potential for misclassification, was where the prowess of DINOV2 truly shone through. As Table 2 shows other models like VGG-16 and VGG-19 showcased near-perfect accuracies (97.45% and 97.69% respectively) DINOV2, with a flawless accuracy of 99.22% a the same time the F1 score excellently balances precision and recall, leading all other models by 100% while also having the lowest rates of missed detections and false alarms.
Model
|
Accuracy
|
Precision
|
Recall
|
F1 Score
|
Inference Time (ms)
|
Ghali et al.12
|
85.12%
|
-
|
-
|
84.77%
|
0.018
|
Xception12
|
78.41%
|
-
|
-
|
78.12%
|
0.002
|
Xception11
|
76.23%
|
-
|
-
|
73.90%
|
0.010
|
EfficientNet-B512
|
75.82%
|
-
|
-
|
73.90%
|
0.010
|
EfficientNet-B412
|
69.93%
|
-
|
-
|
65.51%
|
0.008
|
EfficientNet-B312
|
65.81%
|
-
|
-
|
64.02%
|
0.004
|
EfficientNet-B212
|
66.04%
|
-
|
-
|
60.71%
|
0.002
|
InceptionV312
|
80.88%
|
-
|
-
|
79.53%
|
0.002
|
DenseNet16912
|
80.62%
|
-
|
-
|
79.40%
|
0.003
|
MobileNetV3-Small12
|
51.64%
|
-
|
-
|
44.97%
|
0.001
|
MobileNetV3-Large12
|
65.10%
|
-
|
-
|
60.91%
|
0.001
|
ViT-B-16
|
99.30%
|
99.28%
|
99.23%
|
99.25%
|
3.385
|
ResNet-50
|
99.85%
|
99.82%
|
99.82%
|
99.83%
|
0.036
|
VGG-16
|
99.84%
|
99.86%
|
99.87%
|
99.87%
|
0.011
|
VGG-19
|
99.89%
|
99.08%
|
99.90%
|
99.90%
|
0.012
|
DINOV2
|
98.00%
|
98.89%
|
98.35%
|
98.61%
|
0.090
|
Table 3. Result with dataset FLAME
The FlAME dataset further reaffirmed the superiority of DINOV2. As Table 3 shows models such as VGG-19 and ResNet-50 achieved remarkable accuracies (99.89% and 99.85% respectively). Although DINOV2 does not perform as well as the VGG series on this dataset, most models achieve very good performance on the simple dataset FLAME. In the first two complex datasets, DINOV2 exhibits superior performance and still performs well in the simple FLAME dataset. From the collective insights garnered across all datasets, it becomes abundantly clear that DINOV2 is not just another model in the landscape of wildfire detection. Its exceptional balance of high accuracy, precision, recall, and F1 score, coupled with its swift inference time, elevates it to a league of its own. This harmonious blend of speed and accuracy positions DINOV2 as an indispensable tool in the domain of fire detection, redefining the benchmarks for performance and efficiency and paving the way for a future where real-time, reliable fire detection can significantly mitigate risks and preserve natural and human resources.
|
Head
|
Accuracy
|
Precision
|
Recall
|
F1 Score
|
Fire
|
MLP
|
98.6%
|
98.90%
|
99.30%
|
97.90%
|
|
KNN
|
98.5%
|
99.00%
|
96.90%
|
97.90%
|
DeepFire
|
MLP
|
99.2%
|
99.20%
|
99.20%
|
99.20%
|
|
KNN
|
99.0%
|
99.30%
|
98.00%
|
98.60%
|
Flame
|
MLP
|
98.0%
|
98.80%
|
98.30%
|
98.60%
|
|
KNN
|
99.6%
|
99.50%
|
99.50%
|
99.50%
|
Table 4. Classification Head comparison
In the ablation experiments, we mainly focus on the F1 score and Recall as the primary metrics for analysis. Given the practical application scenarios of fire classification, it’s crucial to optimally balance the model’s false alarm rate and miss rate. Additionally, a high recall can significantly reduce the false alarm rate of fire detection, thereby saving human resources. The results in Table 4 indicate that, within the balanced data environment of Fire, MLP’s recall significantly surpasses that of the KNN classifier head. Looking at the imbalanced dataset DeepFire, MLP’s recall is much higher than KNN’s, and it Accurately detected the number of positive samples and reduced the misidentification of false negatives. This balance is well reflected in the F1 score. Although in the FLAME dataset, KNN outperforms MLP.
Additionally, due to the characteristics of the KNN lazy model, it may overly influence the prediction results when dealing with imbalanced datasets. In contrast, MLP can mitigate the issues arising from imbalanced datasets through techniques such as oversampling or cost-sensitive learning also KNN requires the calculation of k and the distance metric, so the value of k significantly influences the model’s performance. Overall, MLP should be our preferred choice.