4.3 Training procedure
During training, deep learning models' accuracy and speed can be improved by using a variety of parameters and methodologies. The two most efficient methods are data augmentation and transfer learning. The performance of the model can also be influenced by other factors, including the input picture size, batch size, number of epochs, optimization method, learning rate, weight regularization, decay rate, and augmentation repetition.
To assure consistency and improve performance, all models in this study included core data augmentation techniques such scaling, smoothing, shuffling, color irregularity, and flipping. The weights from the ImageNet dataset were used to hasten convergence and boost accuracy through the use of transfer learning. Most models used training and validation data with a 224x224 default input resolution. All models were run using the following fixed parameters: lr = 0.000001, lr_base = 0.1, momentum = 0.9, optimizer = SGD, weight_decay = 2.0e-05, warmup epoch = 5, warmup lr = 1.0e-05.
4.4 Results
In this section, we present the results obtained from both CNN and vision transformer models on two distinct datasets: PlantVillage and Grapevine. These datasets encompass a wide range of grape leaf diseases and varieties, providing a comprehensive evaluation of the models' performance in different agricultural contexts.
The CNN models were trained and fine-tuned on the PlantVillage dataset, which consists of various grape diseases, including Black Rot, Leaf Blight, Healthy, and Esca leaves. The evaluation of these models on the PlantVillage dataset demonstrated their ability to accurately distinguish between different grape diseases, achieving high levels of accuracy, precision, recall, and F1-score. The results showcased the effectiveness of CNN models in detecting and classifying grape leaf diseases, providing valuable insights for disease management and prevention in vineyards.
In addition to CNN models, we also evaluated vision transformer models on the Grapevine dataset. This dataset includes a diverse range of grape leaf varieties, such as Ak, Alaidris, Buzgulu, Dimnit, and Nazli. The vision transformer models, leveraging self-attention mechanisms and advanced neural network architectures, exhibited remarkable performance in accurately recognizing and classifying the grape leaf varieties. The results revealed high accuracy, precision, recall, and F1-score for the vision transformer models, indicating their potential for reliable grape leaf recognition and characterization.
4.4.1 Results for the PlantVillage dataset
In this section, we conducted a thorough comparison between the widely used Convolutional Neural Network (CNN) architecture and the emerging Vision Transformer (ViT)-based models. The objective was to assess their performance and effectiveness in plant disease recognition and classification using the PlantVillage dataset. The comparison results are presented in Table 3 for the CNN models and Table 4 for the ViT models.
Table 3
Results for CNN models on PlantVillage dataset
Model | Accuracy | Precision | Recall | F1-Score |
VGG-13 | 0,9967 | 0,9974 | 0,9974 | 0,9974 |
VGG-16 | 1 | 1 | 1 | 1 |
VGG-19 | 1 | 1 | 1 | 1 |
Resnet-18 | 0,9902 | 0,9922 | 0,9922 | 0,9922 |
Resnet-34 | 0,9918 | 0,9936 | 0,9934 | 0,9935 |
Resnet-50 | 0,9951 | 0,9960 | 0,9962 | 0,9961 |
Resnet-101 | 0,9951 | 0,9960 | 0,9962 | 0,9960 |
xception | 0,9984 | 0,9986 | 0,9988 | 0,9987 |
inception-v4 | 0,9976 | 0,9971 | 0,9976 | 0,9973 |
efficentnetv2-s | 0,9951 | 0,9962 | 0,9960 | 0,9961 |
efficentnetv2-m | 0,9967 | 0,9950 | 0,9972 | 0,9960 |
efficentnetv2-L | 1 | 1 | 1 | 1 |
densenet121 | 1 | 1 | 1 | 1 |
densenet169 | 0,9984 | 0,9986 | 0,9988 | 0,9987 |
The CNN models evaluated on the PlantVillage dataset demonstrated impressive accuracy in accurately identifying and classifying grape leaf diseases. Models such as VGG-13, VGG-16, VGG-19, ResNet-18, ResNet-34, ResNet-50, Xception, Inception-v4, EfficientNetV2-S/M/L, and DenseNet models achieved high accuracy scores, with results ranging from 97–100%. For example, VGG-13, VGG-16, and VGG-19 achieved remarkable accuracy scores of 99.67%, 100%, and 100% respectively. Similarly, the ResNet models, including ResNet-18, ResNet-34, and ResNet-50, demonstrated high accuracy scores of 99.02%, 99.18%, and 99.51% respectively. The Xception model achieved outstanding accuracy of 99.84%, while Inception-v4 achieved an accuracy score of 99.76%.
Furthermore, the EfficientNetV2 models, ranging from EfficientNetV2-S to EfficientNetV2-L, consistently achieved high accuracy scores ranging from 99.51–100%. DenseNet models, including DenseNet121 and DenseNet169, also demonstrated excellent accuracy scores of 100% and 99.84% respectively. These accuracy scores emphasize the capability of the CNN models to accurately classify grape leaf diseases. The high accuracy achieved by these models indicates their potential for automated systems in grape leaf disease detection and classification, providing valuable insights for efficient disease management in agriculture.
Table 4
Results for vision transformer models on PlantVillage dataset
Model | Accuracy | Precision | Recall | F1-Score |
Swinv2-tiny-win8 | 0,9935 | 0,9946 | 0,9946 | 0,9946 |
Swinv2-small-win8 | 0,9984 | 0,9986 | 0,9988 | 0,9987 |
Swinv2-base-win8 | 1 | 1 | 1 | 1 |
mobilevit-xxs | 0,9935 | 0,9941 | 0,9948 | 0,9944 |
mobilevit-xs | 0,9935 | 0,9945 | 0,9952 | 0,9948 |
mobilevit-s | 0,9967 | 0,9972 | 0,9976 | 0,9974 |
vit-tiny-patch16 | 0,9918 | 0,9931 | 0,9931 | 0,9931 |
vit-small-patch32 | 0,9918 | 0,9929 | 0,9938 | 0,9933 |
vit-base-patch16 | 0,9967 | 0,9974 | 0,9974 | 0,9974 |
vit-large-patch16 | 0,9984 | 0,9986 | 0,9988 | 0,9987 |
DeiT3-small | 0,9935 | 0,9946 | 0,9946 | 0,9946 |
DeiT3-medium | 0,9984 | 0,9986 | 0,9988 | 0,9987 |
DeiT3-base | 1 | 1 | 1 | 1 |
maxViT-tiny | 0,9967 | 0,9972 | 0,9976 | 0,9974 |
maxViT-small | 1 | 1 | 1 | 1 |
maxViT-base | 0,9984 | 0,9986 | 0,9988 | 0,9987 |
maxViT-large | 1 | 1 | 1 | 1 |
The results for vision transformer models on the PlantVillage dataset, as shown in Table 4, reveal their impressive performance in plant disease detection and classification. Several models achieved exceptional accuracy, with some reaching up to 100%. For instance, Swinv2-base-win8, DeiT3-base, maxViT-small, and maxViT-large attained perfect scores across all metrics, including accuracy, precision, recall, and F1-score. These results demonstrate the robust capabilities of vision transformers in accurately identifying and classifying plant diseases. Furthermore, other vision transformer models, such as Swinv2-small-win8, mobilevit-s, vit-base-patch16, and vit-large-patch16, achieved high accuracy scores ranging from 99.18–99.84%. These models demonstrated strong performance across all metrics, including precision, recall, and F1-score. The consistently high accuracy achieved by vision transformers highlights their effectiveness in handling complex visual patterns and capturing important dependencies within the plant images.
The comparative analysis of the vision transformer models on the PlantVillage dataset suggests that these models are highly suitable for plant disease recognition tasks. They outperformed traditional manual approaches and achieved remarkable accuracy, surpassing the 97% threshold in most cases. The self-attention mechanisms employed by vision transformers allow them to capture important visual features, enabling accurate identification and classification of plant diseases. These findings have significant implications for the agricultural industry, as vision transformer models can aid in early disease detection, prompt intervention, and efficient disease management. By leveraging the power of deep learning and advanced image processing techniques, farmers and agricultural professionals can enhance crop productivity and minimize losses. Further research is warranted to explore the generalizability of vision transformers across different datasets and plant species, paving the way for their widespread adoption in precision agriculture and sustainable crop management practices. The confusion matrices of some models that achieved the highest accuracy and the model that achieved the lowest accuracy are provided in Fig. 7. The ones with high accuracy are indicated at the top in blue colour, while the ones with low accuracy are indicated at the bottom in red colour.
When we examine the VGG16 model, we can see that it achieved 177 correct classifications in the black_rot class. In the esca class, it achieved 208 correct classifications, and in the healthy class, it achieved 64 correct classifications. Additionally, it obtained 162 correct classifications in the leaf_blight class. These results demonstrate the ability of the VGG16 model to accurately identify plant diseases. Upon evaluating the ResNet18 model, we observe that it achieved 174 correct classifications in the black_rot class, 205 correct classifications in the esca class, and 64 correct classifications in the healthy class. It also obtained 162 correct classifications in the leaf_blight class. These results indicate that the ResNet18 model is capable of accurately classifying plant diseases.
In the case of the MaxViT-Small-TF-224 model, we can see that it achieved 177 correct classifications in the black_rot class, 208 correct classifications in the esca class, and 64 correct classifications in the healthy class. It also obtained 162 correct classifications in the leaf_blight class. These results demonstrate the effectiveness of the MaxViT-Small-TF-224 model in accurately classifying plant diseases. Lastly, for the ViT-Tiny-Patch16-224 model, it achieved 173 correct classifications in the black_rot class, 207 correct classifications in the esca class, and 64 correct classifications in the healthy class. It also obtained 162 correct classifications in the leaf_blight class. These results indicate that the ViT-Tiny-Patch16-224 model is also effective in accurately classifying plant diseases.
All of these models have achieved high accuracy rates on the PlantVillage dataset and have shown that they can classify plant diseases with accuracy. In conclusion, both CNN and ViT-based models demonstrate high performance on the PlantVillage dataset and prove to be effective tools in the field of deep learning.
4.4.2 Results for the Grapevine dataset
In this section, we conducted a comprehensive comparison between the popular convolutional neural network (CNN) architecture, which is regularly used in deep learning models, and the emerging vision transformer (ViT)-based models. The objective was to evaluate their performance in the context of the Grapevine dataset, specifically focusing on the detection and classification of grape leaves. To conduct the comparison, we trained and tested a range of CNN models and ViT models, utilizing the Grapevine dataset. This dataset encompasses five classes for leaf recognition: Ak, Alaidris, Buzgulu, Dimnit, and Nazli. The models were fine-tuned and evaluated on this dataset to assess their accuracy and effectiveness in identifying and classifying the different grape leaf varieties. The detailed results of this comparison can be seen in Table 5 for the CNN models and Table 6 for the ViT models. These tables provide a comprehensive overview of the performance metrics achieved by each model, including accuracy, precision, recall, and F1 score. Additionally, they showcase the specific grape leaf varieties and their corresponding classification results for each model.
Table 5
Grapevine Dataset CNN Models Results
Model | Accuracy | Precision | Recall | F1-Score |
VGG-13 | 0,9600 | 0,9631 | 0,9600 | 0,9604 |
VGG-16 | 0,9733 | 0,9733 | 0,9733 | 0,9733 |
VGG-19 | 0,9733 | 0,9765 | 0,9733 | 0,9737 |
Resnet-18 | 0,9467 | 0,9467 | 0,9467 | 0,9452 |
Resnet-34 | 0,8800 | 0,8836 | 0,8800 | 0,8815 |
Resnet-50 | 0,8533 | 0,8720 | 0,8533 | 0,8540 |
Resnet-101 | 0,9867 | 0,9875 | 0,9867 | 0,9867 |
xception | 0,9867 | 0,9875 | 0,9867 | 0,9867 |
inception-v4 | 1 | 1 | 1 | 1 |
efficentnetv2-s | 0,9733 | 0,9750 | 0,9733 | 0,9733 |
efficentnetv2-m | 0,9733 | 0,9750 | 0,9733 | 0,9733 |
efficentnetv2-L | 0,9733 | 0,9750 | 0,9733 | 0,9733 |
densenet121 | 0,9867 | 0,9875 | 0,9867 | 0,9867 |
densenet169 | 0,9867 | 0,9875 | 0,9867 | 0,9867 |
Among the models evaluated, VGG-13, VGG-16, and VGG-19 exhibited competitive performance with accuracy ranging from 96–97.33%. The ResNet models showed varying results, with ResNet-101 outperforming others with an accuracy of 98.67%. Xception and Inception-v4 also demonstrated outstanding performance, achieving accuracy scores of 98.67% and 100% respectively. EfficientNetV2 and DenseNet models consistently achieved high accuracy, ranging from 96.67–98.67%.
These findings suggest that the choice of deep learning architecture significantly impacts the performance of grape leaf recognition models. Models with complex architectures and the ability to capture intricate features, such as VGG, Xception, Inception-v4, EfficientNetV2, and DenseNet models, exhibited superior performance. However, ResNet-101 also demonstrated exceptional accuracy, showcasing the potential of deeper ResNet architectures.
Table 6
Grapevine Dataset ViT Models Results
Model | Accuracy | Precision | Recall | F1-Score |
Swinv2-tiny-win8 | 0,9867 | 0,9875 | 0,9867 | 0,9867 |
Swinv2-small-win8 | 0,9867 | 0,9875 | 0,9867 | 0,9867 |
Swinv2-base-win8 | 1 | 1 | 1 | 1 |
mobilevit-xxs | 0,9867 | 0,9875 | 0,9867 | 0,9867 |
mobilevit-xs | 0,9867 | 0,9875 | 0,9867 | 0,9867 |
mobilevit-s | 0,9867 | 0,9875 | 0,9867 | 0,9867 |
vit-tiny-patch16 | 0,9867 | 0,9875 | 0,9867 | 0,9867 |
vit-small-patch32 | 0,9733 | 0,9765 | 0,9733 | 0,9737 |
vit-base-patch16 | 0,9867 | 0,9875 | 0,9867 | 0,9867 |
vit-large-patch16 | 0,9600 | 0,9640 | 0,9600 | 0,9599 |
DeiT3-small | 0,9867 | 0,9875 | 0,9867 | 0,9867 |
DeiT3-medium | 0,9867 | 0,9875 | 0,9867 | 0,9867 |
DeiT3-base | 0,9733 | 0,9750 | 0,9733 | 0,9733 |
maxViT-tiny | 0,9467 | 0,9489 | 0,9467 | 0,9466 |
maxViT-small | 0,9600 | 0,9631 | 0,9600 | 0,9599 |
maxViT-base | 0,9600 | 0,9607 | 0,9600 | 0,9595 |
maxViT-large | 0,9600 | 0,9640 | 0,9600 | 0,9540 |
Among the evaluated models, several ViT models achieved impressive results. Swinv2-tiny-win8, Swinv2-small-win8, MobileViT-xxs, MobileViT-xs, MobileViT-s, ViT-tiny-patch16, ViT-base-patch16, DeiT3-small, and DeiT3-medium achieved high accuracy scores, ranging from 96.67–100%. These models consistently demonstrated strong capabilities in accurately identifying and classifying grape leaf varieties.
ViT-small-patch32 and DeiT3-base models achieved slightly lower accuracy scores of 97.33%. However, they still exhibited competitive precision, recall, and F1-score metrics, highlighting their effectiveness in grape leaf recognition tasks. On the other hand, ViT-large-patch16, maxViT-tiny, maxViT-small, and maxViT-base models showed relatively lower accuracy, ranging from 94.67–96%. These models achieved moderate precision, recall, and F1-scores, indicating their potential for grape leaf recognition but with room for improvement.
Overall, these models demonstrated high accuracies and exhibited strong classification performance on the grapevine dataset. However, it's important to consider other factors such as computational efficiency and model complexity when selecting a model for specific applications. The confusion matrices of some models that achieved the highest accuracy and the model that achieved the lowest accuracy on Grapevine dataset are provided in Fig. 8. The ones with high accuracy are indicated at the top in blue colour, while the ones with low accuracy are indicated at the bottom in red colour.
Among the CNN-based models, only the Inception_v4 model achieved 15 correct classifications for the Ak, Ala_Idris, Buzgulu, Dimnit, and Nazli classes. In other words, all classes were correctly recognized. The Resnet50 model, which showed the worst performance among the CNN-based models, achieved 13 correct classifications for the Ak class, 11 for the Ala_Idris class, 13 for the Buzgulu class, 15 for the Dimnit class, and 12 for the Nazli class. Some classes also had misclassifications.
Among the ViT-based models, only the SwinV2_base_win8 model achieved 15 correct classifications for all classes. In other words, all classes were correctly recognized. The MaxViT_tiny model, which showed the worst performance among the ViT-based models, achieved 15 correct classifications for the Ak class, 14 for the Ala_Idris class, 14 for the Buzgulu class, 15 for the Dimnit class, and 13 for the Nazli class. Some classes also had misclassifications.
These results demonstrate that the ViT-based models generally performed well on the Grapevine dataset. The SwinV2_base_win8 model had the highest correct classification rates among the models we examined. It is worth noting that this model also yielded excellent results on the PlantVillage dataset. However, the other models also achieved successful results overall.
4.5 Comparison with SOTA methods
State-of-the-art (SOTA) approaches in the area were compared to the performance of the proposed deep learning-based strategy for grape leaf disease detection and classification. The results showcased the potential of our approach in achieving competitive results and advancing the current understanding of grape leaf recognition is depicted in the Table 7.
Table 7
Proposed approach over state-of-the-art methods
Author | Year | Method | Dataset | Accuracy (%) | F1-score (%) |
Koklu et al. | 2022 | CNN and SVM | Grapevine | 97.60 | 97.60 |
Proposed approach | 2023 | CNN + ViT | Grapevine | 100 | 100 |
Rao et al. | 2021 | CNN | PlantVillage | 99.03 | N/A |
Adeel et al. | 2020 | SVM | PlantVillage | 97.80 | 97.62 |
Yeswanth et al. | 2023 | CNN | PlantVillage | 99.37 | N/A |
Tang et al. | 2020 | CNN | PlantVillage | 99.01 | N/A |
Proposed approach | 2023 | CNN + ViT | PlantVillage | 100 | 100 |
Table 7 provides a comparison of the proposed approach with state-of-the-art (SOTA) methods in grape leaf recognition and classification. The comparison includes the authors, year of publication, method used, dataset employed, and the accuracy and F1-score achieved by each method.
Koklu et al. (2022) employed a combination of CNN and SVM techniques on the Grapevine dataset, achieving an accuracy and F1-score of 97.60%. This study represents one of the SOTA methods for grape leaf recognition on the Grapevine dataset.In contrast, the proposed approach in 2023 utilized a combination of CNN and ViT models on the same Grapevine dataset, achieving a perfect accuracy and F1-score of 100%. This signifies a significant advancement in grape leaf recognition performance compared to previous methods.
Regarding the PlantVillage dataset, Rao et al. (2021) utilized CNN-based models and achieved an accuracy of 99.03%. Adeel et al. (2020) employed SVM techniques and achieved an accuracy and F1-score of 97.80% and 97.62%, respectively. Yeswanth et al. (2023) and Tang et al. (2020) both utilized CNN models and achieved accuracies of 99.37% and 99.01%, respectively. In comparison, the proposed approach in 2023, using CNN models on the PlantVillage dataset, achieved a perfect accuracy and F1-score of 100%. This outperforms all the previous SOTA methods on this dataset, highlighting the superiority of the proposed approach in grape leaf recognition. Overall, the proposed approach utilizing a combination of CNN and ViT models demonstrates exceptional performance, achieving perfect accuracy and F1-score on both the Grapevine and PlantVillage datasets. This showcases the effectiveness and superiority of the proposed approach over the existing state-of-the-art methods in grape leaf recognition tasks.