2.1 Dataset
The dataset pertaining to tomato leaf disease utilized in this study was obtained from the Kaggle platform. The dataset comprises a total of ten distinct categories, encompassing nine distinct classifications of tomato leaf diseases, namely bacterial spot, early blight, late blight, and target spot, and a class denoting healthy leaves. The 10 types of tomato are shown in Figure 1.
The dataset consists of 11,000 photos in JPG format, each having dimensions of 256x256 pixels. The photos were partitioned into a training set and a test set using a random allocation method, with a ratio of 10:1. Table 1 displays the distribution of the dataset. The training set has 10 classes, with each class containing 1,000 images, resulting in a cumulative total of 10,000 images. Similarly, every class within the test set is composed of 100 photographs, resulting in a cumulative total of 1,000 images.
Table 1 The distribution of ten different types of tomato leaf diseases in the training and testing datasets.
Disease
|
Number of images for training
|
Number of images for testing
|
Bacterial spot
|
1000
|
100
|
Early blight
|
1000
|
100
|
Late blight
|
1000
|
100
|
Leaf mold
|
1000
|
100
|
Septoria leaf spot
|
1000
|
100
|
Two spotted spider mite
|
1000
|
100
|
Target spot
|
1000
|
100
|
Yellow leaf curl virus
|
1000
|
100
|
mosaic virus
|
1000
|
100
|
Health
|
1000
|
100
|
Total
|
10000
|
1000
|
2.2 Data augmentation
Data augmentation is the process of generating new training samples by applying a series of random transformation operations to the original data. The goal of this strategy is to enhance the amount and variety of the dataset, thereby improving the model's generalization and robustness. Collected images in practical applications frequently include complicated backgrounds and numerous sorts of interference. Therefore, maintaining high accuracy in such challenging situations is critical for a model. To overcome this, data augmentation can be used to boost the dataset's diversity and difficulty. This study employed a range of data augmentation techniques, such as the addition of Gaussian noise, manipulation of image brightness, application of motion blur, implementation of horizontal flipping, and introduction of random occlusion. The purpose of these procedures was to replicate the intricacy and variety observed in real-life situations. Enhance the model's capacity to effectively acclimate to intricate surroundings and diverse disturbances. Figure 2 and Figure 3 illustrate the visual changes in the dataset before and after applying data augmentation techniques.
2.3 Network model
In 2015, Kaiming He et al. and other researchers from Microsoft Research Asia proposed ResNet [19] as a deep neural network architecture. Residual blocks are incorporated in order to improve the training process of the network and allow for more depth in its architecture. The inclusion of shortcut connections that span many convolutional layers within the residual blocks enables the neural network to efficiently learn the residual between the current layer's output and input. This mechanism effectively mitigates the challenges of vanishing and exploding gradients that are commonly seen in excessively deep networks. Hence, the utilization of ResNet in the domain of plant leaf detection exhibits the ability to significantly enhance the precision and efficiency of leaf disease image identification. For the purposes of this study, the ResNet-50 model was chosen as the foundational network due to its suitable depth and complexity. The network architecture of ResNet-50 is seen in Figure 4. The architecture comprises a network stem module, followed by four consecutive bottleneck groups, and culminates in a completely connected layer.
2.4 Replacement of the stem structure
The stem module denotes the beginning of a network model. A max pooling layer follows a 7x7 convolutional layer in the ResNet-50 stem module. However, when it comes to photographs of tomato leaf disease, it is seen that the affected areas tend to comprise a very minor fraction of the overall image. The utilization of a 7x7 convolutional kernel, characterized by its expansive receptive field, has the potential to result in the omission of important information throughout the process of feature extraction. In this study, we replace the original stem structure with two stacked 3x3 convolutional kernels and an extra reduction module [20]. The stem structure is replaced by smaller convolutional kernels with higher local feature extraction and detail capture capabilities. Stacking numerous tiny convolutional kernels creates more non-linear responses, increasing the model's expressive ability. The extracted features are inputted into two parallel branches within the reduction module. One of the branches is comprised of convolutional layers, which employ convolutional operations to extract local features from the input feature maps. The second branch proceeds by passing pooling layers, wherein higher-level global features are extracted via downsampling operations. Ultimately, the aforementioned branches are combined through the action of concatenation, utilizing the concat function. This results in a more enriched and varied representation of the information. Simultaneously, the novel stem architecture compresses the input feature maps to a reduced size while preserving abundant feature information. This enhancement results in a more lightweight model without sacrificing its powerful classification performance. The revised structure is depicted in Figure 5.
2.5 Attention mechanism and depth-wise separable convolution
The appearance of disease spots on tomato leaves is distinguished by irregular and varied shapes. Furthermore, different diseases might present visually similar symptoms on the leaf surface. These difficulties make model recognition challenging. Attention mechanisms ease this by assigning different weights to input features, allowing the model to selectively pay to important information while ignoring unimportant details. Notably, this method has a low parameter count, which improves the model's performance and efficiency.
During the 2019 Conference on Computer Vision and Pattern Recognition (CVPR), Wang et al. proposed the ECA module [21]. The approach employs one-dimensional convolution to enable the exchange of information between channels. The size of the convolutional kernel (k) is controlled by an adaptive function with the following formula:
The ECA module utilizes global average pooling (GAP) on the input feature map, resulting in the production of a one-dimensional vector. The vector is subsequently submitted to processing by a one-dimensional convolutional layer, where the size of the convolutional kernel is chosen using an adaptive function. The weights obtained are multiplied element-wise with the input feature map, resulting in a new feature map that incorporates weighting specific to each channel.
The depth-wise separable convolution [22] contains two distinct components: depth-wise convolution and point-wise convolution. In the depth-wise convolution, individual channels undergo convolution with distinct kernels, and afterwards, the resulting outputs are concatenated. This type of point-wise convolution involves the application of a 1x1 convolutional kernel to each channel of the input data. The former method demonstrates superior capability in capturing the spatial properties of the feature map, whereas the latter method is more proficient in capturing features specific to individual channels. When comparing standard convolution to depth-wise separable convolution, it becomes evident that the latter demonstrates enhanced abilities in extracting intricate features and capturing correlations within the feature space.
In this study, the final 1x1 regular convolution within the bottleneck architecture was replaced with depth-wise separable convolution, after which an ECA module was added. The enhanced bottleneck configuration is shown in Figure 6.