Data Preparation First, prepare the data by dividing it into training and validation sets. Then, use Keras ImageDataGenerator to augment the images by rescaling, shearing, zooming and flipping the images. This helps to increase the diversity of the training set and prevent overfitting.Base Model Selection Next, select the base model to use for transfer learning. In this case, we will be using two popular architectures, InceptionV3 and VGG16. Both of these models are trained on the ImageNet dataset and have shown great performance in various computer vision tasks. The base model is pre-trained, and we will use its pre-trained weights to extract relevant features from the images.
Then we apply Freeze Layers To speed up the training process and prevent the pre-trained weights from being destroyed, we will freeze the layers of the base model. This will ensure that the weights are not updated during training.Model Modification After the base model is selected and its layers are frozen, we add additional layers on top of the base model to form a new model. We add a GlobalAveragePooling2D layer to convert the 2D feature maps from the convolutional layers into a 1D feature vector. We then add a fully connected layer with ReLU activation($f(x) = max(0, x)$), followed by a dropout layer to prevent overfitting, and finally a fully connected layer with softmax activation for classification.
Compile the Model We compile the model by specifying the optimizer, loss function, and evaluation metric. We use the Adam optimizer with a learning rate of 0.0001, categorical_crossentropy as the loss function, and accuracy as the evaluation metric.Training We train the model on the training set using the fit_generator method. We set an early stopping criterion to prevent overfitting, and save the best model weights based on the validation accuracy.Evaluation After the model is trained, we evaluate it on the validation set and report the accuracy. We repeat the same process for both InceptionV3 and VGG16 and compare their accuracy to determine which model performs better for the given task.
-
Dataset
The dataset used in this study is the New Plant Diseases Dataset, which contains images of various plant diseases, including apple leaf diseases. The dataset was downloaded from Kaggle, a popular platform for data science competitions and projects. The apple leaf disease class consists of four subclasses: apple scab, cedar apple rust, healthy, and black rot. The dataset contains a total of 9705 apple leaf images, with each class containing a different number of images. Specifically, the healthy class has the largest number of images, accounting for 56.67% of the dataset, followed by apple scab with 24.64%, powdery mildew with 12.56%, and cedar apple rust with 6.13%. The dataset was split into training and validation sets with a ratio of 80:20, resulting in 7764 images in the training set and 1941 images in the validation set.
B. Models
We utilized two state-of-the-art deep learning models, InceptionV3 and VGG16, to detect plant diseases in apple tree leaves. Both models are pre-trained on a large dataset, making them capable of extracting high-level features from images
1) VGG16
VGG16 is a widely recognized deep convolutional neural network (CNN) architecture developed by the Visual Geometry Group (VGG) at the University of Oxford. It is characterized by its deep structure and simplicity, making it a popular choice for various computer vision tasks. The VGG16 network consists of 16 convolutional layers, which are responsible for learning and extracting features from the input data. These convolutional layers use small filters, typically 3x3 in size, with a stride of one. By using small filters, the network can capture local patterns and details in the input images. The choice of multiple stacked convolutional layers allows the network to learn increasingly complex and abstract features as information passes through the network.After the convolutional layers, VGG16 has three fully connected layers, which are responsible for the final classification. The fully connected layers utilize the rectified linear unit (ReLU) activation function, which helps address the vanishing gradient problem by allowing efficient gradient propagation during training. The final layer of the network typically employs the Softmax function, producing a probability distribution over the different classes for classification purposes.
One characteristic of VGG16 is the use of max pooling after each set of convolutional layers. Max pooling reduces the spatial dimensions of the feature maps while preserving the most salient information, enabling the network to achieve translation invariance and a degree of spatial robustness. VGG16 has achieved state-of-the-art performance on various benchmark datasets, demonstrating its ability to learn highly discriminative representations. Its capability to capture both local and global features from input data has contributed to its success in tasks such as image classification, object detection, and image segmentation.
2) Inception-InceptionV3 is a deep convolutional neural network (CNN) architecture developed by Google, specifically designed to efficiently extract features of various scales and aspects from input data. The network introduces the concept of "Inception modules" to achieve this goal. The InceptionV3 network consists of multiple convolutional layers, followed by several Inception modules. Each Inception module consists of a set of convolutional layers with different filter sizes, allowing the network to capture features at multiple scales. The outputs from each filter size are concatenated and fed into the next layer, enabling the network to capture features of various sizes and dimensions effectively. This design helps the network extract diverse and complex features more efficiently compared to traditional CNN architectures.
After the Inception modules, the network includes a global average pooling layer, which reduces the spatial dimensions of the feature maps to a single value by taking the average of each feature map. This pooling operation helps to further condense the information while retaining the most relevant features. Finally, a fully connected layer maps the output of the pooling layer to the number of classes in the dataset for classification purposes. The InceptionV3 architecture's strength lies in its ability to capture features at different scales and aspects, making it particularly suitable for image classification tasks. It excels in capturing complex and multi-level representations in the input data. As a result, InceptionV3 has become a popular choice for various computer vision applications, including object detection and segmentation.
The InceptionV3 network has achieved state-of-the-art performance on several benchmark datasets, demonstrating its effectiveness in extracting informative features and achieving high classification accuracy. Its versatility and superior performance have established it as a standard architecture for image classification tasks in the field of computer vision.
C. Approach
In this study, we adopted a transfer learning approach to train both VGG16 and InceptionV3 models on our dataset. Specifically, we utilised pre-trained models that had been previously trained on large datasets, and then fine-tuned the final few layers of the networks on our dataset. This approach allowed us to benefit from the knowledge that had already been captured by the pre-trained models, while also tailoring the models to our specific dataset.
Table 1
Accuracy Comparison on training with apple tree diseases.
| precision | recall | f1-score | support |
Apple_Apple_scab | 0.98 | 0.96 | 0.98 | 504 |
Apple_Black_rot | 0.98 | 1.00 | 0.99 | 491 |
Apple_Cedar_apple_rust | 0.99 | 0.99 | 0.99 | 440 |
Apple_healthy | 0.97 | 0.97 | 0.97 | 502 |