All experiments are carried out on the platform, with the following specifications: Intel (R) core i3-4005 CPU @ 1.70 GHz Processor, 8.00 GB RAM, 64-bit Window 10 Operating System, and Google Colab Pro with Python 3.6.9 as the coding environment. Fig. 1 depicts the flowchart of our proposed research. In this article, a CNN motivated by the AlexNet architecture [11] is proposed for the classification of fresh and diseased cotton leaves and plants.
Dataset Description
In the proposed work, datasets of 2293 real-time images of cotton leaves and cotton plants are used. These datasets were collected from Kaggle, the online free database platform. The images are categorized among four classes: diseased cotton leaf (DCL), diseased cotton plant (DCP), fresh cotton leaf (FCL), and fresh cotton plant (FCP). Based on the category, the image dataset is labelled to its appropriate classes. Fig. 2 shows the sample of the image dataset taken in the actual field condition. Table 1 shows the number of images belonging to each class.
Table 1 Dataset Description
Categories
|
Number of Images
|
DCL
|
346
|
DCP
|
922
|
FCL
|
511
|
FCP
|
514
|
Total
|
2293
|
Data Augmentation
Data augmentation refers to the creation of new data sets from existing datasets. This procedure is used to augment the available data and increase its size to provide a larger and more accurate dataset and compensate for the additional data collection costs. To improve the training process, we propose introducing random transformations such as rescaling, rotation by 40o, width and height shifting the image by 20% each, shear and zooming the image by 20% each, and horizontal flipping with fill mode as nearest. The preceding strategies enlarge the dataset, which aids in preventing overfitting during the training stage [26]. Fig. 3 shows the image sample after the data augmentation technique is applied.
The architecture of Convolution Neural Network
The architecture of the CNN model that we have proposed is shown in Fig. 4. The architecture consists of four convolution layers, four pooling layers, and fully connected layers. With a SoftMax activation function, the final layer serves as an output layer. To change the images into a 1D array, a flatten acting as a hidden layer is used, resulting in improved performance and more accessible data handling. Each convolutional layer has a size of 3x3, and each max-pooling has a length of 2x2. The input image size and its feature map vary, as shown in Fig. 4.
Detail of proposed Convolution Neural Network model
The proposed model is a sequential model with a series of convolution and max-pooling layers that converts the input image into a feature set that can be processed further. In the first layer, the input image of size 224x224 is convolved with 32 filters of size 3x3 resulting in the dimension of 222x222x32. A max-pooling layer with a filter size of 2x2 makes up the second layer that will half the convoluted image size, i.e., 111x111x32. Similarly, the third layer performs a convolution operation with 64 filters of size 3x3, accompanied by a fourth max-pooling layer with a filter size of 2x2. As a result, the image size will be reduced to 54x54x64. The fifth layer again performs a convolution operation with 128 filters of size 3x3, followed by a sixth pooling layer that will reduce the convolved image's size to 26x26x128. A combination of convolution layers with 256 filters, each with size 3x3 and a max-pooling layer with a filter of size 2x2, is again used in the seventh and eighth-layer. Thus, the resulting image dimension will be reduced to 12x12x256. A dropout of 0.5 is used after the last max-pooling layer. As the image dimension is reduced, a flattening layer is added to flatten the previous layer's output. This layer will offer the feature set for each image as output. The model comprises two dense layers that serve as the artificial neural network's hidden layer. These dense layers are interconnected, with every input neuron linked to every other hidden neuron. These layers have 128 and 256 neurons with ReLU as the activation function. Two dropouts of 0.1 and 0.25 are used after the dense layers. The last layer of the model will be the output layer with Softmax as the activation function. A fully connected layer with four output neurons will predict the class labels and evaluate the proposed model's performance (Accuracy, Precision, Recall, and F1 Score).
Training and Testing
The database was randomly split into three datasets: training, validation, and testing. For training and testing the proposed CNN model, over 2,293 image samples of fresh and infected cotton leaves and plants were taken in the field's actual condition were used. Furthermore, the training dataset is divided into training and validation to determine model overfitting. Out of 2293 image samples, 2095 images were included for the training and validation process, and the remaining 198 images were preserved to test the model's performance in classifying new images. The complexity of these images is evident, with multiple facets contributing to it. Some of these include the appearance of numerous leaves and other parts of the plants and the irrelevant objects in the background, such as shoes. The adam optimization with a learning rate of 0.0001, batch size 32, and categorical cross-entropy as loss function is used to train the network. In addition, the model is trained using the three different train, validate, and test ratios, using four different max-pooling layers configuration and finally using three different numbers of epochs.
Visualization approaches are used to expose CNN feature maps in order to identify how CNNs learn features for discriminating across classes. This experiment can better appreciate the differences in feature maps produced from several diseased cotton leaves and plant images. Fig. 5 and 6 shows an example image from the dataset, together with the visualisation results of different convolution layers and different max-pooling layers.
Performance Parameter
The performance of our model was calculated using metrics such as accuracy, precision, recall, and F1 score. These metrics score between 0 and 1, with 1 being the best and 0 being the worst.
Accuracy: Accuracy is the proportion of accurately predicted images (TP+TN) to the total number of predictions (TP+TN+FP+FN). Accuracy can be evaluated using Equ. (1)
Precision: Precision is expressed in the instances' correctly predicted (TP) to the total number of predictions as positive (TP+FP). Mathematically, it can be calculated using Equ. (2).
Recall: Recall, also known as sensitivity, is the ratio of the instances correctly predicted (TP) to the total number of actual cases (TP+FN). Recall can be computed using Equ. (3).
F1 Score: The F1-score compares precision and recall into a single measure. Mathematically it is the Harmonic mean of precision and recall and is computed using Equ. (4).
Where TP: True Positives FP: False Positives TN: True Negatives FN: False Negatives.