Leaf Image-based Plant Disease Identification using Color and Texture Features

Identification of plant disease is usually done through visual inspection or during laboratory examination which causes delays resulting in yield loss by the time identification is complete. On the other hand, complex deep learning models perform the task with reasonable performance but due to their large size and high computational requirements, they are not suited to mobile and handheld devices. Our proposed approach contributes automated identification of plant diseases which follows a sequence of steps involving pre-processing, segmentation of diseased leaf area, calculation of features based on the Gray-Level Co-occurrence Matrix (GLCM), feature selection and classification. In this study, six color features and twenty-two texture features have been calculated. Support vector machines is used to perform one-vs-one classification of plant disease. The proposed model of disease identification provides an accuracy of 98.79% with a standard deviation of 0.57 on 10-fold cross-validation. The accuracy on a self-collected dataset is 82.47% for disease identification and 91.40% for healthy and diseased classification. The reported performance measures are better or comparable to the existing approaches and highest among the feature-based methods, presenting it as the most suitable method to automated leaf-based plant disease identification. This prototype system can be extended by adding more disease categories or targeting specific crop or disease categories.


I. INTRODUCTION
Inconsistency and delay in the identification of plant diseases cause a reduction in the quantity and quality of yield. Losses due to plant diseases or other pest accounts for 20 to 40% of global annual productivity [1]. Studies have been carried out to assess the estimated loss caused by different diseases [1]. Yield loss also contributes toward increased consumer prices and a drop in the earnings for crop producers. Accurate and timely identification of plant diseases is crucial for ensuring maximum yield and is beneficial for farmlands in remote areas.
Advancements in machine vision have made it possible to perform the tasks of visual identification and these visual recognition methods can be employed for successful identification of plant diseases [2]. Image-based disease management and surveys have a long history of more than 90 years when aerial images were used to study crop disease [3]. Disease detection and identification have improved since then and informative and sophisticated analysis is being carried out [4]. Image based disease identification is under continuous development and recent Leaf Image based Plant Disease Identification using Color and Texture Features There are two approaches for leaf image based plant disease identification: (i) deep learning based, which use complex architectures to automatically learn features (ii) feature-based, which extract hand-crafted features such as color and texture features to train a conventional machine learning algorithm. The deep learning based approaches has provided higher accuracies but they require more computation and therefore not suitable for mobile or handheld devices with limited memory and computations. Some of the designed systems are targeted different diseases of some specific plant, whereas the other approaches target multiple plant diseases. Phadikar et al. [18] has presented a feature based approach to disease identification of rice plant. They have used Fermi energy based method for segmentation followed by color, contour and locality mapping. Rough set theory is used for selection of important features and rule mining with 10-fold cross-validation is used for system testing. Baquero et al. [19] has presented a Content-Based Image Retrieval (CBIR) system which uses color structure descriptors and nearest neighbors to classify important diseases or disease symptoms such as chlorosis, sooty molds and early blight. Similarly, Patil et al. [20] has also presented a CBIR and extracted color, shape and texture based features.
Sandika et al. [21] has proposed a feature-based approach for disease identification of grapes leaves. They have also performed the comparison of texture feature's performance. of Their Oberti et al. [22] has targeted the fungal disease of grapevine plant (powdery mildew) due to its adverse effects on the crop yield and quality of produce.
They have used multi-spectral imaging and captured grapevines leaf images at a range of angles (0 to 75 degrees).
They have also highlighted that the detection sensitivity increases with the increase in angle and highest value is obtained at 60 degrees and for early middle ages the sensitivity improves from 9-75% with change in angle from 0-60 degrees. Similarly, Zhang et al. [15] has presented a feature-based approach which transform the image into superpixel representation and then segment the desired region using k-means and extract pyramid of histogram of orientation gradient (PHOG). Sharif et al. [23] has presented a feature-based approach for citrus fruit plant disease.
They have used a hybrid feature selection technique based on principal component analysis and feature statistics.
Singh et al. [24] has also presented a feature based approach for pine trees. Bai et al. [25] has targeted cucumber plant disease and proposed an improved fuzzy c-mean based clustering technique to segment the diseased leaf area. Hlaing  rust. Ferentinos et al. [12] has also presented a deep learning based solution using extended PlantVillage dataset with 58 classes. They have used pre-trained VGG for transfer learning and provided an accuracy of 99.53%. The cross-dataset evaluation has provided sharp decrease of 25-35% in accuracy indicating poor generalization.
Mohanty et al. [14] and Yuan et al. [27] has proposed deep learning based solutions using PlantVillage dataset.
They have used pre-trained CNN and applied transfer learning to classify plants into 38 classes. Zeng et al. [28] has presented a high-order residual CNN architecture which extracts low level details as well as high-level abstract representation simultaneously to improve classification performance and provided 91.3% classification accuracy with good generalization performance.
The proposed approach has targeted the problem through a two-step approach which is more suitable. PlantVillage dataset with 38 classes is used to demonstrate the proposed approach, and extended version of the dataset is also reported by some researchers but it was not available publicly at the time of this research. The first step is identification of the leaf as healthy or diseased. This step is performed with bag-of-features approach which is an effective method of visual classification. The further processing will only be performed on the diseased leaf to identify the type of disease affecting the plant leaf. This identification is performed by segmenting the diseased leaf region and extracted color and texture features. It is to be noted that we have extracted a comprehensive set of texture features which is not used in the literature for disease identification task. Feature normalization and selection is performed to obtain the most discriminatory feature set for classification. Five classification algorithms are tried in the final stage and Support Vector Machines (SVM) with cubic kernel is used as a final classifier. The proposed approach has provided comparable results to the state-of-the-art algorithms and demonstrate the effectiveness of texture based visual classification algorithms for the task of disease identification.

III. MATERIALS AND METHODS
Hughes et al. [8] Figure 1 shows some sample leaf images from the dataset, these images are drawn randomly from the dataset in order to give a look at the raw form of images. These images are captured by trained staff in a regularized process and therefore contrast adjustment, color-cast and background removal is necessary to remove any possible bias.  The proposed approach is based on analysis of visual leaf features in a stepwise manner to construct a classifier and is provided below:

A. Background Removal
The task of background removal is crucial in the approach as it may reduce the quality of features extracted from the leaf image. This step involves the removal of background to avoid any potential bias in extracted features and trained framework. Color cast removal is performed by normalizing the gray values of three-color channels separately. Moreover, the task of background removal needs to be automated and free from any human influence to increase the usability. The background removal can be performed in two ways: i) pixel clustering, and ii) edge detection. We have transformed RGB image to HSV (Hue, Saturation and Value) color space to easily separate background from leaf. Figure 2 provide the demonstration of true color image and its channels in RGB and HSV form to demonstrate that it is easier to transform the image to HSV and use Hue layer for background removal rather than performing the task of background removal on true color image itself.

Red Green Blue
True Color Hue Saturation Value The true color image and the color channels has shadows and light reflection which make them less suitable for background removal. This true color image is transformed to HSV color space and its three layers along with three RGB layers are displayed in Figure 2. It is clearly visible that Hue image is a better candidate for segmentation as it cast out all the intensity related information. Edge preserving blurring is applied on the Hue image using the Nisar Ahmed, Department of Computer Engineering, University of Engineering and Technology Lahore, Pakistan. E-Mail: nisarahmedrana@yahoo.com Contact: +92-300-7272402 bilateral filtering and then Watershed algorithm is applied [29]. It finds watershed ridges in an image by treating light pixels as high elevations and dark pixels as low elevations. Eight neighborhood principle is used to segment adjoining regions. At the final stage, small pixel groups are removed by morphological processing which improved the mask. The stages of preprocessing and segmentation are provided in Figure 3.

Unprocessed Color Adjusted Hue Layer Watershed Transformed
Segmented Mask Morphological Processed

C. Segmentation of Diseased Region
Segmentation of diseased region is useful to extract discriminative features. In this study, diseased region is extracted for texture feature calculation as it contains the most discriminative portion of the plant disease.
Therefore, extraction of diseased segments is a requirement which is described further. Among few options for diseased region segmentation we have opted Otsu's algorithm [30] which is computationally inexpensive and a reasonable choice overall. Otsu's algorithm is a simpler and quicker approach to segment the diseased region of Nisar Ahmed, Department of Computer Engineering, University of Engineering and Technology Lahore, Pakistan. E-Mail: nisarahmedrana@yahoo.com Contact: +92-300-7272402 leaf. It works by calculating threshold in the gray-level histograms. It is assumed that the foreground and background pixels belong to two different Gaussian distributions with different mean and variance values. It finds the ridge between two peaks by maximizing the variance between the two classes and the final point in the graylevel histogram is used as a threshold to segment the image.
The diseased region segmentation is not exceptionally accurate however it is satisfactory. Figure 6 demonstrates the results of segmentation for Corn affected by common rust, Grape affected by leaf blight and potato affected by late blight.

D. Feature Extraction
Feature extraction is an important step and the accuracy of a classification algorithm depends heavily on the feature set. Feature extraction is actually dimensionality reduction which is done to effectively represent the interesting parts of diseased region in a compact form. Shape, color and texture based features represent distinct leaf based classification features [31,32]. The shape features can be used in identification of healthy or diseased leaves but in disease identification the segmented diseased region have different inter and intra-class variabilities and would be of less use. The color of the diseased region is distinct from healthy region and stays similar in intraclass samples and vary in inter-class samples. Similarly, the texture of the diseased region depends heavily on the type of disease and can be used as a major predictor. The parameters related to calculation of these two set of features are discussed below:

Color Features
There are two main types of color features namely: color histograms and color moments. In the present scenario, color is of lesser importance as predictor of disease categories. Color histograms provide a larger feature set which Nisar Ahmed, Department of Computer Engineering, University of Engineering and Technology Lahore, Pakistan. E-Mail: nisarahmedrana@yahoo.com Contact: +92-300-7272402 10 is not greatly required in this scenario. Color moments are used to characterize color information which act as a compact way to represent color information.
I. Mean: Mean value of two-dimensional image matrix can be calculated to represent first color moments.
The color image is separated into its three RGB layers and mean value is calculated for each layer.
II. Standard Deviation: Second color moment is standard deviation which represent the distribution of color information around the mean. It is calculated for three RGB layers for two-dimensional image matrix.
There is a total of six features representing color information in the diseased image region, three for mean and three for standard deviation of red, green and blue layers of RGB image.

Texture Features
Texture is more important part for disease identification as it represents more information related to the diseased region. Here 22 texture features have been extracted from Gray-Level Co-occurrence Matrix (GLCM) of grayscale leaf image as the color information is already encoded using color moments. The GLCM is calculated over grayscale image for which range is resampled into eight gray-levels. It calculates how frequently a pixel with a gray-level is situated adjacent to a pixel with the value . The elements ( , ) of the GLCM matrix of 8 × 8 represent the number of times that pixel with value occurred adjacent to a pixel with value . The details of these texture features is provided as follows: Inverse Difference Moment (Homogeneity) [33,36] Where k is the amount of shift 9 Cluster Shade (CS) [33,34] Cluster Prominence (CP) [33,34] Maximum Probability (MP) [33,34] = max , ( , ) Sum Average (SA) [ Difference Entropy (DE) [35,36] Maximal Correlation Coefficient (MCC) Where; is the second largest Eigen value 20 Inverse Difference Normalized (IDN) [33] There is a total of 22 texture features which represent the texture information of the segmented diseased region of leaf image.

E. Feature Scaling
It is a method to scale the range of values of a feature vector. It is also known as feature normalization or standardization. Feature values are computed in different units or may represent different parameters so their ranges vary widely. It is to note that many classifiers use some distance measure, such as Euclidian distance, between the points. If one feature has wide range of values, the distance measure will be largely affected by this minimum values to these points. In this work, standardization is opted which scale the features such that they have zero mean and unit variance. This is widely used scaling method for SVM, ANN and linear regression [37].
Formula in eq. 2 is used for standardization of features:

F. Feature Selection
Feature selection is employed to select a subset of features for model construction. Inclusion of all the features may cause degradation in performance and will result in slower training. Therefore, feature selection is crucial. There are different approaches for feature selection and we have explored two of them for our problem: i) sequential feature selection and ii) ReliefF algorithm. Sequential feature selection can be used  Table 2 provides the list of these features and the CV accuracy.

G. Classifier Selection
The final stage of the work is selection of suitable classification algorithm for classification of leaf disease to the category they belong. We have chosen five major classification algorithms to check their suitability for classification. The algorithm with best performance is optimized for its hyper-parameters to form a final model.
This study has used multi-SVM, K-NN, Naïve Bayes, Random Forest and Artificial Neural Networks (ANN).
Multi-SVM: The SVM is a supervised learning algorithm that is used for classification, regression, and clustering or outlier detection. Basically, SVM is a binary linear classifier which separate two classes using maximum margin hyperplane. A good separation is provided by a hyperplane which has largest distance to the nearest data points, therefore referred as maximum margin hyperplane, as larger the margin lower the generalization error. The feature set may exist in a finite dimensional space but it is not linearly separable in that space. To make the data linearly separable it is transformed to a higher dimensional space using a kernel function where it is linearly separable. The present study has used the one-vs-one multi-SVM and tried linear, quadratic, cubic and Gaussian kernels whereas cubic SVM provided best cross-validation performance. The kernel functions used for SVM are provided in Table 4.
K-NN: It is a non-parametric algorithm which is used for classification, regression or outlier detection. It is a lazy learning algorithm which doesn't try to construct a model, rather simply stores the training examples of the data.
The classification is performed through simple majority voting of the k-nearest neighbors. This is a preferred algorithm for noisy or large training data and is easier to implement. The problem with K-NN is selection of the value of K. Smaller value of K makes finer decision boundary resulting in overfitting, and larger K value results in smoother boundary resulting in poor classification accuracy due to higher bias. Determination of suitable value of K is computationally expensive as it needs to compute distance of each example with all training samples. Linear Naïve Bayes: It is a probabilistic classification algorithm based on Bayes theory having an assumption that each feature is independent from others. It has an apparently wrong and simplistic assumption that each feature is independent from the presence of any other feature in the feature set. This assumption also helps to alleviate the problems arising due to curse of dimensionality. However, Naïve Bayes have good performance in real-world situations, it require small amount of training data and is fast to compute.
Naïve Bayes is not a single algorithm but a set of algorithms for classification which are based on common assumption that a specific feature is independent of any other feature. It can be trained efficiently using supervised learning approach and in many setting parameter estimations are based on maximum likelihood.
Random Forest: Overfitting is the inherent problem of decision tree which is overcame using different bagging or boasting approaches. The problem of overfitting is addressed by random forest which fits a number of decision tree on several sub-samples of training dataset and use an average for improvement of the predictive accuracy.
The sub-sample size is always same as the original sample size except the samples are drawn with replacement.
The principle advantage of reduction of over-fitting make random forest more accurate than simple decision trees in most cases. However, it is a complex algorithm and its prediction speed is slow.  Table 5: Precision, recall, accuracy and F1-Score of the healthy and diseased plant classification are provided in Table 5.
The results of the classification are reasonable but existing approaches have not used classification between healthy and diseased leaves for this dataset. The second phase involves the segmentation of diseased leaf region of 39,226 images. Six color and twenty-two texture features were extracted and a feature vector with a length of 29 features is formed. This feature vector is standardized so that the feature vector has zero mean and unit variance. Standardization is applied to remove the feature range effect on distance-based classifiers, reduce model complexity and improve the gradient calculation. It has been observed that the classification accuracy of subset selected using ReliefF algorithm is significantly less than the classification accuracy of the complete feature set. Whereas, sub-set with FFS provided slightly better classification accuracy than complete feature set. It has been a favorable decision which provides better performance in terms of time and classification accuracy. The results of classification evaluation with three feature sets are provided in Table 7. The feature subset selected after feature selection stage is used for fine tuning the Multi-SVM and ANN as they both provided highest accuracy in initial stage.
The ANN was configured with one hidden layer having neurons varying from 10 to 100 neurons and two hidden layers with neurons varying from 10 to 50 neurons and evaluated on the basis of Cross-Entropy (CE). The network with least cross-entropy has been tested for classification accuracy of which turned out to be 93.4% with the neural net of Figure 7.  Table 8   performance parameters provided as in Table 9. Note that the accuracy provided by SVM with cubic kernel is higher than the accuracy provided by the artificial neural network with 50 neurons in the first hidden layer and 50 neurons in the second hidden layer. It can be observed that SVM has provided impressive performance with cubic kernel but it has taken longer training time (almost 53 minutes). The prediction speed for SVM with cubic kernel, however, is only slower than linear kernel. Since training is a one-time job and can be done on a faster machine, prediction speed is the desired parameter for its practical use. Therefore, SVM with cubic kernel was selected as the final model. The confusion matrix for the final model is provided in the Table 10 below:  TABLE 10 Table 11 provides the results of classification evaluation for the task of plant disease identification. Please note that precision is macro-average precision which is averaged across 26 classes. The recall is also average of the recall for each class. The proposed model has provided an accuracy of 99.31 and F1-score of 99.33 which is quite high and comparable to existing approaches. The high accuracy can be explained based on rich set of texture and color features and with the use of feature selection which provides subset of most discriminatory features.
It is to be noted that the results of Table 10 are for 26 diseased classes only whereas the complete dataset contains 38 classes. In the literature, only the leaf category is classified so the results of Table 10 can't be compared with them. However, we have calculated the aggregate classification performance by accumulating the classification scores of the both stages which is provided in Table 12. It is to be noted that final classification score is slightly less than the disease classification module performance provided in Table 11. The difference in classifier performance is due to their limitations such Naïve Bayes performs poor due to lack of all posterior probabilities, random forest can't perform good due to limited performance of decision tree and their averaging over the number of trees can give intermediate accuracy. K-NN can perform well in cases where features can be discriminated based on some distance metric and artificial neural network has provided comparable accuracy to SVM and the slightly lower accuracy is due to difficulty in finding the global minima. The artificial neural network has been trained numerous times and there is slight change in its accuracy which is due to the initialization of its weight and difficulty to find a global minimum. The ANN is iterated several times with different number of neurons in one and two hidden layers. The iterations are limited due to computational complexity and can be repeated more times if a good solution is not achieved using other classifiers.  It can be noted from Table 13 that 7 existing works has claimed higher accuracy than the proposed scheme but four of them has used subset of the PlantVillage dataset and targeted a specific plant for disease identification. b. Specialized models for high value crops or specific crops with more number of disease can be designed to identify the disease and categorize its severity based on fine-grained recognition.

IV.
Dissimilarity: Dissimilarity or inverse difference moment normalized is closely related to contrast and represent local gray-level variation in the segmented diseased region. It is almost inversely related to homogeneity and is calculated with the below formula: Inverse