Early detection and control of anthracnose disease in cashew leaves to improve crop yield using image processing and machine learning techniques

Agriculture is one of the primary pillars powering India's economy. It is alarming to note that India's agriculture rate is declining steeply. Climate change, environmental pollution, and soil erosion are well-known factors affecting crop productivity. The increasing prevalence of plant diseases is also a significant factor affecting agriculture. Early disease detection and mitigation actions based on identified conditions in the plants are critical in increasing crop productivity. This study considers a machine learning model for detecting disease in cashew leaves. This work concentrates on Anthracnose disease, which leads to severe yield loss when it affects the cashew plant. In this regard, cashew leaves are collected and used to train various machine learning classifiers to identify and classify the disease. This work focuses on the segmentation and classification of leaves using multiple Machine Learning models. Basic segmentation approaches like Global Threshold, Adaptive Gaussian, Adaptive Mean, Otsu, Canny, Sobel, and K-Means, and Machine Learning models like Random Forest, Decision Tree, KNN, Logistic Regression, Gaussian Naive Bayes Classifiers are employed. The final classification employs a Hard and Soft voting classifier and the Decision Tree, KNN, Logistic Regression, and Gaussian Naive Bayes classifiers. Finally, we observe that K-Means segmentation with Random Forest outperforms other classifiers. The accuracy obtained from the Random Forest classifier is 96.7% for the CCDDB dataset, and the accuracy obtained from the Random Forest classifier is 99.7% for the PDDB dataset.


Introduction
Computer vision is a broad area where machines are trained to get, process, and analyze images automatically with the help of machine learning and artificial intelligence, just as a human would. A subset of artificial intelligence is machine learning, which includes computational techniques for predicting and analyzing the patterns in given data, including images. Typically, machine learning demands data preprocessing and classification. Image processing is the manipulation of an image into a preferred form. B P. Sudha viswasudha@gmail.com P. Kumaran kumaran.0991@gmail.com 1 Department of Computer Science and Engineering, National Institute of Technology Puducherry, Karaikal, India As plants are affected by various diseases, it is difficult for humans to detect them early. Intelligent systems are therefore expected to fill in the knowledge gaps and make disease detection seamless [1].
In the past, cashew disease was thought to be of minimal concern. Crop disease impacts the production of nuts. Today several conditions are severe enough to cause cashew trees to suffer significant losses. This farm crop is reported to be attacked by several fungi, which reduces productivity [2].
More than 12 diseases have been identified as affecting cashew trees globally. In countries that produce cashews, anthracnose, foliar blight, fruit rot, and gummosis of twigs and trunks are frequently regarded as the diseases most likely to cause significant harm [3]. So, this work concentrates on identifying Anthracnose disease, which causes severe yield loss when it affects the cashew plant.
The digital image processing system could be a powerful tool for diagnosing challenging symptoms far sooner than the naked eye [4]. It allows farmers to act appropriately and quickly to protect the crop and obtain the required quality and production of agricultural products [5,6].
This work measures different image segmentation techniques such as global Threshold, Adaptive Mean, Adaptive gaussian, Otsu threshold, Canny edge detection, Sobel edge detection, and K means are applied to highlight the infected area. All these segmentations apply to the Random Forest model and ensemble classifiers of Decision Trees, Naive Bayes, KNN, and Logistic Regression. The same model was applied to the Image Database of Plant Disease Symptoms (PDDB) dataset for comparison, which yielded a 99.7% result and 96.7% for the CCDDB dataset. The paper is structured as follows: Section 2 provides a literature review. Section 3 shows the methodology used for segmentation and classification. Section 4 presents the experiment and result analysis. Section 5 conclusion.

Literature review
Many researchers have previously worked to automatically and accurately diagnose illnesses using various classification approaches. We have surveyed such related papers to our work here.
Through the literature survey, finally, the decision was to determine which basic segmentation would work best with which classifier on our CCDDB dataset. Basic segmentation like Global Threshold, Adaptive Gaussian, Adaptive Mean, Otsu, Canny, Sobel, and K-Means is not applied to a single dataset. The proposed paper used all these segmentations and evaluated how well each classifier performed. In order to achieve better accuracy, this work determines which segmentation technique will work best with which classifier. Table 1 shows a few recent papers supporting our literature review.

Methodology of the proposed system
This section describes the proposed model for cashew leaf disease detection. Determining what type of illness, the cashew crop is experiencing the primary goal of disease identification. The purpose of disease management is to predict how an illness will progress. Accuracy is one of the many aspects of identifying and categorizing plant leaf diseases.
Steps involved in image classification include Cashew Leaf image dataset collection, image preprocessing, segmentation, and classification. Figure 1 shows the proposed model for cashew leaf disease detection.

CCDDB dataset
This part defines CCDDB and PDDB dataset description and image acquisition. The first step in plant leaf classification is data collection. Here Cashew crop leaf image data are collected from the cashew orchard, which is used as input for the classification model. Leaf inputs are taken from a digital camera.
The sample images are gathered from cashew orchards in Konnakavali, Varichikudi, Karaikal, recognized as one of the regions with the highest cashew output in Karaikal. The samples comprise healthy and diseased leaves, including anthracnose, bacterial leaf spot, red rust, grey blight, minor, shooty, and vein necrosis of cashew leaf. Mainly, the focus is on anthracnose disease.
The photos are captured with a digital camera at 3020 × 3020 pixels and then resized to 100 × 100 pixels. The data was collected around one month. Values are Maximum of 35.0-39.8°C temperature, 50%-56%humidity, and 18kmph-21 kmph wind speed was recorded from 2.20 to 3.35 pm, and a Minimum of − 33.1-36.2°C temperature, 38%-51% humidity, and 13kmph-18kmph windspeed were recorded from 11.00-12.10 pm. All experiments are performed in python 3 (Jupyter Notebook) over the DESKTOP_USVG06J computer with Intel i5-8400 CPU @ 2.80 GHz and 16 GB RAM, running Windows 10 Pro operating system.

PDDB dataset
Embrapa in building a representative plant disease database for developing effective automatic plant disease detection and recognition technologies. This collection, known as PDDB, had 2326 pictures depicting 171 illnesses and other diseases that harm 21 different plant species [10].

Image preprocessing
This section describes the image preprocessing used in this work. This step includes normalization and augmentation, and resizing. Digital cameras are used to take pictures. Cashew leaf images are normalized by employing a fixed size in a dataset for processing.
In augmentation, flipping and rotation were applied to expand the CCDDB dataset. Flipping which includes horizontal & vertical flips. Rotations include rotating by 90,180,270 degrees clockwise.

Image segmentation
This section describes the segmentation used in our work. Image segmentation aims to know and identify what the image possesses at the pixel level.

Threshold-based segmentation
Thresholding performs segmentation in an image by fixing all pixels whose intensity is more significant than a threshold value will group in the foreground, and the remaining are grouped in the background value.
Global threshold Thresholding is the primary option for segmenting images in digital image processing. Thresholding can be used to convert grayscale visuals into binary images. A threshold image g(x,y) is defined as, where one is represented as an object and 0 as a background, ThValue represents a threshold value [11].
Otsu threshold The algorithm's most basic form yields a single intensity level that separates pixels into the background and foreground classes [12,13].
The following equation can be used to calculate the withinclass variance at any threshold t.
Where the probability of the number of pixels for each class reaching the threshold is represented by ϕbg(th) and ϕfg(th) the variance of color values is represented by ϑ 2 The value of pixel I in the group at position PV i , the group's average pixel values are represented by PV (bg or fg), the number of pixels is N.
Adaptive thresholding In adaptive threshold, values of threshold change statically on the image. The smaller region will have a different threshold value.

Edge-based segmentation
Sobel edge detection The image is processed in the x and y direction, then the magnitude of both x and y are combined to make a new image.
G x and G y are Gradient factors for x and y orientation. Typically, it determines the expected actual gradient magnitude at every position in n input grayscale images [14].

Canny edge detection
To obtain a smooth image, convolute the image with a Gaussian function and apply the difference gradient operator first to determine edge strength, then compute edge magnitude and direction as usual and use critical or non-maximal suppression on the gradient's magnitude.
To the image of non-maximal suppression, apply a threshold [15].

Clustering-based segmentation
In K-mean segmentation, multiple segmentations on a given image can be performed and bring it for a classifier and try to find the boundaries that describe the location of an object [16,17]. The K-means approach is composed of two distinct steps. Phase 1 involves calculating the k centroids, and phase 2 involves assigning each point to the cluster whose centroid is closest to it.

Classifier selection
This section details different classifiers which are used for this work. This work compares the ensemble model with a Decision Tree, KNN, Logistic Regression, and Gaussian Naive Bayes classifier with Random Forest Classifier.
An ensemble classifier can create a new classifier using ensemble learning that works better than any component classifiers by starting with various basic classifiers. The class with the most significant number of votes or the class with the highest likelihood of being forecasted from each classifier is the actual output class in hard voting. In soft voting, the forecast for each output class is based on the overall probability assigned to that class.
Many traditional machine learning methods, including Random Forests, Bagging, and Boosted Decision Trees, are built based on decision trees. A collection of decision trees is called Random Forest [6]. KNN is a simple algorithm that keeps all the current cases after sorting new instances with the approval of a minimal 'k' of its neighbors. K-Means is an Unsupervised learning technique, and it is used to classify unlabelled data. The Naive Bayesian [18] model is simple to construct and effective for large datasets. The logistic regression algorithm [19], one of the most basic machine learning  techniques, operates on predictions made for z = 1 as a function of the input.

Experiment and result analysis
This section offers experimental findings based on the proposed framework model. From Tables 2 and 3, one thing that can be observed is that K-Means Segmentation with Random Classifier outperforms other classifiers for the CCDDB dataset; the accuracy obtained is 96.76%. The same model is compared and applied to the PDDB dataset where the k-means segmentation with the Random Forest classifier outperforms with an accuracy of 99.7%. As the Random Forest outperforms all the other classifiers, its accuracy is highlighted in bold.
In Table 4, it has been shown that sample input and the segmented leaf of different segmentation techniques, such as Global Threshold, Otsu, Adaptive Mean, Adaptive Gaussian, Canny Edge Detection, Sobel Edge Detection, and K-means clustering segmentation applied to the CCDDB and PDDB datasets.
In Fig. 2a and b, it has been shown that the overall accuracy of different classifiers based on segmentation methods such as Global Threshold, Otsu, Adaptive Mean, Adaptive Gaussian, Canny Edge Detection, Sobel Edge Detection, and K-means clustering segmentation. In which K-means with random forest classifier outperforms other segmentation methods and other classifiers for the CCDDB dataset. Based on the results of the run tests, it is clear that K-Means performs well compared to other segmentation methods. Threshold-based image segmentation assigns foreground values to pixels. Because the positions of pixels are disregarded, pixels allotted to the same class are not required to cluster together, and consequently the clustering performance is low. Edge-based segmentation generally Works poorly on smooth transition image. Sensitivity to background noise is yet another significant drawback. Edge detection methods inherently prefer optimally detecting vertical and horizontal edges. But disease spots are, in general, radial/irregular nature. However, K-means possess certain Table 4 Shows sample input and the segmented leaf of different segmentation techniques applied to the CCDDB and PDDB datasets Fig. 2 a Accuracy Graph for different segmentation methods on CCDDB. b Accuracy Graph for different segmentation methods on PDDB advantages which helps to segment the diseased leaves better. K-means clustering can effeciently separate the interest area from the environment, and it can cluster/group irregularly shaped patches in a image into a separate cluster. K-means with Random Forest outperforms other segmentation methods on the PDDB dataset. For CCDDB, K-Means with Random Forest has 96.7% accuracy, shown in Fig. 2a Red color, and for PDDB, Random Forest obtained a better accuracy of 99.7% accuracy shown in red color.

Conclusion and future work
This work has analyzed different methods for cashew leaf image segmentation and classifications. The machine learning models, such as Random Forest, Decision Tree, Gaussian Naive Bayes, KNN, Logistic Regression, and Ensemble Classifiers with Hard and Soft Voting models are being used. In the result analysis, It has been shown that the comparison chart for each of the classifiers and their performance measure. Based on the work, the conclusion is that the K-Means segmentation with Random Forest Classifier outperforms other classifiers with an accuracy of 96.7% on the CCDDB dataset. K-means with Random Forest outperformed other segmentation methods on the PDDB dataset and obtained 99.7% accuracy. In future, the plan is to incorporate multiclass classifiers and feature extraction techniques to classify cashew leaf plant diseases.