In the literature, various studies discuss plant disease detection systems as support decision systems in farming operations. This section is separated into two main parts, the general architecture of the plant disease detection systems, and the second part evaluates the previous related studies that utilize single and multi-camera inputs.
3.1 A General Architecture of Plant Disease/Nutrient Deficiency Detection Models
Machine Learning-based plant disease/nutrient deficiency detection systems comprise two main sub-systems, image processing, and classification systems [6]. The image processing further comprises four steps while there are also four most cited different classification protocols. Table 1 summarizes the image processing steps and the four popular Machine Learning classification models utilized in plant disease/nutrient deficiency detection systems.
Table 1
Image processing steps and different classification techniques in plant disease detection
1. Image processing steps | 2. Different classification techniques |
1. Image processing: • Image acquisition • Image pre-processing • Image segmentation • Feature extraction • Machine learning classification | 3. SVM Classifier |
5. ANN Classifier |
7. KNN Classifier |
9. FUZZY Classifier |
Image processing steps.
Image acquisition.
Acquiring an image from a sample to be classified is the starting point of Machine Learning-based plant disease/nutrient deficiency detection systems [15, 6]. The image acquiring devices such as image sensors or canners, and unmanned aerial vehicles (UAVs) are the common devices to be utilized for image acquisition purposes [7]. These image sensors utilize two main image sensing technologies, the charge-coupled device (CCD) and complementary metal oxide semiconductor (CMOS) technologies [7]. Both these camera technologies convert light signals and protons to digital data, from digital data to image data [7, 8]. However, their method of forming image data vary [8]. The light signals for a CCD camera are transmitted through a string of neighboring pixels, where they are then amplified and transformed into image data [9]. This makes it possible for the CCD cameras to acquire images with little degradation [9]. The CCD cameras produce images that are sharp and have minimal distortion [10]. In contrast, each pixel of the image sensor of CMOS cameras collects, amplifies, and converts light signals [10–12]. Because each pixel can locally convert light impulses into a picture, CMOS devices can produce images more quickly than CCD devices [11–15].
Since they are less expensive than CCD devices, use less power, and can capture high-quality images faster than CCD devices, CMOS devices are typically used in low-budget projects [15, 16]. Figure 2 depicts the conversion of CCD and CMOS image sensors' serial and localized pixels, respectively.
Image Pre-processing.
Pre-processing an image involves adjusting the image contrast appropriately and filtering interference signals that cause noise and subsequently fuzzy images [17, 18]. This process can improve feature extraction precision and illness identification accuracy more generally [18]. Preprocessing often entails simple operations like deblurring, image clipping, cropping, filtering, and trimming, to mention a few [17, 18]. According to Khalili [19], the typical image preparation steps used in image-based detection systems include picture acquisition, grayscaling, filtering, binarization, and edge filtering.
A colored image is converted into a gray image as the first stage in the process shown in Fig. 3 [19]. Otherwise, this step is essential since processing an image in a grey-scale format is much easier and quicker [17–19]. This conversion stage into a grey image may be skipped in applications where color elements are important. The second stage entails denoising a specimen picture because, in the majority of cases, noise signal interference impacts the visibility of features in the specimen images [19]. The next phase is then picture segmentation, which is covered in more detail in the coming section. The final stage entails creating an outline image, which can be done by preserving the outer linked region while masking the leafstalk and holes [19].
Patil [20]. suggested a similar process to that shown in Fig. 7 for applications involving the recognition of plant-leaf features in a variety of illumination environments. This process involves converting a sample image to grayscale, noise suppression, smoothing, and edge filtering to create the image outline [20]. A histogram equalization technique was shown to be the most efficient way to improve the quality of the greyscale versions of color photos in comparative research by Orchi [21]. In contrast, Zhang [22] discovered that when used to identify illnesses on plant leaves, RGB camera photographs provide more useful image improvement than those that are transformed to greyscale.
Image Segmentation.
An essential component of image-based plant feature recognition and phenotyping systems is image segmentation [23]. To segment an image, the foreground, and background must be separated [23], which entails isolating the feature of interest and masking the irrelevant area of the image [24]. By comparing nearby pixels for similarity based on the three major criteria of texture, color, and shape, the features of interest are often found [24]. Thresholding is one of the most well-known examples of an image segmentation technique [25]. To make feature classification simpler, a procedure known as Thresh-old segmentation transforms a color or grayscale image into a binary image, as seen in Fig. 4 [24, 25]. Black and white colored pixels that correspond to the background and foreground, respectively, make up the output binary pictures [25].
Threshold segmentation is mathematically defined as follows, where T refers to a certain threshold intensity, g is the black or white pixel of a binary image and f is the grey level of the input image [26].
$$g\left(x,y\right)=\left\{\begin{array}{c}0, if f(x,y) ˂ T\\ 1 if f(x,y) ˃ T\end{array}\right.$$
1
Threshold segmentation is classified into three; global, local, and adaptive thresholding [26]. When there is a sufficient difference between the intensity distribution of the foreground and the background, global thresholding is used [26]. To discriminate between the features of relevance and the background, a single threshold value is used [26]. When there is no obvious difference in the intensity distribution of the background and the foreground, local thresholding is used since it is difficult to choose only one threshold value [26]. In this scenario, an image is divided into smaller images and various threshold values are chosen for each divided image [15]. When a threshold-old value is calculated for each pixel in an image with an uneven intensity distribution, this is called adaptive thresholding [26, 27].
Another thresholding technique used for image segmentation is the Otsu thresholding approach [27]. By repeatedly looping over all possible threshold values, this method calculates the spread between the pixel intensity levels on either side of the threshold [27]. The goal is to determine the value below which the sum of the foreground and background diminishes to the smallest amount [26, 27]. The Otsu thresholding method's key characteristic is that the threshold values are applied automatically rather than being preselected by the user [27]. The Otsu method's thresholding has a mathematical definition (2).
\(g\left(x,y\right)=\left\{\begin{array}{c}1, f\left(x,y\right) ˃ T\\ 0, 0 otherwise\end{array}\right.\) | (2) |
Watershed transformation is another segmentation technique used in image processing [28]. A watershed is a change that occurs in a grayscale image [28]. The name alludes, metaphorically speaking, to a geologic drainage split that separates adjacent catchments [28]. The image that the watershed conversion operates on is treated as a topographic map, with the luminosity of each pixel designating its elevation, and this method locates the lines that follow the tops of ridges [28]. Figure 5 shows an illustration of a watershed-segmented image, where the background is shown by the black pixels, the features to be extracted are indicated by the grey pixels, and the watershed lines are indicated by the white pixels [28].
On the other hand, Grabcut is a popular and ground-breaking segmentation method that takes into account an image's boundary and textural circumstances [29]. This segmentation technique is based on the iterative graph-cut method, where the background and foreground are implemented using a mathematical function [29]. Then, it is determined whether each pixel in a picture belongs in the background or the foreground [29]. The grab-cut segmentation method is popular in most applications since it requires little human involvement in its use, but it has limitations as well [30]. The intricacy of the thresholding equation renders it hard to construct the Grab-cut sequence cycles [30]. When the background is complicated and there is little contrast between the characteristics of interest and the background, the segmentation is also inferior [30]. There are numerous unique segmentation methods and algorithms in the literature. This study is unable to rule out specific segmentation methods or recognize those that perform better than the others because the applicability of one method depends on a specific application.
Image Feature Extraction.
The extraction of features is one of the key components of computer vision-based picture recognition [31, 32]. An element of a raw image that is used to solve a specific computer vision problem is known as a feature [31, 32]. The features that have been extracted from a picture are included in the feature vectors [32]. For creating feature vectors, a wide variety of approaches are employed to identify the objects in an image [31, 32]. The main characteristics are edges, picture pixel intensity, geometry, texture, image alterations like Fourier or Wavelet, or permutations of pixels from different color images [31, 32]. The main application of feature extraction is a set of classifiers and machine learning algorithms [31, 32]. The feature extraction in plant leaf disease monitoring systems is classified into three spheres which include texture, color, and shape [32].
The shape is a fundamental aspect of a leaf/fruit that is used in the feature extraction method for leaf/fruit images [33]. The length (L) and width (W), represent the displacement between the two points in the longest axis, the diameter (D), represents the greatest distance between the points, the area (A), represents the surface area of all the pixels found within the margin of a leaf picture, and the perimeter (P), which represents the cumulative length of the pixels around the margin of a leaf picture is the primary shape parameters [33–35]. From these 5 shape primary characteristics, several secondary characteristics can be formed such as circularity, rectangularity, aspect ratio, etc [34, 35].
Some academics and researchers decide to use color features as the key factors in the extraction process [35]. Some examples of the color features that are frequently discussed in the literature for leaf/fruit feature extraction include color mean, color skewness, color kurtosis, etc. Also, several textual characteristics are mentioned by scholars including Sharif [36], Hayit [37], and Jasim [38]. These textural features include correlation, entropy, contrast, etc [36, 38].
After the image processing part is complete from image acquisition to feature extraction, the chosen features are trained into a machine learning classification model of choice to complete the design phase of a plant disease/nutrient deficiency detection model. The coming section discusses the most cited machine learning classification algorithms for this purpose.
Machine Learning Classification Algorithms.
Machine learning algorithms are used to categorize input sample data into several classes or groups of membership using classification techniques [39]. During their training, these classifiers may use supervised learning, unsupervised learning, and reinforcement learning techniques [39, 40]. When a person is a model trainer and uses pre-formed data sets to carry out the training, supervised learning takes place [39, 40]. Since there is no training data available, unsupervised learning takes place; as a result, the algorithm must train itself and increase its classification accuracy by iterative adjudication [39, 40]. When the algorithm makes classification decisions depending on the feedback the environment provides to it, reinforcement learning takes place [39, 40]. The most frequently cited classification algorithms for vision-based plant disease monitoring systems include Space Vector Machines (SVM), Artificial Neural Networks, k-Nearest Neighbor machines, and Fuzzy machines. These classification strategies are covered in the ensuing subsections.
Support Vector Machines.
A Support Vector Machine also referred to as SVM is a predictive model that may be used to handle both classification and regression problems [41]. It is a supervised learning model that can tackle both linear and non-linear tasks and performs well for many real-world issues [41]. The SVM technique generates a vector or a hyperplane that divides the data into groups, which is a very straightforward notion [42].
Figure 6 shows the two classes of data separated by the best hyperplane (the blue squares and green circles). The positive and negative imaginary planes, which pass through the nearest data points on either side of an ideal hyperplane, are the two planes (dashed lines) parallel to the optimal hyperplane [42]. The support vectors are utilized to pinpoint the precise location of an optimum hyperplane and are the locations that are closest to the optimal hyperplane [42]. There may be several potential hyperplanes, but the one with the highest marginal distance i.e., the distance between the two marginal planes, is the best option [41, 42] When compared to lesser margins, the maximal margin yields a more generic solution, and the algorithm with the smaller margin will face accuracy problems if the training data change [42]. Data classes are occasionally difficult to distinguish with a clear light or location, as in Fig. 6. The Kernel method is used to change the low-dimension (typically 2-dimensional) space where these data classes occur into a high-dimension (commonly 3-dimensional) space when data classes exhibit the property of nonlinearity.
In the new high-dimension space, the Kernel approach computes the dot product of the dimensions [41, 42]. In the case where \(\overrightarrow{x}\) is any data point or support vector, \(\overrightarrow{\omega }\) is the weight vector that applies the support vectors' bias \({\omega }_{0}\) which is also a constant [42], Eq. (3) provides the generic solution of a hyperplane.
Neural Network.
A neural network (NN) is a supervised learning model that consists of a network of connected input and output nodes, where each link has a weighted bias value [43]. A general NN has a single input layer, one or maybe more intermediate layers, sometimes referred to as hidden layers, and one or possibly more output layers [43]. When the network runs, the weight of each connection is altered to aid neural network learning [44]. The network's performance is improved by continuously altering the weight [44]. According to connection types, NN can be split into two groups: feed-forward networks and recurrent networks [43, 44]. Feed-forward neural networks do not have cycle-forming connections between units, in contrast to recurrent neural networks [43, 44]. A neural network's behavior is influenced by its architecture, transfer function, and learning rule [44]. The weighted sum of input causes the neurons in neural networks to activate [43, 44]. A generalized NN model comprising the input layer, the hidden intermediate layer (blue layer), and the output layer is shown in Fig. 7.
K-Nearest Neighbors.
The simplest basic machine learning method is the k Nearest Neighbors algorithm, also referred to as kNN [45]. It is a non-parametric method applied to regression and classification issues [45, 46]. Non-parametric implies that no initial training dataset is required [45, 46]. As a result, kNN excludes the use of any presumptions [46]. For the classification and regression tasks, respectively, the k-closest training instances in the feature space serve as the input [45, 46]. The outcomes depend on whether kNN is used for classification or regression [45, 46]. A class of belonging is the result of the kNN classifier [46] The provided data point is categorized according to the prevailing kind in its neighborhood [46]. The category with the highest frequency receives the input point [46].
Figure 8 shows a space with numerous data points or vectors that can be classified into two classes: class A and class B. If the red vector was to be classified, then the k-NN classifier would compute the spacial distance between the red vector and its neighbors for a given constant k value and assign the class to the unknown red vector according to the class in which the majority of neighboring vectors fall. In Fig. 8, for k equal to 3, the red vector will be classified as class B. The most cited method of computing the spatial distance between the data point p to be classified and its neighbors qn is the Euclidean formula (3) [45, 46].
\(\text{d}\left(\text{p},\text{q}\right)=\text{d}\left(\text{q},\text{p}\right)=\sqrt{{\left({q}_{1}-{p}_{1}\right)}^{2}+{\left({q}_{2}-{p}_{2}\right)}^{2}+\dots +{\left({q}_{n}-{p}_{n}\right)}^{2}}=\sqrt{\sum _{i=1}^{n}{\left({q}_{i}-{p}_{i}\right)}^{2}}\) | (3) |
Fuzzy Classifier.
A supervised learning model called the fuzzy classifier system allows computational variables, outputs, and inputs to take on a spectrum of values over predetermined bands [46]. The fuzzy classifier system is trained by creating fuzzy rules that link the values of the input variables to internal or output variables [46]. Typical fuzzy classifier systems are combined with methods for credit assignment and conflict resolution [47]. The fuzzy classifier system creates suitable fuzzy rules using a genetic algorithm [47].
As seen in Fig. 9, continuous membership is displayed by fuzzy sets, and the degree (µ) to which a data point belongs to a given fuzzy set can be used to classify its membership. For instance, the degree of membership µ(960) on the close fuzzy set for the 690 mm in Fig. 9 is 0.7. Figure 9 further demonstrates that a data point may belong to more than one fuzzy set, with the degrees of membership to each set varying at the intersection locations since some fuzzy sets overlap.
Over and above these most cited machine learning classifiers, this study has also noticed that several other authors have used deep learning for classifying plants in plant disease/nutrient deficiency classification models. Deep learning has an advantage over other machine learning algorithms in the sense that feature learning and classification both occur within a deep learning model while they occur separately in the case of the above-listed classification algorithms [46, 47].