Classification and yield prediction in smart agriculture system using IoT

The modern agriculture industry is data-centred, precise and smarter than ever. Advanced development of Internet-of-Things (IoT) based systems redesigned “smart agriculture”. This emergence in innovative farming systems gradually increases crop yields, reduces irrigation wastages and making it more profitable. Machine learning (ML) methods achieve the requirement of scaling the learning performance of the model. This paper introduces a hybrid ML model with IoT for yield prediction. This work involves three phases: pre-processing, feature selection (FS) and classification. Initially, the dataset is pre-processed and FS is done on the basis of Correlation based FS (CBFS) and the Variance Inflation Factor algorithm (VIF). Finally, a two-tier ML model for an IoT based smart agriculture system is proposed. In the first tier, the Adaptive k-Nearest Centroid Neighbour Classifier (aKNCN) model is proposed to estimate the soil quality and to classify the soil samples into different classes based on the input soil properties. In the second tier, the crop yield is predicted using the Extreme Learning Machine algorithm (ELM). In the optimized strategy, the weights are updated using a modified Butterfly Optimization Algorithm (mBOA) to improve the performance accuracy of ELM with minimum error values. PYTHON is the implementation tool for evaluating the proposed system. Soil dataset is utilized for performance evaluation of the proposed prediction model. Various metrics such as accuracy, RMSE, R2, MSE, MedAE, MAE, MSLE, MAPE and Explained Variance Score (EVS) are considered for the performance evaluation.


Introduction
IoT is an advanced technology for monitoring and controlling devices anywhere in the world. In many fields, it creates a remarkable mark due to its easy accessibility. Few technologies developed by IoT, such as remote sensors, drones and robots have made people's lives easier and more beneficial. Moreover, these technologies have experimented on fundamental needs such as food that is obtained from the agricultural field. From the recent survey of World Bank, it is approximated that more than 50% of food is required to cultivate before 2050 based on present population rate. However, such huge production of the crop is a challenging task because of the current climatic changes. In such cases, Smart agriculture system plays a vital role to increase the yield by monitoring and predicting the production of the crops. IoT based crop yield prediction enables the farmers to enhance productivity. In general, an IoT based smart farming system is deployed in an agriculture field for monitoring the crop field with the help of sensors namely DHT11 (temperature and humidity sensor), TOC (Total Organic Carbon) and nitrogen, phosphorus, and potassium (NPK) sensors. Using this setup, farmers can monitor the field conditions from anywhere. Gateways are responsible for receiving data from the crop area and forward them to the storage unit. The prediction engine is used to predict the results and sends information to the notification server. Agricultural supervision, particularly crop yield observation is essential for examining the food security in a region. Manually predicting crop yield is a challenging task due to several difficult aspects. Based on the water quality as well as availability, pest infestations, 1 3 genotype, landscape, soil quality, climatic condition, etc., the crop yield may vary. The strategies and the processes are non-linear in nature, intricate and varied with time because of external aspects and correlated factors (Elavarasan and Vincent 2020). Recently, several studies illustrate that ML approaches such as support vector regression, multilayer perceptron (MLP), etc. have comparatively more enhanced potential than the traditional techniques. These approaches have the ability to predict linear and non-linear agricultural architecture. These methods were obtained from the learning process in the ML agricultural framework (Dang et al. 2020;Van Klompenburg et al. 2020;Bhojani and Bhatt 2020).
On observing the most outstanding frameworks in agriculture, artificial and deep neural networks are the commonly utilized models (Gopal and Bhargavi 2019). Some of the ML models used to predict the crop yield are neural architecture search (NAS) (Ren et al. 2020), linear discriminant analysis (LDA) (Yan et al. 2020), spectral clustering (Li et al. 2018a, b;Li et al. 2019) and artificial neural network (ANN) (Gopal 2019). ANN is a network model that generates approximation by bias as well as a weight optimization for a node-link structure including input, hidden and output layers. Deep learning (DL) is a subgroup of ML that is used to predict the crop yield based on the varying arrangement of raw data via the intensive learning process in a deep network (Shook et al. 2020). Moreover, these DL algorithms have the capability to design a probability model using field data. In addition to this advantage, DL approaches provide data on plant performance under various climate changes (Nevavuori et al. 2020). For example, Reinforcement learning is one of the major areas of artificial intelligence. It is the preparation of ML models for decision-making sequences and is the significant class of algorithm that is used to streamline the logic for dynamic programming (Elavarasan and Vincent 2021). Besides, extreme learning machine (ELM) is also a ML approach capable of enabling neural network training for predicting the crop yield. It accelerates the learning process and provided better outcomes. However, these approaches have several disadvantages like less sustainability, computationally expensive, high complexity and false prediction (Suchithra and Pai 2020). To overcome these challenges, an efficient ML based crop yield prediction model is proposed in this work. Agriculture is the major source that increases the economic system of India. In order to overcome the issues of high cost and complex management of conventional agricultural planting, IoT is applied for realizing real time detection, crop growth intelligent management and change the conventional agricultural planting mode. Various mathematical and empirical yield approaches have been evaluated for several crops. These models requires an enormous amount of knowledge about soil and crop, which makes it hard for implementing for various localities. Many satellites based remote sensing methods have also been developed for yield modelling. But these approaches are not able to provide enough spatial details for small farms to optimize crops. Recent development of ML models enables researchers to solve and understand complex predictions. Motivated by this, aKNCN is used in the research work for soil quality estimation and ELM-mBOA is used for crop yield prediction thereby achieves better results.The major contributions of the proposed work are: • The proposed work provides an IoT based farming system that ensures the deployment of effective crop yield prediction model. This work involves pre-processing, Feature selection and classification. The data is pre-processed and the features are selected by Feature selection algorithms. • Then IoT based smart agriculture system using two-tier ML model is proposed for the better prediction of crop yield. • Ml based Classification model Adaptive k-Nearest Centroid Neighbour Classifier (aKNCN) is used to classify the soil samples for different classes by considering the properties of soil dataset. • Extreme learning machine algorithm (ELM) model is proposed for predicting the crop yield and the weights of the model is updated using modified Butterfly Optimization Algorithm (mBOA) to improve the prediction system with less error values.
Paper outline: Sect. 1 presents the introduction and highlights smart agricultures in brief. Recent related works during (2019-2021) are discussed in Sect. 2. Section 3 presents the proposed framework along with important measurable. Section 4 discusses the experimental analysis and results. Section 5 provides the applications of the proposed methodology and Sect. 6 concludes the presented work. Abbas et al. (2020) predicted the crop yield via proximal sensing and the ML algorithms. The objective was to extract significant data that are responsible for controlling the yield of crop. The properties of potato tuber crop and the data of soil have been gathered by proximal sensing. A large dataset was utilized for the prediction performance. Support vector regression (SVR), k-nearest neighbour (K-NN), linear regression and elastic net ML algorithms were utilized for the classification and prediction of crop yield. The metrics like R 2 , MAE and RMSE were determined for the performance evaluation. The performance achieved for KNN was poor due to the increased number of functions in predicting crop yield. Rezk et al. (2021) presented an IoT based smart agriculture system using ML algorithm. The drought and the crop productivity were predicted by WPART and it was a combination of wrapper and PART techniques. Feature selection and classification were the two important phases in the prediction process. Wrapper feature selection technique selected the optimal features for further classification. PART was a partial decision tree approach used for classification and prediction. Accuracy, precision, sensitivity and F1 score were considered for the experiment of WPART. The crops taken for experiment were Sugarcane, Jowar, Bajra and Soybean. Some samples in the dataset were misleadingly labelled, thus the false prediction rate was high.

Related works
Bu and Wang (2019) developed deep reinforcement learning based ML technique for a smart farming IoT system. To classify and predict the crop yield, cloud computing and artificial intelligence were combined. The key goal of this research was to minimize resource consumption and maximize food production. A hierarchical Bayesian based multitask reinforcement learning method has been utilized for modelling the Markov decision process. Then, the Q-value regression function was examined using policy distillation. However, computational complexity was considered as one of the major drawback of this approach. Also, performance on a human level was not achieved in solving complex task and adapting to dynamic environments. Nevavuori, et al. (2019) proposed a deep learning technique for crop yield prediction. The key objectives of this research were crop yield prediction, biomass evaluation, crop and weed detection. Convolutional Neural Network (CNN) was modelled for extracting the features, training, hyperparameter tuning and regularization to predict the yield of wheat and barley crops. MAE and MAPE are the evaluation metrics used for simulation analysis. But, the presented CNN does not perform well for the large dataset. Also, the performance efficiency of this method was not good.
Dos Santos et al. (2019) introduced AgriPrediction model for IoT based smart agriculture system. It was an end-toend model that predicted agricultural crops. It was the integration of prediction as well as short and medium wireless network range system. The components of the AgriPrediction model have been designed according to the ARIMA prediction model and LoRa IoT technology. Initially, the data were gathered using sensors, then the discrete moving average-based prediction has been performed. If the predicted crop goes wrong, then the notification was given to the farmer's mobile phone. This model was computationally expensive and less sustainable. Moreover, the accuracy of this AgriPrediction model was not evaluated in this research. The future scope of this research is to generate a mobile application for crop's real-time monitoring.
Saranya and Nagarajan (2020) presented a neural network with population based incremental learning (NN-PBIL) method to improve predictive performance. The neural network was used to classify and predict the crop yield. The weight of the neural network was updated by the PBIL approach. The Hadoop framework has been utilized for the prediction performance. Neural network along with ANN and multiple linear regressions (MLR) were implemented for the crop yield prediction. Low convergence and getting stuck within local minimum were the major drawbacks in this presented model. Filippi et al. (2019) proposed the empirical modelling scheme for predicting the yield of barley, wheat and canola crops. In this research, several fields are considered for the prediction performance instead of single field in isolation. Random forest models and publically available data with temporal and spatial data collected on-farm were combined for the yield prediction of canola, barley and wheat. The experimental results showed that the accuracy obtained by this predictive model was low. Sun et al. (2019) predicted the soybean yield both during the season and at the end of the season using deep CNN-LSTM based on remote sensing data. The training data such as MODIS surface reflectance (SR) data, MODIS land surface temperature (LST) data and weather data were correlated and transformed to histogram based tensors according to the Google earth engine (GEE). The performance of crop yield prediction at large scale was not evaluated in this research. Time and computational complexity was high and fed the raw remote sensing data into DL mode was a complex task. However, there are some limitations with these approaches like computational complexity, high expensive, dependencies between target and input variables, proper model representation and accuracy, affected by data quality. Due to these drawbacks the proposed model introduces 2 classification models with optimal weight selection for crop yield prediction.

Proposed framework
The ingredients of soil such as phosphorous, potassium and nitrogen, crop rotation and atmospheric temperature etc. play a vital role in cultivation. ML methods are an essential decision support device for the prediction of crop yield like supporting decision on what crops to grow. Many ML algorithms are employed to support the prediction of crop yield. In the proposed smart agricultural framework, preprocessing, FS and a two-tier model is implemented for crop yield prediction. Here, akNCN is proposed, which is an improved version of KNCN. Generally ELM is an influential model with more fast learning methods, higher performance and less training error when compared with other algorithms. Therefore these two classification algorithms are proposed in this work to improve the accuracy of the system and provide better results than the existing models. Figure 1 illustrates the architecture of the proposed crop yield prediction model. Initially, Pre-processing is done to remove the noise in data and the features are selected on the basis of features selection methods like CBFA and VIF. Finally, the classification uses two tier systems. In the first tier, the proposed aKNCN model is used to classify the soil quality based on IoT system collected soil nutrients. In the second tier, ELM-mBOA is utilized for crop yield prediction and the accuracy is improved by optimal weight selection using mBOA. This model improves the accuracy of the system with minimum error values.

Pre-processing
The data is gathered from various sources and pre-processed. Pre-processing is a necessary phase in ML since it can't handle noisy data. Noisy data means it has errors and outliers. Before the data can be used for classification, it must be preprocessed to insert missing values, remove unwanted data, extract functions and keep the appropriate data range. In this work isnull () approach is used to check the null values then the label encoder() is used to convert categorical data (string format) into numerical data (numeric format). Since Python does not handle categorical data, it must be converted into numeric format. Once the data is converted to numeric format, it is applied for feature selection.

Feature selection (FS)
ML is a computational learning model that works on the prediction of statistical value. The FS model is applied to identify necessary features, which are powerful in correlation with crop production. The main reason to employ FS is that it enables the ML algorithm to train faster, minimizes the model complexity and makes it easy to interpret. It also increases the system accuracy when the proper subset is selected and reduces overfitting. The computation time of the algorithm is less necessary than its classification for normal size feature sets. But the feature selection is necessary for large datasets. Various statistical approaches can be employed in FS like filter, embedded and wrapper methods. Filter methods choose the intrinsic characteristics of the features computed by univariate statistics instead of performance of cross-validation. These methods are faster and less computationally expensive than wrapper methods. When dealing with high-dimensional data, it is computationally cheaper to use filter methods. Hence, in this work filter based FS methods like CBFA algorithm and VIF algorithm are used. CBFA chooses the best feature set which is mainly correlated with yield. VIF verifies the multicollinearity among independent features. Therefore, it eliminates all multicollinear features.

Correlation based feature selection algorithm (CBFA)
CBFS orders feature subset based on the correlation heuristic evaluation function. This function is towards a subset that has features which have high correlation amongst class and uncorrelated with each other. The features which are not relevant must be removed since they have less correlation amongst class and high correlation with other features. Feature acceptance is based on the extent to which it identifies classes in areas which are already not identified by other features. The CBFS is computed as: where N is the total number of features, rc is the average correlation, rf is average feature to pair wise correlation.

Variance inflation factor algorithm (VIF)
VIF computes the strength of the multi colinearity in the analysis of least squares regression. It gives an index that computes how much the variance of an evaluated regression coefficient is enhanced due to colinearity. VIF model is employed for removing correlated independent features. This method is fast and it exploits one pass search to the predictor. In addition, this method is computationally efficient in testing every predictor to the model and it avoids the overfitting issue. It is achieved by regressing each independent variable, let Y on the remaining independent variables (W and Z) and checking how much of it (of Y) is explained by these variables.VIF is measured by From the expression it is shown that the higher the VIF, higher the R 2 which means the variable X is collinear with Y and Z variables. If all the variables are completely orthogonal, R 2 will be 0 resulting in VIF of 1.

Tier 1-classification
In this work, aKNCN (Rosdi et al. 2021) is used to classify the soil classes from the different parameters. The proposed aKNCN overcomes the challenges of conventional KNCN and enhanced the performance of KNCN classification. KNCN is a non-parametric classifier that depends on the centroid distance. This states the nearest neighbours of the test samples should satisfy the following criteria-it should be close to test samples and the nearest neighbour distribution should be symmetrical in test samples. But it is complex to determine neighbours in a feature that satisfy these properties. Though KNCN achieves good accuracy it lags in classification time. Hence aKNCN is developed for improving the classification time by adaptively adjusting the nearest centroid neighbour for every input sample to enhance the classification accuracy. Two properties of aKNCN are given as follows: Property 1 TheaKNCN method satisfies a stable searching phase only when j th distance of nearest centroid is more than pre-defined limit which is multiplier product of k l and the first nearest centroid z ncn,1 to the test sample d(y, z ncn,1 ) . The size of neighbourhood is represented as where d(y, z s i ) is the nearest centroid distance among test samples, z s i and y . The multiplier product is higher or equal to 1 and the 1 st centroid distance is d(y, z ncn,1 ).

Property 2
The aKNCN method satisfies searching phase only when the entire sample class, M i , is found amongst j nearest neighbour and the whole samples per class to compete class is lesser than M i − 1 . Then the property is defined as.
is a subset of V ′ from wi with training samples, M i . Finally, the soil quality is classified and the crop yield is predicted based on the classified properties for different classes. The yield prediction is performed using ELM, which is discussed in the next subsection.

Tier 2-prediction
In this phase, ELM is proposed to predict the crop yield based on the classified soil properties of different classes along with different parameters such as rainfall and temperature. In ELM, a new metaheuristic algorithm called mBOA is hybridized to tune the optimal set of ELM parameters such as thresholds and weights that enhance the performance accuracy with fast convergence. The optimizationmBOA is used to solve the convergence problems and provides robustness. ELM has learning speed and has a better generalization because there is no need to tune the initial parameters of the hidden layer. The hidden layer Feed Forward Network (FFN) is converted into the linear equation by minimum norm least squares. The aim of the ELM is to reduce the output norm weight and training error at the same time.
For the samples {(Z i , T i )|X i ∈ S m , T i ∈ S n , i = 1, 2, … N} , the P neurons hidden layer with the output function is: where = [ 1 , 2 … … P ] is the output weight vector between output neuron and P . The hidden layer output vector to the input X is given by For enhancing the generalization and to reduce the training error of neural networks, at the same time both output weight and the training error must be minimized.

According to Karush-Kuhn-Tucker, the Eq. (19) can be written as
where h is the output matrix of the hidden layer, R is the coefficient of the reflection and T is the expected samples and the ELM algorithm output function is When the feature mapping function r(Z) is unknown, ELM kernel matrix on the basis of Mercer's condition is given by The output function g(Z) on the basis of KOELM is given by where L(Z, Z 1 ) and M = hh T are the hidden neurons kernel function of single hidden layer FFN networks. The functions like polynomial kernel, exponential kernel, linear kernel and Gaussian kernel will satisfy the Mercer condition.
BOA (Arora and Singh 2019) is a nature based metaheuristic approach which influences the behaviour of mating and foraging of butterfly. One of the major properties of BOA varies from other optimization approaches that are every butterfly has its separate scent. The fragrance is expressed as: where f r represents the identified magnitude of fragrance, s is modality of sensor and I b represents stimulus intensity with absorption of fragrance.
The value of s ranges from [0, ∞] but the value is identified by a particularity of the optimization issues in the BOA iterative procedure. The s in the optimal solution of the method is expressed as where T max is the maximum iteration and initial value of s is 0.01. Further, there are two stages in the process, global search and local search space. The mathematical calculation of global search is calculated as here x t j is the solution vector x j of the j th butterfly in iteration t and r i is a random number and rages from [0,1]. Then g b is the present best solution identified among every stage in the present stage. Then the local search space is expressed as Here x m i and x t j are the mth and jth butterflies selected randomly and when x m i and x t j is considered under same iteration, that means butterflies becomes a local random walk. When x m i and x t j is not considered under same iteration random walk may diversify the solution.
Both local search and global search for mating and food partner via the butterflies in nature can happen. Hence a switch probability is considered to transform the intensive local search and normal global search.It is seen from Eqs. (14) and (15) that choosing random local and global search will affect BOA is trapped by local optima. Further the parameter r i capacity for adjusting local and global is limited. Therefore some modification is needed. Hence the new optimal solution is obtained by the following equations.
Therefore the new equation for global search is calculated as: The new equation for local search space is expressed as: where w is a weighting coefficient. Comparing with Eqs. (14) and (15), the updated Eqs. (16) and (17) has the features like the weighting coefficient is and it is able to adjust among local and global search when compared to the original BOA. The best solution is updated either by Eqs. (16) or Eq. (17). These two equation provides better convergence speed because of the weighting coefficient. Further this model has better convergence speed and avoid local optima. Therefore this mBOA provides better results due to the optimal value. The Pseudo-code modified BOA is algorithm 1.

Experimental results and discussion
This section gives the performance analysis and discussion about the developed scheme. The entire implementation has been processed on a system with 8 GB RAM and Intel Core i5 CPU with 3.0 GHz speed. To implement the proposed scheme, PYTHON 3.8 is utilized. The dataset taken in this paper for the experimentation is soil dataset. The developed approach performance is implemented with metrics like RMSE, R 2 , MSE, MedAE, MAE, MSLE, MAPE and EVS, error measures and accuracy are utilized for the performance evaluation.

Comparative analysis of aKNCN-ELM-mBOA and existing approaches
The performance of proposed aKNCN-ELM-mBOA is compared with the state of the art techniques like ELM, artificial neural network (ANN), support vector machine (SVM), gradient boost (GB) and random forest (RF). The simulation is performed on aKNCN -ELM-mBOA with these existing methods using the error metrics to determine the prediction efficiency of each method. While the R 2 and EVS values attained by the proposed aKNCN -ELM-mBOA are higher than other strategies. Figure 2 represents the accuracy measure for the proposed aKNCN-ELM-mBOA and the existing approaches. While considering the accuracy measure, the actual and the predicted data are nearly same then the system is said to be efficient for the crop yield prediction. It depicts the actual data and the predicted data of different techniques. From the graph, the proposed aKNCN-ELM-mBOA predicted the result more accurately than the existing methods. The proposed aKNCN-ELM-mBOA reached near to the actual data whereas the other techniques did not attain a better accuracy. AKNCN-GB achieved very low accuracy that other strategies. Therefore, the proposed aKNCN-ELM-mBOA is effective than the existing techniques. Figure 3 illustrates the resultant graph of error measures for various approaches. From the graphical representation, the MAE of the aKNCN-ELM-mBOA is 0.064 and the MAE of the aKNCN -ELM-BOA, aKNCN -ELM, aKNC _ANN, aKNC-SVM, aKNC-GB and aKNC-RF are0.067, 0.097, 0.130, 0.165, 0.231 and 0.293 respectively. Similarly the proposed model achieves the better value of MSLE and MedAE. In general, if the error occurred is less in the prediction system then it is considered as an effective model. Like this the proposed model achieved better achievement in all the cases. Thus the proposed method proves its achievement. Table 2 presents the running time of the various approaches like aKNCN-ELM-mBOA, aKNCN -ELM-BOA, aKNCN -ELM, aKNC _ANN, aKNC-SVM, aKNC-GB and aKNC-RF. Comparing all the approaches, the running time of the proposed aKNCN-ELM-mBOA is found to be 58 ms. Further it takes only less time for completing the process.  Hence it is proved that, this proposed methodology can be efficiently used in IoT based smart agriculture.

Applications of the proposed methodology
Agriculture based on IoT with ML is the next emerging thing in smart agriculture and farming. In several real life applications, IoT is involved. In smart agriculture, using various sensors can monitor the agricultural activities like plant and irrigation monitoring. Comparing to traditional farming methods, the farming based on IoT is much productive. IoT sensor systems require to be simple to use to facilitate farmer to take advantage of it. Figure 4 depicts the applications of the proposed methodology. Some of the applications of the proposed methodology are listed below: Plant management: The proposed ML model with IoT offers suitable and controllable environment to grow crops by greenhouse technology.
Livestock monitoring: ML with IoT is used for collecting data with respect to location and health of cattle. This data is further used to identify the animal's sick. This monitoring minimize the labouring cost and also prevents the diseases spread to other animals.
Crop and yield: This ML model can be apply in farm lands on the basis of gathered data over IoT by yield monitoring gathered using GPS. It is also used in production of crops.
Soil monitoring: Using wireless sensor nodes, the soil data can be obtained, then the obtained data can be given to this proposed ML algorithm for predicting and analysing the properties of soil and classifying the soil types.
Disease monitoring: Combining this ML model with IoT is used for identifying and managing disease in farming lands. This methodology is further used for selecting proper pesticide for protecting crops from infections and hence minimize the work of farmers.
Animal monitoring: The tracking of animals on the farming land is essential. Many researches are carried out to track animals with a help of IoT based sensors. ML model with IoT can overcome this problem. Sensing the animal's presence can be identified by IoT sonsors.

Conclusion
This work focuses on predicting the yield of the crop based on two-tier ML approach named aKNCN and ELM-mBOA. In the first tier, the proposed aKNCN model is used to estimate the soil quality based on IoT system collected soil nutrients. In the second tier, the soil quality score along with other crop yield related parameters like temperature and rainfall are taken as the input of ELM model to predict the crop yield. The hyper parameter tuning of ELM prediction model is achieved by mBOA to enhance the prediction performance of ELM. PYTHON tool is used for the implementation of proposed system. Soil dataset is utilized for performance evaluation of the proposed prediction model. The proposed scheme attains better results than the other classification models on the basis of accuracy, RMSE, R 2 , MSE, MedAE, MAE, MSLE, MAPE and EVS. The RMSE and MAE of the aKNCN-ELM-mBOA is found to be 0.301 and 0.064 respectively. Presently, the ML models overcome the problems in smart agriculture in numerous way, however this ML-IoT requires continuous internet connection and the developing countries does not have this requirement. Further, this model requires large amount of training data and also it is very complex for planning, building and managing the broad technology to IoT. In future, analysis based on time-series will be done to predict the future values. The use of different parameters like soil nutrients, soil quality, irrigated area and agricultural points can be used to extend the scope of the research as well as improve the accuracy of the system. In addition, deep learning based smart agriculture can be used with the IoT system in order to enhance the production quality.