Novel hybrid ARIMA–BiLSTM model for forecasting of rice blast disease outbreaks for sustainable rice production

In recent years, the application of artificial intelligence (AI) in agriculture has grown to be the most important research domain. The proposed work focuses on forecasting rice blast disease outbreaks in paddy crops. Disease management in the farm fields is the most difficult problem on the planet. There is a variety of reasons for this, first, a lack of farmers’ experience in diagnosing diseases, second experts’ experience in detecting diseases visually, and third unfavorable climate. In recent days, researchers have offered a variety of time-series techniques in different applications. This study adds time-series techniques to the field of agriculture by forecasting crucial rice blast disease outbreaks in the paddy crop of the Davangere region based on daily weather data obtained from KSNDMC. The statistical time-series technique called ARIMA is trained by employing real data of blast disease outbreaks in the Davangere region from the period of 2015–2019. Meanwhile, the deep BiLSTM model is trained by employing real weather data and blast disease outbreaks of the Davangere region. Both models are evaluated by performance metrics, such as mean squared error and mean absolute error. The proposed research is focused on the hybrid model ARIMA–BiLSTM which is a combination of the statistical ARIMA model and deep BiLSTM model. The seasonal component of the rice blast disease outbreak feature is extracted from the additive decompose function used in the ARIMA model and fed as a dependent feature for the BiLSTM model. According to the results obtained, the hybrid approach can successfully forecast blast disease outbreaks in paddy crops with a mean squared error of 0.037 and a mean absolute error of 0.028 compared to the statistical ARIMA and deep BiLSTM model.


Introduction
Rice is one of the significant crops in many parts of the world. Rice is consumed in almost every food over half of the worldwide people [1]. In India, the average diet of the people is directly based on rice. As stated by the World Economic Forum, rice usage in most countries is likely to outpace supply [2]. Because sustaining rice production levels is essential to food production, particularly in India, any destruction to the production of rice is considered unacceptable. In comparison to other crops [3], rice requires less B M. Varsha mvarsha16@gmail.com 1 Bapuji Institute of Engineering and Technology, Davangere, India 2 Jawaharlal Nehru National College of Engineering, Shimoga, India 3 G. M. Institute of Technology, Davangere, India manpower and mechanization. Thus, a trend has emerged in which the area of cultivation has increased. The Pyricularia grisea fungus causes rice blast disease [4]. In Karnataka state, blast disease has become a serious problem. Controlling blast disease becomes more difficult due to the rapid development of pathogens and their region dependency. Blast disease can rapidly break down even in variety resistance; hence, superior cultivars that are resistant to blast disease will become vulnerable after widespread sowing for two-to-three successive sowing seasons [5]. The external environment factor [6], or climatic factor, is some of the causative agents of blast disease. The function of wind speed, for example, is critical for the spreading of micro-organisms or the light severity that impacts the infection penetrating process.
Rice cultivators and agriculture scientists depend upon forecasting models at the early stages of rice production, as they help in preventing the rapid spread of rice blast disease. Forecast models help farmers as well as other users to decide strategically about the amount and timescale of fungicide applications. These forecasting models are based on machine learning approaches to extract relevant data from the datasets with a certain number of features. In terms of genetics, prediction or forecasting systems were based on assumptions about the disease interactions with the host and the climate, which are commonly referred to as the disease triangle. The existence of dependable early detection systems would allow for the prevention of the pathogen's explosive nature through the prompt and effective implementation of preventive actions. Hence, it is necessary to develop an accurate forecasting model which helps farmers to take critical actions before the severity of the rice blast disease increases.
Artificial intelligence and machine learning approaches discover a new way to solve the existing problems in disease management. Recently, researchers have studied this field; however, there is a huge gap in the selected features to teach machine learning models, such as features extracted from the images, and host varieties, and a gap in achieving acceptable accuracy of the forecasting models. Therefore, we proposed a novel approach based on combining a statistical time-series ARIMA model with a deep BiLSTM model to reduce the error rate and get good accuracy. To evaluate the proposed model, a compiled and authenticated dataset is acquired from the Karnataka State Natural Disaster and Monitoring Centre Bengaluru. The dataset contains 9130 instances of daily weather data; among this, 7304 are used as training instances and 1826 are used as testing instances. We have obtained a reduced error rate and better accuracy in comparison with statistical and deep learning models.
The main contribution of this paper is to develop a novel hybrid ARIMA-BiLSTM model for the forecasting of rice blast disease in the Davangere region. The proposed method is supervised by combining the statistical ARIMA model and the deep BiLSTM model. Moreover, this study investigates the effectiveness of the hybrid model for building a rice blast disease forecasting system by considering the performance evaluation metrics. The evaluation results of our proposed system clearly emphasize that our model performs better than the other deep learning models in terms of error values.
The remaining section of the research article is structured as follows. Section 2 presents the significant literature works related to the disease prediction and forecasting of rice blast disease outbreaks. Section 3 presents the technical details of the proposed methodology. Comprehensive experiments and present comparative analysis and performance evaluation are stated in Sect. 4 followed by futurity actions. Conclusions are stated in Sect. 5.

Related work
Due to the severity of rice blast impacts on paddy cultivation, several researchers have devised early detection of rice blast predictive model. Researchers [7] examined 52 rice blast forecasting methods and determined that the parameters used were ambient temperature (T 67.3%), humidity levels (RH 57.7%), rainfall (55.8%), leaf temperature (34.6%), sunshine (30.8%), wind velocity (30.8%), and dewpoint (15.4%). The variables that were most frequently combined were minimum and maximum temperature (T) and humidity levels (RH). Furthermore, the disease emergence was found to be positively related to surface temperatures under the phase structure of the rice blast disease. Abe [8] revealed that rice blast occurrence was the minimum lowest when the minimum temperature at 27.8°C and highest when the minimum temperature at 20°C. According to [9], resistance will increase as both atmosphere and soil temperature increase. The above forecasting models may be employed to determine which periods are favorable as well as whether fungicide usage is cost-effective or costly within these scenarios. Several nations have created both observational and comprehensible simulation models for rice blast forecasting using regression analysis [10]. However, because climate change has a significant impact on the rice field atmosphere, the traditional forecasting model may lose predictive accuracy [11].
In recent decades, very few researchers have thought of the rice blast framework as a dynamic system and have developed a rice blast prediction model using machine learning (ML), particularly artificial neural networks (ANNs), which are considered intelligent problem-solving methodologies. The researcher collected weekly weather data in India for the development of a cross-location and cross years forecasting model. The model included a neural network, regression approach, and support vector machine [12]. In the feature analysis process, rainfall was found to be the most influential parameter for the disease occurrence. Kim et al. [13] developed early rice blast occurrence prediction models for four Korean regions using a long short-term memory (LSTM). Climatic data, such as temperature, humidity levels, and sunlight, were collected in June and July from 2003 to 2016. The researcher obtained an accuracy of about 79% from the LSTM model. Nettleton [14] compared process-based models such as WARM and YOSHINO with machine learning models such as M5Rules and RNN. The result showed that the LSTM model obtained an accuracy of 70%.
In comparison to the conventional REG model, the above ML-based rice blast forecasting frameworks, such as BPNN, GRNN, SVM, and RNN, have evolved an accurate forecasting result based on various climate factors like temperature, relative humidity, and sunlight. However, it has been discovered that rainfall affects blast pathogen development and spread; in addition, blast occurrence is known to be related to heavy rainfall, which has yet to be shown in appropriate applied research. As a result, the impact of rainfall as one of the factors for prediction model development was examined in this study.
Many researchers have used weather data as an input feature for their models, because it has been experimentally shown that it can help in obtaining higher predictive accuracy. However, they have proven to be quite efficient in forecasting short-term time-series. This study investigates the performance of the hybrid ARIMA-BiLSTM model over statistical ARIMA and deep BiLSTM models for different kinds of weather data to forecast rice blast disease in Davangere region.

Proposed framework
In the proposed study, forecasting models for blast disease outbreaks are designed using a statistical time-series model, deep BiLSTM model, and novel hybrid ARIMA-BiLSTM model. The diagrammatic representation of the proposed model is shown in Fig. 1.

Time-series dataset
The selection of a dataset in disease management of precision agriculture is very important, because the performance of models depends upon the accuracy of the dataset. In the proposed study, as the dataset is not available publicly, the dataset which is proposed in this work is the real-time dataset. To forecast rice blast disease, it is important to understand the relationship between climate parameters with the occurrence of rice blast disease. In the literature survey, we understood that the computerized forecasting model EPIBLA [15] was developed in India to analyze rice cultivars such as IR50 and IR20 for the prediction of disease incidence, and the simulation model suggests that minimum temperature, higher rainfall, and maximum relative humidity influence disease incidence. Another experiment [16] was conducted in the Kangra District of Himachal Pradesh to find important weather features of rice blast disease severity. During the experiment conduction, minimum temperature and high relative humidity influenced disease development. Many experiments were conducted in different countries across the world [17][18][19][20] to analyze the most important weather parameters that influence rice blast disease occurrence and progression. Results state that minimum and maximum temperature, relative humidity, and rainfall are the important factors that influence disease occurrence, and along with the above-stated weather factors, wind speed is another important factor that influences rice blast disease progress. The proposed work is focused on forecasting rice blast disease occurrence; hence, the climate dataset is collected for the Davangere district (Karnataka, India) as the region is considered the high rice-producing district in the state of Karnataka. The compiled and authenticated data are acquired from the Karnataka State Natural Disaster and Monitoring Centre Bengaluru using robust telemetric weather stations across the Davangere district. The dataset contains 9130 instances of daily weather data from the year 2015 to 2019. The proposed work is focused on forecasting rice blast disease outbreaks. Therefore, an extensive literature survey is conducted to decide the important weather features of rice blast disease occurrence in the Davangere district. The survey reveals that minimum temperature, maximum temperature, relative humidity, and rainfall are the important features that decide the occurrence of rice blast disease in a specific region. Hence, the proposed dataset has seven input climate attributes and one numerical outcome variable. The dataset comprises information, such as minimum temperature, maximum temperature, temperature difference, maximum relative humidity, minimum relative humidity, humidity difference, and rainfall.
The study area is influenced by the aforementioned climate variables in rice blast disease occurrence. However, the dataset used in the proposed study invites big challenge researchers to understand sensitive information in improving the accuracy and effectiveness of the statistical, deep learning, and hybrid models.

ARIMA statistical time-series model
Statistical approaches for time-series data analysis and constructing the ARIMA are presented in the following subsection.

ADF test
To test the null hypothesis for the existence of unit root in the time-series samples, the proposed Augmented Dickey-Fuller (ADF) test has to be performed [21]. The Dickey-Fuller test is done to evaluate whether or not a time-series sampling has a random walk. This is achieved by Eq. (1) Stationary time-series [22] implies the absence of any trend or seasonal influences in time-series data that makes it easier for predictions. The enhanced Dickey-Fuller test is a simplified form of Dickey-Fuller examination as shown in Eq. (2) that allows an increased regression process of the form y t− p where 1 ≤ p < t According to the null hypothesis defined by the ADF test, the p value of the dataset is examined to confine whether the time-series data acquired are stationary or not. If the p value is less than 0.05 for any time-series data, the null hypothesis gets rejected and the data are said to be stationary.

Seasonal decomposition
The statistical process of decomposing a time-series into the trend, seasonal, and residual is known as time-series decomposition. A trend is the general movement of data over time. Seasonal is the behavior of data observed in individual seasonal periods. Finally, residual is the data that is not observed by trend and seasonal components.
There are two techniques for the decomposition of timeseries data, namely, additive decomposition and multiplicative decomposition.
A time-series that follows an additive model or multiplicative model is mathematically described in Eqs. (3) and (4) where c t s t , t , τ t are the components of cyclical, trend, irregular (noise), and seasonality.

Autocorrelation and partial autocorrelation
The correlation among two values in a time-series is known as autocorrelation. To check the correlation of time-series attributes, the autocorrelation function is used and the lags (frequency of intervals between two moments) are used to describe the relationship between the attributes in the dataset. The observation at y t and y t−k is spaced by k units of time. The lag is denoted by the letter k. According to the nature of the data, the lag might be days, weeks, or years. When k 1, describes evaluating observations that are close together. The autocorrelation function with lag k is mathematically written as in Eq. (5) The partial autocorrelation function is identical to the ACF, and it shows the correlation between two data points that are not explained by the shorter delays. For example, the partial autocorrelation of lag 3 is the association that is not explained by lag 1 and lag 2. In other words, the partial correlation is the distinct correlation between two measurements for each lag. The autocorrelation function is clearly stated in the above section and aids in determining the qualities of time-series data, whereas the Partial Autocorrelation Function (PACF) is useful in the definition phase of regression analysis.

ARIMA model for blast disease outbreaks
The ARMA algorithms stated in Sect. 3.2 can only be used with static time-series data. In practice, most of the time-series such as data related to socioeconomics and industry exhibit non-stationary behavior. Time-series data which include trend and seasonal patterns are called non-stationary. As a result, ARMA theories are insufficient for adequately describing non-stationary time-series. Thus, the ARIMA model is proposed, which is an elaboration of the ARMA model that includes non-stationarity time sequence data.
The Auto-Regressive Integrated Moving Average (ARIMA) is proposed by Box and Jenkins. It is a repetitive and successful method for analyzing time data. An ARIMA model assumes data as a linear mixture of previous values, past mistakes, and future values. The ARIMA model is a strong traditional method that incorporates the Auto-Regressive and Moving Average models through a differential process that renders the series stationary. A nonstationary time-series can be converted to stationary using ARIMA models by implementing a bounded differencing to the time-series data samples.
Auto regression uses the regression equation to anticipate the value for the future time step based on data from prior time steps. To render the sequence stationary, the autoregression employs differencing of raw observations. Differencing is a method for turning non-stationary time-series into stationery. A pure Auto-Regressive model is one in which X t is solely dependent on its lags which is described in Eq. (6) where X t is the first lag in the series, β 1 is the estimated lag1 coefficient by the model, and α is the phrase "intercept".
A pure moving average model on the other hand is one in which the X t is determined by the lagged prediction error as described in Eq. (7) The erroneous terms represent the errors of the different lags in auto-regressive models. t and t−1 are the errors from Eqs. (8) and (9) When combining the auto-regressive and moving average models, we get Eq. 10 (10) Equation 10 represents the ARIMA model and is a stationary time-series model. The usual notation used is ARIMA and (p, d, q) bracketed parameters are replaced with integer values, indicating the exact values for implementing ARIMA. The following are the definition of the settings: p is the frequency of lags features in the model; d is the quantity of differencing procedures required for converting non-stationary data to stationary time-series data.
q is the size of the window for moving averages. ARIMA model can be guessed to some extent based on three things: time-series plot, autocorrelation function (ACF), and partial autocorrelation function (PACF). The experimental results of these are presented in Sect. 4.
Algorithm 1 provides the pseudo-code of the ARIMA model to forecast blast disease outbreaks. Historical blast disease score data for 5 years in the Davangere region are given as input to the algorithm. Statistical model performance is evaluated n terms of mean absolute error and mean squared error. The main goal of the ARIMA algorithm is to produce the least errors and better accuracy.
Initialization is crucial for machine learning or statistical algorithms. 80% of the dataset is initialized as training data and the remaining 20% dataset is used for testing. The augmented dickey fuller test is conducted to find differential order d, since the input dataset is stationary differential order d is considered to be 0. By applying the Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) variables, p and q are initialized to 3 and 1 which is described in Algorithm 1 line number 7. We experimented with a comprehensive ARIMA model by the fit() and forecast() to forecast the occurrence of rice blast disease for testing data and assess the performance of the model using performance measures like mean square error and mean absolute error.

Bidirectional LSTM for blast disease outbreaks
The notion of memory cells with regulating gates assists LSTMs in overcoming the challenges of long-term reliance and disappearing gradients related to RNNs. Backpropagation is used in the learning process to determine and update the weights. Researcher [23] has proposed an LSTM variant in which unique LSTM model layers are designed which consist of two blocks of LSTM to analyze time input in opposite directions at the same time. Each time instance's output is a concatenation of each LSTM block's outputs. Figure 2 shows the overview of BiLSTM architecture used in the proposed study.
We are using the BiLSTM model to learn long-term dependencies, because it avoids vanishing gradients and explosion problems. Input weather features x 1 , x 2 , . . . , x n are fed into a forward LSTM network to generate a sequence of forward hidden states − → h t ∈ R n×d h , where d h represents the dimensions of the hidden state, which is represented in Eq. (11). We fed same x 1 , x, . . . , x n sequence to generate another backward hidden state ← − h t ∈ R n×d h as shown in Eq. (12). The final output of the hidden states in the BiL-STM network ← − h t ∈ R n×2d h is obtained by combining − → h t and ← − h t which represents the hidden state output of the input features as shown in Eq. (13). The pseudo-code of Algorithm 2 shows the procedure of fitting the BiLSTM model, the procedure of forecasting rice blast disease outbreaks, and evaluating the BiLSTM model performance using the mean squared error and mean absolute error

Proposed novel hybrid ARIMA-BiLSTM model
So far, Sect. 3.3 has demonstrated two primary components mainly the trend and cyclical components, which are critical for analyzing the behavior of time-series data. The first denotes continuous movement, whereas the second denotes  Figure 3 depicts the architecture of the hybrid model. The initial phase of the ARIMA model is an additive decomposition and is stated as y t s t + τ t + t where, s t is trend, τ t is a seasonal component, and t is noise. s t and t are retained to reproduce trends and noise. τ t is seasonality decomposition which is used in the data sequence of the BiLSTM model for achieving better performance.

Experimental results
Here, we evaluate two hypotheses: (1) whether the statistical ARIMA and deep BiLSTM model help to decrease error rate and correct forecasting of the occurrence the rice blast disease outbreaks; (2) whether the extraction of the seasonal component from the additive decomposition injecting in the form of the dependent variable into the BiLSTM model further decreases the error rate.

Performance evaluation metrics
In the proposed work to examine the performance of the statistical ARIMA model, deep BiLSTM model, and novel hybrid ARIMA-BiLSTM, we have used metrics such as Mean absolute error and Mean squared error as shown in Eqs. 14 and 15.

The Mean Absolute Error (MAE)
The Mean Absolute Error is defined as MAE 1 n n t 1 The Mean Squared Error (MSE) (15) The mathematical definition of this measure is MSE 1 n n t 1 e t 2 . Table 2 shows the different hyperparameters settings that we have used to carry out the experiments. To conduct experiments, 20% of training data are considered as a validation dataset, and using this, hyperparameters are tuned. To build the model, hidden neurons are set to 64 and 32 and the dropout rate is set to 0.25; to train the model, batch size is set to 32, the learning rate is set to 0.001, and the trained data are threedimensional. To test the model, epochs are set to 30.

Experimental results
Experimental results of the statistical ARIMA model, Deep BiLSTM model, and novel hybrid ARIMA-BiLSTM model for forecasting rice blast disease outbreaks for sustainable rice production are accomplished using testing stationary, seasonal decompose, and forecasting of Blast Disease Outbreaks. A more detailed description of the aforementioned approaches is as follows.

Testing stationarity
In the current section of testing, the stationarity of the data is presented. Augmented Dickey-Fuller (ADF) test is used to screen for stationarity. The ADF test is a stationarity test, which is a statistical test. The unit test aims to determine how significantly a time-series is influenced by a trend.    Table 3, it is observed that the p value < 0.05, and, hence, rejects the null hypothesis. Since data are found to be stationary, lags are not applied.

Seasonal decompose
The "seasonal decompose ()" function in python is used to estimate the trend component and seasonal component of time-series data which may be explained using additive or multiplicative decompose functions. The occurrence of the blast disease feature in the dataset is seasonal, reaching a peak during the Kharif season (July-November) of crop cultivation and troughing every summer. Estimated residual and trend components of data sequences are shown in Figs. 4 and 5. The seasonal pattern for the blast disease data sequence is identical from the beginning of the series to the pattern at the end. Seasonal variables recovered from additive seasonal decomposition are used to build a unique ARIMA-BiLSTM   Table 4.
We have used Keras open-source deep learning package with TensorFlow at the backend to develop the proposed BiL-STM and novel hybrid ARIMA-BiLSTM model in Python. The dataset is initially screened to find null values. Further, using min-max normalization, the data are normalized between 0 and 1. For forecasting future values using the ARIMA model, three values are important, such as p, d, and q. Because features in the dataset are stationary d is set to 0. The selection of p and q values are based on plots of partial autocorrelation and autocorrelation, we can derive the p value from the autocorrelation plot and q value from the partial autocorrelation plot. Figure 6 shows that after lag 3, successive values are shut off and near 0; hence, the p value is 3. Similarly, from Fig. 7, it is observed that values from lag 1 to 25 are almost the same. The q value is set to 1.    Fig. 10, where the portion shaded in green color represents training data, the portion shaded in red color represents the forecasted values, and the portion shaded in blue color represents actual values. We can see that forecasted values are neither equal nor nearer to the actual values. As shown in Figs. 12 and 13, the hybrid ARIMA-BiLSTM model has the lowest error values and we can also observe that the forecasted values are almost equal to the actual values. Thus, forecast values obtained from novel hybrid ARIMA-BiLSTM serve as an indicator of the arrival of blast disease population causing attacks on rice commodities.

Comparative analysis and performance evaluation
The optimal model is found by comparing MAE and MSE values. As indicated in Table 5, the MAE and MSE values of all three models are compared to choose the optimal model for the forecasting occurrence of rice blast disease.   Table 5 and shown in Fig. 14 also reveal that the novel ARIMA-BiLSTM is better suited to forecasting the occurrence of rice blast disease compared to the deep BiL-STM model.

Conclusion
The rice blast disease of paddy crop is exponentially spreading in the paddy growing regions because of varying climate conditions, and the agricultural systems adopted in some highly impacted regions of various states, such as West Bengal, Tamil Nādu, Madhya Pradesh, and Karnataka. Accurately forecasting the occurrence of rice blast disease provides pertinent information to governments and agriculture scientists about the expected situation and the needed measures to impose. Thus, forecasting information can be useful for motivating the wider public farmers to consider the imposed measures for down slowing the spread of the disease. In the proposed study, statistical time-series model ARIMA and NN-based models including, BiLSTM and ARI-MA-BiLSTM have been applied to the real-time weather data to forecasts the daily occurrence of rice blast disease in four regions of the Davangere district. The choice is highly motivated by the extended capacity of deep learning models in capturing the process of nonlinearity and their flexibility in modeling time-dependent data. The performance of each model has been verified in terms of MAE and MSE. Results demonstrate that the novel hybrid model ARIMA-BiLSTM has achieved better forecasting performance in comparison to ARIMA and BiLSTM models.
Author contributions MV collected the dataset, conceived the manuscript, designed and planned the study, and wrote the main manuscript text, including all the figures, and prepared all python programs. BP and MPPK prepared the documentation and methodology part of the article. SB prepared the document as per the journal format and also helped in collecting the data. All authors reviewed the manuscript.
Funding The authors have no financial or proprietary interests in any material discussed in this article.

Data availability
The authors confirm that the data supporting the findings of this study are available within the article.

Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Ethical approval Not applicable.