Post-processing is processing output of numerical weather forecast models in order to have more reliable predictions or having predictions in areas the model does not support. In this research, the goal is to have predictions in areas that the model does not support.
2-1- Material
The material section is divided to CFSV2 model and case study which are presented in the following.
2-1-1- CFSV2 model
Climate Forecast System Version2 (CFSV2) is a numerical weather prediction model that predicts a great range of weather variables(Saha, et al., 2014). The variables are in different groups including: a) Surface and Radiative fluxes variables, b) 3-D Pressure level variables, c) 3-D Ocean data variables, d) 3-D Isentropic variables. CFSV2 is an ensemble prediction system executed 16 times every day. Four of the runs are for monthly prediction for the next nine months, three runs for season forecasts and nine runs for 45-day forecasts.
2-1-2- Case study
The research is done on CFSV2 model precipitation predictions in Iran. CFSV2 is a model with monthly forecasts. The CFSV2 data used here is from 1982 to 2017. The predictions used are in the surface and radiative fluxes variables. There are 107 variables in this group. Only 90 variables which had numerical values where used in this research as input variables. The output variable is the precipitation observation from the weather station. The precipitation observation data are from 274 weather stations all over Iran. Figure 1 shows CFSV2 precipitation predictions in Iran for different regions compared to precipitation observations.
Each precipitation prediction in CFSV2 model is for a specific year and month. The model is executed several times each day and in different days. Therefore, it has multiple predictions for each month. These predictions with 90 variables for each and the observations for the same year and month are matched together so the dataset for post-processing is created. In Figure 1, the CFSV2 precipitation predictions in Iran has been compared with observations.
2-2- Methods
In this section, the methods used in the research is reported.
2-2-1- Problem
Post-processing is a task done on numerical weather predictions with different purposes. One of the purposes is that some models don’t have predictions in some areas due to scalability limitations. Post-processing helps to have predictions everywhere. Another goal of post-processing in enhancing the predictions.
2-2-1- Post-processing
Post-processing is a task done on numerical weather predictions with different purposes. One of the purposes is that some models don’t have predictions in some areas due to scalability limitations. Post-processing helps to have predictions everywhere. Another goal of post-processing in enhancing the predictions.
2-2-2- Preprocessing methods
In Machine learning, preprocessing are the tasks done on data before the learning task (Salvador García, Julián Luengo, & Herrera, 2015). Preprocessing makes the data ready for learning operations. The data was investigated and two main challenges were observed, which are imbalanced data and missing values. These concepts are detailed next.
2-2-2-1- Imbalanced data
Imbalanced data is an important challenge in machine learning (He & Garcia, 2009). This challenge usually occurs in classification tasks in which data in one class is much more than data in other class. Regression is another type of learning in which imbalance may occur (Torgo, Ribeiro, Pfahringer, & Branco, 2013). Imbalance in regression means that some output values occur much more than others.
Here, the output variable is precipitation observation in weather station. It was investigated that most of the observations are zero therefore data imbalance exists. In (Torgo, et al., 2013) a preprocessing algorithm based on SMOTE (Chawla, Bowyer, Hall, & Kegelmeyer, 2002) has been proposed to handle imbalance in regression. There is a R software package for this research (Branco, 2013).
2-2-2-2- Missed values
Missed values is another challenge in machine learning, in which some features don’t have values due to problems in data acquisition (W.-C. Lin & Tsai, 2020). Missed values could be handled by different methods. Here chained Equations are used to impute missed values (van Buuren & Groothuis-Oudshoorn, 2011). In (van Buuren & Groothuis-Oudshoorn, 2011), a R software package has been developed for imputing missed values using chained Equations.
2-2-2-3- Feature selection
Feature selection is one of the most important preprocessing tasks in machine learning (Alpaydın, 2010). In feature selection it is aimed to reduce the dimensions of the learning problem. There are different methods for feature selection. Here, a filter method based on Pearson correlation was used to find the correlation between each variable and the observation. Variables that had low correlation were omitted.
As mentioned earlier, in the CFSV2 data there are 90 variables. After performing the feature selection method, the variables were reduced to 47. Therefore, the time for learning was reduced.
2-2-3- Regression methods
Numerical weather predictions are usually continuous values and the post-processing method aims to change these forecasts to another continuous value. With this explanation, regression methods are a suitable mechanism for post-processing. In regression methods, the predicted variable is continuous (Alpaydın, 2010). In the next sections different regression methods used in this research are explained.
2-2-3-1- General Regression Neural Network (GRNN)
GRNN is a memory-based neural network suitable for linear and non-linear regression tasks (Specht, 1991). GRNN is built up of three layers, which are: pattern layer, summation layer and output layer. In the pattern layer, each neuron is a cluster center and the similarity of input to each cluster is computed. The summation layer sums up the result of the pattern layer and the output layer gives the final prediction.
2-2-3-2- Extreme Learning Machine (ELM)
ELM is a type of neural network in which the hidden layer weights are not trained and have random values (Huang, Zhu, & Siew, 2006). ELM can have multiple hidden layers. The output layer in ELM has weights and only these weights are trained. This enables ELM to estimate weights with an Equation and no need to use backpropagation algorithm. ELM has faster training and doesn’t fall into local minimums. ELM can be used for regression.
2-2-3-3- Neural Network (NN)
Neural networks are a popular learning algorithm (Warren McCulloch & Pitts, 1943). Here a Multi-Layer Perceptron (MLP) is used for regression. The hidden layer has 50 neurons with tangent sigmoid activation. The output layer has one neuron with linear activation. The output layer neuron gives the final prediction of the network. Backpropagation is used for training the MLP.
2-2-3-4- Binary Regression Tree (BRT)
Binary regression trees are a type of decision tree for regression (L. Breiman, J. Friedman, R. Olshen, & Stone, 1984). In this decision tree the nodes are divided based on limits on feature values. The features are selected based on GINI index. The learning function is recursive and the operation done on each leaf of the tree are the same. The training stops when there are no more leaves to extend and all leaves are labels not features.
2-2-3-5- Random Forest (RF)
Random forest is an ensemble of decision trees which are combined based on Bagging approach (Breiman, 2001). In bagging, each learner gives a prediction or vote, and the result prediction is majority of votes (Kuncheva, 2004). In building each tree, random forest has a special strategy. It selects one of the attributes randomly. That’s where the word random comes from.
2-2-3-6- Lasso Boosting (LB)
Lasso Boosting is an ensemble of decision trees which are combined using boosting method (Zhao & Yu, 2004). It belongs to a big family of learners called “Gradient Boosting” methods. In boosting, the general idea is to start from a weak learner and try to enhance it iteratively based on the error in each iteration (Kuncheva, 2004). Lasso, generally is an iterative optimization method. In Lasso Boosting, Lasso is used in combination with boosting to optimize the training procedure.