Comparing artificial neural network algorithms for prediction of higher heating value for different types of biomass

A new set of software tools for the prediction of the higher heating values (HHV) of arbitrarily chosen biomass species is presented. A comparative qualitative and quantitative analysis of 12 algorithms for training artificial neural networks (ANN) which predict the HHV of biomass using the proximate analysis is given. Fixed carbon, volatile matter and ash percentage were utilized as inputs. Each ANN had the same structure but a different training algorithm (BFGS Quasi Newton, Bayesian Regularization, Conjugate Gradient—Powell/Beale Restarts, Fletcher–Powell Conjugate Gradient, Polak–Ribiére Conjugate Gradient, Gradient Descent, Gradient Descent Momentum, Variable Learning Rate Gradient Descent, Levenberg–Marquardt, One Step Secant, Resilient Backpropagation, Scaled Conjugate Gradient). To ensure an extended applicability of our results to a wide range of different biomass species, the data conditioning was based on diverse experimental data gathered from the literature, 447 samples overall. Out of these, 301 datasets were used for the training, validation and testing by MathWorks MATLAB Neural Network Fitting Application and by custom designed codes, and 146 remaining datasets were used for the independent evaluation of all training algorithms. The HHV predictions of the ANN-based fitting functions were thoroughly tested and intercompared, to which purpose we developed a test suite which applies mean squared error, coefficient of the determination, mean Poisson deviance, mean Gamma deviance and Friedman test. The comparative analysis showed that several algorithms resulted in ANN-based fitting functions whose outputs correlated well with measured values of the HHV. All programming codes are freely downloadable.


Introduction
Biomass is usually either agricultural waste or waste from the food or wood industry. It includes plant pits, shells, seeds, cobs, prunings, stalks, leaves, husks and grass. In the paradigm of a circular economy, it is of high importance for the benefit of the environment to explore the possibilities to use renewable and sustainable sources like biomass as alternative, non-toxic and environmentally friendly energy sources. The main benefit from the transition from fossil fuels to renewable energy sources is the reduction of greenhouse gas emission, the most important goal in sustainable development strategies (Klass 1998). The number of potential plant species proposed for biomass-based energy sources is growing every day Alagić et al. 2015;Mijailovic et al. 2014;Merckel and Heydenrych 2017;Priya and Setty 2019;Liang et al. 2019;Channiwala and Parikh 2002;Demirbaş and Demirbaş 2004;Krishnan et al. 2018).
Common biomass species for briquettes and pellets production like corn, wheat, sunflower, barley, oats, etc., are accompanied by such species as tobacco (Mijailovic et al. 2014). Most of these resources can also be used to produce gaseous fuels like methane . Besides methane production, biomass is also employed in the production of bio-adsorbents, composts, insecticides, land recovery from nicotine, chlorogenic acid and various agents (Klass 1998), as well as in bioremediation Alagić et al. 2015). The use of agricultural waste is versatile. Besides the mentioned use as an alternative fuel, it may be used for other alternative energy sources also-for instance, cashew apple juice and many other kinds of agricultural waste may be used as a substrate for microbial fuel cells (Priya and Setty 2019). However, the use of biomass as an alternative energy source has its disadvantages. Besides the lack of standardization, the main drawback of alternative fuels based on biomass is their low calorific value.
The main figure of merit for the calorific value of the biomass is its higher heating value (HHV), defined as the amount of heat released by a given quantity of fuel. In general, regarding production of any fuel on an industrial scale, for system design and analysis, for efficient generation of heat and power, the knowledge on the HHV value is crucial. Experimental determination of the HHV is most reliable; however, measuring of HHV is not always an option and there are many efforts in determining HHV in other ways. There is a correlation between HHV and the composition of the fuel. For instance, it has been shown that the fraction of oxygen can be a predictor of HHV of gaseous, liquid and solid fuels (Merckel and Heydenrych 2017). Due to the fact that HHV is correlated with the composition of the raw material used as fuel, there are numerous mathematical models for the estimation of calorific values of fuels based on ultimate analysis, proximate analysis, physical composition, chemical composition or structural analysis (Liang et al. 2019;Channiwala and Parikh 2002;Demirbaş and Demirbaş 2004;Krishnan et al. 2018;Cordero et al. 2001;Boumanchar et al. 2019). In the quoted papers, as well as in the references cited therein, it is shown that HHV can be determined based on the proximate analysis, i.e., using the known percentages of the fixed carbon (FC), volatile materials (VM) and ash (ASH). Naturally these percentages, residues from the combustion of the raw material must sum up to 1 (FC ? VM ? ASH = 1). The process starts with the moisture extraction and actually the published results on FC, VM and ASH may refer to the calculations performed on four terms (FC, VM, ASH and moisture), as described in Conag et al. (2019). Either way, the proximate analysis can be performed by relatively simple laboratory equipment, in comparison with ultimate elemental analysis, and hence facilitate fast determination of HHV based on proximate analysis (Channiwala and Parikh 2002;Demirbaş and Demirbaş 2004;Krishnan et al. 2018;Cordero et al. 2001;Boumanchar et al. 2019;Conag et al. 2019;Wahid et al. 2017;García et al. 2014;Pattanayak et al. 2020;Yu and Chen 2014;Abdulsalam et al. 2020;Dashti et al. 2019;Ghugare et al. 2014;Nhuchhen and Salam 2012;Alaba et al. 2020;Akkaya 2009;Lakovic et al. 2021;Tan et al. 2015;Uzun et al. 2017;Keybondorian et al. 2017;Qian et al. 2018;Samadi et al. 2021). HHV prediction is important for the early stage decisions in the industry as well as for the simple and fast estimations in the research on biomass-related alternative energy sources.
Apart from the analytical determination of HHV based on the proximate analysis (Channiwala and Parikh 2002;Demirbaş and Demirbaş 2004;Krishnan et al. 2018;Cordero et al. 2001;Boumanchar et al. 2019;Conag et al. 2019;Wahid et al. 2017;García et al. 2014), an active field of research (Pattanayak et al. 2020;Yu and Chen 2014;Abdulsalam et al. 2020;Dashti et al. 2019;Ghugare et al. 2014;Nhuchhen and Salam 2012;Alaba et al. 2020;Akkaya 2009;Lakovic et al. 2021;Tan et al. 2015;Uzun et al. 2017;Keybondorian et al. 2017;Qian et al. 2018;Samadi et al. 2021) is the numerical characterization of the biomass in terms of its HHV, using artificial neural networks or other soft computing techniques. A nonlinear correlation between the higher heating value and the proximate, ultimate analysis has been proven in the above quoted works. The ANNs trained on a relatively small set of examples (25, referring to rice husks) (Yu and Chen 2014) outperformed empirical equations when matched to experimental HHV data. Compared to other soft computing techniques (multilinear regression and gene expression programming), ANN showed better results in predicting HHV of hydrothermally carbonized biomass (Abdulsalam et al. 2020). While the predictions of biochar HHV in the quoted reference were calculated on the basis of hydrothermal carbonization temperature, biomass residence time in the reactor and the composition of biomass as inputs, the majority of empirical and software-assisted correlations for the estimation of biomass fuel HHV reported in literature were based on their proximate, ultimate and chemical analysis.
In aforementioned literature (Pattanayak et al. 2020;Yu and Chen 2014;Abdulsalam et al. 2020;Dashti et al. 2019;Ghugare et al. 2014;Nhuchhen and Salam 2012;Alaba et al. 2020;Akkaya 2009;Lakovic et al. 2021;Tan et al. 2015;Uzun et al. 2017;Keybondorian et al. 2017;Qian et al. 2018;Samadi et al. 2021), it has been shown that software-assisted solutions, particularly ANNs, often outperform empirical correlations given in analytical form due to the ability of ANNs to successfully fit the nonlinear behaviors. However, in spite of a significant body of the published results on the subject, there is no unique approach that would be justified by a comparative analysis and supported by quantitative test results. It is often the case that a paper shows that the ANN-assisted prediction of HHV based on the proximate analysis is possible, but simultaneously does not give enough details or tools to enable researchers to employ that particular solution.
Here, we explore 12 different algorithms for training the ANNs aimed for the prediction of HHV based on known results of the proximate analysis. Our objectives are: to develop generalized models suitable for predicting HHV based on the three-component proximate analysis (FC, VM and ASH) by simultaneously using a larger number of diverse datasets, thus ensuring an extended applicability of our prediction procedure to a wide range of different types of biomass or biomass mixtures. to perform the qualitative and quantitative comparative analysis of the ANN-assisted prediction models and to support their ranking with different criteria by developing adequate comparative analysis tools and to test and intercompare these prediction models thoroughly. to improve the scientific soundness and the reproducibility of results related to the subject, i.e., to provide the readers with tools for predicting HHV based on the proximate analysis and with a small test suite for assessing the quality of their fits.
To attain our first objective, we start by performing exploratory data analysis of 538 raw datasets from the literature, from which we choose and condition 447 datasets and give the descriptive statistics of the examples selected for further processing. We perform pilot training of ANNs to select the structure of the ANN and the division of the data. Starting with the Neural Network Fitting Application of the MathWorks MATLAB environment, we develop custom codes for training ANNs based on 12 different regression models. We feed our codes with 301 examples and collect the default parameters for the quality assessment-the MSE (mean square error) and the R 2 (coefficient of the determination).
We attain our second objective in two different ways. First, we complement the testing results given in MSE and R 2 with three additional criteria (mean Poisson deviance, mean Gamma deviance and Friedman test). We then apply five criteria to assessment of ANN-based fitting functions fed with data ''never seen before,'' 146 examples randomly selected from the starting pool of data.
We attain our third objective by storing our work on an open online data repository (Jakšić 2021). We give conditioned data in the form of MATLAB workspaces, developed codes, developed ANN-based fitting functions and a small test suite aimed for ranking and assessment of the obtained ANN-based predicting functions.

Methodology
This section briefly introduces ANN-based modeling, the paradigm of machine learning, types of training algorithms and figures of merit of created ANNs that will be used for exemplary data in subsequent sections.
In the context of the Tom Mitchell's definition of machine learning (''A computer program is said to learn from experience E with the respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E''), this method is based on the following interpretation.
E-experience of repeating determination of HHV based on given percentages of FC, VM and ASH, as per the proximate analysis of the biomass.
T-task of computing one HHV value based on one triplet of input data (FC, VM and ASH). P-statistical performance metrics based on minimization of error of the computed value compared to the true value obtained by experiment.
Typical ANN learning tasks are approximation, selection and clustering functions (Sekh et al. 2020a(Sekh et al. , 2020b. The ANN fitting functions addressed here aim to solve nonlinear regression fitting problems. In supervised learning, ANN is fed with datasets with examples of proper input-output pairs. We first explore the datasets from the literature.

Exploratory analysis of data
Our starting dataset is a compilation of examples from Demirbaş and Demirbaş (2004); Krishnan et al. 2018;Cordero et al. 2001;Conag et al. 2019;Pattanayak et al. 2020;Nhuchhen and Salam 2012;Alaba et al. 2020;Lakovic et al. 2021;Uzun et al. 2017). Out of 538 examples in total, we omitted 88 examples where proximate analysis was represented by quadruplets (FC, VM ASH and moisture). A total of 301 examples were used for feeding the ANNs in the process of the development of predicting functions for training, validation and testing. The remaining 146 examples were used for examining the behavior of the predicting functions when fed with the data ''never seen before.'' The complete utilized datasets together with their corresponding source references are available in the spreadsheet form upon request from the authors. The datasets that were used for this work are accessible from Jakšić (2021)  examples used for subsequent assessment of the quality of fits are given in Table 2. The data in Table 1 are from Demirbaş and Demirbaş (2004); Krishnan et al. 2018) and (Nhuchhen and Salam 2012). The data in Table 2 are from Cordero et al. (2001); Alaba et al. 2020), and (Lakovic et al. 2021). The datasets in Table 1 and Table 2 are from the disjunctive sources. We later use them for the estimation of the generalization capabilities of the developed predictors.
Graphical interpretation of data is given in Fig. 1. The marker size is proportional to the numerical value of the higher heating value. All points are located on the plane ASH ? FC ? VM = 100%. The ASH, FC and VM are not independent, but keeping all three of them as inputs gives better results, as reported in Akkaya (2009).
What we do now is create a multiple-input single output (MISO), forward feed backward propagation multilayer shallow artificial neural networks by using algorithms with supervised learning (contrary to deep neural networks, predominantly used for image classification). Supervised learning was realized by datasets based on experimental data from literature, used as examples of proper network behavior (proper network behavior being the combination of proximate analysis results at the input of ANN, and corresponding measured HHV value at the output of ANN). In all cases, the network was fed (supervised) with the same sets of input/output data.
Forward feed refers to the direction of data processing: input data feed the first hidden layer where each neuron is calculated in a way that a sum of weighted input data and a bias is subjected to an activation function. Every subsequent hidden layer is fed with the neuron outputs of the previous one and the output layer is fed with the neuron outputs of the last hidden layer. The obtained output of the ANN is evaluated with regard to the target values in the datasets used for supervising. Backward propagation refers to error computation throughout the network. Errors are calculated with respect to weights and then used for optimization of network performances which iteratively affects the adjustment of weights throughout the network. The weights are updated after all the inputs in the training set are applied-the batch training mode was chosen (opposed to incremental training mode, where weights are updated after each of the three inputs from the input dataset is applied).
The quality of results was evaluated through MSE and regression values. MSE (mean squared error) is the average squared difference between outputs generated by the MATLAB function and the targets (measured HHV data that correspond to the inputs given to the MATLAB functions). Lower values are obviously better. The algorithm with the minimal MSE was considered the most appropriate. Equation for the calculation of MSE is N is the number of samples (input-output pairs) used for training the network, t is the target value, here it is the measured value of biomass HHV, a is the value calculated by the ANN, and e is the error, i.e., the difference between the target and the calculated value.
The regression values R measure the correlation between the obtained ANN outputs and the targets. An R value of 1 means a close relationship, 0 is a random relationship. The expression for the calculation of R is The notation is the same as for the calculation of MSE and t i is the arithmetic mean of the target values.
After training multiple ANNs and performing the comparative analysis, the algorithm with the highest correlation between the ANN output and the target values was considered to be the most appropriate for the estimation of HHV based on the known results of the proximate analysis of the biomass. All networks were cross-validated with the set of new data, previously unknown to them.

The structure of artificial neural networks
For the comparative analysis of different algorithms for training the ANNs, the same network structure (shown in Fig. 2) was used but with different training function for each algorithm.
The ANNs use backpropagation as a learning algorithm. There are three inputs for all networks. They are the same, the results of the proximate analysis: the percentages of the fixed carbon (FC), volatile materials (VM) and ash (ASH).
The networks have one hidden layer with 20 neurons, one output layer with 1 neuron and one output-the predicted HHV of the biomass. The number of neurons in the ANN is a parameter of interest in the process of choosing the network structure. If it is too small, the generated outputs do not converge well to the desired targets. If it is too high, the network predicts well in the framework of the given training set of data but may be not sufficiently good in predicting the output when fed with new, unknown data. The number of neurons in the hidden layer was chosen here as the one that gave the best results with regard to the outputs from the ANN fed with known and unknown data.
The neuron values in both the hidden and the output layer are calculated in a similar manner: a sum of the weighted values provided by the previous layer and some bias term is passed through an activation function. Among common activation functions are the simple linear and two sigmoid functions (logistic function and hyperbolic tangent). These three functions are defined by expressions (3), (4) and (5), respectively.
In the algorithms investigated in this paper, the following sigmoid function (sigmoid symmetric transfer function) was used The sigmoid activation function is used in the hidden layer and the linear activation function is used in the output layer. Figure 3 shows the diagram of the training process that includes loading the collected data, creating and configuring the network, initializing the weights and biases, training and validating the network and storing outputs of interest for future network usage. The step ''Random division of dataset'' implies that 70% of data are intended for the ANN training, 15% of data serve for the validation and 15% of data are used for the testing. In supervised learning, the examples in the training set of data show the network proper input-output pairs.

Training algorithms
The weights and biases in the network (as per structure on Fig. 2), calculated to match the input-output training example, are validated against the data from the dataset aimed for the validation. Errors propagate backwards. Based on the errors, the parameters of the network are updated and the process iteratively repeats until the maximum number of iterations is reached or the quality of the prediction is accomplished. The validation dataset is used in every iteration of the training process. Contrary to that, the dataset aimed for the testing is used only after the training is done. It is utilized for the final estimation of the goodness of fit and the estimation of the quality of the ANN response to input. Before the actual training, we performed a pilot training aimed to help us determine the network structure (data division and number of nodes in the hidden layer). We performed sweeps over different network parameters and chose the set of parameters that gave the best results, the division 70%-15%-15% and the number of 20 nodes in the hidden layer. The data division was applied to the set of 301 examples whose statistics are given in Table 1, and whose original references are (Demirbaş and Demirbaş 2004;Krishnan et al. 2018) and (Nhuchhen and Salam 2012). The other dataset of 146 samples, their descriptive statistics given in Table 2, and their references of origin being (Cordero et al. 2001;Alaba et al. 2020) and (Lakovic et al. 2021), is used for an independent testing aimed for the assessment of the generality of the ANN predictors. In subsequent sections, the excerpts of the datasets will be given and the datasets used in this work are accessible from Jakšić (2021) in the form of MATLAB workspaces.
The step 'Train ANN' refers to training by using one of the 12 training algorithms available in MathWorks MATLAB or Octave environment. Short names and the acronyms of the training functions are given in Table 3. There is no simple rule to choose the best training algorithm. Some are better/faster in solving regression problems (finding optimal function approximation) and some show better results in pattern recognition. The speed of a training algorithm depends on the complexity of the network structure, on the complexity of the problem represented by datasets, on the amount of data in datasets, etc. The same workflow presented in Fig. 3 produces different results if different training functions are employed. Moreover, due to the nature of the training process itself, the same workflow presented in Fig. 3 produces different results if employed with the same training function repeatedly. Multiple trainings can be performed in search for the best possible outcome of each separate algorithm. Shallow learning is performed here. Multiple trainings are performed manually.
The step ''Evaluate the fitting function'' can be related to various performance criteria, for instance, to a predefined maximum number of iterations (or epochs in the MATLAB notation), or to a specified goal the error should reach (MSE or SSE-Sum squared error). In the Math-Works MATLAB Neural Network Fitting Application, the criteria are also related to validation (validation stop). In this study, the training keeps running until the validation error fails to decrease for six iterations.
A special attention was given here to generating the output after training the network. In this work, the output block in Fig. 3 does not refer solely to the results related to network performances. It also refers to the functions that can be used for a future ANN usage on new datasets and also to scripts for recreating ANN functions which may lead to their new versions with better performances.

Input data for training ANNs
Datasets were formed on the basis of literature data on measured biomass HHV characterized by proximate analysis. The final set of data consisted of 318 training input records and corresponding 318 outputs. All input data, used in the process of training, testing and validation of ANNs, are obtained from references Demirbaş and Demirbaş (2004), Krishnan et al. (2018) and Nhuchhen and Salam (2012). The conditioned final dataset together with the custom-programmed standalone MATLAB functions is freely available from Jakšić (2021). As for the exemplary  7 HHVannPRO: code collection for ANN training and reliability assessment Our codes, freely available from Jakšić (2021), are grouped into four sets. The first set aims for preprocessing of the data, the exploratory data analysis of the datasets with the examples, the descriptive statistics of the datasets, the visual analysis of the data and for pilot training that helps structuring the network (the number of nodes in the hidden layer) and the data division.
The second set aims for the model development, i.e., for the creation of ANN-based fitting functions with respect to 12 different nonlinear regression models: BFGS Quasi Newton, Bayesian Regularization, Conjugate Gradient, Powell/Beale Restarts, Fletcher-Powell Conjugate Gradient, Polak-Ribiére Conjugate Gradient, Gradient Descent, Gradient Descent Momentum, Variable Learning Rate Gradient Descent, Levenberg-Marquardt, One Step Secant, Resilient Backpropagation and Scaled Conjugate Gradient.
The third set aims for the employment of the developed models. In this paper, the aim is to apply the developed predictors to completely unknown data and assess the capability of the developed models for generalization. The aim is also to examine the behavior of predictors with respect to overfitting.
The fourth set of HHVannPRO is related to the metrics and scoring techniques used for quantifying the quality of predictions. The metrics for the quantitative analysis of the regression models was based on multiple scoring tools: mean squared error, coefficient of the determination, mean Poisson deviance, mean Gamma deviance and Friedman test. Three of these tests belong to the Tweedie family of tests (MSE, mean Poisson deviance and mean Gamma deviance). Mean Tweedie deviance is a parametric metric, with the power parameter p, a homogeneous function of the degree (2-p) as per the expression The notation is the same as before, N is the number of samples, t refers to target values, and a refers to values assumed by the predictor function. If the power parameter p is set to zero, the expression (7) comes down to the expression for MSE, which is a mean normal deviance. If it is set to one, the expression (7) becomes the expression for the mean Poisson deviance. We used it in its log form, met in practice: If p is set to two, the expression (7) becomes the expression for the mean Gamma deviance, also used in practice in its log form: As per the above expressions, MSE scales quadratically with the divergence between target and predicted values, Poisson deviance scales linearly, and Gamma deviance is invariant to their scaling, it measures relative errors. As power p goes up, the measure is less sensitive to extreme deviations between target and predicted values. Thinking of the outliers of our predictors, we use the mean Gamma deviance to employ the measure that is less sensitive to outliers.
Friedman's test is a nonparametric test performed to rank all the algorithms in terms of their performance. The basis for Friedman's test is a null hypothesis which suggests no variation in the performance of all the algorithms. It detects significant differences between the performances of two or more algorithms. In Friedman's test, like in all aforementioned tests based on the Tweedie deviance (MSE, mean Poisson deviance, mean Gamma deviance), the algorithm which performs the best has the lowest mean rank while the algorithm, which performs the worst, gets the highest rank. Contrariwise, for R 2 score-the coefficient of determination as per Eq. (2)-the ranking is the opposite. The best possible score is 1.0. However, R 2 can also be negative because the prediction can be arbitrarily bad. Contrary to the previously described four measures, this measure is not based on the deviations between two different functions. It is related to the variance of target values, as the denominator of the fraction in (2) implies. This measure is dataset-dependent. Hence, R 2 may not be relevant for comparisons across different datasets.
We first perform here the analysis of data obtained by different predictors, and then, we apply separately all five tests on the data related to the model development as well as to completely new data.

Results and discussion
The results related to network performances are useful for quantitative comparative analysis of the utilized training algorithms. Table 8 presents the results related to the mean squared error. Table 9 presents the results related to the coefficient of determination. Values are calculated by the MATLAB functions in the course of the model development for all three sets of data separately (training, validation and testing). As per workflow diagram in Fig. 3, the data division is performed randomly, before the model development, and these datasets are different for each of 12 algorithms. Their metrics are calculated on different subsets (subsets equal in proportions but with different particular set elements).
It is often convenient to normalize the input data to the (-1, 1) interval in order to avoid overfitting and to ensure dimensional uniformity. Since the input data in our case are of the same order of magnitude, and the network contains only one hidden layer with 20 neurons, the normalization of the data was not performed. That is why the MSE values collected in Table 8 reach as high as 278,7862 for some training functions (even higher values were obtained during the training process). The quantitative comparative analysis of all training functions was made so that each step  Comparing artificial neural network algorithms for prediction of higher heating value... 5941 in the workflow shown in Fig. 3 is the same for all the training processes (loading the same set of collected data, creating and configuring the network with the identical structure for all, etc.). All of the used algorithms are ranked in Table 10. The ranking is based on their comparative analysis as shown in Tables 8 and 9. In terms of MSE calculated on a set of training data, the best performance (minimal MSE) had the Levenberg-Marquardt (LM) training function. For MSE calculated both for sets of validation and testing data, LM ranked second. For R calculated on a set of training data, the LM training function is ranked fourth, after BR, SCG and CGB. For R calculated on a set of validation data, LM training function is ranked third, after BR and BFG training functions and for R calculated on a set of testing data, LM training function is also ranked third, after BR and SCG training functions.
These results are in accord with and complement the MathWorks MATLAB documentation about Deep Learning Toolbox on how to choose a multilayer neural network training function (Deep Learning Toolbox Documentation: Choose a Multilayer Neural Network Training Function 2021). The comparative analysis presented in Deep Learning Toolbox Documentation: Choose a Multilayer Neural Network Training Function (2021) focused on the algorithm speed relative to the amount of input data and the network complexity (activation function, number of hidden layers, number of weights and biases, etc.). The recommendations related to the algorithm speed and the memory usage favor LM training function for small datasets, emphasizing that its performances are relatively poor on pattern recognition problems which can be solved faster by using Resilient Backpropagation (RP) algorithm.
Since the determination of HHV based on the proximate analysis of biomass is a regression problem, not related to classification or pattern recognition, the relatively poor result of the training method based on the RP algorithm is actually better than expected.
A recommended training function, besides LM, regarding the speed and lower storage usage, is BFGS Quasi Newton training function (BFG). A caveat is that its implementation gets slower with an increasing network complexity. Functions CGP, BFG, RP, OSS, GDX, GDM, GD, CGP, CGF and CGB are less common in literature on the prediction of HHV based on the proximate analysis of biomass. Probably this is at least partially due to the fact that they are not part of the built-in application with the graphical user interface in MATLAB application for training ANNs. Our results point out that the network performances may differ depending on many factors and that it is advisable to explore the training functions alternatives to the three most commonly met in practical situations: LM, BR and SCG.
As for these three most commonly used training functions, LM, BR and SCG, general recommendations are that LM is the first choice for fast prediction of function fitting problems (nonlinear regression) on small datasets and that SCG is favorable in pattern recognition problems because the SCG algorithm is the least memory demanding. Training with SCG automatically stops when generalization stops improving, as indicated by an increase in the mean square error of the validation samples. The BR algorithm is more time-consuming, but can result in good generalization for difficult, small or noisy datasets. Training stops according to adaptive weight minimization (regularization).
The quantitative comparative analysis of performances of the 12 training algorithms identical in all aspects except in the training function resulted in the following conclusions: For the dataset used in the process of training all 12 networks, the best performances were shown by the Bayesian Regularization training function in terms of R in general (calculated for the training, validation and testing data) and in terms of MSE calculated for the validation and testing data. The LM training function had the smallest MSE regarding the training dataset.
However, the above discussion is based on just two criteria, namely MSE and the root of the coefficient of determination. Their values are calculated on the subsets of the dataset used for the model development. For different algorithms, these are different datasets. The data division to training, validation and testing is the same in proportions but is performed on a random basis; hence, the particular subsets used for the development of each model are different. To further examine the statements on the comparative analysis, we perform additional tests. We calculate MSE, mean Poisson deviance, mean Gamma deviance and Friedman test on the same set of data (all 301, undivided to any subset). The results of tests are given in Table 11, the corresponding ranking is in Table 12, and the visual interpretation of the ranking is shown in Fig. 6.

Generalization capabilities
A very important figure of merit of any ANN is its ability to generalize the knowledge acquired in the training process to new, completely unknown data.
The developed ANN fitting functions were fed with a completely new set of input data in order to estimate the ability of the obtained networks to generalize the acquired knowledge. The complete set of new data used for this estimation consisted of 146 examples originating from literature on experimental values for FC, VM, ASH and HHV (Cordero et al. 2001;Alaba et al. 2020;Lakovic et al. 2021). The complete dataset is available upon request, and a chosen subset is shown in Table 13.
The results presented in Fig. 7 show the predicted HHV values calculated by the developed predicting functions on a set of 146 input triplets (FC, VM and ASH values) over the corresponding measured values of HHV. The solid red line represents the ideal response. Symbols represent the predicted data.
The visual analysis of results shown in Fig. 7 favors LM, SCG and BR as the models that showed good generalization capabilities and reliably predicted HHV of completely new input triplets that correspond to various types of biomass. To further examine the statements based on the visual analysis, we perform additional tests. We calculate MSE, mean Poisson deviance, mean Gamma deviance and Friedman test on the new set of data (all 146, undivided to any subset). The results of tests are given in Table 14, the corresponding ranking is in Table 15, and the visual interpretation of the ranking is shown in Fig. 8.
Graphical presentation of HHV predicted by ANNbased fitting functions, shown in Fig. 7, is supported by the results of five different tests shown in Tables 13 and 14 and in Fig. 8. Tables 13 and 14 and Fig. 8 quantitatively justify the choice for the favorable training methods.
The ranking results between different tests are not identical but agree very well, as shown in Fig. 6 and Fig. 8. The best ranked algorithm in Fig. 6 (that shows tests on a dataset used for the model development) is BR, based on the Bayesian Regularization regression model. In Fig. 8 (that shows tests on a dataset used after the model development), it was outperformed by the LM algorithm (based on the Levenberg-Marquardt regression model). That Table 13 Excerpt of new dataset used as input for the estimation of the ANN ability to generalize the acquired knowledge when introduced to previously unknown data, gathered from (Cordero et al. 2001;Alaba et al. 2020;Lakovic et al. 2021 implies better generalization capabilities of the LM over the BR model. Overall, the best three predictions were obtained by LM, BR and SCG. The developed models can be used for the HHV prediction of a very diverse range of different types of biomass. The only precaution is related to the input data. Proximate analysis is sometimes given in the form of quadruplets (moisture, ASH, FC and VM) as in the example of raw and torrefied sugarcane residues in Conag et al. (2019). Samples of torrefied biomass have higher HHV than raw biomass. ANN teaching performed for this study did not treat such examples.

Conclusion
In this work, we presented a comparative qualitative and quantitative analysis of 12 algorithms for training artificial neural networks which predict the higher heating value (HHV) of arbitrary biomass species/biomass mixtures using the proximate analysis based on the percentages of the fixed carbon, volatile matter and ash.
Contrary to a lot of published works which focus on either a single type of biomass or a small number of similar ones, we applied a generalized approach by simultaneously using a larger number of diverse datasets. In this way, we ensured an extended applicability of our prediction procedure to a wide range of different biomass species and biomass mixtures.
The prediction models developed in this study were based on 12 different training functions-Levenberg-Marquardt, Scaled Conjugate Gradient, BFGS Quasi Newton, Bayesian regularization, Resilient Backpropagation, Variable Learning Rate Gradient Descent, Fletcher-Powell Conjugate Gradient, Polak-Ribiére Conjugate Gradient, One Step Secant, Conjugate Gradient, Powell/ Beale Restarts, Gradient Descent and Gradient Descent Momentum.
The datasets used for the development of the models and for the testing of models' performances included data from very diverse literature sources on a number of biomass types.
This study also provided a test suite based on five criteria-mean squared error, coefficient of determination, mean Poisson deviance, mean Gamma deviance and Friedman test. Thus, the developed comparative analysis tools were thoroughly tested and intercompared. According to all tests, the top three models, tested on the datasets from a vast range of data sources, are the ones based on the Levenberg-Marquardt, Bayesian regularization and Scaled Conjugate Gradient regression models.
The presented results may prove useful in further research on the AI-assisted HHV prediction based on the proximate analysis of a wide range of biomass species, including the novel ones.
To better serve the scientific community, all codes developed within this work are freely available for download. This enables potential users not only to perform their own customizations of the code whenever necessary.