Influence of data scaling and normalization on overall neural network performances in photoacoustics

In this paper, the influence of the input and output data scaling and normalization on the neural network overall performances is investigated aimed at inverse problem-solving in photoacoustics of semiconductors. The logarithmic scaling of the photoacoustic signal amplitudes as input data and numerical scaling of the sample thermal parameters as output data are presented as useful tools trying to reach maximal network precision. Max and min–max normalizations to the input data are presented to change their numerical values in the dataset to common scales, without distorting differences. It was demonstrated in theory that the largest network prediction error of all targeted parameters is obtained by a network with non-scaled output data. Also, it was found out that the best network prediction was achieved with min–max normalization of the input data and network predicted output data scale within the range of [1–10]. Network training and prediction performances analyzed with experimental input data show that the benefits and improvements of input and output scaling and normalization are not guaranteed but are strongly dependent on a specific problem to be solved.


Introduction
It is a well-known fact that data scaling and normalization in neural networks are the techniques usually applied as part of the data preparation process. Their ultimate goal is to change the values of data in the dataset to a common scale without distorting input differences. The main difference, followed in this article, between scaling and normalization is that, in scaling, the data range is changing, helping one to compare different variables in the same range. In normalization, the data range is changed from the original so that all values are within the range of 0 and 1, without distorting differences in the ranges of values or losing information. In general, scaling helps one to compare different variables on equal footing and normalization is used trying to improve the network model due to its numerical stability and quality of the training process. Both scaling and normalization can be applied in neural networks not only on input (for example, arbitrary signals) but on output data (predicted parameters) as well (Balderas-Lopez 2006;Balderas-López et al. 2002;Govorkov et al. 1997;Melo and Faria 1995).
In photoacoustics (PA), scaling and normalization process is usually used in experimental data analysis for accurate and reliable characterization of the investigated materials, preferably in two cases: (1) to eliminate the influence of the measuring system and (2) to find the differences in photoacoustic signal behavior by detecting small changes that correspond to possible thermal, mechanic and/or electronic parameters variations. In both cases, photoacoustics uses the data set containing two features: amplitudes and phases in the modulation frequency range from (20-20 k) Hz. In the given frequency range the phase varies from (0 to 360) deg, while the amplitude variations are within the range of (10 −3 -1 −8 ) a.u. and lower. It is obvious that these two features are in different ranges (Balderas-Lopez 2006;Balderas-López et al. 2002;Calderón et al. 1998;Dramićanin et al. 1995;Govorkov et al. 1997;Melo and Faria 1995;Todorović et al. 2014Todorović et al. , 1995. Within the framework of artificial neural networks (ANNs), defining the main feature as an input is a very important issue (Djordjevic et al. 2020(Djordjevic et al. , 2019Jin et al. 2015;Kim 1999). Many authors define the amplitudes as the main feature and more important predictor in photoacoustic signal analysis than phases (El-Brolossy and Ibrahim 2010;Velasco et al. 2011). But phases are the constitutive part of the signals, too, also very sensitive to the sample parameters changes (Pichardo-Molina and Alvarado-Gil 2004). It means that both amplitudes and phases have to be considered as predictors of equal importance (Ordóñez-Miranda and Alvarado-Gil 2009).Based on our experience and the application of ANNs in photoacoustics, the equal importance of these predictors allows us to use, for simplicity, only one of them to characterize the sample, without significant changes in the quality of sample parameters prediction (Arridge et al. 2019;Yahyaoui's et al. 2018). This is the reason why, in this article, we will use only amplitudes of the photoacoustic signal as an ANNs input.
In our previous articles we have shown that the application of ANNs in photoacustics could improve experimental procedures in many ways: better accuracy and precision in investigated sample parameters prediction, better control of the experimental conditions together with approaching to the real-time characterization of the investigated sample, etc. (Djordjevic et al. 2020(Djordjevic et al. , 2019. Here, in this article, we will try to show why the different types of scaling and normalization procedures of input and output data could be beneficial to the accuracy, precision and numerical stability of the network predicted parameters, and to the process of network training acceleration. To do that logarithmic scaling and min-max and max normalizations will be applied on input data used in the ANNs training process. At the same time, simple numerical scaling will be used for network output data (predicted sample thermal and geometric parameters such as: thermal diffusivity, linear coefficient of thermal expansion, thickness…) to find possible benefits to ANNs performances. Our analysis of training, stability and accuracy of network prediction will rely on the ANNs that are trained with or without scaling and/or normalization of input and output data aiming to find their influence to overall network performances. All analyzes will be done on a theoretical model first. Then special attention will be given to the benefits of scaling and normalization within the application of ANNs on experimental data.

Photoacoustic theory and experimental response
The photoacoustic signals are generated as a consequence of the thermal state changes in semiconductors due to the absorption of monochromatic modulated light source having the intensity of I = I 0 1 + e iωt ( I 0 is the incident light amplitude and ω = 2 π f where f is the modulation frequency). Using theoretical model of composite piston the photoacoustic signals in semiconductors, so called total PA signals p total (f ) generated by the investigated sample can be presented in the form (Calderón et al. 1998;Dramićanin et al. 1995;Mandelis 1999;Markushev et al. 2018Markushev et al. , 2015Markushev et al. , 2019Todorović et al. 2014Todorović et al. , 1995: where K i , i = 1, 2, 3 are the constants dependent on the thermodynamic state of the gas in the photoacoustic cell and its geometrical properties, R s and l are the radius and thickness of the sample, respectively, T is the coefficient of linear expansion of the sample, d n is the coefficient of electronic deformation, ñ p (x, f ) is the concentration of minority excess charge carriers and T s (x, f ) is the temperature distribution along the sample. Knowing that the solutions of T s (x, f ) and ñ p (x, f ) can be found using complex analysis, Eq. (1) can be presented, also, in the simplified complex form: where A total (f ) is the total signal amplitude and total (f ) is the total signal phase. Usually, measured experimental signals S exp total (f ) are the combinations of p total (f ) and influence of the measuring system H total (f ) (the deviations caused by the microphone and accompanied electronics), written as: To eliminate the influence H total (f ) of the measuring system, correction procedure of S exp total (f ) was established aiming to find characteristic parameters of the used instruments (microphone and accompanied electronics) and remove H total (f ) from Eq. (3). In such way only p total (f ) remains in the Eq. (3), suitable to fit with Eq. (1). Mentioned fitting procedure will allows one to extract all thermal (for example D T , T ,T s (x, f ) ) and electronic (for example ñ p (x, f ),d n ) parameters needed for sample material characterization.
In this article, theoretically obtained p total (f ) signal amplitudes (Eqs. (1, 2)) are used to form the network training database. Measured experimental signals S exp total (f ) are corrected (Eq. (3)), to obtain p exp total (f ) amplitudes which are presented to the networks and (1) used for "intelligent characterization" of semiconductors (ANNs prediction of the sample parameters).

Neural networks model design
As it was explained in (Djordjevic et al. 2020(Djordjevic et al. , 2019, we used the simplest model in creating our network, the so-called "feed-forward network", having an input, one hidden and the output layer. The input layer consists of 72 neurons (amplitudes) following the standard number of modulation frequencies used in our experiments. The hidden layer consists of 50 neurons, following the criteria that the number of neurons is between the size of the input and output layers, ~ 2/3 the size of the input + output layer. The output layer consists of 3 neurons, each corresponding to the number of predictions we want to make (typical investigated sample thermal-D T , α T and geometric-l parameters). The sigmoid activation function (Logistic Sigmoid) is used to convert the input into a more useful output with values between 0 and 1.
Network database consists of 5491 amplitudes ( Fig. 1a) of the PA theoretical signals p total (f ) obtained with Eqs. (1, 2), changing the sample parameters in the following ranges: thermal diffusivity D T = (8.1 − 9.9) ⋅ 10 −5 m 2 s −1 , linear expansion T = (2.34 − 2.86) ⋅ 10 −6 K −1 with a step of 1.25%, and the sample thickness l = (1 − 10) ⋅ 10 −8 m , with a step of 50 μm. From this database of 5491 signals, every 50th signal (110 in total) was taken for the independent test, and 5381 amplitudes was obtained for training base. In such way satisfying variety, density and volume of relevant data is achieved. Sometimes, simple amplitude scaling on unity (scaled on 1) is performed (Fig. 1b) to emphasize differences in frequency domain (Djordjevic et al. 2020(Djordjevic et al. , 2019. Typical supervised learning is used here in network training, applying regression technique for output data prediction together with back-propagation algorithm that assures (a) (b) Fig. 1 a Non-scaled amplitude characteristics of the theoretical photoacoustic signals (Djordjevic et al. 2020(Djordjevic et al. , 2019, that forms the network training database; b simple amplitude scaling on unity to emphasize differences in frequency domain. The arrows indicate an increase in the thickness l satisfying network prediction accuracy finding optimal values of weights. The training, testing and validation procedure was performed for input signal amplitudes and output parameters normalization or scaling (Figs. 4 and 5).
To evaluate our network model the test of its prediction accuracy on data from the training set which was not used at all during training was performed. Networks trained in mentioned way are used to predict thermal parameters of experimentally obtained PA signals from three n-type silicon samples with different thicknesses, because training set of photoacoustic signals amplitudes correspond to the measured signals amplitudes p exp total (f ) of all possible experimental settings of an open photoacoustic cell (Djordjevic et al. 2019, 2020, Jordovic-Pavlovic et al. 2020a.

Input and output data scaling and normalization
It is obvious from Fig. 1 that scaling is needed due to the large changes in amplitude values (few orders of magnitude). Scaling of PA amplitudes A(f ) from the training base ( Fig. 1a) applied here is performed using the logarithmic function (log scaled) based on the Bode plots, having the form and the results based on the Eq. (4) are presented in Fig. 2a. This type of scaling is usually performed to change the amplitude values to a common scale, within the range of corresponding phase values. At the same time, it is expected that scaling will map amplitudefrequency variances from log-log (Fig. 1a) to lin-log scale (Fig. 2a).
Generally speaking, max and min-max normalizations are used to rescale amplitudes to the ranges of (0,1] and [0,1], respectively, and to equalize amplitude-frequency variances as much as it is possible. First type of normalization applied on A(f ) is the normalization to the maximum absolute value (Max norm) of the base frequency vectors defined as (Jordovic-Pavlovic et al. 2020a, b):  This type of normalization gives amplitude values within the range of 0 < A max (f i ) ≤ 1 and changes (equalizes) the amplitude-frequency variances to some extent in the given modulation frequency range of 20 Hz-20 kHz (see Fig. 2b).
Second normalization applied on A(f ) is the min-max normalization (Min-Max norm) which rescales the range of the amplitudes to [0,1] (Fig. 2c), using (Furundzic et al. 2017(Furundzic et al. , 1998: This type of normalizations (Eq. (6)) is depicted in Fig. 2c, simultaneously changes (equalizes) the amplitude-frequency variances to some extent, slightly different from max one, in the investigated modulation frequency range.
It is assumed that both types of input data normalization could lead to better network predictions in critical areas (> 1 kHz) where some signal overlapping exists. Then, it can be said that databases in the range 0-1 (Fig. 2b and c) are more acceptable for machine learning because it is easier to form a weight matrix using numerical values of a certain smaller range. The slight difference that exists in max and min-max normalizations is due to the normalization of the maximum value of the signal at a certain frequency, while the normalization of min-max is the normalization to the range [0,1].
Two ANNs with different ways of scaling the output data are formed to analyze the influence of scaling on the network's training performances. As input parameters of both ANNs, the unchanged (non-scaled and non-normalized) PA signal amplitudes (Fig. 1a) are used. As we said earlier, the data of the output vectors are three parameters of the sample, given in the following ranges:D T = (8.1 − 9.9) ⋅ 10 −5 m 2 s −1 , T = (2.34 − 2.86) ⋅ 10 −6 K −1 , and l = (1 − 10) ⋅ 10 −4 m . For the first neural network (NN1) with non-scaled output, we set the thermal diffusivity range as the smallest data in the output vector (0.81 − 0.99) ; the linear expansion in the range (2.34 − 2.86) ; the thickness in the range (100 − 1000) . For the second neural network (NN2), we scaled the output data in the range of values D T , T and l , from 1 to 10: (8.10-9.90); (2.34-2.86); (1-10).

Neural networks with non-scaled and scaled output
To test the network training performances (a plot of Mean Squared Error (MSE) vs Epochs), we train NN1 and NN2 networks. The results of obtained network training performances are given in Table 1, Fig. 4. It is obvious that NN2 networks shows much better results in network training performances, having larger number of epochs but same training time as in the case of NN1.
Network prediction accuracy is tested using maximal and average relative errors obtained comparing network prediction and parameters values used to form the theoretical amplitudes. The tests are performed on signals not presented to the network during , j = 1, 2, ..., 5381, i = 1, 2, ...72.
the training: 110 randomly selected signals from the training database, and 24 randomly selected signals out of the training base (with random parameters within the range of parameter changes). The results presented in Tables 1, 2 and 3 indicate that the scaling of output data is beneficial not only in the case of network training performance but also for network prediction performances, as well. Only the results obtained for the sample thickness prediction deviate from the expected ones (Tables 2 and 3). This could have been expected knowing that the theoretical model (Eq. (1)) considers thickness as a parameter obtained by fitting the larger error so the thickness prediction results can vary on a much larger scale than D T and α T .

Neural networks with non-scaled, scaled and normalized input
Based on the results presented in the previous paragraph NN2 network seems to be more acceptable than NN1 so, our choice for further theoretical analysis will be the NN2. Its training performances are analyzed using different PA signal amplitudes A(f) data as an input in the form of non-scaled (Fig. 1a), scaled on 1 (Fig. 1b), logarithmically scaled (Fig. 2a), max (Fig. 2b) and min-max normalized (Fig. 2c). The results of such analysis are shown in Table 4, Fig. 5.
It is obvious that the worst result in Table 4 is network training with data scaled on unity. Logarithmic scaling significantly improves network performances, max normalization gives better results, while max-min normalization shows the best performances. In general, the presented analysis shows that data normalization is beneficial in terms of network training performances. The standard NN2 network prediction accuracy test is performed with independent signals extracted from the amplitude base before training (Djordjevic et al. 2020(Djordjevic et al. , 2019. The results presented in Table 5 show small relative (%) errors for all types of normalization. This test confirms that the best prediction accuracy was achieved with min-max input normalization.

Application on experimental signals
All previous considerations and tests of network training performances and their prediction accuracy have been done on theoretical signals used for precise determination of changes in D T , α T and l. Interesting results are obtained by analyzing the network prediction accuracy of experimental samples with different thicknesses (Table 6). Unfortunately, the experimental signal S exp total (f ) amplitudes (Eq. (3)) measured by the standard experimental setup cannot be directly presented to the network as an input data because they are distorted under the influence of H total (f ) . As it was mentioned in Sect. 2, additional procedure is needed (Fig. 3), explained in details in (Aleksić et al. 2016;Markushev et al. 2015;Popovic et al 2016) to detect instrumental characteristics, calculate H total (f ) and find undistorded signal p total (f )(Eq. (3)) which can be recognized by ANNs. Obtained p total (f ) for each sample are fitted using Eq. (1) to find D T , α T and l parameters (see Sect. 4). After these parameters estimation, amplitudes A(f ) (Eq. (2), non-scaled input) are presented to different ANNs (NN1, NN2) to find network predicted parameters D ANN T , ANN T and l ANN . Relative errors (%) are calculated comparing fitted and predicted parameters. The results of calculated experimental data are presented in Tables 6 and 7. The predictions of the thinnest sample are most accurate in the case of NN1 network, while in the case of thicker samples the better predictions are obtained with the NN2. Obviously, NN2 output scaling matches the scale of precise network prediction. This logic does not work with thinner samples where NN1 non-scaled output variables result in a more precise parameters prediction. As a conclusion based on Table 6, one can say that scaling the output data is useful in the case of PA signal amplitudes processing originating from thicker samples. Approaching the thinner ones, output scaling loses its importance. In our case (Fig. 1), since we work in most cases with thick samples, the choice of an NN2 network for further experimental amplitudes analysis proves to be a rational solution.  (3)); b signal corrected due to instrumental influence; c theoretical fit (Eq. (1)) and network prediction parameters; d relative error calculation Neural networks NN2 with scaled outputs, but different rescaled and normalized inputs, were tested on adjusted (Eqs.4, 5 and 6) experimental signals as well. Prediction accuracy test and the results are presented in Table 7. The benefits of the input data scaling and normalization of the NN2 network can be seen, also, but not as obvious as with theoretical signals. The reason for that lies in the fact that the experimental signals are neither in frequency nor parameters changing steps as theoretical signals are. This is the reason why one can expect larger errors and more diversity of NN2 parameters prediction accuracy obtained with experimental signals. Largest errors are obtained mostly with non-scaled inputs.

Conclusion
In this article, the influence of different input and output data scaling and normalization on overall neural network performances in photoacoustics is presented.
In theory, simple numerical scaling is applied on network prediction parameters (diffusivity, coefficient of thermal expansion, thickness) as output data. This kind of scaling was found to be beneficial in the terms of network training and prediction and it was kept throughout the whole analysis. Various scaling and normalization methods (logarithmic scaling, scaling on unity, max and min-max normalization) were applied to the photoacoustic signal amplitudes as input predictors of equal importance compared to the signal phase. It was found that min-max amplitude normalization shows the best results in network training and prediction accuracy.
In the experiment, the results of network input scaling and/or normalization are not unambiguous. In general, each method has pros and cons based on the specific problem to be solved and one can decide based on the problem which scaling or normalization method is best suitable for the problem. The results obtained here with the silicon samples of different thicknesses suggesting that max normalization of input and non-scaled output data are the best choice to reach the highest quality of network overall performances in the case of thicker samples. In the case of thinner samples, various scaling and normalization methods can be only partially beneficial for network overall performances. As a principal conclusion, there is no universal input data normalization method that can be chosen in advance to improve network training performances and prediction accuracy.

Appendix II Training, validation and test results of neural networks with scaled output and scaled or normalized input
Training, validation and test results of amplitude neural network NN2 with (a) non-scaled input, (b) scaled input on 1, (c) log scaled input, (d) max normalization input, (e) min-max normalization input in order to achieve the best network performance, see Table 4 and