Generalization enhancement of artificial neural network for turbulence closure by feature selection

Feature selection targets for selecting relevant and useful features, and is a vital challenge in turbulence modeling by machine learning methods. In this paper, a new posterior feature selection method based on validation dataset is proposed, which is an efficient and universal method for complex systems including turbulence. Different from the priori feature importance ranking of the filter method and the exhaustive search for feature subset of the wrapper method, the proposed method ranks the features according to the model performance on the validation dataset, and generates the feature subsets in the order of feature importance. Using the features from the proposed method, a black-box model is built by artificial neural network (ANN) to reproduce the behavior of Spalart-Allmaras (S-A) turbulence model for high Reynolds number (Re) airfoil flows in aeronautical engineering. The results show that compared with the model without feature selection, the generalization ability of the model after feature selection is significantly improved. To some extent, it is also demonstrated that although the feature importance can be reflected by the model parameters during the training process, artificial feature selection is still very necessary.


Introduction
In recent years, there has been a surge in research on the combination of machine learning and turbulence. These researches can be divided into two categories according to the model form: classifiers for identifying uncertainty regions and regression models for calculating turbulence-related variables. For the classifier, researchers can use machine learning to recognize the uncertain regions calculated by the Reynolds-Averaged Navier-Stokes (RANS) turbulence model based on experimental or high-fidelity data [1,2]. The uncertainty of turbulence calculation comes from the ensemble averaging operation of Navier-Stokes (N-S) equation, the functional form of a model, the representation of Reynolds stress and the empirical parameters of the model [3]. Then, a classifier can be constructed to predict the uncertain regions in different flow cases and a better method with higher accuracy can be used to improve the accuracy for these uncertain regions. For the regression, machine learning is mainly used to improve or substitute the conventional RANS model and the subgrid model in large eddy simulation (LES). The linear eddy viscosity models are based on the Boussinesq assumption, which have high efficiency and good robustness, but are eclipsed by anisotropic turbulence, such as large separation and secondary flows. Thus, researchers used machine learning to decrease the discrepancies between RANS results and high-fidelity or experimental data. There are mainly two corresponding solutions. One is to change the governing equation form of the RANS model, such as introducing the correction coefficient or adding the source term to the transport equation [4][5][6][7]. The other is to construct the deviation function by machine learning and then superimpose the results of RANS model and the output of proposed model as the Reynolds stress [8,9]. Different from the above studies, some studies directly construct the surrogate models of some turbulence variables [10][11][12]. Overall, the achievement is encouraging. In the future, machine learning may play a key role in turbulent modeling of complex flows [13]. With the research deeper and deeper, there are still many problems to be solved, such as feature selection (FS) and insufficient generalization ability, etc. [14,15] The feature means each component of the input vector and describes the sample property, which is similar to the independent variable of the function. Thus, the feature plays an important role in the model performance. It is difficult to characterize the data space thoroughly and comprehensively by inadequate features, which leads to low accuracy. On the contrary, redundant features may result in over-fitting of the model and reduce its generalization ability. Generalization generally refers to the performance of a model on unseen datasets. Feature selection is to select relevant and useful features to reduce the computation cost and improve the model performance [16]. For complex problems with unclear mechanism, it is challenging for researchers to extract features straightforwardly. A common way is selecting some related features empirically according to physical knowledge and the specific problem. The result of this way might be accidental. Therefore, an efficient and reliable feature selection algorithm is needed by researchers.
In data-driven turbulence modeling, feature construction and selection are large obstacles for researchers. Many researchers select features according to existing governing equations and theorems or their own physical views. A common recommendation is that the features should be invariant. Invariance property means that values do not change with the translation, reflection and rotation of the coordinate system. Ling et al. [17] compared two methods of integrating invariant property into machine learning models and found that embedding the invariance into the features is more efficient and the corresponding model has better performance. Based on the conclusion, they took five tensor invariants as features and constructed the tensor basis neural network for scalar coefficients of the nonlinear eddy viscosity model. Besides the integrity basis derived by Pope [18], Wu et al. [19] introduced the gradient of turbulent kinetic energy and pressure to consider the effect of strong pressure changes and nonequilibrium. Yin et al. [20] further summarized feature selection criteria based on tensor analysis and flow structure identification. Although some researchers emphasize the invariance property, it is not clear whether the features without this property will definitely reduce the model performance. Wang et al. [21] studied the closure of the subgrid stress in LES and the adopted features are the filtered velocity and first and second order derivatives. Since the geometry and coordinate system of test cases are the same as those of training cases, Cruz et al. [22] took each component of the tensors and vectors as the input feature. Singh et al. [6] adopted the features mainly based on the governing equation of Spalart-Allmaras (S-A) model. The features used in most of the current work are based on physics and belong to the result of feature construction rather than feature selection to a large extent.
There are some common feature selection methods, like filter method, wrapper method and embedded method or some derivation methods based on these three methods [16]. Filter methods rank the features according to the correlation between features and output, which is efficient and priori. One of the common ways to measure the correlation is the Pearson correlation coefficient. Other correlation measurement can be found in the review of Saúl et al. [23] The wrapper methods evaluate the feature subset according to the prediction performance. The exhaustive search of this method becomes an NP-hard problem since there are 2 N feature subsets, where N is the candidate feature number. The embedded methods insert the feature selection into the learning process. The difference of embedded method from the previous two methods is that the learning process and the feature selection process cannot be separated, thus the feature subset applicable to a specific model architecture might not be applicable to other model architectures. In recent years, some hybrid methods of these three methods have been proposed [24,25]. More information about the feature selection method can be found in many studies [26][27][28][29]. In data-driven turbulence modeling, feature selection methods stated above are not widely used yet.
The principle target of this work is to enhance the model generalization ability through the proposed feature selection algorithm. For high Reynolds number airfoil flows in aerospace, the constructed model in this paper is essentially a black-box model without solving the transport equation, aiming to reproduce the behavior of RANS model. Driven by the data from only one flow case, the constructed model is expected to be effective for various cases with different freestream conditions and airfoils. The rest of this paper is summarized as follows. The second section introduces the methods used in this work, including workflow, modeling strategy, artificial neural network and the feature selection algorithm. The training process is illustrated in the third section, including the datasets and the training method. The fourth section is the results and discussions, which are addressed from the aspects of accuracy and convergence. Conclusions and outlook lie in the fifth section.

Workflow
The workflow includes two parts: neural network training and coupling of the N-S equation solver with the constructed model. The first part aims to obtain a neural network model, of which the performance should be satisfying and convergent. The word "convergent" means that the loss function is hard to be lower. This part mainly consists of three aspects: data acquisition, data preprocessing and model training. Model performance is improved mainly by adjusting model parameters and hyperparameters during the training process. If the performance is not adequate, the input features and training samples can be further optimized in data preprocessing. The target of the second part is to substitute the conventional RANS model with the constructed surrogate model and obtain the results that agree well with those calculated by S-A model. The surrogate model acts as a turbulence solver, as shown in the dot box below. The entire process is shown in Fig. 1, and some details will be addressed later.

Modeling strategy
According to the dimensional analysis, the kinetic eddy viscosity can be expressed as the product of mixing velocity u mix and mixing length l mix .
The "mixing length" derives from the molecular mean free path in gas dynamics, which was proposed by Prandtl [30]. The algebraic model with zero equation can express turbulent eddy viscosity explicitly in the form of Eq. (1). Generally, two length scales are used to characterize the turbulence from wall to the outer edge of the boundary layer, namely the wall friction scale δ υ and the boundary layer thickness δ.
The δ υ and δ are corresponding to the inner and outer layers, respectively, where the boundary is around the outer edge of the logarithmic layer. The mixing length of the inner layer is related to the distance to the wall. There are many researches on the scale analysis of the inner layer, and the universality is good in many cases. However, the outer flow variables are only slightly influenced by the wall, and are more disorderly and irregular [31]. Therefore, the scale analysis of the outer layer is more Fig. 1 Flow chart for building the turbulence surrogate model challenging and of less universality. More information about the mixing length is detailed in Granville's work [32]. The mixing length adopted in this paper is based on the Prandtl-van Driest formulation [33] and limited by the boundary layer thickness of flat plane, where A + = 26 and κ = 0.4 in this work, d y þ ≈ 1 denotes the estimated wall friction scale, calculated by the function at the following address: https://www.cfd-online.com/Tools/ yplus.php. Because it is found that the real wall friction scale leads to worse convergence during the iteration process. Referring to Clauser's constant eddy viscosity hypothesis [34], the mixing length of the outer layer is the same order with the boundary layer thickness. For convenience, the boundary layer thickness here is approximate to that of the flat plane, since the curvature of the airfoil is small. It should be noted that Eq. (2) is effective qualitatively rather than quantitatively, which is a compromise to the lack of robustness of the constructed neural network model. Once the mixing length is determined, the mixing velocity can be deduced according to Eq. (1). The constructed model in this paper is a nonlinear mapping between mean flow variables and the mixing velocity. Since the modeling strategy includes the distance to the wall, which can be less effective in separated flows, the flow cases in this work are all attached flows.

Artificial neural network
The term "neural network" comes from neurology and is often referred to as "artificial neural network" in machine learning. Neuron is the basic unit of a neural network. The number of neurons in each layer is called width and the number of layers is called depth, which are two hyperparameters of a neural network. For a fully-connected neural network, only the neurons in adjacent layers are connected to each other, and the output of the neurons in the previous layer is the input of the neurons in the current layer. The neural network used in this work is shown in Fig. 2. The first layer is the input layer while the last layer is the output layer, and the rest is the hidden layers. The parameters of the neuron are called weight and bias. Each neuron contains two operations. The first one is the linear addition between the product of input vector X and weight vector W and bias b, and the second one is the nonlinear activation operation.
The common activation functions f(·) are Tanh, Sigmoid, Relu, LeakRelu and so on. In this work, Tanh is adopted, see Fig. 3. Since the grid search method is timeconsuming, the structure is determined by a trial-and-error approach. Firstly, we fixed the width and tested the effect of depth on the model performance. Then, we fixed the optimal depth and tested the effect of width on the model performance. It was found that the structure (20,20,20,20,20,1) is enough for reaching the optimal solution. More neurons do not significantly improve model performances but increase computation time.
The error back propagation (BP) algorithm based on gradient descent is used to train the neural network. The gradient is calculated by the chain rule to update the parameter values of the neurons. Because the gradient descent algorithm is prone to stop at the local minimum, the global optimum is generally approximated by several training with different initial values. More information about neural networks can be found in many works [35][36][37].
According to the universal approximation theory, neural networks with enough neurons can approximate any measurable functions [38]. The complexity of the neural network can be adjusted by changing the depth and width. The strength of mapping complex nonlinear functions makes neural networks popular with researchers. It has already been used in turbulence computation [10,39].

Feature selection
We do not intend to delve into feature construction that depends on specific problems or researchers' physical view, but rather try to select some features that are helpful to model performance from existing candidate features. Although we assume that the Fig. 3 The Tanh activation function neural network itself can play a role in feature selection more or less in the training process, the important role played by artificial feature selection is irreplaceable according to our own experience.
The proposed feature selection method includes two parts. The first one is measuring the numerical space of the features and kicking off those with obvious extrapolation according to the mean extrapolation distance (MED). The second one is ranking the feature and determining the feature subset according to the model performance. The whole process is shown in Fig. 4 and to be more specifically, it includes the following steps.
Step 1: Extract the remaining features and the output from D and T, and reshape the datasets respectively.
MED ¼ where N e is the number of the extrapolation samples. Fig. 4 The procedure of feature selection Step 2: Divide the dataset D into training set D train and validation set D valid randomly, where D train ¼ fðq 1 ; q 2 ; …; q n ; y train Þ; q i ∈ℝ l train Â1 ; y train ∈ℝ l train Â1 ; i ¼ 1; 2; …; ng.
Construct a neural network M according to D train and D valid . Create a new datasetT ¼ fð e q 1 ; e q 2 ; …; f q n ; y test Þ; e q i ∈ℝ l train Â1 ; y test ∈ℝ l train Â1 ; i ¼ 1; 2; …; ng by selecting l train samples from T. Replace the feature of D train by the same feature ofT one by one and combine the output ofT, thus, we can get n datasets Q i , where i = 1, 2…, n.
Test the model by datasets Q i and get the test errors. Rank the features according to the test errors. The smaller the error, the more important the feature and the higher the rank.
Step 3: Adjust the features of D train andT according to the feature importance and shape them as the following form: Feature Importance : 1; 2; …; n D train ¼ fq 1 ; q 2 ; …; q n ; y train g T ¼ f e q 1 ; e q 2 ; …; f q n ; y test g Create the training set g D train and validation set g D valid by selecting the first j features from D train andT, respectively. Then, construct a neural network according to g D train and g D valid , and record the validation error Err j on g D valid . The feature number j increases from min _ Feature _ NO set by the user to n one by one. Finally, select the feature subset with the minimal validation error.
In this paper, datasets D and T are composed of 2525 and 5437 samples, respectively. These samples are obtained from different flow cases (see Table 1) using the sample selection algorithm described below. The flow cases are sampled through the Latin hypercube sampling method (LHS), and the sampling space is Ma ∈ [0.1, 0.5], Re ∈ [2 × 10 6 , 8 × 10 6 ], α ∈ [0°, 5°]. The total candidate features before feature selection are shown in Table 2, where p denotes the pressure, x and y denote the streamline and normal direction of the freestream respectively and the subscripts denote the partial derivative in the direction. The sign function sgn() is defined as follows.
It can be found that the candidate features are not limited to those with invariant property. Some features have invariant property, such as u i ∂p ∂x i and S ' /Ma 2 , while others don't, such as velocity vector and its derivatives, which are invariant to translation but not to rotation. According to step 1, the MED of each feature can be calculated (see Fig. 5) and the space domain formed by each feature and the output of dataset D and T can also be plotted. For example, the space domain of some features can be seen in Fig. 6. After removing the features with large MED, the remaining features are (q 5 , q 8 , q 10 , q 11 , q 12 , q 13 , q 14 ). For step 2, the mean square error (MSE) of 7 datasets Q 1 −Q 7 is shown in Fig. 7. Thus, the feature importance is q 8 > q 13 > q 5 > q 12 > q 14 > q 10 > q 11 according to the MSE. In step 3, the min _ Feature _ NO is set to be 3, which means that the features q 8 , q 13 , q 5 are necessary.
Starting from these three features, an individual feature subset can be generated as the feature number increases by one in the order of the feature importance and a corresponding neural network model can be constructed. The best feature subset can be found by the model performance on validation set. In order to avoid the influence of occasionality, the model is trained three times for each feature subset, and the model performance is measured according to the mean value of the errors on validation set. As is shown in Fig. 8 where the horizontal axis means the feature number of the feature subset, the model performance is the best when the first four features are adopted. Therefore, the features used to construct the model hereinafter are (q 8 , q 13 , q 5 , q 12 ).

Dataset
The source data used in this paper is computed by the CFD solver coupled with the S-A model [40], where the transport equation of the S-A model is as follows.
The governing equations are solved by the pseudo-time-marching method. The cellcentered finite volume method is used and the spatial discretization is the AUSM+UP scheme with second-order accuracy. The governing equations are nondimensionalized by the mean aerodynamic chord c, speed of sound a ∞ and freestream density ρ ∞ . The  airfoil surface is set to be wall with no-slip condition and the far field is pressurefar-field without reflection. When the variation of drag coefficient between 1000 iterations is less than 1 × 10 −5 , the flow field is set to be converged. Hybrid meshes are used. The grid near the wall is quadrilateral with a growth rate of 1.2 and the outer grid is triangular. The height of the first layer at the wall satisfies y + < 1. The benchmarks of NACA0012 and RAE2822 airfoils are adopted to validate the code and the corresponding freestream condition is shown in Table 3. The result of the CFL3D and experiment can be referred to [41][42][43]. Both of the agreement of C p and C f are good, see Figs. 9 and 10. In this paper, there is only one training case, and its freestream condition is Ma = 0.27, AOA = 2.45°, Re = 3 × 10 6 . Two thousand five hundred and twenty-five data pairs were generated by the sample selection algorithm, and 1/5 of them were selected randomly as the validation dataset, while the rest were used as the training dataset. A total of 68 flow cases were used as the test cases, 60 of which are shown in Fig. 11 and the rest cases will be illustrated later. The reason for this design of test cases is that we attempt to find whose change of the freestream condition (Ma, Re, AOA) is more challenging for the model performance, which might be helpful for future work.

Data process 3.2.1 Sample selection
The goal of sample selection, in our opinion, is to select as few samples as possible to represent as much of the original sample space as possible. In other words, the diversity of physical behaviors contained in the original sample space should be retained as much as possible. The sample selection method used in this work is a combination of the law of the wall and recursive algorithm. The whole process can be achieved by two steps: Step 1: Divide the flow field into several subzones according to lny + , and the size of each subzone is a set value Δx.
Step 2: Set the standard correlation value cc _ std. Take the first data in the subzone as the first sample. For each subzone, starting loop from the second data, calculate the correlation coefficient cc between each data and all samples of the current subzone, and denote the maximal cc as cc _ maximum. If cc _ maximum < cc _ std, then the corresponding data is a new sample; otherwise, the data is redundant and should be removed. Loop the current step for all subzones.  . 9 The comparison of surface pressure C p and skin friction coefficient C f of NACA0012 airfoil The formula of correlation coefficient is, where g denotes the vector composed by the input features and output. The specific algorithm is shown in Table 4. It should be noted that before the sample selection and during the coupling process later, the data of cells whose Reynolds stress is near zero in the far field is eliminated according to the redefined entropy S' < 2 × 10 −5 . In this work, we applied the sample selection to the training case. And we set Δx = 0.2 and cc _ std = 0.999737. The number of subzone is 59. The data number before and after sample selection is 8328 and 2525, respectively. The comparison of data distribution before and after sample selection is shown in Fig. 12 when this method is applied to the training case. It can be found from the figure that, among the selected samples, many of them lie in the leading edge and trailing edge where the flow field changes rapidly. And the data at middle airfoil is obviously sparse, while there is only a few data in the far field and wake region.

Postprocess
For the samples at the edge of model performance, the model outputs are probably outliers, resulting in negative eddy viscosity. In this case, eddy viscosity is forced to be non-negative. In addition, there may also be the phenomenon of "burr", that is the small jump of the outputs among adjacent cells. In order to obtain a smooth flow field, spatial smoothing is introduced to smooth the eddy viscosity field calculated by the model at each iteration.
where i denotes the i th neighboring cell, N the total number of neighboring cells. β is the weight of neighboring cells, thus, the larger the β, the better the smoothness. But the eddy viscosity in the boundary layer changes sharply and non-linearly along the normal direction of the wall. Since Eq. (10) is linear, the error can be evitable. And the larger the β, the larger the error. It is a trade-off between smoothness and accuracy. We recommend small values for test cases which are similar to the training case and large values for test cases which deviate a lot from the training case. In this work, we set β ∈ [0.2, 1]. All the flow cases involved in this paper are steady. Therefore, the final turbulent field should be convergent. During the iteration process, we found that the turbulence field calculated by the model is convergent on the whole, but there are slight oscillations of the eddy viscosity of some cells from time to time. The slight oscillation in turn causes the residual oscillation of the N-S solver. In order to restrain such unreasonable oscillation, when the residual value of N-S solver is less than 1 × 10 -5 , the mean variation of eddy viscosity ∑Δμ t /M between adjacent iteration steps is limited to be monotonically non-increasing, where M is the number of the total cells. More specifically, denote the Table 4 The sample selection algorithm value of ∑Δμ t /M at current step k as ∑Δμ t k /M, where Δμ t = |μ t k − μ t k − 1 |. At each iteration step, set ∑Δμ t /M = min(∑Δμ t /M, ∑Δμ t k /M) and cycle the whole grid cells. If the variation of eddy viscosity between current and forward steps satisfies Δμ t > ∑ Δμ t /M, then set μ t k = μ t k − 1 + sgn(μ t k − μ t k − 1 ) × ∑ Δμ t /M. It is found that the influence of this limitation on the final results can be ignored.

Training process
If the loss function is simply defined as mean square error, then the relative error of the model will be large for samples with small output value. As far as this work is concerned, the above trouble happens in the near wall region, and thus the model is prone to perform poorly. Furthermore, the high shear strain rate here will lead to unreasonable results of Reynolds stress and skin friction distribution. Therefore, in the training process, the accuracy of Reynolds shear stress is considered as a constraint for the region y + < 30. In other words, the accuracy of the samples in near wall region is improved through increasing the weight in the loss function by the constraint term. The constructed model needs to be coupled with the N-S solver like conventional turbulence models. Therefore, the robustness of the model should be enhanced during the training process to avoid the high sensitivity of output to the slight change of input. Two tricks are used. One is that we included the L1-norm and L2-norm regulation in the loss function. The other is that we introduced the stability training (ST) after the normal training process. The principle of stability training is to enhance the model robustness through adding some random noise to the original samples. The random noise used in this work is uniformly distributed with a magnitude less than ±5% of the original sample.
whereX is called random perturbation copy. The model output ofX is denoted asỸ . The stability loss term is defined as, The loss function of ST is: More details on stability training can be referred to the work of Zheng et al. [44] The model training is implemented on PyTorch [45], and the related parameters are shown in Table 5. There are five components in the loss function and the magnitude of each one can be different during the training process. If the magnitude of one component is much larger than that of the others, then the effect of the other components will be suppressed. Thus, the multiplier λ 1 − λ 4 is introduced to make the magnitude of each component on the same order. It should be noted that the error function of the validation set does not include the stability loss term during normal training, which is defined as: During the stability training, the error function is redefined as: The training process is over when the error function no longer drops within 500 steps.

Results and discussions
Since C d and C f are sensitive to the eddy viscosity field, this section mainly assesses the accuracy of the constructed model from these two aspects. The model performance is expected to be good in the test space Ma ∈ [0.1, 0.5], Re ∈ [2 × 10 6 , 10 × 10 6 ], α ∈ [0°, 5°]. First of all, we calculate the relative error of C d for all the test cases in Fig. 11. The error contour of the test space can be obtained by interpolation of the scattered cases, as shown in Fig. 13. It can be found that the relative error varies regularly with the angle of attack and Ma number for cases with fixed Re number. The relative error is smaller with the increase of angle of attack and larger with the increase of Ma number. However, the variation of the relative error is ambiguous with angle of attack and Re number for cases with fixed Ma number. On the whole, the relative error increases with the increase of angle of attack and Re number. The above analysis is qualitative and rough. More detailed and quantitative information is shown in Fig. 14. When α < 3°, the influence of angle of attack on model performance is not obvious. When α > 3°, the C d calculated by the constructed model tends to be small with the increase of angle of attack. In addition, the C d is smaller than that calculated by the S-A model as the Re number increases. It can be found from the above results that the model performance tends to be poor at some boundaries of the test space, which is helpful to speculate the model performance in the whole test space. The freestream conditions of the eight boundary point cases and the corresponding relative errors of C d are listed in Table 6. Only case C6 slightly exceeds 5%, while the other cases are all lower than 4%. The corresponding skin friction coefficient distribution is shown in Fig. 15. The cases C7 and C6 correspond to the best and worst model performance, respectively.
In order to validate the proposed sample selection and feature selection algorithms, two models were constructed respectively under the same model architecture: one is without sample selection and feature selection and the other is only without sample selection. Take the two cases of NACA0012 airfoil as examples, and the comparison of C f calculated by the three models is shown in Fig. 16. The result is obviously wrong without feature selection, although the accuracy of the lower surface at the leading edge is better for case C6. This indicates that more features do not mean better model performance, and the ability of neural network to select features automatically is limited. Elaborate feature selection is necessary. Sample selection has negligible influence on the result, except for slight deviation at the leading edge of the airfoil, which indicates that the proposed sample selection algorithm is feasible and the selected samples preserve the data space and diversity of the original dataset. Thus, what should be focused on is the diversity rather than the number of samples. In terms of these two cases, the C f agreement is very good at the upper surface, which is obviously better than that of the lower surface. The deficiency mainly manifests in two aspects. One is around the peak value of C f at the leading edge, where the flow variables change dramatically. The other  Fig. 17. When the constructed models can be coupled with N-S solvers like conventional models, the research is of more practical significance. However, the error propagation between the Reynolds stress and the time-averaged flow field is a challenge for most of the research. Wu et al. [47] explained the stability of different closure models according to the local condition number. Cruz et al. [22] suggested using the Reynolds force vector instead of the Reynolds stress tensor as the model output. For the steady cases in this work, we expect that the convergent solution can be obtained when coupling the model with the N-S solver. Although we enhance the model robustness and the smoothness of the eddy viscosity field by stability training and space-time smoothing and the eddy viscosity model has its own stability advantage, these are still not enough to counteract the high sensitivity of time-averaged field to Reynolds stress, which causes the residual oscillation and the slower convergence.
Since the convergence process of all cases is similar, we take the case C6 of NACA0012 airfoil as an example to make further analysis for the sake of convenience. In each iteration step, we tracked the Reynolds shear stress variation between the current step and the previous step of the cell with maximal residual value. In this Fig. 13 The relative error contour of C d of the test cases regard, the change of S-A model decays fast and smoothly, while that of the ML model decays slower and has large oscillations frequently, as is shown in Fig. 18. The oscillations of Reynolds stress variation derive from those of the model output, which disturb the convergence of the mean flow field and lead to the residual oscillations. The location of the cells with maximum residual was documented after 6 × 10 4 pseudo steps, see Fig. 19(a). Since many symbols overlap each other, we further count the percentage of the symbols in the same location in all symbols, see Fig. 19(b). It can be found that  most of the cells with maximum residual are in the buffer layer and near the inner edge of the log layer and the majority lie in the back part of the airfoil.

Conclusion and outlook
In this paper, a feature selection method is developed to improve the generalization ability of machine learning models in turbulent flows. The method selects the features by three steps: firstly, measure the numerical distribution characteristics of the training and validation datasets according to the mean extrapolation distance and eliminate the features with obvious numerical extrapolation; secondly, construct a model with all features, then test and rank the effect of each feature on the model performance; finally, generate the feature subsets according to the feature importance and evaluate the model performance that each feature subset can achieve, then select the feature subset with optimal model performance. The proposed method avoids the shortcoming of priori of feature importance ranking and low efficiency of exhaustive research. Using the features from the proposed method, a turbulence surrogate model is constructed with artificial neural networks driven by the flow field data calculated by N-S solver coupled with S-A model. The training data is selected from only one flow case, and 70 flow cases with different freestream conditions and airfoils are tested. The results indicate that the generalization ability of the constructed model is amazing. The model driven by only one flow case performs well in the whole designed test space after the feature selection. The relative error of the drag coefficient is within 5% for almost all the test cases. This work mainly validates the proposed feature selection method. In future work, it can be compared with some existing feature selection methods for effect and efficiency.  Fig. 18 The residual evolution for case C6 of NACA0012 airfoil