Ensemble of hybrid neural networks to compensate for epistemic uncertainties: a case study in system prognosis

Despite the growing popularity of deep learning, even with recent progresses, many existing methods struggle with applications challenged by non-ideal datasets and partial understanding of input–output relationships. This is the case of prognosis and health management of industrial assets, where operating conditions might be recorded, but failure of equipment is only partially observed and partially understood. In this contribution, we address the issue of incomplete knowledge in models used to estimate time-dependent damage propagation. Specifically, unaccounted corrosion-fatigue of aircraft fuselage panels is presented as a case study. While degradation resulting from mechanical fatigue is accounted by damage models, unforeseen corrosion effects are not, yielding in large discrepancies between predicted and observed crack lengths. To address this model-form uncertainty, we propose using hybrid neural networks. Hybrid models are composed of physics-informed layers addressing well-understood aspects of damage accumulation, while data-driven layers are trained to act as correction terms. To improve overall accuracy, ensemble techniques are adapted to merge resulting predictions. Optimal ensemble weights are derived to help with the task of defining safe operation in the fleet. The proposed case study encompasses highly imbalanced data sets with uncertainties acting asynchronously. Main contributions of the proposed work are: (i) in a modeling perspective, an approach capable of compensating for model-form uncertainty by formulating the numerical integration of governing ordinary differential equations as hybrid recurrent neural network models; (ii) from a prognosis perspective, the proposed methods can be used to rank damage severity, allowing to prioritize aircraft for inspection and/or route reassignment.


Introduction
In industrial segments, as civil aviation, in which degradation is intrinsically related to safety, successful implementation of maintenance programs, requires the ability to properly account for unexpected and/or unforeseen factors. There are several published reports in which failure to identify and account for uncertainties introduced during aircraft service life lead to serious incidents. An illustrative example can be found in NTSB Contributors (1988). Hence, prognosis A. Dourado and F. Viana have contributed equally to this work. and health management (PHM) analyses (i.e., the ability to model and predict the system degradation state) under these uncertain conditions require an understanding of the complex stochastic interactions between operating conditions and system capability. In a way, under such complex and some times unforeseen scenarios, that can lead to large discrepancies between predicted and observed useful life, much of the work resumes on building reliable models for unexpected failure modes and/or machine degradation rates, i.e., models that can compensate for the unaccounted uncertainties effects. Additionally, prognosis models in segments as civil aviation, are used in decision-making scenarios, regarding asset allocation (which aircraft will operate in a given route) or maintenance planning. Such analyses require the ability to properly integrate and forecast in time the rate of degradation for quantities of interest (i.e., extrapolate components accumulated damage state), and are crucial for properly managing aircraft fleets. Thus, the ability to identify and compensate for uncertainties effects, especially epistemic uncertainties (i.e., uncertainties due to incomplete or lack of knowledge), are of utmost importance for prognosis analyses.
Prognosis and health management in itself already encompasses several challenges involving an array of distinct aspects with a rich literature devoted to such topics (Jardine et al. 2006;Sikorska et al. 2011;An et al. 2015;Lei et al. 2018;Diez-Olivan et al. 2019;Guo et al. 2019;Fink et al. 2020). Concerning aviation industry applications, prognosis analysis reflects the complexity of the aircraft as a system, with the modeling task usually focusing on critical sub-systems (e.g., aircraft engines, airframe, actuators, landing gear, and avionics). With regard to airframe components, although there might be many reasons for failure, in terms of prognosis, the two most important forms are fatigue and corrosion (Findlay and Harrison 2002;Bhaumik et al. 2008). Discussions considering crack propagation issue can be found in DuQuesnay et al. (2003); Nagaraja et al. (2007); Xie et al. (2016); Li et al. (2017). Other critical elements include engine components such as bearings and blades. The interested reader is referred to DeGiorgi et al. (2018); Cui et al. (2021a, b) for further discussion on diagnosis and prognosis of such components.
A crucial aspect in prognosis and health management is asset remaining useful life (RUL) estimation. Here the goal is to properly characterize the damage state of any given component, and more importantly accurately estimate is future state. Thus, in the task of RUL estimation for prognosis analysis the key challenge is the extrapolation of damage accumulation models under the influence of varying levels of uncertainty. There is a vast literature concerning uncertainty quantification and estimation in PHM analyses and RUL estimation (Baraldi et al. 2010;Sankararaman 2015;Dewey et al. 2019;Li et al. 2021;Caceres et al. 2021). In this contribution we focus specifically on the effects of epistemic uncertainty (i.e., incomplete or lack of knowledge) in damage extrapolation for prognosis analysis. A case study considering an unaccounted corrosion-fatigue crack propagation issue in a fleet, is used to address the issue of model form uncertainty compensation. For the considered case study, component degradation for low cycle mechanical fatigue is well understood and accounted in the damage accumulation model. However, corrosion effects due to an unexpected exposure to corrosive environments are not accounted by in the damage integration, leading to a large discrepancy between predicted and observed crack length values. Additionally, the consider application is plagued by highly imbalanced data sets, with very sparse observations of accumulated damage, i.e., outputs in the form of crack length values are only available for portion of the fleet during inspection campaigns, resulting in very few crack observations, further hindering the task of deriving accurate prediction models.
As illustrated by Jiang et al. (2018); Hüllermeier and Waegeman (2021), epistemic uncertainties effects in model predictions are usually consider by means of probabilistic frameworks. Even in contributions where non-probabilistic frameworks are proposed as in Clavreul et al. (2013); Tang et al. (2017); Zhang et al. (2018), in which elements of evidence and fuzzy theories are applied; employed approaches are often still closely related to probability theory. In this contribution, we departure from probabilistic frameworks and propose a deterministic approach capable of compensating for incomplete knowledge of the degradation mechanism in the prognosis analysis. Given that damage accumulation can be interpreted as an initial value problem, where the current damage state is estimated by the integration of ordinary differential equations, hybrid recurrent neural network cells are formulated to account for the missing-physics issue in the damage models. In the proposed hybrid recurrent cells, physics-informed layers address well-understood aspects of the damage accumulation, while data-driven layers are trained to account for the epistemic effects, acting as correction terms. Due to the large imbalance in the data sets for the considered application (i.e., severely sparse label set), in this contribution we opted to adapt ensemble learning techniques to improve the overall accuracy of the final model prediction. The key idea is to merge predictions of hybrid physics-informed neural networks, each derived from distinct physical considerations, to account for the uncertain factors, yielding more reliable predictions.
Several contributions reports the use of ensemble learning methods to increase accuracy and/or reliability and robustness in models predictions. In Che et al. (2019), a prognosis model for health assessment of aircraft systems is derived by combining multiple deep learning algorithms. Studies, as shown in Hu et al. (2012), in which the focus lies on the weighting scheme impacts on the overall remaining useful life prediction; or as in Li et al. (2019b, a) and Shi et al. (2021), in which different aspects of ensemble learning are evaluated for the remaining useful life estimation of aircraft bearings and engines; illustrates the main capabilities and contributions of ensemble learning on the scope of prognosis and health management. In Guo et al. (2020a), ensemble learning is used to address class-imbalance problems, while Guo et al. (2020b) and Yousefnezhad et al. (2021) employed ensemble learning to increase prediction accuracy and robustness in classification problems. Chou et al. (2020) proposed an ensemble learning framework to predict shear strength of reinforced concrete beams. Obtained results illustrated the ability of the proposed frameworks to outperformed other machine learning frameworks potentially serving as a better alternative to help in the design of such structures.
As in the aforementioned contributions, here we employ ensemble methods to improve the accuracy of the considered damage accumulation model, but here the considered damage model is inherently biased due to the partial knowledge of the degradation mechanism acting on the fleet. Hence, each base model on the ensemble needs first to learn how to compensate for the epistemic uncertainty effects in its predictions. Additionally, the adapted ensemble technique can be employed to help in the decision-making task of defining a time window for safe operation of each aircraft based on its forecasted damage and predefined safety thresholds. To do so, in this contribution a heuristic based on optimal ensemble weights is presented.
The remaining of the paper is organized as follows. Section 2 presents the considered numerical experiment and the synthetic fleet data generation. Section 3 details the proposed hybrid physics-informed neural network cells alongside the physical aspects considered in each of these cells. The main aspects concerning ensemble learning used in this contribution are briefly discussed in Sect. 4. In Sect. 5, the main results of this contribution are presented and discussed. Finally, Sect. 6 closes the paper recapitulating salient points and presenting conclusions and future work.

Case study-unaccounted corrosion-fatigue in aircraft fleet
For the discussions presented in this paper, we considered a scenario in which due to unforeseen air-carrier operational variations, an aircraft fleet is mainly operating in coastal routes. Such conditions significantly increase aircraft exposure to corrosive environments, conditions that are not accounted in the fleet prognosis analysis, leading to an unforeseen corrosion-fatigue problem. Root cause analysis on inspected aircraft would potentially identify the corrosion effects, and in the long-term manufacturers would be able to update fleet prognosis models with high-fidelity degradation models accounting for such effects. However, in the meantime, operators would still need to reliably run the fleet, take decisions concerning aircraft route structures, and inspection campaigns. Hence, tools capable of rapidly and accurately compensating for the discrepancies introduced by the unaccounted corrosion effects would be critical. Mathematically, such issues consist of an epistemic uncertainty problem, i.e., uncertainty due to incomplete knowledge of the problem, and in a simplified interpretation can be seen as an unknown bias that is plaguing the prognosis model. Different hybrid physics-informed neural networks will be proposed in the following sections to compensate for such epistemic uncertainty.
To emulate such operational conditions, we used a numerical experiment yielding in a synthetic aircraft fleet. The choice for synthetic data, in this study, is due to the proprietary nature and safety concerns involved in sharing failure information in civil aviation, limiting public access and dissemination of such type of data. However, it is worth mentioning that no prior knowledge concerning data generation is used to guide the formulation or training of the correction terms in the proposed hybrid models. Furthermore, details of said terms will be discussed in Sect. 3, but in general, the considered bias terms were derived either as generic additive correction terms to the damage model (see Sects. 3.1 and 3.2), or as correction terms for model parameters/coefficients (Sect. 3.3) and model inputs (Sect. 3.4). Arguably, the key concepts behind each correction term could be extended and adapted to similar damage models, given that no prior knowledge of plaguing bias is used in its formulation, but rather fundamental knowledge of the damage model itself. Lastly, interested readers are referred to Sect. 5.4 in which they can download data sets used in this study and replicate our main findings.

Fatigue crack growth modeling
The previously discussed corrosion-fatigue problem is addressed in the synthetic fleet by predicting crack propagation of an Al 2024-T3 alloy on critical points of aircraft wing panels under varying loading and environmental conditions. Hence, panels are subjected to cyclic mechanical loading alongside saline corrosion. Crack propagation in the fleet is addressed by a cumulative damage model (Fatemi and Yang 1998;Frangopol et al. 2004) describing the irreversible accumulation of damage throughout the useful life of components. For that matter, fatigue crack propagation is usually modeled through Paris law (Paris and Erdogan 1963): where a is the fatigue crack length; t is time; K is the stress intensity range related with the severity of a loading cycle; and C and m are material properties commonly experimentally determined through coupon data, and many engineering materials have these properties documented in handbooks such as MMPDS collaborators (2017). Coupon data can be plotted on a logarithmic scale, as illustrated in Fig. 1, where accumulated fatigue damage in a cycle da/d N can be visualized as a function of the loading cycle severity, expressed here by the stress intensity range K . These type of plots are used to estimate the material properties C and m. The Paris law can be modified to take into account the effect of mean stress, expressed in terms of the stress ratio between the minimum and maximum stress levels R = S min /S max . We use the Walker model (Dowling 2012) to incorporate the stress ratio. The numerical integration of Eq.
(1) leads to its discrete form: where a is the fatigue crack length; the subscripts t and t − 1 indicate the time steps; C 0 and m are properties that depend Fig. 1 Corrosion-fatigue coupon data (adapted from Menan and Henaft (2010)). "Air" curve refers to coupon data obtained from purely mechanical loading, i.e., without corrosive effects. It express accumulated fatigue damage in a cycle da/d N, as a function of the cycle stress intensity range K , when the stress ratio R = 0. For this curve, material coefficients can be estimated as C = 1.132 × 10 −10 and m = 3.859. "NaCl 3.5 %" curve presents data for when the aluminum alloy was previously exposed to a sodium chloride solution, i.e., with corrosive effects. For this curve C = 2.241 × 10 −8 and m = 1.853, and it represents the threshold for environment corrosivity in this contribution. Depending of the environment corrosivity level, accumulated corrosionfatigue damage would yield unknown intermediate curves, resulting in an unknown bias ( a C O R R ) when considered damage model only accounts for mechanical fatigue (i.e., pure air curve) of environmental conditions and γ depends of material and loading conditions. Stress intensity range, K comes from engineering models that use information about geometry, crack length, and internal loads (stress amplitude). Assuming that fatigue damage accumulates under mode I loading condition 1 (Irwin 1957), the stress intensity range K t can be expressed as K t = F S t √ πa t−1 , where F is a geometry factor and S t are far-field stresses.
Corrosion-fatigue is a very complex phenomenon involving pit nucleation, pit growth, fatigue crack nucleation, short crack growth, transition from short crack to long crack, and long crack growth. Modeling corrosion-fatigue has proven to be challenging (Goswami and Hoeppner 1995;Shi and Mahadevan 2001), and it is not our objective to propose a new method (nor defend any preferred method) for it. Instead, we will focus on how a deep neural network based on an Euler integration cell  can be designed and trained to compensate for the corrosion effects in crack propagation.
In any given flight, the vast majority of fatigue cycles are purely mechanical. However, take-off, landing, and taxing cycles are subjected to a mix of mechanical loading and saline corrosion. The level of saline corrosion-fatigue depends on the airport and weather related aspects (seasonality). As shown by empirical studies, such as the one found in Menan and Henaft (2010), corrosion effects drastically increase the damage accumulation rate (shifting the da/d N vs K curve up, when compared to the rates in purely mechanical-fatigue, see Fig. 1). To address such effects, in this paper, for the synthetic fleet data generation, we define a corrosivity index c t , which essentially controls the interpolation between the pure air and 3.5% NaCl curves (see Fig.  1). Pure air represents a purely mechanical-fatigue condition, while degradation in a saline condition of NaCl 3.5% is the most aggressive corrosive environment here considered. The exact degradation curve will depend on the corrosivity index value. We use the curve for air when c t = 0 (and thus the crack propagation is purely mechanical and a function of the coefficients for pure air shown in Fig. 1), while the curve for NaCl at 3.5% is used when c t = 1 (crack propagation is of a corrosion-fatigue condition based then on the coefficients for NaCl shown in Fig. 1). Values of c t in the interval 0 < c t < 1 represent the unknown environmental conditions and are related to the unknown bias that the proposed hybrid neural networks are trained to compensate.
The control points of interest are consider to have an initial damage condition that can be represented as an initial crack length value (a 0 = 0.5 mm). Here, we arbitrarily considered that the crack length information derived from inspection is initially available for only part of the fleet after 9,000 flights-this value was selected to approximated what would be the second "C" inspection cycle (Vieira and Loures 2016). In this case study, we consider that input conditions (i.e., far-field stresses S t , stress ratio R t , and corrosivity index c t ) are fully observed. In real life, S t and R t , i.e., loading conditions, could be estimated using engineering analysis with models relating maneuvers to loads. For the analysis presented in this contribution, the corrosivity index c t is estimated using meteorological data (available in an airport-by-airport basis) by adapting a modified version of the PACER LIME model (Summitt and Fink 1980;Roberge et al. 2002). Further details on the data generation for this case study are discussed in the Appendix section.

Hybrid physics-informed neural network cells for corrosion-fatigue damage accumulation
There are many data-driven approaches (Zhang et al. 2021;Ran et al. 2021) for time-series problems such as damage accumulation. In this work, we focus on recurrent neural networks (Goodfellow et al. 2016), which repeatedly apply transformations to given states in a sequence: where t ∈ [0, . . . , T ] represent the time discretization, y ∈ R n y are the states representing the sequence, x ∈ R n x are input variables, and f (.) defines the transition between time steps (function of input variables and previous states).
In the recurrent neural network terminology, different implementations of f (.) are referred to as cells (Yu et al. 2019). A cell can be as simple as a single-layer perceptron, but it can also assume sophisticated forms such as the long short-term memory (LSTM) (Hochreiter and Schmidhuber 1997) and the gated recurrent unit (GRU) (Chung et al. 2015). In this paper, we use the Euler integration cell ). This cell was conceived to implement numerical integration of ordinary differential equations using recurrent neural networks. Nascimento and Viana (2020), Dourado and Viana (2020), and Yucesan and Viana (2020) demonstrated that the Euler integration cell can be implemented as a hybrid model where data-driven nodes compensate for model-form uncertainty.
The case study presented in this paper consists of an initial value problem that can be numerically integrated as: where y is the output state representing the sequence (fatigue crack length a, in the considered case study); the subscripts define the time step; f (x t , y t−1 ) combines physics of fatigue damage accumulation and a data-driven node for estimation of corrosion-fatigue damage increment; and x are the inputs to the recurrent neural network.
At this point one might be tempted to ask why use a hybrid recurrent neural network cell for the damage integration instead of employing well-established frameworks as the LSTM or GRU cells given that such cells have reports of successful implementation in seemly more complex applications? The answer lies in the niche of application we target in this paper. Our application of cumulative damage models involves full observation of input conditions (i.e., loads and environmental parameters), and partial observation of damage (crack length). This scenario leads to highly imbalanced data sets plagued by asynchronous uncertainties (i.e., only few cycles in the damage accumulation are subject to corrosive damage) for training and validation. As shown by Nascimento and Viana (2020) and Yucesan and Viana (2021), under these circumstances established recurrent neural network cells (e.g., LSTM or GRU) do not achieve the same performance as the considered hybrid cells, unless the number of observations are increased, which is not representative for the considered case study, given that this would imply performing unscheduled inspection campaigns.

General correction term: bias model
In this study, we assume that fleet prognosis model only considers the failure mechanism of mechanical fatigue. Therefore, damage would be estimated integrating Eq. (2) using C 0 and m for a non-corrosive environment (pure air). However, as previously discussed in Sect. 2 due to operational variations the fleet is exposed to the combined failure mechanism of corrosion-fatigue. Hence, our proposed approach is to train hybrid physics-informed neural networks to model the damage accumulation and compensate for the corrosion effects.
While the base damage increment of mechanical fatigue is given by the physics-based model of Eq. (2), here, a datadriven node adjusts the base damage increment to account for the missing physics of corrosion. Thus, the damage accumulation model in this case can be expressed as: where a is fatigue crack length; the subscripts define the time step; C 0 , m, and γ are material/environmental properties; K is the stress intensity range; g(.) is the unknown functional related to the missing contribution of the corrosion effects in the crack propagation; R is the ratio between the minimum and maximum stress levels; and c is the corro-sivity index estimated using meteorological data (available in an airport-by-airport basis). An illustration of the physical interpretation of Eq. (5) is given in Fig. 2 Here, we use a multi-layer perceptron (MLP) to model corrosion-fatigue damage increment, i.e., a C,t = M L P( S t , R t , c t , a t−1 ).
The multi-layer perceptron takes S t , R t , c t , a t−1 as inputs and returns a C,t , which is never observed (only observed output is the accumulated damage in the form of crack length at specific inspection points). Figure 2b illustrates the resulting recurrent neural network cell in which, its physics-informed portion (highlighted in blue) implements Eq. (2), while its data-driven portion (the MLP highlighted in green) is trained to approximate the unknown functional g(.) of Eq. (5). The multi-layer perceptron is adjusting the damage increment outputted by the physics-informed layers so that it compensates for the effects of corrosion. Therefore, it works as a discrepancy-correction term, or simply as an additive term. (2) for pure air coefficients) and are not subject to training. Here the trainable data-driven nodes (on the bottom in green) try to learn the unknown additive term a C,t , taking as inputs loading conditions (R t and S t ), corrosivity index c t , and accumulated damage a t−1

Revising the inputs for the bias term: bias 1K model
This functional form is a direct review of the general bias cell discussed throughout Sect. 3.1. The bias estimator present in Sect. 3.1 consists of a general corrector term of functional form g( S t , R t , c t , a t−1 ). Due to the underlying functional form of this MLP it acts as a simple additive term, impacted by any unaccounted effects manifested in its inputs (loading condition and environmental effects). In practice, as stated, the data-driven model detailed in Sect. 3.1 acts as a "black box", proposing correction terms without focusing in its "root cause".
Here, we aim to minimize the contribution of the loading condition in the bias term by revising its inputs, as shown in Eq. (6). This revised functional it is still derived as a bias term, proposing correction values to compensate for the missingphysics in the mechanical fatigue damage increment term ( a F,t ) as in the previous model: where a C,t is the bias term related to the missing-physics issue that the data-driven portion of the cell (i.e., the MLP) aims to learn during training. Despite seemly looking like a minor modification to the model detailed in Sect. 3.1, the physical interpretation, and numerical repercussion are very significant. While in the previous cell the data-driven layer is acting as a generalized bias term for the damage accumulation process, here the estimated bias values are more constrained in form. The revised functional form takes the stress intensity range K t as input, instead of observing directly the far-field stresses S t values. This means that the contributions of these stresses to the unknown bias term are not directly inferred, but instead will Fig. 3 Euler integration cell for the proposed Bias K model. Here, modifications concern the considered inputs to the proposed data-driven node tasked in estimating the unknown additive bias term a C,t (bottom portion of the figure highlighted in green). In this implementation, datadriven nodes take as inputs corrosivity index c t , accumulated damage a t−1 , and indirect effects of loading conditions in the form of the stress intensity range K t . Reasoning here is that this modification should minimize the contribution of the loading information in the estimation of the additive term (that for the considered case study is solely due to environmental factors) be indirectly manifested in the stress intensity range K t . Such features may hinder the ability of the proposed cell to identify potential impacts of the loading conditions in the correction terms. The final result potentially is a cell more focused on the environmental effects (represented in form by c t ) in the crack propagation. Thus, we expect that for the case study considered, this hybrid cell will yield the most accurate predictions. In Fig. 3, the related recurrent neural network cell is presented and discussed. As in the previous case, physics-informed nodes are highlighted in blue, and implements Eq. (2), while the data-driven portion of the cell is highlighted in green and address the MLP described by Eq. (6).

Estimating the unknown degradation curve: log model
In this section, the functional form described by Eq. (7) is derived under distinct physical assumptions. As in the previously described hybrid cells, it still formulated as a correction term for the mechanical fatigue damage increment, but contrary to the previous cells, the bias term ( a C,t ) is being directly calculated based on estimates of the unknown degradation curve (see Fig. 4a). Hence, instead of learning to propose correction terms a C,t , the data-driven portion of the hybrid cell focus in providing approximations of equivalent material coefficients C and m, so that the unknown degradation curve can be directly estimated, and the contribution of the missing-physics can be calculated as: where a F,t is the mechanical fatigue damage increment given by Eq.
(2) considering C 0 and m values for pure air, M L P C and M L P m are the multi-layer perceptrons tasked with estimating values for the equivalent coefficients C and m, respectively, as illustrated in Fig. 4b. This is a significant departure from the previously discussed cells. Here the proposed multi-layer perceptrons observe only the stress ratio (R t ) and the corrosivity index (c t ) at any given cycle in order to estimate values for the material coefficients. This implies that this model will potentially be heavily influenced by environmental effects. Lastly, despite being more focused on the environmental effects in the unknown bias related to the missing-physics problem, this model is still capable of accounting for any unforeseen effects (e.g., loading issues, faulty sensors, etc.) by proposing equivalent material coefficient values capable of compensating for such factors. Here we aim to estimate the material coefficients C and m, associated with the unknown corrosion-fatigue degradation curve. After the unknown curve is estimated by means of the inferred material coefficients, the bias in a cycle due to corrosion effects ( a C,t ) is simply calculated and added to the mechanical damage a F,t . For simplicity these operations are carried on a logarithmic basis due to the nonlinear relationship between material coefficients and the stress intensity range (see Eqs. (1) and (2)). (b) In this proposed Euler integration cell, physics-informed nodes (on the top in blue) implement the known aspects of mechanical fatigue and calculation of the corrosion effects contribution. But here we have two trainable data-driven nodes (on the bottom in green) each tasked in estimating one of the unknown material coefficients

Equivalent stress intensity range: 1K model
Lastly, in this section, we propose a functional form more focused on the loading conditions impacts on the missingphysics problem. As detailed in Eq. (8) in this hybrid model, the data-driven layers no longer act as bias estimators for the mechanical fatigue damage increment but rather aims to compensate for missing-physics effects on the stress intensity range ( K t ) by proposing correction terms (bias values) for its value K t = K F,t + K C,t , and where K F,t is the stress intensity range considering purely mechanical fatigue; and K C,t is an additive term in the stress intensity value, related to the missing-physics, that the data-driven portion of the cell (i.e., the MLP) aims to learn how to compensate for during training. Figure 5a presents the main physical aspects of this model, while an illustration of the resulting recurrent neural network cell can be seen in Fig. 5b. This is a more extreme departure from the modeling approaches previously discussed. Here, the focus is on the stress intensity range rather than on the damage increment itself. This is the least representative model for the considered case study of this paper, given that in the proposed case study the source of the epistemic uncertainty is the missing-physics of corrosion, directly related to environmental effects and not significantly impacted by loading conditions. However, the proposed cell should be capable of yielding accurate predic-

Fig. 5 Physical interpretation and Euler integration cell for the K model. (a)
Here the effects of the unknown corrosion bias ( a C,t ), related with the unknown degradation curve, are accounted by in the damage accumulation model, by estimating an equivalent shift in the stress intensity range K . This shift needs to be estimated in such a way that the damage in the pure air curve would be equal to the unknown total damage (total damage in the unknown degradation curve a F,t + a C,t ). (b) In this cell, physics-informed nodes (on the top in blue) still implements the known aspects of mechanical fatigue (as expressed in Eq. (2)). However, here the data-driven node (on the bottom in green) is tasked in estimating an additive correction term to the stress intensity range K t , capable of yielding the required shift the mechanical degradation curve tions for the considered case study, by proposing equivalent stress intensity values capable of compensating for the corrosion effects. Albeit this is not representative of the actual physics of degradation in the case study, the numerical results might be the same, as long as the hybrid model learns to stipulate a valid equivalency between the corrosion effects and the stress intensity range correction terms.

Ensemble learning: weighted average ensembles
In this section, we focus on discussions concerning aggregation techniques to combine the previously described neural networks predictions and increase the overall accuracy. By adapting the weighted average surrogate procedure detailed in Viana et al. (2009), weighted average ensembles can be derived using cross-validation errors, merging thus predictions of these pre-trained neural networks. Hence, a weighted average ensemble intends to take advantage of n pre-trained neural networks (that in this contribution are the previously detailed base models, namely, Bias, Bias K , Log, and K models) aiming to canceling errors in prediction through proper weighting selection in the linear combination of the models and thus yielding better predictions. Mathematically, this can be summarized as: where a Ensemble is the predicted crack length by the resulting ensemble, ω i is the weight associated with the ith base model, and a i is the predicted response by the respective base model (Bias, Bias K , Log, and K models). As detailed by Viana et al. (2009), given a C matrix of base models cross-validation errors, the optimal weighted ensemble can be found through the minimization problem: where c i j = 1 p e T i e j , p is the number of data points in the cross-validation, and e i and e j are the cross-validation errors of base models i and j, respectively.
The solution of the optimization detailed in Eq. (10) is obtained through Lagrange multipliers as presented by Eq. (11). To avoid issues with negative weights as well as weights larger than one, Eq. (11) is solved considering only the diagonal terms of matrix C.
Thus, by adopting such procedure to build the neural networks ensemble brings the key advantage of using closed-form solution to determine the ensemble weights, not requiring thus, additional computational costs of performing the ensemble weights optimization. In the following sections, optimal weights for the considered case study will be presented and discussed. Additionally, a brief discussion concerning the impact of the base models discrepancy (i.e., the variability in the base models prediction trends) on the ensemble accuracy will be addressed.
Lastly, we devise a heuristic metric based on the ensemble weights to guide the decision-making task of defining a time-window for safe operation of the fleet. Here, the goal is to define a time range for prediction extrapolation (fleet prognosis) in which we are confident that the ensemble is still accurate enough to determine aircraft operability. To do so, we considered the ensemble weights as metrics of confidence in each of its base models, i.e., a weight of 90% for instance, indicates that we have a 90% confidence in the respective base model prediction. Hence, a heuristic metric can be derived based on the amount of flights predicted by each base model to achieve a given crack length threshold (a T H ) defined on a operational basis: where t Forecast Ensemble is the total number of flights predicted by the ensemble to achieve the given threshold (actively defining a time-window for damage forecast and fleet prognosis analysis), t i a→a T H are base models prediction for the number of flights to achieve the crack length threshold, and ω i are the base models respective weights.

Results and discussions
In this section, we will present and discuss results related to the fundamental questions of this paper, mainly, focusing if it is advantageous to combine several neural networks into an ensemble or if is better to simply use the best base model. In other words, if there is a gain, in prediction terms, of using the proposed ensemble approach. We start by discussing the use of cross-validation analysis to properly rank the base models and guide the selection of the best base model. Then, we shift our discussion to the ensemble responses, in terms of optimal weights and a gain analysis in comparison with the best base model. Lastly, we discuss the use of the ensemble weights to determine a time-window for damage forecast in the fleet. Parameters * 51 * Differences in the total number of parameters are due to differences in the number of inputs

Physics-informed neural network design
In order to build the previously detailed hybrid physicsinformed neural network cells, the first step is to define the architecture for the multi-layer perceptrons of each cell. Dourado and Viana (2020) presented a discussion on multi-layer perceptron architecture selection based on crossvalidation. Here, for simplicity, we will use the architectures presented in Table 1. Training of the recurrent neural network cells was performed using RMSProp 2 as an optimizer, set up with varying values of learning rate (from 10 −9 up to 10 −3 ) considering 25 epochs. Mean-squared error (MSE) loss function was adopted for all the recurrent neural network cells training.
The optimization of recurrent neural network hyperparameters can be challenging, and it is conducted through a gradient-descent algorithm. An initial guess for the hyperparameters that is far away from the optimum might cause divergence of the optimization process or at least significantly increase in training time. In this paper, we also used the auxiliary plane approach proposed in Dourado and Viana (2020) to help in the initialization of multi-layer perceptron hyperparameters. This procedure was used only to yield an initial point for the recurrent neural network cells training, and it does not interfere with the cells overall performance.

Cross-validation analysis and model ranking
In this section, we evaluate the use of cross-validation errors to directly infer the ranking of the base models and guide the selection of the "best" base model in the pool. To carry such an analysis, we first need to define a cross-validation procedure from which we will extract the cross-validation errors. We adopted cross-validation procedures as presented in Kohavi (1995). To perform this analysis, we applied the following measures: we split the fleet into portions, from which, 20 sample sets containing each 20 aircraft derived from combinations of the same 40 aircraft were defined. On each sample set, a cross-validation analysis, considering a "leave-one-out" strategy was carried. This procedure is repeated until all aircraft in the sample set were removed once. Figure 6a illustrates the root mean-squared error (RMSE) distribution of the cross-validation errors for each base model considering all the sample sets. This cross-validation analysis seems to point out that in the vast majority of the samples the "best" model was the Bias K model. The ranking from the remaining models varies on a sample basis, but considering the average behavior shown in Fig. 6a, the cross-validation analysis indicates a ranking in which the best models are the Bias models, with the Log model being the third in the pool, while the K model being ranked as last. This ranking is in agreement with the expected behavior for the considered corrosion-fatigue case-study given the aspects of each base model discussed in Sect. 3. Figure 6b brings a comparison of the ranking considering the cross-validation analysis and the true ranking on each sample. In all the considered sample sets, the Bias K model was the most accurate model, while the K model was the least accurate. Albeit not identifying the true ranking on all the sample sets, the considered crossvalidation procedure can be safely used to segregate the base models, and identify the overall ranking of the base models.

Merging the predictions: ensemble analysis
In this section, we present and discuss the main findings considering the previously discussed weighted average ensemble approach (see Sect. 4) to merge the base neural networks predictions. Figure 7a illustrates a possible set of weights for the neural network ensemble considering the cross-validation errors of a given sample set. For simplicity presented weights were rounded up. As we can see, the derived weights highly favor the Bias K model and hardly differentiate the remaining base models. This a direct reflection of the behavior illustrated in the cross-validation error distribution shown in Fig. 6a. A possible explanation for the ensemble weights  Fig. 7a, despite the K model clearly overestimation of larger crack lengths, is most likely due to the fact that the best model in this case (the Bias K model) slightly underestimate such cracks. It is worth highlighting that Fig. 7a presents a single example of ensemble weights. Many other combinations of weights can be found considering the other sample sets in the cross-validation analysis.
The results illustrated by Fig. 7a helps demonstrating the task of merging the base models predictions. However, in Fig.  7a, we can observe that the ensemble responses are very close to the predictions of the best base model (Bias K model) in the pool (due to the high weight value associated with such model). Albeit such behavior may not be observed in other sample sets, it raises an important question. There is any gain of using such ensemble predictions instead of simply using the best base model? Or simply putting what is the advantage of using such ensembles in comparison to just using the best base model predictions? To address such questions, we perform a gain analysis for the ensemble predictions based on the metric define as: where R M S E Best is always related to the true best base model on a sample set basis. Therefore, it changes with the considered data set and it is an idealization, in the sense that it would not be easily define in practical applications given that is derived from actual errors (based on true crack length for all aircraft in the sample). R M S E Ensemble is related to the ensemble prediction errors. The results of this gain analysis are shown in Fig. 7b with varying number of base models in the ensemble. Except for two outliers we can see that when the number of base models is equal to 1 (i.e., the ensemble is composed only of the Fig. 7 Ensemble predictions for a given sample and ensemble gain evaluation best base model) the described metric is zero. This illustrates once again how the ranking based on the cross-validation errors analysis can safely and accurately identify the best base model in the pool (R M S E Best ).
When the number of base models in the ensemble is larger than 1, the defined metric can either be positive or negative. Negative values indicate that for the considered weight combination in the ensemble, the predictions will be worse than the ones generated by the best base model, meaning thus that at this condition, it does not pay off to combine the base models. Positive values indicate that the ensemble predictions are more accurate than the ones generated by the best base model, and thus, it is worth it to combine the model predictions. In Fig. 7b, we can observe that as the number of base models in the ensemble increases, the gain associated with such ensembles also increases, meaning that the ensem- Fig. 8 Examples of sets resulting in positive and negative gains, respectively. Diversity in the base models predictions helps yielding ensembles that outperform the best base model. Ensembles are incapable of overcoming underlying under or over prediction trends by all base models ble accuracy surpasses the accuracy of best base model. This is an indicative that the ensembles gain is related to its base model diversity.
To better illustrate the role that the base models diversity plays on the ensemble performance, consider the scenario presented in Fig. 8b. In this example, the resulting ensemble presents a negative gain value, indicating that its predictions are worse than the best base model in the pool. But a closer inspection on the base models behavior for this scenario indicates an underlying bias from the base models, i.e., for this scenario all base models have an underestimation trend. In other words, this scenario represents a lack of diversity condition for the ensemble, and at such conditions the resulting ensemble will invariably yield worsen predictions than the best base model. Figure 8a illustrates a scenario in which a clear spread concerning the base model predictions can be observed, i.e., there is a "disagreement" among the base model predictions. In such conditions, the inherent diversity in the base models prediction will help yielding an ensemble that can outperform the best base model. These results further collaborates with the observation that even relatively "poor" base models actively contributes to the ensemble accuracy, and suggests that there is no gain in removing the "worst" base model from the ensemble.
Lastly, we focus our discussions on the performance of the proposed metric to determine "confiability" of damage forecast. Fig. 9a illustrates the key ideas of proposed heuristic metric for a given aircraft in a given sample set. Each base model yields different predictions for the number of flights to achieve the arbitrary crack length threshold (a T H ) of 20 mm, depending of its underlying conservativeness and accuracy. The proposed metric weight the base models estimates considering the ensemble optimal weights into an average prediction. It is worth mentioning an interesting contribution of the proposed heuristic metric. The ensemble prediction for the number of flights to achieve the defined threshold will not coincide with the metric estimate. This is due the accumulated effect of the base models over or underestimation in the ensemble forecast. The ensemble prediction is given by the linear combination of the base models estimates for crack length on every loading cycle, while the heuristic is the linear combination of a single value (i.e., the number of flights until the threshold). This observation highlights the value of the proposed heuristic. In Fig. 9b, we present the comparison between the predicted number of flights to achieve the predefined crack threshold given by the proposed metric, and the actual number of flights required to achieve this crack value. The comparison considers only the aircraft that surpassed the defined threshold within 13,000 flights or less in a given sample set (i.e., we considered here 4,000 flights of damage forecast). The proposed metric predictions are relatively close to the actual number of flights, with observed differences in the range of few hundreds flights. Additionally, for the results shown in Fig. 9 we can see that the proposed metric predictions has an underlying conservatism concerning the actual number of flights. Albeit this observed conservatism is highly related to the underlying behavior of the base models in the ensemble, in a safety related application as in aviation, a given level of conservatism in prognosis is certainly desired. The presented results are a strong indicative of the accuracy of the proposed heuristic, indicating that the proposed approach could safely be adopted in fleet prognosis analysis, helping ranking aircraft damage in the fleet and helping prioritizing inspection.

Replication of results
Simulations were conducted using a laptop configured with an Intel Core i7-7820HQ CPU at 2.90GHz, 32GB of RAM, and NVIDIA Quadro M620 graphical processing unit running Windows 10. Our implementation is all done in Tensor-Flow 3 using the Python application programming interface (version 2.0). Links to required data sets and Python scripts demonstrating the proposed framework can be found in Dourado and Viana (2019) 6 Summary and closing remarks In this paper, we focus on the development of weighted ensembles of hybrid neural networks capable of minimizing the impact of epistemic uncertainties in model prediction and extrapolation. To illustrate the key concepts of the proposed formulation, a numerical study considering an unaccounted exposure of an aircraft fleet to highly corrosive environments leading to an unforeseen corrosion-fatigue crack propagation issue was presented. The fleet prognosis model only accounts for mechanical fatigue in its damage accumulation analysis, ultimately yielding in an epistemic uncertainty due to the missing physics of the corrosive damage.
Firstly, we presented distinct approaches to derived hybrid physics-informed recurrent neural network cells to accounted for the epistemic uncertainties effects on damage accumulation. Each neural network cell was defined based on specific physical interpretations of the corrosion-fatigue problem, proposing each, unique ways to account for the missing physics problem. The major achievement of such hybrid models is the ability to use data-driven layers to compensate for the missing physics in reduced-order models and accurately estimate damage accumulation. Additionally, the physics-informed layers reduce the need for larger data sets found in purely data-driven approaches. Presented results indicate that despite varying levels of conservativeness in predictions, all proposed hybrid cells are able to capture the overall trend of fleet crack propagation, and in its own way, compensate for the corrosion effects.
Then, we evaluated the use of weighed average ensembles, derived from the linear combination of the proposed hybrid cells. Ensemble weights were defined based on cross-validation analysis, and, the key idea of the proposed approach, is to take advantage of existing pre-trained neural networks, aiming to canceling errors in prediction through proper weighting selection, ultimately yielding better predictions. Discussions concerning the advantage of using the proposed neural network ensemble over the best base model were addressed. Results here presented shown the impact of the base models diversity in the ensemble gain. Here results point out to the conclusion that increasing the number of base models in the ensemble, associated to an increment in the "discordance" of the base models predictions, significantly 3 www.tensorflow.org. increases the ensemble gain in accuracy. Lastly, we devised an heuristic metric based on the ensemble optimal weights to define a time-window for safe operation of the fleet. As illustrated by the results presented in this contribution, the proposed heuristic is capable of accurately defining a valid time-window for forecast, indicating that the proposed metric could safely be adopted in fleet prognosis analysis.
Despite the promising results, the proposed approach can be further developed. Concerning future directions we suggest: evaluating a possible convergence of the ensemble accuracy concerning the number of base models; investigate the ensemble performance when multiple sources of uncertainty plague the damage accumulation process; extend the proposed framework to consider complex combined failure modes simultaneously acting on the fleet; and evaluate if the proposed neural network ensemble is capable of automatically learn to identify the root cause of the unexpected observations and properly detect such unforeseen factors. We can also further study how to refine or tune the parameters of the first-principle models. This is a challenge problem by itself and could entail substantial enhancements in our framework as well as reveal the need of differentiated sources of data (such as material testing and component testing data). Finally, in our work, we focused on a single output (fatigue crack length); however, there might be cases in which fleet optimization requires handling multiple outputs (damage, performance, safety, etc.). In such cases, we could expand the framework to a multi-objective formulation, as discussed in Deng et al. (2022).  Table 2. Based on the load magnitude, each loading cycle of the flight can be translated into far-field stress ( S = S max − S min ) and stress ratio (R = S min /S max ) values. Each flight type is characterized by a load frequency distribution. Therefore, flights present dif- IX 0 0 0 0 1 7 X 0 0 0 0 0 3 ferent severity levels in terms of mechanical loads. In order to balance out the exposure of the fleet to the severe flights, we considered the 10 arbitrarily designed mission mixes. In our study, each aircraft in the fleet is assigned to only one specific mission mix. Within a given mission mix, flights types are assigned to the aircraft randomly following the probability distribution detailed in Table 3.
In this paper, we assumed that corrosive cycles represent, on average, 5% of the total number of cycles in one mission (loads of type E and F are related to take-off and landing,  respectively, and may be subjected to corrosion-fatigue). We then proceed to penalize this cycles by accumulating damage using curves that are between air and NaCl at 3.5% (see Fig.  1). In our model, the exact curve will depend of a corrosivity index c t . While the actual relationship between the corrosivity index and the C 0 and m constants is unknown; here, we arbitrarily used the following equation to generate synthetic data: C 0 = − 2.275 × 10 −8 (c t ) 2 + 4.505 × 10 −8 c t + 1.132 × 10 −10 , and m = 2.046(c t ) 2 − 4.052c t + 3.859.
Several factors need to be considered while determining the corrosivity index c t . Here, we use a modified version of the PACER LIME model Summitt and Fink (1980); Roberge et al. (2002), illustrated in Fig. 10b. In this case study, we considered corrosion index values coming from a truncated normal distribution (i.e., 0 < c t < 1), with scale σ = 0.05, and location (μ c t ) defined by the modified PACER LIME model severity levels (see Fig. 10b). In our modified model, this levels are: "AA", very severe corrosivity level (c t = 1); "A", severe corrosivity level (μ c t = 0.844); "B", moderate corrosivity level (μ c t = 0.608); and "C", mild corrosivity level (μ c t = 0.361). On each flight mission two airports must be associated to a flight. To help structure this step in the synthetic simulation, we arbitrarily defined 10 routes. Hence, at each flight mission an aircraft is randomly assigned to a given route, and therefore, to specific corrosivity indexes.
Following all the previously discussed steps, the fleet simulation can be summarized as: an aircraft is randomly assigned to a route; at the beginning of each flight mission random draws are performed to define the corrosivity index (based on the route structure and the severity levels presented in Fig. 10b), and the mission mix configuration (see Table  3); the specific flight type for each mission is then randomly assigned following the mission mix probability distribution (see Table 3); based on the selected flight type, loads (i.e., S and R values) are applied following the distributions detailed in Table 2.