The Research Ethics Committee of the Pedro Ernesto University Hospital (HUPE) approved the study that obeys the Declaration of Helsinki. The written post-informed consent of all volunteers was obtained before inclusion in the study.
5.1. Studied subjects
The data used in this work were obtained through the FOT. The examinations were carried out at the Biomedical Instrumentation Laboratory of the Rio de Janeiro State University. The exam with each volunteer was repeated three times, and each piece of data used in this work results from the average of these three measures. Seventy-two individuals took part in the study. Twenty-five were healthy volunteers representing the control group, and 47 were patients with sarcoidosis. In the latter, spirometry verified that 24 had normal conditions, representing the normal spirometry group, and 23 had respiratory changes, representing the altered spirometry group.
5.2. Forced oscillation measurements and features
The FOT comprises applying oscillations with a low-pressure amplitude to an individual's respiratory system using an external device. While the individual remains seated, wearing a nose clip, and breathing spontaneously, pressure signals with frequencies multiple of 2 in the 4-32Hz range are applied to the respiratory system's entrance. We measured the applied pressure (P) and the airflow (V') induced by it. Then, the Fourier transform (F) was used to estimate the respiratory impedance (Zrs=F(P)/F(V'), from which we can generate resistance and reactance curves as a function of frequency.
To interpret the resistance data, we used a linear regression in the 4-16 Hz range to estimate resistance at the intercept (R0), the slope of this curve (S) and the average resistance in this range (Rm). R0 and S are related to the respiratory system's total resistance and ventilation inhomogeneity, respectively, and Rm is related with central airways' resistance [11].
The resistance measured at low frequency is associated with the airways' total resistance, while at high frequency, it is related with the central airways’ resistance. The difference between them is usually interpreted as an index of small airway obstruction and heterogeneity of ventilation [12]. Then, the other features analyzed are the resistance at 4Hz (R4), the resistance at 20Hz (R20), and the difference between them (R4-R20).
To interpret the reactive results, we calculated dynamic compliance (Cdyn) from the reactance obtained at 4Hz [13]. In this same frequency, we calculated the absolute value of the respiratory impedance (Z4), a feature associated with the respiratory muscles' work to overcome resistive and elastic loads, to allow the airflow in the respiratory system [11]. The average reactance (Xm) is also associated with the inhomogeneity of the respiratory system, and we calculated it through the reactance curve based on the entire frequency range studied (4-32Hz) [14]. We also evaluated the resonant frequency (Fr), where respiratory elastance and inertance make equal and opposite contributions, resulting in a zero value for reactance). Finally, we measured the area under the negative part of the reactance curve (Ax), between 4Hz and Fr, which reflects the elastic properties and ventilation heterogeneity of the respiratory system [15].
5.3. Extended RIC model features
The impedance curves provided by FOT may be interpreted using engineering concepts to correlate them with models composed of electrical components analogous to resistance, inertance, and complacency of the respiratory system. The extended RIC (eRIC) model used (Figure 1) contains a peripheral resistance (Rp) associated in parallel with the respiratory compliance (C), in series with the central resistance (R) and the respiratory inertance (I) [12]. We define the total resistance (Rt) as the sum of R and Rp.
Several studies have already been carried out using this model, such as, associating model features with abnormalities in silicosis [16], showing that the models can aid in the early diagnosis of chronic obstructive pulmonary disease (COPD) [17] and using these features to detect mild obstruction in asthma [18]. We can calculate the impedance equivalent to the eRIC circuit according to Equation 1.
Thus, it is necessary to find the values of the features to minimize the error between the impedance measured at discrete frequencies and its respective analytical result. We have estimated using the ModeLIB program developed in our laboratory, which estimated model parameters using the Levenberg-Marquardt algorithm to determine the set of coefficients of the nonlinear model that best represents the input data set in the least-squares sense.
5.4. Datasets
This study carried out the experiments in a dataset with 16 input features (11 FOT indexes and five eRIC model components) from 72 exams. The measurements were performed in 25 healthy volunteers and 47 patients with sarcoidosis: 24 with normal conditions according to the spirometry and 23 with respiratory changes.
5.5. Machine Learning Algorithms
Machine Learning (ML) is a field of Artificial Intelligence that gives computers the ability to learn without being explicitly programmed to do so [19]. We can use its methodologies mainly in problems with no deterministic solution, using data so that the algorithms automatically discover the relationship between them. Artificial intelligence/machine learning methods have been developed to improve pulmonary function analysis since the 1980s [20]. Previous works have reported that it is workable to use the features obtained by FOT to apply ML algorithms to improve the diagnosis of respiratory diseases [13][21][22][23][24][25]. Besides providing accurate results, the explanation of a classifier is relevant in the study of respiratory diseases. Knowing how the classification is performed and the most important features can enhance our knowledge about the diagnosis and contribute to our understanding of the underlying pathophysiology. The development of a set of interpretable models and methodologies that result in more understandable models while maintaining excellent prediction performance is the major goal of a new topic of study called Explainable Artificial Intelligence (XAI) [26]. Regrettably, there is no universally accepted definition of explainable. Some researchers use the terms interpretability and explainability interchangeably, while others distinguish between the two. Authors in [27] define interpret as "to explain or present in language that humans can understand." Other authors in [28] define interpretation as the translation of abstract concepts into a domain humans can understand, whereas explanation is the collection of the features of the interpretable domain that have led to the production of a choice in a specific example. The notion of explanation and interpretation in this work is aligned with [28].
Therefore, in this study, we want to explore Genetic Programming (GP) because of the classification being made by intelligible expressions that can be interpreted and also study the subset of optimal features selected by the feature selection methods to explain which FOT parameters are most discriminative.
GP is a method used to build programs, which fits into the family of evolutionary algorithms. Each program is an individual whose fitness depends on the execution of that program. The most common representation for a GP individual is as a tree [29]. The terminal nodes (leaves) represent the features, and the internal nodes represent the functions that operate the leaves. Figure 2 shows the tree representation of the program y=ln(x1)+5*x2. as parent 1, and the program y=sin(x1)-x2/2 as parent 2. However, other forms of representation have become popular, such as graphs, lists, and grammars [30]. In each case, the genotype is the computational representation of the program, and the phenotype is its interpretation, more understandable to the user. Some of the most important characteristics of genetic programming are that it does not require or requires only minimal pre-processing of inputs or post-processing of outputs, and it has a built-in feature selection mechanism that allows GP to select only the more useful features from the dataset. The evolutionary process takes place in the problem domain. Because the outputs are already expressed in this problem domain, there is no need for translation or mapping processes [29].
The proceeding followed by the GP comprises randomly generating the first population and evolving it through generations until a stop criterion is reached, such as, for example, whether we found an optimal individual or we have reached a maximum number of generations. Each generation consists of evaluating each individual's fitness and selecting some of them to apply genetic operators generating offspring. Individuals are chosen on a probabilistic basis based on their aptitude. Individuals with higher fitness, therefore, have a better chance of being chosen. The tournament method is the most commonly used selection method in genetic programming. This method involves selecting a subset of individuals at random from the population. They are compared, and the best individual from this group is chosen to be the parent. In terms of evolutionary operators, genetic programming favors the crossover operator. The subtree crossover operator is the most commonly used crossover operator. A crossing point (node) in two parents is chosen at random and independently in this method. The offspring is formed by removing from the parents the subtrees whose roots are the chosen crossing points. The rest of the trees are combined at these points. Figure 2 shows an example of this process, where the crossing points and the corresponding subtrees are highlighted. Then, parents 1 and 2 are combined to generate offspring 1 and 2. This process is done with copies of the selected parents, thus not eliminating the parents in the process. The most frequently used mutation operator is the subtree mutation. In this operator, a mutation point is chosen randomly and the subtree whose root is the mutation point is replaced by a randomly generated subtree.
The Grammatical Evolution (GE) algorithm [30][31][32] is based on both the biological process of producing a protein from genetic material and the broader genetic evolutionary process. The genome is composed of DNA that is transcribed into RNA as a string of building blocks. After that, the RNA codons are translated into amino acid sequences and used in the protein. The phenotype is the protein's response to its surroundings. A phenotype is a computer program that is derived from a binary string genome. The genome is decoded into a series of integers that are then mapped onto the program's pre-defined rules, known as grammar, which are defined in Backus-Naur Form (BNF). To map genotype to phenotype, a one-to-many process with a wrapping feature is used. This is analogous to the biological process that occurs in many bacteria, viruses, and mitochondria where the same genetic material is used to express multiple genes. The mapping increases the robustness of the process, both in terms of being able to use structure-agnostic genetic operators on the subsymbolic representation during the evolutionary process and of being able to generate well-formed executable programs from the representation. Thus, even if the fundamentals are the same, using a different grammar can cause a model to produce significantly different results. This adaptability allows grammar to be applied to a wide range of problems, making it extremely useful.
We used GE and tree-based GP as interpretable classifiers. They can derive a mathematical expression to compute a score that indicates the probability that a patient belongs to a specific class, or they can synthesize Fuzzy Pattern Trees [33].
Because it allows data knowledge to be expressed in a comprehensible form, similar to natural language, fuzzy set theory has provided a framework for developing interpretable models [34][35], giving the model a higher degree of interpretability. The majority of fuzzy models developed are rule-based fuzzy systems (FBRS), which can represent both classification and regression. It may be difficult to obtain fuzzy models based on easily interpretable rules because, depending on the application, many rules with many antecedents may be required, making the model difficult to understand. A system with fewer rules, on the other hand, is easier to understand, but its predictive accuracy suffers as a result. Therefore, we decided to employ the Fuzzy Pattern Trees (FPT) method, which is based on the theory of fuzzy sets and is not based on rules but on a hierarchical method.
Terminal nodes in FPTs have fuzzy features, and internal nodes have fuzzy operators. FPTs can employ a variety of operators. Aggregation operators, which can be t-norms or t-conorms, exist. The first involves operators with the logical connector AND as the minimum operator and those with the connector OR as the maximum operator. The average operator, such as WA (weighted average) and OWA, is another type (ordered weighted average). There are also concentration and dilution operators that take only one input and reduce or increase their membership value. The square of the input value is the simplest concentrator, while the square root of the input value is the simplest dilator. Table 2 summarizes the expressions for the fuzzy operators used in this work, where a and b are their inputs and 0 < r < 1.
Fuzzy logic is used to build more meaningful trees in order to improve the interpretability of the evolved models. To that end, we adopted the most straightforward fuzzification scheme presented in Figure 3, where X is any feature. X_max is the highest X value in the dataset, and X_min is the lowest. The membership functions are triangular, and there are three fuzzy sets for X, which are set as shown in Table 1.
Figure 4 shows an FPT example where the tree represents the class "High Quality wine." The alcohol content, acidity, and concentrations of sulfur dioxide and sulfates are the input attributes. They are associated with a fuzzy term that represents a range in the discourse attribute universe. In Figure 4, for example, the fuzzy term Alcohol_Low represents the fuzzy set that indicates a low alcohol content. In fuzzy sets, the membership value is grouped by operators who keep the partial results in the range [0,1]. If the given attributes presented at the bottom of the tree accurately represent the class, the value obtained in the output after all feature groupings must be close to 1.
Insert Figure 4
In our previous research [13][21][22][23][24][25] we have described and experimented with a wide diversity of algorithms such as K-Nearest Neighbors (KNN) [36], Support Vector Machine (SVM) [37], AdaBoost [38], Random Forest (RF) [39], Light Gradient Boosting Machine (LGBM) [40], Extreme Gradient Boosting (XGB) [41], and Logistic Regressor (LR) [42]. Here, we compared the results obtained by these algorithms with the ones achieved by classifiers synthesized by GP and GE to check if the results of the interpretable classifiers are competitive.
In addition, the fuzzification scheme employed in the FPTs is also employed as a feature engineering step to generate another representation of the original attributes (FOT parameters). The main motivation to perform the fuzzification is to verify if the fuzzy terms can emphasize the differences between the groups. Besides, the newly generated features can also be used to train the algorithms from previous works to check if it is possible to improve the diagnostic accuracy.
5.6. Performance analysis
In medical diagnosis, the area under the receiver operating characteristics curve (AUC) can measure a model's ability to discriminate whether a condition is present or not, so it is an appropriate metric for this work [43]. Geralization is what makes learning worthwhile. To assess the generalization capacity, we must test a classifier in a different set from the one used for its training. Usually, we desire to use as much data as possible to train the model and the most considerable amount available to test its generalizability However, because our dataset is small, we must use a practical approach, such as the k-fold cross-validation technique [44], to estimate generalization performance and perform hyperparameter tuning. Unfortunately, because the performance estimate was directly optimized while tuning the hyperparameters, using single k-fold cross-validation to complete both tasks may introduce an optimistic bias into the performance estimate. As a result, in our experimental approach, we employ Nested Cross-Validation. This procedure uses an outer cross-validation process to generate a performance estimate that is used to select the best model. To minimize an inner cross-validation estimate of generalization performance, the model's hyperparameters are tweaked independently in each fold of the outer cross-validation. The outer cross-validation is simply measuring the performance of a method for fitting a model. As the test data in each iteration of the outer cross-validation has not been used to optimize the performance of the model in any manner, this avoids the bias produced by the flat cross-validation technique, and may thus provide a more trustworthy criterion for selecting the best model.
Thus, we divided the dataset into ten folders with the same proportion of classes, enabling ten sub-experiments, each using nine folders for training and one for testing. All algorithms use the same training and test sets so that we can compare their results. In the beginning, we specify some options of hyperparameters for a specific algorithm. An exhaustive search is made using the inner cross-validation to find the best hyperparameters in each sub-experiment, which we apply to the respective test folder. After repeating that ten times, we take all test sets' results, make a single ROC curve, and take the AUC for that algorithm.
6. Experimental scheme
We performed three experiments, each considering two distinct analyses with the dataset: Control group versus individuals with sarcoidosis and normal spirometry, and control group versus individuals with sarcoidosis and altered spirometry.
Experiment 1 consisted of assessing each FOT feature's ability to diagnose correctly respiratory changes associated with sarcoidosis.
In the second experiment, we evaluated the accuracy of several classifiers in the diagnosis. We also evaluated interpretable methods and other ML algorithms to compare their results. We investigated all techniques using the original dataset with the z-score normalization and a fuzzy dataset with the fuzzification scheme from Figure 3. We implemented KNN, SVM, AdaBoost (using Decision Trees as base estimator), RF, LGBM, XGB, and LR classifiers with the library Scikit-Learn [45]. We can do a grid search to find a model's best hyperparameters with a function from this library. The options provided for the search are in Table 3.
We performed GP classifiers with the library gplearn 0.4.1, which is compatible with Scikit-learn; we can do a grid search with the previously mentioned function. Finally, we used ponyGE2 0.2.0 to carry out GE classifiers, but that library is not compatible with Scikit-learn. Because of that, we developed a new interface that allows us to use Scikit-learn functions [46]. Table 3 also shows the options provided to GP and GE hyperparameters.
We used arithmetic functions when performing GP with normalized data. In this case, the model's output results from the tree transformed through a sigmoid function. When performing it with fuzzy data, we used the functions shown in Table 2, and the output of the model is directly the result of the tree. Finally, we defined the grammar shown in Figure 5 for the use of GE, in which rules (I) to (IV) are used in experiments with normalized data and rules (V) to (X) in those with fuzzy data.
Thirdly, we included a feature selection technique and rerun every procedure of experiment 2. We used a recursive feature elimination to select the optimal subset of features. It is a backward method, in which the search starts with all features, eliminating at each iteration the one whose removal presents the most negligible loss of information. We put the same hyperparameters in the grid in Table 3 and another one, which is the number of features to select. There are 16 FOT indexes in total, so we put options 1 to 15 for that hyperparameter, except in GP and GE experiments. For these, we put only three alternatives (4, 8, 12) due to their execution time. In experiments with fuzzy data, there are 48 features, so we put options 1 to 47, except in GP and GE experiments, in which there are just three alternatives again.
Employing feature selection to obtain a subset of the optimal features contributes to avoiding overfitting, especially in works with a small dataset like ours. Since reducing the number of features simplifies the model, our principal interest in feature selection is to achieve a better performance in the classification. However, experiment 3 can also contribute to explaining the results by observing which features are selected most often. Each experiment consists of ten sub-experiments. As we use nine algorithms, each analysis shows 90 results in the feature selection. From these results, we can know which are the essential features. We elaborated 3D plots with the three most frequent ones in each analysis to evaluate the visual separation between classes.
6.1. Statistics
Initially, the sample distribution characteristics were assessed using Shapiro-Wilk's test. Since data were non-normally distributed, non-parametric analyses (Mann-Whitney test) were performed. Differences with p≤0.05 were considered statistically significant. These analyses were performed using R version 4.0.5 (R Foundation for Statistical Computing, Vienna, Austria).