Chronic obstructive pulmonary disease (COPD) is a common disease characterized by persistent respiratory symptoms and airflow limitation. Its clinical manifestations comprise cough, expectoration, chest tightness, air hunger and dyspnea. In severe cases, it can progress into respiratory failure and corpulmonale, which would cause great damage to patients’ health and quality of life. It has emerged as the fifth-largest burden in the global economy[1, 2], especially in developing countries, where COPD harbours high morbidity and mortality. One study has shown that about 1 million people die of COPD in China every year, accounting for 30% of the deaths of COPD across the world. Also, COPD is the third and fourth leading cause of death in China’s rural and urban areas, respectively. Obviously, COPD has become an important public health problem. It is imperative to comprehensively analyze its related factors and the complex relationship between them, and to take effective measures to prevent and reduce the occurrence of COPD as soon as possible.
Previous studies on COPD risk factors generally explored the risk factors of COPD by logistic regression, which required variables to be inter-independent, and it reflected the correlation between independent variables and outcome variables based on odds ratio. For example, in 2018, Wang Chen  studied the related risk factors of COPD using logistic regression, and found that gender, age, years of smoking and severe exposure to PM2.5 were significantly associated with COPD. However, in practical research, there is often a certain correlation between these influencing factors of disease. Therefore it fails to meet the prerequisite of independent variables of logistic regression. In addition, logistic regression is unable to reveal direct or indirect factors of COPD .
Bayesian networks(BNs) was firstly proposed by Pearl Judea in 1987, and then had been widely used. Without strict statistical hypothesis, BNs construct a directed acyclic graph (DAG) to show the potential relationship among influencing factors, and use conditional probability distribution table (CPT) to reflect the correlation intensity among variables. As such, BNs can directly show the complex network relationship between disease and influencing factors, and overcome the limitation of traditional logistic regression. In addition, BNs can infer the probability of unknown nodes by using the information of known nodes, and flexibly show how the relevant risk factors make an impact on the risk of COPD[11, 12].
Bayesian networks learning refers to obtaining a complete Bayesian network by analyzing the existing information. The construction method consists of two types, parameter learning and structure learning. Parameter learning assumes that the network structure is known, then determining the parameters in the network. This study focused on structure learning, which is more commonly used. Generally speaking, Bayesian networks structure algorithms can be divided into two parts, score-based search and constraint-based algorithm.
The essence of score-based search is to find out the Bayesian networks structure whose score function reaches a maximum. Nonetheless, it’s hard to obtain an optimal network structure when the structure space becomes very large. The constraint-based algorithm boasts a high learning efficiency and can obtain the global optimal solution, but it also has some shortcomings. Firstly, independence of different nodes is sophisticated, and the number of independence tests between nodes increases exponentially with the increase of the number of nodes. Secondly, the results of the high-order conditional independence test are unreliable. Considering the limitations of the two kinds of algorithms, some scholars have proposed a hybrid algorithm. Max-Min Hill-Climbing(MMHC), adopted in this study, is a widely used hybrid algorithm, which includes two phases. In the first stage, it builds an undirected Bayesian network framework to reduce the search space by using constraint-based algorithms. Then the score-based search is used to add, delete and change the direction of edges in the constrained space to find the network with the highest score. Thus, MMHC skillfully combines the two algorithms and effectively overcomes their shortcomings.
Undoubtedly, there are many related factors affecting COPD, if all of these factors are incorporated into the BNs, the network will become intricate, and some factors with weak correlation will reduce the accuracy of the model. Since Lasso regression does not take into account the correlation between features, it is not suitable for multiple collinear variables. Ridge regression cannot select the model, on account of no prediction factor with zero actual coefficients. However, Elastic Net can combine the two and carries on the feature selection through cross-verification. By adopting Elastic Net, an ideal sparse model can be obtained and the influence of the correlation between the observed variables can be compensated.
Hence, we intended to employ Elastic Net to screen influencing factors from the original data, selecting factors with a strong correlation with COPD, and then used MMHC algorithm to build a network of COPD and related factors, exploring the potential relationship between COPD and these influencing factors, so as to provide a theoretical basis for clinical prevention and reduction of the occurrence of COPD.