History- and symptom-based prediction of pulmonary function and the presence of chronic obstructive pulmonary disease: development and validation of a nomogram

Background: Early suspicion followed by assessing lung function with spirometry could decrease the underdiagnosis of chronic obstructive pulmonary disease (COPD) in primary care. We aimed to develop a nomogram to predict the FEV 1 /FVC ratio and the presence of COPD. Methods: We retrospectively reviewed data of 4,241 adult patients who underwent spirometry between 2013 and 2019. By linear regression analysis, variables associated with FEV 1 /FVC were identied in the training cohort (n=2,969). Using the variables as predictors, a nomogram was created to predict the FEV 1 /FVC ratio and validated in the test cohort (n=1,272). Results: Older age (Odds ratio [OR], 0.858; 95% condence interval [CI], 0.833-0.885), male sex (OR, 0.149; 95% CI, 0.064–0.348), current or past smoking history (OR, 0.036; 95% CI, 0.015–0.086), the presence of dyspnea (OR, 0.086; 95% CI, 0.027–0.275), and the absence of overweight (OR, 2.445; 95% CI, 1.210– 4.942) were signicantly associated with the FEV 1 /FVC ratio. In the nal testing, the developed nomogram showed a mean absolute error of 0.822 (95% CI, 0.789–0.854) between the predicted and actual FEV 1 /FVC ratios. The overall performance was best when FEV 1 /FVC < 70% was used as a diagnostic criterion for COPD; the sensitivity, specicity, and balanced accuracy were 82.3%, 68.6%, and 75.5%, respectively. Conclusion: The developed nomogram could be used to identify potential patients at risk of COPD who need further evaluation, in the primary care

models was based on information available in the primary care setting alone in predicting spirometry results or the presence of COPD.
Therefore, the purpose of this study was to develop and validate a history-and symptom-based nomogram that can be conveniently used to predict a spirometry result-the forced expiratory volume-one second (FEV 1 )/forced vital capacity (FVC) ratio-and the presence or absence of COPD.

Study population
We searched our electronic medical record database and found 6,322 adult (≥40 years) patients who underwent pulmonary function tests, including spirometry at a single medical institution in South Korea (hereafter, Korea) between January 2012 and December 2019. Of these, 1,703 patients who were already diagnosed with COPD in 2012 were excluded. Thus, patients included in this study were either rst diagnosed with COPD between 2013 and 2019 or were not diagnosed with COPD during the study period.
When a patient underwent multiple spirometry measurements, the rst test result was used to exclude the possible treatment effect on spirometry results. Of these patients, 378 were excluded because smoking history is missing. Other respiratory ailments, such as asthma, bronchiectasis, interstitial lung disease, were not included in this study. The nal study cohort was randomly split into train and test cohorts with a ratio of 7:3 while preserving the same proportion of COPD patients (Fig. 1).

Measurements of lung function and De nition of COPD
To measure the lung function, spirometry was performed according to the American Thoracic Society/European Respiratory Society (ATS/ERS) standards by trained research assistants [13]. Dry rolling-seal spirometer (Model 2130; SensoMedics, Yorba Linda, CA, USA) was used for all subjects. All spirometry traces were reviewed by a lung function specialist to determine whether they ful lled the reproducibility and acceptability criteria of the ATS/ERS Task Force.
The normal predictive values for spirometry data were calculated using a reference equation derived from Korea's general population [14]. A xed criterion of predicted forced expiratory volume in 1 second per forced vital capacity (i.e., FEV 1 /FVC <0.7) was used to diagnose patients with COPD in accordance with the Global Initiative for Chronic Obstructive Lung Disease (GOLD) guidelines [15].

Variables
Outcome variables were the FEV 1 /FVC ratio and the presence or absence of COPD. Predictors for the outcome were age, sex, overweight (de ned as body mass index [BMI] > 25 kg/m2), smoking history, symptoms of dyspnea, cough, or sputum, the presence or absence of underlying hypertension, diabetes, congestive heart failure, coronary vascular disease, stroke, or anemia, and the prior use of salbutamol or antibiotics.
Smoking history could be obtained from both our medical records and the national health screening results, as in Korea, all adults over 40 years old are mandated to undergo the biannual national health screening, which contains a questionnaire about smoking habits. However, there were approximately 5 times more missing values in the health screening records than the medical records. Therefore, we mainly used smoking history from the medical records; only when smoking history was missing in the medical records, we used smoking history, if present, from the health screening database instead.

Statistical analysis and prediction model
Continuous or categorical variables were compared between the training and test cohorts using t-test or chi-square tests, respectively. Univariable and multivariable linear regression was performed to determine the association between the risk factors and FEV 1 /FVC ratio and nd independent predictors for our prediction model. In the multivariable regression, only variables with a signi cant association with the FEV 1 /FVC ratio in the univariable regression were used. A linear regression model for predicting the FEV 1 /FVC ratio was t in the training cohort and validated in the test cohort using mean absolute error (MAE) as an evaluation metric. In addition, the agreement between the predicted and actual FEV 1 /FVC values was graphically assessed using the Bland-Altman plot. A nomogram to predict FEV 1 /FVC was created based on the prediction model tted in the training cohort.
Using predicted FEV 1 /FVC values as a diagnostic criterion, the area under receiver operating characteristic curve (AUC), sensitivity, speci city, positive predictive value (PPV), negative predictive value (NPV), and balanced accuracy were calculated for discriminating between patients with and without COPD.
Our study cohort was imbalanced, with approximately 9 times more patients in the non-COPD group than in the COPD group. In an imbalanced cohort, it is highly likely that predicted outcome values are biased towards the majority group (i.e., non-COPD group or higher FEV 1 /FVC ratio in this study). Therefore, when training the model, we used the synthetic minority over-sampling technique (SMOTE) algorithm to create synthetic minority class cases to balance the two classes [16].

Patient characteristics
The nal study cohort comprised 4,241 patients (2,204 men and 2,037 women) with a mean age of 67 (range, 39-98) years. The mean or frequency of all the variables was not signi cantly different between the training and test cohorts (

Prediction model
The mean difference between the predicted and actual FEV 1 /FVC values (i.e., MAE) was 8.858 in the training cohort and 8.721 in the test cohort. For FEV 1 /FVC in the range between 65 and 75, the MAE was 6.324 in the training cohort and 6.490 in the test cohorts.
The Bland-Altman plot revealed a trend that our model overestimated FEV 1 /FVC when an actual FEV 1 /FVC value was less than 65; in this range, many cases were observed above the upper 95% limit of agreement (Fig. 2). Hence, the effective range of the FEV 1 /FVC ratio predicted by our nomogram was from 65 to 90; a predicted FEV 1 /FVC value less than 65 or larger than 90 must be interpreted as 'less than 65' or 'larger than 90', respectively, instead of the value itself (Fig. 3).

Discussion
In this study, we developed a multivariable model to identify patients who are expected to have decreased pulmonary function and thus is at risk for COPD. In developing this prediction model, we aimed to create an easy-to-use tool that can help primary care providers decide whether to refer patient suspected of having COPD to a facility where spirometry is available. Thus, we examined variables that are obtainable from simple physical examination and history taking for potential predictors. In our study, the ve variables associated with air ow limitation (i.e., decreased FEV 1 /FVC) were older age, male gender, the absence of overweight, the presence of dyspnea, and ever-smoking history.
Old age, male gender, and smoking are well-known risk factors for COPD. Historically, COPD has been considered a disease of elderly male smokers, although evidence suggests that this historical view is slowly changing [17]. The prevalence and mortality of COPD have increased more rapidly in women than in men during the past two decades, attributed to the changing smoking trends during the past 50 years [18]. Hence, reevaluation of risk strati cation by gender is warranted in the future.
Tobacco smoking is the most powerful risk factor for COPD. Although the acquisition of accurate and correct information on the actual smoking habits-duration, amount, and type of cigarette-is of utmost importance, the information in electronic medical records is often quite variable depending on the timing of data entry, visit route (i.e., outpatient, emergency room, or general ward), or medical staff who entered the data. Thus, we processed the smoking data as a binary variable: non-smoker and ever (current or past)-smoker.
In this study, the presence of overweight showed a protective effect, which is in line with previous studies. A study with Asian COPD patients reported that COPD patients with a high BMI have a better pulmonary function [19]. In another study, while underweight was associated with poor survival in COPD, there was a protective effect of overweight and obesity on mortality on COPD patients [20].
GOLD guidelines also support the use of multivariable prediction models to assess the prognostic pro le and facilitate follow-up of patients, instead of single predictors such as spirometry or history of exacerbations [15]. Since the occurrence and manifestation of COPD is unique to each race and country [21,22], we believe that our model could screen more undiagnosed COPD patients in Korea. We wish that we could improve our model as more data are obtained in the future and eventually develop a robust, reliable prediction model that can be used nationwide. This study has some limitations. First, further external validation is needed, because this model was developed with a retrospective study in a single institution. Second, detailed smoking history (the type of smoking, amount, and duration) was not used in our analysis. In addition to the conventional tobacco smoking, various electronic cigarettes using new nicotine delivery technologies have recently gained popularity in public. Although recent national health screening questionnaires are changing to re ect recent smoking behavior, the data used in this study did not contain it.

Conclusions
In conclusion, we developed a nomogram to predict an FEV 1 /FVC ratio and the presence of COPD based on age, gender, weight, the presence of dyspnea, and smoking history. This nomogram could be used conveniently to screen potential high-risk patients, especially in the primary care setting, where spirometry is not available.
Abbreviations COPD, chronic obstructive pulmonary disease; FEV 1 , forced expiratory volume-one second; FVC, forced vital capacity; ATS/ERS, American Thoracic Society/European Respiratory Society; GOLD, global initiative for chronic obstructive lung disease; BMI, body mass index; MAE, mean absolute error; AUC, area under curve; PPV, positive predictive value; NPV, negative predictive value; OR, odds ratio; CI, con dence interval Declarations Ethics approval and consent to participate The Institutional Review Board of National Health Insurance Service Ilsan Hospital (NHIMC 2020-06-005) approved this Health Insurance Portability and Accountability Act-compliant retrospective study and waived the informed consent. All methods were performed in accordance with relevant guidelines and regulations.

Consent for publication
Not applicable.

Availability of data and materials
Due to the institutional policy, data can only be made available to researchers who subject to a nondisclosure agreement, upon reasonable request.  Flowchart of the study population. COPD, chronic obstructive pulmonary disease.  Nomogram predicting FEV1/FVC ratio. The nomogram is used by rst giving each variable a score on the 'Points' scale. The scores for all variables are then added to obtain the total score and a vertical line is drawn from the 'Total Points' row to estimate the FEV1/FVC ratio. A predicted FEV1/FVC value <65 or > 90 must be interpreted as 'less than 65' or 'larger than 90', respectively.