Study Design
A retrospective study design was chosen to analyze a cohort of patients diagnosed with COVID-19. The study included patients admitted to Baqiyatallah hospital. Patients were recruited from October 2020 to May 2021, during which all individuals admitted with a confirmed COVID-19 diagnosis were eligible for inclusion.
Patient data were obtained from hospital records, focusing on adults aged 18 and older. The primary inclusion criteria were as follows: a positive COVID-19 diagnosis confirmed by RT-PCR, admission to the hospital for COVID-19 treatment, and chest CT examination within one day of admission. Exclusion criteria were patients with incomplete clinical data due to transfer from other hospitals or other reasons. The final cohort consisted of 1008 out of 1744 after applying the inclusion and exclusion criteria. The study received ethical approval from the hospital's ethics committee (IR.BMSU.BAQ.REC.1400.079). All patient data were anonymized to ensure confidentiality and privacy. Included patients were divided into groups based on their oxygen saturation levels by the cutoff of 90%. Subsequently, the patient cohort was divided into training and testing groups for model development. Approximately 70% of the data was used for training, with the remaining 30% for testing.
Data collection
Clinical data were extracted from electronic health records (EHR) and hospital databases. This process involved identifying patients who met the study's inclusion criteria and retrieving their complete medical history, clinical assessments, and treatment records. Demographic information, including age and gender, was collected at admission. Contact history, documenting potential exposure to COVID-19, was obtained through patient interviews and travel history reports. Medical history, detailing pre-existing conditions like hypertension, diabetes, coronary heart disease, surgery, and hepatitis B, was also extracted from hospital records and patient self-reports. Clinical symptoms such as fever, cough, chills, dizziness, fatigue, and body ache were documented based on patient self-reporting and clinical observations by medical staff. Laboratory data were collected through routine blood tests and arterial blood gas tests conducted upon patient admission. Routine blood tests provided values for white blood cell counts, lymphocytes, eosinophils, neutrophils, and C-Reactive Protein (CRP). Additional biomarkers were measured to assess tissue and organ function, including D-Dimer, lactate dehydrogenase (LDH), creatine kinase isoenzyme, Alanine Aminotransferase (ALT), Aspartate Aminotransferase (AST), blood creatinine, blood urea nitrogen, and procalcitonin. After collection, data were integrated into a single dataset and cleaned to remove incomplete or redundant records.
CT Image Acquisition
The CT images were acquired using a multi-detector CT scanner, GE Revolution EVO 64-slice CT (GE Healthcare, Milwaukee, WI), within one day of hospital admission to ensure consistent imaging conditions and minimize external factors that could affect image quality. The scanning parameters were set to supine position with 120 kVp tube voltage, automatic tube current modulation, 0.725 mm collimation, and 1 mm and 5 mm reconstruction intervals. The scanning range spanned from the thoracic inlet to the upper abdomen, ensuring comprehensive lung coverage. The 12-bit CT images were mapped to 8-bit images with lung-specific window-level settings (window width: 1000 HU; window level: -500 HU).
Image Segmentation and Processing
A segmentation process was applied to the CT images. This process used two-dimensional U-Net models to segment lung and lesion regions. Each convolutional layer used a 3 × 3 kernel, followed by batch normalization and activation with a rectified linear unit (ReLU). The final segmentation layers determined the boundaries for lung and lesion regions. Data augmentation techniques were included random rotations of up to 20 degrees, random shear transformations, and zoom levels ranging from 0.9 to 1.1 times to improve the model's ability to generalize across variations in CT images. After automated segmentation, experienced radiologists manually reviewed the results. The manual review involved four radiologists with a minimum of two years of experience in chest imaging. The final confirmation and correction were made by two senior radiologists, with 6 and over 11 years of experience.
CT-based features
Following the segmentation process, a variety of CT variables were quantified to assess lung structure and disease progression. The lung volume, lesion volume, non-lesion lung volume (NLLV), and the fraction of NLLV (%NLLV) were calculated to determine the extent of disease involvement. Additionally, a comprehensive set of histogram texture features, including mean, standard deviation, skewness, kurtosis, energy, and entropy, was calculated for both lesions and NLLV. This quantification provided a robust dataset from which key CT variables were extracted for further analysis and modeling.
The CT variables were encoded to create a dataset suitable for machine learning modeling. Binary variables, representing the presence or absence of specific features, were encoded using binary encoding where a value of '1' indicated the feature's presence, and '0' indicated its absence. This approach was applied to a range of CT features, including Ground Glass Opacities (GGO), Consolidation, Crazy Paving Pattern, Halo Sign, Reverse Halo Sign, Peripheral Distribution, Lower Zone Predominance, Traction Bronchiectasis, Vascular Thickening, Subpleural Lines, Air Bronchograms, Pleural Effusion, Interstitial Thickening, Lymphadenopathy, Cavitation, and Fibrotic Bands.
Data integration
The one-hot encoding technique was used for categorical variables to convert each distinct category into a separate binary feature, thus capturing the inherent information without introducing bias. This technique was applied to variables such as nodules (for size, number, and location) and architectural distortion (various types). Continuous variables, with a range of potential values, were standardized using the z-score method, providing a uniform scale across the dataset. Variables such as lung volume, mosaic attenuation pattern, and white blood cell count required this approach to prevent disproportionate influence by features with higher numerical ranges.
Feature Selection
To train machine learning (ML) classifiers—including Linear Support Vector Machine (SVM), SVM with Radial Basis Function Kernels (SVMRBF), Logistic Regression, Random Forests, Naïve Bayes, and Extreme Gradient Boosting (XGBoost)—an optimal feature count range of 8 to 22 was identified. This range varied based on each classifier's feature handling properties and aimed to optimize training efficacy and generalizability(15–21), given the training (n ≈ 700) and validation (n ≈ 300) sample sizes. The selection aimed to capture the central tendency, dispersion, complexity, and heterogeneity of the CT images of the lungs. Cross-validation (10-fold) was used to validate the feature selection process, ensuring consistency across different subsets of data.
Recursive Feature Elimination (RFE) was employed to select the most relevant features by iteratively training a model, ranking features by importance, and removing the least significant ones. This process continued until an optimal subset was obtained, particularly for SVM with RBF kernels and Linear SVM classifiers, which are sensitive to feature selection. This technique typically identified 7 to 13 optimal features, depending on the classifier.
For Random Forest and XGBoost classifiers, feature importance was calculated based on the frequency of feature use in decision tree splits and its impact on reducing impurity. This approach led to the selection of the most impactful features, generally between 13 and 22, based on their contribution to the overall model's accuracy and robustness. Logistic Regression and Naïve Bayes classifiers used a statistical approach for feature selection, examining the correlation between each feature and the target variable. Pearson's correlation was used for continuous variables, while chi-square tests were used for categorical variables. Features with statistically significant correlations (p < 0.05) were retained. Additionally, the Minimum Redundancy Maximum Relevancy (mRMR) method was applied to minimize feature redundancy, typically yielding a set of 10 to 15 significant features.
ML Classifiers
We implemented multiple imputation using chained equations instead of median imputation and designed two data preprocessing pipelines: one utilized StandardScaler for feature normalization, while the other retained the original feature scales. The multiple imputation process involved 20 iterations, following Rubin's rules. We chose the ML models based on their proven effectiveness with similar datasets. The selected models were Linear SVM (with a linear kernel and a tuned regularization parameter C), SVMRBF (optimized for C and gamma), Logistic Regression (utilizing C for regularization with both l1 and l2 techniques), Random Forests (fine-tuning max_depth, min_samples_split, and min_samples_leaf), Naïve Bayes (optimized var_smoothing), and XGBoost (optimized learning_rate, max_depth, n_estimators, min_child_weight).
Bayesian optimization was used for hyperparameter tuning, replacing grid search to improve efficiency. This approach used Gaussian processes as priors to explore the hyperparameter space with 4-fold cross-validation to determine optimal settings based on accuracy. To address class imbalances, we conducted exploratory data analysis to identify and quantify imbalances. We used Adaptive Synthetic Sampling (ADASYN) and Synthetic Minority Over-sampling Technique (SMOTE) for resampling. Additionally, cost-sensitive learning adjusted misclassification costs to favor correct predictions of the minority class. Balanced accuracy and the F1 score were chosen as performance metrics due to their sensitivity to class imbalances.
Validation Strategy
Our validation strategy employed 10-fold cross-validation with stratification to maintain class distribution across folds. We assessed classifier performance using adjusted sensitivity, specificity, and precision metrics, focusing on accurate classification of the minority class. Model performance was analyzed using the area under the curve (AUC), considering the AUC range across folds and the overall AUC for the validation dataset. This approach provided insights into model stability and generalizability.
Feature Importance and Stability
To assess feature importance, we used normalized coefficients in linear models like Linear SVM and Logistic Regression. Ensemble methods, such as Random Forests and XGBoost, utilized built-in feature importance metrics, complemented by Permutation Importance for a more comprehensive evaluation. RFE was applied for SVMRBF to dynamically evaluate feature contributions. Naïve Bayes determined feature significance through variance ranking. SHapley Additive explanations (SHAP) values were used across all models to offer interpretable, model-agnostic insights into feature importance. Stability selection techniques utilized subsampling and aggregation to calculate stability percentages for each feature across various data subsets.
Unsupervised Learning
For unsupervised learning, Principal Component Analysis (PCA) was conducted to achieve two components from each model, aiding SVMRBF decision boundary plotting. We retained Principal Components based on eigenvalue > 1 and scree plot analysis, ensuring significant variance capture while avoiding overfitting.
Configurations of the Generative Pre-trained Transformer (GPT)
This study utilized specialized configurations of the GPT-4 model, meticulously designed through advanced prompt engineering and schema modifications to meet specific research requirements. The range of tasks included optimizing machine learning classifier parameters, conducting comprehensive literature reviews, and enhancing manuscript language quality.