Study population
We included 811,244 individuals who attended the Health Management Center & Physical Examination Center between 2013 and 2019. These samples were enrolled from Sichuan province, most of them from Chengdu city. The enrolled samples account for about 1% of the demographics of Sichuan province and 5% of the demographics of Chengdu city. The participants represented 35 healthy states based on either a healthy status or the presence of an underlying disease condition (unhealthy status). Specifically, the study population included 711,928 healthy participants, 46,981 patients with hypertension, 11,745 patients with diabetes, and 32,960 with other unhealthy status (mainly are chronic disease) (Table 1). Besides, 7,630 samples with 12 diseases in replication for prediction were also enrolled in 2019 as a separate dataset. We included 221 PEIs in our analyses, which comprised patient demographic information (age and sex) and life-style indicators (alcohol consumption, tobacco use, etc.).
PEI correlations in participants with a healthy physical status
We first aimed to explore the PEI correlations in healthy status to give a landscape. Among 221 PEIs, we found 7,662 significant correlations (P<0.05/ 24,322 PEI pairs=2×10-6) in all 24,322 PEI pairs correlations (31.5%) (Table 1, Table S1) in those with a healthy physical status (N=711,928, mean age 41.4, female=45.7%). This finding suggests a wide range of correlations between PEIs (Fig. 1). The top 50 correlated PEIs included sex, age, red blood cell count, prealbumin (PAB), history of alcohol intake (alcohol consumption, drinking), alkaline phosphatase level (ALP), tobacco use (smoking) and so on (Fig. 1(a)). Among the 221 PEIs, the number of significantly correlated PEIs also suggested rich correlations between PEIs (Fig. 1(b)). Of these identified correlations among PEIs in health status, some of them are consistent with the reported literature, but most of them are newly discovered in this study.
General inspection PEIs showed rich relevance to each other or other PEIs. For example, sex showed the richest PEI correlations (151 PEI pairs, males vs. females), including hemoglobin (Hb), creatinine, uric acid (UA), drinking, smoking, body mass index (BMI), etc., which reflect the differences in body shape, physique, and living habits between males and females (Fig. 1, Fig. 2, Table S1). Age also showed strong PEI correlations (125 PEI pairs), such as estimated glomerular filtration rate (eGFB), systolic pressure (SBP), diastolic pressure (DBP), albumin (Alb), and low-density lipoprotein (LDL-C). These findings suggest that with increasing age, body functions systematically change (Fig. 1, Fig. 2, Table S1). We also found 124 PEI correlations with BMI which reflects the strong influence of body shape on PEIs, including UA, high-density lipoprotein (HDL-C), SBP, and DBP (Fig. 1, Fig. 2, Table S1). Blood pressure (BP), which has many physiological meanings, we identified a set of PEIs that correlated with blood pressure (BP), including 125 PEIs for DBP and 124 PEIs for SBP (Fig. 1, Fig. 2, Table S1). Intraocular pressure (IOP) is an important factor for the diagnosis of glaucoma 12. We found 79 PEIs that were weakly correlated with IOP of the left eye (IOP-L), including IOP of the right eye (IOP-R) SBP, DBP, Alb, BMI, TG, ApoB, drinking, and TC. Similar to IOP-L, 73 PEIs were weakly correlated with IOP-R (Fig. 1, Fig. 2, Table S1).
As expected, blood lipid PEIs display many correlations. For example, 119 PEIs correlated with triglyceride (TG) (Fig. 1, Fig. 3, Table S1). We found 122 PEIs that correlated with HDL-C, with many negative correlations, including TG, UA, and BMI (Fig. 1, Fig. 2, Table S2). The correlation patterns between LDL and HDL showed a specific opposite trend (Fig. 1, Fig. 2, Table S1). Out of expected, living habits have a profound impact on our bodies. Consistently we detected 130 PEIs that correlated with drinkings, such as sex, smoking, Hb, and UA (Fig. 1, Fig. 2, Table S1). Similarly, 128 PEIs were correlated with smoking, including drinking, sex, and age (Fig. 1, Fig. 2, Table S1). We also detected 58 PEIs that weakly correlated with exercise habits (e-habits), including age, eGFB, and SBP (Fig. 1, Fig. 2, Table S1). Tumor marker expression can indicate the occurrence and development of tumors. We detected weak correlations between several tumor markers and PEIs. For example, 88 PEIs were correlated with cytokeratin-19-fragment CYFRA21-1 (CYFRA 21-1); 83 PEIs were correlated with tumor-supplied group factors (TSGF); 64 PEIs were correlated with neuron-specific enolase (NSE); and 64 PEIs were correlated with complexed prostate special antigen (C-PSA) (Fig. 1, Fig. 2, Table S1).
PEI correlations in individuals with an unhealthy physical status
Next, we examined the PEI correlations in 34 unhealthy physical states. In this analysis, we also identified rich correlations in these unhealthy physical states (Table 1). Compared with the healthy physical state, we found fewer significant correlations in PEIs in those with an unhealthy physical status, which might be caused by sample size effect (Table 1, Table S2-S35). Each unhealthy physical state has its only correlation spectrum and most of them are newly discovered in this study. For example, in the hypertension population, we found 4,413 significant correlations in the 221 PEIs of 24,322 PEI pairs (18.3%) (Table S2). The PEI with increased correlations included monocytes (MON) (70 in hypertension vs six in healthy physical state, the same below), quantitative detection of hepatitis B virus DNA (HBV-DNA) (76 vs 33), quantitative detection of hepatitis C virus RNA (HCV-RNA) (49 vs 8), etc. (Table S2). Those with both hypertension and coronary heart disease (hypertension+coronary) had an increased correlation of RH blood group compared with the healthy cohort (41 vs 9 in normal). Conversely, the numbers of correlations in homocysteine (Hcy) were greatly reduced in unhealthy versus healthy patients (2 vs 120). In diabetes, 10 PEI pairs increased while the remaining 195 PEI pairs decreased; the increased PEIs including MON (41 vs 6), HCV-RNA (42 vs 8), anti-Sc70 (59 vs 31), and HCV-cAg (35 vs 10) (Table S17). These results suggest that under the unhealthy status, the PEIs have changed systematically. Each disease has its own specific PEI spectrum.
Next explored the correlation networks among the PEIs using a qgraph 13, which would show the LinkMode among PEIs. In a healthy status, we found that the PEIs showed rich interactions with both positive and negative directions (Fig. 3). In the unhealthy physical states, each of them showed its unique interaction networks with PEIs (Fig. 4 showed the network of hypertension and diabetes). These results show that there is a dependency relationship between multiple indicators in each physical state, which can be used with the combination in the assessment of physical health.
Candidate PEI markers for unhealthy physical status
To verify and discover new candidate biomarkers or the impact of living habits for disease early diagnosis, we next calculated the difference of each of the 221 PEIs between healthy and unhealthy physical states. In total, we found 1,239 significantly different PEI pairs between healthy and 34 unhealthy physical status (P<0.05/34=0.0014, adjust for 34 unhealthy physical status) (Table 1, Fig. 5, Table S36). For example, 112 PEIs were significantly different between patients with hypertension and healthy people, 100 PEIs were different between hypertension+diabetes and healthy people, and 91 PEIs were different between diabetes and healthy people. Some of them are consistent with previous findings and the rest of them are newly discovered.
For many of the 221 PEI, we detected a difference between healthy and unhealthy physical status, especially in PEIs involved in physique, lifestyles, blood lipids (Fig. 5, Table S36). For example - BMI, we found differences between healthy and unhealthy physical statuses in 16 of the 34 unhealthy physical statuses, including in patients with hypertension (P=0) and gout (P=6.48×10-90). Exercise habits (E-habits) showed 19 differences between healthy and unhealthy status, including in hyperlipidemia (P=1.28×10-277) and diabetes (P=4.20×10-29). Dietary habits also showed differences in 10 unhealthy status, including in chronic pharyngitis (P=2.59×10-19) and cholecystolithiasis (P=9.43×10-18). We detected differences in alcohol intake habits in 20 unhealthy status, including hyperlipidemia (P=0), coronary heart disease (P=4.06×10-24), diabetes (P=1.09×10-22), and Parkinson's syndrome (P=1.43×10-17). We also observed differences in smoking habits in 18 unhealthy status when compared to the unhealthy condition, including in hypertension (P=2.74×10-114), hyperlipidemia (P=2.69×10-62), and Parkinson's syndrome (P=5.12×10-29). We found differences for IOP-R in five unhealthy status compared with healthy, including in hypertension (P=3.63×10-85) and diabetes (P=2.01×10-73); similar findings were produced for IOP-L (Fig. 5, Table S36). For lipids PEIs, we also observed differences between 34 unhealthy and healthy status. For example, LDL-C was detected in 21 unhealthy status, including hypertension (P=0) and diabetes (P=2.95×10-212). HDL-C was detected in 17 unhealthy status, including in diabetes (P=1.92×10-177) (Fig. 5, Table S36). We further conducted a detailed analysis of HDL-C and diabetes and found those with low HDL-C showed a significantly higher risk of developing diabetes than those with average values (1.26-1.75 mmol/L) in this population. Of note, those with high HDL-C levels also showed an elevated risk of developing diabetes (Fig. 6).
Tumor-associated antigens also display significant differences between healthy and unhealthy status. For example, CYFRA 21-1 was detected in 10 unhealthy status, including hypertension+diabetes (P=3.71×10-97) and diabetes (P=4.52×10-70). CEA1 was detected in 12 unhealthy status, including hypertension+coronary (P=9.59×10-29) and diabetes (P=1.73×10-18). Alpha-fetoprotein (AFP) was detected in hepatopathy (P=1.08×10-28). C-PSA was detected in hypertension+coronary (P=8.38×10-20). Finally, the carbohydrate antigen CA724 (CA 72-4) was detected in asthma (P=9.92×10-13), gout (P=3.53×10-7), and coronary+diabetes (P=4.06×10-5) (Fig. 5, Table S36). Among other PEIs, we also detected significant differences between healthy and unhealthy status. For example, we found differences in urine sugar levels (U-GLU) in nine unhealthy status, including in diabetes and its associated diseases. The eosinophil rate (eo%), was found in five unhealthy status, including asthma (P=1.38×10-129) and rhinallergosis (P=4.05×10-18). Whole blood iron levels (WB-Fe) was found in 11 unhealthy status, including hypertension (P=2.52×10-69). We detected PH in 11 unhealthy status, including diabetes (P=1.97×10-239), hypertension (P=2.41×10-166), hypertension+diabetes (P=9.90×10-32), and gout (P=9.82×10-15). We found potassium (K+) in five unhealthy status, including hypertension (P=1.98×10-119) and hepatitis B (P=3.13×10-10). We also detected differences in magnesium (Mg2+) in hypertension+diabetes (P=3.14×10-58) and diabetes (P=5.10×10-52). Hcy (an indicator of cardiovascular disease) was detected in eight unhealthy status, including hypertension (P=1.97×10-136) and Parkinson's syndrome (P=1.76×10-7) (Fig. 5, Table S36). These results provide a set of candidate markers for chronic diseases early diagnosis.
Machine learning to predict healthy and unhealthy physical status from PEIs
A key objective of this study was to apply PEI data and machine learning technology to develop algorithms that can predict a common disease based on general physical examination. We tried three machine learning models, including kernelized support vector machine (SVM), multilayer perceptron (MLP), and random forests. MLP prediction models only resulted in a low f1_score, recall, and precision in our initial training data. It takes tens of hours for the SVM model to do a binary classification, so we excluded MLP prediction models and SVM prediction models for further training. We found that random forest is more suitable for our data. It only takes 2–3 minutes to do a binary classification, and the prediction effect of random forest is much better than that of MLP and SVM. However, the random forest could not give good performance in the multi-class classification of all the physical status. Finally, we tried to use binary classification to classify each pair of healthy and unhealthy physical status (e.g. hypertension and healthy people; Parkinson's syndrome and healthy people) and we obtained relatively better performance than the multi-class classification. Then we tried to optimize this prediction algorithm. Because the data were characterized by serious category imbalance, a random under-sampling method was adopted that balances the data by randomly selecting the data subset of the target class. In each physical status, the top 15% or 16% representative PEIs were extracted for prediction by feature extraction. The advantage of this method is that it is usually very fast and completely independent of the model applied after feature selection.
Finally, in the random forests algorithm prediction of each pair of healthy and unhealthy physical status, the area under the curve (AUC) of receiver operating characteristic curve reached 66%~99% depending on the unhealthy physical status (average 87.6%) (Fig. 7, Table 2 and Table S37 and 38). For classification, AUC values more than 90% indicated excellent performance, and values from 80% to 90% indicated good performance. Our algorithm provided high-precision predictions in 18 of the 34 unhealthy physical status (AUC>90%), good performance for another 9 of the unhealthy physical status (90% >AUC>80%). In our algorithm, patients with heart-related diseases showed excellent performance. For example, by extraction 30 PEI features (age, leukocyte count, monocytes, Mon%, mean corpuscular volume, red blood cell count, red cell distribution width, lymphocyte rate, platelet count, low-density lipoprotein, high-density lipoprotein, total cholesterol, carcinoembryonic antigen 1, albumin, albumin-globulin, cystatin c, glucose, urine sugar, urine creatinine, estimated glomerular filtration rate, creatinine, urea, waistline, waist-hip Ratio, body mass index, operation history, systolic pressure, height, neck size, and anamnesis), Hypertensive+Diabetes+Coronary Heart Disease provides 99% AUC just using 909 training samples and 387 validation samples (f1-score (95%CI), 0.96(0.95-0.96); accuracy (95%CI): 0.95(0.94-0.97); specificity (95%CI): 0.95(0.94-0.95); recall (sensitivity) (95%CI): 0.95(0.94-0.97). In our algorithm, patients with Parkinson's syndrome provides 97% AUC using 192 training samples and 83 validation samples (f1-score (95%CI), 0.91(0.90-0.91); accuracy (95%CI): 0.90(0.89-0.90); specificity (95%CI): 0.87(0.79-0.94); recall (95%CI): 0.90(0.89-0.91). For hepatic adipose infiltration, our algorithm also provided good prediction performance using 803 training samples and 115 validation samples (f1-score (95%CI), 0.82(0.78-0.87); accuracy (95%CI): 0.81(0.76-0.86) ; specificity (95% CI): 0.75(0.67-0.82); recall (95% CI): 0.82(0.77-0.87) and AUC (95% CI): 0.92(0.89-0.94). For chronic rhinitis, we got the lowest prediction performance in this study (AUC (95%CI):0.66(0.60-0.72)). When all unhealthy physical status were classified as one “unhealthy” status together, our algorithm also provided good predictions: f1-score (95%CI): 0.83 (0.83-0.83); accuracy (95%CI): 0.82 (0.82-0.82); specificity (95%CI): 0.81(0.81-0.81); sensitivity (95%CI): 0.84 (0.84-0.84) and AUC (95%CI): 0.9 (0.90-0.90). These results suggested that by using feature extraction of the PEIs (15-16% of all 221 PEIs) just by using a small number of samples, our random forest algorithms provided good performance for majority unhealthy physical status predictions.
To further validate our random forest algorithm prediction model, we did a replication analysis of 12 diseases in another new dataset. The results are presented in Fig. 8 and Table 3. The ROC of the replication data achieved 0.63–0.98 (average 0.90) (Fig. 8), suggesting a good performance of the prediction effect, based on the limited samples. For the rest of the diseases, we did not obtain enough samples in the new dataset for replication (<100 samples).
In this study, the top 15% or 16% representative PEIs were extracted for random forest prediction by feature extraction in each physical status (Table S38), which reached 66%–99% precision in predictions, depending on the physical state. In total, 161 PEIs were used for the random forest prediction of 35 pairs of health statuses. Some PEIs were used more frequently than others, suggesting their important physiological values for the human body. The top 20 used PEIs included monocyte counts (36 health statuses used, the same as below, 100%), anamnesis (33, 92%), age (32, 98%), albumin (31, 86%), estimated glomerular filtration rate (30, 83%), systolic pressure (27, 75%), waistline (27, 75%), red cell distribution width (26, 72%), creatinine (23, 64%), neck size (23, 64%), operation history (23, 64%), red blood cell count (23, 64%), urea (22, 61%), waist-hip ratio (22, 61%), BMI (21, 61%), gender (21, 61%), height (20, 56%), glucose (19, 53%), hemoglobin (19, 53%) and platelet count (19, 53%). Some PEIs were rarely used, suggesting their unique indication of a certain disease. For example, sodium was only selected for cholecystolithiasis prediction, and cholinesterases were only selected for rhinallergosis prediction. Our results provide proof for predicting health conditions just using a set of PEIs.