The Study of Key Tongue Features for Prediabetes Risk and Diabetes Risk using Principal Component Analysis and Logistic Regression

Background: Carrying out routine diabetes risk assessment and prediction is conducive to controlling the incidence of diabetes. Previous studies have confirmed that tongue features can reflect changes in glucose metabolism. We want to find the key tongue features related to glucose metabolism for early warning of prediabetes and diabetes. This will benefit the prevention and control of diabetes. Methods: We investigated 719 subjects in Shuguang Hospital affiliated with Shanghai University of TCM in Shanghai, China, during the August 2018 – December 2019 period. We used PCA to reconstruct 25 new features on the basis of the original features, and used logistic regression to analyze these new features. Based on the factor loading method, we associate the new features with the original features. Finally, we determined the key tongue features and change patterns related to prediabetes risk and diabetes risk. Results: For the prediabetes, TB-ASM, TC-ASM and TB-Cr are protective factors. TB-CON, TC-CON, TB-MEAN, TC-MEAN, TB-ENT and TC-ENT are risk factors. For the diabetes, TB-a, TB-S, TB-Cr, TB-b and TC-b are protective factors, and Per-all and TB-Cb are Conclusion: The PCA eliminates the redundancy of original features and refines the original features. The PCA combined with logistic regression found the key tongue features reflecting glucose metabolism, and clarified the significance of the changes in these tongue features in the prediabetes and diabetes population. This study provides a research foundation for using tongue features to implement early warning of diabetes risk.


Introduction
Tongue diagnosis is a rapid, objective, simple, noninvasive and valuable method that provides access to physiology and pathology information for TCM doctor. The doctor observes the color, texture and moisture of the tongue body and tongue coating to identify the health status of an individual. At present, with the development of standardized tongue diagnosis equipment, objective tongue information continues to accumulate. Researchers use tongue diagnosis equipment to collect and analyze the subjects' tongue images, calculate and extract tongue features.
According to WHO, the prediabetes is associated with a high risk for developing diabetes. In addition, prediabetes is related to early atherosclerosis, subclinical inflammation, and cardiovascular disease (CVD). The occurrence of prediabetes is due to the joint action of insulin resistance and β-cell dysfunction [1,2] . WHO has set the prevention of T2DM as a priority goal. At the same time, the 2018 Berlin Declaration called on the world to take early measures to prevent T2DM. Individuals at high risk of diabetes should actively implement lifestyle modification [3] . Many diabetic patients were missed due to hidden onset [4] . Delay in diagnosis and treatment not only affects the patient's prognosis, but also incurs high medical costs [5,6] . Therefore, it is necessary to establish a routine diabetes risk screening mechanism. However, Routine blood glucose monitoring is an invasive examination and will incur a certain cost. Therefore, it is necessary to establish a low-cost, convenient, non-invasive, and accurate diabetes risk screening system. For example, The American Diabetes Association (ADA) recommends an noninvasive assessment tool to assess diabetes risk [7] .
Our previous studies have confirmed that tongue features has a potential role in predicting risks of diabetes [8] . However, among the many tongue features, which tongue features are more important in the early warning of hyperglycemia and still need to be explored. Using PCA and logistic regression analysis, we hope to reconstruct new features on the basis of the original features to solve the collinear and redundancy of the original features, and to find features that can accurately distinguish between normal and hyperglycemia, especially find the key tongue features for risk warning of prediabetes and diabetes. This will benefit the prevention and treatment of diabetes.

Subjects
In our study, we investigated 719 subjects in Shuguang Hospital affiliated with Shanghai University of TCM in Shanghai, China during the August 2018 -December 2019 period. Subjects were divided into normal case (3.9mmol/L < fasting blood glucose (FPG) < 6.1mmol/L and HbA1c < 5.7%), and those with prediabetes (FPG 6.1-6.9mmol/L or HbA1c 5.7-6.4%), and and those with diabetes (FPG ≥ 7.0mmol/L or HbA1c ≥ 6.5%). We confirm that all experiments were performed in accordance with relevant guidelines and regulations. All subjects signed an informed consent form. The study was approved by the IRB of Shuguang Hospital affiliated with Shanghai University of TCM.

Collection of clinical data
We collected basic information, blood samples and tongue images of subjects on the same day. Basic information included: Gender, Age, Height, Weight, Systolic blood pressure (SBP), Diastolic blood pressure (DBP).

Collection and analysis of tongue features
Tongue image acquisition equipment (type TFDA-1 tongue diagnosis instrument) (Fig. 1) and analysis software (Tongue diagnosis analysis system, TDAS) were designed and developed by the intelligent diagnosis technology research team in Shanghai University of TCM [9] . TFDA-1 is equipped with a high-definition camera and standard light source to ensure a stable shooting environment. The tongue images were recorded as JPEG files (5568 x 3712 pixels).
Based on the color image segmentation method [10] , TDAS first implemented tongue segmentation. According to the difference between the pixel values of the tongue coating and the tongue body, the tongue coating was separated from the tongue body and the color features of both in the RGB, Lab, HIS, YCrCb color space were calculated. According to the difference statistical method [11] , TDAS automatically calculates the texture features of tongue body and tongue coating. The principle of tongue coating ratio feature calculation was: tongue coating area calculated based on pixel value/full tongue area (Per-all), tongue coating area calculated based on pixel value/tongue coating area calculated based on pixel position (Perpart).

Statistical analysis and machine learning modeling
The measurement data presented as ̅ ± and median (interquartile range). For measurement data that conformed to a normal distribution and homogeneity of variance, the independent sample t-test was used, and for measurement data that did not conform to a normal distribution, the Mann-Whitney rank sum test was used. Enumeration data were analyzed by the Chi-square test. Statistical analysis were performed using Python (version 3.7.4).
We used principal component analysis to reduce data dimensionality. In the case of retaining the information of the original data to the greatest extent, new features (principal component, PC) were reconstructed on the basis of the original features to reduce data redundancy and prevent the multicollinearity of the original features from affecting the reliability of logistic regression analysis. PCA calculation involves five steps: 1) The standardization was performed on each feature (Eq.1).
2) The covariance matrix (Eq.5) was calculated based on primary data (Eq. (2) = √ (8)   For the prediabetes group, PC_0, PC_4, PC_8, PC_15 and PC_20 are risk factors, PC_1 and PC_5 are protection factors (Table.5). The original features were associated with the PCs through factor loading method (Table.6). Figure.3 showed the original features highly related to PC_0. Figure (Table.7). The original features were associated with the PCs through the factor loading method (Table.8). Figure.9 shows the original features highly related to PC_0, Figure.10 shows the original features highly related to PC_2, and Figure.11 shows the original features highly related to PC_3. Figure.12, Figure.13, and Figure.     Individuals with prediabetes, especially those with Impaired Glucose Tolerance (IGT) or abnormal HbA1c (6.0% -6.4%), not only are at increased risk of developing diabetes, but also have an increased risk of cardiovascular Diseases [12] . For individuals with prediabetes, those later diagnosed with diabetes had medical expenditures nearly 1/3 higher than those who

Results
were not later diagnosed [13] . The best way to reverse the epidemic of diabetes is to intervene in prediabetes as early as possible [14] . For patients with undiagnosed potential diabetes, the insidious onset and lack of awareness of diabetes will lead to missed diagnosis and poor prognosis [15,16] . Therefore, it is very important to develop reliable diabetes risk warning tools.
Tongue features contain rich physiological and pathological information, which can be obtained non-invasively, and are closely related to glucose metabolism, showing potential advantages [17] .Therefore, finding the key tongue features related to diabetes risk and clarifying the clinical significance of the changes in these tongue features will provide a basis for the application of tongue features to diabetes risk prediction.
Logistic regression analysis requires that the sample size should be more than 15 times the size of independent variables, and there is no multicollinearity between the independent variables. By introducing PCA, we reduced the data dimension to 25 features (PCs) to eliminat the collinearity of the original features. PCA makes full use of original data and ensures the reliability of analysis conclusions. We used the PCA combined with logistic regression analysis method to find the key tongue features reflecting glucose metabolism, and clarified the significance of the changes in these tongue features. However, tongue features are continuous measurement data, and classification standards need to be established in the future to facilitate future clinical practice. This study did not consider the simultaneous increase of fasting blood glucose and HbA1c. This may mean that the patient's glucose metabolism is more disordered.

Conclusion
In this study, PCA was first used to reduce the dimensionality of the original data and retain more than 90% of the original data information. 25 new features were reconstructed on the basis of 76 original features. Then use logistic regression to calculate the OR value of the new feature. Next, the factor loading method is used to associate the original feature with the new feature. Through this study, the tongue features that are highly related to the risk of diabetes have been determined, and the clinical significance of these changes in tongue features have been clarified. This will lay the foundation for the study of constructing an early warning model of diabetes risk based on the tongue features.

Declarations
Ethics approval and consent to participate We confirm that all experiments were performed in accordance with relevant guidelines and regulations. All subjects signed an informed consent form. The study was approved by the IRB of Shuguang Hospital affiliated with Shanghai University of TCM.
Consent for publication Not applicable.
Availability of data and materials The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.       Biplot of PC_1 and PC_4 in prediabetes group Note: PC_1 is a protective factor, TB-ENT mainly act on PC_1 (The vector that is sub-parallel to the x-axis). TB-ENT plays a negative role on PC_1 and is risk factor.    Factor loadings of PC_2 in diabetes group Note: PC_2 is a protective factor, TB-a is positively correlated with PC_2, and the correlation coe cient is 0.308.

Figure 11
Factor loadings of PC_3 in diabetes group Note: PC_3 is a risk factor, TB-b and TC-b are negatively correlated with PC_3, and the correlation coe cients are -0.435 and -0.351, respectively.

Figure 12
Biplot of PC_0 and PC_2 in diabetes group Note: PC_0 is a risk factor, TB-Cr, TB-b, TC-b mainly act on PC_0 (The vectors that are sub-parallel to the x-axis), TB-Cr, TB-b and TC-b play a negative role on PC_0 and are protective factors. PC_2 is a protective factor, TB-a and TB-S mainly act on PC_2 (The vectors that are sub-parallel to the y-axis), TB-a and TB-S positively act on PC_2 as a protective factor.

Figure 13
Biplot of PC_0 and PC_3 in diabetes group Note: PC_3 is a risk factor, Per-all mainly acts on PC_3 (The vector that is sub-parallel to the y-axis), Per-all has a positive effect on PC_3 and is a risk factor. Figure 14 14 Biplot of PC_2 and PC_3 in diabetes group Note: PC_3 is a risk factor, TB-Cb mainly acts on PC_3 (The vector that is sub-parallel to the y-axis), TB-Cb has a positive effect on PC_3 and is a risk factor.