Dietary Protein is the Strong Predictor of Coronary Artery Disease; A Data Mining Approach

Backgrounds and aims: Coronary artery disease (CAD) is the major cause of mortality and morbidity globally. Diet is known to contribute to CAD risk, and the dietary intake of specic macro- or micronutrients might be potential predictors of CAD risk. Machine learning methods may be helpful in the analysis of the contribution of several parameters in dietary including macro- and micro-nutrients to CAD risk. Here we aimed to determine the most important dietary factors for predicting CAD. Methods: Total 273 cases with more than 50% obstruction in at least one coronary artery and 443 healthy controls who completed a food frequency questionnaire (FFQ) were entered into the study. All dietary intakes were adjusted for energy intake. QUEST method was applied to determine the diagnosis pattern of CAD. Results: Total 34 dietary variables obtained from FFQ were entered the study that 23 of these variables were signicantly associated with CAD according to t-test. Out of 23 dietary input variables adjusted protein, manganese, biotin, zinc and cholesterol remained in the model. According to our tree, only protein intake could identify the patients with coronary artery stenosis according to angiography from healthy participant up to 80%. Manganese dietary intake was the second important variable after protein. The accuracy of the tree was 84.36% for training dataset and 82.94% for testing dataset. Conclusion: Among different macro- and micro-nutrients in the dietary, a combination of protein, manganese, biotin, zinc and cholesterol could predict the presence of CAD.


Introduction
Coronary artery disease (CAD) is the major cause of morbidity and mortality worldwide (1). CAD is also very prevalent in Iran compared to other countries. Therefore, nding a national program for reducing the risk factors of CAD based on lifestyle is fundamental (2). The most prevalent CAD risk factors are smoking, male gender, age, ethnicity, family history of the disease, high blood pressure, high blood cholesterol, diabetes, poor diet, Lack of exercise, obesity, stress and blood vessel in ammation. These factors affect each patient variously (3).
The gold standard of CAD diagnosis remains invasive coronary angiography, but this procedure is associated with a risk of serious complication (4). Finding an appropriate, safe and non-invasive method to diagnosis is the aim of current diagnostic approaches (4). Evidence indicated signi cant association of a limited number of dietary factors and dietary patterns with CAD (5). Previous association studies reported mineral dietary intake such as sodium, potassium, magnesium and zinc as associated risk factors of CAD (6)(7)(8)(9)(10)(11). While, many previous studies have shown that vitamin C, vitamin E and selenium interventions do not reduce the risk of CAD (12,13). Furthermore, a study in China indicated that low fat and high ber intake decrease CAD mortality (14). In regard to protein intake and risk of CAD, in a followup study by health professionals a signi cant relationship was found between total protein and increased risk of CAD (15), However, in a review article, Pedersen et al. concluded that there was no signi cant relationship between protein intake and strokes and coronary heart disease (16).
Hence the use of dietary and intake patterns and their application to novel algorithms to predict CAD remains a substantial approach to risk strati cation (17).
Machine learning is rapidly used to predict healthcare issues, such as cost, utilization, and status. In machine learning, the purpose is to train the algorithm to learn how maps inputs features to an output.
Generally, any machine learning method applies the following steps; data preparation, algorithm selection, training, regularization, and evaluation (18). Different methods of machine learning models for coronary artery disease were previously built and analyzed (19)(20)(21). Nevertheless, the circumstances may vary based on different situations, lifestyles, and accessible data and features. Thus, we believe that with constructing and validating prediction models, it becomes plausible to classify patients who have a high risk of disease from those who are at low risk. Consequently, a diagnostic model for predicting CAD is necessary.
In the present paper, among different methods of machine learning (arti cial neural network, deep learning, etc.) we employed a well-known technique called decision tree (DT). A DT model is a graphical model that its structure is like a tree. One of the advantage of DT is that the produced model is a more interpretable model. QUEST is a binary-split decision tree method of machine learning. In Quest, the association between the input features and the target is calculated by ANOVA F-test (ordinal features) or Pearson's chi-square (nominal features). The features that make the greatest agreement with the target is chosen to divide the node (22). The computation speed in this method is greater than those in other algorithms; the bene t of this algorithm is that it can avoid the bias that exists in other classi cation methods (23).
In this current study, QUEST is applied for models construction to recognize the importance of factors related to incidence of CAD, and detecting dietary intake as a major CAD risk factor.

Subjects
The data was extracted from our previous case-control study, between September 2011 and May 2013 (17). Out of 1187 patients whom underwent coronary angiography, 273 cases whom had more than 50% obstruction in at least one coronary artery and also their food frequency questionnaire (FFQ) was available, entered the study.
Healthy controls were selected from the same study. The healthy subjects had no signs or symptoms of CAD. Furthermore, they did not have any of the traditional risk factors of CAD. Total of 443 healthy controls who had FFQ questionnaire were chosen.

FFQ
The dietary intake data of the current study population were collected by a semi-quantitative food frequency questionnaire (FFQ) which was validated among an Iranian population (19). This FFQ is a 65food item one and each food item was consisted of frequency intake (per day, per week, per month, seldom, and never) and portion size. After completing the FFQ by experienced nutritionists, dietary intake data was analyzed by diet plan 7 software. Consequently, dietary intake of micronutrients and macronutrients was obtained for all subjects.

Data adjustment
We performed an energy adjustment method for each input attributes. We applied the energy adjustment method based on the residual. In this method, the energy-adjusted intake measure is the residual from a regression model in which total energy intake is the independent variable and absolute nutrient intake is the dependent variable (2,24,25).
All the variables which were signi cant between participants with positive angiography and healthy participants were considered as input variables. The input variables were adjusted protein, carbohydrate, sugar, ber, total fat, cholesterol, mono unsaturated fat, sodium, potassium, phosphorus, calcium, magnesium, iodine, manganese, zinc, selenium, carotene, folate, vitamin C, thiamin, retinol, niacin, biotin shown in Table 1. The model evaluated in this study had 10 input variables and one target variable. The target variable consisted of 2 classes as healthy and positive angiography.

Model
In this model, the QUEST method has been investigated to analyze the data and build a diagnosis pattern of patients with coronary artery disease. To perform the investigation, total number of 716 participants were considered. As a common rule in decision tree, data were divided into training and testing groups, 70% of total participants (505 subjects) were randomly selected to make training group for constructing the decision tree. The remaining 30% (211 subjects) were considered as testing group to evaluate the performance of decision tree.
A confusion matrix was used to evaluate the performance of the decision-tree for classi cation of participants. The accuracy, sensitivity, speci city and the receiver operating characteristics (ROC) curve were measured for comparison.

Results
Micro and macronutrients obtained according to FFQ questionnaire for total number of 716 subjects in two groups of angiogram positive and healthy subjects were indicated in Table 1. Out of 23 input variables adjusted protein, manganese, biotin, zinc and cholesterol remained in the model. The nal decision tree with 12 leaves and 4 layers was shown in Fig. 1. The if-then rules is shown in Table 2. For evaluation of the decision tree, confusion matrix were used which was indicated in Table 3 for training and testing datasets. The accuracy of the tree was 84.36% for training dataset and 82.94% for testing dataset. Other performance variables of the tree including sensitivity, speci city and AUC was shown in Table 4.

Discussion
This retrospective study was designed to create a tree to recognize the dietary risk factors for CAD. Decision tree is a data mining algorithm which is generally used for predicting medical conditions such as coronary artery disease (26) .We observed that adjusted protein sits at the apex of the tree which indicated that high levels of protein intake were the most important risk factor for CAD. According to our tree, only protein intake could identify patients with coronary artery stenosis according to angiography from healthy participant up to 80%. Higher degrees of protein intake were associated with CAD. Dietary manganese was the second most important variable after protein. Interestingly, as shown in Table 2 There are a few studies available investigating the risk factors of CAD using data mining. Taye et al. carried out a data mining research in 2346 subjects by using a decision tree algorithm. They entered 10 variables including sex, age, triglyceride (TG), total cholesterol (TC), low density lipoprotein (LDL), high density lipoprotein (HDL), fasting blood glucose (FBG), high sensitivity C-reactive protein (hs-CRP), systolic blood pressure (SBP) and diastolic blood pressure (DBP) in the decision tree model. They concluded hs-CRP was the most important risk factor of CAD and they also found FBG, sex and age were other risk factors of CAD. They reported the accuracy of 95.3% for their tree (17). Moreover, Xing et al. evaluated the effect of some variables including, tumor necrosis factor-α (TNF-α), interleukin-6 (IL-6), interleukin-8 (IL-8), hs-CRP, methylputrescine oxidase-1 (MPO1), troponin I-2 (TNI2), sex, age, smoking, hypertension, and diabetes on prediction of CAD survival using three algorithms including decision tree. They found that that decision tress models have accuracy of 89.6% (27).
To the best of our knowledge this is the only study using data mining algorithms for risk strati cation of angiographic results considering dietary intake as potential factors. However, many studies have examined the effects of dietary intake on CAD prediction using other methodologies. Nazeminezhad et al, divided the population of study into three groups: 1) those with considerable disease (> 50% occlusion), 2) individuals with < 50% coronary artery occlusion, and 3) control group. After evaluating the dietary intake using a 24-h dietary recall method and dietary analysis, they found that those in control group have less dietary protein intake and higher manganese intake than that in the other two groups (2). In an 18 years follow-up study, in line with our ndings, the researchers used a validated food-frequency questionnaire at 4 time points to assess nutrients intake. They observed that higher dietary vegetable protein signi cantly reduces risk of fatal ischemic heart disease. They also found that intake of animal protein is associated with ischemic heart disease occurrence in healthy men (28). Furthermore, it is previously shown that low consumption of protein and minerals (e.g. manganese) and high consumption of carbohydrate and fat is associated with having more severe CAD (29). Moreover, the data of a previous research on 36 adults showed that high dietary protein/ meat intake (more than 0.8 g protein/kg body weight/24 hours) induced CAD progression by increasing the lipid deposition, in ammation and coagulation pathways.
Regarding to the association between Manganese as a micro-nutrient and CAD, Manganese induces synthesis of cholesterol and fatty acids in liver (30). Manganese is also a part of enzymes superoxide dismutase and adenylyl cyclase enzymes involved in antioxidant mechanism (31,32).
There are several possible explanations for this difference between studies. The methodology, the type of protein intake (of vegetable or animal) and questionnaire using for dietary intake are the most important factors responsible for this diversity.
Because of the increased CAD prevalence and consequently the heavy nancial pressure on the society, nding ways to effectively predict this disease is a major desire of healthcare communities (33). Data mining might be used to notify individualized preventive actions and also de ne the impacts of each variables on the studied association. However, data mining has some limitations. It is a complicated method that needs speci c knowledge and skills. In addition, each application created many rules and selection the meaningful ones requires experience.

Conclusion
Machine learning could be a powerful tool for risk strati cation of diseases including CAD. Here we indicated considering dietary intake of protein and manganese along with zinc, biotin and cholesterol could predict CAD with accuracy of almost 85%. Ethics approval and consent of participant: The study protocol was given approval by the Ethics Committee of Mashhad University of Medical Sciences and written informed consent was obtained from participants.

Consent of publication:
Not applicable.
Availability of data and materials: Not applicable. Final decision tree with 12 leaves and 4 layers