In this section recruitment of participants, body composition analysis, dataset, evaluation metrics and the proposed method will be described in ditails. The abbreviations used in this research are listed in Table 1.
Table 1
Abbreviations used in this study.
Terms
|
Abbreviation
|
Support Vector Machine
|
SVM
|
Stochastic Gradient Descent
|
SGD
|
K-Nearest Neighbors
|
KNN
|
Multilayer Perceptron
|
MLP
|
Adaptive Boosting
|
Adaboost
|
Ensemble Diabetes Network
|
EDINet
|
2.1 Recruitment of participants
The Fasa cohort study was intended to assess the relationship between fat in different body areas and Diabetes in Fasa's rural region residents, a city in the eastern portion of the Fars province in southwest Iran, with a population of around 250,000 (12). The rural health care worker (Behvarz), who works as the representative of the primary health care system in each health home in villages and small towns, encourages the public to participate in the survey (13). All individuals were chosen from a group of qualified volunteers to participate in this study. The dataset used in this study is a subset of the Fasa cohort, containing 4661 models' records. Participants who reported having diabetes (571 diabetic models) were separated from those who did not (4090 nondiabetics models). The participants' self-declaration did this.
Furthermore, the dataset covers data on 2155 males and 2506 females. Individuals were aged between 35 and 70 years. They were nominated as the cohort's target group because they are of appropriate age to have been exposed to health themes.
2.2 The field office
A field office was established in the region's principal town. All necessary equipment is given in the field, including computers, body Composition Analyzer device, etc.
2.3 Personnel resources
Every day, a field staff member works at the field office. A field supervisor, two physicians, interviewers, nurses, and an office boy complete the team. The field supervisor is responsible for supervising ongoing interviews and ensuring that data collecting regulations are adhered to.
2.4 Registration
Each individual's national ID code is used in the registration procedure. The field supervisor registers participants by scanning their identification cards and assigns each one a unique number Personal Computer Id (PCID), which will be visible on all forms from that point on. During registration, one of the evaluators assists the field supervisor. Throughout the cohort study's several stages, the PCID number may be accessed using a search engine. Simultaneously, a cohort ID card with a digital bar code, picture, and the cohort's name will be issued. A separate notebook is also used to conduct a registry. The person must be registered. Behvarz's list of invited participants was compared to the previously compiled list. In each case, before registering, an written informed consent form is completed. All surveys are electronic and are administered over a secure web portal—a web-based application. We exploited the electrical edge. Variable types are indicated; data input is halted unless the registrar performs a recheck and the submission is accepted.
2.5 Body composition analysis
Body composition analysis was performed on all participants. Members stand on the device and hold its handles without shoes and by light clothes.
The name of the device was "Tanita Segmental Body Composition Analyzer BC-418 MA Tanita Corp, Japan". This device is one of the first Segmental Body Composition analyzers that afford measurement for the trunk and each member thanks to 8 electrodes.
The device measures: Weight, Basal Metabolic Rate (BMR), Fat Percentage: \(\frac{Fat Mass}{Weight}\) × 100 (FATP), Fat Mass (FATM), Fat-Free Mass (FFM), Total body water (TBW), Desirable body fat ranges, and Segmental body fat information.
Segmental reading separates into:
-
Fat percentage (FATP): Right Leg Fat Percentage (RLFATP), Left Leg Fat Percentage (LLFATP), Right Arm Fat Percentage (RAFATP), Left Arm Fat Percentage (LAFATP), Trunk Fat Percentage (TRFATP).
-
Fat mass (FATM): Right Leg Fat Mass (RLFATM), Left Leg Fat Mass (LLFATM), Right Arm Fat Mass (RAFATM), Left Arm Fat Mass (LAFATM), and Trunk Fat Mass (TRFATM).
-
Fat-free mass: Right Leg Fat-Free Mass (RLFFM), Left Leg Fat-Free Mass (LLFFM), Right Arm Fat-Free Mass (RAFFM), Left Arm Fat-Free Mass (LAFFM), Trunk Fat-Free Mass (TRFFM)
-
Predicted muscle mass for the right arm, right leg, left arm, left leg.
The device has a Goal Setter (GS) function, which analyses the fat mass to be lost to attain a particular target. GS focuses attention on actual fat mass rather than weight.
2.6 Control of quality
The value of surveillance systems cannot be overstated. At the first enrollment stage, the quality control team has a checklist that has been empirically tested to address some data collecting quality indicators. The team supervisor is a non-affiliated epidemiologist. They take into account the three dimensions of data and anthropometric measurements. If any flaws are discovered, they will be corrected and considered for the subsequent cleanup process. The quality control manager, who is also one of the principal investigators, entered to the database and classified content. If data cleaning was required, another team member regained the codified data and performed the clear out. The FACS program is intended to conduct routine checkups. Likewise, automatic outliers were a problem in data fields.
The investigators have established, and the software alerts the operators to validate their morals.
They are incoming before the completion of the last registration. Incomplete data is in terms of varied kinds, lengths, and scales of amount are as well recognized, and ultimate agreement is contingent upon the different stages in the PI validation process.
2.7 Dataset
The Fasa cohort included 4661 participants (571 diabetics and 4090 healthy samples), including 2155 Males and 2506 Females and 22 input features age (Between 35 and 70), gender ID (1: male, 2: female), Basal Metabolic Rate (BMR), Fat Mass (FATM), Fat Percentage (FATP) which is (Fat Mass)/Weight × 100, Fat-Free Mass (FFM), Total Body Water (TBW), Right Leg Fat Percentage (RLFATP), Right Leg Fat Mass (RLFATM), Right Leg Fat-Free Mass (RLFFM), Left Leg Fat Percentage (LLFATP), Left Leg Fat Mass (LLFATM), Left Leg Fat-Free Mass (LLFFM), Right Arm Fat Percentage (RAFATP), Right Arm Fat Mass (RAFATM), Right Arm Fat-Free Mass (RAFFM), Left Arm Fat Percentage (LAFATP), Left Arm Fat Mass (LAFATM), Left Arm Fat-Free Mass (LAFFM), Trunk Fat Percentage (TRFATP), Trunk Fat Mass (TRFATM), and Trunk Fat-Free Mass (TRFFM).
Institutional approval was granted for the use of the patient datasets in research studies for diagnostic and therapeutic purposes. Approval was granted on the grounds of existing datasets. Informed consent was obtained from all of the patients in this study. All methods were carried out in accordance with relevant guidelines and regulations. All experimental protocols were approved by Research Ethical Committees of School of Medicine-Shiraz University of Medical Science.
2.8 Evaluation metrics
By following previous studies in the literature (14–17), the performance of the machine learning algorithms was estimated by accuracy, precision, recall, and F1-score, according to the following equations:
Accuracy =\(\frac{\text{T}\text{P}+\text{T}\text{N}}{\text{T}\text{P} + \text{T}\text{N} + \text{F}\text{P} + \text{F}\text{N}}\)
Weighted Average Precision =\(\frac{\sum _{i=0}^{n-1}\left|\text{n}\text{u}\text{m}\text{b}\text{e}\text{r} \text{o}\text{f} \text{e}\text{l}\text{e}\text{m}\text{e}\text{n}\text{t}\text{s} \text{i}\text{n} \text{c}\text{l}\text{a}\text{s}\text{s} \text{i}\right|\frac{{TP}_{i}}{{TP}_{i}+{FP}_{i}}}{\sum _{i=0}^{n-1}\left|\text{n}\text{u}\text{m}\text{b}\text{e}\text{r} \text{o}\text{f} \text{e}\text{l}\text{e}\text{m}\text{e}\text{n}\text{t}\text{s} \text{i}\text{n} \text{c}\text{l}\text{a}\text{s}\text{s} \text{i}\right|}\)
Recall =\(\frac{\text{T}\text{P}}{\text{T}\text{P}+\text{F}\text{N}}\)
F1-score =\(\frac{2 \text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n} \times \text{R}\text{e}\text{c}\text{a}\text{l}\text{l} }{\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n}+ \text{R}\text{e}\text{c}\text{a}\text{l}\text{l}}= \frac{\text{T}\text{P}}{\text{T}\text{P}+\frac{1}{2}\left(\text{F}\text{P}+\text{F}\text{N}\right)}\)
Weighted Average F1-score =\(\frac{\sum _{i=0}^{n-1}\left|\text{n}\text{u}\text{m}\text{b}\text{e}\text{r} \text{o}\text{f} \text{e}\text{l}\text{e}\text{m}\text{e}\text{n}\text{t}\text{s} \text{i}\text{n} \text{c}\text{l}\text{a}\text{s}\text{s} \text{i}\right|{F1-score}_{i}}{\sum _{i=0}^{n-1}\left|\text{n}\text{u}\text{m}\text{b}\text{e}\text{r} \text{o}\text{f} \text{e}\text{l}\text{e}\text{m}\text{e}\text{n}\text{t}\text{s} \text{i}\text{n} \text{c}\text{l}\text{a}\text{s}\text{s} \text{i}\right|}\)
Where TP is the number of test results that are correctly classified as positive. TN is the number of test results that are correctly classified as negative. FP is the number of test results that are incorrectly classified as positive and FN is the number of test results that are incorrectly classified as negative.
2.9 Proposed Model
In this section, we present the overall view of the proposed methodology. At first, split the dataset into Training (80%) and Testing (20%) datasets. Since our dataset was imbalanced, we proposed oversampling technique on the Training part, using SVMSMOTE. Afterward, we implemented the base classifiers: (linear) SVM, Decision tree, Stochastic Gradient Descent, Logistic regression, Gaussian naïve Bayes, K-Nearest Neighbors (k = 3 and k = 4), and Multi-Layer Perceptron neural network, to train machine learning models.
Finally, to improve the performances of our models, we propose some ensemble learning algorithms: Gradient boosting, Adaboost, Stacking (for top3 and top4), and Voting (for top3 and top4). A framework of the proposed models for diagnosis of diabetes is shown in Fig. 1.