The overall methodology for the achievement of the research included knowledge acquisition, data collection, data analysis, risk factor management technique identification, expert system development, and performance evaluation. Figure (1) shows general methodology of the study.
3.1 Knowledge Acquisition
Study variable needed for this study were gained from expert knowledge. Different variables were identified from CKD guidelines, and articles related to risk factor identification, and diagnosis methods. Different studies have been done on CKD risk factors and also some nephrologist has published different common risk factors. In this study research articles and guidelines were reviewed, and variables applicable to Ethiopia were identified. The identified risk factors were older age, sex, diabetes mellitus, hypertension ,body mass index, injury on kidney, presence of family with the disease ,cigarette use, alcohol consumption, being hospitalized before , and related kidney diseases [5, 17–32].
3.2 Data Collection
In order to distinguish the risk factors in general people in Ethiopia, data was collected from selected population by preparing questionnaire using the variables identified from expert knowledge and previous studies format as reference[20, 23]. The data collection has included different steps; Figure (2) shows methods undertaken for the successes of data collection.
3.3 Sample Size Determination
The sample size needed was calculated on the basis of the following equation (1)[33, 34], single proportion formula with a 95% confidence level, standard deviation=5, 5% margin of error. The values were selected on the basis of fulfilling the criteria for performing logistic regression analysis.
3.4 Ethical Consideration
The research data collection was approved by the Institutional Review Board of College of Medicine and Health Sciences, Jimma University and St Paulo’s Millennium medical college. The purpose of the study was explained to the study participants accordingly.
3.5 Setting and Population
This data collection was conducted from March 4 to June 14, 2021 at the inpatient settings Jimma University Medical Center, St Phaulos Millenium Medical College ,and MizanTepi University Teaching Hospital. The hospitals were selected using purposive sampling. In purposive sampling researcher decides which particular groups to select. Purposive sampling is used when it is challenging to reach every area, household or individual member of the population and dependable information about population locations and numbers is not available[35]. Additionally it is used when there is insufficient time to visit the number of households or individuals needed. Due to lack of time and budget these sampling was used to select the hospitals.
As humans were involved in the study, the study protocol is performed in accordance with the relevant guidelines. All subjects involved in the study were invited to participate on a voluntary basis. A written informed consent is obtained from all the participants with age 18 and above, who were suspect of CKD and admitted to the hospital were eligible for the study. Age above 18 is selected because it is preferred to perform studies for age above 18 because , they cannot provide their consent and information needed by themselves without help of their parents and some of the questions are related with addiction (smoking and chat consuming) they may not provide the right answer [36, 37]. A total of 384 patients who fulfilled the above criteria were consecutively included for the final analysis. Socio-demographic and some risk factor variables were collected using a structured questionnaire by five nurses. Patient histories were reviewed to identify presence and absence of CKD, HTN, DM, kidney related disease and other diseases. Creatinine data was obtained from patient history and the glomerular filtration rate (GFR) was estimated using Modification of Diet in Renal Disease (MDRD) study equation shown in 2[6] and stage of CKD was identified using KDIGO guideline[2].
3.6 Measures
The height and weight were taken at the time of the interview and used to measure the BMI. Participants were categorized by BMI into normal (BMI 18.5–24.9), underweight (<18.5), overweight (25.0–29.9), obese (30.0–39.9), and morbid obese (≥40.00) according to guidelines[38].Participants were considered to have diabetes mellitus if previously they had been recognized by the doctor as having DM or any documents in favor of DM or they reported taking insulin or oral anti-diabetic drug or random plasma glucose ≥11.1 mmol/L with symptom. Blood pressure readings have been obtained by a qualified nurse using an electronic sphygmomanometer. Hypertension was defined as systolic BP ≥ 140 mmHg or diastolic BP ≥ 90 mmHg or use of medication for hypertension irrespective of the blood pressure.
3.7 Statistical Analysis
A set of methods used to analyze data are called statistic. Statistic exists in all areas of science involving the collection, handling and sorting of data, given the insight of a specific phenomenon and the possibility that, from that knowledge, inferring possible new results[39]. One of the goals with statistics is to extract information from data to get an improved understanding of the situations they represent. Thus, the statistics can be thought of as the science of learning from data. In other way , we can say that statistic based on the theory of probability, provides techniques and methods for data analysis, which help the decision-making process in various problems where there is uncertainty[39].
As shown in figure (3), statistical analysis was performed using version 26 of the SPSS. The first approach done before analyzing data was entering and editing the data. Data was entered in the form of number and string. In order to identify missing values missing value analysis using frequency analysis was performed. Patient data containing incomplete information were excluded/ corrected before performing the analysis. After data entry and missing value analyses, descriptive analyses, bivariate analyses, and multivariable logistic regression were performed. Descriptive analysis was performed to identify participant’s socio demographic status and stage.
Bivariate analysis was performed to identify differences in patient’s characteristics and risk factors for CKD were analyzed using chi-square test. Chi-square testis used to determine whether the association between two qualitative variables is statistically significant, since researchers must conduct a test of significance[40]. Additionally, In order to estimate the unique relationship between the included variables and CKD status multivariable logistic regression was performed. Multivariable logistic analysis (MVLA) model is selected because it is efficient method for the analysis of with one outcome (dependent) and multiple independent variables[41]. In this study CKD status is taken as dependent variable while other factors are taken as independent variable.In statistical analysis to identify the significance of the independent variable P value is the taken as a measurement tool. The P stands for probability and measures how likely it is that any observed difference between groups is due to chance. Being a probability, P can take any value between 0 and 1. Values close to 0 indicate that the observed difference is unlikely to be due to chance, whereas a P value close to 1 suggests no difference between the groups other than due to chance. The smaller is the probability of the result being “statistically significant” (p-value < 0.05 or <5%)[42].In this work P value is taken P< 0.05 for MVLRA and bivariate analysis. In multivariable analysis, male gender (AOR =2.297; 95% CI:1.407-3.753), hypertension (AOR =3.095; 95% CI: 1.882-5.089), family history of kidney disease (AOR =4.128; 95% CI: 2.302-7.402), diabetes above ten and below ten years (AOR =30.986; 95% CI: 3.972-241.744) , COR =3.011; 95% CI: 1.904(1.212- 7.483) respectively, hypertension (AOR = 3.60; 95% CI: 1.98–6.54) ,smoking above 4years (COR =2.226; 95% CI: 1.014- 4.883),being overweight and injury on kidney (COR =1.904; 95% CI: 1.904(1.119-3.239) (AOR =2.362; 95% CI :1.016-5.491) were independently associated with the presence of CKD .Tables(1) and (2) shows relation between the variables from bivariate and MVLRA analysis.
Table 1: Crude odd ratio of factors associated with CKD of respondents from bivariate analysis
Variables
|
Frequency
|
Percentage
|
COR (95%CI)
|
P Value
|
|
|
|
CA (%)
|
NCA (%)
|
|
|
History of HTN
|
Yes
|
151
|
51
|
49
|
1.861(1.419,2.440)
|
0.000
|
No
|
233
|
27
|
72
|
.675(.570,.798)
|
Diabetes Duration >=10
|
Yes
|
30
|
96.7
|
3.3
|
27.528(3.788,200.057)
|
0.000
|
No
|
354
|
47.5
|
52.5
|
0.857 (.808,0.909)
|
Duration<10
|
Yes
No
|
38
346
|
78.9
21.1
|
48.3
51.7
|
3.559(1.675,7.561)
0.885(.828,.946)
|
0.00
|
Sex
|
Female
|
169
|
38.5
|
61.5
|
0.590(.464,.749)
|
0.00
|
Male
|
215
|
61.4
|
38.6
|
1.509(1.251, 1.820
|
Age
|
Age>=60
|
64
|
54
|
46
|
1.113(0.895, 1.385)
|
0.335
|
Age<60
|
320
|
49
|
51
|
0.913(.760, 1.098)
|
History of Smoking
|
Duration>4
|
51
|
74.5
|
25.5
|
2.775(1.527, 5.041
|
0.001
|
Duration<4
|
333
|
47.7
|
53.3
|
0.867(0.802, .938)
|
Chat Consumption
|
Yes
No
|
54
330
|
46.3
52.1
|
53.7
47.9
|
0.818(0.498, 1.344)
1.033(0.953,1.121)
|
0.427
|
Hospitalized Before
|
Yes
|
115
|
44.3
|
55.7
|
0.756(0.556, 1.030)
|
0.075
|
No
|
269
|
54.3
|
45.8
|
1.127 (0.987,1.286)
|
Experienced Injury
|
Yes
|
40
|
72.5
|
27.5
|
2.503(1.288,4.864)
|
0.005
|
No
|
344
|
48.8
|
52.2
|
0.906(0.846,0.970)
|
Presence of Family with kidney Disease
|
Yes
|
92
|
69.9
|
30.4
|
2.170 (1.460,3.225)
|
0.00
|
No
|
292
|
45.5
|
54.5
|
0.794 (0.708,0.890)
|
Affected with CVD
|
Yes
|
49
|
40.8
|
59.2
|
0.655(0.384, 1.116)
|
0.116
|
No
|
335
|
42.8
|
42.8
|
0.670(0.566,0.793
|
BMI
|
|
Normal
|
Yes
|
85
|
44.7
|
55.3
|
0.767(0.526, 1.120)
|
0.167
|
No
|
299
|
52.2
|
44.8
|
1.078(0.968, 1.200)
|
Underweight
|
Yes
|
48
|
54.2
|
36.8
|
1.122(0.659, 1.908)
|
0.671
|
No
|
336
|
50.9
|
49.1
|
0.984(0.912,1.061)
|
Over weight
|
Yes
|
127
|
61.1
|
38.9
|
1.875(1.372,2.563)
|
0.00
|
No
|
257
|
44
|
56
|
.745(.646,.859)
|
Obese
|
Yes
|
6
|
50
|
50
|
0.949(0.194, 4.644)
|
0.886
|
No
|
378
|
51.4
|
48.6
|
1.001(0.949,1.462)
|
Other kidney disease presence
|
Yes
|
8
|
25
|
75
|
0.316(0.065, 1.548)
|
0.133
|
No
|
376
|
51.9
|
48.1
|
1.023(0.993,1.053)
|
|
Alcohol Consumption
|
Yes
|
18
|
72.5
|
27.5
|
2.468(0.897, 6.788)
|
0.069
|
No
|
366
|
50.3
|
49.7
|
0.960(0.918,1.003)
|
Table 2: Adjusted odd ratio and p values of factors associated with CKD of respondents from Multivariable Logistic regression analysis
Risk Factor
|
Significance
P Values
|
Final Model AOR(95% CI)
|
Final Model B0 Coefficient (intercept = -1.836)
|
Low Risk
|
High Risk
|
Diabetes Duration >=10
|
0.00
|
30.986(3.972,241.744)
|
3.434
|
Has No Diabetes
|
Diabetes above 9 years
|
Presence of Family with kidney Disease
|
0.000
|
4.128(2.302, 7.402)
|
1.14
|
Without presence of family with kidney disease
|
Has family with Kidney disease
|
Hypertension
|
0.000
|
3.095(1.882, 5.089)
|
1.13
|
Has No diabetes
|
Has no hypertension
|
Diabetes Duration <10
|
0.018
|
3.011(1.212, 7.483)
|
1.102
|
Has No Diabetes
|
Diabetes between 0 and 9 years
|
Experienced Injury around kidney
|
0.046
|
2.362(1.016,5.491)
|
0.86
|
Has not experienced injury
|
Experienced injury
|
Sex
|
0.001
|
2.297(1.407,3.753)
|
.832
|
Female
|
Male
|
Smoking
|
0.046
|
2.226(1.014, 4.883)
|
0.800
|
Below 4 years
|
Above 4 years
|
Over weight
|
0.018
|
1.904(1.119, 3.239)
|
0.644
|
Underweight and Normal
|
Overweight
|
3.8 Expert System Development
The Expert System (ES) is a computer system that emulates the decision-making ability of a human expert in a limited domain. The Expert System is one of the leading artificial intelligence (AI) techniques that have been adopted to handle such task. ES provide powerful and flexible means for obtaining solutions to a variety of problems that often cannot be dealt with by other, more traditional and orthodox methods[43].In this research rule based expert system is developed to make a system that can predict risk of individuals and suggest management ways. Figure (4) shows general frame work of the developed expert system.
3.9 Rule Based Expert System
Rule-based expert systems use rules as a knowledge representation technique. If and then statements are used to present rules. The “if” part is called premise, the “then” part is called conclusion. The data and associated conditions are the fact elements. Facts interact with data directly to determine if the event is of interest. The rule component of the expert systems relates facts with actions. In other words, it constructs an If-then rule by putting the facts under the If part and the set of actions under the then part. Through, joining rules using logical operator’s complex rules can be formed. AND, and OR operators are used to form premise part of the rule. A rule can also activate multiple set of actions. These set of actions can also be joined by logical operators when there are multiple set of facts to be checked individually[44].Basic structure of expert system contains Knowledge Base, Inference Engine and User Interface[45]. Knowledge base contains domain-specific and high-quality knowledge. Inference engine gets and uses the knowledge from the knowledge base to reach at a specific solution. It applies rules repetitively to the facts, which are obtained from earlier rule application. It adds new knowledge into the knowledge base if required. Resolves rules conflict when multiple rules are applicable to a particular case use of efficient procedures and rules by the Inference Engine are essential in deducting a correct, flawless solution. To recommend a solution, the Inference engine uses forward chaining and backward chaining[46].Forward chaining systems are data-driven rule-based systems that trigger actions based on the facts under the premise part of the rule. They start from the known data and add a new fact to the knowledge base, if it is not already in the knowledge-base. The disadvantage with forward chaining is many rules can be executed even they do have nothing to do with the established goal. So it is not efficient if one fact is only to be inferred. Forward chaining systems perform well when the goal is not known. They can trigger sounding actions if adequate information is gathered[43].User interface offers interaction between user of the ES and the ES itself[47].
3.10 Knowledge Acquisition for Expert system
3.10.1 Risk prediction knowledge acquisition
In order to make a system that can predict a risk of CKD knowledge must be attained and stored as set of rules. For the risk prediction the knowledge from different literature reviews and expert was analyzed on the patient’s data to identify the relationship between the disease and significant factors. The identified risk factors from the MVLA are used for risk prediction. Identified risk factors can be used for estimating probability of disease using logistic regression equation [48] shown in equation 3 below and to identify risk level .For a factor that increases risk, the probability of disease when the factor is present exceeds that in absence of the characteristic. Logistic regression models can account for the joint effects of multiple factors on the occurrence of disease. The multivariable logistic model provides an estimation of risk for subsequent disease[48, 49]. The risk level is identified with presence and absence of risk factor which can be classified as low risk and high risk[10].
Where, P is the probability of CKD during a stipulated period of observation, where B0 is the intercept, B1 is the regression coefficient for the first independent variable (x1), B2 is the regression coefficient for the second independent variable (x2), and so forth for each of the variables.
3.10.2 Risk factor management technique identification
Identifiable risk factors of CKD can be classified as modifiable and non-modifiable risk factors [50].Some of the risk factors found in this study are modifiable. Hypertension, diabetes, and BMI, are factors that are part of metabolic syndrome. Even though cigarette smoking is not a component of metabolic syndrome, it is also a known modifiable risk factor .Interventions that delay and prevent the onset of diabetes mellitus, reduce overweight , support smoking cessation, and control hypertension should be considered to improve to prevent or delay CKD[51].Genetics related factors, gender, and injury are not modifiable[50, 51].After identifying related risk factors from the statistical analysis risk management ways are searched for from different guidelines for modifiable risk factors like DM,HTN, smoking and being overweight. For non-modifiable risk factors male gender, experiencing injury, presence of family members were excluded as they can’t be modified .For the modifiable risk different guidelines were reviewed and identified [28, 39, 52–64].
3.10.3 Knowledge Representation
In these research work, rule based expert system is developed. The facts gained from different guidelines, books and statistical analysis were stored as facts and implemented using rules by means of ‘’if‘’ and ‘’then’’ cases. The proposed expert systems reasons based on different health implications, socio demographic status, and health implications, and generate three types of results. Taking these questions as an input, the expert systems trigger the inference engine to fire probability of the disease, risk level and management suggestion. The system asks different health implications; socio demographic status and habits gained from MNLRA.As shown below in figure (5) Q0 represent age. Q1- Q9represents question that will be asked from the user, B represents values for each question, the questions include presence of DM, duration of diabetes, presence of HTN, habit of smoking, duration of smoking, height, weight, presence of family history with CKD, and experience of injury. For questions that need duration, it asks the duration if the person has the disease otherwise it takes it as ‘No”. If the answer of each question is ‘Yes” , the B value is set to a number greater than zero which is found from logistic regression unless B value is set to zero, and probability of the person is calculated using logistic regression formula .If the person has one risk factor, the system identifies the risk level as high risk individual. If no risk factor is found, it identifies as low risk .In addition, if the person has modifiable risk factor, it suggests risk factor type the person has and how it should be managed.
3.10.4 Graphical User Interface (GUI)
Tkinter library and Azure theme (GU styling) is used for creating an application of user Interface, to create windows and all other graphical user interface. Python programing tool is also used to write the code.
Performance Evaluation Metrics
After the system is built, its performance must be evaluated so as to know the actual result. The system was evaluated using data extracted from medical records of patients at JUMC. Medical records of 120 patients containing 60 patient’s data with CKD and 60 without CKD were collected for evaluation and patients were interviewed. To evaluate the system, all data was organized exported to the developed system and the system output was compared with the diagnosis recorded in the medical records. The system is evaluated using confusion matrix. The confusion matrix is a square matrix table that is used to describe the performance of any classification models on test dataset by representing the actual (column) and predicted (row) dimensions. It makes it easy for programmers to clearly see the performance of the model designed. A number of model performance metrics can be derived from the confusion matrix. Perhaps, the most common metric is accuracy defined by the following formula, precision and recall[65].The evaluation was performed at three cut off percent’s.
A TP (True Positive) value indicates that what is predicted is true; A TN (True Negative) value indicates that the predicted class is truly negative. A FP (False Positive) value indicates that a thing is predicted as if it is part of the class while it is not, FN (False Negative) the prediction indicates that it is not part of the class while it is[66].