Currently, the topic of health informatics is very vital for research, we propose a novel framework SEMLIH for health informatic system with machine learning. Next section describe the details for our propose framework.
3.2 SEMLHI framework
SEMLHI frameworks are specifically geared toward facilitating the development of software applications and include components that facilitate the analysis of the health data set. (Figure 1) summarizes the proposed framework as a conceptual framework in addition to the mechanism used to interact with operating system and hardware. Many users will work directly as developers or system analysts with approach framework or indirect by using result. As for users of Software Engineering, proposed frameworks interact with operating system components that are used by the framework and further how all software manage devise hardware with the main system devise that use by the framework.
Our framework is composed into four components or modules (software, machine learning model, machine learning algorithms, and health informatics data. (Figure 2) implements how each module interacts with the all modules to work as a framework.
3.2.1 ML Algorithms
Machine Learning algorithms, use to compute the parameters that might define a model [21], optimize its network topology and improve the system convergence without losing information. As supervised learning, k-Nearest Neighbors (KNN) [22] can be used for classification & predictive problems, KNN makes decisions based on the dominant categories of k objects, rather than a single object category. Figure 3 identify the most machine learning algorithms that used for health classification.
As all the data in our sample of dataset that prepared using SEMLHI framework, the output method will be supervised “label data” for this the KNN algorithm with multi label and evaluate our result (knn used for supervised learning, Kmean unsupervised). k-means can used for dataset that include million labeled, approximate nearest neighbours ( ANN ) which is usually 10x - 100x faster than KNN support-vector machines ( SVM ) works differently, and it is good and fast solution for many problems and will outperform k-NN almost, (Figure 4) show that Logistic Regression have high accuracy compared with expected and real predictions.
Based on original data that collected from the hospital in Palestine government in the past three years, five algorithms have been used to predict the lab test using Machine learning Algorithm (MAM) component on SEMLIH framework. ML approaches and algorithms [23] can achieve better performance than expert-selected. MLA uses two types of techniques, supervised learning (we have labeled input data to predict future outputs), and unsupervised learning (unlabeled data for input data). As for MLA module, first which techniques to use, then select the most suitable algorithms to use based on mathematical selection related to criteria, and then from list for algorithms, table 2 show the results of accuracy with different algorithms applied (KNeighbors Classifier, Linear SVC, Logistic Regression, Multinomial NB, and Random Forest Classifier), for 750 case, linear SVG have about 0.57 value compared with other algorithms, for clustering need to calculates the distance d between the two objects x and y by comparing the values of their n features by applying Minkowski metric.
Table 3: Evaluate the accuracy for machine learning models
Algorithm Name
|
Accuracy
|
KNeighbors Classifier
|
0.487694
|
Linear SVC
|
0.564566
|
Logistic Regression
|
0.560412
|
Multinomial NB
|
0.517013
|
Random Forest Classifier
|
0.488955
|
In supervised learning, the data set contains ‘n’ of rows (cases), each case needs to be evaluated using function f: A → B to compare with label A or label B according to ‘f’ function by evaluate E and compare to learn from training set of n. f have set of n(d). Unsupervised learning, the data haven’t labeled, apply in analysis, dimensional reduction, this type [24], use training set (t) which include n of objects (t) = xi ∈A : 1 ≤ i ≤ 0 that can be a category to n of class C1, ck∈ A by applying algorithm f in the evaluation phase to set which ck for input x ∈ Cj : 1 ≤ j ≤ k.
3.2.2 Machine Algorithm Model:
Machine learning helps us to extract useful features from dataset to solve, or predict health-related events [25]. MAM component include five sub-modules, read the data, prepare the data, train the model, test and evaluate the model, and predict new data. (Figure 5) describe the sequence of this stages.
The challenge for this component is to use the right type of algorithm, which optimally solves the dataset, while avoiding high bias or high variance. MAM main component to analysis the dataset by set of conditions, as if the dataset > 50 samples, and have label then classification algorithms will select, if not cluster algorithms will apply if dataset need to predict quantity regression algorithms will be used, if not dimensional reduction will apply, next diagram show in details which algorithms to select for MAM Components.
3.2.3 Health Informatics Data
In ML, the data is essential; while the methods to apply to present and visualize knowledge are the most important step. Our dataset sample contains ten columns with 50k rows (cases) to use dataset on Health Informatics Data (HID) algorithms, require transformed into numerical features. Other data contain missing, duplicating or null value such as negative ages or extremely large integers could negatively affect the performance of our MLA, (Figure 4) describe main roles to detect the methodology to use classification, clustering, regression and reduction.
HID use data source and dictionary translate for label encoding to converting each value in a column to a number to reduce misinterpreted data used by Bayesian inference. Node Identifier work to analysis data as common process with determine patterns using patient-specific research identifier, dataset usually required multiple records from the same patient identified as being related in the de-identified database, for Outlier HID used a set of methods to analysis finding hidden groups to removing outliers, and in advance step need to calculate the outlier values for data, which appear to be erroneous and crop it from dataset. (Figures 8), shows that logistic regression has high accuracy compared with expected and real predictions.
To predict disease, we use icd-10 with multi label, as each patient have ICD code in health records can affect all regions of the retina, though, there is currently no classification system [23], to distinguish anterior (peripheral), and posterior (macular). We hypothesize that these classifications are characterized by D and refractive features, highlighting the disparity in types of disease.
Used collected electrocardiograph data to focused on D most common Diagnosis cases in Diagnosis Code database: D= {d1, d2, d3, …, dn}, d: a disease that applicable to diagnosis code, n: number of Disease classes using k mean algorithm with multi label, figure 7 describe pseudo code for k-nearest neighbor algorithm for Multilevel Learning.
Read the data
In Summary, to rad the data, from data source, such as CSV file, or other source available, the algorithm will automatic remove missing values, cleaning the data to remove noise, and other date in text description, add category column to encoding the test result name, as applied on our research, the result (720, 27), as we have 720 case categories, represented by 27 laboratory test, this step generate new dataset that include 18 columns, and 720 rows, figure 8, summarize the age with category feature.
Select ML algorithm and domain
To solve the medical data, as Multi-Class need to use supervised classifiers as all input data in training set we know, we can train supervised classifiers to train unseen patient description text to predict the “Diagnose Code” Category by applying set of algorithms. This phase include select the automatic algorithm accuracy based on compare all algorithms to use.
3.2.4 Software
Software module includes sub class include reuse, performance, test, privacy, and security. For software testing main point to verify that the code is running correctly by exercising the code under known conditions and checking that the results are as expected [26]. This class used to test the resources use from memory, or CPU from the application, performance issues by first measure them, then profile the code, then optimize for that using the benchmark is the best choice to compare result to improve the performance for optimization. Code smells [27], found Genetic Algorithms used by 22.22% as most commonly machine learning techniques.
[Due to technical limitations, the formula could not be displayed here. Please see the supplementary files section to access the formula.] (1)
multi-labelled classification, a prediction containing a subset of the actual classes should be considered better than a prediction that contains none of them, i.e., predicting two of the three labels correctly this is better than predicting no labels at all. To measure a multi-class classifier a misclassification using micro-averaging and macro-averaging [16].