Software Engineering for Machine Learning in Health Informatics

Background We propose a novel framework for health Informatics: framework and methodology of Software Engineering for machine learning in Health Informatics (SEMLHI). This framework shed light on its features, that allow users to study and analyze the requirements, determine the function of objects related to the system and determine the machine learning algorithms that will be used for the dataset. Methods Based on original data that collected from the hospital in Palestine government in the past three years, first the data validated and all outlier removed, analyzed using develop framework in order to compare ML provide patients with real-time. Our proposed module comparison with three Systems Engineering Methods Vee, agile and SEMLHI. The result used by implement prototype system, which require machine learning algorithm, after development phase, questionnaire deliver to developer to indicate the result using three methodology. SEMLHI framework, is composed into four components: software, machine learning model, machine learning algorithms, and health informatics data, Machine learning Algorithm component used five algorithms use to evaluate the accuracy for machine learning models on component. Results we compare our approach with the previously published systems in terms of performance to evaluate the accuracy for machine learning models, the results of accuracy with different algorithms applied for 750 case, linear SVG have about 0.57 value compared with KNeighbors classifier, logistic regression, multinomial NB, random forest classifier. This research investigates the interaction between SE, and ML within the context of health informatics, our proposed framework define the methodology for developers to analyzing and developing software for the health informatic model, and create a space, in which software engineering, and ML experts could work on the ML model lifecycle, on the disease level and the subtype level. Conclusions This article is an ongoing effort towards defining and translating an existing research pipeline into four integrated modules, as framework system using the dataset from healthcare to reduce cost estimation by using a new suggested methodology. The framework is available as open source software, licensed under GNU General Public License Version 3 to encourage others to

contribute to the future development of the SEMLHI framework.

Background
The field of Health Informatics (HI) aims at providing a large-scale linkage of disparate. Normally, the healthcare dataset is found to be incomplete and noisy; as a result, reading data from dataset linkage traditionally failed within the discipline of software engineering. Machine learning (ML) is a rapidly mature branch of computer science, since it can store data in large scale. Many ML tools that can be used to analyze the data, and reach the knowledge that can improve the quality of work for both staff and doctors; however, as for developer, there is no methodology that can be used until now. Regarding software engineering, there is a lack of approaches to evaluate which software engineers tasks are better performed by automation, and which require human involvement, or human in the loop approaches [1].
Recently, a set of frameworks that are used to develop the analysis of data, such as Win-CASE [2], and SAM [3], while the market has vast analytical tools of data [4], that can discover interesting patterns, and hidden relationships to support decision makers. BKMR [5] used R package as a statistical approach with health effects to estimate the multivariable exposure-response function.
Augmentor [6] include as python image library for augmentation, while on visualization of medical treatment plans and patient data, used CareVis [7] as it's designed for this task. Other applications require a visual interface used COQUITO [8], for health-care data analytics widely known 3P tools [9] used. Many simple application, such as WEKA [10] that provides a GUI to many machine learning algorithms, and apache Spark [11] are used for cluster computing framework. Consequently, table 1 summarizes the main tools that used for big data for analytics according to task.
Software Engineering for Machine Learning Applications (SEMLA) [12] discusses challenges, new insights, and practical ideas regarding the engineering of ML and artificial engineering (AI). Based on software development, ML algorithms in clinical genomics generally come in three main forms: supervised, unsupervised and semi supervised [13]. Interflow system requirement analysis (ISRA) [14] is used to determine system requirements.

Dataset and preprocessing
The applications delivery of applied Machine Learning (ML) models in healthcare is often hampered by the existence of the isolated product deployments with poorly developed architectures and limited or non-existent maintenance plans, The "Translating Research into Agile Development" (TRIAD) method relies [17], and present five step method for designing a tailored EHR tool, while SEMLHI framework will help developer to use stander methodology for developing health systems, develop software supported by machine learning, at this state the dataset applied from Palestine hospitals.
In such a case, patient required to do more than one tests. In this article, we focus on helping patients and doctors to complete their treatment tasks by predictable test result based on ICD-10 [18], and helping hospitals save time and efforts for medical tests.
SEMLHI models and methodology will be presented by including new software systems connecting to real data set and present knowledge from data and using ML algorithms. The dataset case studies discussed in this article are set within the context of the Palestine Hospital and centers. Three hospitals and nine medical centers were used for our data set. Furthermore, the data collection was conducted during the last two years, 458k Patients were identified with patient no. Overall, for pmc dataset included 141k patient with 1.63% missing, mean 1.08m, std dev 554k, min 10000, max  Table 2 illustrates the results of comparison between our methodology compared and the methodology of Vee [19] and agile [20]. SEMLHI's framework Methodology describes in details the process that is used when developing health software and the mechanism to integrate and use ML algorithms with the development software. SEMLHI methodology will give developer's new road map to design health application with system functions and implementing of the software. This framework includes 10 stages starting from defining the problem until reaching the stage of development and ending by the results that explain the following next section. To develop HI system the first step is to design (encode data, define outlier and cleaning up the data), implement (Verification & Validation), maintain defined Workflows, structure Information, provide Security and privacy, test the performance, and then reuse software applications. Records in most data sets in HI are weakly structured and non-standardized. To apply ML for HI system a set, of patterns need to be used by algorithm to predict and visualize for ML algorithm and generate knowledge. The main patterns that are used in our framework are: Geographic location, Patient records, departments and hospitals, surgical history, obstetric history, family history, habits, immunization, assessment and plan, and test results.

Results And Discussion
Currently, the topic of health informatics is very vital for research, we propose a novel framework SEMLIH for health informatic system with machine learning. Next section describe the details for our propose framework.

SEMLHI framework
SEMLHI frameworks are specifically geared toward facilitating the development of software applications and include components that facilitate the analysis of the health data set. (Figure 1) summarizes the proposed framework as a conceptual framework in addition to the mechanism used to interact with operating system and hardware. Many users will work directly as developers or system analysts with approach framework or indirect by using result. As for users of Software Engineering, proposed frameworks interact with operating system components that are used by the framework and further how all software manage devise hardware with the main system devise that use by the framework.
Our framework is composed into four components or modules (software, machine learning model, machine learning algorithms, and health informatics data. (Figure 2) implements how each module interacts with the all modules to work as a framework.

ML Algorithms
Machine Learning algorithms, use to compute the parameters that might define a model [21], optimize its network topology and improve the system convergence without losing information. As supervised learning, k-Nearest Neighbors (KNN) [22] can be used for classification & predictive problems, KNN makes decisions based on the dominant categories of k objects, rather than a single object category. Figure 3 identify the most machine learning algorithms that used for health classification.
As all the data in our sample of dataset that prepared using SEMLHI framework, the output method will be supervised "label data" for this the KNN algorithm with multi label and evaluate our result (knn used for supervised learning, Kmean unsupervised). k-means can used for dataset that include million labeled, approximate nearest neighbours ( ANN ) which is usually 10x -100x faster than KNN support-vector machines ( SVM ) works differently, and it is good and fast solution for many problems and will outperform k-NN almost, (Figure 4) show that Logistic Regression have high accuracy compared with expected and real predictions.
Based on original data that collected from the hospital in Palestine government in the past three years, five algorithms have been used to predict the lab test using Machine learning Algorithm (MAM) component on SEMLIH framework. ML approaches and algorithms [23] can achieve better performance than expert-selected. MLA uses two types of techniques, supervised learning (we have labeled input data to predict future outputs), and unsupervised learning (unlabeled data for input data). As for MLA module, first which techniques to use, then select the most suitable algorithms to use based on mathematical selection related to criteria, and then from list for algorithms,  In supervised learning, the data set contains 'n' of rows (cases), each case needs to be evaluated using function f: A → B to compare with label A or label B according to 'f' function by evaluate E and compare to learn from training set of n. f have set of n(d). Unsupervised learning, the data haven't labeled, apply in analysis, dimensional reduction, this type [24], use training set (t) which include n of objects (t) = xi ∈A : 1 ≤ i ≤ 0 that can be a category to n of class C1, ck∈ A by applying algorithm f in the evaluation phase to set which ck for input x ∈ Cj : 1 ≤ j ≤ k.

Machine Algorithm Model:
Machine learning helps us to extract useful features from dataset to solve, or predict healthrelated events [25]. MAM component include five sub-modules, read the data, prepare the data, train the model, test and evaluate the model, and predict new data. (Figure 5) describe the sequence of this stages.
The challenge for this component is to use the right type of algorithm, which optimally solves the dataset, while avoiding high bias or high variance. MAM main component to analysis the dataset by set of conditions, as if the dataset > 50 samples, and have label then classification algorithms will select, if not cluster algorithms will apply if dataset need to predict quantity regression algorithms will be used, if not dimensional reduction will apply, next diagram show in details which algorithms to select for MAM Components.

Health Informatics Data
In ML, the data is essential; while the methods to apply to present and visualize knowledge are the most important step. Our dataset sample contains ten columns with 50k rows (cases) to use dataset on Health Informatics Data (HID) algorithms, require transformed into numerical features. Other data contain missing, duplicating or null value such as negative ages or extremely large integers could negatively affect the performance of our MLA, (Figure 4) describe main roles to detect the methodology to use classification, clustering, regression and reduction.
HID use data source and dictionary translate for label encoding to converting each value in a column to a number to reduce misinterpreted data used by Bayesian inference. Node Identifier work to analysis data as common process with determine patterns using patient-specific research identifier, dataset usually required multiple records from the same patient identified as being related in the de-identified database, for Outlier HID used a set of methods to analysis finding hidden groups to removing outliers, and in advance step need to calculate the outlier values for data, which appear to be erroneous and crop it from dataset. (Figures 8), shows that logistic regression has high accuracy compared with expected and real predictions.
To predict disease, we use icd-10 with multi label, as each patient have ICD code in health records can affect all regions of the retina, though, there is currently no classification system [23], to distinguish anterior (peripheral), and posterior (macular). We hypothesize that these classifications are characterized by D and refractive features, highlighting the disparity in types of disease.
Used collected electrocardiograph data to focused on D most common Diagnosis cases in Diagnosis Code database: D= {d1, d2, d3, …, d n }, d: a disease that applicable to diagnosis code, n: number of Disease classes using k mean algorithm with multi label, figure 7 describe pseudo code for k-nearest neighbor algorithm for Multilevel Learning.

Read the data
In Summary, to rad the data, from data source, such as CSV file, or other source available, the algorithm will automatic remove missing values, cleaning the data to remove noise, and other date in text description, add category column to encoding the test result name, as applied on our research, the result (720, 27), as we have 720 case categories, represented by 27 laboratory test, this step generate new dataset that include 18 columns, and 720 rows, figure 8, summarize the age with category feature.

Select ML algorithm and domain
To solve the medical data, as Multi-Class need to use supervised classifiers as all input data in training set we know, we can train supervised classifiers to train unseen patient description text to predict the "Diagnose Code" Category by applying set of algorithms. This phase include select the automatic algorithm accuracy based on compare all algorithms to use.

Software
Software module includes sub class include reuse, performance, test, privacy, and security. For software testing main point to verify that the code is running correctly b y exercising the code under known conditions and checking that the results are as expected [26]. This class used to test the resources use from memory, or CPU from the application, performance issues by first measure them, then profile the code, then optimize for that using the benchmark is the best choice to compare result to improve the performance for optimization. Code smells [27], found Genetic Algorithms used by 22.22% as most commonly machine learning techniques.
[Due to technical limitations, the formula could not be displayed here. Please see the supplementary files section to access the formula.] (1) multi-labelled classification, a prediction containing a subset of the actual classes should be considered better than a prediction that contains none of them, i.e., predicting two of the three labels correctly this is better than predicting no labels at all. To measure a multi-class classifier a misclassification using micro-averaging and macro-averaging [16].

Acknowledgment
We thank Health Minister of the State of Palestine, Dr. Jawad Awwad for the efforts has been made to allow us accessing Palestinian data set for patients and for all the teams that supported us during last two years for feedback that greatly improved the manuscript.

Authors' contributions
Moreb and Ata contributed equally to the work. Moreb Designed the study, prepare and analyzed the data, develop the framework using python, Ata test the framework, analysis the methodology steps, with evaluation framework. All authors have read and approved the final manuscript.

Funding
Not applicable.

Availability of data and materials
The data are available upon request to the corresponding author after signing appropriate documents in line with ethical application and the decision of the Ethics Committee.

Ethics approval and consent to participate
The research meets all applicable standards with regard to the ethics of experimentation and research integrity, and the following is being certified/declared true. As an expert scientist and along with co-authors of concerned field, the paper has been submitted with full responsibility, following   Machine Learning algorithms that used for health classification.   Pseudo code k-nearest neighbor algorithm for Multilevel Learning.

Figure 8
Age and category Feature summarize with other's variables (test result, gender, icd 10).