Enhanced Diagnosis of the COVID-19 Behaviour Using the Rough Set Theory and Genetic Algorithms

The outbreak of the coronavirus 2019 (COVID-19) has created an excellent challenge for the care system worldwide. One in every of the foremost vital points of this challenge is that the management of COVID19 patients needing acute and/or vital metastasis care. The main objective of applying data mining to Covid-19 dataset is essential to propel learning by empowering data-oriented decision making to improve existing clinical practices and learning materials. Current data mining techniques offer patient data analysis for achieving an automated diagnosis of the diseases as an example; however, the results are not very accurate nor reliable, especially with a dynamic virus as the COVID-19. In this paper, we are proposing a multi-stage diagnostic (MSD-Covid19) model to enhance the diagnosis of the COVID-19, and to provide a sustainable automated system to improve the healthcare systems and patient outcomes. The rst stage includes a selection of a classication model with no reduction attributes. Tested classication algorithms include Deep learning, Multilayer Perceptron, KNN, Bayesian Auto Regression, Logistic Model Trees (LMT), Hoeffding tree (VFDT), and Fuzzy Unordered Rule Induction Algorithm. In the second stage, a rough set reduction algorithm based on genetic algorithms is employed, and nally, an optimization of the classication is conducted using the reduced attributes. The proposed model is evaluated on a global COVID-19 dataset. Experimental results demonstrate that the proposed MSD-Covid19 has a great contribution to increase the diagnostic accuracy of the COVID-19 disease behaviour.


Introduction
A Coronavirus is a large group of viruses that incorporates regular cold infection. COVID-19 is another respiratory virus that started in late 2019 and mid-2020 in Hubei Province and the Chinese city of Wuhan, killing many people. The virus is known as the Severe Acute Respiratory Syndrome (SARS-CoV-2), and the World Health Organization (WHO) has named the virus COVID-19 after more than 1,000 people have died [1]. Even with any emergency, the most signi cant approach to defeat it is to know the problematic itself and know about its belongings. To perform better, it is essential to nd out about this dangerous virus. This information has numerous measurements; one of its most critical measurements is the consciousness of the measure of harm, both humanly and socially. One reason a few governments can show improvement over others is to have the information, data, models and abilities to manage the emergency.
There square measure many ways that to research the COVID-19 information, like time-series analysis, simulation and modelling, benchmarking, data processing, etc. Data mining can be a process of analyzing information to get patterns and data within the datasets in a very human-understandable structure. Generally, Data mining is that the process of analyzing information from totally different views and summarizing it into helpful info that may be wont to increase revenue, cuts costs, or enhance system's performance. Data Analysis is associate degree approach to resolve issues, issues, or events through a data perspective to know information at a deeper level and type new insights regarding them.
To perform a data analysis task, a suitable machine learning method must be selected. Some machine learning methods are appropriate for some data structure, and otherwise. Therefore, a machine learning technique should be chosen for the data based on a particular purpose, and each machine learning technique may announce different accuracy with various performances. The set of features in dataset represent a ngerprint for each of dataset. These features may be continuous, discrete, or categorical. All data mining models, regardless of whether to foresee the scattering of sentiment, to comprehend the conduct of networks, or to study the spread viruses, contain factors and features that should be considered and studied carefully.
The main objective of applying data mining to the COVID-19 dataset is to propel learning by empowering data-oriented decision making to improve existing clinical practices and learning materials. Several data mining studies [2] have been developed over the last few months to monitor, trace, and diagnosis the spread of the deadly virus, however none of these studies have provided a deeper and accurate analysis of the models adopted, features, and data types. In addition, Current data mining techniques offer patient data analysis however the results are not very accurate, have limited results due to the shape of data, and not reliable due to the dynamic change of the virus. Thus, we propose a multi-stage model to improve the critical care of the COVID-19 patients by providing a sustainable automated model to provide an e cient diagnosis of the COVID-19 patterns, and to discover associated factors or features that are highly correlated to COVID-19. This paper focuses on nding the key factors affecting the performance of the COVID-19 clinical diagnosis. The ndings of this paper will positively impact future decisions about the progress of the patients' outcomes, quality of the clinical process, and the future of the clinical provider.
The rst stage in the model is to discover the impact of all the features of the COVID-19 dataset in selecting a proper classi cation model, and the second one is the reduction step. The third stage is to acquire an effect of the selected properties after reduction, which has increased the accuracy.
The rest of this paper is organized as follows. Section 2 describes the related work and background on the Covid-19 pandemic. Sections 3 describe the rough set theory and reduction attributes. Sections 4 introduce the proposed Multi-stage diagnosis algorithm. Section 5 and Sect. 6 discuss the experimental dataset and framework used to evaluate the performance of the proposed algorithm. Finally, Sect. 7 presents conclusions and future directions.

Related Work And Background
Data mining is utilised to trace and predict however the COVID-19 complaint can spread and reshape over time. as an example, following a past pandemic, that of the 2015 Zika-virus, Akhtar et al. Set up a neural network to predict its spread, such the developed methods ought to be re-prepared utilizing knowledge from the COVID-19 pandemic [3].
Since the outbreak of COVID-19, try to develop learning methods for COVID-19 screening based on medical images such as CT scans and deep learning. Huang et al. Assessed the severity of pulmonary manifestations of COVID-19 through chest CT using a deep learning method. According to the deep learning algorithm, in patients with different clinical severity, there was a signi cant difference in the percentage of lung opacity. This automated tool for quantifying lung involvement may be used to monitor disease progression and understand the temporal evolution of COVID-19. Patients with COVID-19 who underwent chest CT between 1 January 2020 and 3 February 2020 were screened.. Patients were divided into mild, moderate, severe and critical types according to the initial clinical, laboratory and CT ndings. Percentage of CT of the lung was divided into mild, moderate, severe and critical according to the initial clinical, laboratory and CT ndings. Percentage of CT of the whole lung and ve lobes was automatically measured by deep learning and compared with CT scan follow-ups. Longitudinal changes in the quantitative parameter of CT were also compared among the four clinical types [4].
After proving computed tomography (CT) as a useful way to diagnose COVID-19 patients. Yang et al. found that publicly available COVID-19 CT datasets were incomplete due to privacy issues. They have a small number of samples, which hinders the research and development of CT-based COVID-19 diagnostic methods [5].
Zhang et al. in a retrospective analysis of CT ndings in patients infected with coronavirus 2019 (COVID-19) in a collection that included thirty-four cases, 15 women and 19 men, ranging in age from 7 to 88 Years, which were con rmed by reverse transcriptase-reverse polymerase chain reaction (RT-PCR), were used for the study, and all thin lung CT scans were performed in all patients. Clinical, laboratory and CT imaging were available for evaluation in all patients. Preliminary CT scans have been shown to be important for early detection and assessment of COVID-19 disease progression. [6] Young et al. developed a deep learning-based CT diagnostic system to help physicians identify patients with COVID-19 [7]. They developed an in-depth learning algorithm by modifying the transfer-learning model for clinical presentation [8]. Shi et al. proposed an infection-conscious randomized forest approach (iSARF) that could automatically classify individuals into groups with different ranges [9]. Narin et al. proposed three different deep learning models, ResNet 50, InceptionV3, and Inception-ResNetV2, to detect COVID-19 infection from X-ray images. It is noteworthy that the existing COVID-19 dataset, which consisted of X-ray images of 50 COVID-19 patients and 50 normal breast images, is available in Kaggle.
Evaluation results show that the ResNet50 model was reasonably accurate [10].
Xu et al. in response to the question: Can arti cial intelligence technology be used for early detection of COVID-19 patients by computed tomography (CT) images and what is the diagnostic accuracy of the computer? A case study was conducted with the aim of establishing an initial screening model to distinguish COVID-19 pneumonia from in uenza-A viral pneumonia and healthy cases with pulmonary CT images using depth learning. By doing this research, they proved that this fully automated method could be a promising complementary diagnostic method for front-line clinicians [11].
Hosseini et al to control the epidemic and prevent COVID-19 infection. They developed an e cient optimization algorithm, which can solve NP-hard in addition to applied optimization problems. They rst proposed the COVID-19 optimization algorithm (CVA) to cover almost all operational areas of optimization problems. They also simulated the distribution of the corona virus in several countries around the world. They then modeled the corona virus distribution process as an optimization problem to minimize the number of countries infected with COVID-19 and thus slow the spread of the epidemic. In addition, they proposed three scenarios to solve the optimization problem using the factors in uencing the distribution process. The simulation results show that one of the control scenarios has better performance than the others. Extensive simulations using several optimization problems show that the performance of CVA compared to volcanic eruption algorithm (VEA), gray wolf optimizer (GWO), particle swarm optimization (PSO) with a maximum of 15%, 37%, 53% and 59% increase in best performance [12].
Data mining technology can be applied to healthcare in order to build predictive models providing predictions in real environments using, for that, real clinical data. The 2019 novel coronavirus (COVID- 19) presents several unique features [14,15]. While the diagnosis is con rmed using polymerase chain reaction (PCR), infected patients with pneumonia may present on chest X-ray and computed tomography (CT) images with a pattern that is only moderately characteristic for the human eye [16]. In late January, a Chinese team published a paper detailing the clinical and paraclinical features of COVID-19. They reported that patients present abnormalities in chest CT images with most having bilateral involvement [17]. Bilateral multiple lobe and sub segmental areas of consolidation represent the standard ndings in chest CT image of treatment unit (ICU) patients on admission [18]. Compared, non-ICU patients show bilateral ground-glass opacity and sub segmental areas of consolidation in their chest CT image [19]. In these patients, later chest CT image show bilateral ground-glass opacity with resolved consolidation [20]. Table 1 shows the Comparison table of reviewed methods. Failure to generalize the algorithm to another data set.
Requires radiologist supervision.
High accuracy of diagnosis.
Need to create a better monopoly model to train and improve results.
Lack of generalization of the algorithm on a larger data set.
The dataset used was not publicly available.
To diagnose COVID-19. Failure to generalize the algorithm to a larger data set.
COVID-19 optimizer Algorithm (CVA) [12] 2020 Create an e cient optimization algorithm that can solve NP-hard in addition to applied optimization problems.
For further large-scale validation, the study should be performed in several hospitals and several districts.
Failure to generalize the algorithm to a larger data set.

The Rough Theory
The primary gain of the Rough set is to reach approximate signi cance of the earn data. This theory is a powerful mathematical tool for con ict ambiguity and uncertainty that has ways that to reduce attributes that is inapplicable or unnecessary for a dataset. This process of feature reduction in data is based on training the main functionality in the data, and without losing the basic structure of the dataset. As a result of data reduction, a set of concise and meaningful rules is obtained that facilities the decisionmaking process [28]. In the information table (dataset), each subset of attributes provides a homogeneous relationship. If a set B has a set of non-empty sets, and two objects v and u (indistinguishable by B), we will have another set U, for each feature, such that we have A i in B, that A i is the dataset attributes: The ≡ symbol represents the equivalence relationship, also called the equilibrium relationship, and is

PX= {x | [x] p ⊆ X} (4)
And the equivalence relationship between P and Q on U is de ned in positive areas as in Eq. 5.

5
Finally, the degree of dependence of the property can be represented by Eq. 6 which indicates that: Q depends on P and its degree of dependence is equal to K (0 ≤ k ≤ 1).

Reduction Attributes
Searching for an optimal subset involves nding features that are strongly related to decision-making but separate from each other. The choice of the optimal subgroup varies depending on the problem being considered. Usually, feature reduction algorithms are based on exploratory methods or a random search to reduce the degree of complexity and ultimately lead to a reduced subset of features [29]. This algorithm begins with giving an initial value of a reduction candidate with an empty set. Then each property in the function of uncertainties is evaluated by exploratory measurement. The reduction algorithm generally counts the number of features that appear in the phrase and considers the attribute that seems the most important. The features that have the highest amount of exploration are added to the reduction candidate. The whole phrase deleted features that are included in the unde ned function are removed. As soon as all the expressions are removed, the algorithm terminates and returns the reduction candidate [30].

The Genetic Algorithms
Genetic Algorithms are random and evolutionary process search techniques supported the principles of biological evolution, natural process, and genetic recombination. It simulates the principle of 'survival of the ttest' during a population of potential solutions referred to as chromosomes. Each chromosome represents one attainable answer to the matter or a rule a classi cation. [31]. Genetic algorithms cannot directly handle the info of answer area. Thus, it should categorical them as genotype string-shaped data in genetic area through cryptography. Then, a xed-length string of binary symbols is employed to express individuals [32].

The Proposed Multi-stage Covid-19 Diagnosis (Msd-covid19) Algorithm
In this paper, we employ a data mining-based model to improve the critical care of COVID-19 patients. The proposed method uses feature reduction to increase the accuracy of the model and to identify factors affecting COVID-19 diagnosis. The multi-stage model consists of three steps, which are fully explained below.

Classi cation Model with No Reduction Attributes
The rst step is to select a classi er that performs the best in diagnosing COVID-19. Eight classi cation models on the Covid-19 dataset are employed with No reduction attributes. We have used Deep learning,

Multilayer Perceptron (MLP), k-Nearest Neighbours (KNN), Bayesian Auto Regressions (BARs), Logistic
Model Trees (LMT), Hoeffding Tree (VFDT), and Fuzzy Unordered Rule Induction Algorithm (FURIA). Each model was implemented separately on the COVID-19 dataset with 10-fold cross-validation, and the following results were obtained. Some of the classi cation algorithms adopted in the paper are described next.
Deep learning could be a branch of machine learning that generates multi-layered representations data, normally victimisation arti cial neural networks. It's improved the progressive in numerous machine learning tasks. This analysis has utilized Deep learning [21] that supports absolutely connected feedforward networks, convolutional networks, and perennial networks.
Multilayer Perceptron (MLP), as a nonlinear learning and modelling tool, has been with success employed in a broad vary of applications. In theory, such applications are 3 basic categories: operate estimation (regression), pattern classi cation, and distribution estimation. The standard learning rule for MLPs area unit supported the employment of IID. (Independently associate degree identically distributed) coaching information to reduce an empirical loss/risk operate outlined in keeping with its application classes [22].
Bayesian Auto Regressions (BARs) square measure linear variable time-series models ready to capture the joint dynamics of multiple statistic. Bayesian illation treats the power unit parameters as random variables. It provides a framework to estimate "posterior" likelihood distribution of the situation of the model parameters by combining data provided by a sample of ascertained knowledge and previous data derived from a spread of sources, like alternative macro or small datasets, theoretical models, alternative economics phenomena, or contemplation [23] The Fuzzy Unordered Rule Induction Algorithm (FURIA) algorithm is proposed by Huhn and Hullermeier; since it is based on fuzzy rules, models the decision limits by making them more exible, with fuzzy intervals, generating non-ordered rules in place of the regular list of rules [24].
The Hoeffding tree (VFDT) is associate degree progressive call tree induction rule capable of learning from large data, presumptuous that the distribution generating examples don't modi cation over time. This idea is supported mathematically by the Hoeffding certain, that quanti es the amount of observations required to estimate some statistics among a prescribed preciseness. Victimization the Hoeffding certain, one will show that its output is sort of the image of that of a non-incremental learner victimization in nitely several examples [25].
The Logistic Model Trees (LMT) is a classi cation method with associate supervised learning algorithmic that mixes logistic regression and decision tree learning. Logistic model trees square measure supported the sooner plan of a model tree [26].
The k-Nearest Neighbours algorithm (KNN) is a non-parametric methodology planned by Cowl used for classi cation and regression. In each case, the input is the k nearest instances within the feature space.
The output of k-NN is employed for classi cation or regression [27].

Stage2 (Feature Reduction Using Rough Set Based on GA)
So as to beat the challenges in feature reduction with large number of features, genetic algorithm are employed to reach the minimal features of data set by consolidating its exceptional capacity for generally using the rough set theory. This is an evolutionary model for function optimization based on the biological progress. Existing reduction algorithms, start from rough set principle to achieve reduction set of the features. S = (U, R) is a data table and S is a subset of U and it's a set of features, as well as, R is condition features. Decision information is represented by S = (U, R ∪ {D}) where D ∈ R that called decision feature. A set of features D is depends on the set of condition features R, where C ⇒ D. D depends on R where k (0 ≤ k ≤ 1) as follow. k = γ ( R, D ) = |( POSRD )||U|. (7) 5.2 Stage3 (Reclassi cation Analysis with reduction attributes and Genetic algorithm) This step constructs the discernibility matrix of attributes based on genetic algorithm. The method employs each attribute to the initial strings to reach minimal attribute set as reduction subset. The rough set theory is de ned as tness function.
The Pseudo code of the proposed algorithm is explained in Fig. 1   A Sample X-ray is shown Fig. 2.

Experimental Analysis
In the rst stage of the algorithm, some of classi cation algorithms are adopted including Deep learning, Multilayer Perceptron (22 hidden layer), Multilayer Perceptron classi er (one hidden layer), Bayesian auto Regression, KNN, FURIA, VFDT, LMT. Table 3 shows the classi cation accuracy with no reduction attributed used. We have used the True Positive (TP) Rate, False Positive (FP) Rate, Precision, Recall, and F-Measure [12] as metrics of accuracy. Results using the rough set model and genetic algorithms to reduce the number of redundant features, thus improving the classi er accuracy is shown in Table 4.  Table 5.
As can be seen in Table 5, the proposed method has improved the classi cation metrics in all cases examined. This improvement is more pronounced in some methods and less in others, but in all of them, the proposed method offers acceptable improvement that indicating the classi cation error is reduced.
Therefore, the proposed method can be useful for increasing classi cation accuracy.  For example, in Fig. 3a, it can be seen that the number of classi cation errors is 6 whereas number of classi cation errors in Fig. 3.b is 5, and also for other Figs. So, this indicates that the proposed method reduces the number of classi cation errors.

Conclusion
In this paper, the analysis on the Covid-19 data set was performed in three steps. The rst stage includes a selection of a classi cation model with no reduction attributes. Tested classi cation algorithms include