The literature review process helps to understand and analyze the existing techniques in a comprehensive manner. The Review process assists in founding a re- search methodology process and basic pillars behind the whole study. After a de- tailed literature review, some RQs are drawn on the basis of previous techniques and information. The RQs help to understand the proposed work and methodology in a detailed manner. Following are the research questions for the proposed study.
RQ-1: What are current techniques for Heart Diseases prediction in Machine Learning? (answered in the Literature review)
RQ-2: Can the SVM-CNN be combined to outperform ML algorithms for heart disease prediction?
RQ-3: How can we evaluate and validate the criteria of the SVM-CNN model?
RQ-4: How can we compare the performance of the proposed SVM-CNN model with an existing model?
RQ2, RQ3, and RQ4 are discussed in the result and discussion section. Detailed information about the research process is explained in the methodology section, and the research methods used, and the data processing methods are explained in a detailed manner. The approach used in scientific research is composed of a few basic steps. Each study differs in some way due to the period of time, conditions, and location in which it is being conducted. The research process is a common and well-known phenomenon. A modification in one phase of the research process will also affect the other stages because each phase is interconnected. Researchers must evaluate the other stages after introducing a modification in one phase to ensure that changes are reflected in all phases.
The proposed model mainly focuses on identifying heart disease with better ac- curacy using selected features. The performance of the ML algorithm has been tested on three datasets UCI Dataset, Z-Alizadeh Sani Dataset, and Cardiovascular Dataset. The workflow of the model has been implemented in three stages shown in Fig. 1. Input phase, Feature Selection, and the last step is the classification and performance evaluation of heart disease. These are the following steps that we will discuss in this section.
Input Phase
In the input phase, various pre-processing techniques are applied to each dataset to enhance the proposed model’s capabilities and performance. Following Data pre- processing techniques are applied.
-
Replace NaN values with the mean values of the column.
-
Removing Duplicate entries Handling Outliers
-
Label encoding is a common encoding technique for categorical information. This approach assigns a unique number to every label based on its alphabetical order. The label encoding gives each occurrence in the chosen datasets a special identification number.
-
Dataset 1 suffers from imbalanced class distribution, we use SMOTE technique to balance the classes and improve model performance.
-
Data is split into the ratio of 70:30. 70% is used for model training while the remaining 30% is used for testing purposes. The test set is used to evaluate the model performance on unseen data.
-
Feature Scaling is applied to the selected datasets. The Feature scaling is carried out by using standard scalers. In feature scaling the trained data is
Feature Selection Phase
After the pre-processing, pre-processed datasets are passed towards the genetic algorithm, one of the evolutionary algorithms. The genetic algorithm with CNN finds the optimal solution based on the F1 score criteria.
Genetic Algorithm GA is a useful, reliable strategy and search approach. The nat- ural selection of the test individuals used by nature to generate species served as the basis for the GA search algorithm. In that it determines the most suitable and effective solution to a given computer issue, a genetic algorithm is comparable to natural evolution. The biological ideas of evolution and the survival of the fittest serve as the foundation of the genetic algorithm. Because it offers a solid and re- liable answer that is graded against fitness criteria, this algorithm is significantly more effective and powerful than an exhaustive search algorithm. The fitness function measures how closely a particular solution comes to being optimum. Feature selection aims to reduce the dimensionality of the data by identifying the subset of features that are most predictive or influential for the problem at hand. This component helps improve the efficiency and accuracy of the solution by focusing on the most significant features and reducing noise or irrelevant information.
This method offers a decent and resilient solution that is graded against fitness criteria, making it far more powerful and efficient than an exhaustive search algorithm. A solution’s closeness to optimality is measured using the fitness function. The possible solutions to the problem are represented by a set of chromosomes. A chromosome is a string of binary digits. Each digit is called a gene. The initial population can be created randomly. The Genetic Algorithm (GA) is a search and optimization technique inspired by the principles of natural evolution. It iteratively evolves a population of candidate solutions to a problem by mimicking the processes of selection, reproduction, and mutation.
-
Initialization: Start by creating an initial population of potential solutions to the problem. Each solution is represented as a set of parameters called chromosomes or genes. The population is typically generated randomly or using some heuristic. We used a population size and number of generations of 50,50 each.
-
Evaluation: Evaluate the fitness of each individual in the population. The fitness function measures how well each solution performs with respect to the problem at hand. The fitness function can be problem-specific and aims to quantify the quality or suitability of a solution
-
Selection: Select two individuals from the population at random and compare fitness values. The selection process is usually based on the fitness values of the individuals. Solutions with higher fitness have a higher probability of being selected, mimicking the concept of survival of the fittest.
-
Reproduction: Create offspring by combining the genetic material of the selected parents. This is typically done through genetic operators such as crossover and mutation. The probability of crossover is 0.8. Crossover involves exchanging genetic information between two parents to produce new individuals, while mutation introduces small random changes to the genetic material to promote exploration of the search space. The probability of mutation is
0.2. Choose crossover point and choose gene to mutate.
-
Replacement: Replace some individuals in the current population with newly created offspring. The replacement strategy can vary, but commonly the off- spring replace the least fit individuals, ensuring that the population maintains diversity and potential improvements.
-
Termination: Repeat steps 2–5 for a certain number of generations or until a termination criterion is met. Termination criteria can be based on the number of generations, reaching a satisfactory fitness level, or a predefined time limit. Or termination is also met when we analyze that the next step will overfit the data then we stop the process. Select the best individuals.
Now, selected features are used for classification and prediction.
Classification Phase Using Hybrid Model SVM-CNN
Set up the SVM model, including the choice of kernel function (e.g., linear, polynomial, or radial basis function) and hyperparameter selection. Train the SVM model using the training dataset. Define the architecture of the CNN model, including the arrangement and number of convolutional, pooling, and fully connected layers. Train the CNN model using the training dataset, optimizing the model’s weights through forward and backward propagation. Determine how the output features or representations from the SVM and CNN models are combined or integrated to create a hybrid model. A hybrid approach that combines Support Vector Machines (SVM) and Convolutional Neural Networks (CNN) for classification tasks. SVM is a machine learning algorithm used for supervised classification, while CNN is a deep learning technique commonly used for classification. By combining the strengths of both SVM and CNN, the hybrid approach aims to leverage the discriminative power of SVM and the ability of CNN to learn hierarchical features automatically from the input data. These components give an effective means of classifying or categorizing the input data based on the selected features, utilizing the power of both traditional machine learning and deep learning techniques.
Support Vector Machine
Support Vector Machine (SVM), is a classification and regression prediction model that makes use of machine learning theory for the best accuracy and to prevent model overfitting. Locating the hyperplane in N- dimensional space is the support vector machine’s goal. For the separation of two classes, there are numerous potential hyperplanes from which to choose. Drawing the hyperplane with the greatest possible margin distance is the goal in order to
classify upcoming data points. Decision boundaries called hyperplanes aid in categorizing the data points. Support vectors are data points that relate to the sides of the hyperplane and which indicate the characteristics of the classes. The hyperplane will be a line that the attributes that belong to separate classes if the input features are two. A hyperplane is a two-dimensional plane if the input features are three. The hyperplanes become challenging when the input feature reaches three. Support vectors that are near the hyperplane have an impact on the hyperplane’s position and orientation. The sigmoid function compresses the input in logistic regression into the range [0, 1]. Assign the input value a label of 1 if it exceeds the threshold value (0.5). The output of the linear function is taken into consideration while determining the class label in SVM.
A class label is assigned if the output is 1, and a different class label is assigned if the output is -1. There is a margin between threshold values, which range from 1 to -1. The margin value between the data points and the hyperplane should be maximized so that future data points can be classified more confidently. The kernel’s goal is to make it possible for operations to be carried out in the input space. The data is transformed using the kernel functions, and based on these transformations, an ideal boundary between the potential outputs is discovered. We categorize the binary class data for the linearly separable data using the linear kernel. The non- linear SVM is utilized in the scenario of non-linear separable data though. The border determined by utilizing the non-linear kernels is not a straight line in a non-linear SVM. Depending on the nature of the dataset, several kernels such as Gaussian, polynomial, sigmoid, or others, are employed in non-linear SVM.
Convolution Neural Network
A synthetic neural network with numerous concealed layers. CNN can extract greater order of connection by using more hidden layers. The incoming and outgoing layers of a CNN are split at different levels. Whatever their size or structure, neural networks must have neurons, connections, values, bias, and functionalities. Here’s a detailed explanation of how CNNs work in our model and their components
Input Layer
The input layer of a CNN receives the output of SVM model.
Convolutional Layer
The convolutional layer is the core component of a CNN. It consists of a set of learnable filters (also called kernels) that convolve over the input data. Each filter detects specific features, by performing element-wise multiplication and aggregation. The outcome is a feature map that draws attention to the fact that certain features were present in the input.
Activation Function
After each convolutional operation, an activation function is applied element-wise to introduce non-linearity. Common activation functions used in CNNs include Rectified Linear Unit (ReLU), sigmoid, or hyperbolic tangent. ReLU is the most popular choice due to its simplicity and effectiveness in preventing the vanishing gradient problem. We use ReLU in our model.
Pooling Layer
The pooling layer reduces the spatial dimensions of the feature maps while retaining the most salient information. It achieves this by applying operations like max pooling or average pooling within a localized region. Pooling helps in reducing computational complexity and making the learned features more invariant to small spatial variations.
Fully Connected Layer
The fully connected layer, also known as the dense layer, takes the output from the previous layers and connects every neuron to every neuron in the subsequent layer. It learns high-level representations by combining the learned features from previous layers. These layers are similar to those in a traditional neural network and perform classification or regression tasks.
Dropout
Dropout is a regularization technique commonly used in CNNs to prevent overfitting. During training, a certain percentage of neurons in the fully connected layer are randomly ignored or "dropped out." This helps in reducing interdependent learning between neurons and forces the network to learn more robust features.
Output Layer
The output layer of the CNN provides the final predictions or out- puts. The activation function in the output layer depends on the nature of the problem being solved. For binary classification, a sigmoid function is typically used, while for multi-class classification, a softmax function is commonly used.
The proposed solution for heart disease classification involves utilizing a Genetic Algorithm (GA) for feature selection and a combination of Support Vector Machine (SVM) and Convolutional Neural Network (CNN) models (referred to as SVM- CNN) for prediction. This hybrid approach aims to enhance the accuracy and ro- bustness of heart disease classification by selecting relevant features and leveraging the strengths of both SVM and CNN algorithms.
Datasets
We use three different types of datasets and perform three experiments on each dataset. we use dataset 1 named as UCI Dataset with 920 instances and 76 features shown in Fig. 2 With 303 instances and 56 attributes shown in Fig. 3, dataset 2 also known as the Z-Alizadeh Sani Dataset is used in experiment 2. We use the Cardiovascular Disease Dataset shown in Fig. 4, a dataset containing 70,000 instances and 11 characteristics, in experiment 3.
First of all pre-processing techniques are applied to clean the data that are already discussed in the methodology. Start by preparing the datasets one by one, ensuring it is labeled with the corresponding class labels for each instance. Here we did labeling on all heart disease Datasets that we are using UCI has 4 labels, Z-Alizadeh Sani has 2 and Cardiovascular has also 2 labels. We apply standardization from − 1 to 1 to transform data. Transform categorical variables into numerical data using a label encoder. Then we split the data into training and testing in the ratio of 70:30. UCI data was so imbalanced, for balancing the data we use SMOTE technique to balance the data. After balancing the data feature scaling is used to transform the data, we use a standard scaler to transform data from − 1 to 1. Preprocessing techniques are quite the same for the three datasets except for class imbalance, that are used only for the UCI dataset. The experiment consists of three phases. Input Phase, Feature Selection Phase, and Classification phase. Input phases consist of the data sets. The selected datasets are pre-processed to ensure the accuracy and efficiency of the pro- posed model. The input phase consists of three datasets having different numbers of instances and attributes. To check the performance of the proposed model it is trained and tested on different sizes of datasets. These datasets are sufficient and helpful in heart disease prediction for several reasons. They provide a diverse set of features, enabling the development of comprehensive prediction models. They offer large and diverse samples of patient data, ensuring that the models are trained on representative populations. They allow researchers to identify and investigate various risk factors associated with heart disease. The datasets can serve as baselines for evaluating new models and comparing their performance against established methods. These datasets also have some limitations. Datasets may have missing values or incomplete data for certain features, which can affect the model’s performance and accuracy. We tackle this limitation by employing imputation techniques to fill in missing values based on statistical methods or machine learning algorithms. The data might contain errors, outliers, or noisy measurements due to various factors like data collection errors or inconsistent recording practices. Data cleaning and preprocessing steps are applied to remove outliers and correct errors in the data. We also use feature selection techniques to reduce the impact of noisy data on the model’s performance.