Each day, human existence evolves, yet the health of each generation either improves or deteriorates. There are always uncertainties in life. Occasionally encounter a large number of individuals with fatal health problems due to the late detection of diseases. Concerning the adult population, chronic liver disease would affect more than 50 million individuals worldwide. However, if the sickness is diagnosed early, it can be stopped. Disease prediction based on machine learning can be utilized to identify common diseases at an earlier stage. Currently, health is a secondary concern, which has led to numerous problems. Many patients cannot afford to see a doctor, and others are extremely busy and on a tight schedule, yet ignoring recurring symptoms for an extended length of time can have significant health repercussions [1].
A medical diagnosis is a form of problem-solving and is a crucial and significant issue in the actual world. Illness diagnosis is the process of translating observational evidence into disease names. The evidence comprises data received from evaluating a patient and substances generated from the patient; illnesses are conceptual medical entities that detect anomalies in the observed evidence [2].
Diseases are a global issue, thus medical specialists and researchers are exerting their utmost efforts to reduce disease-related mortality. In recent years, predictive analytic models are playing a pivotal role in the medical profession as a result of the increasing volume of healthcare data from a wide range of disparate and incompatible data sources. Nonetheless, processing, storing and analyzing the massive amount of historical data and the constant inflow of streaming data created by healthcare services has become an unprecedented challenge utilizing traditional database storage [3, 4, 5].
The concept of medical care is used to stress the organization and administration of curative care, which is a subset of healthcare [6]. The ecology of medical care was first introduced by White in 1961. White also proposed a framework for perceiving patterns of health concerning symptoms experienced in particular populations of interest, along with individual's choices in getting medical treatment. In this framework, it is able to calculate the proportion of the population who used medical services over a specific time period. The "ecology of medical care" theory has become widely accepted in academic circles over the past few decades [7].
Healthcare is the collective effort of society to ensure, provide, finance, and promote health. In the 20th century, there was a significant shift toward the ideal of wellness and the prevention of sickness and incapacity. The delivery of health care services entails organized public or private efforts to aid persons in regaining health and preventing disease and impairment [8]. Healthcare can be described as standardized rules that help evaluate actions or situations that affect decision-making [9].
Healthcare is a multidimensional system. The basic goal of healthcare is to diagnose and treat illnesses or disabilities. A healthcare system key component are health experts (physicians or nurses), health facilities (clinics, hospitals that provide medications and other diagnostic services), and a funding institution to support the first two [10].
With the introduction of systems based on computers, the digitalization of all medical records and the evaluation of clinical data in healthcare systems have come to be a widespread routine practice. The phrase "electronic health records" was chosen by the Institute of Medicine, a division of the National Academies of Sciences, Engineering, and Medicine in 2003 to define the records that continued to enhance the healthcare sector for benefit of both the patients and physicians. Electronic Health Records (EHR) are "computerized medical records for patients that include all information in an individual's past, present, or future which occur in an electronic system used to capture, store, retrieve, and link data primarily to offer healthcare and health-related services," according to Murphy, Hanken, and Waters [10].
Daily, healthcare services produce an enormous amount of data, getting it increasingly complicated to analyze and handle it using "conventional ways." Using machine learning and deep learning, this data may be properly analyzed to generate actionable insights. In addition, genomics, medical data, social media data, environmental data, and other data sources can be used to supplement healthcare data. Figure 1 provides a visual picture of these data sources. The four key healthcare applications that can benefit from machine learning are prognosis, diagnosis, therapy, and clinical workflow, as outlined in the following section [11].
The long-term investment in developing novel technologies based on machine learning as well as deep learning techniques to improve the health of individuals via the prediction of future events reflects the increased interest in predictive analytics techniques to enhance healthcare. Clinical predictive models, as they have been formerly referred to, assisted in the diagnosis of persons with an increased probability of disease. These prediction algorithms are utilized to make clinical treatment decisions and counsel patients based on some patient characteristics [12].
Artificial Intelligence (AI) is a scientific field that successfully integrates computer science and large datasets to solve problems. It requires an understanding of computing to build tools and devices that offer desired behavior [13]. Figure 2 depicts machine learning and deep learning as subsets of AI.
Medical personnel are usually facing new problems, changing tasks, and frequent interruptions as a result of the system's dynamism and scalability. This variability often makes disease recognition a secondary concern for medical experts. Moreover, clinical interpretation of medical data is a challenging task from an epistemological point of view. This not only applies to professionals with extensive experience but also representatives, such as young physician assistants, with varied or little experience. The limited time available to medical personnel, the speedy progression of diseases, and the fluctuating patient dynamics all the time make diagnosis a particularly complex process. However, a precise method of diagnosis is critical to ensuring speedy treatment and thus ensuring patient safety [14].
1.1 Machine Learning
Machine learning (ML) is a subfield of AI that aims to develop predictive algorithms based on the idea that machines should have the capability to access data and learn on their own [15]. ML utilizes algorithms, methods, and processes to detect basic correlations within data and create descriptive and predictive tools that process those correlations. ML is usually associated with data mining, pattern recognition, and deep learning. Although there are no clear boundaries between these areas and they often overlap, it is generally accepted that deep learning is a relatively new subfield of ML that uses extensive computational algorithms and large amounts of data to define complex relationships within data. As shown in Figure 3, ML algorithms can be divided into three categories: supervised learning, unsupervised learning, and reinforcement learning [16].
1.1.1 Supervised Learning
Supervised learning is an ML model for investigating the input-output correlation information of a system depending on a given set of training examples that are paired between the inputs and the outputs [17]. The model is trained with a labeled dataset. It matches how a student learns fundamental math from a teacher. This kind of learning requires labelled data with predicted correct answers based on algorithm output [18]. The most widely used supervised learning-based techniques include K-Nearest Neighbor, Naive Bayes, Support Vector Machines, Decision Trees, Random Forests, and Logistic Regression.
Linear Regression
Linear regression is a statistical method commonly used in predictive investigations. It succeeds in forecasting the dependent, output, variable (Y) based on the independent, input, variable (X). The connection between X and Y is represented as shown in eq. 1 assuming continuous, real, and numeric parameters.
Y = mX + c. (1)
where m indicates the slope and c indicates the intercept. According to eq. 1, the association between the independent parameters (X) and the dependent parameters (Y) can be inferred. [19].
The advantage of linear regression is that it is straightforward to learn, and it is also easy to eliminate overfitting through regularization. One drawback of linear regression is that it is not convenient when it is applied to non-linear relationships. However, it is not recommended for most practical applications as it greatly simplifies real-world problems [20]. The implementation tools utilized in Linear Regression are Python, R, MATLAB, and Excel.
As shown in Figure 4, observations are highlighted in red, and random deviations' result (shown in green) from the basic relationship (shown in blue) between the independent variable (x) and the dependent variable (y) [21].
Logistic Regression
Logistic regression, also known as the logistic model, investigates the correlation between a large number of independent variables and a categorical dependent variable, and calculates the probability of an event by fitting the data to a logistic curve [22]. Discrete mean values must be binary, i.e., have only two outcomes: true or false, 0 or 1, yes or no, or either superscript or subscript. In logistic regression, categorical variables have to be predicted and classification problems to be solved. Logistic regression can be implemented utilizing various tools such as R, Python, Java, and MATLAB [19]. Logistic regression has many benefits, such as it shows the linear relationship between dependent and independent variables with the best results. It is also simple to understand. On the other hand, it can only predict numerical output, is not relevant to non-linear data, is sensitive to outliers [23].
Decision Tree
The Decision Tree (DT) is the most popular supervised learning methods used for classification. It combines the values of attributes based on their order either ascending or descending [24]. As a tree-based strategy, DT defines each path starting from the root by a data separating sequence until a Boolean conclusion is attained at the leaf node [25- 26]. DT is a hierarchical representation of knowledge interactions that contains nodes and links. When relations are employed to classify, nodes reflect purposes [27 - 28]. An example of DT is presented in Figure 5.
DTs have various drawbacks, such as increased complexity with increasing nomenclature, small modifications that may lead to a different architecture, and more processing time to train data [19]. The implementation tools used in DT are Python (Scikit-Learn), R Studio, Orange, KNIME, and Weka [23].
Random Forest
Random Forest (RF) It is a basic and most widely utilized algorithm that produces correct results most of the time. It may be utilized for classification and also regression. The program produces an ensemble of DTs and blends them [29].
In the RF classifier, the higher the number of trees in the forest, the more accurate the results. So, the RF has generated a collection of DTs called the forest and combined them to achieve more accurate prediction results. In RF, each DT is built only on a part of the given dataset and trained on approximations. The RF brings together several DTs to reach the optimal decision [19].
As indicated in Fig. 6. RF randomly selects a subset of features from the data and from each subset it generates n number of random trees [21]. RF will combine results from all DTs and provide them in the final output.
Two parameters are being used for tuning RF models: mtry - the count of randomly selected features to be considered in each division; and ntree - the model trees count. The mtry parameter has a trade-off: large values raise the correlation between trees but enhance the per-tree accuracy [30].
The RF works with a labeled dataset to do predictions and build a model. The final model is utilized to classify unlabeled data. The model integrates the concept of bagging with a random selection of traits to build variance-controlled DTs [31].
RF offers significant benefits. First, it can be utilized for determining the relevance of the variables in a regression and classification task [32,33]. This relevance is measured with a scale, based on the impurity drop at each node used for data segmentation [34]. Second, it automates missing values contained in the data and resolves the overfitting problem of DT. Finally, RF can efficiently handle huge data sets. On the other side, RF suffers from drawbacks such as it needs more computing and resources to generate the output results and it requires training effort due to the multiple DTs involved in it. The implementation tools used in RF are Python Scikit-Learn and R [19].
Support Vector Machine
The most popular supervised ML algorithm for classification issues and regression models is called Support Vector Machine (SVM). SVM is a linear model that offers solutions to issues that are both linear and nonlinear. as shown in Figure 7. Its foundation is the idea of margin calculation. The dataset is divided into several groups to build relations between them [19].
SVM is a statistical-based learning method that follows the principle of structural risk minimization and aims to locate decision bounds, also known as hyperplanes, that can optimally separate classes by finding a hyperplane in a usable N-dimensional space that explicitly classifies data points. [35, 36, 37]. SVM indicates the decision boundary between two classes by defining the value of each data point, in particular the support vector points placed on the boundary between the respective classes [38].
SVM has several advantages such as it works perfectly even with both semi-structured and unstructured data. Kernel trick is a strength point of SVM. Moreover, it can handle any complex problem with the right functionality and can also handle high-dimensional data. Furthermore, SVM generalization has less allocation risk. On the other hand, SVM has many downsides. Its model training time is increased on a large dataset. Choosing the right kernel function is also a difficult process. In addition, it is not working well with noisy data. Implementation tools used in SVM include SVMlight with C, LibSVM with Python, MATLAB or Ruby, SAS, Kernlab, Scikit-Learn, and Weka [23].
K - Nearest Neighbor
K-nearest neighbor (KNN) is an "instance-based learning" or non-generalized learning, which is often known as a “lazy learning” algorithm [39]. KNN is used for solving the classification problems. To anticipate the target label of the novel test data, KNN determines the distance of the nearest training data class labels with a new test data point in the existence of a K value, as shown in Figure 8. It then calculates the number of nearest data points using the K value and terminates the label of the new test data class. To determine the number of nearest-distance training data points, KNN usually sets the value of K among 0 and 1 [23].
KNN has many benefits such as it is sufficiently powerful if the size of training data is large. It is also simple and flexible with attributes and distance functions. Moreover, it can handle multi-class data sets. KNN has many drawbacks such as the difficulty of choosing the appropriate K value, it is very tedious to choose the distance function type for a particular dataset, and the computation cost being a little high due to the distance between all the training data points [31]. The implementation tools used in KNN are Python (Scikit-Learn), WEKA, R, KNIME, and Orange [23].
Naïve Bayes
Naive Bayes (NB) focuses on the probabilistic model of Bayes' theorem and is simple to set up as the complex recursive parameter estimation is basically none, making it suitable for huge data sets [40]. NB determines the class membership degree based on a given class designation [41]. It scans the data once and thus classification is easy [42]. Simply, the NB classifier assumes that there is no relation between the presence of a particular feature in a class and the presence of any other characteristic. It is mainly targeted at the text classification industry [43].
NB has great benefits such as ease of implementation, can provide a good result even using fewer training data, can manage both continuous and discrete data, ideal to solve prediction of multiclass problems, and the irrelevant feature does not affect the prediction. NB, on the other hand, has the following drawbacks: it assumes that all features are independent which is not always viable in real-world problems, suffers the zero frequency problems, and the prediction of NB is not usually accurate. Implementation Tools: WEKA, Python, R Studio, and Mahout [19].
1.1.2 Unsupervised learning
Unlike supervised learning, there are no correct answers and no teachers in unsupervised learning [43]. It follows the concept that a machine can learn to understand complex processes and patterns on its own without external guidance. This approach is particularly useful in cases where experts have no knowledge of what to look for in the data and the data itself does not include the objectives. The machine predicts the outcome based on past experiences and learns to predict the real-valued outcome from the information previously provided, as shown in Figure 9.
Unsupervised learning is widely used in the processing of multimedia content, as clustering and partitioning of data in the lack of class labels is often a requirement [44]. Some of the most popular unsupervised learning-based approaches are k-means, Principal Component Analysis (PCA), and Apriori Algorithm.
k-means
The k means algorithm is the common portioning method [45] and one of the most popular unsupervised learning algorithms that deal with the well-known clustering problem. The procedure classifies a particular data set by a certain number of preselected (assuming k-sets) clusters [46]. The Pseudocode of the K-means algorithm is shown in Pseudocode 1.
Pseudocode 1: k-means Pseudocode
|
1. Arrange K points in the space represented by the
clustered items. These points reflect the
centroids of the first group.
2. Set each object of the group that has the nearest
centroid.
3. After setting all the elements, the coordinates
of the k centroids have to be recalculated.
4. Repeat Steps 2 and 3 until the centroids stop
moving.
|
K means have several benefits such as being more computationally efficient than hierarchical grouping in case of large variables. It provides more compact clusters than hierarchical ones when small k is used. Also, the ease of implementation and comprehension of assembly results is another benefit. However, K-Means have disadvantages such as the difficulty of predicting the value of K. Also, as different starting sections lead to various final combinations, the performance is affected. It is accurate for raw points and local optimization, and there is no single solution for a given K value - so the average of the K value must be run multiple times (20-100 times) and then pick the results with the minimum J [20].
Principal Component Analysis
In modern data analysis, Principal component analysis (PCA) is an essential tool as it provides a guide for extracting the most important information from a dataset, compressing the data size by keeping only those important features without losing much information, and simplifying the description of a data set [47, 48].
PCA is frequently used to reduce data dimensions before applying classification models. Moreover, unsupervised methods, such as dimensionality reduction or clustering algorithms, are commonly used for data visualizations, detection of common trends or behaviors, and decreasing the data quantity to name a few only [49].
PCA converts the 2D data into 1D data. This is done by changing the set of variables into new variables known as principal components (PC) which are orthogonal [24]. In PCA data dimensions are reduced to make calculations faster and easier. To illustrate how PCA works, let's consider an example of 2D data. When this data is plotted on a graph, it will take two axes. Applying PCA the data turns into 1D. This process is illustrated in Figure 10 [50].
Apriori
Apriori algorithm is considered an important algorithm, that was first introduced by R. Agrawal and R. Srikant, and published in [51,52].
The principle of the Apriori algorithm is to represent the filter generation strategy. It creates a filter element set (k + 1) based on the repeated k element groups. Apriori uses an iterative strategy called planar search, where k item sets are employed to explore (k+1) item sets. First, the set of repeating 1-items is produced by scanning the dataset to collect the number for each item, then collecting items that meet the minimum support. The resulting group is called L1. Then L1 is used to find L2, the recursively set of two elements is used to find L3, and so on until no repeated k element groups are found. Finding every Lk needs a full dataset scan. To improve production efficiency at the level-wise of repeated element groups, a key property called the Apriori property is used to reduce the search space. Apriori property states that all non-empty subsets of a recursive element group must be iterative. A two-step technique is used to identify groups of common elements: join and prune activities [53].
Although it is simple, the Apriori algorithm suffers from several drawbacks. The main limitation is the costly wasted time to contain a large number of candidate sets with a lot of redundant item sets. It also suffers from low minimum support or large item sets and multiple rounds of data are needed for data mining which usually results in irrelevant items, in addition to difficulties in discovering individual elements of events [54, 55].
1.1.3. Reinforcement learning
Reinforcement learning (RL) is different supervised learning and unsupervised learning. It is a goal-oriented learning approach. RL is closely related to an agent (controller) that takes the responsibility for the learning process to achieve a goal. The agent, in particular, chooses actions, and as a result, the environment changes its state and returns rewards. Positive or negative numerical values are used as rewards. An agent's goal is to maximize the rewards accumulated over time. A job is a complete environment specification that identifies how to generate rewards [56]. Some of the most popular reinforcement learning-based algorithms are the Q-Learning algorithm and the Monte-Carlo Tree Search (MCTS).
Q-Learning
Q-Learning is a type of model-free RL. It can be considered an asynchronous dynamic programming approach. It enables agents to learn how to operate optimally in Markovian domains by exploring the effects of actions, without the need to generate domain maps [57]. It represented an incremental method of dynamic programming that imposed low computing requirements. It works through the successive improvement of the assessment of individual activity quality in particular states [58].
In information theory, Q-learning is strongly employed, and other related investigations are underway. Recently, Q-learning combined with information theory has been employed in different disciplines such as Natural Language Processing (NLP), pattern recognition, anomaly detection, and image classification [61, 62, 63,64]. Moreover, a framework has been created to provide a satisfying response based on the user’s utterance using RL in a voice interaction system [65]. Furthermore, a high-resolution deep learning-based prediction system for local rainfall has been constructed [66].
The advantage of developmental Q-learning is that it is possible to identify the reward value effectively on a given multi-agent environment method as agents in ant Q-learning are interacting with each other. The problem with Q-learning is that its output can stuck in the local minimum as agents just take the shortest path [67].
Monte Carlo Tree Search
Monte Carlo Tree Search (MCTS) is an effective technique for solving sequential selection problems. Its strategy is based on a smart tree search that balances exploration and exploitation. MCTS presents random samples in the form of simulations and keeps activity statistics for better-educated choices in each future iteration. MCTS is a decision-making algorithm that is employed in searching trees-like huge complex regions. In such trees, each node refers to a state, which is also referred to as problem configuration, while edges represent transitions from one state to another [68].
The MCTS is related directly to cases that can be represented by a Markov decision process (MDP), which is a type of discrete-time random control process. Some modifications of the MCTS make it possible to apply it to Partially Observable Markov Decision Processes (POMDP) [69]. Recently, MCTS coupled with deep RL became the base of AlphaGo developed by Google DeepMind and documented in [70]. The basic MCTS method is conceptually simple, as shown in Figure 11.
Tree 1 is constructed progressively and unevenly. The tree policy is utilized to get the critical node of the current tree for each iteration of the method. The tree strategy seeks to strike a balance between exploration and exploitation concerns. Then, from the specified node, simulation 2 is run, and the search tree is then updated according to the obtained results. This comprises adding a child node that matches the specified node's activity and updating its ancestor's statistics. During this simulation, movements are performed based on some default policy, which in its simplest case is to make uniform random movements. The benefit of MCTS is that there is no need to evaluate the values of the intermediate state, which significantly minimizes the amount of required knowledge in the field [72].
1.2 Deep Learning
Over the past decades, ML has had a significant impact on our daily lives with examples including efficient computer vision, web search, and recognition of optical characters. In addition, by applying ML approaches, AI at the human level has also been improved [73, 74, 75]. However, when it comes to the mechanisms of human information processing (such as sound and vision), the performance of traditional ML algorithms is far from satisfactory. The idea of Deep Learning (DL) was formed in the late 20th inspired by the deep hierarchical structures of human voice recognition and production systems. DL breaks have been introduced in 2006 when Hinton built a deep structured learning architecture called Deep Belief Network (DBN) [76].
The performance of classifiers using DL has been extensively improved with an increased amount of data compared to classical learning methods. Figure 12 shows the performance of classic ML algorithms and DL methods [77]. The performance of typical ML algorithms becomes stable when they reach the training data threshold, but DL upturns their performance as the amount of data increases [78].
DL (deep ML, or deep structured learning) is a subset of ML which involves a collection of algorithms attempting to represent high-level abstractions for data through a model that has complicated structures or otherwise, composed of numerous non-linear transformations. The most important characteristic of DL is the depth of the network. Another essential aspect of DL is the ability to replace handcrafted features generated by efficient algorithms for unsupervised or semi-supervised feature learning and hierarchical feature extraction [79].
DL has significantly advanced the latest technologies in a variety of applications, including machine translation, speech, and visual object recognition, NLP, and text automation, through the use of multi-layer Artificial Neural Networks (ANNs) [16].
Different DL designs in the past two decades give the enormous potential for employment in various sectors such as automatic voice recognition, computer vision, NLP, and bioinformatics. This section discusses the most common architectures of DL such as Convolutional Neural Networks (CNNs), Long Short-Term Memory (LSTM), and Recurrent Convolution Neural networks (RCNNs) [80].
Convolutional Neural Network
CNNs are special types of neural networks inspired by the human visual cortex and used in computer vision. It is an automatic feed-forward neural network in which information transfers exclusively in the forward direction [81]. CNN is frequently applied in face recognition, human organ localization, text analysis, and biological image recognition [82].
Since CNN was first created in 1989, it has done well in disease diagnosis over the past three decades [83]. Figure 13 depicts the general architecture of a CNN composed of feature extractors and a classifier. Each layer of the network accepts the output of the previous layer as input and passes it on to the next layer in feature extraction layers. A typical CNN architecture consists of three types of layers: convolution, pooling, and classification. There are two types of layers at the network's low and middle levels: convolutional layers and pooling layers. Even-numbered layers are used for convolutions, while odd-numbered layers are used for pooling operations. The convolution and pooling layers' output nodes are categorized in a two-dimensional plane called feature mapping. Each layer level is typically generated by combining one or more previous layers [84].
CNN has a lot of benefits, including a human optical processing system, greatly improved 2D and 3D image processing structure, and is effective in learning and extracting abstract information from 2D information. The max-pooling layer in CNN is efficient in absorbing shape anisotropy. Furthermore, they are constructed from sparse connections with paired weights and contain far fewer parameters than a fully connected network of equal size. CNNs are trained using a gradient-based learning algorithm and are less susceptible to the diminishing gradient problem because the gradient-based approach trains the entire network to directly reduce the error criterion, allowing CNNs to provide highly optimized weights [84].
Long Short Term Memory
LSTM is a special type of Recurrent Neural Networks (RNNs) with internal memory and multiplicative gates. Since the original LSTM introduction in 1997 by Sepp Hochrieiter and Jürgen Schmidhuber, a variety of LSTM cell configurations have been described [92].
LSTM has contributed to the development of well-known software such as Alexa, Siri, Cortana, Google Translate, and Google voice assistant [93]. LSTM is an implementation of RNN with a special connection between nodes. The special components within the LSTM unit include the input, output, and forget gates. Figure 14 depicts a single LSTM cell.
LSTM is an RNN module that handles gradient loss problems. In general, RNN uses LSTM to eliminate propagation errors. This allows the RNN to learn over multiple time steps. LSTM is characterized by cells that hold information outside the recurring network. This cell enables the RNN to learn over many time steps. The basic principle of LSTMs is the state of the cell, which contains information outside the recurrent network. A cell is similar to a memory in a computer, which decides when data should be stored, written, read, or erased via the LSTM gateway [94]. Many network architectures use LSTM such as bidirectional LSTM, hierarchical and attention-based LSTM, convolutional LSTM, autoencoder LSTM, network LSTM, cross-modal, and relational LSTM [95].
Bidirectional LSTM networks move the state vector forward and backward in both directions. This implies that dependencies must be taken into account in both temporal directions. As a result of inverse state propagation, the expected future correlations can be included in the network's current output [96]. investigates and analyses this because bidirectional LSTM networks encapsulate spatially and temporally scattered information and can tolerate incomplete inputs via a flexible cell-state vector propagation communication mechanism. Based on the detected gaps in data, this filtering mechanism reidentifies the connections between cells for each data sequence. Figure 15 depicts the architecture. A bidirectional network is used in this study to process properties from multiple dimensions into a parallel and integrated architecture [95].
Hierarchical LSTM networks solve multidimensional problems by breaking them down into sub-problems and organizing them in a hierarchical structure. This has the advantage of focusing on a single or multiple sub-problems. This is accomplished by adjusting the weights within the network in order to generate a certain level of interest [95]. A weighting-based attention mechanism that analyses and filters input sequences is also used in hierarchical LSTM networks for long-term dependency prediction [97].
Convolutional LSTM reduces and filters input data collected over a longer period of time using convolution operations applied in LSTM networks or the LSTM cell architecture directly. Furthermore, due to their distinct characteristics, convolutional LSTM networks are useful for modelling many quantities such as spatially and temporally distributed relationships. However, many quantities can be expected collectively in terms of reduced feature representation. Decoding or decoherence layers are required to predict different output quantities not as features but based on their parent units [95].
The LSTM autoencoder solves the problem of predicting high-dimensional parameters by shrinking and expanding the network [98].The autoencoder architecture is separately trained with the aim of accurate reconstruction of the input data as reported in [99]. Only the encoder is used during testing and commissioning to extract the low-dimensional properties that are transmitted to the LSTM. The LSTM was extended to multimodal prediction using this strategy. To compress the input data and cell states, the encoder and decoder are directly integrated into the LSTM cell architecture. This combined reduction improves the flow of information in the cell and results in an improved cell state update mechanism for both short-term and long-term dependency [95].
Grid Long Short-Term Memory is a network of LSTM cells organized into a multidimensional grid that can be applied to sequences, vectors, or higher dimensional data like images [100]. Grid LSTM has connections to eg the spatial or temporal dimensions of input sequences. Thus, connections of different dimensions within cells extend the normal flow of information. As a result, Grid LSTM is appropriate for the parallel prediction of several output quantities that may be independent, linear, or non-linear. The network's dimensions and structure are influenced by the nature of the input data and the goal of the prediction [101].
A novel method for the collaborative prediction of numerous quantities is the cross-modal and associative LSTM. It uses a number of standard LSTMs to separately model different quantities. To calculate the dependencies of the quantities, these LSTM streams communicate with one another via recursive connections. The chosen layers' outputs are added as new inputs to the layers before and after them in other streams. Consequently, a multimodal forecast can be made. The benefit of this approach is that the correlation vectors that are produced have the same dimensions as the input vectors. As a result, neither the parameter space nor the computation time increase [102].
Recurrent Convolution Neural Network
CNN is a key method for handling various computers vision challenges. In recent years, a new generation of CNNs has been developed, the Recurrent Convolution Neural Network (RCNN), which is inspired by large-scale recurrent connections in the visual systems of animals. The Recurrent Convolutional Layer (RCL) is the main feature of RCNN, which integrates repetitive connections among neurons in the normal convolutional layer. With the increase in the number of repetitive computations, the Receptive Domains (RFs) of neurons in the RCL expand infinitely, which is contrary to biological facts [103].
The RCNN prototype was proposed by Ming Liang & Xiaolin Hu [104, 105], the structure is illustrated in Figure 16, in which both forward and redundant connections have local connectivity and weights shared between distinct sites. This design is quite similar to the Recurrent Multi-Layer Perceptron (RMLP) concept which is often used for dynamic control [106, 107] (Fig. 17, middle). Similar to the distinction between MLP and CNN, the primary distinction is that in RMLP, common local connections are used in place of full connections. For this reason, the proposed model is known as RCNN [108].
The main unit of RCNN is the RCL. RCLs develop through discrete time steps. RCNN offers three basic advantages. First, it allows each unit to accommodate background information in an arbitrarily wide area in the current layer. Second, recursive connections improve the depth of the network while keeping the number of mutable parameters constant through weight sharing. This is consistent with the trend of modern CNN architecture to grow deeper with a relatively limited number of parameters. The third aspect of RCNN is the time exposed in RCNN which is a CNN with many paths between the input layer and the output layer, which makes learning simple. On one hand, having longer paths makes it possible for the model to learn very complex features. On the other hand, having shorter paths may improve the inverse gradient during training [103].
The primary goals of this work are to present a comprehensive overview of the key machine learning as well as deep learning techniques employed in healthcare prediction, as well as to identify the obstacles that machine learning and deep learning face in healthcare prediction.
The rest of this paper is structured as follows:
• Section 2 presents a survey methodology
• Section 3 gives a literature survey of the machine
learning and deep learning techniques used in
healthcare prediction.
• Section 4 summarizes the advantage and limitations
of the techniques discussed in section 3.
• Finally, Section 6 outlines the conclusions.