A scalable and real-time system for disease prediction using big data processing

The growing chronic diseases patients and the centralization of medical resources cause significant economic impact resulting in hospital visits, hospital readmission, and other healthcare costs. This paper proposes a scalable and real-time system for disease prediction from medical data streams. This is carried out by integrating Twitter, Apache Kafka, Apache Spark and Apache Cassandra. Thus, Twitter users tweet attributes related to health, Kafka streaming receives all desired tweets attributes and ingest them to Spark streaming. Here, a machine learning algorithm is applied to predict health status and send back a response message through Kafka. The heart disease dataset, obtained from the UCI repository, was used for experiments. In order to enhance prediction accuracy, Relief algorithm is used for features selection. We compared sex types of relevant machine learning algorithms implemented by Spark MLlib such as Random Forest (RF), Naive Bayes, Support Vector Machine, Multilayer Perceptron, Decision Tree and Logistic Regression with the full features as well as selected features. The highest classification accuracy of 92.05% was reported using RF with selected features. The scalability of RF using Spark MLlib and WEKA framework for both training and application stages was measured. The results show significantly better performances of Spark in terms of scalability and computing times.


Introduction
Soft computing techniques are widely used for classification of disease in medical field especially data mining which is the computational process of discovering patterns and useful knowledge from databases [3,22]. Indeed, digital data is becoming increasingly important in many domains like healthcare, technology and society. The amount of data generated in real-time is becoming very important, which involves a number of problems, the main one being the processing and prediction of streaming data event coming with rapid rate. Solving these problems using traditional technologies require hardware resources and timeconsuming for the analysis especially machine learning. To deal with these challenges, powerful distributed computing platforms are widely used.
The importance and effectiveness of big data tools in healthcare field is explained in [38]. The authors demonstrate that the effective way on healthcare delivery costing and achieve good healthcare outcomes is by integrating big data tools with data mining, big data analysis and medical informatics, while effective systems support is still lacking for already established machine learning use cases in the big data context [16].
Hadoop MapReduce [17], it constitute the first generation processing engine for big data and the powerful solution for one-pass computations, it require one Map phase and one Reduce phase for any data processing. The main drawbacks of Hadoop MapReduce is does not supports real-time stream processing and in-memory computation. Also, it is not always easy to implement the MapReduce paradigm for all use cases.
Apache Spark [9], the second generation processing framework designed for fast computation, real-time analytic, data processing workloads, ease of use and optimized to run in memory. With its in-memory computation, the performance can be several times faster than other big data frameworks, especially in problems involving iterative machine learning [27]. Spark provides a distributed machine learning framework called MLlib, it consists of a library which implements a set of commonly used machine learning and statistical algorithms. Spark streaming is built on top of core Spark API, it focuses on processing data streams in real-time from various sources like Twitter and Kafka. Data analytics has been made a revolution in healthcare by transforming the data into valuable information in order to predict epidemics, avoid preventable deaths and improve quality of life [14]. But the growth of volume, complexity and speed in data drives the need for scalable big data analytic algorithms and systems. Due to fast growth of social networks and their role in the daily life of millions of people around the world, social media and mobile applications have opened up new path-ways for healthcare delivery [1].
Based on the challenges facing the healthcare systems, we have proposed and developed a solution in healthcare with a real-time health status prediction use case. This solution is based on Twitter streaming, Kafka streaming, Spark streaming, Spark MLlib, NoSQL Cassandra and Twitter for real-time data transfer. Based on this, the system first preprocesses the available healthcare data and analyzes it to create an offline model for learning system, the model then deployed on system and use it in real-time to predict health status. To balance the incoming load, multiple streams of user data related to health are tweeted from Twitter in a predefined format, ingested and filtered on Kafka streaming. The health attributes are extracted and processed at Spark streaming on which machine learning model is applied. The health status result is sent back to the user, and are then stored in NoSQL based distributed data storage for data visualization and analytics. Efficient processing of data in healthcare increases the quality of patient monitoring.
The proposed system for disease prediction is explained as follows. Section 2 gives the related works. The design and architecture of our system is explained in Section 3. Section 4 will give the implementation of each module, experimental results are presented and discussed. Finally, there is a conclusion part in Section 5.

Related works
In the past few years, many researches were centered on the use of machine learning in predicting outcomes especially in healthcare field. Machine learning with data mining tools show its power in predicting diseases, extract patterns and make decisions [32]. But in the big data context, machine learning is handled only in a few works.
In [45] a predictive model related to the risk of diabetes is performed using a scalable RF classification algorithm. A Hadoop based intelligent care system is proposed in [46] that illustrates IoTs based big data contextual sharing across all devices in a health system. Using Hadoop, a novel method based on k-Nearest Neighbors algorithm (KNN) to efficiently detect the outliers in large-scale healthcare data has been proposed in [56]. This method outperformed the KNN and Local Outlier Factor (LOF) in terms of accuracy and processing efficiency. Authors in [59] proposed an automated method that is able to detect abnormal patterns for the elderly living alone entering, exiting behaviors collected from simple sensors equipped in home-based setting. Usage of convolutional neural network based multimodal disease risk prediction algorithm for disease prediction by machine learning over big data from healthcare communities is performed in [15]. In [12] a hybrid fuzzybased decision tree algorithm for early detection of heart disease using a continuous and remote patient monitoring system was proposed. The proposed system use some parameters related to heart disease obtained from wearable sensor attached to human body. An alert message is sent to the respective physician and the care taker when the obtained value exceeds the threshold value.
Usage of Hadoop technology, authors in [4] have built an application platform in order to load and visualize the data collected from raw sensors, which is a wireless network of wearable computing devices for monitoring the condition of a human body. A new scalable IoTs based architecture has been proposed in [35]. This approach is designed to handle big data from sensor and identify the most significant parameters of heart disease. There are three major components in this architecture, the first level consists of data collection from medical devices. The second level is set to store the huge amount of data in cloud computing by using Hbase. Linear regression is used as a prediction model for the prediction of heart disease using Apache Mahout based machine learning libraries. A new framework for a lambda based cardiovascular disease prediction system is proposed in [24]. The framework uses big data technology notably Hadoop MapReduce to solve the problems associated with realtime analysis of big data. This system can be used to assist, predict and diagnose diseases such as cardiovascular disease. Due to the growing digitalization, it is necessary to move from paper-based medical records to digital by managing the large volume of health data for analytical purposes and using them for effective treatment will be a crucial issue. To deal with this situation, an approach based on the Hadoop MapReduce framework was proposed by [48]. This method uses the big data predictive analysis algorithm to predict the complexities of diabetes mellitus and the type of treatment to adopt. Usage of Spark framework, authors in [53] have proposed a Naive Bayes approach for constructing classifier based on big data predictive analytics model to predict the future health condition of heart disease data taken from UCI machine learning repository. In [40], a new architecture has been proposed that can support the implementation, storage and processing of scalable sensor data for healthcare applications. The proposed architecture is divided into two sub-architectures : Meta Fog-Redirection (MF-R) and Grouping and Choosing Architecture (GC). The first uses Apache Pig and Apache HBase to collect and store generated big data from different devices. The second architecture is used for securing integration of fog computing with cloud computing. MapReduce based prediction model is used to predict the heart disease.
On the other hand, stream computing over big data is handled only in a few works. In [18] a real-time health status prediction system is proposed, this work focuses on applying machine learning especially decision tree on data streams received from socket streams with breast cancer use case using Spark streaming framework. An overview of big data architectures and machine learning algorithms for processing big data in healthcare and other applications is discussed in [39]. Furthermore, a generic architecture for healthcare analytics has been proposed in [50]. A real-time heart disease prediction system based on big data framework is proposed in [19]. The proposed work is based on Apache Spark which stand as a strong large scale distributed computing platform that can be used successfully for streaming data event against machine learning through in-memory computations.
The combination of streaming big data and machine learning is a revolutionary technology that can have a significant impact in the field of healthcare, especially the real-time detection of heart diseases. In [21], authors proposed a novel heart disease monitoring system based on a new classification approach that combines distributed machine learning and real-time that uses the real-time predictive analytics algorithm in Spark environment to predict heart disease. First, the traditional decision tree algorithm was transformed into a parallel, distributed, scalable, and fast decision tree. Then, this model is applied to real-time data coming from distributed sources to predict heart disease in real time. The result as well as data streams were stored in a distributed database for real-time reporting and monitoring. This system is limited in terms of user interaction which becomes a necessity for prediction and monitoring systems. On the other hand, model classification accuracy rate is quite low, often, a single tree is not sufficient for producing effective results. In addition, decision tree are highly prone to being affected by outliers and often overfit training data. More that, there is no comparison with other type of machine learning algorithm in order to achieve high accuracy. With the increase of data sources and users, to balance the incoming load streams in this system is still challenging as Spark itself is not designed to data management. For this, integrating another framework designed specifically for data stream management will be more efficient in real-time data processing.
There are studies showing the power of social media in particular Twitter data in monitoring and transfer data, such as real-time flu and cancer surveillance system by mining Twitter [37], finding patterns related to the health events [58], earthquake reporting system using Twitter as a social sensor for detecting an event in real-time. In [54], a workflow for data ingestion and data management of Twitter streaming data is developed where they retrieved space-time activities from geotagged tweets and stored them in a single cluster of MongoDB. Finding trending topics is discussed in [23]. In [51], a system is developed for collecting data which uses Twitter to transfer data to a clinician to provide follow up for cardiovascular patients. This work involves healthcare professionals to analyze the data and to send appropriate messages. Author in [49] offers the design of a healthcare system for monitoring patients in real-time. In this system, different parameters influence heart disease are captured from the patient through sensors that are then sent to the patient's Android mobile phone application. RF algorithm is used to predict heart disease. This new system will also assist the physician visualize the records of other patients with the same medical report of a selected patient using the KNN classification algorithm. In [20], authors proposed a new and general architecture for real-time health status prediction and analytics system using big data technologies. The system focus on applying distributed machine learning model on streaming health data events ingested to Spark streaming through Kafka topics.
Spark is faster than Hadoop and has a better performance especially in problems involving iterative machine learning [26]. Furthermore, authors in [34] found that the proposed methods in the literature were limited to batch processing while big data streams computing are not been widely adopted and remained open to future research. They concluded their work with a recommendation to direct future research towards big data streams processing, it was recommended that research efforts should focus on the development of scalable frameworks and algorithms adapted to real-time data analysis. On the other hand there is a gap of capability of performing more complicated analytical tasks in real-time like managing and analyzing the events streams, machine learning algorithms, data storage, data visualization and transforming healthcare data into valuable information.
Reviewing related work in this field showed that the healthcare analytics solution involving big data are mainly focused on Hadoop. It can process a large volume and diverse data sources in case of batch oriented computing which is not sufficient when it comes to analysing real-time application scenarios, it would be limited for real-time computing [42]. On the other hand, these works consider a specific healthcare data sources or focuses only on batch computing. However, healthcare data sources are continuously generate huge data at a high rate. In addition, they are either considering powerful tools for data analysis such as machine learning and data mining, or they are focusing only on data storage and visualization. Therefore, real-time healthcare analytics, which includes continuous data collection, real-time processing and powerful tools for distributed machine learning, distributed data storage and real-time analytics is necessary to build an effective system for handling distributed healthcare data streams. Others studies in the field of heart disease prediction have focused only on predicting heart disease based on traditional machine learning algorithms with full features. These studies cannot predict heart disease with real-time streaming data. In addition, data from social media platforms as a source is not considered to solve the diseases prediction problems. Other studies use single algorithm to build the model, without using any other type of machine learning algorithm in order to achieve high accuracy.

System architecture
The purpose of this study is to develop a real-time data processing, monitoring system combining Twitter streaming, Kafka streaming and Spark streaming. It consists of a four-tier architecture. Tier-1 focuses on collecting data from Twitter status. Tier-2 uses the Spark MLlib to develop the Random Forest model for heart disease prediction. Tier-3 uses Apache Cassandra to store the huge volume of Twitter status data. Figure 1 shows the architecture of the proposed system. Firstly, the user tweets the health attributes, they are sent to the Spark streaming application through Kafka streaming, where the real-time processing is performed. Spark streaming receives health attributes from Kafka streams with Twitter user's name and apply the machine learning model to predict health status. After that, an appropriate message is send back to the user based on Twitter username. The results as well as data streams will be stored in distributed database for historical data analysis and real-time monitoring.  This system has some characteristics which distinguish it from others traditional data analytics approaches. The main idea here is that there is a need for methods to manage and analyze thousands of messages coming from different sources each second in a short amount of time. Also, the system should be independent of imported data volume. We notice that most of state of the art approaches are useful for predicting the health status but the real-time data stream management, data storage, users interactions and data visualization is not covered. With the high populations, to cover all patients by the available doctors is a challengin [31]. On the other hand one of the most important technological challenges of big data analytics is exploring ways to effectively obtain valuable information for different types of users and knowledge discovery in big data generated by all users.

Data ingestion
Improvement in technology has increased the availability and use of smartphones, personal computer and many others. Due to the rapid growth of social networks and their role in the daily life of millions of people around the world, the number of users and the amount of data generated in these media is increasing exponentially. Twitter is one of the popular social network site, it's a micro blogging site and updates in 140 characters through ultra short messages called tweets in which the peoples can share their opinions, sending and receiving messages [47]. Twitter becomes an inseparable part of human life and fastest way to get real-time information from around the word, it provides free and public access to stream of tweets. Therefore, it is considered as rich source of real-time data in an inexpensive way. As it is supported by Spark streaming and smartphones environments where memory, bandwidth and display size are limited. Twitter can be used as an effective and free real-time communication channel tool. Therefore, it has been integrated to our system. Instead of using Twitter, streaming big data can be generated from other data sources such as medical IoT technology and wearable sensor devices, from SMS or mobile application to send and receive data to the system, but it requires an additional infrastructure to establish the communication, resulting in an additional investment of time and money.
In this module, we capture tweets streams using a specific keyword related to health disease followed by attributes in the same format as the feature vector in the testing set separated by space: #rtbigdhdsparkt 1 1 2 0 2.3 3 0 6 This is the first module in our system. For authentication request to the Twitter platform, a Twitter account and Twitter application are required to be created and set the access level to read and write. The same we need applicant Consumer Key and Secrete key. We also require Access token and Access Token Secret to make streaming API requests on your own accounts without sharing password. Then the Twitter streams are ingested to Spark through Kafka [8]. To balance the incoming load streams in our system, Kafka is used, which is more suitable in dealing with real-time streaming data routes. In this module, real-time data is streamed from Twitter through Kafka producer.

Distributed computing
MapReduce is a programming model introduced in 2004 by Google for large scale processing across clusters [17]. It is the core component of the Apache Hadoop framework, which enables the resilient and distributed processing of massive and unstructured data across clusters where each node has its own storage space. Internally, the framework offers two main features. It distributes the tasks to the individual nodes in the cluster (Map), then organizes them and reduces the results provided by each node into a single consistent response to a query (Reduce). This is made possible by its distributed file system (HDFS).
Apache Spark adopts the MapReduce model and has several advantages over other big data and MapReduce technologies such as Hadoop and Storm. First of all, Spark offers a complete and unified framework to meet big data processing needs for various datasets, diverse in nature (text, graph, etc.) as well as source type (batch or real-time stream). Spark uses the concept of Resilient Distributed Datasets (RDDs) which is the immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. Then, Spark allows applications on Hadoop clusters to be executed up to 100 times faster in memory, 10 times faster on disk [52]. It follows the master-worker architecture, for every Spark application it will create one master process and multiple workers (Fig. 2). Spark streaming is a module supports scalable, fault-tolerant processing of live data streams, it has much higher latency, while it provides higher throughput against the most popular frameworks such as Storm and Flink [43]. The ingestion of data supported by Spark streaming can be from many sources like Twitter, Apache Kafka, Tcp Sockets. Incoming data stream is grouped into batches of interval less than a second and processed by the batch processing Spark engine, it can be processed using machine algorithms with high level function such as Map and Reduce. Finally, processed data may be pushed out to databases, file systems and live dashboards for visualization and historical data analysis. Spark streaming provides a high-level abstraction called discretized stream or DStream [57], which represents a continuous stream of data. Internally, a DStream is represented as a sequence of RDDs where each RDD in the sequence is considered as micro batch of input data.
In addition to other Spark API libraries, Spark provides another major library called Spark MLlib [41], which is a toolkit of distributed machine learning and data mining models that is under heavy development and already contains scalable, high-quality and efficient algorithms for many common machine learning tasks.
Spark SQL [11] is the most popular and prominent feature of Apache Spark. It provides support for interacting and manipulating data with Spark via SQL statements within a Spark program. It represents database tables as Spark RDDs and translates SQL queries into Spark operations to perform parallel queries.

Reasons to choose Spark
There are several reasons for choosing Spark, but three are key : -Spark uses the concept of RDD, which allows us to store data in memory and store it as required. This greatly increases the performance of batch jobs up to 10 to 100 times faster than traditional MapReduce.
-Spark also enables us to cache the data in memory, which is more beneficial in case of iterative algorithms, in which the computation requires multiple passes over the data such as those used in machine learning.
-Particularly, MapReduce is ineffective in case of multi-pass applications that require real-time computing, and low-latency processing with massively parallel processing architectures.

Use case dataset
In this study, the processed.cleveland.data of heart disease database [28] was used for training and testing the machine learning algorithm which predicts heart disease. It was used in many machine learning research works. For each disease observation, we have constructed a labelled dataset with attributes, where class label attribute labelled with two classes, present presence or absence of heart disease. The class label attribute values modified to just 0 and 1, where value 1 indicates presence of heart disease replacing values 1, 2, 3 and 4 while value 0 indicates absence of heart disease, turning it to a binary class dataset. The remaining 13 features are described in Table 1 while Table 2

Parallel and scalable random forest based on Spark
The prediction of health status coming from Twitter streams needs to build a classification model, which is capable to classify the attributes of each stream in presence or absence of heart disease. There are many methods for classification in multivariate approach, including discriminate analysis, artificial networks, and regression models, especially logistic regression and fuzzy logistic regression [29,44]. Decision Tree (DT) [13] is one of the very important tools in data mining, it can even process a large set of data. It is a powerful method for pattern categorizations that is widely used in the fields of medicine, science and technology. They are relatively easy to understand and interpret, it can handle categorical and numerical features and do not require input data to be scaled or standardized. One of the important advantages of DT which separates it from other algorithms is the structural information, reliable and simple variable selection tool for clinical practice. The DT algorithm is a top-down approach that begins at a root node, and then selects a feature at each step that gives the best split of the dataset based on information gain of this split computed from the node impurity. For classification tasks, there is a measure that can be used to select the best split such as Gini and Entropy impurity given by formula (1) and (2) respectively: where f i is the frequency of label i at a node and c is the number of unique labels and p(s, j ) is the proportion of instances in s that are assigned to j-th class [6]. RF [30] is an ensemble learning method and one of the most successful and powerful supervised machine learning algorithm that is capable of performing both classification and regression tasks. As the name suggests, RF builds the forest from several DTs. Each tree gives a vote to classify a new element, the class that receives more votes is chosen by RF.
Let S a dataset of samples formalized as: , where x is a sample and y is a feature variable of S. Namely, the original training dataset contains N samples, and there are M feature variables in each sample. The steps involved in the construction of the RF algorithm are as follows: Step 1. Create a bootstrapped dataset In this step, to create a bootstrapped dataset, k subsets are randomly selected from the original dataset S. Finally, k training sub-sets are constructed as a collection of training subsets.

Step 2. DT construction
In this step, each DT is created by using bootstrap dataset, but only uses a random subset of variables (or columns) at each step based on gain ratio discussed previously. The main process of the RF algorithm construction is presented in Fig. 3.
In order to improve the performance of the simple RF algorithm and reduce the data handling cost of large-scale data in a parallel and distributed environment, we propose a Spark Parallel Random Forest (SPRF) algorithm. The SPRF algorithm is optimized using a hybrid parallel approach that combines data parallel optimization and task parallelization. From the point of view of data parallel optimization, a vertical data partitioning method is performed. This method reduce the amount of data and the number of data transmission operations in the distributed environment without reducing the accuracy of the algorithm. From the point of view of tasks parallel optimization, a parallel approach is performed in the learning process of the SPRF algorithm and the Directed Acyclic Graph (DAG) is performed on data based on the dependency of the RDD objects. Then, different task planners are invoked to perform the tasks of the DAG. The parallel training approach maximizes the parallelization of SPRF and improves the performance of SPRF. Then, the task planners further minimize the cost of data communication between Spark cluster and achieve a better balance of workload and fast computation. Hence, an adequate and parallel model for predicting health status in big data context using Spark is needed. Based on this, a RF model adaptation is more important. In this work the Spark streaming handles the Kafka topic data streams using Spark streaming library while the parallelization of RF is performed using Spark MLlib.
Algorithm 1 represents the steps to train and test the RF on Spark based distributed environment. Figure 4 shows the flowchart of the RF model based Spark.
In this part, there are two phase real-time health status prediction, first involves analysis on healthcare dataset to build the machine learning model. The second uses the model in production to make predictions on live health data streams. The working process is given in Fig. 5.

Features selection
When developing a machine learning model, only a few variables in the dataset are useful for building the model, and the other features are either redundant or irrelevant. If we introduce all these redundant and irrelevant features into the dataset, it can have a negative impact and reduce the overall performance and model accuracy. Therefore, it is very important to identify and select the most appropriate features from the data and remove the irrelevant or less important features, which is done using feature selection in machine learning. There are many feature selection techniques, for that purpose, a comparative study in terms of accuracy has been made for well know techniques such as Relief, Correlationbased feature selection, Chi squared, Filtered subset, Info gain, Gain ratio, One attribute based, Consistency subset, Filtered attribute, Genetic Algorithm (GA) and GA with SVM. We find that feature selection using Relief algorithm achieved the highest accuracy on heart disease dataset. The Relief algorithm is an algorithm invented in 1992 by researchers Kenji Kira and Larry A. Rendell [33] whose goal is to distinguish between essential and non-essential features, which better takes into account these notions of interactions between different information. This algorithm relies in particular on the measurement of similarities and dissimilarities between input values and tested values, and thus makes it possible to estimate the relevance (or irrelevance) of the various features using a global score. To evaluate the relevance of a value, the algorithm will try to know how a feature evolves (or not) within a class and outside this class.

Selected features
Relief method assigns for each feature a ranking score. All features are ranked in descending order based on their score. In this part, the important 8 features which have high rank were chosen.  Figure 6 represents the ranking of all features. In order to choose the attributes that give the best accuracy, we test all the attributes with the different machine learning methods, and then we remove the attributes one by one starting with the attributes that have a low score. The best accuracy is reached in the case of 8 attributes (Table 3) with high ranking score.

Distributed database
To get meaningful patterns [5] from big data generated by all users such as patient diagnostic information is also an essential problem, so the predicted results must be stored in a distributed way to ensure the data availability with no single point of failure. Apache Cassandra [7] is a very powerful distributed database system, it is particularly effective at supporting large volumes of records across multiple servers. This database can be easily scaled to support a sudden increase in demand. It is sufficient to deploy a multinode clusters database with Cassandra. In addition, Cassandra is highly available and has the advantage of not having a single point of failure [36]. Also Cassandra provides extremely fast write and read speeds with Spark [25].

Visualization
Data visualization has become an integral part discipline, necessary to visualize the multitude of data to which we have access and to communicate the most relevant information. Secondly, data visualization seems to mainly willingness to respond in the professional environment to quickly access information. This visualization will allow among other to explain and highlight key information that will be exploited from databases [5]. Data visualization is the graphical representation of information, it lies at the intersection of the fields of communication, information science and design. The main advantage of data visualization is not makes data more beautiful but it provides insight complex data sets by communicating their key aspects, it allows decision makers to see analytics presented visually which plays a key role in decision-making.
Storage and visualization of electronic health records can help in identifying the patterns for disease prediction system, assist healthcare providers to find an accurate and responsive to the patient needs, make better financial and healthcare decisions based on result predictions made by the system [2]. Also, with the availability of data records we can extract the relationship between each attribute (dependent variable) and the outcome (desired attribute). Furthermore, clustering of similar patients records can help doctors to save time when diagnosing diseases. In fact, each group of patients can undergo the same type of treatment. Finally, we can get useful information about diseases based on data statistics.
Here the data visualization is performed using Zeppelin [10] which is a web based and multipurpose notebook that enables interactive data analytics, it is an open source data analysis environment that runs on top of Apache Spark. Using concept of dataframe with Spark SQL, the database can be queried by different queries like, number of instance, number of different cases and making more sophisticated and high-level data analytics. The description for data querying using Spark SQL is as follows : -Spark SQL, SQL Context is imported.
-Load the data from Cassandra table to construct RDD, and execute the action commands. -Call Map transformation and Map the RDD into heart disease case class.
-Convert the case class to dataframe and create temporal table.
-Define the executed using Spark SQL query.
In this step we studied the retrieval time of following queries (Table 4) with different size of records and multinode cluster.

Implementation
The proposed system was written using Scala programming language and Zeppelin as a development platform which support many interpreters like Scala, Spark and Cassandra. Algorithm 2 describes the main steps to implement our system.

Algorithm 2
Processing steps of the proposed framework.
Firstly, the proposed system is carried out on single node cluster created on standalone machine with core i7 processors and 8GB RAM, having Ubuntu 16.04 operating system through Spark platform which integrates MLlib especially RF model with Kafka streaming data handling. Table 5 shows the characteristics of our master and worker nodes. The application after establishing connection to the Twitter streaming through Kafka streaming as detailed in Fig. 1, is continuously receiving the data streams from multiple Kafka producers and once it encounters the health status check stream, it extracts the attribute values from the data events sent by Kafka streaming and apply RF model to predict health status. On the other hand, the details of each predicted instance were persisted in Cassandra database table for querying them later. Once the proposed system was successfully tested on single node cluster, a multinode cluster with one master and two workers is created.

Performance evaluation of machine learning models
The heart disease dataset has been randomly split into a training data set and a test data set with a split seed, 70% of the data is used to train the model, and 30% will be used for testing. The models have been trained using the training set with hyperparameter tuning, different models have been tested, the classification accuracy values are calculated in each case.
Accuracy, sensitivity, specificity, positive likelihood ratio (PLR), negative likelihood ratio (NLR), disease prevalence (DP), positive predicted value (PPV) and negative predicted value (NPV) are calculated for models evaluation. The validations metric are defined by: where TP: True Positive, TN: True Negative, FP: False Positive, and FN: False Negative.

Accuracy using all features
Using all feature, the diagnosis accuracy is maintained at 87.50% in case of RF. Table 6 presents a comparative analysis of RF algorithm with relevant machine learning algorithms implemented by Spark MLlib such as Naive Bayes (NB), Support Vector Machine (SVM), Multilayer Perceptron (MLP), DT and Logistic Regression (LG). Figure 7 represents a graphical visualization of accuracies which shows an important advantage for RF classifier among other well-known classifiers in terms of detection accuracy.

Accuracy using selected features
In the second test, we first used Relief-based feature selection to select the best features, and then, we used the same machine learning models as in the previous section. Experimental results showed that highest classification performances are achieved when RF is used as classifier. Table 7 shows the best parameters contributing to best accuracy.
At the best accuracy of 92.05% which minimizes maxBins, maxDepth, numTrees parameters with Gini impurity, the total test simple is equal to 88 and TP = 37 , TN = 44 , FP = 2 , FN = 5. Table 8 shows a comparison analysis between sex classifier: NB, SVM, DT, MLP, LG and RF. RF achieved the highest accuracy at 92.05%. Figure 8 shows a graphical visualization of accuracies which shows an important advantage for RF classifier among other well-known classifiers in terms of detection accuracy. Figure 9 presents a graphical visualization of accuracies using and without using features selection. It can be observed that the accuracy has been improved especially with RF classifier.
Based on the results obtained, the proposed approach to classify heart diseases not only provides the best accuracy compared to previous experimental results with a minimal set of features, but also shows the feature reduction in a simple way leading to time savings during the training phase, especially in the context of big data which has become a great

Spark based RF scalability
In order to show the scalability of our approach and to have sufficient database size, we generate other data records, the original dataset has been enriched up to 4 millions of rows.

Throughput
Initially, the heart disease dataset has been processed in order to construct a labelled dataset. RF model was built and tested separately by varying parameters such as maxDepht, maxBins and numTrees, the minimum model error is taken into account based on the model classification accuracy (Table 8). An offline model with selected features has been created and saved in order to use it in real-time as Fig. 5 shows. In our case study, for testing the purpose, simulator applications acting as Kafka data producers were created, data producers consist of two simulator applications for heart disease streams. We are using two producers. Each one sends approximately 280200 (can be more) events per second per node in predefined format which has preceded by a keyword as a header word and which in turn serves them to consumers. We conducted three scenarios with a stream interval of 1, 2 and 3 seconds. Table 9 presents the conducted experiments. Figure 16 presents the overall performance evaluation of the proposed system.

Query throughput
The performance of the retrieval time of records is mainly calculated with respect to a specific size of data (3 million of records). Figure 17 shows the query time during each query between traditional relational database management system (RDBMS) like MySQL  and Spark SQL. For heart disease dataset, during each query, the query time increases significantly for traditional RDBMS rather than Spark SQL. The figure above shows that Spark speed up the execution time than traditional RDBMS such as MySQL. In this context, Spark can help speed up slow report requests and add more scalability to long running queries. Using Zeppelin a real-time data dashboard has been created which will retrieve data from Cassandra database and displays it in charts and tables. This application uses Spark SQL and Angularjs to push the data to the web page in fixed intervals, so data will be refreshed automatically. After testing the system using a simulator application, three Twitter accounts were created. Selected features related to heart disease were tweeted in predefined format preceded by a keyword from Twitter's status. All these tweeted attributes as well as the user name will be captured and filtered by Kafka streaming in real-time, which in turn serves them to Spark streaming as a Kafka producer. At Spark streaming, attribute values and user name were extracted. RF was applied on extracted attributes to predict health status and send appropriate message indicate absence or presence of disease. The result will be saved into  Ed-daoudy and Maalmi [21] No Heart disease −Accuracy:

82.40%
Yes Yes −Interaction between system and users is not covered −More general architecture −Classification accuracy is not enough −Stream data management is not covered Ed-daoudy and Maalmi [20] No Heart disease and diabetes −Accuracy:

82.40%
Yes Yes −Interaction between system and users is not covered −More general architecture −Classification accuracy is not enough Our Yes Heart disease −Accuracy: 92.05%

Yes
Yes −Real-time big data analysis is the main goal −Big data stream management is covered −Make prediction before the data storage −Real-time machine learning is performed −Data storage and visualization is covered −User interaction is considered and it is the main objective −Suitable for real-time The proposed system is able to collect, process, analyze and store health status data in real-time. It focuses on applying distributed and real-time machine learning on streaming health data events using big data technologies, namely, Apache Spark instead of traditional frameworks which become limited for real-time computing. Using Apache Kafka as big data stream management, our system can predicts different type of diseases at the same time based on the concept of Kafka topic and multiple machine learning models built on the appropriate dataset. On the other hand, online, distributed and fast prediction model is used to predict health status from user's streaming tweets to make the system scalable. With a simple tweet, Twitter users receive instant information about their health status. The system offers users real-time remote monitoring at no extra cost.
Cassandra database for historical data analysis and visualization. The connection and interaction between the system and the users requires investment of a costly set of infrastructure in term of programming skills, time and money. The use of Twitter as a free channel greatly simplified the communication. Table 10 presents a comparison of state-of-the-art architectures for health status prediction with the proposed system. Based on the findings, the proposed system can be applied to solve the real-time big data analysis jobs for medical IoT.

Conclusion
In this paper, we have presented a real-time system for health status prediction in big data context based on Apache Spark and Apache Kafka. This system is designed to collect, filter, analyze, store and visualize streams of health status data. Firstly, an offline machine learning model has been developed, in this part, to achieve high accuracy, we select important features using Relief algorithm. Using full and selected features, different type of machine learning algorithms implemented by Spark MLlib, namely, RF, DT, SVM, NB, MLP and LG were applied on heart disease dataset from the UCI repository. RF with selected features achieves highest accuracy, it was deployed to our system as an online machine learning model. With Twitter serving as free communication channel, Twitter users tweet they attributes related to heart disease. Kafka streaming receives all desired tweets attributes, serves them to Spark streaming in which RF model is applied on data streams to predict health status and send appropriate message. The streamed data is stored into distributed database Cassandra. The result will be displayed on dashboard and making more sophisticated and high-level data analytics.
Developing a real-time data prediction and analytics system using traditional analytics tools requires a variety of skills, intensive and more expensive programs and considerable amount of time and money. However, using traditional data processing platforms and techniques become difficult to process the enormous generate data. But using open source big data tools especially Spark, significantly improved the performance and the effectiveness of the analytics system, especially in terms of system development time and complexity of programs, execution and the speed it provide. In future works, we aim to integrate other data sources such as mobile application, IoT to our system and other classification models to improve the prediction accuracy.

Data Availability
The dataset analysed during the current study is available in https://archive.ics.uci.edu/ ml/datasets/heart+disease Declarations Ethics approval and consent to participate This article does not contain any studies with human participants or animals performed by any of the authors

Conflict of Interests
The authors declare that they have no conflict of interest.