Toward a Smart Health: Reality Mining and Big Data Analytics for Real-Time Disease Prediction

doi:10.21203/rs.3.rs-1621912/v1

Download PDF

Research Article

Toward a Smart Health: Reality Mining and Big Data Analytics for Real-Time Disease Prediction

https://doi.org/10.21203/rs.3.rs-1621912/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 14 Mar, 2023

Read the published version in Journal of Big Data →

You are reading this latest preprint version

Background

We are living in an age where data is everywhere and grows up in a very speedy way. Thanks to sensors, mobile phones and social networks, we can gather a hug amount of information to understand human behavior as well as his individual life. In healthcare system, big data analytics and machine learning algorithms prove their effectiveness and efficiency in saving lives and predicting new diseases. This triggered the idea of taking advantages of those tools and algorithms to create systems that involve both doctors and patients in the treatment of disease, predict outcomes and use real-time risk factors from sensors and mobile phones.

Methods

we distinguish three types of data: data from sensors, data from mobile phones and data registered or updated by the patient in a mobile app we created. We take advantages from IoT systems such as Raspberyy Pi to collect and process data coming from sensors. All data collected is sent to a NoSql Server to be then analyzed and processed in Databricks Spark. K-means centroid clustering algorithms is used to build the predictive model, create partitions and make predictions. To validate results in term of efficiency and effectiveness, we used clustering validations techniques: Random K, Silhouette and Elbow methods.

Results

The advantage of the proposed system is its capability to be applied in several prediction disease researches. As a case study, we create an e-monitoring real-time miscarriage prediction system to save baby’s lives and help pregnant women. In fact, doctors receive the results of clustering and track theirs patient through our mobile app to react in term of miscarriage to avoid non-suitable outcomes. While pregnant women receive only advices based on their behaviors. The system uses 15 real-time risk factors and our dataset contains more than 1000000 JSON files. Elbow method affirm three as the optimal number of clusters and we reach 0.99 as a value of Silhouette method, which is a good sign that clusters are well separated and matched.

Big Data

Miscarriage prediction

Predictive analytics

K-means

Clustering

Big data and reality mining improve well from sciences to several fields including healthcare, industry, education, agriculture … among others. Nowadays, sensors, mobile phones and machines are generating a huge amount of data every short time. In this context, Artificial Intelligence comes to help us making decisions, react in advance to avoid non-suitable outcomes and predict things that were impossible before [1]

Several researchers are creating new models, new algorithms and new knowledge to make machines act like humans and make decision in real-time. For instance, Google can propose answers for you in a short time, Amazon can suggest books for you, and our smartphones suggest and add words to their dictionaries based on the context. Another famous example is Google’s Flu Trends that collect the flu activity based on Google Search. In fact, Google compares between people who search for the flu and people who have flu symptoms [2]–[4]

As its core, AI and Big Data are about making predictions and react in the right moment. Today, big data became a synonymous of data mining and predictive analytics, and change the process from reporting and decision to predicting results. In other aspect, we think that the challenge in using Big Data is not just collecting and storing data, but it is about creating values and make machines act and think like humans to make complex tasks and make people’s live easier [5].

After all, reality mining and data mining techniques have the capabilities to understand our individual and collective behaviors. In this context and with respect to all studies in background section, we created a new e-monitoring system for real-time diseases prediction [6]. This system can be applicable to any case study of disease prediction by using the appropriate data inputs. The proposed system benefits from the use of big data tools, machine learning algorithms and Internet of Things, in order to offer a performant model with accurate results. As a case study, we applied the proposed system on a real-time miscarriage prediction where we used K-means centroid base algorithm for clustering data and get accurate results (Miscarriage / No Miscarriage / Probable Miscarriage) [7]. The proposed model used a dataset of 15 attributes where only 11 are real-time risk factors of miscarriage [8]. The following study enhance the previous one by:

Adding more risk factors in the training process. In fact, the present dataset added four features and contains 15 attributes that are all real time risk factors of miscarriage.
Using only the raspberry pi instead of both Arduino UNO and Raspberry for collecting and processing the model of prediction to minimize response time for doctors and patients.
Clustering data into three clusters: Miscarriage (M=0), Probable Miscarriage (PM=1) and No Miscarriage (NM=2).

Data represents the key-value of all researchers since having good and accurate results depends on inputs quality [9]. In healthcare, data is used in several studies and researches to propose new systems, new algorithms and new methodologies; in order to make predictions, help patients and make people’s lives healthier. In addition, machine learning algorithms, IoT tools and reality mining show their power in making prediction and decision-making [10], [11].

The rest of this paper is organized as follows:

Background section highlights studies of predictive analytics in several areas such as education, agriculture and healthcare; and discusses important works of miscarriage prediction.
Method section discusses the architecture of the proposed model, methodology and implementation.
Experiment section presents the experiment environment, dataset as well as the use of the predictive model K-means in Databricks. In the same section, we discuss results in term of efficiency, performance and clustering accuracy.
Evaluation section, presents how we evaluate the model using clustering methods including Random K, Relative Clustering Validation RCV, Internal Clustering Validation ICV and External Clustering Validation ECV.
Conclusion and future work section concludes the paper and highlights future work.

Authors in [12] present a literature review about the use of predictive analytics using data mining algorithms, for medical industry and business. Authors assume that predictive analytics using datamining methods are a powerful tool to make predictions about future outcomes. They also mention that the process of prediction start from historic data analysis and evaluation to get prediction results.

In [13], authors made a comparison of four machine learning classification algorithms: Support vector Machine (SVM), Decision Tree (C4.5), Naïve Bayes (NB) and K-Nearest Neighbor (K-NN). In this study, they use the cited algorithms to predict breast cancer, using the Wisconsin Breast Cancer (WBC) dataset to train models. As a result, SVM shows its power in term of efficiency with an accuracy of 97.13%. Authors enhance this study by proposing a hybrid data mining classifier for breast cancer prediction using the same dataset WBC. Experimental results show that the classification using fusion of SVM, NB and C4.5 reached a highest accuracy of 97.31%. [14].

In [15], authors propose an interesting view of the role of datamining and predictive analytics in the healthcare services. They affirm that data mining are useful in some areas of healthcare system such as determination of clinical pathways and quality of care, while it stills an emerging axe of research in other situations. First, they examine existing works associated with healthcare delivery. Second, they provides a multi-layered framework for analyzing data mining algorithms in healthcare service management. Third, they used two approaches of deductive and inductive fitting to classify the research. Furthermore, they reveal that data mining successes in three main healthcare applications namely: disease pathways, healthcare capacity and enhancing the quality of care.

In another study [16], authors propose a new algorithm that predict the next inter-cell behavior of a mobile user covered by a personal communication systems network. In the prediction process, the new system includes three main phases: information of mobile user trajectories as user mobility patterns, deducing rules from the gathered patterns and lastly predicting outcomes. Experiments show that the new algorithm shows its power in term of precision and recall comparing to other methods.

Authors in [17] develop a process model for actionable mining “Actionable Mining and Predictive analysis” for public safety and security. This process includes multiple steps namely: question or challenge, fusion of data and collection, operationally preprocessing (recoding and selection), identification and modeling, evaluation of public safety and security, and finally actionable output.

In [18], authors assert that Body Mass Index (BMI) represents an interesting risk factor of miscarriage during pregnancy. In fact, women who suffer from obesity or underweight have more chance to have miscarriage than women with a normal weight. Thus, pregnant women with high or low BMI have a higher risk of having miscarriage than females with a normal BMI.

Another study [19] demonstrates that vaginal blood loss, uterine adenomyosis, C-reactive protein and premature rupture of membranes are significant and independent risk factors during the second-trimester of pregnancy. In this study, they use data collected from patients under miscarriage treatment in hospital, to be then analyzed using logistic regression algorithm to determine the most significant risk factors.

In [20], authors conduct a study to estimate the burden of miscarriage in the Norwegian population and to evaluate the dependency between miscarriage, maternal age and miscarriage recurrence. 421 201 pregnant women participated in this study. As a result, women aged 25–29 (10%) have a lowest chance to have miscarriage compared to pregnant women aged 30 and over. In addition, they notice that the recurrence of miscarriage expand with women aged 45 and over. Lastly, they conclude that maternal age and history of miscarriage are greatly good risk factors that produce unwanted pregnancy outcomes.

3.1. Real-Time Disease Prediction: A Proposed System

Nowadays, people suffer from several diseases that they could avoid if they take specific treatment earlier. In fact, individuals go to make a checkup just in case of severe symptoms; but it is often too late to deal with a disease at late level. In addition, people do not have an overview about their health until they make a consultation or a radiography. Therefore, patients are unaware about their health problems and are not involved in the treatment of diseases. This triggered the idea of creating an e-monitoring system for real-time disease prediction (see Fig. 1). This system can take decision in real-time to make people lives healthier and safer. Thanks to this system, we can react in advance, because it is not just about predicting diseases but it is about preventing also. The main contributions of this system are:

Applying the system in many case study: heart-attack prediction, obesity prediction, miscarriage prediction, skin cancer prediction….
Involving doctors and patients in treatment of the disease.
Using of healthcare sensors to collect real-time data: activity, heart rate, body temperature, alcohol consumption …
Using of mobile phone to collect real-time data from the profile of the user: age, previous diseases, BMI, weight …
Using IoT tools to collect and process data gathered.
Using Big Data platform to train, analyze and create the predictive model.
Making a mobile phone application with multiple profiles (doctor’s profile and pregnant woman’s profile) to communicate and react in advance in critical cases.
Making smart bracelet to incorporate the system for an easy use.

3.2. Case study: Real-Time Miscarriage Prediction

Miscarriage represents one of the most outcome that hurt family’s lives, especially the pregnant women [21]. In fact, during the pregnancy, the woman must do regular consultations to see how her baby is growing and how healthy he is. In some cases, the pregnant woman is scandalize by the death of his baby, and in some cases, she could save his life by reacting in advance. So here, we notice that saving lives depends on having the accurate information in a specific moment. That pushed us to use the previous system for a miscarriage prediction on real-time.

To be able to use the proposed model for miscarriage prediction, pregnant women are equipped by several healthcare sensors to collect real-time health data. Additional data are gathered from mobile phone through the mobile application that we created. All those data; in JSON format; are then sent to a database server to be recovered by the Big Data Platform on real-time. Data collected is trained, transformed and then covered by the predictive model to get results of miscarriage. Not to mention that before sending results, the predictive model must be evaluated using accurate evaluation method of clustering.

Miscarriage outcome remains a critical information. Thus, we choose to send results of miscarriage (M, PM or NM) to the doctor that will ask for an urgent consultation in some critical cases to save the baby’s life, as well as the life of the mother. While, pregnant women will receive recommendations based on her real behavior during her pregnancy. Recommendations are related to her well-being, food quality, activity …

3.3. Reality mining for gathering data

Reality mining and Big Data are powerful ways that help researchers and developers understand human’s individual behaviors. Thanks to the advancement of technologies and tools, we can now collect data that were impossible to gather earlier. In fact, from sensors and mobile phones, we can attend a huge amount of data every second about a person. Thus, our system takes benefit from those tools to collect all possible information about pregnant women. We can distinguish multiple source of data (see Fig. 2):

Data from healthcare sensors.
Data from mobile phone sensors.
Data from the patient’s profile.

All the above data; that contains risk factors related to disease prediction; are filled in JSON files every minute to be then sent to a server to be analyzed and processed. We used Couchbase Server to store Files in clusters [22].

3.4. IoT implementation for Healthcare Data

Sensors are the source of data of IoT and become so advanced that can collect as much data you want. In healthcare system, having accurate information is crucial and influential because we deal with life and death of the person. In our case, we used multiple healthcare sensors: temperature sensor, pulse sensor, acceleration sensor and alcohol/drunk sensor (see Table 1).

To manage and collect data from sensors, multiple IoT tools exist such as:

Arduino Uno [23]: a small electronic board equipped with a microcontroller and an open source platform for gathering data.
Raspberry Pi 3 [24]: a single-board computer for programming and processing.

In this study, we just implement raspberry pi for collecting and processing data. Before we opt to use Arduino Uno for collecting data and Raspberry pi for processing; but we notice that we consume much time, while we have to reduce the time response instead. Lastly, we transfer all data collected to the mobile phone application “e-preg monitoring” as shown in Fig. 3.

Table 1

List of Healthcare sensors.
Sensor	Feature collected
Pulse sensor	Heart rate variability, blood pressure and emotion
Temperature sensor	Temperature variation
Acceleration sensor	Activity degree
Alcohol / Drunk sensor	Alcohol consumption

3.5 Predictive model

a. Prediction process

To build our predictive model, we used Databricks Big Data management tool based on Spark [25], [26]. The choice of working with Spark instead of other EcoSystems such as Hadoop is due to the need of processing streaming data. Using Hadoop, we can only do batch processing which not reply to our demand since we need results in real-time to save lives [27]. For creating the model and use K-means clustering algorithm to group our data, we used Machine Learning Library of Spark (MLlib) [28] by following the below four steps :

1. Collect and store data: Data are collected from different sources: healthcare sensors, mobile phone and patient’s profile. Data collected is then stored in database server on real-time in JSON files.

2. Extract data from database: In Databricks platform, we extract data from Couchbase server using a listener that order to do extraction when new data arrived.

3. Make transformation:

a. In Spark, data must be numeric for analysis.

b. In Spark, we use RDD (Resilient Data Distributed) to pass data to Kmeans clustering algorithm.

4. Train dataset: function “train” is used to train the model, by passing the following parameters: Number of clusters, Data and Number of iterations.

5. Make predictions: define the appropriate cluster of each sample in the dataset.

6. Evaluate the model: Evaluation is an essential step during prediction since having accurate model depends on whether or not we reach good accuracy of predicting outcomes.

b. Kmeans Clustering algorithm

Machine learning algorithms; which are a part of AI; remain performant models to learn and predict outcomes with high accuracy. Several families of machine learning exist: supervised learning, unsupervised learning and reinforcement learning. Many applications in several domains such as education, healthcare and agriculture use data mining techniques in the process of prediction [29].

Among unsupervised machine learning algorithms, we distinguish clustering algorithms like K-means [30], Hierarchical clustering HCA [31], Expectation Maximization EM [32] … among others. The challenge here is how to choose the right algorithm to work with. According to [33], multiple parameters must be considered: size of the data, number of clusters, type of dataset …. For large dataset, K-means and EM algorithms show their performance and efficiency. While HCA and SOM become very good when dataset is small.

In our study, we opt for K-means algorithm based centroid clustering since the size of our data is large, its simplicity to understand and its popularity. When using K-means algorithm, we calculate the probability of the most relevant function. Then functions are grouped using Euclidian distance. K is the number of clusters that will group your data into K clusters. The main goal of K-means is to minimize the squared error objective function given by:

$$J\left(V\right)={\sum }_{i=1}^{C} {\sum }_{j=1}^{{c}_{i}}{ \left(\left|\left|{x}_{i}-{v}_{i}\right|\right|\right) }^{2}$$

Where:

“||${x}_{i}$ − vj||” : Euclidean distance between${x}_{i} and {v}_{i}$
X= {x₁, x₂, x₃… x_n} is the set of data points and V= {v₁, v₂… v_c} is the set of centers.

4.1. Experiment Environment

Table 2 presents the experiment environment of our study. We used Databricks Spark as a big data management tool for processing and building the predictive model, Couchbase for storing data, IoT tools and healthcare sensors for collecting and processing real-time data. Android programming language is used for creating the mobile application.

Table 2

Experiment environment.
Big Data platform	Databricks Spark Programming language: Scala
Database server	Couchbase Server hosted with a public IP address
IoT tools	Raspberyy pi (Linux OS) Programming Language: Python
Sensors	Pulse sensor, temperature sensor, alcohol sensor and acceleration sensor.
Mobile tools	Arduino Studio for creating our mobile application “e-preg monitoring”

4.2. Experiment dataset

Table 3

Miscarriage Dataset description.
	Reference	Risk factor	Description	Sources
1	[34]	Heart Rate variability (HR)	A marker of stress. A high value of HR is a sign of a hypertension and elevated blood pressure.	Healthcare Sensors
2	[35]	Stress and Blood Pressure (BP)	Values define based on the value of HRmax
3	[36]	Temperature variation (TP)	A spontaneous miscarriage or premature delivery is associated with any viral infection. A higher Body temperature and flu increase the risk of having a miscarriage.
4	[36]	Physical Activity	An extreme activity of the body is associated with an elevated risk of miscarriage
5	[37]	Alcohol Consumption	The level of alcohol consumption. Ethyl alcohol value in the human body.
6	[37]	Drunk	State of the drunk level.
7	[38]	BMI	A sign of obesity and underweight. BMI = w(kg) / H(m) Where W and 𝐻 are respectively the weight and the height of the pregnant woman.	Mobile Phone
8	[39]	Number of previous miscarriages	miscarriage is categorized by previous number previous pregnancy losses
9	[20]	Maternal age	Increased maternal is associated with an increased chance of having miscarriage.
10	[40]	Location	Food safety and eating well play well during pregnancy to avoid getting an illness
11 to 15	[36]	Current activity (running, walking, biking,)	Activities like running, walking, biking, among others can encourage having miscarriage.

In this study, we used sensors and mobile phones to collect data about pregnant women. The current system use a dataset that contain more than 1 000 000 JSON files of pregnant women. It contains 15 features that are all miscarriage risk factors; gathered from mobile phones and sensors during the period of 2019 and 2021 (see Table 3).

4.3. Prediction process steps in Databricks Spark

As shown in Fig. 2 and Fig. 3, we send all data gathered to a database server for storage and then retrieved by Big Data Platform on real time. Once data is available in Databricks Spark Cloud, we have to build our predictive model to predict whether a pregnant woman is having a miscarriage or not. To build K-means model using Spark, we follow the steps below:

a. Transformation

We have to convert data to RDD to be passed to Kmeans algorithm as shown in Fig. 4.

AllDF represents the dataset
r.getInt(1) … r.getInt(15) represents the following attributes: Age: Int, BMI: Int, Nmisc:Int, Activity: Int, Biking: Int, Walking: Int, Driving: Int, Sitting: Int, Location: Int, temp: Int, bpm: Int, stress: Int, bp: Int, alcohol: Int

b. Train the model

To train K-means algorithm (see Fig. 5), we pass several parameters: Data: “Vectors”, Numbers of clusters: “3”Number of iteration: “400’

c. Make predictions

We Predict result using Predict function that dense vectors of attributes as described in Fig. 6. r._1 to r._15 represent the following attributes: Age: Int, BMI: Int, Nmisc:Int, Activity: Int, Biking: Int, Walking: Int, Driving: Int, Sitting: Int, Location: Int, temp: Int, bpm: Int, stress: Int, bp: Int, alcohol: Int.

d. Send results

Spark send results of predictions to doctors through the mobile application “e-preg monitoring” to act on the right time in case of a probable miscarriage. While we send recommendations to patients to avoid psychological problems.

4.4. Experiment Results and Discussion

a. Model efficiency

In healthcare system, time response is a crucial metric because we are dealing with the life or the death of the person. In our study, we train our model with different values of K to compare the model efficiency in term of time to train the model, time to create clusters, time to define centroids and finally time to evaluate the model ( see Table 4).

Table 4

Performance parameters of k-means.
Parameters	K = 1	K = 2	K = 3	K = 4
Time to build the model (s)	1.28	1.69	2.85	3.80
Time for cluster Distribution (s)	0.69	0.96	0.95	1.08
Time for center Definition (s)	0.35	0.62	0.40	0.49
Time for Model evaluation (s)	47.65	48.52	49.89	48.28

c. K-means scatter plot

Scatter plot represents an interesting graph that represents the distribution of clusters and the correlation between features and foresee trends. Because of the size of dataset and the big number of features, the scatter plot cannot be visualize. Databricks Community Platform only suggest a scatter plot for the first 1000 rows. Thus, clusters are not all viewed in the whole plot because of the small variation in the first rows. Figure 7 illustrates a previous scatter plot of a previous study for two clusters. We can notice variation of clusters because the dataset is large and we can have variation of clusters in the first 1000 rows. Table 5 and Table 6 represent respectively the meaning of features and the summary of the features.

Table 5

Features meaning
Feature	Meaning	Feature	Meaning
Feature 0	Age	Feature 7	Sitting
Feature 1	BMI	Feature 8	Location
Feature 2	Nmisc	Feature 9	temp
Feature 3	Activity	Feature 10	bpm
Feature 4	Biking	Feature 11	stress
Feature 5	Walking	Feature 12	bp
Feature 6	Driving	Feature 13	alcohol
		Feature 14	Drunk

Table 6

Summary of features
	Summary	Mean	Stddev	min	25%	50%	75%	max
Features	Age	22,000	0,000	22	22	22	22	22
	BMI	17,000	0,000	17	17	17	17	17
	Nmisc	1,558	1,134	0	1	2	3	3
	Activity	2,000	1,500	0	1	2	3	4
	Biking	0,510	0,500	0	0	1	1	1
	Walking	0,485	0,500	0	0	0	1	1
	Driving	0,005	0,070	0	0	0	0	1
	Sitting	0,512	0,500	0	0	1	1	1
	Location	2,005	1,409	0	1	2	3	4
	Temp	37,782	1,788	35	36	38	39	41
	Bpm	129,157	56,093	43	79	119	167	238
	Stress	1,134	1,114	0	0	1	2	3
	Bp	1,522	1,099	0	0	2	2	3
	Alcohol	476,539	216,925	100	295	475	659	854
	Drunk	1,891	0,413	0	2	2	2	2

d. Results transmission for doctors

Once the predictive model determines clusters, results are then transmitted to doctors in a mobile application we created. In his profile, the doctor lists his patients, tracks each pregnant woman and receives the results of having miscarriage on real-time (see Fig. 8). Thus, in case of a high percentage of miscarriage, doctors can notify the woman to come for a consultation to avoid unwanted outcomes.

e. Recommendations for patients

During Pregnancy, the woman experiences an emotional imbalance due to changes in hormones and body. Sending bad news is not appropriate thing since it can cause some psychological problems. Therefore, we preferred to send only recommendations based on her behavior during pregnancy as shown in Fig. 8. For example, if we notice that the pregnant woman is spending most of her time in snacks and restaurants, the system will notify her by a message: “Please eat healthy foods! You are pregnant and you should take care of your future baby nutrition”. Pregnant woman must eat healthy foods, and preferably, those made at home.

The main goal of K-means algorithm is to minimize the sum of squares of distance (Euclidian Distance for example) among the points of each cluster. It remains important to evaluate the obtained model and check whether it represents data correctly. In this study, we evaluate and validate the model through a number of techniques:

Within Set Sum of Squared Errors (WSSSE): Evaluation with a random K
Clustering validation techniques:
- Relative Clustering Validation (RCV)
- Internal Clustering Validation (ICV)
- External Clustering Validation (ECV)

5.1. Within Set Sum of Squared Errors (WSSSE): Evaluation with a random K

When using K-means, we define three as a value of “K”. However, we have to validate this value and check if it is the appropriate value of K in our case or not. Within Set Sum of Squared Error represents a good metric to evaluate models, since it is the sum of the distances of each point in each “K” cluster.

Normally, we get better results when the number of cluster is important. In spite of that, having a high value of K is not always a good indicator of accurate outcomes. The context of the case study and meaning of clusters is considered. To use evaluation by a random “K”, we calculate WSSSE of k-means for four values of “K” as presented in Table 7. We reach the highest value of WSSSE when k = 1 and k = 2 while we attend lowest WSSSE when k = 3 and k = 4. Therefore, we need to validate either grouping our data into three or four clusters through the following sections.

Table 7

Values of WSSSE for different values of K.
Value of K	Value of WSSSE (Thousand)
1	1,01E + 24
2	3,89E + 23
3	1,01E + 23
4	1,24E + 23

5.2. Relative Clustering Validation (RCV)

RCV evaluates the general structure of clustering method and determines the optimal number of K through its “Elbow” method. In fact, we represent a graph of the correlation between the value of “K” and its WSSSE value. The optimal number of K is when we notice a bend (knee) in the graph.

In Spark, to obtain the result of Elbow method, we called the function “fviz_nbclust” :

$$fvic\_nbclust, kmeans, method=\text{wss}+geomylinexintercept= 4,linetype= 2+labssubtitle="Elbow method"$$

From the plot in Fig. 9, we notice the location of two bends. The first is when k = 2 with a high value of WSSSE (3,89E + 23). This case is to avoid since the error rate is important. The second one is located when K = 3 with a lowest error rate. At this age, elbow method mention that clusters are well structured in the model when we group data into three partitions. For us, it is a good sign since our goal is to group our data into three outcomes: M, PM and NM. Choosing k = 1 (Highest WSSSE) is not appropriate, while choosing k = 4 (lowest WSSSE) do not satisfy our aims in our case study.

5.3. Internal Clustering Validation (ICV)

In contrary of RCV, Internal Clustering Validation ICV evaluates the density and the segregation between clusters. We focus on how a collection is separated from other partitions and how points are distributed in the same cluster. The majority of ICV methods include both density (compactness) and separation to calculate indexes (α and β are weights):

$$Index=\frac{ \alpha *Separation }{ {\beta } \text{*} \text{D}\text{e}\text{n}\text{s}\text{i}\text{t}\text{y}}$$

Silhouette method is one of the most valuable evaluation method in ICV because of its robustness. We can deduce how well a point is grouped by calculating the average distance between clusters. In other words, this method helps to rate the distance from an observation in cluster A to all points in cluster B. For each observation i, the silhouette width S_i is calculated as follows:

1. For each point i, we calculate the average variation a_i between i and all other observations in the same cluster of i.

2. We calculate the average variation d(i,C) of i to all points of C (other clusters to which i does not belong). We call the lowest value of this variation bi = min_C d(i,C) that represents the dissimilarity between i and its nearest cluster.

3. At last, we get the silhouette width (a number between − 1 and 1) of the observation i through the formula below. Meanings of values of silhouette width are described in Table 8.

$$Index=\frac{ {b}_{i}-{a}_{i}}{\text{ max}({a}_{i} , {b}_{i} ) }$$

Table 8

Different values of Silhouette width in clustering models.
Value of Silhouette width S_i	Meaning
Almost 1	Observations are well grouped.
Around 0	Too few or too much clusters Points are not well clustered in their own cluster
Negative value	Observations are not in the appropriate partition

From the plot in Fig. 10, we notice that we attend almost 1 with three clusters. Silhouette method validate the elbow method and prove the compactness of groups when k = 3. We used clustering evaluator package of MLlib Spark to compute Silhouette width (see Fig. 11).

5.4. External Clustering Validation (ECV)

External clustering validation remains on comparing results of experiences with other studies working on the same case study prediction. As we mentioned earlier, miscarriage prediction researches in literature use either maternal factors or echography to predict an eventual miscarriage. However, it is often too late to react and save baby’s life. Comparing to previous studies, we attend good results. Table 9 and Table 10 present a comparison with other researches studies on miscarriage prediction.

Table 9

External Clustering Validation: Comparison with previous studies.
Reference	Real-time risk factors	Maternal risk factors	Medical risk factors
[41]		x	x
[39]			x
[42]		x	x
[43]			x
Experiment	x	x	x

Table 10

Comparison with a previous study.
Reference	N° of risk factors	Silhouette method	Elbow Method	Value of WSSSE
Old study	11	0.95	K = 2	978.24
Current study	15	0.99	K = 3	1,01E + 23

Nowadays, sensors and mobile phones remains important sources of data about human being behavior. Taking advantages of those technologies and benefit from reality mining and big data analytics present a challenge in predicting diseases. The present paper propose an e-monitoring system for real-time disease prediction that can be applied in several case studies. As a proof of concept, we propose a real-time miscarriage prediction system that benefit from the use of IoT (healthcare sensors, mobile phone and IoT tools), data mining algorithms and Big data predictive analytics. The proposed system involves both pregnant women and doctors in the treatment of the disease through a mobile phone that we created.

K-means centroid clustering algorithm shows its performance by grouping data into three clusters: Miscarriage, Probable Miscarriage and Non-Miscarriage. To validate results, we used several methods including WSSSE, ICV, ECV and RCV. We achieve the lower value of WSSE when k = 3 and Elbow method of RCV assert this value of K through its graph. Thus, the optimal number of K is 3 and we can no more regroup dataset to more partitions. In addition, the silhouette width value is 0.99, which is almost 1. So, compactness and separation of clusters are well structured in K-means model.

Future works consist on enhancing the proposed system by including more healthcare sensors to collect healthcare data about a person, adding risk factors collected from social networks, texts, calls, images … among others, developing the predictive model under a faster real-time framework like Flink. In addition, the proposed system represents a part of an ongoing project of Humanoid healthcare robots. In fact, future system include humanoids, the proposed prediction system and assisting system. HIYAM represents the first part created which is a new Moroccan humanoid robot that will serve as a nurse robot in hospitals. Several functions are developed and we are improving HIYAM in term of decision-making and prediction.

AI: Artificial Intelligence;IoT: Internet of Things;RCV: Relative Clustering Validation; ICV: Internal Clustering Validation; ECV: External Clustering Validation; SVM: Support Vector Machine; NB: Naive Bayes; K-NN: K-Nearest Neighbor; C4.5: Decision Tree; WBC:Wisconsin Breast Cancer; BMI: Body Mass Index; MLlib: Machine Learning Library; RDD: Resilient Data Distributed; HCA: Hierarchical clustering; EM: Expectation Maximization; HR: Heart Rate; WSSSE: Within Set Sum of Squared Errors.

Ethical approval and consent to participate

Not applicable

Consent for publication

Not applicable.

Availability of data and materials

The dataset is published and available through the link: http://dx.doi.org/10.17632/5sbmhh6t3r.1, under the Mendeley Repository name: HIBA ASRI_ Miscarriage Prediction Risk Factors

Competing interests

The authors declare that they have no competing interests.

Funding

Not applicable.

Authors’ contributions

The author HA defined the study methodology, developed the e-monitoring miscarriage prediction system and wrote the main manuscript. Author ZJ brought his expertise in the workflow of the system. All authors reviewed, corrected the manuscript. All authors agreed to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. All authors read and approved the final manuscript.

Acknowledgements

Not applicable.

Authors’ information

Pr Hiba Asri: is an assistant professor at the computer science department of the Faculty of Sciences Semlalia at Cadi Ayyad University. She is member of LISI Laboratory at Cadi Ayyad University She holds a doctorate in computer science at Cadi Ayyad University, for her work on Big Data, Predictive / Preventive Analysis, Machine Learning and Reality Mining applied to the field of health. She obtained an engineering degree in Computer Networks and Information System in 2014. Her main research interests include Big Data, Machine Learning, IoT, Robotics, Data Mining algorithms, human-machine interaction and new generation internet technologies; in various fields of application such as health, education and e- learning. In addition to her academic experience, she chaired the program committee of many international conferences. She is a Keynote Speaker / Session Chair in various conferences and international workshops. She is Co-researcher in several international projects such as: Challenge AI-BioDiv Project and Climate Change Project. Hiba Asri is the creator of a bracelet Smart for real-time miscarriage prediction that is submitted for patent of invention. Also, she chairs the HIYAM project, A Moroccan Artificial Intelligence Humanoid Robot

Pr Zahi Jarir: received his postgraduate degree in computer science in 1997 on Natural Language Processing at Faculty of Sciences in Rabat, Morocco. From 1997 to 2006, he was assistant professor at Faculty of sciences, Cadi Ayyad University in Marrakech, Morocco. In 2006, he received academic accreditation from Cadi Ayyad University in the field of Personalization of Telecommunication Services and Web applications. Currently, he is a full professor of Computer Science at Faculty of Sciences of Cadi Ayyad University. He has participated actively in several research projects (RNTL, Volubilis, CSPT, PMARS, etc.). His research interests include distributed systems, adaptive and reflective middleware, ubiquitous computing, service-oriented computing, cloud computing, Information Security, M2M and IoT coordination, artificial Intelligence techniques and blockchain. He is a member of the editorial boards and member for several international journals, and a program committee member for multiple international conferences. He has published several publications in international conferences and journals, and chaired and organized several international scientific events

H. Asri, H. Mousannif, H. Al Moatassime, and T. Noel, ‘Big data in healthcare: Challenges and opportunities’, in 2015 International Conference on Cloud Technologies and Applications (CloudTech), Jun. 2015, pp. 1–7. doi: 10.1109/CloudTech.2015.7337020.
D. Lazer, R. Kennedy, G. King, and A. Vespignani, ‘The Parable of Google Flu: Traps in Big Data Analysis’, Science, vol. 343, no. 6176, pp. 1203–1205, Mar. 2014, doi: 10.1126/science.1248506.
K. Tsuji et al., ‘Book Recommendation Based on Library Loan Records and Bibliographic Information’, Procedia - Social and Behavioral Sciences, vol. 147, pp. 478–486, Aug. 2014, doi: 10.1016/j.sbspro.2014.07.142.
H. Asri, ‘IoT and Reality Mining for Real-Time Disease Prediction’, in IoT and Smart Devices for Sustainable Environment, M. Azrour, A. Irshad, and R. Chaganti, Eds. Cham: Springer International Publishing, 2022, pp. 85–102. doi: 10.1007/978-3-030-90083-0_7.
H. Asri, H. Mousannif, and H. Al Moatassime, ‘Reality mining and predictive analytics for building smart applications’, J Big Data, vol. 6, no. 1, p. 66, Jul. 2019, doi: 10.1186/s40537-019-0227-y.
H. Asri, H. Mousannif, and H. A. Moatassime, ‘Big Data Analytics in Healthcare: Case Study - Miscarriage Prediction’, IJDST, vol. 10, no. 4, pp. 45–58, Oct. 2019, doi: 10.4018/IJDST.2019100104.
H. Asri, H. Mousannif, and H. Al Moatassime, ‘Comprehensive miscarriage dataset for an early miscarriage prediction’, Data in Brief, vol. 19, pp. 240–243, Aug. 2018, doi: 10.1016/j.dib.2018.05.012.
H. Asri, ‘HIBA ASRI_ Miscarriage Prediction Risk Factors’, vol. 1, May 2021, doi: 10.17632/5sbmhh6t3r.1.
‘Comparing different supervised machine learning algorithms for disease prediction | SpringerLink’. https://link.springer.com/article/10.1186/s12911-019-1004-8 (accessed Apr. 28, 2022).
‘Automated machine learning: Review of the state-of-the-art and opportunities for healthcare - ScienceDirect’. https://www.sciencedirect.com/science/article/pii/S0933365719310437 (accessed Apr. 28, 2022).
S. Poornima and M. Pushpalatha, ‘A survey of predictive analytics using big data with data mining’, International Journal of Bioinformatics Research and Applications, vol. 14, no. 3, pp. 269–282, Jan. 2018, doi: 10.1504/IJBRA.2018.092697.
H. Asri, H. Mousannif, H. A. Moatassime, and T. Noel, ‘Using Machine Learning Algorithms for Breast Cancer Risk Prediction and Diagnosis’, Procedia Computer Science, vol. 83, pp. 1064–1069, Jan. 2016, doi: 10.1016/j.procs.2016.04.224.
H. Asri, H. Mousannif, and H. Al Moatassim, ‘A Hybrid Data Mining Classifier for Breast Cancer Prediction’, in Advanced Intelligent Systems for Sustainable Development (AI2SD’2019), Cham, 2020, pp. 9–16. doi: 10.1007/978-3-030-36664-3_2.
M. M. Malik, S. Abdallah, and M. Ala’raj, ‘Data mining and predictive analytics applications for the delivery of healthcare services: a systematic literature review’, Ann Oper Res, vol. 270, no. 1, pp. 287–312, Nov. 2018, doi: 10.1007/s10479-016-2393-z.
G. Yavaş, D. Katsaros, Ö. Ulusoy, and Y. Manolopoulos, ‘A data mining approach for location prediction in mobile environments’, Data & Knowledge Engineering, vol. 54, no. 2, pp. 121–146, Aug. 2005, doi: 10.1016/j.datak.2004.09.004.
C. McCue, ‘Data Mining and Predictive Analytics in Public Safety and Security’, IT Professional, vol. 8, no. 4, pp. 12–18, Jul. 2006, doi: 10.1109/MITP.2006.84.
R. Pasquali, ‘Obesity, fat distribution and infertility’, Maturitas, vol. 54, no. 4, pp. 363–371, Jul. 2006, doi: 10.1016/j.maturitas.2006.04.018.
Z. Li, Y.-D. He, and Q. Chen, ‘A risk-prediction nomogram for patients with second-trimester threatened miscarriage associated with adverse outcomes’, In Review, preprint, Nov. 2020. doi: 10.21203/rs.3.rs-111117/v1.
M. C. Magnus, A. J. Wilcox, N.-H. Morken, C. R. Weinberg, and S. E. Håberg, ‘Role of maternal age and pregnancy history in risk of miscarriage: prospective register based study’, BMJ, vol. 364, p. l869, Mar. 2019, doi: 10.1136/bmj.l869.
G. Kong, T. Chung, B. Lai, and I. Lok, ‘Gender comparison of psychological reaction after miscarriage—a 1-year longitudinal study’, BJOG: An International Journal of Obstetrics & Gynaecology, vol. 117, no. 10, pp. 1211–1219, 2010, doi: 10.1111/j.1471-0528.2010.02653.x.
M. A. Hubail et al., ‘Couchbase analytics: NoETL for scalable NoSQL data analysis’, Proc. VLDB Endow., vol. 12, no. 12, pp. 2275–2286, Aug. 2019, doi: 10.14778/3352063.3352143.
Y. A. Badamasi, ‘The working principle of an Arduino’, in 2014 11th International Conference on Electronics, Computer and Computation (ICECCO), Sep. 2014, pp. 1–4. doi: 10.1109/ICECCO.2014.6997578.
E. Upton and G. Halfacree, Raspberry Pi User Guide. John Wiley & Sons, 2014.
R. C. L’Esteve, ‘Machine Learning in Databricks’, in The Definitive Guide to Azure Data Engineering: Modern ELT, DevOps, and Analytics on the Azure Cloud Platform, R. C. L’Esteve, Ed. Berkeley, CA: Apress, 2021, pp. 543–559. doi: 10.1007/978-1-4842-7182-7_23.
S. Salloum, R. Dautov, X. Chen, P. X. Peng, and J. Z. Huang, ‘Big data analytics on Apache Spark’, Int J Data Sci Anal, vol. 1, no. 3, pp. 145–164, Nov. 2016, doi: 10.1007/s41060-016-0027-9.
K. Aziz, D. Zaidouni, and M. Bellafkih, ‘Real-time data analysis using Spark and Hadoop’, in 2018 4th International Conference on Optimization and Applications (ICOA), Apr. 2018, pp. 1–6. doi: 10.1109/ICOA.2018.8370593.
X. Meng et al., ‘MLlib: Machine Learning in Apache Spark’, p. 7.
G. Bonaccorso, Machine Learning Algorithms. Packt Publishing Ltd, 2017.
K. P. Sinaga and M.-S. Yang, ‘Unsupervised K-Means Clustering Algorithm’, IEEE Access, vol. 8, pp. 80716–80727, 2020, doi: 10.1109/ACCESS.2020.2988796.
J. Fan, ‘OPE-HCA: an optimal probabilistic estimation approach for hierarchical clustering algorithm’, Neural Comput & Applic, vol. 31, no. 7, pp. 2095–2105, Jul. 2019, doi: 10.1007/s00521-015-1998-5.
M. Hamidi, M. Sheikhalishahi, and F. Martinelli, ‘Privacy Preserving Expectation Maximization (EM) Clustering Construction’, in Distributed Computing and Artificial Intelligence, 15th International Conference, Cham, 2019, pp. 255–263. doi: 10.1007/978-3-319-94649-8_31.
C. J. Gilmore, G. Barr, and W. Dong, ‘Choice of clustering method’, urn:isbn:978-1-118-41628-0, 2019. https://onlinelibrary.wiley.com/iucr/itc/Ha/ch3o8v0001/sec3o8o3o5/ (accessed Apr. 29, 2022).
J. F. Thayer, F. Åhs, M. Fredrikson, J. J. Sollers, and T. D. Wager, ‘A meta-analysis of heart rate variability and neuroimaging studies: Implications for heart rate variability as a marker of stress and health’, Neuroscience & Biobehavioral Reviews, vol. 36, no. 2, pp. 747–756, Feb. 2012, doi: 10.1016/j.neubiorev.2011.11.009.
O. Anselem, D. Floret, V. Tsatsaris, F. Goffinet, and O. Launay, ‘[Influenza infection and pregnancy]’, Presse Med, vol. 42, no. 11, pp. 1453–1460, Nov. 2013, doi: 10.1016/j.lpm.2013.01.064.
E. y. Wong et al., ‘Physical activity, physical exertion, and miscarriage risk in women textile workers in Shanghai, China’, American Journal of Industrial Medicine, vol. 53, no. 5, pp. 497–505, 2010, doi: 10.1002/ajim.20812.
J. Nizard et al., ‘Pathologies maternelles chroniques et pertes de grossesse. Recommandations françaises’, Journal de Gynécologie Obstétrique et Biologie de la Reproduction, vol. 43, no. 10, pp. 865–882, Dec. 2014, doi: 10.1016/j.jgyn.2014.09.017.
Z. Veleva et al., ‘High and low BMI increase the risk of miscarriage after IVF/ICSI and FET’, Human Reproduction, vol. 23, no. 4, pp. 878–884, Apr. 2008, doi: 10.1093/humrep/den017.
N. Stamatopoulos et al., ‘Prediction of subsequent miscarriage risk in women who present with a viable pregnancy at the first early pregnancy scan’, Australian and New Zealand Journal of Obstetrics and Gynaecology, vol. 55, no. 5, pp. 464–472, 2015, doi: 10.1111/ajo.12395.
S. Chakrabarti and A. Chakrabarti, ‘Food taboos in pregnancy and early lactation among women living in a rural area of West Bengal’, J Family Med Prim Care, vol. 8, no. 1, pp. 86–90, Jan. 2019, doi: 10.4103/jfmpc.jfmpc_53_17.
C. Bottomley and T. Bourne, ‘Diagnosing miscarriage’, Best Practice & Research Clinical Obstetrics & Gynaecology, vol. 23, no. 4, pp. 463–477, Aug. 2009, doi: 10.1016/j.bpobgyn.2009.02.004.
S. Mastrodima, R. Akolekar, G. Yerlikaya, T. Tzelepis, and K. H. Nicolaides, ‘Prediction of stillbirth from biochemical and biophysical markers at 11–13 weeks’, Ultrasound in Obstetrics & Gynecology, vol. 48, no. 5, pp. 613–617, 2016, doi: 10.1002/uog.17289.
S. Kumari, J. Roychowdhury, and S. Biswas, ‘Prediction of early pregnancy failure by use of first trimester ultrasound screening’, International Journal of Reproduction, Contraception, Obstetrics and Gynecology, vol. 5, no. 7, pp. 2135–2141, Jul. 2016.
43. S. Kumari, J. Roychowdhury, and S. Biswas, ‘Prediction of early pregnancy failure by use of first trimester ultrasound screening’, International Journal of Reproduction, Contraception, Obstetrics and Gynecology, vol. 5, no. 7, pp. 2135–2141, Jul. 2016.

No competing interests reported.

Download PDF

Journal Publication

published 14 Mar, 2023

Read the published version in Journal of Big Data →

Editorial decision: Major revision
14 Aug, 2022
Reviews received at journal
04 Jul, 2022
Reviewers agreed at journal
04 Jul, 2022
Reviewers agreed at journal
02 Jul, 2022
Reviewers invited by journal
30 Jun, 2022
Editor assigned by journal
07 May, 2022
Submission checks completed at journal
07 May, 2022
First submitted to journal
04 May, 2022

You are reading this latest preprint version

Toward a Smart Health: Reality Mining and Big Data Analytics for Real-Time Disease Prediction

Status:

Journal Publication

Version 1

Abstract

Background

Methods

Results

Figures

1. Introduction

2. Background: Data Mining Opportunities And Disease Prediction

3. Methods

3.1. Real-Time Disease Prediction: A Proposed System

3.2. Case study: Real-Time Miscarriage Prediction

3.3. Reality mining for gathering data

3.4. IoT implementation for Healthcare Data

3.5 Predictive model

4. Experiment

4.1. Experiment Environment

4.2. Experiment dataset

4.3. Prediction process steps in Databricks Spark

4.4. Experiment Results and Discussion

5. Model Evaluation

5.1. Within Set Sum of Squared Errors (WSSSE): Evaluation with a random K

5.2. Relative Clustering Validation (RCV)

5.3. Internal Clustering Validation (ICV)

5.4. External Clustering Validation (ECV)

6. Conclusion And Future Work

Abbreviations

Declarations

References

Additional Declarations

Status:

Journal Publication

Version 1