4.3. Prediction process steps in Databricks Spark
As shown in Fig. 2 and Fig. 3, we send all data gathered to a database server for storage and then retrieved by Big Data Platform on real time. Once data is available in Databricks Spark Cloud, we have to build our predictive model to predict whether a pregnant woman is having a miscarriage or not. To build K-means model using Spark, we follow the steps below:
a. Transformation
We have to convert data to RDD to be passed to Kmeans algorithm as shown in Fig. 4.
-
AllDF represents the dataset
-
r.getInt(1) … r.getInt(15) represents the following attributes: Age: Int, BMI: Int, Nmisc:Int, Activity: Int, Biking: Int, Walking: Int, Driving: Int, Sitting: Int, Location: Int, temp: Int, bpm: Int, stress: Int, bp: Int, alcohol: Int
b. Train the model
To train K-means algorithm (see Fig. 5), we pass several parameters: Data: “Vectors”, Numbers of clusters: “3”Number of iteration: “400’
c. Make predictions
We Predict result using Predict function that dense vectors of attributes as described in Fig. 6. r._1 to r._15 represent the following attributes: Age: Int, BMI: Int, Nmisc:Int, Activity: Int, Biking: Int, Walking: Int, Driving: Int, Sitting: Int, Location: Int, temp: Int, bpm: Int, stress: Int, bp: Int, alcohol: Int.
d. Send results
Spark send results of predictions to doctors through the mobile application “e-preg monitoring” to act on the right time in case of a probable miscarriage. While we send recommendations to patients to avoid psychological problems.
4.4. Experiment Results and Discussion
a. Model efficiency
In healthcare system, time response is a crucial metric because we are dealing with the life or the death of the person. In our study, we train our model with different values of K to compare the model efficiency in term of time to train the model, time to create clusters, time to define centroids and finally time to evaluate the model ( see Table 4).
Table 4
Performance parameters of k-means.
Parameters
|
K = 1
|
K = 2
|
K = 3
|
K = 4
|
Time to build the model (s)
|
1.28
|
1.69
|
2.85
|
3.80
|
Time for cluster Distribution (s)
|
0.69
|
0.96
|
0.95
|
1.08
|
Time for center Definition (s)
|
0.35
|
0.62
|
0.40
|
0.49
|
Time for Model evaluation (s)
|
47.65
|
48.52
|
49.89
|
48.28
|
c. K-means scatter plot
Scatter plot represents an interesting graph that represents the distribution of clusters and the correlation between features and foresee trends. Because of the size of dataset and the big number of features, the scatter plot cannot be visualize. Databricks Community Platform only suggest a scatter plot for the first 1000 rows. Thus, clusters are not all viewed in the whole plot because of the small variation in the first rows. Figure 7 illustrates a previous scatter plot of a previous study for two clusters. We can notice variation of clusters because the dataset is large and we can have variation of clusters in the first 1000 rows. Table 5 and Table 6 represent respectively the meaning of features and the summary of the features.
Table 5
Feature
|
Meaning
|
|
Feature
|
Meaning
|
Feature 0
|
Age
|
|
Feature 7
|
Sitting
|
Feature 1
|
BMI
|
|
Feature 8
|
Location
|
Feature 2
|
Nmisc
|
|
Feature 9
|
temp
|
Feature 3
|
Activity
|
|
Feature 10
|
bpm
|
Feature 4
|
Biking
|
|
Feature 11
|
stress
|
Feature 5
|
Walking
|
|
Feature 12
|
bp
|
Feature 6
|
Driving
|
|
Feature 13
|
alcohol
|
|
|
|
Feature 14
|
Drunk
|
Table 6
|
Summary
|
Mean
|
Stddev
|
min
|
25%
|
50%
|
75%
|
max
|
Features
|
Age
|
22,000
|
0,000
|
22
|
22
|
22
|
22
|
22
|
BMI
|
17,000
|
0,000
|
17
|
17
|
17
|
17
|
17
|
Nmisc
|
1,558
|
1,134
|
0
|
1
|
2
|
3
|
3
|
Activity
|
2,000
|
1,500
|
0
|
1
|
2
|
3
|
4
|
Biking
|
0,510
|
0,500
|
0
|
0
|
1
|
1
|
1
|
Walking
|
0,485
|
0,500
|
0
|
0
|
0
|
1
|
1
|
Driving
|
0,005
|
0,070
|
0
|
0
|
0
|
0
|
1
|
Sitting
|
0,512
|
0,500
|
0
|
0
|
1
|
1
|
1
|
Location
|
2,005
|
1,409
|
0
|
1
|
2
|
3
|
4
|
Temp
|
37,782
|
1,788
|
35
|
36
|
38
|
39
|
41
|
Bpm
|
129,157
|
56,093
|
43
|
79
|
119
|
167
|
238
|
Stress
|
1,134
|
1,114
|
0
|
0
|
1
|
2
|
3
|
Bp
|
1,522
|
1,099
|
0
|
0
|
2
|
2
|
3
|
Alcohol
|
476,539
|
216,925
|
100
|
295
|
475
|
659
|
854
|
Drunk
|
1,891
|
0,413
|
0
|
2
|
2
|
2
|
2
|
d. Results transmission for doctors
Once the predictive model determines clusters, results are then transmitted to doctors in a mobile application we created. In his profile, the doctor lists his patients, tracks each pregnant woman and receives the results of having miscarriage on real-time (see Fig. 8). Thus, in case of a high percentage of miscarriage, doctors can notify the woman to come for a consultation to avoid unwanted outcomes.
e. Recommendations for patients
During Pregnancy, the woman experiences an emotional imbalance due to changes in hormones and body. Sending bad news is not appropriate thing since it can cause some psychological problems. Therefore, we preferred to send only recommendations based on her behavior during pregnancy as shown in Fig. 8. For example, if we notice that the pregnant woman is spending most of her time in snacks and restaurants, the system will notify her by a message: “Please eat healthy foods! You are pregnant and you should take care of your future baby nutrition”. Pregnant woman must eat healthy foods, and preferably, those made at home.