Descriptive statistical analysis of preprocessed data
The data preprocessing resulted in 16440 data elements, such that each element represented one patient. The mean recent consultation month of the 16440 AIDS patients was 15 months, and the median was 3 months. The mean total medical cost was 11000 RMB (China’s currency), and the median was 8700 RMB. The average medical cost per visit was 900 RMB, and the median was 644 RMB. The mean consultation frequency was 15.65 visits, and the median was 13 visits. These four markers conform to a positively skewed distribution. The results are shown in Table 1. Since there was no statistically significant difference between good and bad adherence according to the factors of gender, place of residence and age (see Supplemental Table 6), only the descriptive statistical analysis was conducted.
Table 1. Consultation status of AIDS patients in 2009–2019
Variable1
|
Samples
|
Minimum (M)
|
Maximum (X)
|
Mean (E)
|
Standard deviation
|
Median
|
Skewness
|
Kurtosis
|
Quartile(0)
|
Quartile(25)
|
Quartile(50)
|
Quartile(75)
|
Recent consultation month (month)2
|
16440
|
1
|
125
|
14.99
|
23.58
|
3
|
2.111
|
4.24
|
0
|
1
|
3
|
20
|
Total medical cost (RMB)3
|
16440
|
1
|
666737.98
|
11315.29
|
17503.54
|
8740.51
|
16.88
|
496.76
|
0
|
2807.5475
|
8740.51
|
16973.0025
|
Average medical cost per visit (RMB)4
|
16440
|
1
|
28270
|
918.01
|
1107.12
|
644.47
|
4.98
|
57.29
|
0
|
444.4842
|
644.4745
|
831.4121
|
Consultation frequency (visits)5
|
16440
|
1
|
397
|
15.65
|
14.90
|
13
|
3.58
|
52.43
|
0
|
3
|
13
|
25
|
Note:
- Variable:FRM(m)Model variables.
- Recent consultation month (month):Recency
- Total medical cost (RMB):Monetary
- Average medical cost per visit (RMB):Monetary
- Consultation frequency (visits):Frequency
The optimal RFM(m) model, clustering analysis and decision algorithm
The three markers in the RFM model were used as variables to construct 13 models (including five clustering models and eight prediction models). However, the predictor variables were unstable. The three markers in the RFm model were then used as variables to construct 27 models (including 7 clustering models and 20 prediction models), the predictor variables were stable, and the model was robust. Clustering analysis was used to construct 12 models, and the k-means clustering analysis was the most robust one. The decision algorithm was used to construct 28 models, and the C5.0 algorithm was robust and had high prediction accuracy. The results showed that the RFm model, k-means clustering analysis and C5.0 algorithm were optimal. The results are shown in Table 2.
Table 2. Preliminary experiment on RFM(m) models, clustering analysis, and decision algorithms in 16440 valid datasets obtained after cleaning
Model type
|
Clustering analysis
|
Decision algorithm
|
Clustering type
|
Number of models constructed
|
Model quality5
|
Predictor variable importance
|
Predictor variable importance
|
Prediction model accuracy (%)
|
C5.0
|
CHAID
|
CART
|
QUEST
|
C5.06
|
CHAID
|
CART
|
QUEST
|
RFM model 1
|
K-Means
|
1st round
|
0.8
|
R=1 M=1 F=1
|
R=0.9865
M=0.0067
F=0.0067
|
R=0.7402
M=0.2547
F=0.0052
|
R=0.9697
M=0.0152
F=0.0152
|
R=0.9689
M=0.0156
F=0.0156
|
99.96
|
93.55
|
99.77
|
97.07
|
2nd round
|
0.9
|
R=1 M=1 F=1
|
M=1
|
M=0.7039
F=0.2961
|
_
|
_
|
99.98
|
99.9
|
_
|
_
|
3rd round
|
0.7
|
R=1 M=1 F=1
|
M=0.5
F=0.5
|
M=1
|
_
|
_
|
99.98
|
99.93
|
_
|
_
|
Two-step clustering
|
1st round
|
0.4
|
R=1 M=1 F=1
|
_
|
_
|
_
|
_
|
_
|
_
|
_
|
_
|
Kohonen
|
1st round
|
0.4
|
R=1 M=1 F=1
|
_
|
_
|
_
|
_
|
_
|
_
|
_
|
_
|
RFM model2
|
K-Means3
|
1st round
|
0.5
|
R=1 M=1 F=1
|
R=0.6735
M=0.017
F=0.3095
|
R=0.6661
M=0.0429
F=0.2910
|
R=0.7172
M=0.0020
F=0.2808
|
R=0.7479
M=0.0017
F=0.2503
|
99.88
|
90.38
|
97.43
|
95.86
|
2nd round
|
0.8
|
R=1 M=1 F=1
|
R=0.5918
M=0.4043
F=0.0039
|
R=0.5938
M=0.1768
F=0.2293
|
R=0.5814
M=0.4129
F=0.0057
|
R=0.5962
M=0.3999
F=0.0039
|
99.96
|
96.66
|
98.06
|
98.33
|
3rd round
|
0.6
|
R=1 M=1 F=0.37
|
R=0.7258
M=0.0451
F=0.1841
|
R=0.7708
M=0.2728
F=0.0014
|
R=0.971
M=0.0268
F=0.0022
|
R=0.6749
M=0.3245
F=0.0007
|
99.82
|
97.48
|
97.48
|
98.25
|
Two-step clustering5
|
1st round
|
0.7
|
R=1 M=1 F=1
|
R=0.6864
M=0.2090
F=0.1046
|
R=0.6607
M=0.2962
F=0.0431
|
R=0.6283
M=0.2582
F=0.1135
|
R=0.5797
M=0.2624
F=0.1579
|
99.87
|
96.9
|
98.41
|
97.32
|
2nd round
|
0.7
|
R=1 M=0.15 F=0.01
|
R=0.7058
M=0.2942
|
R=0.8428
M=0.1572
|
R=0.7351
M=0.2649
|
R=0.7398
M=0.2602
|
98.61
|
95.45
|
98.04
|
97.7
|
3rd round
|
0.4
|
R=1 M=1 F=1
|
_
|
_
|
_
|
_
|
_
|
_
|
_
|
_
|
Kohonen
|
1st round
|
0.4
|
R=1 M=1 F=1
|
_
|
_
|
_
|
_
|
_
|
_
|
_
|
_
|
Note: R-Recency, F-Frequency, M-Monetary. Values lie in the 0–1 range.
- The predictor variables of the RFM model were either unstable, or could not be used for modeling in clustering analysis and decision tree algorithm. M:total medical costs.
- The predictor variables of the RFm model in the decision algorithm were stable.m:average medical costs per visit.
- The K-means clustering model was robust.
- The accuracy of the two-step clustering in the C5.0 algorithm was lower than that of the k-means clustering model, and the quality of the model in the third round was low.
- The "model quality" in the output of the model is an indicator of the quality of the model built.
- The C5.0 algorithm prediction model had an accuracy of 99%.
The adherence prediction model and variables
After determining the optimal model, clustering algorithm and decision algorithm to be used, we used the R, F and m in the RFm model as the variables for three rounds of k-means clustering analysis, then used the C5.0 algorithm to construct and validate the adherence prediction model. The 16440 data elements underwent one round of k-means clustering analysis to remove the data of the patients who did visit within 24 months. The second round removed the data of the patients who did not visit within eight months. The third round resulted in the best model quality of 0.8, and 5 clusters representing 5 types of patients. Among these elements, 9803 (recent consultation month ≤ 3 months) were patients with good adherence, and 811 (recent consultation month > 3 months) were patients with poor adherence. Furthermore, two important predictor variables (recent consultation month and average medical costs per visit) were obtained. The results are shown in Table 3 and Table 4.
Table 3. Results of cleaned 16440 datasets after three rounds of k-means clustering analysis
Samples
|
Number of constructed models1
|
Clustering number (type)2
|
Model quality3
|
Predictor variable importance
|
16440
|
1st round
|
6
|
0.5
|
R=1 M=1 F=1
|
11585
|
2nd round
|
6
|
0.7
|
R=1 M=1 F=1
|
10614
|
3rd round
|
5
|
0.8
|
R=1 M=1 F=0.0563
|
Note: R-Recency, F-Frequency, M-Monetary. Values lie in the 0–1 range.
- After 3 modeling times, the model quality reached the best.
- The first round of modeling is better to cluster into 6 categories, the second round of modeling is better to cluster into 6 categories, and the third round of modeling is to cluster into 5 categories.
- The "model quality" in the output of the model is an indicator of the quality of the model built.
Table 4 Clustering map of five types of patients (10614 datasets) after three rounds of k-means clustering analysis
cluster
|
cluster-4
|
cluster-1
|
cluster-3
|
cluster-2
|
cluster-5
|
label
|
good adherence
|
good adherence
|
good adherence
|
poor adherence
|
poor adherence
|
Sample(Patient ratio %)
|
4313(40.6%)
|
2966(27.9% )
|
2406(22.7% )
|
811(7.6%)
|
118(1.1%)
|
input
|
average medical costs per visit= 651.76
|
average medical costs per visit = 678.68
|
average medical costs per visit = 682.59
|
average medical costs per visit = 831.28
|
average medical costs per visit= 3733.77
|
Recency = 1.00
|
Recency = 2.00
|
Recency = 3.00
|
Recency = 4.62
|
Recency =1.47
|
Frequency = 20.75
|
Frequency = 21.63
|
Frequency =20.93
|
Frequency = 16.36
|
Frequency=16.67
|
Since the AIDS patients in the study unit can collect drugs for free from the designated hospital once every three months, and following the recommendation of Tarokh MJ [29] to designate three months as one consultation cycle, we decided to classify the patients as well adherent if the recent consultation month was between one to three months. As a result, the patients in Clusters 1, 3, 4 and 5 were classified as patients with good adherence. The patients in Cluster 5 had other underlying diseases. Therefore, the average medical costs per visit was relatively high. The patients in Cluster 2, who did not go for consultation for more than four months, were classified as poorly adherent. The C5.0 algorithm was employed, with good adherence and poor adherence as the targets, and the recent consultation month and average medical cost per visit as the input variables. Validating the adherence prediction model showed that the recent consultation month represented the adherence prediction model node. If recency ≤ 3, Model 1 has a good adherence. If recency >3, Model 2 has a poor adherence. Thus, there was only one important predictor variable: the recent consultation month. The accuracy of the prediction model was 100%. The results are shown in Table 5.
Table 5 C5.0 algorithm analysis results
Item
|
Data binning
|
Node/model1
|
Predictor variable importance
|
Prediction Model
accuracy (%)
|
Amount of data (sets) indicating
good/poor adherence
|
Training set
|
90% data
|
R ≤ 3 /[Model:1]
R > 3/ [Model:2]
|
R=1
|
Correct 9582 100%
|
|
|
Good adherence 8846 0
|
|
|
Wrong 0 0
|
|
|
Poor adherence 0 736
|
|
|
Total 9582 —
|
|
|
Total 9582
|
|
Test set
|
10% data
|
R ≤ 3 /[Model:1]
R > 3/ [Model:2]
|
R=1
|
Correct 1032 100%
|
|
|
Good adherence 957 0
|
|
|
Wrong 0 0
|
|
|
Poor adherence 0 75
|
|
|
Total 1032 —
|
|
|
Total 1074
|
|
Note: 90% of the data was used as the training set to construct the adherence prediction model, and the remaining 10% was used as the test set to validate the model
- Nodes: Recency: R;Three months was used as a node to divide R into two categories: for Model 1, R ≤ 3 months; indicating good adherence; for Model 2, R > 3 months, indicating that poor adherence.