This is an empirical study performed on the data exported from the hospital information system (HIS) of the only designated hospital in Shanghai for treating AIDS patients from August 2009 to December 2019. The data was used to train and test an RFM model to get predictor variables for adherence in AIDS patients.
Data Extraction And Preprocessing
The data of AIDS outpatients from August 1, 2009, to December 31, 2019 were exported from the HIS of the research unit using the methods from literature [25–26, 29–30]. The fields included the consultation time, patient’s identification card number, gender, age, place of residence (local/no-local) and medical costs, for a total of 257305 data elements (16440 patients). Public hospitals in Shanghai in China implement a system wherein the actual name of the patient is used during consultation, and the identification number of the patient is an essential field. The SPSS 22.0 and SPSS Modeler 18.0 software were used for data analysis.
The data were cleaned, and the following fields were expanded by the methods used in literature [25–26, 30]: (1) The consultation time field was expanded to “recent consultation month”, with December 2019 as the first month, November 2019 as the second month and so on until August 2009 as the 125th month; (2) The cumulative cost field in the patient’s identification card was used to calculate the “total medical costs” field; (3) The consultation time and cumulative frequency in the patient’s identification card were used to obtain the “consultation frequency” field; (4) The “total medical costs” field of each patient was divided by the “consultation frequency” field to obtain the “average medical costs per visit”. These constitute one data element representing one person.
Variable Generation And Descriptive Statistical Analysis
Seven variables (recent consultation month, gender, age, total medical costs, consultation frequency, average medical costs per visit and place of residence) were generated. The factors of gender, age and place of residence were statistically analyzed to investigate good or poor acceptability, then four variables (recent consultation month, total medical costs, consultation frequency and average medical costs per visit) were used to describe these 16440 data elements as mentioned in literature [25–26, 29–30].
Finding the optimal RFM or RFm model, clustering analysis and decision algorithm
In this experiment, we tested the RFM and RFm models with several clustering analysis and decision algorithms to determine the best components to construct and evaluate the adherence prediction model. We employed methods from literature [25–26, 29] and used the RFM model theory as follows: (1) The three fields of recent consultation month, consultation frequency and total medical costs were used for the RFM model ; (2) The three fields of recent consultation month, consultation frequency and average medical costs per visit were used for the RFm model [25, 29]. Three clustering methods (K-means, Kohonen and two-step clustering) were used to construct the clustering models, in which four decision algorithms (C5.0, classification and regression tree (CART), Chi-square Automatic Interaction Detector (CHAID) and Quick, Unbiased, Efficient, Statistical Tree (QUEST)) were used in each model to construct several preliminary prediction models. From these models, we determined the optimal RFM(m) model, clustering analysis method and decision algorithm based on the quality of the model and stability of important predictor variables, which were used for the adherence prediction model experiment.
Validating The Adherence Prediction Model And Obtaining The Variables
In this experiment, we used the optimal RFM(m) model and methods found in the previous experiment to construct the best clustering model, separate patients with good adherence from those with poor adherence and identify important variables for adherence prediction. The literature methods were used as references, and the optimal decision algorithm was employed, with good and poor adherence as targets. The important predictor variables in the best clustering model were utilized as the input variables, and the data underwent randomization and binning: we used 90% of the data as the training set to construct the adherence prediction model, and the remaining 10% was used as the test set to validate the model and finally obtain the adherence predictor variables.