Data source
This study used the data of the Cambodia Socio-Economic Survey (CSES) conducted between 2010 and 2017 (29-36), publicly accessible upon request. The CSES is a nationally representative cluster sample survey, conducted annually by the National Institute of Statistics (NIS). The CSES uses systematic sampling with probabilities proportional to the size, based on the number of households per village retrieved from the public information source (33). The country’s 24 provinces and municipalities, at the time of the surveys, were first divided into 19 separate groups. Each group was further divided into urban and rural strata, and a total of 38 strata were formed. The CSES is designed in the three-stage sampling at primary sampling units (PSUs), enumeration areas (EAs), and households (33). The interview was conducted with the household head, his/her spouse, or any other adult household member if the head and spouse were both absent.
For this study, we used pooled data of 38472 households covered in the CSES 2010–2017: 3592 in 2010 and 2011, 3840 in 2012 and 2013, 12090 in 2014, 3839 in 2015 and 2016, and 3840 in 2017 (29-36). The large data pool increased precision and power. Table 1 shows descriptive statistics and socioeconomic characteristics of the survey respondents.
Table 1 Descriptive statistics and socioeconomic characteristics of the survey respondents
|
No of
HHs
|
No of
Indi-viduals
|
HH
size1
|
HH head age1
|
F-headed
HHs2
|
HH annual
consumption3
|
CPI4
|
2010
|
3592
|
16510
|
4.6 (1.9)
|
46.2 (13.9)
|
22 (21-24)
|
678 (666)
|
100.000
|
2011
|
3592
|
16327
|
4.5 (1.9)
|
46.8 (14.0)
|
23 (21-25)
|
707 (628)
|
105.479
|
2012
|
3840
|
17644
|
4.6 (1.8)
|
47.3 (13.8)
|
22 (20-23)
|
814 (710)
|
108.572
|
2013
|
3840
|
17225
|
4.5 (1.8)
|
47.5 (13.6)
|
21 (20-23)
|
895 (695)
|
111.767
|
2014
|
12090
|
53968
|
4.5 (1.8)
|
47.8 (13.8)
|
22 (21-23)
|
894 (718)
|
116.076
|
2015
|
3839
|
17301
|
4.5 (1.7)
|
49.2 (13.7)
|
24 (22-25)
|
1036 (843)
|
117.493
|
2016
|
3839
|
16985
|
4.4 (1.8)
|
49.3 (13.9)
|
23 (21-24)
|
1178 (910)
|
121.071
|
2017
|
3840
|
16090
|
4.4 (1.7)
|
49.2 (13.7)
|
23 (21-25)
|
1179 (912)
|
124.572
|
Total
|
38472
|
172050
|
|
|
|
|
|
Source: Cambodia Socio-Economic Survey 2010-2017 (29-36)
Notes: No: number, HH: household, F-headed: female-headed, CPI: consumer price index, 1. Mean (standard deviation), 2. Percentage (95% confidence interval), 3. Median (interquartile range) in current US dollars (1 USD = 4065.02 riels as of 23 June, 2020), 4. The 2010 base CPI (37) was used to adjust household consumption data for inflation in the analyses. The annual household consumptions in this table are unadjusted.
Analyses
Equitable health insurance contribution should be determined based on one’s ability to pay, which is not simply defined as a current income function. It should, however, be more precisely defined as a non-subsistence effective income (38). Effective income is further defined as the income that households would behave as if they have when making consumption decisions (38). Households tend to smooth consumption over time by saving and borrowing (39), taking into account expected variations in income over the year, their assets and future earning potentials (38). Additionally, a policy paper suggested that consumption-based measure is more relevant in a lower-income setting where many households are borrowers, rather than savers (40). Therefore, we used annual household consumption as the basis to estimate household’s ability to pay. The household consumption in each year was transformed into the value of 2010 based on the consumer price index (37) to adjust for inflation in the eight-year study period.
Table 2 shows the household consumption aggregates, including food, non-food, and housing consumption items. The CSES household questionnaire is designed to collect consumption data on purchase in cash, consumption of own production, and consumption of items received in kind. We aggregated the data following the World Bank’s guideline (41, 42), the most widely referenced guideline of household consumption aggregates, albeit excluding consumer durables due to insufficient information.
Table 2 Composition of household consumption aggregates
1. Food consumption (20 items)
|
|
Rice/cereals, meats, dairy products, vegetables, fruits, seasonings, non-alcoholic beverages, food taken away from home, purchased meals, etc.
|
2. Non-food consumption (18 items)
|
|
Clothing and footwear, personal care, communication, transportation, household equipment, recreation, education, domestic salaries, etc.
|
3. Housing consumption (12 items)
|
|
Utility, house rent, maintenance of dwelling, etc.
|
Source: Cambodia Socio-Economic Survey 2014 (33), Guidance for Constructing Consumption Aggregates for Welfare Analysis (41) and User's Manual for Handling Resampled Micro Data of CSES 2009 (42)
Based on the previous discussions in similar studies (11-23), 369 predictor variables were created with the CSES data. Table 3 shows a summary of the predictor variables.
Table 3 Summary of predictor variables
Residential area
|
|
Province; urban/rural settings
|
Household members’ characteristics
|
|
Sex, age, ethnicity and educational level of household members; household size; dependent rate; total working hours
|
Real estate property
|
|
Number, area and use of own land; number, area, use and price value of own buildings; investment on buildings
|
Housing/living conditions
|
|
Size and construction materials of the dwelling; source of lightening; source of drinking water; type of toilet; utility charges; consumption of luxury food
|
Land use
|
|
Number, area and use of land parcels operated
|
Farming activities
|
|
Harvested land area; production; type of livestock, fishery and forestry activities
|
Durable goods
|
|
Possession, number and newness of durable goods in both urban and rural settings
|
Work
|
|
Type of employer; employment status; occupation; type of industry
|
Income and liabilities
|
|
Type of income; number and amount of loans
|
Survey year
|
The data were divided into a training set and a test set using the 8:2 ratio randomly. Subsequently, the analyses were conducted in two steps.
In the first step, using the training set, we constructed linear regression models that related a set of predictor variables (X) to observed household consumption (y), the value reported in the CSES, as follows:
where βk is a coefficient parameter to be estimated using ordinary least squares (OLS) method and is the error term, which is assumed to follow the normal distribution. With the new information on the predictor variables in time t + 1, the corresponding household consumption can be predicted by plugging the estimated parameters into the above equation.
For the linear and mixed-effects models, we screened all the variables using a partial correlation coefficient with significance at the 90% or higher level as the cut-off point (23). We manually selected predictor variables using a backward-elimination technique to construct Model A (Manually-selected Linear Model). We also used the backward-selection technique within a stepwise regression analytical framework with a 0.1 level of significance as the cut-off point for removing variables (23) to construct Model B (Stepwise Linear Model). Subsequently, we constructed Model C (Mixed-effects Linear Model) with the remainder of the stepwise selection, considering a random effect across the same province. To avoid overfitting, we constructed Model D with elastic net regression, which was finally functioned with L1 penalty term of the regression coefficients, which was known as least absolute shrinkage and selection operator (LASSO) regression. In addition, we made it adaptive LASSO by adding data-dependent weights to obtain more unbiased estimates. Ten-fold cross-validation was used to select the regularization parameter in the LASSO model (43, 44). We used all the available predictor variables for Model D since adaptive LASSO can automatically perform the variable selection to improve the prediction performance and interpretability of the statistical model while ensuring the model parsimony.
In the second step, the trained models were applied to the test data. With this subset, we predicted the household consumption values, and the results were compared with the values reported by the CSES, which used the full-length questionnaires.
Finally, the prediction performance was evaluated with three measures, namely mean absolute error (MAE), root mean squared error (RMSE), and mean absolute percentage error (MAPE). We used MAE because it evaluates prediction performance of the model most simply by taking the absolute difference between the actual and predicted values and finds the average as follows (45):
RMSE squares the difference, finds the average of all the squares, and then finds the square root, as shown below. RMSE was additionally used because it is more sensitive to larger errors as it creates an exponential change in the base number by squaring the difference (45).
While MAE and RMSE are useful methods to compare the prediction performance of different models for the same dataset, they do not tell the relative performance of the prediction model itself. MAPE is the percentage of the error compared to the actual value according to the following equation (46), which provides more context to explain the model’s average performance.
All analyses were conducted in Stata 16.0. A P-value <0.05 was considered statistically significant. The protocol of this study has been published elsewhere (47).