Developing and Validating Regression Models for Predicting Household Consumption to Introduce an Equitable and Sustainable Health Insurance System in Cambodia

Abstract


Introduction
Universal health access and protection of the population against catastrophic health expenditures and impoverishment are the key targets in universal health coverage (UHC), a target (3.8) of the Sustainable Development Goal (SDG) 3 (1).Despite extensive efforts of the global community, however, the population incurring catastrophic health spending increased by 3.6% a year between 2000 and 2015 at the 10% threshold and by 5.3% a year at the 25% threshold (2).During the period, the largest concentration of the world population with catastrophic health spending shifted from low-income countries to middle-income countries, while around 70% was persistently concentrated in Asia (2).Evidence has suggested that nancial protection can only be universally available if backed by funds from prepaid and pooled sources with subsidies for the indigent.The entitlement to guaranteed services should not be linked to employment status, but instead, it should be universal (3).
Two major prepaid and pooled health nancing approaches have been introduced.One is through taxation (the Beveridge model), and the other is contributions collected for social insurance from the insured (known as the Bismarck model) (4).The former poses a challenge for low-and middle-income countries (LMICs), where the general tax revenue is limited.The latter is an alternative method as nancial discipline could be maintained by establishing a contribution level to balance revenues and expenditures (5).The revenues contributed by the insured are, in fact, more secure than the general tax that is not guaranteed to be allocated to the health sector.Nevertheless, it is also a challenge for LMICs, where most of the population is engaged in the informal economy, making it di cult to estimate their income levels (6-9).Under such conditions, a at-rate contribution can be collected.However, it often lacks equity (8) and endangers the nancial sustainability of the insurance fund.This is because the contribution rate is usually set at a level that the lowest-income group can afford, and thus limits the total contribution revenue (6).Some countries, however, namely Japan, Korea, and Taiwan, successfully achieved UHC by introducing social insurance for those engaged in the informal economy.Those countries collect insurance contributions based on the household income level (10).Although a clear understanding of household income levels alone would not solve the issue, this seems to be a key and an absolute requirement for a successful introduction of universal social insurance.
A national survey often estimates household income or consumption in LMICs.However, it is usually composed of lengthy questionnaires that are not likely to be utilized regularly by local administrative staff to determine health insurance contributions.Studies have attempted to develop e cient scales to measure households' welfare or poverty status, mainly for social assistance programs (11)(12)(13)(14)(15)(16)(17)(18), or a singular value decomposition, such as principle component analysis, for research purposes (19)(20)(21).Nonetheless, these tools merely identi ed poor households or ranked households by their welfare status.
A couple of studies have attempted to predict household income or consumption.However, one dichotomized the households at a certain income level as a cut-off point (22).The other predicted national average household income at a country level (23).These attempts implied the possibility of estimating household economic status using a limited number of indices.However, no study has so far focused on predicting household income or consumption on a monetary basis to be applied for health insurance contribution determination.
The present study aims to develop and validate e cient regression models to predict annual household consumption in Cambodia, a lower-middle-income country in Asia (24), using the national survey data.In Cambodia, formal sector workers are enrolled in the National Social Security Fund (NSSF) health insurance, and the poor households are covered by the fully subsidized health protection program, the Health Equity Fund (HEF) (25,26), but nearly 70% of the population remains uninsured (27).Cambodia's government plans to extend the NSSF health insurance to the currently uninsured population (28), although an additional budget is not guaranteed due to limited scal space.This study will help Cambodia implement a nancially sustainable health insurance system that allows the insured to pay contributions according to their ability, and the state to redistribute wealth since larger contributions are collected from households with higher ability than those with lower ability.This study ndings will also contribute to ensure equity in access to healthcare for the Cambodian population.

Data source
This study used the data of the Cambodia Socio-Economic Survey (CSES) conducted between 2010 and 2017 (29-36), publicly accessible upon request.The CSES is a nationally representative cluster sample survey, conducted annually by the National Institute of Statistics (NIS).The CSES uses systematic sampling with probabilities proportional to the size, based on the number of households per village retrieved from the public information source (33).The country's 24 provinces and municipalities, at the time of the surveys, were rst divided into 19 separate groups.Each group was further divided into urban and rural strata, and a total of 38 strata were formed.The CSES is designed in the three-stage sampling at primary sampling units (PSUs), enumeration areas (EAs), and households (33).The interview was conducted with the household head, his/her spouse, or any other adult household member if the head and spouse were both absent.
For this study, we used pooled data of 38472 households covered in the CSES 2010-2017: 3592 in 2010 and 2011, 3840 in 2012 and 2013, 12090 in 2014, 3839 in 2015 and 2016, and 3840 in 2017 (29-36).The large data pool increased precision and power.Table 1 shows descriptive statistics and socioeconomic characteristics of the survey respondents.

Analyses
Equitable health insurance contribution should be determined based on one's ability to pay, which is not simply de ned as a current income function.It should, however, be more precisely de ned as a nonsubsistence effective income (38).Effective income is further de ned as the income that households would behave as if they have when making consumption decisions (38).Households tend to smooth consumption over time by saving and borrowing (39), taking into account expected variations in income over the year, their assets and future earning potentials (38).Additionally, a policy paper suggested that consumption-based measure is more relevant in a lower-income setting where many households are borrowers, rather than savers (40).Therefore, we used annual household consumption as the basis to estimate household's ability to pay.The household consumption in each year was transformed into the value of 2010 based on the consumer price index (37) to adjust for in ation in the eight-year study period.
Table 2 shows the household consumption aggregates, including food, non-food, and housing consumption items.The CSES household questionnaire is designed to collect consumption data on purchase in cash, consumption of own production, and consumption of items received in kind.We aggregated the data following the World Bank's guideline (41,42), the most widely referenced guideline of household consumption aggregates, albeit excluding consumer durables due to insu cient information.Based on the previous discussions in similar studies (11)(12)(13)(14)(15)(16)(17)(18)(19)(20)(21)(22)(23), 369 predictor variables were created with the CSES data.Table 3 shows a summary of the predictor variables.The data were divided into a training set and a test set using the 8:2 ratio randomly.Subsequently, the analyses were conducted in two steps.
In the rst step, using the training set, we constructed linear regression models that related a set of predictor variables (X) to observed household consumption (y), the value reported in the CSES, as follows: where βk is a coe cient parameter to be estimated using ordinary least squares (OLS) method and is the error term, which is assumed to follow the normal distribution.With the new information on the predictor variables in time t + 1, the corresponding household consumption can be predicted by plugging the estimated parameters into the above equation.
For the linear and mixed-effects models, we screened all the variables using a partial correlation coe cient with signi cance at the 90% or higher level as the cut-off point (23).We manually selected predictor variables using a backward-elimination technique to construct Model A (Manually-selected Linear Model).We also used the backward-selection technique within a stepwise regression analytical framework with a 0.1 level of signi cance as the cut-off point for removing variables (23) to construct Model B (Stepwise Linear Model).Subsequently, we constructed Model C (Mixed-effects Linear Model) with the remainder of the stepwise selection, considering a random effect across the same province.To avoid over tting, we constructed Model D with elastic net regression, which was nally functioned with L1 penalty term of the regression coe cients, which was known as least absolute shrinkage and selection operator (LASSO) regression.In addition, we made it adaptive LASSO by adding data-dependent weights to obtain more unbiased estimates.Ten-fold cross-validation was used to select the regularization parameter in the LASSO model (43,44).We used all the available predictor variables for Model D since adaptive LASSO can automatically perform the variable selection to improve the prediction performance and interpretability of the statistical model while ensuring the model parsimony.
In the second step, the trained models were applied to the test data.With this subset, we predicted the household consumption values, and the results were compared with the values reported by the CSES, which used the full-length questionnaires.
Finally, the prediction performance was evaluated with three measures, namely mean absolute error (MAE), root mean squared error (RMSE), and mean absolute percentage error (MAPE).We used MAE because it evaluates prediction performance of the model most simply by taking the absolute difference between the actual and predicted values and nds the average as follows ( 45): RMSE squares the difference, nds the average of all the squares, and then nds the square root, as shown below.RMSE was additionally used because it is more sensitive to larger errors as it creates an exponential change in the base number by squaring the difference (45).
While MAE and RMSE are useful methods to compare the prediction performance of different models for the same dataset, they do not tell the relative performance of the prediction model itself.MAPE is the percentage of the error compared to the actual value according to the following equation ( 46), which provides more context to explain the model's average performance.
All analyses were conducted in Stata 16.0.A P-value <0.05 was considered statistically signi cant.The protocol of this study has been published elsewhere (47).

Results
Figure 1 shows the conceptual framework of predictor variable selection.Out of 369 predictor variables, 98 remained after removing variables with 0.1 or greater partial correlation coe cients.Subsequently, 51 predictor variables were selected for Model A, 86 for Model B and C, and 162 predictor variables remained for Model D. Supplementary Table 1 shows details of the remaining predictor variables and the coe cients in each model.Overall, a positive linear relationship between observed and predicted household annual consumption was found in all four models, with the data points concentrated along the regression t lines.The relationship was stronger for households with lower consumption, and it declined as the level of household consumption increased.There was a subtle trend that the middle-class households' consumption was likely to be underestimated.In contrast, that of the high-class households' consumption was over-estimated in all four models, while the trend was less noticeable in Model D.
(Please insert Figure 2 here)  Discussion study found that it is possible to predict household consumption at a reasonable level, with a pool of highly predictive indices.The nal product of the study will be an automated tool with selected predictor variables and respective regression coe cients, which will further determine the optimal amount of health insurance contribution for each household.Moreover, our approach would suggest a possibility of an equitable contribution collection from all socioeconomic groups of the society, while ensuring the feasibility of the insurance fund by allowing informed planning through an accurate estimation of the revenue pool.Incorporating our predictive model into the existing social insurance system in Cambodia will enhance the country's current efforts to prevent catastrophic health expenditure and achieve UHC targets.
While the four alternative prediction models had different functions, there was no signi cant difference in the results, particularly among Model B, C and D. The regularization technique with consideration of datadependent weights in Model D, and the inclusion of random effects in Model C were not particularly effective in this environment.Among the three models with better predictability, Model B and C were more parsimonious with 86 predictors, as compared to Model D with 162 predictors.Parsimoniousness of the model is an important criterion in the model selection because the number of covariates yields the size of the questionnaire.These ndings suggest that Model B would best suit the situation in Cambodia.Although Model B was not the most parsimonious, the number of questions could be curtailed as multiple variables are attributed to one information source.For example, the question about the oor material of one's dwelling was used to create ve predictor variables.It is expected that the number of questions required in Model B could be boiled down to as few as 56.
In this study, the household consumption was expressed in the logarithmic scale.When the logarithmic transformation is returned in practice, the MAE in Model B is interpreted to be 4151 thousand Cambodian riels, which is equivalent to USD 1021.15,meaning that there is a mean error of USD 1021.14 in the average household annual consumption of USD 4231.22.This result of the error is further interpreted in the context of insurance contribution.Suppose the contribution rate is 3% of the non-subsistence household consumption, the MAE on the annual household contribution would be USD 30.63, which is USD 0.57 per person per month.Also, the model predictability is generally better for poorer households who could be threatened by overestimating their ability to pay.Therefore, the negative impacts of using this tool on the insurance contribution determination are not expected to be large.
The proxy means test has been practiced in Cambodia to identify poor households as bene ciaries of the social assistance programs, including the above-mentioned Health Equity Fund (17).The proxy means test is carried out based on the questionnaires that consist of scoring and non-scoring proxy indicators that differentiate poor households from non-poor households (17).The household consumption prediction model developed in this study is essentially different from the proxy means test.While the former predicts household consumption in monetary form, the latter merely assesses the level of poverty by scoring households.In addition, the results of the proxy means test are veri ed through the discussions in the community (26), but prediction performance of the tool has not been regularly evaluated.Therefore, the proxy means test cannot be used for the insurance contribution determination because it does not provide the reliable information on how much a household earns or spends, which is necessary when the insurance contribution is equitably collected.On the other hand, the reverse might be possible.That is, the household consumption prediction model could be used for both the poor household identi cation and the insurance contribution determination.If the feasibility of this model is proved, it is worth trying to use the model for the dual purposes to make the Cambodian social security system more e cient.
Despite our innovative methodology to estimate household consumption on a monetary basis, there are some practical limitations.First, this study compared the predicted household consumption with the observed values.However, the observed values were not real household consumption, but estimated

Figure 2 .
Figure 2. Observed vs. predicted household annual consumption in Cambodia in 2010-2017

Table 3
Summary of predictor variables

Table 4
shows the MAE, RMSE, and MAPE values of the four alternative prediction models.It should be that MAE and RMSE are in logarithmically converted Cambodian riel.All these statistical measurements with smaller values are preferred.The smallest mean absolute error, MAE of 0.227 was calculated for Model B, followed by Model C with 0.228, Model D with 0.230, and Model A with 0.242.The trend was not different for RMSE, which should react more pronouncedly to larger errors, with the values of 0.301 for Model B, 0.302 for Model C, 0.305 for Model D, and 0.320 for Model A. The percentage of the predictive error compared to the observed value, MAPE was 1.376% for Model B, 1.380% for Model C, 1.394% for Model C, and 1.469% for Model A. The rank was consistent with all three statistical measurements.

Table 4
Prediction performance of alternative predictive models (95% con dence intervals)