Weighting National Survey Data: why, how and which weight?

Weighting of national data is a procedure that enables the sample to be more representative of the target population. Weighting procedure is a thorough exercise and yields several types of weights. However, considerable variation exists among authors on which weight to use leaving the researchers baffled. In this article we share our experience on weighting for a few recent national surveys in Bangladesh. We generated four weights: the base weight calculated from probabilities of selection, and non-response adjustments, population calibration, and trimmed weights. Finally we checked weighted means, medians, ranges, standard errors, confidence intervals, variances, multiplicative effects, design effects and prevalence of a key variable of the survey to decide on which weight to use. It yielded


Background
Sample survey is one of the most important methods of collecting health data that can draw conclusion on a reference population (1). However inference cannot be drawn without treating the sample data. Weighting corrects the imperfections in the sample that prevents bias and other departures between the sample and the reference population. In complex sample surveys four types of imperfections emerge from unequal probabilities of selections, multistage selection, stratifying sample into the reporting domains, and non-responses and population noncoverage (2). Ignoring these will lead to incorrect inferences in a survey.
Though sample survey can draw conclusion to a reference population(1), the results may be influenced by sampling and non-sampling errors (3). Among the non-sampling errors, nonresponseboth unit non-response and item non-responseis addressed rigorously through weighting. Adjustment of the non-sampling error can be different depending on the data collection technique of the survey -digital and paper-pencil. Digital data collection has a wellordered method to adjust non-sampling error compared to pen-paper based one (4). Several recent national level household (HH) surveys in Bangladesh(5)(6)(7) used digital data collection tools.
In addition, weighting adjusts the weighted sample distribution for key variables of interest (for example, age, race, and sex) to make it conform to a known population distribution (8).
Production of design-unbiased estimates of parameters of interest is possible by applying proper weights (9). Thus weighting procedure is a critical step after the survey data have been collected and all the essential steps of data processing have been completed (10).
Testing the variability of the calculated non-response rates and weight is an important step of generating acceptable weights (5). High variation in weights can lead to some observations having too much importance. These might lead to distortion of survey results. In addition if the sampling design is not informative, using the weights or not should not introduce any significant differences in the estimates and if the sampling design turns out to be informative, the use of weighted estimators will produce "better" results (9).
Weighting itself to some researchers is like a black box (10). Handling different types of weights and which to use sometimes lead researchers confused. As a result survey data are often used by researchers without the weights leading to erroneous conclusions (9). Given the influence weights have on survey results, it is important that researchers understand enough about weighting process to be discerning users of the survey data (10). In this article we describe the weighting process, our approach in identifying which weight to use and explain the reason behind selecting one. We believe that there is no scientific article on weighting of survey data by authors from Bangladesh and our article is the first of its kind in the context of health survey data in Bangladesh.

A. Overview of the sampling and weighting procedure
Several recent national level household (HH) surveys conducted in Bangladeshhaving similar design (geographically clustered, multistage, probability based)(5)(6)(7) and frame(11)used digital data collection tools. The results shown in this article are extracted from the National Mental Health Survey Bangladesh 2019 which was ethically approved by the Bangladesh Medical and Research Council.
The estimated sample size was 8 928 selected in three stages. The first stage was to randomly select primary sampling units (PSUs) from the frame of PSUs covering 95.85% of the total population of Bangladesh(11). The PSUs (496) were equally selected from each of the eight divisions segregated into urban and rural strata. From each of the selected PSUs the required and equal number (18) of HHs were selected by systematic samplingthe second stage. These HHs were equally allocated to either of man or woman HH (sex randomization). One individual was randomly selected from a list of eligible member of each HH in the third stage (Figure 1).
The overall response rate of the survey was 90.4%(12) ( Table 1).
There is no universally held protocol for calculating weights (5). The aim of the weighting procedure is to calculate the 'final weight'. We used Microsoft Excel in Microsoft Office 365 bundle for this exercise. The 'base weight' calculation was the initial step. The adjustment of the base weights for non-response was the second step. The third step was to adjust the weights to a known population total. In the fourth stage a trimming exercise was applied.

Base weight calculation:
Base weight can be calculated soon after sampling exercise and HH listing activity. It involves the probability of selection of PSUs (p1), households (p2), sex randomization (p3) and individuals iv. Individual selection probability (p4): The selection probability for one eligible individual among a number/s of eligible HH members is calculated. This is obtained from the survey response data and is applicable to all the respondents of the survey.

= (1)
The base weight is calculated as,

Non-response weight:
In this step, we need to estimate the probability of responses/ non-responses using information available for both respondents/ non-respondents. These are generated as 'disposition codes' ( Table 1) in a survey dataset.
The non-response weight is calculated by taking the inverse of the response rate for each subset of the survey. The non-response adjustment is undertaken at PSU, HH and individual levels.
i. PSU-level non-response factor: This is applied for 16 strata (8 division x 2 residence status). It is calculated as, ii. Household non-response factor: It is calculated within each PSU, so there are 496 adjustment cells. The HH level non-response adjustment is calculated as, iii. Person non-response factor: It is calculated by gender (e.g. 2 groups), age group (e.g. four groups) and a core variable of interest (e.g. two groups) of the survey in concern.
Therefore, the person non-response is calculated in 16 (2x4x2) adjustment cells. The person-level non-response adjustment is, The non-response adjusted base weight is calculated by multiplying the three non-response factors with the base weights successively.

Population weight/ Calibration:
If census information is available these can be used for the correction of over-or underrepresentation in the sample of the targeted age-sex groups. In principle, the goal of a calibration weight adjustment is to bring weighted sums of the sample data in line with the corresponding counts(13) and frame deficiencies (14) in the target population.
Initially projected population is estimated then population calibration factor is calculated.

Estimating the projected population from census data
The steps of population estimation are as follows: 1. Summarizing the population by domains of residence (2 strata), sex (2 strata), and age groups (5 strata) for each division (8 strata)(11); 2. Estimating the total projected population from census month up to the data collection month is done using exponential growth rate for each of the strata.

Calculating post stratification adjustment factor (r)
The population calibration factor is calculated by division, residence, gender and the five age

Trimming of weight
The trimming procedure varies between researchers (15) (16). Trimming can be applied to any stage of the weighting process (10). In our exercise we followed a trimming procedure based on the assumed distribution of sampling weights (16). The process started with identifying the extreme weight/s and fixing all weights above and below the set cut-off points at that value. The weights thus lost were equally distributed among the non-trimmed weights. This procedure was repeated till no weights were above the cut-off point (15).

B. Checking the calculated weights
We thoroughly checked all the steps of calculations for the weighting process. In addition we were keen on checking the distribution of the weights specially taking notice of the extreme values and back-tracking these for possible errors for quality results. We also checked the weights with standard statistical parameters like: mean, median, range, standard error, confidence interval, variance, multiplicative effect, design effect and prevalence of a key variable of the Survey.

Results
We calculated the four weights using the mathematical formula mentioned in the method section.

'Base weight' calculation:
Probabilities of selection of PSUs, HHs, sex randomization and individual selection probability were taking into consideration. This is applied to 16 strata comprised of eight divisions and two residence strata. This procedure yielded a total base weight value of 79 422 102. ( Table 2) 2. Non-response factor calculation: i. PSU-level non-response factor: This is also calculated for 16 domains. The value for the PSU non-response is essentially '1' for all the 15 domains except in the one domain where on PSU was dropped. The mean PSU non-response is 1.0008.
ii. Household non-response factor: This is calculated in all 496 PSUs. The mean PSU nonresponse is 1.1002.
iii. Person non-response factor: The mean PSU non-response is 1.1002.
When this base weight is adjusted with the non-response weights, the adjusted base weight stands at 92 569 866. (Table 2) Calculating the projected population from census data:  Table 2)
In each of the domains the sum of projected population in that domain is divided by the nonresponse adjusted base weights of that domain. The mean 'r' was 1.328213. ( Table 2)

Calculating population calibration and non-response adjusted weight:
This weight is the product of base weight; PSU, household, and individual non-response factors; and population calibration factor. Our calculated adjusted weight was 102 948 678.
( Table 2) 4. Trimming of weight: In our exercise we trimmed the non-response and population calibration adjusted base weight.
We identified the median of the non-response adjusted and population calibrated weight to be 9 091.9. All weights above and below the 3.5 times median(15)(16) value of 31 821.7 and was set at that value. We trimmed any weight above 31 821.7 and fixed the weight at that value. (Table   3)

Testing the calculated weights
First a comparison was made between the distribution of the projected population with the unweighted sample to show the differences in distribution by age and residence. (Figure 2A and   B). It is shown that the unweighted sample distribution is not similar to the population distribution.
However when we make the same comparison with weighted distributions with any of the four calculated weights, it shows that the distribution closely matches with that of population. The best match was achieved by the sample distribution weighted with population calibrated and trimmed weights. (Figure 2) All the weights except the trimmed weight show a wide range denoting instability of the calculated weights. Sum of the calculated weights gradually increased from the base weight to the trimmed weights. The population calibrated weights and the trimmed weights thus stands at 100.8% (Table 3) of the projected population. The distribution of the trimmed weight is more centrally oriented as is denoted by the difference between maximum and minimum, narrower standard error, confidence level than other weights. The multiplicative effect for the trimmed weight is 1.5 and the only weight which is less than 2.
We also checked the effect on the different weights on the prevalence of mental disorders according to the National Mental Health Survey 2019(7). The unweighted prevalence is 17.3%.
Which is very close to the weighted prevalence (15.8%-16.8%). We also calculated the prevalence in urban-rural and male and female domains and found no notable difference.
However, overall the prevalence of mental disorders tend to decrease as we progress from unweighted to weighted results. However, we think that this difference is negligible. We calculated the design effect of unweighted and weighted calculations. Though it is somewhat increased in the weighted results, we observe the lowest design effect for base weights (1.7) and trimmed weights (1.7). (Table 4)

Discussion
We calculated four weights: base, non-response adjusted, population calibrated and trimmed weights using Microsoft Excel. As population counts were not available for the Survey year (2019) for Bangladesh, we estimated projected population from the 2011 census(11). In addition we had to extract population data one division as the number of divisions after 2011 census increased from seven to eight. We provide evidence that weighting yields a favorable result compared to the unweighted inferences. In addition, the weighting adjustments add more precision maintaining validity of results. Weighting of national surveys requires keen knowledge on survey sampling procedure and non-responses. Though the process in theory is quite straightforward, the calculation is not. Considerable variation exists among authors whether to adjust or not adjust, how to form weighting class and post-stratification cells, whether and how to trim the weights, and so forth (10). We presented here the weighting process from a recently conducted National Mental Health Survey 2019(7) and tested those for quality (5).
It is claimed that weighting with base weight only is an efficient method as it is a simple one to construct (10). It be completed after the mapping and listing activity before the data collection. It avoids the performing meticulous non-response calculations and the need for population projection estimation and calibration. Thus base weight can be used as the final weight for a survey when response rate is 90% or more(17). Otherwise calculating a non-response and population calibration adjusted base weight is recommended (5). However we generated all four weights despite survey response rate was acceptable and fresh census data unavailable.
In the recent surveys as data were collected though handheld computers item non-response was completely absent(4). However the weighting procedure corrected the sample distribution for unit non-response. The compared to the unweighted distribution of sample, the weighted distribution were more reflective of population distribution and size. In contrast the unweighted sample distribution did not meet this requirement.
Despite the sampling design with equal allocation of PSUs to urban-rural and male-female strata, the calculated weights corrected the sample distribution for variables like sex, residence etc. to make it conform to the population by distribution and size (8). This achieves one of the prime objectives of the weighting exercise(15). The biasness induced by the design effect is also reduced by the small design effect in the weighted results. Small design effect will help to estimate a smaller required sample size which is much needed in a low-resource country.
In addition we calculated trimmed weights(15). There are issues both in favor and against trimming of the weights(17)(15). Some authors do not recommend this procedure as it might induce inaccurate results(18). Trimming survey weights introduces a small bias into estimates but greatly reduces standard errors(17). The cost of weighting data is reduced accuracy/ precision to some extent. Some survey practitioners worry about dealing with highly unequal weights which trimming might address thus improving precision(15). We encountered this situation in the generated weights. In our study the trimmed weight was stable and provided a favorable result.
We tested(10)(5) these weights for compensation of non-sampling errors, matched with population distribution, variability and accuracy. It has been argued that even if weights reduce bias, they might largely inflate variance of estimates(19). Though we encountered a little loss of precision overall in the process if base weight is used, this is gradually removed when we use the other weights. The results calculated using the trimmed weight having the most precision.
Except for the trimmed weight calculated weights showed a large range of values denoting instability of the weights. In our data we showed that trimming procedure generated a weight that stroke a good balance between instability and accuracy.

Conclusion
Weighting compensated for the non-sampling errors and corrected the imperfections in the sample and prevented bias between the sample and the reference population in contrast to the unweighted sample. We found that the trimmed weight was the most acceptable among the four weights. The results generated by using the trimmed weights yields a more nationally representative, precise results and renders it comparable with other national data.
theme, and reviewed the data and manuscript critically. All authors approved the final version of the manuscript to be submitted for publication.