A general framework in three dimensions
In our proposed framework, we classify selection processes in three main dimensions. Firstly, we distinguish between selection processes at the population-level occurring independently of study decisions and study-specific selection processes occurring only because of the study. Secondly, we distinguish between selection in exposure, i.e. selection processes that cause changes in exposure, and selection in population composition, i.e. processes that cause changes in the composition of the general population, source or study population. Thirdly, we organize the selection processes with respect to the timing relative the exposure and the outcome, for example selection occurring at exposure entry, during exposure (but prior to outcome) or post-outcome. These three dimensions are described in more detail in the following paragraphs.
1. Population vs. study-specific selection processes
Population selection processes occur in general or source populations independently of study decisions, whereas study-specific selection processes occur only because of the study. Epidemiologists are often familiar with non-participation, self-selection into studies, losses to follow-up and other types of study-specific selection processes that can be serious concerns in empirical research. However, there are also selection processes that result in non-random groupings or changes in the composition of the underlying populations. These are continuously on-going at the population level irrespectively of whether they are subject to sampling in empirical studies [6]. Population selection occurs both within general and specific populations, such as specific patient populations. It includes phenomena that may lead to confounding in observational studies, i.e. confusion of effects or lack of exchangeability between exposed and unexposed with respect to background risks for the disease outcome [6, 7]. The confounding resulting from population selection processes is sometimes apparent, such as differences in health determinants across groups in age, sex or socioeconomic characteristics, and therefore possible to adjust for in statistical analysis. However, population selection processes often lead to subtle differences across groups that are more difficult to account for, for example if personal ambitions make individuals seek higher education, if self-interest drives individuals to choose the occupation that produces the highest utility for them, if physicians select patients into treatments and when health conscious individuals select themselves into preventive screening programs. Additionally, population selection includes phenomena that are distinct from confounding, for instance if an index event must occur in order for someone to enter the population at risk [8]. As examples, disease progression can only be observed among people with the disease, and being unemployed is often a prerequisite for taking part in a job training program. Population selection may also act during exposure, for example through survival of the fittest and depletion of susceptibles over time [9]. A general characteristic of population selection effects is that they tend to persist also in “perfect” observational study settings, including register-based studies on entire populations [10].
2. Selection in exposure vs selection in population composition
In the second dimension we make a mechanistic distinction between i) selection that causes changes in exposure (selection in exposure) and ii) selection through for example migration, disease events or deaths that causes changes in the composition of the population (selection in population composition; Figure 1). Differences in mode of action between these two types of selection mechanisms, which can either be of unifactorial or bi/multifactorial origin, can be more formally depicted using causal diagrams (Figure 2A-D).
Selection in exposure (unifactorial case)
Each arrow between two nodes in a causal diagram represents a direct causal effect, but also a directed selection process. To see why, consider the direct effect C0 → E1 in Figure 2A, which here means that C-positivity (C0 = 1) increases the likelihood of becoming exposed (E1 = 1). As a consequence, C-positivity will be more common among exposed individuals and less common among unexposed individuals than in the total population. Thus, the direct causal link between C and E is a selection process that leads to two selected groups in the population, exposed and unexposed that have different compositions with respect to C. We will refer to any arrow that ends in initiated, continued or terminated exposure as selection in exposure. This selection process does not affect the boundaries of the population as such (i.e. it does not select in or out from the combined population of unexposed and exposed). In occupational epidemiology, healthy worker hire effect is a well-known example of selection in exposure that may lead to confounding bias in the estimated exposure effect on a disease outcome D, if healthy individuals (C0 = 1) in a population are more likely of becoming employed and thereby occupationally exposed (E1 = 1) [21]. If the origin of the selection C only causes E and not D, then confounding bias would not occur. However, external validity of associational measures can still be compromised by the causal link between C and E if there is heterogeneity in the E – D association across levels of C [10]. This would for example occur if individuals that are less susceptible to the exposure effect (e.g. more stress tolerant) are more prone to become exposed (by applying for positions with high job demands). The estimated E – D association would still be internally valid but would not generalize to the general population with a different distribution of susceptible individuals.
Selection in population composition (unifactorial case)
The other fundamental selection mechanism, selection in population composition, acts on the general, source or study population, for example by selecting individuals in or out from the population eligible for the exposure (e.g. only survivors until a certain age will be able to retire). It may also select individuals in or out from the population at risk during exposure (e.g. elevated preterm mortality in a drug addict population filters the population at risk of diseases typically occurring at older ages). As a result of this selection mechanism, there will be differences in the composition of subpopulations that are selected and filtered out, for example non-random differences between eligible and non-eligible or between those who remain at risk and those who do not. Selection in population composition is depicted in Figure 2B, where the directed selection mechanism C0 → S1 here means that C-positivity (C0 = 1) will be more common those who remain in the population at risk (S1 = 1) than among those who do not (S1 = 0). At the population level, the boxed S represents what we refer to as conditioning by nature, which means that any causal action in this system downstream from this time window will only act on the selected population (S1 = 1). As with unifactorial selection in exposure, the change of the population composition with respect to C may compromise extern validity if the magnitude of the exposure effect depends on C [10].
Selection in exposure (multifactorial case)
Exposures are in reality often caused by several underlying selection mechanisms, either acting independently or in concert. Importantly, the origin of some these selections may be unknown. A selection process with two independent, directed selections into a common effect E is depicted in Figure 2C, one originating from a known source C and one from an unknown source U. This combined selection process will generally induce associations between C and U within strata of E that are different from those of the underlying population. As an example, suppose heavy alcohol consumption (E) is caused by low socioeconomic status (C) but also by a genetically determined tolerance for alcohol that allows escalation of drinking, here assumed to be unknown (U). The induced inverse association within strata of alcohol consumption will imply that individuals with low socioeconomic status will have a worse tolerance on average than individuals with high socioeconomic status within the same stratum. Confounding bias in the estimated effect of E on a disease outcome D may occur also after adjustment for C, if the unknown exposure cause U is a direct or indirect cause of D [8]. Bias may occur even if U is only causing D in interaction with C. A hypothetical example is presented in Table 1, where the relative risk (RR) of E on D is constant (2.0) across strata of C and U. If the effect of C on D depends on U, then the causal effect of E can only be correctly identified in individuals where C is absent (so called partial exchangeability [26]). Bias will occur in analyses adjusted for C since exposed and unexposed individuals with the same level of C differ in U. Stratification on C may for this reason lead to false conclusions regarding the heterogeneity in the exposure effect.
Selection in population composition (multifactorial case)
Multifactorial selection processes also frequently operate on the population composition. An example is depicted in Figure 2D, where the composition of an index population S (for example all cases of a certain disease) has an exposure E as a known origin but also an unknown origin U. Any subsequent causal action on disease progression can only be observed among people with the disease (S = 1), which in the causal diagram is represented by S. Such conditioning by nature on an index event necessary for entering the diseased population creates associations between the underlying determinants E and U that differ among people with and without the disease. This can lead to a bias commonly referred to as index event bias, or more generally collider stratification bias (or simply collider bias) in studies where the collider S (i.e. the event where the two directed selection mechanisms collide) constitutes the study population [16, 22]. As an example, smoking is a well-established risk factor for the development of rheumatoid arthritis (RA) [16]. Nevertheless, null or even inverse associations between smoking and disease progression among patients with RA have been observed. The antiinflammatory role of nicotine has been put forward as a possible explanation of the lower systemic inflammation and structural disease progression in current smokers with RA [27]. However, collider bias stemming from a selection mechanism of the type depicted in Figure 2D is a compelling alternative explanation, as the necessary conditioning on the index event S (incident RA) induces spurious inverse associations between the exposure E (smoking) and other risk factors U that may also cause disease progression [16]. Another example of such paradoxical results that could be explained by collider bias is the apparent protective effect of obesity on mortality among end-stage renal disease patients [28].
3. Timing of the selection – at exposure entry, during exposure or post-outcome
Figure 2A-D all represent selection mechanisms occurring at exposure entry, prior to any exposure or treatment effect in the downstream population. At the study-level, selection occurring at exposure entry corresponds to non-random sample selection at study entry in surveys and baseline selection before follow-up starts in cohort studies [6, 29]. Selection may also act during exposure (Figure 3A-D), and affect continuation and termination of exposure or follow-up. Selection during exposure would for example occur if individuals less susceptible to side effects are more prone to stay exposed or continue with a treatment (see additional example in next section). Similarly, selection post-outcome would occur if subclinical symptoms lead to loss to follow-up in a study, or if a disease outcome lead to changes in exposure. The latter would be an example of reversed causality, meaning that the outcome causes exposure changes rather than the opposite (Figure 4A). This situation would lead to bias unless data allow separation of the timing of the outcome events and exposure changes. Selection post-outcome would also lead to bias if the composition of the general, source or study population is dependent both on disease outcome and exposure (Figure 4B). A particular example is selective participation and nonresponse bias in studies conditioning on variables affected by the outcome and exposure [2].
Selection processes at the population-level – additional examples
We now illustrate how the proposed framework with three dimensions (selection level, type of mechanism and timing of the selection) can be used to classify selection processes commonly described in the epidemiological literature (Table 2). In this section we focus on population selection, whereas next section covers study-specific selection processes.
Selection in exposure – at exposure entry, during exposure or post-outcome
Self-selection into a screening program for breast-cancer implemented in the general population may result in important differences between exposed (i.e. those attending the screening) and unexposed women (those who do not attend), including social, demographic, and health factors that can independently influence outcomes [30]. Self-selection at the population-level (Table 2, cell 1.1.1) has several context-specific synonyms such as confounding by indication in clinical research, which occurs when the clinical indication for selecting a particular treatment (for example severity of the illness) also affects the outcome [31], and the abovementioned healthy worker hire effect in occupational epidemiology [21].
Selection in exposure at the population-level may also occur during exposure (Table 2, cell 1.1.2), even when exposure initiation was not subject to selection. As an example, there was no apparent socioeconomic gradient (C) in the propensity to start smoking (E0) among young adults in the 1950´s [32]. However, the propensity to continue smoking (represented by E1 in Figure 3A) despite growing evidence of serious health hazards exhibited a clear inverse socioeconomic gradient [33]. The selection from exposure initiation to exposure continuation can be mediated by a side effect or an adverse event caused by the exposure. This situation would for example occur if initial smokers (E0 = 1 in Figure 3B) who experience bad cough (A1 = 1) are more likely to quit smoking than others. Because of this selection, continuing smokers (E2 = 1) would then have experienced bad cough to a lesser degree than those who have given up smoking (E2 = 0). Healthy worker survivor effect is a similar selection process, which implies that employees who can tolerate the physical or psychosocial working conditions are more prone to stay in the workplace and thereby remain occupationally exposed [21].
Selection may also cause changes in exposure post-outcome (Table 2, cell 1.1.3), for example if a certain exposure may relieve subclinical symptoms. As an example, it has been suggested that smoking in schizophrenia may begin in the prodromal phase of the disorder [34].
Selection in population composition – at exposure entry, during exposure or post-outcome
As already discussed in relation to Figure 2D, selection in population composition occurring prior to exposure entry often implies that an index event (for example obtaining a certain age, becoming unemployed or falling ill [22]) must have occurred in order to make an individual eligible for the exposure (Table 2, cell 1.2.1). Self-selection may also affect population composition, for example if people motivated to exercise and eat well choose to live in neighborhoods that support this lifestyle [35]. Changes in population composition may also occur during exposure (Table 2, cell 1.2.2). Figure 3C represents a situation where an exposure E causes an adverse event S (with death as an extreme example) that precludes continued stay in the population at risk with respect to the disease outcome of interest D. Such mutually exclusive outcomes are commonly referred to as competing event (competing risk) [15]. Continued causal actions beyond a certain time window can only occur if the adverse event has not occurred until then. This implies that the exposure in the population at risk will gradually become more and more inversely related to other determinants of the adverse event (represented by U in Figure 3C). As an example, suppose we want to study the effect of sedentary lifestyle (E) on the risk for dementia (Y) at older ages. Thus, people will have to have survived to older ages in order to become part of the exposed population at risk. This necessary conditioning on survival (S = 1) implies that surviving individuals with sedentary lifestyle can be expected to have a more favorable risk profile with respect to other determinants (U) of survival than surviving individuals with a physically active lifestyle. Biased effect estimates would be obtained if these determinants are also related to the risk of dementia and are not accounted for.
Depletion of susceptibles, which is depicted in Figure 3D, is a selection process that is similar to the competing event situation (Figure 3C), in that the continued exposure E causes selection gradually in the population at risk (Table 2, cell 1.2.2). However, here the outcome at a later time point (represented by D2 in Figure 3D) does not get competition from other outcomes but from the same outcome at earlier time points (D1). Thus early events in the outcome of interest caused by the exposure may lead to depletion of susceptibles in the population at risk over time [24]. As an example, an effect of smoking on a particular disease outcome may seemingly decrease over time, or even change direction, as early disease events “eat” of the causal components required for the disease to manifest at later time points. It has been suggested that the hazardous effect of smoking on mortality may disappear for ages 85 and above [36]. However, an alternative explanation is that the harmful effect of smoking is disguised by depletion of susceptibles, i.e. smokers among the oldest are likely to be a more selected group of survivors than non-smokers at the same age with respect to other determinants of survival. Bias will occur in the estimated smoking effect unless susceptibility can be measured and accounted for.
Selection processes may also alter the composition of the population post-outcome (Table 2, cell 1.2.3). A well-known example is Berkson’s fallacy, where the population of patients who come to the hospital is structurally different from patients with the same disease who for various reasons do not come [18]. The selected population may be dissimilar from the unselected population with respect to a single determinant of the selection, but also with respect to associations between different determinants (Figure 4B with S as selected population). Post-outcome selection of the Berkson type may for example affect validity in studies of malformations in live births [37], as malformations often increase the risk of miscarriages and these are often impossible to observe.
Study-specific selection processes – additional examples
Selection in exposure – at study entry, during follow-up, or post-outcome
Study-specific selection in exposure may occur at study entry (Table 2, cell 2.1.1) in studies of medical or social interventions, where exposure or treatment is not randomized but assigned by the investigator or chosen by the study person (e.g. as a prerequisite for participation). Such self-selection may lead to bias, and can also occur post-randomization in randomized studies, in particular in trials where treatment assignment cannot be blinded and participants are free to refuse treatment [25]. Non-compliance with the treatment protocol caused by side effects (during follow-up; Table 2, cell 2.1.2) or by subclinical symptoms (post outcome; Table 2, cell 2.1.3) may also affect the validity of interventional studies.
Selection in population composition – at study entry, during follow-up or post-outcome
Selection in population composition at study entry may either be ‘intentional’ or ‘unintentional’ from the researcher’s perspective [6]. Intentional selection occurs based on criteria established by the researcher, for example when defining the source population for a cohort study or the study base for a case-control study (Table 2, cell 2.2.1). Intentional selection may have consequences on validity. As an example, randomized clinical trials and observational cohort studies on effects of hormone-replacement therapy on coronary heart disease have yielded conflicting results. A major reason for the differences was the inappropriate definition of the source population in several of the cohort studies [19]. Long-term current users of estrogen/progestin were included at baseline, which led to selection of individuals less susceptible to adverse effects and with biased treatment effects as a consequence. Unintentional self-selection affecting population composition at study entry occurs when there are differences between the source population (those who are eligible to participate) and the study population (those who actually participate). A particular example is the healthy volunteer effect (i.e. participants being more healthy than non-participants [15]), which may hamper the possibilities to generalize results beyond the study sample. Both intentional and unintentional baseline selection (at study entry) may in cohort studies lead to collider bias of the type depicted in Figure 3C, for example if study participation (S) is caused both by the exposure (E) of interest and another outcome risk factor (U) [6].
Loss to follow-up in cohort studies before or after outcome has occurred (Table 2, cells 2.2.2 and 2.2.3) can be critical if outcome ascertainment is dependent on continued study participation (for example by an additional visit to the study center) [38], but is generally of less concern if outcomes are ascertained by health care registers [39]. Selection occurring post-outcome is of particular concern in case-control studies if enrolment is done retrospectively after ascertainment of case/control-status [40].