Systematic review conduct
The Evidence Review and Synthesis Centre at the University of Alberta (JP, SR, LH) will conduct the systematic reviews on behalf of the task force following the research methods outlined in the task force methods manual (49). We report this protocol and will conduct, and report, the review in accordance to current standards (50–54). During protocol development, a working group was formed consisting of task force members (SK, DR, GT), with input from clinical experts (CF, NL, MM), and scientific support from the Global Health and Guidelines Division at the Public Health Agency of Canada (LAT, GTr, CG). The working group contributed to the development of the Key Questions (KQs) and PICOTS (population, intervention(s) or exposure(s), comparator(s), outcomes, timing, setting, and study design) elements.
Task force members made the final decisions with regard to the KQs and PICOTS. Task force members and clinical experts rated the proposed outcomes based on their importance for clinical decision-making, according to methods of Grading of Recommendations Assessment, Development and Evaluation (GRADE) (55). Ratings by the clinical experts were solicited to ensure acceptable alignment with the views of task force working group members, but task force members determined the final ratings. Final critical outcomes (rated at 7 or above on 9-point scale) include: lung-cancer mortality, all-cause mortality, and overdiagnosis. Final important outcomes (rated 4–6) for inclusion are: health-related quality of life (HRQoL), false positives, incidental findings, major complications and death from invasive procedures undertaken as a result of screening, and psychosocial harms from the screening process. The section on data extraction has details on how each of these outcomes will be defined for this review. Further, since our calculations of overdiagnosis require estimates of cancer incidence we will extract and rate the certainty for data on incidence but it will not be an outcome considered during decision making about overall recommendations for, or against, screening. We anticipate that there will be evidence on all-cause mortality, lung cancer mortality, and potential benefits of screening. We anticipate that there will be evidence on outcomes of harms including overdiagnosis of lung cancer, false positives, incidental findings, major complications and death from invasive procedures undertaken as a result of screening, and psychosocial harms. HRQoL may represent a benefit or a harm of screening, depending on the direction of the effect. The final classification of benefit or harm for all outcomes will be based on the effects observed for different comparisons. Measures of values and preferences related to the critical and important outcomes were based on GRADE methodology (41).
This version of the protocol was reviewed by the entire task force. Stakeholders (n = pending) reviewed a draft version of this protocol, and all comments were considered. Throughout the conduct of the systematic reviews, we will document any changes to the protocol and we will report on these within the final reporting of the review.
Key Questions
-
What are the benefits and harms of screening for lung cancer in adults aged 18 years and older?
-
What is the relative importance people place on the potential benefits and harms from screening for lung cancer?
3 a: What are the comparative benefits and harms of risk prediction models compared with trial-based criteria to identify eligibility for lung cancer screening?
b: What are the comparative benefits and harms of alternate nodule classification systems compared with nodule classification systems used in lung cancer screening trials?
Eligibility criteria
Tables 1 to 3 outline the eligibility criteria in terms of the population, intervention/exposure, comparators, outcomes, timing, setting, and study designs (PICOTS), for each KQ. For KQ1 we will include studies of adults in whom lung cancer is not suspected. Studies may enroll a general adult population or people meeting eligibility criteria associated with increased risk for lung cancer, as defined by authors. The population may include current, former and second-hand smokers, as well as those with exposures to substances that may affect risk and other identified factors that may increase risk. Though low-dose CT, with or without any other lung cancer screening interventions (e.g., biomarkers), is the primary intervention of focus, we will include new RCTs published in 2015 or later of other screening modalities including chest radiography. Comparators can be either no screening or an alternative screening modality (e.g., chest radiography) or strategy (e.g., different eligibility criteria, classification of findings, screening interval). For benefits and overdiagnosis, we will only include RCTs because of the rigor of this design for overdiagnosis and known availability of data with several years follow-up (56). For harms apart from overdiagnosis, where the effects are either rare and requiring large studies or only occurring/reported in the screening arm, we will also include nonrandomized and uncontrolled studies with some specific requirements as outlined in Table 1. We will proceed with examining evidence on harms for a particular modality of screening when there is at least a low certainty of some benefit for one or more benefit outcomes from an RCT.
Table 1
Eligibility criteria for key question 1 on benefits and harms of screening for lung cancer with computed tomography
| Inclusion Criteria | Exclusion Criteria |
---|
Population | Adults ≥ 18 years old and not suspected to have lung cancer Specific population of interest may include: • Age • Sex • Gender • Smoking history (e.g., duration, pack-years) • Race and ethnicity (e.g. First Nations, Inuit and Métis) | • Studies of people seeking care for symptoms of lung cancer or who are suspected of lung cancer • Studies of people who have been previously diagnosed with lung cancer • Studies of people younger than 18 years old • Study population includes > 25% individuals with recent abnormal screening result |
Intervention | Computed tomography (CT), with or without any other lung cancer screening interventions (e.g. biomarkers). The focus is on screening interventions involving CT but we will include new RCTs (published in 2015 or later) of other screening modalities. Screening modalities (i.e. chest x-ray) examined in the last review (chest x-ray) will only be included if new RCTs are found. | |
Comparator | • No screening/placebo/minimally or non-active intervention (e.g., smoking cessation) • An alternative screening strategy/protocol (e.g., pre-screening with risk model or biomarkers, different screening interval or method to classify nodules) • None (for harms except overdiagnosis where a no screening arm is required) | |
Outcomes (by importance as rated by the WG) | Benefits: 1. All-cause mortality* 2. Lung cancer mortality* 3. Health-related quality of life If there is trial evidence showing at least low certainty of benefit for 1 + of the above outcomes, then: Harms: 4. Overdiagnosis of lung cancer* 5. Major complications or morbidity from invasive testing as a result of screening 6. Death from invasive testing as a result of screening 7. False positives 8. Psychosocial consequences of the screening process (i.e., before or right after screening, receiving a false positive result) 9. Incidental findings *Rated as critical outcomes; others rated as important | |
Timing | No limitation on the duration of follow-up, except for FPs where ≥ 12 mos follow-up after the screening result is required in ≥ 80% of participants (if there is no evidence found meeting this criteria we will accept ≥ 6 mos follow-up). Longest follow-up will be used for mortality, quality of life, and overdiagnosis, unless there has been substantial (> 20%) contamination by the control group receiving screening. | |
Delivery Setting | Any setting relevant to primary care Countries rated as very high on the Human Development Index 2019 (57) | |
Study design & publication type | • RCTs (parallel designs using individual or cluster randomization) • With the exception of overdiagnosis, for harms nonrandomized studies are eligible. For false positives, the study must use similar selection criteria and methods of determining a positive result (e.g., diameter, volumetric) as used in an RCT. If a method used in an RCT is compared with another method in a nonrandomized study these will be considered for in KQ3. For psychosocial harms there must be at least a within-group comparison relevant to the outcome e.g., before and after the screening test, between people with a negative vs. a positive test. Note: nonrandomized studies comparing the benefits of screening selection criteria or the nodule classification systems used in the RCTs with use of a risk prediction model or a different nodule classification system will be considered in KQ3. Journal articles Letters, abstracts and grey literature (e.g., government reports, results in trial registries) if information on study design (e.g., eligibility criteria, participant characteristics, presentation of scenarios) is sufficient to assess study and results are confirmed as final (accessible online or via author contact). | • Studies using modeling or simulation, or retrospectively applying differing screening criteria • Editorials, commentaries • Case reports and series (i.e. all participants have the outcome) |
Language of full text | English or French | |
Dates of publication | Any date | |
Table 2
Eligibility criteria for key question 2 on the relative importance people place on the potential benefits and harms from screening for lung cancer
| Inclusion | Exclusion |
---|
Population | Adults aged 18 years and older Specific populations of interest may include: – Age – Sex – Smoking history – Race or ethnicity (e.g. First Nations, Inuit and Métis) – For health-state utility studies: cancer detection via screening vs. clinical presentation; stage/severity of cancer (i.e., curative vs. palliative/non-metastatic vs. metastatic) | Expert or healthcare providers (doctor, nurse) acting as proxy for patients and public |
Exposures | a) Direct measurements: Utility-based measurements (e.g., health state utilities, trade-offs between outcomes) i. Experience with outcome/health state (as per KQ1, adding different stages/severities of cancer) ii. Exposure to clinical scenarios about the outcome(s) iii. Exposure to choice sets or other risk exercises (e.g. trade-offs, balance sheet, ranking) with differing risks/magnitudes of effects on benefits versus harms from screening (must contain 1 + benefit and 1 + harm) Quantitative non-utility studies (e.g. simple ratings, rankings or trade-offs between 1 + benefit and 1 + harm) i. Any exposure b) Indirect measurements (allowing inferences about how many people perceive benefits as more important than harms & acceptability of screening; not only in context of critical outcomes): i. Exposure to estimates of effect from screening for 1 + benefit and 1 + harm (e.g. decision aids). | |
Comparisons | a) Utility-based measurements: i. Healthy/usual state without outcome (may include screen-negative patients) ii. Different outcome/health state (e.g. false positives vs. overdiagnosis; includes studies comparing different stages/severity of cancers) iii. No comparison, if no other studies for the outcome comparison b) Quantitative non-utility studies & indirect studies i. No comparison ii. Comparison with another screening strategy (e.g., having different magnitude of effects; based on comparisons evaluated in evidence found for KQ1) | Studies measuring utilities during cancer treatments that are not considered a standard first-line of care for a representative sample, by stage of cancer (will gather clinical input as needed) |
Outcomes | a) Direct measurements: i. Health-state utility values; health states may include different stages/severities of lung cancer and nodules to help estimate the utility of an overdiagnosed case ii. Other utility scores (e.g. trade-offs, magnitude of coefficients/utility weights in models for each outcome) iii. Non-utility, quantitative information about relative importance of different benefits and harms (ratings, rankings) b) Indirect measurements: i. Inferences about how many people perceive benefits as greater than harms ii. Preference for or against screening (screening attendance, intentions, or acceptance) or preferred screening strategy based on different outcome risk descriptions (e.g. using decision aids) c) Measures of variability for all of above (e.g. 95% CIs, proportion unwilling to trade any benefit for harm, proportion undecided from decision aids vs those confident in decision for vs. against) | |
Timing | Measured immediately after experiencing outcome/diagnosis (to 3 months) and longer-term (e.g., after investigations and treatment) | |
Setting | Any setting in Very High Human Development Index countries (57) | |
Study Design and Publication Status | Any quantitative study design Journal articles Letters, abstracts and grey literature (e.g., government reports, results in trial registries) if information on study design (e.g., eligibility criteria, participant characteristics, presentation of scenarios) is sufficient to assess study and results are confirmed as final (accessible online or via author contact). | Editorials. |
Language | English or French | |
Publication date | 2012 – present | |
Table 3
Eligibility criteria for key question 3 on the comparative effects between a) trial-based selection criteria and use of risk prediction models, and b) trial-based nodule classification and different nodule classification systems
| Inclusion | Exclusion |
---|
Population | Adults aged 18 years and older not suspected to have lung cancer. | Studies that focused on people under age 18 or that targeted adults 18 years and older who were either suspected of having lung cancer or were previously diagnosed with lung cancer. |
Intervention(s) | For risk prediction models: Externally validated risk prediction models intended for identifying persons who may benefit from screening. The subsequent screening process must be similar to trials included in KQ1 (e.g. LDCT, nodule classification) Note: The purpose of the study may be to provide external validation. For nodule classification systems: any nodule classification system. Other elements of screening must be similar to trial included in KQ1. | Risk prediction models that include factors (laboratory tests, other assessment) that are not available or feasible in primary care. Risk prediction models that incorporate nodule characteristics. |
Comparison(s) | For risk prediction models: Trial-based eligibility criteria (similar to that used by trials included in KQ1 (e.g. NLST, NELSON)). For nodule classification systems: Nodule classification system similar to those used by trials included in KQ1 (e.g. NLST, NELSON). | |
Outcomes | Benefits: 1. All-cause mortality* 2. Lung cancer mortality* 3. Health-related quality of life Harms: 4. Overdiagnosis of lung cancer* 5. Major complications or morbidity from invasive testing as a result of screening 6. Death from invasive testing as a result of screening 7. False positives 8. Psychosocial consequences of the screening process (i.e., before or right after screening, receiving a false positive result) 9. Incidental findings *Rated as critical outcomes; others rated as important Note: outcome may be reported as a relative measure between benefit(s) and harm, e.g. number of overdiagnosed cancers per prevented death | False positive selections (individual is positive for “increased risk” using model [i.e. is selected for screening with CT or other modality] but does not get cancer or die from lung cancer during follow up) |
Delivery Setting | Any setting relevant to primary care | |
Study design(s) & Publication status | Nonrandomized clinical trials, prospective or retrospective controlled observational studies (including analyses using data from screening arms in RCTs), modelling studies Journal articles Letters, abstracts and grey literature (e.g., government reports, results in trial registries) if information on study design (e.g., eligibility criteria, participant characteristics, presentation of scenarios) is sufficient to assess study and results are confirmed as final (accessible online or via author contact). | -RCTs (captured in KQ1) - Editorials - Commentaries - Case reports and series |
Countries | Studies conducted on populations from countries categorized as “Very High” on the 2016 Human Development Index (as defined by the United Nations Development Programme). Priority will be given to studies relevant to the Canadian context | Studies conducted on populations from countries NOT categorized as “Very High” on the 2016 Human Development Index (as defined by the United Nations Development Programme). |
Language of full text | English or French | Languages other than English and French. |
Dates of publication | 2012 (publication of first RCT showing benefit of lung cancer screening was Aug 2011) | |
For KQ2 on values and preferences, individuals may or may not have experienced lung cancer or one or more of the critical or important outcomes of interest. Study designs may be any quantitative design measuring preferences for outcomes either directly such as health-state utilities or trade-offs, or indirectly, hence allowing inferences about relative values based on the degree of acceptance of screening given scenarios with estimates of the expected benefits and harms.
For KQ3, we will include nonrandomized studies comparing the benefits and harms between randomized trial (included in KQ1) selection criteria or nodule classification systems and selection based on (externally validated) risk prediction models or different nodule classification systems. The selection criteria should be similar to, but does not need to be identical to that used in a trial included in KQ1; for instance, if the minimum age differs by 2–3 years, the review team will use clinical input to decide eligibility. RCTs of these comparisons will be included in KQ1.
For the most relevance to Canada, we will only include studies conducted in countries listed as very high according to the Human Development Index (57) and having full texts in English or French. For KQs 2 and 3, we will limit studies to those published during or after 2012. Based on clinical input and working group discussion, utility-based outcomes will have changed over time because treatments and their impact has changed quite dramatically, and the best indirect measurements, for example based on decision aids, would be based on contemporary estimates of effect of CT screening since publication of NLST trial in 2011. This date also represents publication of the major risk prediction models and emergence of studies comparing different screening protocols.
Searching the literature
For KQ1 we will locate all full texts from the previous task force review (48). The previous review’s final search date for studies was March 31, 2015 (48), so for this KQ we will search Medline and Embase, via Ovid, and Cochrane Central including the Central Register for Controlled Trials from 2015 onwards. The previous searches for benefits and harms were modified slightly to increase their sensitivity, with a more sensitive filter applied for RCTs, and broaden their scope, such as adding controlled vocabulary and key words for incidental findings and psychosocial harms into the harms search. For KQ2 we will search Ovid Medline (1946-), Scopus, and EconLit from 2012 using two searches: one for utility-based studies, focusing on relevant preference-based instrument/methodology terms and the relevant outcomes as well as lung cancers and nodules to help estimate the utility of an overdiagnosed case) and another for decision-making/acceptance/attitudes about lung cancer screening. For KQ3, we will search Medline and Embase; for Medline we will rely on the 2019 search performed by the authors of a review on this question to inform the United States Preventive Services Task Force (and screen all of their studies for eligibility) (58) and update the search to present using a de novo search strategy, and for Embase we will run the de novo search from 2012. Searches were developed in collaboration with an information specialist and peer-reviewed by another using the PRESSS 2015 checklist (59). The final Medline searches are located in Supplementary File 1. We will scan reference lists of included studies and relevant reviews. We will search ClinicalTrials.gov and the World Health Organization International Clinical Trials Registry Platform for results data for published and unpublished trials (past two years) of lung cancer screening. Where studies are only reported in conference abstracts or trial registries, first authors will be contacted by email with two reminders over 1 month to confirm results are final and see if full study reports are available. Any unpublished data will be subject to sensitivity analysis, if included.
We will export the results of database searches to an EndNote library (Clarivate Analytics, Philadelphia, US, 2018) for record-keeping and will remove duplicates. We will document our supplementary search process, for any study not originating from the database searches, and enter these citations into EndNote individually. We will update electronic database searches for all KQs within 12 months of the task force guideline publication.
Selecting studies
Records retrieved from the database searches will be uploaded to DistillerSR (Evidence Partners Inc., Ottawa, Canada) for screening. For all citations retrieved from the database searches, two reviewers will independently screen all titles and abstracts using broad inclusion criteria. Full texts of any citation from the search considered potentially relevant by either reviewer will be retrieved. Two reviewers will independently review all full texts including the studies from the previous reviews against a structured eligibility form, and a consensus process will be used for any full text not included by both reviewers. If necessary, a third reviewer with methods or clinical expertise and/or author contact will be used to arbitrate decisions. The screening and full-text forms will be pilot-tested with a sample of at least 100 abstracts and 20 full texts, respectively. Screening studies located from reference lists, trial registries, and websites will be conducted by one experienced reviewer, with two reviewers reviewing full texts. We will document the flow of records through the selection process, with reasons provided for all full-text exclusions, and present these in a PRISMA flow diagram (53) and appended excluded studies list. When data from multiple reports from the same trial are used in the review for results of mortality, HRQoL or cancer incidence (for estimating overdiagnosis), we will consider the report from which we collected lung-cancer mortality data to be the primary (included) publication for citation but will cite the others as companion papers. When we are using data on harms only in the screening arm within a trial, we will consider the study a different, uncontrolled study, and cite the report we used for the data unless it is the primary publication.
Data extraction
We will rely on data extraction from the previous review team, as able and suitable. For this data and for all data from new studies, one reviewer will extract data and another will verify all data for accuracy and completeness. Results data for the critical outcomes in KQ1 will be extracted in duplicate, with decisions based on consensus or arbitration by a third reviewer. Each data extraction form will be piloted with a sample of at least five studies.
Sufficient data will be collected to allow examination of the homogeneity and similarity assumptions for meta-analysis, for description and possible analyses on specific populations (see Tables 1 and 2), and for assessment of the risk of bias. The main data items include the study characteristics (i.e., year and country of conduct, eligibility criteria, sample size eligible and enrolled, setting of recruitment, trial/study design, methods for randomization, concealment and blinding); population (i.e., age, sex, gender, race or ethnicity, personal history of lung disease, family history of lung cancer, smoking history [past, current, pack-years], health state including diagnosis, stage of cancer and treatment received [for KQ2 preference studies]); intervention and comparator (e.g., interval, rounds, dose, classification of nodules, description of usual care and any adjuvant therapies including smoking cessation advice etc.) (for effectiveness) or exposure (e.g., instrument, measurement of tariffs, scenarios used, estimates of effects of screening, any specified durations of health states) (for KQ2 preference studies); outcomes (definitions, ascertainment, methods to determine cause of death, timing of data collection, tool with range of values for patient-reported outcome measures); number screened at each round or during usual care; cumulative number of cancers diagnosed (including the proportion diagnosed based on screening results); results (numerator and denominator for each outcome; see details below); funding source; data supporting missing outcomes or analyses.
For most of the outcomes in KQ1, the denominators will be the population enrolled in the relevant arm/group(s) in the study (i.e., intention-to-treat). One exception is psychosocial harms where sub-populations (e.g., those receiving a positive screening result) will be considered. Another exception is for overdiagnosis. We will calculate estimates of overdiagnosis by the relative (56) and absolute risk of cumulative lung cancer incidence through follow-up in the screening compared with no screening group, and by the excess incidence of cancer from screening among those i) having cancer diagnosed in the screening arm, and ii) having cancer diagnosed through screening in the screening arm.
For mortality outcomes and overdiagnosis (using cancer incidence), we will use crude data on the cumulative number of events from the longest follow-up time point unless there has been substantial contamination after a previous time point (> 20% no screening group receiving screening). For incidence rates for use when calculating overdiagnosis there must be follow-up beyond the active phase of the screening. For HRQoL, we will extract the mean baseline and endpoint or change scores (at longest follow-up without substantial contamination), standard deviations (SDs) or other measures of variability, and the number analyzed in each group. For the outcomes of major complications or morbidity (requiring hospitalization or medical intervention), and mortality, from invasive testing as a result of screening, we will use counts of the number of people having one or more events (not the total number of events) among those who later receive a negative diagnosis (false positives) and among anyone receiving the invasive testing (those with cancer and false positives). For incidental findings, we will extract all data on the number of people with an incidental finding, unless only number of incidental findings is reported, and details of the incidental findings, that is, the organ system involved and whether it resulted in referral for additional testing. For false positives and psychosocial harms from screening, we will examine results for anyone receiving a recommendation for early recall (e.g., indeterminate result with repeat CT screening at 3 or 6 mos) or for a diagnostic follow-up (e.g., result suspicious of lung cancer) as well as only for those recommended to have diagnostic follow-up. Other definitions of false positives will be considered. We will record the proportion of people receiving at least one false positive result over all screening rounds and the average number of false positives during the active screening phase of the study. Results considered consistent with the outcome of psychosocial harms include data from patient-reported outcome measurement tools/questionnaires on symptoms of anxiety, depression, distress, and concern about lung cancer; if composite scores meeting these concepts are available we will use these and not subscales. Single-question items are not eligible. Subscales of overall HRQoL scales (e.g., mental health) will be considered to measure psychosocial harm if other tools measuring the same symptoms are not reported. For this outcome we will extract data at all time-points where there is measurement, during the active phase of screening, to primarily capture these harms from undertaking screening itself and from receiving a false positive result.
When only relative effects/ratios between groups are reported instead of raw counts and intention-to-treat is not used, we will rely on results from last-observed carry-forward or, if necessary, per protocol/completer approaches, as reported. For missing results data for any outcome, including measures of variance, we will contact authors by email with two reminders over one month. If not received, as possible we will compute missing SDs or standard errors (SEs) from other study data, or as a last resort, impute based on other studies in the review. When computing SDs for change from baseline values, we will assume a correlation of 0.5 (60), unless other information is present in the study that allows us to compute it more precisely. We will use available software (i.e., Plot Digitizer, http://plotdigitizer.sourceforge.net/) to estimate effects from figures if no numerical values are provided. If cross-over trials are included, we will limit the data extraction to the first period of the study, prior to the cross-over.
For KQ2 health-state utilities, data using the most commonly used measurement tool (e.g., EuroQol 5 Dimensions [[EQ-5D]), using tariffs from the same country, at the earliest time point after baseline (or diagnosis) will be prioritized for analysis (61) though we will extract all data meeting our criteria in Table 2.
For data from non-randomized studies we will rely when possible on results adjusted for potential confounders.
Risk of bias
For RCTs we will use the Cochrane risk of bias (ROB 2.0) tool, assessing the effect of assignment to the intervention for each relevant outcome (cancer-specific and all-cause mortality, cancer incidence [to calculate overdiagnosis] and HRQoL) (62). For nonrandomized studies (including single-arm data on harms from RCTs), we will use the checklists, as applicable, from the Joanna Brigg’s Institute (63), with the exception of preference-based studies where will use items as per GRADE guidance, about the choice/selection of representative participants; appropriate administration and choice of instrument; analysis and presentation of methods and results; instrument-described health state presentation, of all relevant outcomes and valid with respect to health state; patient understanding; and subgroup analysis to explore heterogeneity (41).
Two reviewers will independently assess the studies and come to a consensus on the final risk of bias assessment for each question using a third reviewer where necessary. Each risk of bias tool will be piloted with a sample of at least five studies, using multiple rounds until agreement on all elements is high. These assessments will be incorporated into our assessment of the risk of bias across studies when rating the certainty of the evidence for each outcome using GRADE.
Data analysis
When two or more outcome-comparisons are sufficiently similar, we will pool their data. The decision to pool studies will not be based solely on the statistical heterogeneity; the I2 statistic will be reported, but it is recognized that the I2 is influenced by the number of studies and magnitude and direction of effects (64). Rather, we will rely on interpretations of the clinical (related to our PICOTS, e.g., definition of positive screening result) and methodological differences between studies. For pairwise meta-analysis, when there are large differences in trial sizes and potential publication bias or within-study bias in smaller studies (65), our main analyses will employ a fixed-effects model using Stata. If these factors are not apparent we will use a random effects model. We will use the DerSimonian Laird method unless events are rare (< 1%) where we will pool odds ratios (ORs) using Peto’s method or (if zero events) the reciprocal of the opposite treatment arm size correction. For dichotomous outcomes, we will analyze and report data using risk ratios (RRs) and their 95% confidence intervals (95% CIs) unless ORs are used with the Peto method, where we will covert ORs to RRs using control event rates. For continuous outcomes, we will report a pooled mean difference using changes scores, when one measurement tool is used. We will use a standardized mean difference when combining two or more outcome scales measuring similar constructs based on clinical input. If suitable, we will transform the results to either a mean difference or ratio to assist interpretation (66). For pooling proportions which we anticipate for most harms, we will apply suitable transformation (logit or arcsine) depending on the proportions of events and use a random effects model (67). Pooling of mean health state utilities will use a random-effects model with weighting by the inverse of variance. If we are not able to use a study’s data in a meta-analysis (e.g., only p values are reported), we will comment on these findings and compare them with the results of the meta-analysis. Analyses will be performed using Microsoft Excel, Review Manager (version 5.3) and STATA (version 14.2 or higher). Relative effects will be transformed to absolute effects using the pooled control events rates across the included studies. Based on clinical input, we may also assume one or more different control/baseline rates for estimating absolute effects in a lower and/or higher risk population. For mortality outcomes having statistically significant effects, we will calculate the number needed to screen and its 95% CI.
If meta-analysis is not undertaken, we will synthesize the data descriptively. We will use various techniques as described for narrative syntheses, such as creating an overall synopsis of each study, including their characteristics and reported findings, and describing relationships within and between studies focusing on our exposure subgroups and outcome comparisons of interest and other factors such as methodological quality (68).
Unit of analysis issues
In the event of the inclusion of cluster-randomized trials, we will take appropriate measures to avoid unit-of-analysis errors when reporting their findings and/or incorporating them into meta-analysis (69). When available, we will use the intracluster correlation coefficient reported in the trial to apply a design effect to the sample size and number of events in each of the treatment and control groups (70). If not reported, we will use an external estimate from similar studies. We will clearly identify cluster-randomized trial data when it is included in meta-analysis with individually randomized trials.
Assessment of heterogeneity
When statistical heterogeneity in the direction of effects is seen across studies, we will conduct subgroup (stratified) analyses, using variables associated with the population (specific populations in Tables 1–3), the intervention (e.g., screening interval) or exposure (e.g., scale measuring utilities), or the follow-up duration, and/or sensitivity analysis removing high risk of bias studies, data from unpublished studies, or studies for which we needed to impute measures of variance or adjust for clustering. Subgroups will be tested for statistical significance and the credibility of the results interpreted using available guidance (71). We will also extract results from within-study analyses related to our specified variables of interest.
Small study bias
When meta-analyses of trials contain at least 10 studies of varying size, we will test for small study bias visually by inspecting funnel plots for asymmetry and statistically via the Egger’s test (continuous outcomes) (72) or Harbord’s test (dichotomous outcomes) (73).
Rating the certainty
We will use GRADE methods to assess the certainty of evidence for all outcomes (41, 43, 44, 74). In cases where studies of interventions cannot be pooled in meta-analysis, we will use guidance for rating the certainty of evidence in the absence of a single estimate of effect (75). Two reviewers will independently assess the certainty of evidence for each outcome and agree on the final assessments. A third reviewer will arbitrate if necessary.
We will assess the certainty of evidence (very low, low, moderate, or high) based on five domains: study limitations (risk of bias), inconsistency of results, indirectness of evidence, imprecision, and reporting biases (small study bias or missing outcome data). For cancer incidence, mortality, and HRQoL, RCTs will start at high certainty and nonrandomized studies will start at low certainty. False positives, incidental findings, and complications and mortality from invasive procedures resulting from screening are most relevant to the screening group and therefore certainty will start at high for the studies reporting these outcomes except when a comparison between different screening approaches is the focus. For psychosocial outcomes, findings from people in a control group would be beneficial for interpreting data and attributing the findings to the screening intervention, such that findings from nonrandomized studies will start at low certainty. Rating up will be considered when relying on nonrandomized studies, if there are no serious concerns about the other domains (76). Studies measuring preferences and health state utilities will start at high certainty. Unless the outcome is measured using an instrument (e.g., for HRQoL) that has a known minimally important difference around which to base our conclusions and certainty, we will initially apply a minimally contextualized approach whereby we will rate our certainty in the direction of effect (i.e., relative to the null effect) rather than a particular magnitude of effect (74). Rather than statistical significance, a threshold of a minimal effect (i.e., to determine if results very close to the null are of little-to-no difference) may be chosen before the task force reviews the results. Upon examining the findings, the task force may decide to adopt a partially or fully contextualized approach using one or more thresholds (e.g., for small and moderate magnitudes of effect) and considering multiple outcomes simultaneously. In such case, assessment of heterogeneity (i.e., by magnitude) and certainty ratings will be revised accordingly.
We will prepare GRADE Summary of Findings tables, by outcome for each comparison, including explanations for all decisions.
Task force involvement
The task force and clinical experts will not be involved in the selection of studies, extraction of data, appraisal of risk of bias, nor synthesis of data, but will contribute to the interpretation of the findings and comment on the draft report. Clinical experts and/or task force members may be called upon to contribute to the identification of thresholds and the certainty of evidence appraisals, e.g., to interpret directness (applicability) of included studies to the population of interest for the recommendation.