Systematic Review Conduct
The Evidence Review and Synthesis Centre (ERSC) at the University of Alberta (AG, JP, DK-L, BV, LH) will conduct the systematic reviews on behalf of the Task Force following the research methods outlined in the Task Force methods manual [83]. We will follow a pre-defined protocol, reported in accordance to current standards (Supplementary File 1) [84], as documented herein. During protocol development, a working group was formed consisting of Task Force members (DR, CK, AM, GTh, BDT), with input from clinical experts (JL, CP, DvN), and scientific support from the Global Health and Guidelines Division at the Public Health Agency of Canada (RS, GTr). The working group contributed to the development of the Key Questions (KQs) and PICOTS (population, intervention(s) or exposure(s), comparator(s), outcomes, timing, setting, and study design) elements.
Task Force members made the final decisions with regard to the KQs and PICOTS. Task Force members and clinical experts rated the proposed outcomes based on their importance for clinical decision-making, according to methods of Grading of Recommendations Assessment, Development and Evaluation (GRADE) [85]. Ratings by the clinical experts were solicited to ensure acceptable alignment with the views of Task Force working members (clinical decision-makers), but Task Force members determined the final ratings. Final critical outcomes (rated at 7 or above on 9-point scale) pertaining to the effectiveness and comparative effectiveness of screening included: the rate of ICC, cervical cancer mortality, all-cause mortality, the rate of CIN 2 and CIN 3, and overdiagnosis of CIN 2, CIN 3, and ICC. Final important outcomes (rated 4-6) for inclusion were: the number and rate of colposcopy and/or biopsy (or referral rate), adverse pregnancy-related outcomes from conservative management of CIN, and the false-positive rates for detecting CIN 2, CIN 3, and ICC. These outcomes are defined in Supplementary File 2. Other outcomes relevant to comparative accuracy, values, preferences, and the effectiveness of interventions to improve screening rates were selected by the Task Force working members in collaboration with the ERSC. The classification of benefit or harm for all outcomes will be based on the effects observed for different comparisons.
This version of the protocol was reviewed by the entire Task Force. Stakeholders (n = 17) reviewed a draft version of this protocol, and all comments were considered (comments available at: [url pending]). Throughout the conduct of the systematic reviews, we will document any changes to the protocol (including timing), with justification. We will report on these within the final report. We will report our findings in accordance to available standards at the time of writing (i.e., v. 2009 [86] or updated version of the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-analyses) Statement, should it become available prior to submission of the final report).
Key Questions and Analytical Framework
The Task Force has delineated five KQs to inform their recommendations, as follows:
KQ 1: What are the effectiveness (benefits and harms) and comparative effectiveness of different screening strategies for the prevention and early detection of cervical cancer?
KQ 1a: Do the effectiveness and comparative effectiveness of different screening strategies for the prevention and early detection of cervical cancer differ by age or by other population subgroups?
KQ 2: What is the comparative accuracy of screening tests for the prevention and early detection of cervical cancer?
KQ 2a: Does the comparative accuracy of screening tests differ by age or by HPV vaccination status?
KQ 3: What are the adverse pregnancy outcomes associated with conservative management of CIN? (NB. will not require a new or updated systematic review)
KQ 4: What is the relative importance individuals place on the potential outcomes from screening for the prevention and early detection of cervical cancer?
KQ 5: What is the effectiveness of primary care-based interventions to increase rates of screening for the prevention and early detection of cervical cancer for under- and never screened individuals?
For the purpose of these reviews, we will consider effectiveness to include both benefits and harms. The analytical framework in Figure 2 shows the population (and population subgroups), KQs, and outcomes in the context of the screening, diagnosis, management, and treatment modalities under consideration.
The systematic reviews for KQs 1 and 2 focus on the effectiveness and comparative effectiveness (KQ 1) and the comparative accuracy (KQ 2) of various screening strategies. The intent for KQ 2 is to fill gaps for the outcomes from KQ 1. The main goal is to compare detection rates and harms (i.e., false positives, false negatives) between different screening strategies, and to provide indirect evidence for KQ 1 with respect to false-positive rates, as we expect evidence for this outcome to be of low or very low certainty from studies contributing to KQ 1. It may also provide information about the comparative accuracy of screening tests not studied in KQ 1 to help determine if these may be appropriate to use in practice in the absence of KQ 1 evidence.
KQ 3 focuses on the adverse pregnancy outcomes (the only direct treatment or management harm rated as important by the working group) associated with conservative management of CIN 2 and CIN 3. The intent of this KQ is to fill gaps for adverse pregnancy outcomes identified in the studies for KQ 1. The rationale for a separate KQ is that adverse pregnancy outcomes are unlikely to be reported in studies focusing primarily on screening effectiveness. In the United States Preventive Services Task Force (USPSTF) 2018 review of screening for cervical cancer with hrHPV testing [65], none of the included screening trials (n = 8) [52-59, 87-97] reported on adverse pregnancy outcomes.
The ERSC will not undertake de novo searches or syntheses for KQ 3. During protocol development, a research librarian undertook a comprehensive search of existing systematic reviews published between 2014 and March 2019. These systematic reviews were scrutinized for suitability, with careful consideration for the comprehensiveness of their searches, scope (i.e., ability to capture the studies of interest), and reporting quality. We identified two Cochrane systematic reviews, published in 2015 [98] and 2017 [99], that answer our KQ 3. The Cochrane Review Group has confirmed that both reviews are presently being updated to incorporate the latest evidence, and these reviews will be used by the Task Force.
Of the two Cochrane reviews, the review by Kyrgiou et al. published in 2015 [98] synthesized evidence on fertility and early pregnancy outcomes (i.e., pregnancy rates, miscarriage rates, ectopic pregnancies) following conservative excisional or ablative management of CIN. Fifteen observational studies (>2 million participants) were included. The review by Kyrgiou et al. published in 2017 synthesized evidence on the obstetric outcomes (i.e., preterm birth, low birth weight, cervical cerclage) following conservative excisional or ablative management of CIN. Sixty-nine observational studies (>6 million participants) were included. Due to the observational study design of the available evidence, both reviews reported very low to low certainty evidence for the effects of the interventions for our outcomes of interest. Given that it would be unethical to conduct RCTs to address this question (i.e., where women with CIN would be randomized to a non-treatment control group), the probability of identifying a newly published trial that will improve the certainty of evidence for the outcomes of interest is assessed as virtually zero. Additional observational evidence is also unlikely to improve the certainty of evidence, but could impact the pooled effect estimates. As the two systematic reviews are presently undergoing updates, to avoid duplication of research effort the Task Force will rely on these two reviews to inform KQ 3. The ERSC will review, contextualize, and summarize the available evidence (i.e., in text, tables, and figures) from the two Cochrane systematic reviews to facilitate interpretation by the Task Force during guideline development.
The review for KQ 4 will synthesize evidence of the relative importance individuals place on the outcomes from cervical screening (independent of the screening strategy) [100, 101], including all critical and important outcomes as defined for KQ 1 (Table 2). It will also provide information to the Task Force on whether there is important uncertainty about or variability in how much people value the main outcomes [100].
Table 2. Eligibility criteria for Key Question 1 (effectiveness and comparative effectiveness)
Criterion
|
Inclusion
|
Exclusion
|
Population
|
Individuals with a cervix, 15 years of age and older, who have been sexually active and who have no symptoms of cervical cancer*
*We will include studies where up to 25% of the participants had a recent abnormal screening result.
Population subgroups:
- By age group (15–19, 20–24, 25–29, 30–69, 70+)
- Risk groups: immunocompromised (e.g., HIV, organ transplantation, chemotherapy or chronic use of corticosteroids, use of disease-modifying anti-rheumatic drugs or biologics); risk behaviours (e.g., early sexual debut, women who have sex with women, individuals who have multiple sexual partners, smoking); under or never screened (e.g., transgender individuals, individuals with a history of trauma or abuse); Indigenous peoples; rural populations; immigrants; race or ethnicity; low socio-economic status; pregnant individuals; HPV vaccinated populations
|
Study population includes >25% individuals with recent abnormal screening result
|
Intervention
|
Any screening strategy using hrHPV tests and/or cytology with subsequent follow-up of abnormal tests:
- Primary screening with cytology (conventional or liquid-based)
- Primary screening with hrHPV testing
- Cytology screening, which if abnormal may be followed by triage with an hrHPV test
- hrHPV screening, which if positive may be followed by triage with cytology or other hrHPV test (e.g., full genotyping)
- Other combinations will be considered
|
HPV test using in-situ hybridization, p16 immunostaining or HPV viral load
Urine for sample collection
Point of care tests
Co-testing as a strategy (although we will include relevant data for the individual strategies where suitable)
|
Comparator
|
Effectiveness:
No routine screening
Comparative effectiveness:
Any screening strategy differing by one or more of the following factors:
- Screening test strategy
- Screening interval
- Universal vs. selective/targeted (e.g., starting age)
- Method of sample collection (e.g., self-collection** (self-collection at home vs. self-collection in clinic) vs. health provider collection)
- Protocol for evaluation of abnormal screening results (e.g., criteria for immediate colposcopy)
**Different samples or methods of sample collection
|
|
Outcomes
|
Critical outcomes:
- Incidence of invasive cervical cancer (squamous and adenocarcinoma)
- Incidence of cervical intraepithelial neoplasia (CIN) 2 and CIN 3***
- Cervical cancer mortality
- All-cause mortality
- Overdiagnosis of CIN 2 and CIN3 and invasive cervical cancer***
Important outcomes:
- Number and rates of colposcopy and/or biopsy, including LEEP and other treatments provided during colposcopy (or referral rate) (for comparative effectiveness)
- Adverse pregnancy outcomes from conservative, local management of CIN
- False-positive rate for detecting CIN 2 and CIN 3 and invasive cancer***
***The ability to report and analyze findings by CIN 2, CIN 3 and invasive cervical cancer will be determined after reviewing the outcomes used in the identified studies (e.g. CIN 2+ and CIN 3+ will be considered if necessary, and may be considered indirect)
|
|
Timing
|
No limitation on the duration of follow-up; results will be reported by screening round and longest follow-up
|
|
Setting
|
Studies from Very High Human Development Index countries
|
|
Study design
|
- Randomized controlled trials
- If insufficient data from randomized controlled trials (by comparison and outcome): non-randomized studies (controlled trials, before-after studies, interrupted time series, individual patient data meta-analysis, cohort studies, case control studies)
|
Conference proceedings; government reports; systematic reviews; case reports; editorials
|
Language
|
English or French
|
|
Publication date
|
1995-present
|
|
Abbreviations: CIN: cervical intraepithelial neoplasia; HIV: human papillomavirus; LEEP: loop electrosurgical excisional procedure
Given that certain Canadian sub-populations remain under-screened or never-screened despite recommendations for cervical screening, the review for KQ 5 will inform primary care interventions that may improve screening rates.
Eligibility Criteria
Tables 2 to 5 show the PICOTS elements for KQs 1, 2, 4, and 5. These are described in detail in Supplementary File 3. Given that we will not undertake de novo synthesis for KQ 3, we have not included PICOTS elements for this KQ.
Table 3. Eligibility criteria for Key Question 2 (comparative diagnostic accuracy)
Criterion
|
Inclusion
|
Exclusion
|
Population
|
Individuals with a cervix, 15 years of age and older, who have been sexually active and who have no symptoms of cervical cancer*
*We will include studies where up to 25% of the participants had a recent abnormal screening result.
Population subgroups:
- By age group (15–19, 20–24, 25–29, 30–69, 70+)
- HPV vaccinated populations
|
Study population includes >25% individuals with recent abnormal screening results
|
Index screening test
|
- Primary high-risk HPV testing with HPV nucleic acid tests** alone
- High-risk HPV testing with HPV nucleic acid tests, followed by some form of triage (e.g., cytology or HPV testing with partial genotyping for HPV 16 or 18, sequential partial genotyping for HPV 16 or 18 followed by cytology to further triage those positive for HPV 16 or 18).
Subgroups:
- Method of sample collection for high-risk HPV testing (i.e., self-collected (home vs. in clinic) vs. clinician-collected)
- Type of assay (i.e., generic, partial genotyping, full genotyping)
- HPV test threshold for a positive result (i.e., 1 pg/mL, 2 pg/mL)
**Eligible HPV tests include generic assays, as well as partial and full genotyping assays able to detect at least some high-risk HPV genotypes (e.g., HPV 16, 18, 31, 33, 35, 39, 45, 51, 52, 56, 58, 59, 68) and available commercially in Canada or reasonably perceived to potentially be available in Canada. Examples of eligible high-risk HPV tests include the Cobas 4800 HPV Amplification/Detection Kit (Roche Molecular Systems, Inc.), Linear Array HPV Genotyping Test (Roche), Aptima HPV assay (Hologic, Inc.), Aptima HPV 16 18/45 Genotype Assay (Hologic), Cervista HPV HR assay (Hologic), Abbott RealTime High-Risk HPV (Abbott Molecular), Diagene DML-2000 HPV Test Hybrid Capture 2 (Qiagen Sciences LLC), Xpert HPV test (Cepheid).
|
HPV test using in-situ hybridization, p16 immunostaining or HPV viral load
Earlier versions of commercial tests that have been replaced (e.g., Hybrid Capture 1)
Urine for sample collection
Point of care tests
|
Comparator screening test
|
- Conventional or liquid-based cytology, with or without follow-up by high-risk HPV testing
- High-risk HPV testing with HPV nucleic acid tests, followed by different form of triage than in the index test
- hrHPV testing with HPV nucleic acid tests, using a different method of sample collection (i.e., self-sampled (home vs. clinic) vs. clinician-sampled)
|
Visual inspection with acetic acid or visual inspection with Lugol’s iodine
|
Reference standards
|
- Colposcopy with histologic examination of tissue specimens, when indicated.
- Study protocol stipulates that reference standard is applied to:
- All patients, or
- All screening test-positive patients and a subset (e.g. random 10%) of screening test-negative patients
|
Reference standard only applied to screen-positive patients
|
Outcomes & target conditions
|
Diagnostic test accuracy:
Number and proportion of people positive and negative on each test (TP, FP, TN, FN), sensitivity and specificity to screen for high-grade cervical lesions (CIN 2, CIN 3, HSIL) and/or invasive cervical cancer (squamous cell carcinoma or adenocarcinoma)
|
|
Timing of reference standard
|
Reference standard test performed before any management based on the index test result
|
|
Setting
|
Studies from Very High Human Development Index countries
|
|
Study design
|
- Observational studies (e.g., prospective or retrospective cohorts, or cross sectional studies) in which all participants receive both the index and comparator screening test, followed by verification of disease status using the reference standard in all patients or in all screening test-positive patients and a subset (e.g., random 10%) of screening test-negative patients
- Randomized controlled trials where participants are randomized to different screening tests but all receive the same reference standard
|
Conference proceedings; government reports; systematic reviews; case reports; case-control studies; editorials
|
Language
|
English or French
|
|
Publication date
|
1995-present
|
|
Abbreviations: AGC: atypical glandular cells; AIS: adenocarcinoma in situ; CIN: cervical intraepithelial neoplasia; FN: false negative; FP: false positive; HPV = human papillomavirus; HSIL: high-grade squamous intraepithelial lesions; TN: true negative; TP: true positive
Table 4. Eligibility criteria for Key Question 4 (patient values and preferences)
Criteria
|
Inclusion
|
Exclusion
|
Population
|
Individuals with a cervix, or who have had their cervix removed as part of treatment for cervical cancer, 15 years of age and older (patients and the general public)
Population subgroups:
- Age (15–19, 20–24, 25–29, 30–69, 70+)
- Risk groups: Immunocompromised (e.g., HIV, organ transplantation, chemotherapy or chronic use of corticosteroids, use of disease-modifying anti-rheumatic drugs or biologics), risk behaviors (e.g., early sexual debut, women who have sex with women, individuals who have multiple sexual partners, smoking), Indigenous peoples, rural populations, immigrants, lower SES, pregnant individuals, HPV vaccinated populations
- – Previous screening history (regular as per guidance vs. not regular (under) vs. never-screened)
|
|
Exposures
|
- Experience with critical outcome(s) related to screening or,
- Exposure to clinical scenarios or information about potential critical outcomes and/or estimates of effect on outcome risks from screening, or
- No experience or exposure to information about outcomes, but authors are soliciting probability trade-offs or ratings of different potential critical outcomes (e.g., number of biopsies acceptable to prevent one early diagnosis of invasive cervical cancer)
- Focus of study is on consideration of possible, or assessment of experienced, outcomes from screening.
Exposure moderators: differing descriptions or experience of outcomes in terms of stage, treatments received, severity, time since diagnosis (immediate vs first year vs later years); number of outcomes considered; differing estimates of magnitudes of effect from screening (if applicable)
|
Apart from studies with direct (e.g. time-trade off) or indirect (e.g. based on EQ-5D) measurement of heath state utilities, participants need to consider at least one outcome that may be a harm from screening (e.g. false positives, overdiagnosis [e.g. hrHPV+ but never will get cancer], increased CIN 2+ detection).
Focus on the harms from management of lesions or cancer.
|
Comparisons
|
- Different critical outcome or groups of outcomes (e.g., critical benefits vs harms)
- Healthy state without outcome (for utility studies)
- No comparison (for utility studies)
- No or another intervention, if applicable for interpreting outcome importance, i.e., no screening, another screening strategy (e.g., having different magnitude of effects), no information (e.g., in studies using decision aids).
When only one arm (e.g., receiving decision aid) of a comparative study is used for interpreting data on patient preferences, the study will be classified as a non-comparative study.
|
|
Outcomes
|
- Utility values/weights for the potential outcomes from screening
- Non-utility, quantitative information about relative importance of different outcomes (e.g., rating scales using ordinal or interval variables, ranking; preference for or against screening [screening attendance, intentions, or acceptance] or preferred screening strategy based on different outcome risk descriptions, strength of associations about outcome ratings with screening behaviours or intentions)
- Qualitative information indicating relative importance between outcomes
- Rank-order of importance of outcomes, based on data from a) to c) above, as applicable.
Data must relate to the outcomes considered critical to the Task Force. Outcome groupings a) to c) above will be included in a hierarchical manner for each critical screening outcome.
|
|
Timing
|
Follow-up duration: any or none
|
|
Setting
|
Any setting in Very High Human Development Index countries
|
|
Study Design and Publication Status
|
Any quantitative or qualitative study design using the methods described below:
- Utility values/weights measured directly using time trade-off*, standard gamble**, visual analogue scales, conjoint analysis with choice experiments or probability trade-offs
- Utility values/weights measured or estimated indirectly, e.g., from transforming several health state domains from multi-attribute utility indexes such as EQ-5D to utilities using general population preferences, including mapping from generic or disease-specific health-related quality of life instruments
- Non-utility, quantitative information about relative importance of different outcomes, e.g., rating scales using ordinal or interval variables, ranking; preference for or against screening (screening attendance, intentions, or acceptance) or preferred screening strategy based on different outcome risk descriptions, strength of associations about outcome ratings with screening behaviours or intentions)
- Qualitative information indicating relative importance between benefits and harms
- Rank-order of importance of outcomes
*Time trade-off measures the value placed on attributes of a commodity by requiring individuals to choose between different scenarios, where in each scenario the commodity in question has varying levels of different attributes.
**Standard gamble approaches require that respondents choose between a lifetime in a certain health state or a gamble between different health states, whereas time trade-off requires respondents to choose between living for a period in less than perfect health, as opposed to a shorter period in perfect health.
|
Conference proceedings; government reports; editorials
|
Language
|
English or French
|
|
Publication date
|
2000–present
|
|
Abbreviations: HPV: human papillomavirus
Table 5. Eligibility criteria for Key Question 5 (interventions to increase screening rates)
Criterion
|
Inclusion
|
Exclusion
|
Population
|
Individuals with a cervix who would meet the criteria for cervical cancer screening, but who have never been screened or who have been under-screened, as defined by the study authors, when assessed against current screening recommendations (e.g., for screening interval).
Population subgroups:
- Indigenous peoples
- Immigrant groups
- Rural populations
- Low socioeconomic status populations
|
Individuals with symptoms of cervical cancer or previous abnormal test results on cervical screening (unless cleared to return to normal screening)
Individuals who have had complete surgical removal of the cervix
|
Intervention
|
- Mail-out or opt-in (invitation to request) self-sampling for hrHPV screening
- Other interventions aimed at individuals or primary care providers with the intent to increase acceptability of screening (e.g., screening reminders, education, counselling, provider recommendation, addressing cultural practices and beliefs, patient-provider communication)
|
Interventions not targeted to primary care providers or feasible for primary care to deliver to their patients (e.g., community or lay health workers, community distribution of HPV self-sampling kits)
|
Comparator
|
- No intervention
- Routine care (could include reminders or invitations to screen, or other forms of minimal intervention like pamphlets, posters)
|
|
Outcomes
|
Screening rate
|
|
Timing
|
No limitation on the duration of follow-up
|
|
Setting
|
- Primary care settings or settings available through primary care referral (note we will not exclude primary care interventions that are implemented alongside or in support of broader public health initiatives (e.g., reminders))
- Studies involving populations from Very High Development Index countries
|
|
Study design
|
- Randomized controlled trials
- Non-randomized trials and cohort studies (will only be considered if there are no data available from randomized controlled trials)
|
Conference proceedings; government reports; case series; case reports; case-control studies; editorials
|
Language
|
English or French
|
|
Publication date
|
2000-present
|
|
Abbreviations: HPV: human papillomavirus
Efficiencies by Integrating or Using Existing Systematic Reviews
Where possible, we will either update one or more existing systematic reviews or, if we are not aware of systematic reviews that are good candidates for an update, integrate studies from existing systematic reviews [102]. When available, we may use existing high quality, up-to-date systematic reviews as is without de novo searches or syntheses if they align well with the scope of our KQs and PICOTS elements (fully or in part, i.e., for one of multiple eligible comparisons; as is noted above for KQ3). In this case, we will contextualize and summarize the available evidence and perform certainty of evidence appraisals (based on information reported in the review) as needed to facilitate interpretation by the Task Force during guideline development. For the integration approach (detailed in Supplementary File 3), we will identify relevant studies in multiple previously published systematic reviews and develop and run update searches to present to identify contemporary studies not included in earlier reviews. The existing reviews will be used primarily to locate primary studies, although we may rely on reporting by reviews for some data extraction or risk of bias assessments, and will re-analyze the data using the primary studies and assess the overall certainty of the evidence in all cases. To identify potential candidate reviews, we undertook a comprehensive search for relevant systematic reviews, published between 2014 and March 2019, and scrutinized each for suitability. Important considerations included the comprehensiveness of the original searches, the scope of the review (i.e., ability to capture the studies of interest), and the reporting quality. Details of the reviews that we will use as a source of studies are in Supplementary File 4.
Literature Searches
We developed all database searches in collaboration with our research librarian. The searches, available in Supplementary File 5 for KQs 1, 2, and 4, have been peer reviewed by an external librarian according to PRESS (Peer Review of Electronic Search Strategies) guidance [103]. The searches for KQ 5 will be updated from previous reviews [61, 104, 105], with adaptations as needed. Unless otherwise indicated, all searches will be limited to studies published in English or French. We will not apply geographic filters to any of the searches. For KQ 1, we will contact five content experts by e-mail to inquire about their knowledge of additional relevant studies. We will contact each expert twice, two weeks apart, before ceasing contact if we do not receive a reply. In all cases we will also search the reference lists of the included studies and of relevant systematic reviews identified during screening for additional records. We will search ClinicalTrials.gov and the World Health Organization International Clinical Trials Registry Platform for ongoing trials. Although we will exclude studies available only as conference proceedings, letters, or abstracts, we will contact the corresponding authors twice, two weeks apart, to ask about relevant full reports before ceasing contact if we do not receive a reply. The following are details of the strategies specific to each KQ. The results of the electronic database searches for all KQs will ultimately be combined into a single database (removing duplicates) to create efficiencies in screening (due to inevitable overlap across the searches).
For KQ 1, we will search Ovid Medline (1946-), Ovid Embase (1996-), and Cochrane Central (1996-) from 1995 onward using MeSH terms and keywords for cervical cancer and screening, and study design filters for RCTs and observational studies. We have chosen to develop and run de novo searches rather than updating the searches from the 2013 CTFPHC guideline review because that review did not include the incidence of CIN as an outcome, nor screening with hrHPV.
For KQ 2, we will integrate studies from the 2019 health technology assessment (HTA) on HPV testing for primary screening for the prevention and early detection of cervical cancer by the Canadian Agency for Drugs and Technologies in Health (CADTH) [61], and the 2018 systematic review by Arbyn et al. [104] on the comparative accuracy of self- versus clinician-sampled hrHPV tests. We will update the searches for the CADTH review in Ovid Medline (1946-), Ovid Embase (1996-), and Cochrane Central (1996-) from 2016 onward to identify studies published after the last date searched (March 2017 for the full search), undertaking edits to the searches as necessary (e.g., removing concepts that are not relevant to our KQ 2). We will update the searches for the Arbyn et al. (2018) review in the same databases from 2017 onward (last date searched, April 2018). We anticipate the possibility that an update to the systematic review by Arbyn et al. (2018) may become available before we undertake our review for KQ 2. If such is the case, we will use the updated review as is without de novo searches or syntheses for the comparison of self- and clinician-sampled hrHPV testing.
CADTH sought to include systematic reviews and subsequently searched for primary studies published after the most recent systematic review. The inclusion of systematic reviews is not consistent with standard Task Force procedures for evidence synthesis [83]. Thus, we will supplement the updated database searches by screening the reference lists of the systematic reviews included in the CADTH HTA to identify the primary studies published prior to 2016.
For KQ 4, we will search Ovid Medline (1946-), Scopus (2004-), and EconLit (1886-) from 2000 onward using MeSH terms and keywords for cervical cancer, preferences and preference-based methods (e.g., conjoint analysis, trade-off), decision making, and attitudes.
For KQ 5, we will integrate studies (eligible for our review) from the 2011 Cochrane systematic review by Everett et al. on interventions to encourage cervical screening uptake [105], and the 2018 systematic review by Arbyn et al. on hrHPV self-sampling compared with reminders to encourage cervical screening rates [104]. The Cochrane review by Everett et al. included studies of interventions targeted at women to improve cervical screening rates, compared with no intervention or routine care [105]. We will update the Ovid Medline (1946-), Ovid Embase (1996-), and Cochrane Central (1996-) searches from 2008 onward to identify contemporary studies not included in the Cochrane review, undertaking edits to the searches as necessary. We expect the update search to capture studies of hrHPV self-sampling compared with reminders (as per Arbyn et al.’s (2018) review), and other effectiveness studies published since the last date searched in the review by Everett et al. As per KQ 2, we anticipate the possibility that an update to the systematic review by Arbyn et al. (2018) may become available before we undertake our review for KQ 5. If such is the case, we will use the updated review as is without de novo searches or syntheses for the comparison of self- and clinician-sampled hrHPV testing.
Study Selection
Electronic database searches
We will upload the results of the electronic searches to EndNote (v.X7, Clarivate Analytics, Philadelphia, Pennsylvania) and remove duplicates. We will transfer the titles and abstracts to DistillerSR (Evidence Partners, Ottawa, Canada) for screening. Two reviewers will independently screen the studies for eligibility in two stages (titles and abstracts, then full texts) following the pre-defined selection criteria (Tables 2 to 5) and mark each as include/unsure or exclude. At the title and abstract screening stage, we will use the liberal-accelerated approach [106, 107], whereby any record marked as include/unsure by either of two reviewers will be considered eligible for full text screening. Records excluded by either reviewer will be screened by a second reviewer to confirm or refute their exclusion. At the full text screening stage, the reviewers will agree upon the included studies, with arbitration by a third reviewer if necessary. We will record the reasons for excluding full texts and illustrate the study selection process via a flow diagram. We will append a detailed list of the excluded studies, with full text exclusion reasons, to the final report. Before each screening stage, we will undertake a pilot round of 200 titles and abstracts and 10 full texts, or as many as needed to achieve a mutual understanding of the selection criteria. To create efficiencies, we will screen for studies meeting the eligibility criteria for all KQs simultaneously (the searches for all KQs will ultimately be combined into one database).
When inadequate detail is reported in a study to confirm or refute its eligibility we will contact the corresponding author by e-mail to request the additional information required. We will contact authors twice, two weeks apart, before ceasing contact if we do not receive a reply.
Studies identified via other sources
Studies identified via content experts and reference lists (i.e., of known systematic reviews that we are using as sources of studies, systematic reviews identified during screening, included studies) will be uploaded to separate folders in EndNote for storage and management. These will be screened following the same procedures as described for those identified via the electronic database searches. The selection process for these studies will be incorporated into the aforementioned flow diagram.
Data Extraction
For all KQs we will develop standard forms in Excel v. 2016 (Microsoft Corporation, Redmond, Washington) to guide data extraction. We will pilot test the forms on a random sample of 3 to 5 included studies for each KQ to ensure the complete and accurate extraction of all relevant data. Supplementary File 6 outlines the data extraction items for each KQ.
Data for the studies included in the review for each KQ will be extracted by one reviewer with verification by another, with the exception of results data (i.e., findings for the outcomes of interest) which will be independently extracted by two reviewers with consensus. A third reviewer will arbitrate if agreement on the extracted data cannot be reached. For qualitative studies (KQ 4), one reviewer will copy the relevant ‘Results’ or ‘Findings’ texts and paste them into a Word (Microsoft Corporation, Redmond, Washington) document for analysis [108]. A second reviewer will verify the completeness of the extraction.
To create efficiencies we will rely on the study characteristics and results data for primary studies reported in earlier systematic reviews, where feasible. In the case of reviews with high quality conduct and reporting (e.g., Cochrane systematic reviews), one reviewer will perform a quality check of 10% of the data specific to the outcomes of interest, and unless substantial errors or omissions are noted, we will rely on the reported data without further re-extraction from the primary study. When the data of interest are incompletely reported, one reviewer will extract data from the primary study and compare data specific to the outcomes of interest (as previously described) to that reported in the earlier review(s) for consistency. A second reviewer will provide input only in cases where discrepancies between the extracted data and that reported in reports of earlier systematic reviews cannot be resolved.
Specific to KQ 2, we expect heterogeneity in the criteria (thresholds) used to define a positive test result across studies. Differences in the criteria for test-positivity across studies could affect whether and how we pool and interpret their results. We are not able to judge a priori the possible array of reported definitions. Thus, to inform our analyses we will extract the definition of a positive test reported in the individual studies and present the range of definitions (without further study details) to clinical experts supporting the working group. Based on clinical expert judgment and improved familiarity with the range of definitions reported across studies, we will finalize our data analysis plan (i.e., which types of studies we may be able to pool). Only after the clinical experts have deliberated on the consistency and compatibility of available definitions and we have developed a suitable analysis plan will we move forward with the extraction of results data. Because we will finalize the analysis details prior to the extraction of data, and based on the input of clinicians who will not be aware of study details, the risk of biasing the analyses will be minimal.
Risk of Bias Assessment
Considering the array of available risk of bias tools [109-111], we will use study design-specific tools that we believe best account for potential sources of bias [100, 112-116]. The planned methods are described in detail in Supplemental File 3. For all KQs we will develop standard forms in Excel to guide risk of bias appraisal. We will pilot test the forms on 3 to 5 included studies for each KQ to ensure a mutual understanding of the requirements of each tool. We will report domain-specific risk of bias ratings for each included study, with justification for each rating, in an appendix to the final report. Two reviewers will independently appraise the risk of bias of each included study and reach consensus. A third reviewer will be consulted if an agreement cannot be reached. We will extract and use risk of bias appraisals reported in available systematic reviews where possible, to create efficiencies.
Data Synthesis
Key Question 1: Effectiveness and Comparative Effectiveness
Where appropriate, we will pool studies reporting on mortality from cervical cancer, all-cause mortality, the incidence of ICC, the incidence of CIN 2 and 3, the number and rate of colposcopy and/or biopsy, and/or adverse pregnancy outcomes, per outcome-comparison. The measure of effect will be the relative risk (RR) or odds ratio (OR) with 95% confidence intervals (CIs), where appropriate. These will be calculated in Review Manager version 5.3 (The Nordic Cochrane Centre, The Cochrane Collaboration, Copenhagen, Denmark) from raw data reported in the studies or, if not provided, we will use the reported relative measures. When available, we will use adjusted ORs from observational studies, as these usually reduce the impact of confounding [117]. We will pool data using DerSimonian and Laird random effects models [118] to account for expected clinical and methodological heterogeneity across studies [119]. For rare events, we will use the Peto one-step odds ratio method to provide a less biased effect estimate [120], unless control groups are of unequal sizes, a large magnitude of effect is observed, or when events become more frequent (5% to 10%). In these cases, the reciprocal of the opposite treatment arm size correction will be used [120]. We will pool data from RCTs and controlled clinical trials separately from observational studies. We will present separate analyses for each comparison. In some cases, we may also deem it appropriate to combine intervention groups (e.g., for the comparisons of any screening versus no screening) using standard methods to avoid unit of analysis issues [117]. We will transform the pooled RR for each outcome to the absolute risk reduction (ARR) via standard methods [121]. We will calculate the number needed to screen for an additional beneficial outcome for outcomes with statistically significant results.
We will consider false-positives to be cervical screening tests that are positive (according to the primary testing strategy used in the individual studies, recognizing that definitions of test positivity will differ across studies) and lead to diagnostic follow-up testing, but that are not histologically confirmed as CIN 2, CIN 3, or more severe disease. We will calculate the false-positive rate using available data in the individual studies, as follows: (# of individuals with a positive screening test result who are not histologically diagnosed with the relevant condition / # individuals not diagnosed with the relevant condition, regardless of screening test result). This calculation necessitates histological examination for pre-cancerous lesions on all participants. Should this information not be available (in the published report and following attempts to contact the study authors), we will report the number of positive tests and the total number of tests, as reported by the authors. The range of false-positive rates across studies will be reported narratively and in tables, per test.
We are not aware of a standard formula for estimating overdiagnosis in the context of cervical screening. Thus, we expect studies reporting on overdiagnosis to be highly methodologically heterogeneous. For this reason, we will synthesize data on this outcome narratively and in tables, including the method (formula) used to derive each estimate.
Key Question 2: Comparative Accuracy
We will populate 2 x 2 tables with the TP, FP, TN, and FN for each screening test used in each study. If we identify more than three studies that we deem suitable for statistical pooling, we will compare accuracy data per outcome and per comparison using the Rutter and Gatsonis hierarchical summary receiver operating characteristic (HSROC) model [122], as recommended by Cochrane [123]. This model allows for the exploration of heterogeneity in test positivity (threshold for a positive test), position of the HSROC curve (accuracy of the test), and the shape of the HSROC curve [123]. Compared to the binomial regression model, the HSROC model also more fully accounts for within- and between-study variability in TP and FP rates [122]. We will investigate whether test strategies are associated with the shape and position of the summary ROC curve by fitting a binary covariate to the model representing the type of test that informed each 2 x 2 table [123]. In the event that preliminary plots of the study level estimates of sensitivity and specificity in ROC space reveal substantial differences in heterogeneity between studies for the two tests being investigated, we will assess whether the assumption of equal variances of the random effects of the two tests is reasonable by comparing the fit of the alternative models (i.e., where variances do or do not depend on the covariate for test strategy) [123]. For each screening strategy, we will report the pooled relative sensitivity and specificity across studies, with 95% CIs. In the event that the data are not suitable for statistical pooling, we will report their findings narratively and in tables.
Key Question 4: Relative Importance of Potential Outcomes from Screening
We will synthesize the quantitative data separately from the qualitative data. For the quantitative data, we expect to undertake a narrative synthesis given the likely heterogeneity in study designs, exposures, comparisons, and outcomes reported across studies. We will synthesize the included studies and draw conclusions based on the body of evidence using standard methods for narrative syntheses, as described by Popay et al. (2006) [124]. Adaptations to standard methodology may be necessary, as our review aims to investigate peoples’ values and preferences, so the outcomes differ, to a certain extent, when compared with intervention or implementation reviews. We will first present an overall synthesis of each included study, including their characteristics and reported findings. We will then describe relationships within and between studies, focusing on our exposure subgroups and comparators of interest and other factors such as methodological quality. As much as possible, we intend to report a best estimate of values and preferences for various exposures, and potential moderating factors.
We will analyze the qualitative data following standard procedures for thematic analysis [108, 125]. One reviewer will initially read through the data to familiarize themselves with the prevailing ideas. Next, the reviewer will use line-by-line coding in Microsoft Word to apply one or more codes to each line of text. The reviewer will then compare codes across the data, combine similar codes, categorize common codes into themes, and develop memos for each theme. To reduce the risk of interpretive biases, a second reviewer will review the codes and themes for differences in interpretation. The two reviewers will agree upon the final themes, with the input of a third reviewer if necessary. We will report on each theme narratively.
Key Question 5: Effectiveness and Comparative Effectives of Interventions to Increase Screening Rates
We will incorporate newly identified studies into the analyses previously reported in the Cochrane systematic review by Everett et al. (2011) [105]. Additional studies extracted from the review by Arbyn et al. (2018) [104] will be pooled via the same methods. In some cases, we may also deem it appropriate to combine intervention groups from multi-arm trials using standard methods to avoid unit of analysis issues [117]. We will transform the pooled RR for each outcome to the absolute values via standard methods [121]. We will calculate the number needed to treat for an additional beneficial outcome (i.e., participation) for outcomes with statistically significant results. We will report on studies that are not appropriate for statistical pooling narratively.
Dealing with Missing Data
When data required for statistical pooling are not reported by the individual studies, we will contact the corresponding author via e-mail to inquire about the availability of the data. We will contact authors twice, two weeks apart, before ceasing contact if we do not receive a response.
For randomized trials, we anticipate that many will report their findings based on a “number of individuals screened” denominator, rather than intention-to-screen calculations using all individuals randomized. Our primary analysis will use outcome data derived by analyzing all individuals randomized (i.e., intention-to-screen). We will extract data as reported in the individual studies using the number randomized as the denominator for each arm. We will also analyze based on the findings as reported in the individual studies, undertaking separate analyses for studies reporting only the number of individuals screened and those reporting on all individuals randomized.
Unit of Analysis Issues
In the event of the inclusion of cluster-randomized trials, we will take appropriate measures to avoid unit-of-analysis errors when reporting their findings and/or incorporating them into meta-analysis [126]. When available, we will use the intracluster correlation coefficient (ICC) reported in the trial to apply a design effect to the sample size and number of events in each of the treatment and control groups [127]. If not reported, we will use an external estimate from similar studies. We will clearly identify cluster-randomized trial data when it is included in meta-analysis with individually randomized trials. Decisions about whether it is reasonable to pool data from cluster-randomized and individually randomized trials will be undertaken on a case-by-case basis. We will investigate the robustness of the conclusions from any meta-analysis including cluster-randomized trials via sensitivity analysis.
Assessment of Heterogeneity
We will explore heterogeneity via subgroup analyses. First, we will report within-study subgroup data from our pre-specified subgroups of interest (Tables 2 to 5). We will also stratify the meta-analyses by subgroups (between-study analysis), or use other relevant statistical techniques like meta-regression to investigate heterogeneity. For population subgroups, we will use a large majority (i.e., >80% of participants) to decide the relevant subgroup for each study. We will interpret the plausibility of subgroup differences cautiously using available guidance [128, 129]. Should within or between-study subgroup analysis not be available or possible for some subgroups, studies with individuals or populations that may require equity (e.g., Indigenous peoples, trauma affected, low income) or other considerations by the Task Force will be noted and the applicability of the interventions to these populations will be assessed.
Small Study Bias
When meta-analyses of trials contain at least eight studies of varying size, we will test for small study bias visually by inspecting funnel plots for asymmetry and statistically via the Egger’s test [130].
Certainty in the Body of Evidence
We will use GRADE methods [131] to assess the certainty of evidence for all outcomes, without relying on the appraisals reported in earlier systematic reviews. In the event that we use one or multiple systematic reviews as is to answer a KQ (e.g., the Kyrgiou et al. [98, 99] reviews for KQ 3), we will review the reported certainty of evidence appraisals and undertake amendments as necessary to ensure that the appraisals are appropriately contextualized. In cases where studies of interventions cannot be pooled in meta-analysis, we will use GRADE guidance for rating the certainty of evidence in the absence of a single estimate of effect [132]. Two reviewers will independently assess the certainty of evidence for each outcome nd agree on the final assessments. A third reviewer will arbitrate if necessary.
We will assess the certainty of evidence (very low, low, moderate, or high) based on five considerations: study limitations (risk of bias), inconsistency of results, indirectness of evidence, imprecision, and publication (small study) bias [133-138]. We will assess the certainty of evidence from trials and observational studies separately, for each outcome. For KQs of intervention effects (KQs 1 and 5), data from RCTs will begin at high certainty, and be downgraded for flaws in each of the aforementioned domains (or, rarely, upgraded for strengths) [139], whereas observational studies will begin at low certainty. For KQ2 on diagnostic accuracy, all studies will begin at high certainty [140, 141]. For KQ 4, we will adhere to GRADE methods for assessing the certainty of evidence in the importance of outcomes or values and preferences [101, 115]. We will report our appraisals comprehensively and transparently, including justification for downgrading on any of the considered domains. We will use a partially contextualized approach; thus we will express our certainty that the true estimate lies within a range of magnitudes for each outcome. We will not account for other outcomes when assessing the magnitude of effect for individual outcomes, nor consider the certainty of any one outcome versus another [142].
For each KQ we will create a separate GRADE summary of findings table [134]. Justifications for rating up or down in any of the considered domains will be explained. We will also note where differences were observed between the data from trials and that from observational studies, or when we have relied solely on either the trial or observational evidence. The certainty of evidence assessments for each outcome will be incorporated into the Task Force’s evidence-to-decision framework [143]. The Task Force may choose to fully contextualize the range of possible effects on all outcomes (including benefits and harms). The Task Force will consider the net benefits and harms of screening and other elements (e.g., costs, feasibility, patient values and preferences) to develop updated recommendations for screening for the prevention of cervical cancer [143].
Task Force Involvement
The Task Force and clinical experts will not be involved in the selection of studies, extraction of data, appraisal of risk of bias (or methodological quality), nor synthesis of data, but will contribute to the interpretation of the findings and comment on the draft report. Clinical experts and/or Task Force members may be called upon to contribute to the certainty of evidence appraisals, e.g., to interpret directness (applicability) of included studies to the population of interest for the recommendation.