Clinical Quality in Cancer Research: Strategy to Assess Data Integrity of Germline Variants Inferred from Tumor-Only Testing Sequencing Data

In the majority of cancers, pathogenic variants are only found at the level of the tumor; however, an unusual number of cancers and/or diagnoses at an early age in a single family may suggest a genetic predisposition. Predisposition plays a major role in about 5–10% of adult cancers and in certain childhood tumors. As access to genomic testing for cancer patients continues to expand, the identification of potential germline pathogenic variants (PGPVs) through tumor-DNA sequencing is also increasing. Statistical methods have been developed to infer the presence of a PGPV without the need of a matched normal sample. These methods are mainly used for exploratory research, for example in real-world clinico-genomic databases/platforms (CGDB). These databases are being developed to support many applications, such as targeted drug development, clinical trial optimization, and postmarketing studies. To ensure the integrity of data used for research, a quality management system should be established, and quality oversight activities should be conducted to assess and mitigate clinical quality risks (for patient safety and data integrity). As opposed to well-defined ‘good practice’ quality guidelines (GxP) areas such as good clinical practice, there are no comprehensive instructions on how to assess the clinical quality of statistically derived variables from sequencing data such as PGPVs. In this article, we aim to share our strategy and propose a possible set of tactics to assess the PGPV quality and to ensure data integrity in exploratory research.


Background
Cancer is a genetic disease. In the majority of cancers, pathogenic variants are only found at the level of the tumor (somatic variants); however, an unusual number of cancers and/or diagnosis at an early age in a single family may suggest a genetic predisposition. Predispositions play a major role in about 5-10% of adult cancers and in certain childhood tumors [1]. For example, BRCA1 and BRCA2 are involved in homologous recombination and DNA repair, and are germline cancer predisposition genes (CPGs) that result in a syndrome of hereditary breast and ovarian cancer [2].
Access to genomic testing in oncology continues to expand for treatment recommendations, disease monitoring, and early detection [2]. Identification of potential germline pathogenic variants (PGPVs) through tumor-DNA sequencing is also increasing, with both the European Society of Medical Oncology [3] and the American College of Medical Genetics [4] having issued recommendations on how to report and analyze PGPVs derived from tumor-only sequencing data.
In routine care, testing for PGPVs is triggered by patient medical and family history and follows local guidelines and recommendations [5]. Testing is performed on two independent normal tissue samples (whole blood and buccal swab). Of note, there is a clear regulatory framework around testing (e.g., informed consent requirements), patient and patient family follow-up, as well as patient care. For cancer patients who undergo tumor testing, there are different strategies: tumor-only testing, tumor-normal paired testing with germline variant subtraction, and tumor-normal paired testing with analysis of genes associated with germline cancer predisposition [4].
In tumor-only testing, sequencing data may be used to infer germline presence of a specific variant from the variant allele frequency (VAF). On average, germline variants have higher VAFs than somatic variants; for example, a typical heterozygous germline variant should have a VAF of 50%. VAF in tumors is highly dependent on tumor purity and heterogeneity, and experimental thresholds need to be determined to classify germline variants [3]. Recently, generalizable statistical modeling techniques have been developed to classify variants from tumor-only testing that do not require to set individual VAF thresholds for each gene [6][7][8][9]. When a PGPV is inferred by tumor-only testing, recommendations are provided to the patient and their physicians, the results must be confirmed on a matched normal tissue sample, and genetic counseling is advised. Of note, these might not always be reimbursed and a PGPV could generate anxiety and stress for the patient waiting for confirmation [4].
In clinical practice, gatekeepers (e.g., normal match testing, genetic counseling, review by molecular tumor boards) must be in place to mitigate the risks to patients well-being and on data integrity. PGPVs inferred from tumor-only testing can also be used for exploratory research, for example in real-world clinico-genomic databases/platforms (CGDB). These databases are being developed by pharmaceutical and diagnostics companies [10,11] to support many applications, such as targeted drug development, clinical trial optimization, and postmarketing studies. To ensure the integrity of data used for research, a quality management system should be established, where quality oversight activities should be conducted to assess and mitigate clinical quality risks (for patients safety and for data integrity) [12].
As opposed to well defined 'good practice' quality guidelines (GxP) areas such as good clinical practice [12], there are no comprehensive instructions on how to assess the clinical quality of derived variables from sequencing data such as PGPVs. In this report, we share our strategy and a set of tactics to assess the quality of PGPVs and to ensure data integrity in exploratory research. Our approach serves as a framework for quality professionals to partner with researchers and can be applied to statistical inference methods that analyse clinico-genomics data.

Prerequisites
The primary objective of this research was to provide a strategy and a set of tactics for clinical quality professionals to assess the quality (of the assessment) of PGPVs used for exploratory research. They are not guidelines for analytical and/or clinical validation of methods used to infer PGPVs from sequencing data, nor do they relate to clinical quality assurance regulatory requirements. The scope was tumoronly testing, where academic institutions and/or diagnostic companies do not use a normal match to verify that the variants found at the somatic level are also present in the germlines. Several methods have been developed by diagnostic companies [10,11], however they are usually not disclosed (for intellectual property reasons). We selected the methods [6][7][8][9] that had been published in peer-reviewed journals (see Table 1) and used them as examples to illustrate our strategy. Of note, we did not review further inference methods that set individual VAF thresholds, as the American College of Medical Genetics and the European Society of Medical Oncology [3,4] guidelines already provided recommendations on how to assess their performance and limitations.

Problem Statement
To design a fit-for-purpose quality review, the problem statement and the scope should be clearly defined: getting assurance that PGPV data can be used with confidence to address the respective research questions, while being transparent on the data limitations and their impact on the analysis.
The proposed quality strategy to address the above problem statement followed a two-step approach. First, the methods used should be assessed to understand the potential impact on the quality of the PGPV data, how the methods were built, and how they were evaluated. In a second phase, quality checks can be established if PGPV data are available with other clinico-genomics data.  [7] Virtual Normal (virtual dataset as a normal match) Park et al. [8] ALFRED (statistical model) Sun et al. [9] SGZ (statistical model)

Assessing the Method Used to Infer Potential Germline Pathogenic Variants
Details on the methods used in exploratory research are sometimes available through peer-reviewed scientific publications [6][7][8][9]. After an initial review step, we rejected the ALFRED method by Park et al. [8]. This method was designed to discover new CPGs (i.e., never reported as such previously) from tumor-only sequencing data, and used a statistical model to test the Knudson two-hits hypothesis [13].
In order to assess the methods used to infer PGPVs from tumor-only testing, the following areas should be considered for review.
(a) Any model used for PGPV classification needs to undergo an expert review that addresses the model applicability to the current dataset considering the original sample population and model validation techniques. The published methodologies for PGPV assessment have been evaluated on a limited selection of tumor types and in restricted population samples. It is important that these limitations are disclosed and that documented expert opinion permits these methodologies to be applied to the specific tumor type and patient population at hand. The expert review should include a brief tabular summary, as exemplified in Table 2. (b) Any PGPV classification modeling technique needs to be adequately validated. The validation strategy is dependent on the statistical method and should be included in the expert review. Classification metrics such as accuracy can be misleading for imbalanced data sets [14], and we therefore recommend that the following performance metrics should be disclosed: true positive rate, false positive rate, true negative rate, and false negative rate, which allow to reconstruct the entire confusion matrix [14]. It also needs to be clear whether the classification estimate addresses the pathogenicity or the germline presence or both of these PGPV characteristics. (c) Requirements to ensure optimal performance of the inference method. For example, which breadth and depth of sequencing coverage are required, and under which tumor purity level does the model perform best [3,4]. (d) Potential pitfalls of the inference method and how it could impact the quality of PGPV data. What risk mitigations could be taken (e.g., discard PGPVs identified outside of the inference method specifications).

Quality Checks Using Clinico-Genomics Data
For clinical quality professionals and researchers having access to clinico-genomics data (e.g., through CGDBs [10,11]), there are a number of tactics that can be applied to further assess the quality of PGPVs. For the purpose of this report, we considered alterations, histopathology, and biomarker data. These quality checks can also be implemented for inference methods that set individual VAF thresholds [3,4]. When biomarker data are available for the same patient, a concordance analysis can be performed. For example, data on germline testing of a normal blood sample for BRCA1/2 can be compared with the inferred PGPV for BRCA1/2.
To discard possible false positive/false negative results and to identify other anomalies, data quality checks can be implemented. These checks should be based on the latest medical theory and clinical practice, reflecting empirical results. The tactics suggested below are examples and they can be complemented (with additional data and variables if available): -Retrieving all data reported as PGPVs across all tumor types, and comparing the data with a list of known CPGs [1,15]. PGPVs associated with an unknown CPG would likely be a false positive. -If a founder variant (e.g., for BRCA1/2, APC, MSH2, MSH6, CHEK2, or MUTYH) is identified at the somatic level, it is likely to be found in the germlines [16]. This can help identify potential false negatives. -When a pathogenic BRCA1/2 variant is present at somatic level, it is expected to be found also at the germline level in approximately 70-80% of the cases in breast cancer and in approximately 60% of cases in ovarian cancer [17,18]. The proportion of somatic BRCA1/2 variants also flagged as PGPVs can be calculated and compared against these ratios. -Germline CDH1 pathogenic variants are associated with an increased risk of lobular breast cancer. In CDH1-associated breast cancer, malignant cells show loss of adhesion (as CDH1 codes for E-cadherin) [19]. Hence, by querying the associated histopathological data, we can identify false positive PGPVs, i.e. CDH1 inferred PGPVs associated with ductal breast cancer. -If available and accessible (e.g., despite privacy constraints), ancestry data could be used to verify pathogenicity, as many pathogenic variants are found to be ancestryspecific and trigger somatic effects [20].

Inference Methods
To ensure the integrity of PGPV data used for research, it is important to understand how the method to infer PGPVs was designed, i.e., what type of model was fitted and on which dataset, the requirements to ensure the model optimal performance (e.g., tumor specimen quality requirements), and its validation. Furthermore, reviewing available performance metrics and disclosed limitations can help in tailoring the analysis further (see Table 2). For example, the method published by Hiltemann et al. [7] cannot adequately account for rare germline variants that are private to a family or small population. Furthermore, to avoid false positives and false negatives, the analysis should be adjusted on the method parameters and the correct filters should be applied, for example: -if the model performs best on a defined range of tumor purity (e.g., for the SGZ [somatic-germline-zygosity] method by Sun et al., a high level of accuracy is maintained from 10% through 75% tumor purity [9], PGPVs inferred from sequencing of specimens that have a tumor purity outside of the range can be excluded, to reduce false positives and false negatives); -if the model requires minimum sequencing depth (e.g., the method published by Khiabanian et al. [6], high depth sequencing coverage, at least over 500 reads); -Filter for the specimen that passed quality checks (socalled the 'qualified or valid' specimen)

Data Quality Checks
Descriptive analytics can be used to facilitate the interpretation of the quality review, for example for reviewing the PGPV classification estimates by plotting them against expected population frequencies, which makes discrepancies easy to spot (Fig. 1).
The possible outcomes and their interpretations for the quality checks proposed in the Quality Strategy section are detailed in Table 3.

Quality Assurance Application
The quality strategy is summarized in a flow chart (Fig. 2) that can be used by clinical quality professionals to partner with researchers for reviewing and documenting the quality of PGPV data.
It is up to the researchers, depending on the questions they need to answer, to follow-up on the proposed actions. In certain situations, some PGPVs can be excluded from the analysis, but a thorough root cause investigation is recommended. It could not only help in fixing the underlying issue that generates incorrect PGPV flagging but also avoids the recurrence of similar quality issues moving forward. As an alternative, if the root cause cannot be identified and/or the quality issues can be solved, we advise that the risks of using potentially biased data are taken into account, and mitigated where possible. We also advise full disclosure on the data limitations for transparency. Finally, we would like to reiterate that this quality strategy is not intended to be used in a clinical setting, but rather in exploratory research.
There are currently no binding regulatory requirements for clinical quality data management in exploratory research, but implementing this strategy could help accelerate research by addressing some of the root causes of the data quality issues. Furthermore, should the decision be made to use PGPV data inferred from tumor-only testing for a regulatory filing, having assessed the method and applied quality checks could avoid having to implement quality assurance measures retrospectively, and therefore save significant resources. Of note, if PGPV data are intended to be used for filing, all regulatory and quality requirements pertaining to processes, systems, and computer system validation must be met [21].

Challenges
One of the key challenges with real-world data (RWD) is that source data verification cannot always be performed. Therefore, when a potential data quality issue is identified, it might be challenging to understand the root cause. Implementing methods for reviewing and assessing the quality of RWD, together with embedding automated quality checks, can reduce the risk of the integrity of the data being compromised. A thorough documentation about the quality checks implemented, their review, and decisions made to keep or discard RWD for analysis is highly recommended. When using clinico-genomics data for verifying the quality of PGPV data, researchers should first assess and understand how the database they use has been designed and setup. It is very likely that the data do not represent the general cancer population. For example: • The database might be composed mainly of patients with late-stage disease and/or older patients. • The samples of a particular cancer type might also be small and therefore not representative.
Furthermore, as genomic testing is not widely accessible to all cancer patients, it is highly probable that there is an underlying selection bias. For example, comparing the demographics and frequency of PGPVs in breast cancer, ovarian cancer, and colorectal cancer (which are cancers most frequently associated with CPGs) [1] with a cohort selected from a CGDB might not be representative of what is found in the real world. Furthermore, a younger than typical age at onset may suggest an underlying cancer predisposition, such as very early onset breast cancer [22], but this might not be reflected in a cohort built from a CGDB.
Last but not least, should new methods be developed to infer PGPVs from tumor-only testing, we encourage researchers to clearly describe their specifications and limitations to avoid false expectations/understanding. For example, in the four methods available in scientific publications [6][7][8][9], not all relevant performance metrics were explicitly disclosed, and some additional features (e.g., confidence intervals, thresholds for accuracy) could have been useful.

Conclusion
In this report, we defined a strategy and a set of tactics to assess the quality of PGPV data inferred from tumor-only testing to ensure the data integrity in exploratory research activities. Our method can be easily tailored to the specifics of the models used for PGPV inference. It can help researchers get further insights on the quality of PGPV data and implement quality by design for RWD [23]. Our strategy could be expanded to other machine-learning-derived biomarkers from sequencing data (e.g., microsatellite instability status), although a deep understanding of the topic is always required in order to design the quality review appropriately. Our strategy will continue to be improved and further tactics for quality checks can be performed by using additional variables/data, such as patient family history [24]. Finally, this project is part of a broader effort to deliver quality assurance effectively, by leveraging analytics and implementing quality by design to accelerate patient access to innovative healthcare solutions [25][26][27][28]. Application of a tailored data quality strategy to identify and address quality issues will aid researchers in reproducing results, demonstrating integrity of the data collected/generated, and ultimately giving stakeholders confidence in the research outcome.