Study objectives
This study aims to develop and validate a novel blood-based whole methylome sequencing method followed by a multidimensional model to analyse several features of cfDNA for early GC detection. We expect that this noninvasive liquid biopsy assay could help to distinguish individuals with a high probability of GC who will be assigned a positive result and subsequently undergo a confirmed diagnosis by gastroscopy. Meanwhile, individuals with a negative test result indicate a low probability of GC, and they could be recommended for a deferred gastroscope and possibly a repeated test annually.
The primary objective is to validate the overall diagnostic specificity and sensitivity of the model and further in each clinical stage of GC.
The secondary objective is to investigate the specificity and sensitivity of the model in combination with possible biomarkers such as PG, G17, and/or Hp levels for early GC detection.
Timeline
The study is planned for 18 months, starting from November 2022 to April 2024.
Study Population
This study will enroll individuals who receive gastroscopy at Xijing Hospital of Air Force Military Medical University, Xi'an, China. Two groups will be formed based on gastroscopy results: a malignant group and a nonmalignant group. The malignant group included patients diagnosed with high-grade intraepithelial neoplasia or GC (> 50% of patients in stages I and II), while the nonmalignant group contained healthy individuals and patients with gastritis, gastric ulcers, gastric polyps or other benign gastric diseases. Clinical parameters such as personal history, family history of digestive tract cancer, gastroscopic results and gastric cancer staging will be collected for each subject. Blood samples will be collected for WMG analysis, the results of which will be used to establish a GC diagnostic model.
Inclusion criteria
Subjects will be enrolled after understanding and signing the informed consent form, and they must be eligible to participate according to the inclusion/exclusion criteria.
Inclusion criteria must be fulfilled: (1) 18 years and older; (2) complete clinical information; and (3) patients self-agree to join the study and with signed patient consent and good compliance.
The specific inclusion criteria for subjects to be included in the malignant group must be fulfilled: (1) according to the definition of AJCC's 8th Edition Cancer Staging Manual, patients with gastric adenocarcinoma confirmed by histopathology and with pathological stages of stage I-IV, including patients with oesophageal gastric junction adenocarcinoma (EGJ); and (2) had not previously received any local or systematic antitumour treatment.
Exclusion criteria
Subjects will be ineligible if any of the exclusion criteria are met: (1) diagnosed previously with any kind of malignant tumour; (2) have received total or partial gastrectomy; (3) have received bone marrow or organ or stem cell transplantation; (4) have received blood transfusion in the past 6 months; (5) ongoing fever or recipient of anti-inflammation therapy within 14 days prior to study blood draw; (6) female participants during pregnancy or lactation; and (7) incomplete clinical information or unqualified to participate in the study.
Discontinuing Study Interventions And Patient Withdrawal
If the ongoing study is later considered to be infeasible, the principal investigator (PI) shall submit a termination application to the Institutional Review Committee (IRC). The decision for the termination will be given by the IRC. After that, the PI shall provide a clinical trial termination report to the IRC.
Subjects may be withdrawn from the study for the following reasons:
-
Subjects request to leave the trial and withdraw informed consent.
-
Subjects refuse to have blood samples collected after informed consent.
-
Sufficient blood sample collection is difficult for the subjects.
-
Subjects who are enrolled by mistake and later judged inappropriate for the study.
Sample Collection
Twenty millilitres of peripheral blood will be collected from each subject and stored in a Cell-Free DNA Storage Tube (Cwbiotech). Each subject’s information, such as index, name, sample type and collection time, will be marked clearly. Blood samples shall be delivered to the lab within 72 hours and proceeded with plasma extraction. If in any case, the blood samples cannot be delivered to the lab on time, the plasma shall be extracted on site in the following steps: (1) blood samples are centrifuged at 4,000 × g for 15 min at 4°C and the upper layer containing plasma will be transferred to a new tube. Approximately 4 ml of plasma will be obtained and stored at − 80°C until use or -20℃ for no more than one month before transportation.
Cfdna Extraction
Plasma samples prepared in the previous step will be processed for cfDNA extraction by using the MagMAX Cell-Free DNA Isolation Kit (Thermo Fisher Scientific) per the manufacturer’s instructions. The quantity and quality of extracted cfDNA will be assessed with a Bioanalyzer 2100 (Agilent).
Wms Library Construction
The entire amount of extracted plasma cfDNA (capped at 15 ng if more are extracted) for each sample will be used to generate WMS libraries by NEBNext Enzymatic Methyl-seq Kit (New England Biolabs) according to manufacturer instructions with one modification: 100 ng of carrier RNA (TIANGEN) will be added before denaturation by sodium hydroxide. Libraries will be amplified with 9 cycles of PCR, followed by quantification by a Qubit dsDNA HS Assay Kit (Thermo Fisher Scientific). After that, constructed libraries will be sequenced on a NovaSeq 6000 (Illumina) with a paired-end read length of 100 bp.
Model Establishment
The estimation dataset will be used to establish a machine learning model to predict cancer probability. The plasma cfDNA fragment characteristics, methylation status, etc., will be subjected to data preprocessing, feature filtering, and model selection, and their respective classifier models will be constructed. We previously defined those features as the methylated fragment ratio (MFR), chromosomal aneuploidy of featured fragments (CAFF), fragment size index (FSI), and fragment end motif (FEM).
MFR
Genomic methylation status has been altered extensively since the early cancer development stage. ctDNA released by cancer cells can change the methylation levels of cfDNA in the plasma. The whole genome will be tiled into nonoverlapping 1 Mb windows, the methylation levels of which will be quantified by the fraction of fragments with fully methylated CpGs, resulting in MFR.
CAFF
Cancer cells often have chromosomal instability, and partial or whole chromosomal arms are amplified or missing (29, 30). A previously described plasma aneuploidy (PA) score was calculated for each sample based on the copy number alternation levels of five chromosome arms, which exhibited the most dramatic copy number alterations compared to baseline samples (20).
FSI
The size frequency of ctDNA fragments released from cancer cells differs from cfDNA (33). The whole genome was tiled into nonoverlapping 1 Mb windows, from which the read ratio of short fragments (100–166 bp) to long fragments (169–240 bp) was calculated and defined as the FSI. The FSI of each sample was compared to the healthy population baseline, and the cancer risk for each subject was estimated.
FEM
The preferred FEMs of ctDNA differ from those of cfDNA due to DNA nuclease downregulation in the cfDNA fragmentation process (31). The frequency of a total of 256 4-nucleotide (i.e., 4-mer) fragment 5’ end motifs was analysed (refer) with the following modifications: (1) Fragments shorter than 171 bp were selected for analysis. (2) Only reads mapped to the Crick strand were used for calculation.
An ensemble classifier named Through Epigenetic Marker Integration Solution (THEMIS) based on a generalized linear model (GLM) with elastic-net penalization is developed, which is established by integration of the four abovementioned features: predicted cancer probability scores by MFR, FSI, and FEM along with the PA scores of CAFF. R packages such as CARET under 20-fold cross validation will be used.
Sample Size Estimate And Statistical Analysis
The sample size was estimated using the strategy mentioned in this article (32). Based on the abovementioned pilot study, the ensemble THEMIS model reached a sensitivity of 94% (need to be confirmed with our bioinformatics) under 95% specificity (Fig. 1). We expect that the new model integrating similar features as THEMIS will reach a sensitivity of at least 90%. If we allow the 95% confidence interval width of 0.08 (two-sided test alpha = 0.05), then at least 54 cancer cases are needed. If we would also expect a certain amount of drop-out, such as 10%, no less than 60 cases will be needed. The same number of patients will be enrolled in the nonmalignant control group in this case control study. The plan is to allocate the two groups of subjects by the same 2:1 ratio into training and testing datasets. A total of 360 subjects (180 cancer cases and 180 controls) will be enrolled to ensure at least 60 cases and 60 controls in the testing dataset. Therefore, the total number of subjects in each group shall be no less than 180. The malignant group and the control group will be matched by age and sex to minimize possible confounding factors (Fig. 2).
The area under the curve (AUC) constructed by receiver operator characteristic (ROC) analysis calculated by the pROC package will be used to evaluate individual classifier models as well as the ensembled model’s performance for differentiating cancer/noncancer. Clinical diagnosis results, including but not limited to gastroscopy and/or pathological determinations, are considered the “gold standard”. The cancer risk threshold (cut-off value) will be determined based on the training dataset. After that, the prediction performance of the established model will be evaluated in the testing dataset in several aspects, including sensitivity, specificity, positive prediction value (PPV), and negative prediction value (NPV). These variables as well as 95% confidence intervals will be calculated using the epiR package. Cohen’s Kappa score will be calculated by the vcd R package to demonstrate the concordance of the predicted results versus the pathological diagnostic results. A single blind analysis will be applied, which means that the bioinformatics analysts will be blinded to whether the subjects have gastric cancer or not. The results are considered significant if the p value < 0.05. All statistical analyses will be performed in R (R v.3.6.3).
Patient And Public Involvement Statement
Patients, healthy subjects, and the public will not be involved in our studies for reporting, designing, or implementing. We do not plan to inform the study participants of the results unless they apply for it.