Refining the serum miR-371a-3p test for viable germ cell tumor detection: identification and definition of an indeterminate range

Circulating miR-371a-3p has excellent performance in the detection of viable (non-teratoma) GCT pre-orchiectomy; however, its ability to detect occult disease is understudied. To refine the serum miR-371a-3p assay in the minimal residual disease setting we compared performance of raw (Cq) and normalized (ΔCq, RQ) values from prior assays, and validated interlaboratory concordance by aliquot swapping. Revised assay performance was determined in a cohort of 32 patients suspected of occult retroperitoneal disease. Assay superiority was determined by comparing resulting receiver-operator characteristic (ROC) curves using the Delong method. Pairwise t-tests were used to test for interlaboratory concordance. Performance was comparable when thresholding based on raw Cq vs. normalized values. Interlaboratory concordance of miR-371a-3p was high, but reference genes miR-30b-5p and cel-miR-39-3p were discordant. Introduction of an indeterminate range of Cq 28–35 with a repeat run for any indeterminate improved assay accuracy from 0.84 to 0.92 in a group of patients suspected of occult GCT. We recommend that serum miR-371a-3p test protocols are updated to a) utilize threshold-based approaches using raw Cq values, b) continue to include an endogenous (e.g., miR-30b-5p) and exogenous non-human spike-in (e.g., cel-miR-39-3p) microRNA for quality control, and c) to re-run any sample with an indeterminate result.


Introduction
Correct staging in early stage germ cell tumor (GCT) patients is critical for identifying patients best served with surveillance versus primary management with retroperitoneal lymph node dissection (RPLND), chemotherapy, or radiotherapy [1,2]. In patients with clinical stage I (CS I) GCT, up to 97% of seminomas and 60% of non-seminomas recur on surveillance without marker elevation [3][4][5]. Additionally, 26% of patients with negative STM and cross-sectional imaging undergoing RPLND are found to have viable tumor [6]. Consequently, the performance characteristics of current STM introduces substantial risk of under-and over-treatment.
The superior performance of circulating microRNAs (miRNAs), particularly miR-371a-3p, to detect GCT is well documented. An agreed, protocolized standard for de nition of positive and negative miR-371a-3p results is lacking. The absence of a standard protocol in combination with the inherent sensitivity of the test has contributed to interlaboratory heterogeneity, making comparisons di cult and limiting widespread clinical adoption [7].
We address these issues by performing interlaboratory sample exchange experiments and re-evaluating analytic pipelines for calling results. In addition to positive and negative calls, we identify an indeterminate range, which we then validate in an independent patient cohort undergoing primary RPLND.
These changes improve assay performance, particularly speci city and negative predictive value (NPV), which upon clinical implementation will reduce potential over-treatment of patients without true minimal residual disease.

Methods Patient Population
Thirty-two chemotherapy-naïve patients underwent primary RPLND for clinical stage I or II GCT. Serum was obtained immediately prior to RPLND. Bilateral full-template or extended modi ed template nervesparing RPLND was per surgeon discretion. Baseline clinicopathologic data were collected (Table 1). Samples were classi ed as either 'Control' (pure teratoma or no GCT), or 'Viable GCT' [seminoma or nonseminomatous GCT (NSGCT)].
All experimental protocols were approved by an Institutional Review Board at The University of Texas Southwestern Medical Center (STU 102010-051). Informed consent was obtained from all subjects and/or their legal guardians prior to their inclusion in the study. The authors con rm that all methods described in this manuscript were performed in accordance with the relevant guidelines and regulations.
MiRNA isolation and quanti cation RNA extraction and serum miRNA quanti cation were performed as described [8]. Primers and probes used are detailed in Supplementary Table 1. To calculate relative quanti cation (RQ), the ∆∆Cq method was used, with the mean of four normal control human male serum samples (males between age 18-45 years) used as reference.

Concordance studies
Serum aliquots were shipped between the two research laboratories of Cambridge, UK and University of Texas Southwestern, US priority overnight on dry ice. Upon receipt, sample inspection con rmed that none had thawed. Each site followed an identical protocol to yield raw Cq and normalized (∆Cq and RQ) values, which were then compared against one another.
Cq vs. RQ performance Raw (Cq values) and normalized (∆Cq and RQ values) data from two studies previously published from our group were utilized [9,10]. Optimal thresholds were calculated for each metric using the Youden index [11] and sensitivity, speci city, and area under the receiver-operating characteristic curve (AUC) were calculated.
Establishment and assessment of an indeterminate range All runs included in our two previous reports [9,10], including any technical replicate PCR runs undertaken, were pooled and grouped based on histology (Control or Viable GCT). An indeterminate range was de ned as the 95% con dence interval of the distribution of the rst (lower Cq, higher apparent abundance) raw Cq peak, rounded to whole numbers (down at the lower bound and up at the upper bound) and subsequently formally assessed for change in assay performance.

Statistical analysis
Statistical signi cance for intergroup differences of clinicopathologic data was determined using the Kruskal-Wallis test with Dunn's post-hoc test. Concordance was assessed by a pairwise t-test. Performance characteristics, including sensitivity, speci city, NPV, positive predictive value (PPV), accuracy, and AUC were calculated using R version 4.1.2 with the pROC package (version 1.18.0) and tidyverse metapackage (version 1.3.1) [12][13][14]. AUC values were compared using the roc.test function in pROC with default parameters. Two-tailed p < 0.05 was statistically signi cant.

Results
Thresholding on Cq simpli es the serum miR-371a-3p test without affecting assay performance The requirement for a normal control serum sample in each assay run for normalization is costly and adds another potential source of variation. To determine if assay normalization is required, we examined our previously published data from samples taken pre-orchiectomy [10] and pre-RPLND [9]. We examined four metrics with varying levels of normalization-Cq (raw value), ∆Cq (Cq normalized to internal control miR-30b-5p), corrected ∆Cq (∆Cq corrected with an external control cel-miR-39-3p), and RQ (corrected ∆Cq of sample normalized to corrected ∆Cq of normal serum).
Calculated sensitivity and speci city were both greater than 0.9 in all cases and did not change appreciably across any of the metrics tested, Table 1. AUC was 0.97-0.99 for all four metrics, and none were statistically different from one another (all p > 0.05). These results suggest that normalization to endogenous or exogenous controls, or normal healthy serum, does not impact the performance of the serum miR-371a-3p assay.
To examine interlaboratory variation, we conducted a concordance study between the two laboratories. Aliquots of the same serum sample collection were exchanged, and both sites ran identical protocols. miR-371a-3p Cq was highly concordant, with a mean difference of < 0.5 cycles between sites (p = 0.251), Fig. 1. The exogenous non-human spike-in control cel-miR-39-3p was discordant (p = 0.002), likely due to separate preparations of highly concentrated standards. Surprisingly, the endogenous control, miR-30b-5p, was also discordant (p < 0.001). These results suggest that this normalization process introduces additional variation and contributes to interlaboratory heterogeneity. We therefore recommend use of raw Cq values for cutoffs for the serum miR-371a-3p test going forwards.

Identi cation and establishment of an indeterminate range
The serum miR-371a-3p test is extremely sensitive, due in part to the pre-ampli cation step used prior to qPCR, which also exposes to risk of false positives. This risk is already heightened by the need to open PCR tubes following pre-ampli cation to set-up the qPCR, which may inadvertently spread ampli cation products. The inclusion of a water ('no template') control (NTC) sample initiated at the reverse transcription step is recommended to combat this-a positive qPCR result on NTC suggests such upstream contamination. However, we noted occasional cases where known control samples would yield an inconsistent/stochastic positive result despite a negative NTC sample result on the same qPCR run.
Repeating these samples from the reverse transcription step usually yielded the anticipated negative result. In contrast, repeating runs on samples from patients with pathologically veri ed disease typically returned similar Cq values. Examples of repeated runs for pathologic negative and positive samples are presented in Supplementary Fig. 1.
To investigate the above observation, we aggregated a total of 150 runs from our previously published studies [9,10]. We examined the distribution of Cq values split by group, Control vs. Viable GCT, Supplementary Fig. 2A. Individual sample Cq values are displayed in Supplementary Fig. 2B. The samples in the Viable GCT group show a broad distribution with a mean Cq and standard deviation (SD) of 26.4 ± 4.33. This wide distribution is expected given the heterogenous population with differing amounts of disease burden. However, the distribution of Cq values in the Control group appeared to be bimodal, with the mean Cq of the rst peak at 32.2 ± 1.53, and the mean Cq of the second peak at 39.8 ± 0.7. The mean of the second peak is anticipated, as undetected samples are assigned Cq of 40. We were surprised that approximately 25% of all runs in the Control group fell into the rst peak. Two separate research laboratories (Cambridge, UK; UTSW, Dallas, US) and one clinical laboratory (Department of Pathology, UTSW, Dallas, US) all independently reported this observation, indicating that this is unlikely to be due to technical errors. We have not found any reliable predictor for this assay behavior; it appears to be an entirely stochastic and non-predictable event. This suggests that as currently applied, the qPCRbased serum miR-371a-3p assay has an approximately 25% chance to misclassify any true negative as positive.
Mitigation of this misclassi cation is critical prior to clinical implementation of the test. We reasoned that de ning an 'indeterminate' range based on the rst distribution and repeating the qPCR for any sample that fell into that range would reduce misclassi cation from ~ 25% to ~ 6% (0.25 x 0.25 = 0.0625). Based on our established assay pipeline, we de ned the indeterminate range as Cq 28-35, which approximates the mean of the rst Cq peak ± 2 SDs in the controls. We then interrogated our aggregated data again to simulate how application of this revised methodology might improve viable GCT classi cation. To simulate the original methodology, the rst chronological run per sample was selected. To simulate our revised methodology, the rst chronological run per sample was selected unless its result fell into the indeterminate range (28 < Cq < 35). If so, the second chronological run was selected. Any sample that remained indeterminate after the second run was classi ed 'indeterminate' and removed from performance calculations. With this model, the original method had 81 runs. In the revised method, nine samples (11.1%) had two indeterminate results and were classi ed as truly indeterminate, leaving 72 runs. Two of these nine samples were in the Control group, and the remaining seven were in the Viable GCT group. We then compared the resulting Cq distributions, Fig. 2A Application of revised methodology to an updated primary RPLND dataset Improved performance of the serum miR-371a-3p test would allow for both early detection of recurrence and avoidance of unnecessary treatment. The detection of minimal residual disease (MRD) therefore carries great clinical signi cance in this context. As serum miR-371a-3p Cq is correlated with tumor burden, detection of MRD demands the greatest performance of this test. We therefore expanded a cohort of chemotherapy naïve patients receiving primary RPLND and compared the performance of the original and revised methodology.
Patient characteristics are summarized in Table 2

Discussion
We report the use of raw circulating miR-371a-3p Cq values, instead of normalized data, for optimal assay performance with excellent interlaboratory concordance. qPCR assays are extensively and routinely used in clinical laboratories and often report results using raw Cq. Introduction of a normalization procedure increases costs and hampers translation into routine clinical testing. Due to the very high sensitivity of the circulating miRNA assay for viable GCT, we believed that additional normalization would be necessary to control for variation between runs. However, results from identical samples run in two independent laboratories suggest normalization may be harmful. The addition of these normalization procedures introduces additional technical variation due to the discordance of reference genes (cel-miR-39-3p and miR-30b-5p) without performance bene ts.
Other groups used raw data in their assessments and retained high performance [15,16]. However, assays used by these groups differ materially (e.g., the use of plasma extracts, detection by droplet digital PCR (ddPCR), and/or no pre-ampli cation). Since the largest miRNA studies to date, including a commercially available assay (miRdetect), were conducted with a serum qPCR-based method with preampli cation, we felt it important to replicate these studies using this particular methodology.
Critically, we have identi ed and established an indeterminate range to maintain assay performance of the circulating miR-371a-3p test. This arises from the observation in three separate laboratories that any given negative sample has an approximately 25% random or stochastic chance to return a spurious positive result. The existence of this reproducibility issue is further supported by an independent study reporting the existence of an indeterminate range in normalized values [17]. Additionally, Christiansen et al recently reported that the inclusion of the pre-ampli cation step improved sensitivity but also led to more false positives [18]. Dropping the assay cutoff below the rst distribution would lead to an unacceptable drop in sensitivity. Instead, we elect to de ne an indeterminate range and rerun any indeterminate extract (Fig. 4). We have observed that upon repeat, most true positive samples will maintain a Cq value very close to the rst run, while most true negative samples will yield a negative result. Because outcomes for viable GCT tend to be positive even in the case of recurrence, we recommend classi cation of any sample that returns an indeterminate result twice as a true indeterminate. In this clinical scenario, there is comparatively greater patient cost to over-treat than undertreat. Application of our revised method to an expanded cohort of patients with MRD improved speci city and PPV, demonstrating that these changes could prevent over-treatment.
Because many groups use a similar or identical protocol for this test, the question arises as to why this indeterminate range has not previously been described in detail. One contributing factor may be that larger retrospective non-blinded studies using this serum qPCR-based assays are focused on testicular GCT rather than retroperitoneal disease. Because circulating miR-371a-3p levels are dependent upon tumor burden, circulating miR-371a-3p is anticipated to be weakly positive in the context of MRD, rendering cutoff selection di cult. For example, the median Cq value for Viable GCT patients in our orchiectomy cohort [10] was 26.6, below the indeterminate range. However, the median Cq for our original primary RPLND cohort [9] was 29.3, within the indeterminate range. Additionally, a small number of spurious positive results in a control group may be written off as technical error and/or potential contamination, and the qPCR run repeated several times, subsequently yielding negative results. This enforces the utility of blinding technicians and analysts when conducting assays.

Conclusion
We recommend three important modi cations to serum miR-371a-3p assay protocols going forwards: 1) revise the test by applying cutoffs to raw Cq values instead of normalized values; 2) include endogenous (eg, miR-30b-5p) and exogenous (eg, cel-miR-39-3p) controls for quality control purposes; 3) include an indeterminate range to enhance speci city. These changes reduce the complexity and cost of the test while improving performance, particularly with regards to the detection of MRD. We believe the present work regarding reproducibility and thresholding provides a substantial step towards the clinical implementation of the serum miR-371a-3p assay for management of patients with viable GCT disease.    Decision-making owchart for revised serum miR-371a-3p method. First, results of miR-30b-5p are evaluated for quality control. If Cq > 30, insu cient RNAs may have been isolated from the serum sample, and this sample should be reextracted. If Cq < 30, proceed to evaluation of miR-371a-3p. If Cq < 28, accept positive result. If Cq > 35, accept negative result. If 28 < Cq < 35, assay should be repeated from reverse transcription (RT) step. If Cq < 28 or Cq > 35, accept results as above. If 28 < Cq < 35 again, report indeterminate and recommend short interval follow up.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. suppmaterial.docx