Optimizing risk-based breast cancer screening policies with reinforcement learning

Screening programs must balance the benefit of early detection with the cost of overscreening. Here, we introduce a novel reinforcement learning-based framework for personalized screening, Tempo, and demonstrate its efficacy in the context of breast cancer. We trained our risk-based screening policies on a large screening mammography dataset from Massachusetts General Hospital (MGH; USA) and validated this dataset in held-out patients from MGH and external datasets from Emory University (Emory; USA), Karolinska Institute (Karolinska; Sweden) and Chang Gung Memorial Hospital (CGMH; Taiwan). Across all test sets, we find that the Tempo policy combined with an image-based artificial intelligence (AI) risk model is significantly more efficient than current regimens used in clinical practice in terms of simulated early detection per screen frequency. Moreover, we show that the same Tempo policy can be easily adapted to a wide range of possible screening preferences, allowing clinicians to select their desired trade-off between early detection and screening costs without training new policies. Finally, we demonstrate that Tempo policies based on AI-based risk models outperform Tempo policies based on less accurate clinical risk models. Altogether, our results show that pairing AI-based risk models with agile AI-designed screening policies has the potential to improve screening programs by advancing early detection while reducing overscreening. A reinforcement learning model can predict risk-based follow-up recommendations to improve early detection and reduce screening costs in breast cancer across diverse patient populations.

F or multiple diseases, early detection significantly improves patient outcomes 1,2 . This motivates considerable investments in population-wide screening programs 3,4 such as mammography for breast cancer. To be effective and economically viable, these programs must find the right balance between early detection and overscreening. This capacity builds on two complementary technologies: (1) the ability to accurately assess patient risk at a given time point and (2) the ability to design screening regimens based on this risk. With recent advances in deep learning, imaging and genetics, risk assessment technologies are rapidly improving 5,6 . However, our ability to utilize these predictions to personalize screening regimens lags behind. This deficiency is particularly apparent when the screening system has limited throughput.
In this paper, we focus on the design of screening regimens attuned to the increased capacity of the modern risk assessment models. The need for new methods to personalize screening is motivated by a substantial change in risk assessment algorithms. Traditional risk assessment models rely on a number of categorical variables encoding patient demographics and clinical history combined with traditional statistical models to predict risk 7,8 . These scores are relatively static throughout a patient's lifetime, with changes typically driven by the patient's age. Moreover, the limited predictive capacity of these risk models restricts the scope of recommendations they support and, consequently, their impact on the screening regimen. Current guidelines divide the population into a few large groups, most often discriminating predicted high-risk patients from the rest, and recommend the same screening frequency to all the members of that cohort [9][10][11] . As a result, there remain large opportunities to further personalize care.
The power of novel, AI-driven risk models 6,12-14 has given us an opportunity to fundamentally transform population screening. Deep learning algorithms enable these risk models to operate over raw patient data, such as imaging, in addition to traditional expert-specified categorical variables. Moreover, these models can detect highly complex dependencies, which further strengthens their predictive capacity relative to traditional methods. One distinctive feature of these risk models is that their predictions may fluctuate over time as the patient's raw data evolve. This feature suggests that screening regimens need to be flexibly adjusted with changes in risk and optimized over a patient's lifetime. We hypothesize that by pairing AI-based risk models and agile AI-based screening regimens, we can improve early detection while lowering the overall cost of screening. This article presents empirical findings that support this hypothesis in the area of breast cancer screening. The core methodology is applicable to other disease areas and other types of risk models beyond imaging. their chance for early detection while minimizing screening costs. Intuitively, such a policy should recommend infrequent screenings for low-risk patients while prescribing a higher frequency of screenings for patients at increased risk. The question is how to personalize screening intervals based on a patient's risk profile. More formally, we can cast the screening problem as a Markov decision process, where a patient's state is their risk assessment, the possible actions are different follow-up recommendations (e.g., 6 months or 2 years) and rewards are a combination of expected early detection benefits and screening costs. This formulation enables us to find the best possible policy for this Markov decision process with reinforcement learning (RL) algorithms 15,16 . RL algorithms train policies (i.e., machine learning models) to make a sequence of decisions that maximize future reward (e.g., early detection benefits) without explicit guidance on the right decision at intermediate steps.
Policies are initialized randomly, and through a mix of random exploration and utilization of current knowledge, RL algorithms iteratively improve policies. We show how to leverage RL methods of determining effective cancer screening policies from retrospective screening data.
Applying RL in this context poses a unique challenge, namely the estimation of patient trajectories from retrospective data. The training data pertaining to individual patients only contain information about their risk at the time points when the mammogram was taken. However, to determine whether the algorithm makes the correct recommendation, we need to know the risk assessment at intermediate points. Therefore, we design an algorithm that learns to extrapolate a patient's risk at unobserved time points from the observed screenings. This estimation evolves as new mammograms of the patient become available. With access to these predictions, we can guide our reinforcement learner to adjust its actions according to the estimated risk. Using the retrospective trajectories as our simulation environment, we train screening policies to maximize the future reward given the patient's evolving risk assessments, as illustrated in Fig. 1. In doing so, our trained screening policies are specialized to the dynamics and subtleties of the underlying risk model.
Our full framework, named Tempo, is depicted in Fig. 2. As described above, we first train a risk progression neural network to predict future risk assessments given previous assessments. This model is then used to estimate patient risk at unobserved time points, and it enables us to simulate risk-based screening policies.
Next, we train our screening policy, which is implemented as a neural network, to maximize the reward (i.e., a combination of early detection and screening costs) on our retrospective training set. We train our screening policy to support all possible early detection versus screening cost trade-offs using Envelope Q-learning 17 , an RL algorithm designed to balance multiple objectives. The input of our screening policies is the patient's risk assessment and desired weighting between rewards (i.e., screening preference). The output of the policy is a recommendation for when to return for the next screen, ranging from 6 months to 3 years in the future, in multiples of 6 months. Our reward balances two contrasting aspects, one reflecting the imaging cost (i.e., the average number of mammograms per year recommended by the policy) and one modeling early detection benefit relative to the retrospective screening trajectory. Our early detection reward measures the time difference in months between each patient's recommended screening date, if it was after their last negative mammogram, and the actual diagnosis date. We evaluate screening policies by simulating the recommendations for held-out patients. The exact reward details and the neural network architectures used are elaborated in Methods. . For each recommended trajectory, we can compute the screening cost and early detection benefit relative to the historical screening. We measure the early detection benefit of a policy by comparing its recommended screening dates with the last known negative date and the known cancer date. In our simulation, Tempo-Mirai, annual screening and biennial screening obtained an early detection benefit of 6 months, 0 months and −12 months, recommending an average of 1.0, 1.0 and 0.5 mammograms per year for this patient, respectively.
Our primary objective was to develop personalized screening policies that would outperform current guidelines, improving early detection while reducing screening costs. To this end, we developed Tempo-Mirai, an RL-trained screening policy that operates on Mirai risk assessments. This policy takes as input a patient's Mirai risk assessment and outputs a follow-up recommendation as illustrated in Fig. 2. We implemented our risk progression model, which extrapolates unobserved Mirai risk assessments from prior risk assessments, as a recurrent neural network (RNN). This method is described in detail in Methods and validated in Supplementary  Table 6. Sample risk progression predictions are shown in Extended Data Fig. 1.
We compared Tempo-Mirai with existing screening guidelines, including annual screening, biennial screening and a hybrid screening strategy recommended by the US Preventive Services Task Force (USPSTF) 11 , which switches from annual screening to biennial screening at age 55 years. To assess the benefit of leveraging Mirai, a mammography-based AI risk model, over a traditional clinical risk model in the Tempo framework, we also developed Tempo-TCv8, a Tempo policy that operates on TCv8 risk assessments. We utilized a deterministic model, static risk, to estimate risk progression for TCv8. This model is detailed in Methods. To quantify the benefit of using our RL approach to develop risk-based screening policies (i.e., Tempo) over a supervised learning approach, we also developed Supervised-Mirai and Supervised-TCv8. Instead of maximizing the overall reward with RL (without supervision for intermediate decisions), our supervised learning approach trains policies to predict the optimal follow-up recommendation at each time step. These baselines are detailed in Methods.
For each policy (e.g., Tempo-Mirai or annual screening), we measure its screening cost in terms of the average number of mammograms it recommends per year and its early detection benefit in months relative to historical screening. Our early detection metric assumed that early screening, following a patient's last negative mammogram, could offer a maximum early detection benefit of 18 months. We note that our early detection benefit metric is local and institution specific, as different institutions have different screening patterns. To directly compare policies that recommend differing numbers of mammograms, we also evaluated the efficiency of each policy, as measured by the early detection benefit in months divided by the number of mammograms per year recommended. Our efficiency metric is best suited to compare policies that obtain positive early detection benefits.
Evaluating personalized screening policies. The results of all screening policies across the MGH, Emory, Karolinska and CGMH test sets are illustrated in Table 2. We utilized the same Tempo-Mirai operating point across all test sets. We illustrate the performance of Tempo across different operating points (i.e., screening preferences) in all test sets in Fig. 3.
In addition to overall performance on the test sets, we also studied the histogram of early detection benefits in Extended Data Fig. 2, and the histogram of recommended screening frequencies in Fig. 4 and Extended Data Fig. 3. We note that all trained policies (for example, Tempo-Mirai, Supervised-Mirai) have the same set Our Tempo policy takes as input a risk assessment (e.g., from Mirai) and outputs a recommended follow-up time, such as k years into the future. If a risk assessment is not available at the time step k, then we estimate the missing risk assessment using our risk progression network. of possible recommendations ranging from a 6-month to 3-year screening follow-up, but we found that Supervised-Mirai only selected two options, recommending either 6 months or 3 years of follow-up. In contrast, Tempo-Mirai at our chosen operating point leveraged follow-up recommendations of 6 months, 1 year and 2 years. As shown on Fig. 4, we found that Tempo-Mirai offered a wider range of recommended frequencies than other methods, reflecting a larger degree for personalization. This reflects the optimization differences between the two policies. Tempo-Mirai is optimized to maximize overall reward across patient trajectories, as measured by early detection and screening cost, and does not receive any explicit guidance on the correct recommendation given a specific risk assessment. As a result, Tempo-Mirai has the flexibility to explore a wide range of possible recommendations during training to identify high-performing policies. In contrast, Supervised-Mirai has a rigid modeling objective; it is instead trained to predict the optimal (i.e., correct in hindsight) screening recommendation from each risk assessment, which is difficult given the uncertainty of real-world risk models.
To understand the flexibility of Tempo-based policies, we plotted the performance of each policy in Fig. 3 while varying the screening preference (i.e., operating point), which specifies the desired balance between early detection and screening cost. Across a wide range of possible operating points, Tempo-Mirai outperformed other policies in increasing early detection and reducing screening costs, demonstrating that the policy can be easily adapted to suit clinical requirements without retraining.
While the above results show that Tempo-Mirai consistently improved over alternate policies in screening efficiency, we also observed that the absolute magnitude of early detection varied substantially across different datasets. For instance, annual screening obtaining early detection benefits of 1.58 (95% CI, 0.54, 2.58), 3.21 For each policy, we report the average number of mammograms per year, the early detection benefit in months relative to historical screening (where a higher positive number means earlier) and the screening efficiency (where a higher positive number is better). We defined screening efficiency as the early detection benefit divided by the average number of mammograms per year. All metrics are followed by their 95% CI.    Robustness to assumptions. Our empirical results across the different test sets depend on the exact choice of assumptions of our early detection metric. As illustrated in Fig. 1, our early detection metric measured the time difference in months between each patient's recommended screening date and the diagnosis date. Our metric assumed that the maximum early detection benefit obtained through earlier screening was 18 months. To test our model's robustness to this assumption, we also evaluated Tempo-Mirai, Supervised-Mirai and annual screening across all test sets when setting our maximum early detection benefit assumption to 6, 12, 18 and 24 months. We note that we did not retrain Tempo-Mirai for this analysis and that Tempo-Mirai was originally trained using the 18-month assumption. For each policy, we measured its screening efficiency (i.e., the early detection benefit divided by the number of mammograms recommended per year) to enable a head-to-head comparison between policies that recommend different screening volumes. As shown in Extended Data Fig. 4, Tempo-Mirai is more efficient than annual screening across all datasets and assumptions. This result is further supported by the histogram of early detection benefits shown in Extended Data Fig. 2.

discussion
We developed an RL framework for personalized screening, Tempo, to predict follow-up recommendations from patient risk assessments. We demonstrated that a Tempo policy based on Mirai risk assessments was significantly more efficient than annual screening, achieving earlier detection per screening cost. Moreover, we showed that the same Tempo policy can be adapted to a wide range of possible screening preferences and that policies that leverage more accurate risk models (i.e., Mirai) outperform those based on less accurate risk models (i.e., Tyrer-Cuzick). We found that policies developed using data from MGH generalized to held-out test sets in Emory, Karolinska and CGMH and significantly outperformed both annual screening and our supervised learning baselines. Finally, we demonstrated our results were robust across a range of possible assumptions for our early detection metric.
Our screening policies can be easily implemented in any screening clinic where Mirai risk assessments are collected. Clinicians can retrospectively validate our trained screening policies on their own screening population and choose an operating point to achieve the desired balance between screening volume and early detection benefit. The installed policy can then offer clinicians suggested risk-based follow-up intervals immediately after a patient's risk assessment. Depending on clinical requirements, Tempo can be utilized to significantly reduce the volume of screening for a fixed early detection target or improve early detection for a fixed screening budget. For instance, we showed that Tempo-Mirai could obtain better early detection than annual screening at Karolinska while reducing screening by 25%. Given the scale and cost of breast cancer screening, even modest improvements in screening guidelines have the potential to benefit a wide patient population.
Our study is complementary to a rich body of work surrounding risk-based screening 7,18-20 . Several guidelines already recommend supplemental imaging or chemoprevention based on risk assessments 19,21,22 , and recent results from the DENSE trial 23 have shown that a breast density-based screening strategy could significantly reduce interval cancers compared to current screening. Our work is most closely related to the MyPeBS trial 24 , which prospectively compares a personalized screening follow-up strategy based on either Tyrer-Cuzick 8 or MammoRisk 25 risk assessments with current national recommendations. These studies point to substantial clinical interest in risk-based screening; however, current methods for devising screening policies rely on categorizing patients into a few coarse categories (e.g., low and high risk), limiting personalization. Our study provides a data-driven alternative for clinical decision-making and can be easily integrated into a screening trial or routine patient care. Our work is also complementary to ongoing efforts to improve mammography reading; Tempo screening policies can be deployed in tandem with new technologies aimed at improving breast cancer detection at the time of screening (i.e., computer-aided detection 26,27 or triage systems 6,28 ).
Our work is also related to a large volume of modeling studies focused on breast cancer [29][30][31][32][33][34][35] . Typically, these approaches operate over a model of disease progression that characterizes how patients transition between healthy and disease states. The transitions are informed by patient features and impact the likelihood of different observations, such as a palpable lump. Their probabilities can be estimated from retrospective data or retrieved from the literature. The approaches then work to identify the optimal screening policy under the specified disease progression model. While these approaches were the first to demonstrate the feasibility of developing personalized screening policies, they have several limitations that restrict their practical use in clinical settings. First, the postulated disease progression model does not capture the full complexity and uncertainty of cancer. Second, the methods generally assume that a patient's features are fixed and do not evolve over their lifetime 35 . This assumption does not hold in general and is not applicable to modern AI-based risk models that are sensitive to changes in patient health. In contrast, our framework does not assume a complete disease progression model; instead, it assumes access to a risk model (rather than discrete set of states) and a reward function that measures the performance of a screening trajectory given observational data. This relaxed assumption allows us to optimize screening policies directly on observed patient trajectories, which contain the full diversity of cancer diagnoses, as well as validate our policies on held-out patient populations, which may differ in their cancer characteristics, such as Emory, Karolinska and CGMH.
This study focuses on breast cancer screening using image-based risk models. However, our framework is flexible and can be readily utilized for other diseases, other forms of risk models and other definitions of early detection benefit. For instance, it can easily incorporate richer representations of the cancer outcomes. Recent work has highlighted concerns about the potential overtreatment of ductal carcinoma in situ 36 . Tempo policies can take these differences into account by leveraging separate reward metrics for the early detection of invasive and in situ cancers. In this scenario, Tempo policies would be trained using three reward metrics (early detection of invasive cancers, early detection of in situ cancers and screening cost), and clinicians would select a Tempo operating point (i.e., screening preference) that achieves the desired balance among the three metrics. In a similar fashion, our framework can be used to optimize more refined definitions of early detection benefit that account for properties of the cancer (e.g., tumor size and grade) at the time of diagnosis. For instance, given access to a patient's tumor properties, a cancer mortality model and a cancer growth model, a sophisticated early detection metric could directly estimate the reduced mortality risk if the cancer had been diagnosed at an earlier time point. Given a patient's age, this metric could also directly be tried to quality-adjusted life years. Similarly, more sophisticated measures of screening cost that take into account varying false-positive risks depending on patient characteristics (e.g., breast density) could be used to further refine screening policies. In this sense, prior work in modeling cancer mortality and screening benefits 29-34 is complementary to our own. We expect that the utility of Tempo, which is agnostic to the underlying choice of screening metrics and risk model, will increase as risk models and outcomes metrics are further refined across more diseases.
There are multiple future directions that can further improve personalized screening algorithms. While our method focused on predicting follow-up recommendations given risk estimates from established risk models, one could instead directly input rich patient information, such as a patient's mammograms and family history, into the screening policy. Directly learning to interpret this information for the purpose of personalized screening in an end-to-end fashion may result in more accurate policies. Moreover, the action space of our method could be expanded to include different types of screening recommendations, such as leveraging magnetic resonance imaging or mammograms, and future work could separately model the costs and benefits of each modality. Finally, given improved screening policies, future work could also recalculate the earliest and latest age such that screening is still cost-effective for a patient.
This study has several limitations. Our early detection metric assumed that cancer is detectable up to a fixed time (18 months) before diagnosis. While we found that the trends reported in our study were robust to different values of this assumption (ranging from 6 to 24 months), none of these assumptions are individually correct across all cancers, as the early detection potential of a tumor depends on that tumor's characteristics at the time of diagnosis. Moreover, our screening cost metric, recommended mammography volume, does not provide a full analysis of screening cost; it does not quantify false-positive risks or additional screening harms.
Our simulations also did not account for the sensitivity of screening mammography or the probability of a patient entering the clinic with a palpable lump if their diagnosis is overly delayed. While our framework is agnostic to the specifics of how the rewards are formulated, further research using more refined early detection metrics, such as quality-adjusted life years, that explicitly model tumor characteristics at the time of detection and tumor growth is needed. While Tempo can be applied with any risk model, Tempo-Mirai inherits the limitations of Mirai. Mirai has only been validated using Hologic full-field digital mammograms, and future work is needed to adapt the risk model to more mammography vendors and tomosynthesis images. Finally, prospective trials are necessary to assess the efficacy of these models in clinical care before widespread adoption.

Online content
Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/ s41591-021-01599-w.

Methods
Study design. The primary objective of this study was to develop personalized screening policies that could improve early detection while reducing screening costs. To this end, we developed Tempo, an RL framework for personalized screening that can be paired with any risk model. As illustrated in Fig. 2, Tempo policies are neural networks that take as input a risk assessment and output a screening follow-up recommendation. In this study, we focused our attention on breast cancer screening, and we hypothesized that our Tempo policies could offer improved early detection benefits over annual screening without requiring more screening. Moreover, we hypothesized that these policies would generalize to new institutions. We developed Tempo-Mirai, an RL-based policy that operates on Mirai (version 0.4.0.) risk assessments, and compared this policy to existing guidelines, including annual and age-based screening. Mirai is a deep learning-based risk model that predicts a patient's future risk directly from their mammogram. To assess the benefit of leveraging Mirai risk assessments over a traditional risk assessment model (i.e., Tyrer-Cuzick), we also developed Tempo-TCv8, an RL-based policy that operates on Tyrer-Cuzick risk assessments. To evaluate the benefit of using our RL approach for creating personalized risk policies (i.e., Tempo), we also developed models based on a supervised learning approach, Supervised-Mirai and Supervised-TCv8. An RNN was used to estimate risk progression for Mirai, and a deterministic model, static risk, was used to estimate risk progression for TCv8. All models were trained on the MGH dataset and tested at MGH, Emory, Karolinska and CGMH.

Dataset description.
To develop Tempo, we collected consecutive full-field screening mammograms and detailed risk information at the time of mammography from 80,134 patients screened between 1 January 2009 and 31 December 2016 at MGH under approval of MGH's institutional review board (IRB) with a waiver for written informed consent and in compliance with the Health Portability and Accountability Act. We obtained outcomes through linkage to a local five-hospital registry in the Massachusetts General Brigham healthcare system, alongside pathology findings from MGH's mammography electronic medical record. We collected detailed risk factors, including those used by the Tyrer-Cuzick model, from provider-entered information and patient-entered questionnaires in the electronic medical record. We associated each mammogram with patient risk factors as present at the time of mammography. We excluded patients who were diagnosed with other cancers (e.g., sarcoma) in the breast or did not have all four views (left craniocaudal (CC), left mediolateral oblique (MLO), right CC and right MLO). For patients who developed cancer, we excluded exams made within 6 months of diagnosis. For patients who did not develop cancer, we excluded exams made within 3 years of the last follow-up screen. We note that 6 months and 3 years are the minimum and maximum follow-up recommendations for Tempo, so this exclusion enabled us to ensure that simulations always occur within the bounds of observed data. This exclusion resulted in 54,673 patients who were randomly split into groups for training (43,749), development (5,399) and testing (5,525). We note that this dataset was also used to develop Mirai 12 , so we used the same training, development and testing splits. Because each patient had multiple exams, this resulted in 137,682, 16,634 and 17,119 exams for training, development and testing, respectively. All mammograms were acquired on Hologic machines. For each exam, we obtained Mirai 12 risk assessments, as well as TCv8 risk assessments. Detailed demographic information for this dataset is available in Supplementary Table 1, and the dataset construction procedure is shown in Extended Data Fig. 5. We report detailed demographics across all risk factors, including the rate of missing values, in Supplementary Table 5.
To evaluate the ability of Tempo policies to generalize to new populations, we collected the Emory, Karolinska and CGMH datasets under approval of the relevant IRBs with a waiver for written informed consent. To create the Emory test set, which contains a large representation of African American women, we extracted 8 years of full-field mammograms from an institutional database of all comers for screening mammography from 2013 to 2020 and randomly selected 30% of women (28,994 patients). All mammograms were acquired on Hologic machines. We collected outcomes from pathology findings from Emory's mammography electronic medical record. We obtained Mirai risk assessments for each exam. As with the MGH dataset, we excluded exams within 6 months of diagnosis. For patients who did not develop cancer, we excluded exams within 3 years of the last follow-up screen. This resulted in a total of 22,030 exams from 10,340 patients. Detailed demographics of this dataset are shown in Supplementary  Table 2, and the dataset construction procedure is shown in Extended Data Fig. 5.
The Karolinska test set was extracted from the cohort of screen-aged women 37 . All women aged 40-74 years within the Karolinska University uptake area who had attended screening and were diagnosed with breast cancer, without implants or prior breast cancer, from 2008 to 2016 were included, as well as a random sample of controls with at least 2 years of follow-up data from the same time period. The full Karolinska case-control and validation datasets included 11,301 and 2,580 women, respectively. A random subset of 9,484 patients in total were selected for inclusion in this study. We included all full-field mammograms, acquired on Hologic machines, from 2008 to 2016 for the included women that contained all four views (left CC, left MLO, right CC and right MLO), resulting in 14,362 exams from 7,193 patients. We excluded exams within 6 months of a cancer diagnosis. For patients who did not develop breast cancer, we excluded exams within 3 years of the last screening follow-up. Because of the case-control dataset design, this dataset has a much higher ratio of patients who developed cancer, relative to the 1.9% incidence reported in the cohort of screen-aged women 37 . To take this into account, we randomly resampled patients who did not develop cancer from our cohort to produce a larger dataset with a 1.9% cancer incidence, resulting in a total of 93,052 exams from 7,193 patients. Detailed demographics are shown in Supplementary  Table 3 with the dataset construction procedure in Extended Data Fig. 5. Given the 1.9% patient-level cancer rate and the length of the collection period, we estimated that the 5-year cancer incidence in the Karolinska population was 1.2%. For each exam, we obtained Mirai 12 risk assessments.
To create the CGMH test set, which consisted of 12,280 exams from 12,280 patients, we selected random women undergoing full-field screening mammography at CGMH between 2010 and 2011 who were aged 45-70 years. Women aged 40-44 years were also included if they had a family history of breast cancer, following local screening guidelines. All mammograms were acquired on Hologic machines. Cancer outcomes were obtained from the national cancer registry. Demographics for this dataset are available in Supplementary Table 4 and Extended Data Fig. 5. For all patients, we collected the date of last screening follow-up. We excluded patients with unknown age. For each patient who developed cancer, we also manually collected all the dates of their future screenings from 2010 to 2020 through chart review. This allowed us to estimate early detection benefits relative to historical screening. We did not collect all future screening dates for patients who did not develop cancer. For patients who developed cancer, we excluded exams within 6 months of diagnosis, while for patients who did not develop cancer, we excluded exams within 3 years of the last follow-up screen. For each exam, we obtained Mirai 12 risk assessments. The CGMH test set only included one Mirai risk assessment per patient; as a result, our ability to estimate risk progression at CGMH is more limited compared with the other test sets, where the risk progression model benefited from multiple prior observations and made predictions across shorter time intervals. This limitation means it is more difficult to estimate the quality of Mirai-based policies (i.e., Tempo-Mirai and Supervised-Mirai) in this dataset.
For patients with multiple exams in a dataset, we considered each exam in their trajectory as a possible simulation starting point and evaluated screening policies across all starting points. For instance, consider a patient who was screened in years 1, 2 and 3. For training and evaluation, we consider the scenarios in which the patient started to follow the Tempo-Mirai policy at years 1, 2 and 3. Simulating policies from multiple starting points offers more information about the behavior of a policy. To account for these correlated simulations in computing our CIs, we used a clustered bootstrap procedure with 5,000 samples. We note that our risk progression model always had access to all prior observations and was not affected by the choice of simulation starting point.
For each trajectory, we considered its censor time as either the date of cancer diagnosis via biopsy or the date of last screening follow-up. We designed our screening policies to offer a minimum follow-up recommendation of 6 months and a maximum follow-up recommendation of 3 years. Because our follow-up intervals were in increments of 6 months, we discretized time across all trajectories into 6-month time steps. This was done by subtracting the first date in the trajectory from all dates and then dividing the date difference by 6 months using integer division (i.e., without rounding). As a result, an exam 9 months after time-step 0 was considered step 1. This design decision simplified our simulation code.
To ensure that our simulations always occurred within the time frame of the observed data, we excluded starting points where cancer was diagnosed in less time than the minimum action (6 months). For screening trajectories without a cancer diagnosis, we excluded starting points where the time to the last screening follow-up was less than the maximum action (3 years). To understand the latter exclusion, consider a patient with no known future cancer date who was screened at year 1 and had her last screening follow up at year 2. If a Tempo policy recommended follow-up in 3 years (e.g., return at year 4), then we could not assess whether that recommendation would result in a diagnosis delay as that time point (i.e., year 4) is unobserved. To avoid this scenario, we exclude exams where a Tempo policy cannot be evaluated (i.e., within 3 years of the last follow-up date if the patient does not develop cancer).
Mammograms were converted to the PNG16 format using the dcmj2pnm command of the DCMTK toolkit (version 3.6.1, 2015). Torchvision (version 0.2.1) and Pillow (version 5.2.0) Python libraries were used for image preprocessing and data augmentations.
Reward design. We considered two rewards in our simulation environment: measuring imaging cost and early detection benefit. We modeled our imaging cost reward as the negative amount of mammograms per year recommended by a policy. To model early detection benefits, we measured the time difference in 6-month time steps between each patient's recommended screening date (if it was after their last negative mammogram) and the actual diagnosis date. We then converted this value into months. We defined a patient's diagnosis date as the date of their positive biopsy result. Negative values of this reward imply a delayed diagnosis, and positive values imply relative screening benefit over the retrospective trajectory. We capped maximum early detection benefit for any patient at 18 months and did not cap the possible screening delay. As a result, if a patient's last negative mammogram was 3 years before their cancer diagnosis and a screening policy recommended a mammogram 2 years and 1 year before a patient's cancer diagnosis, then we assigned this trajectory an early detection benefit of 18 months. We provide additional analysis for different possible assumptions for the maximum screening benefit in Extended Data Fig. 4. We also considered an alternative definition of early detection benefit, where a policy can only offer early detection if it recommends an additional screen within 18 months of the diagnosis date in Supplementary Table 9. In the above example where a patient is screened 2 years and 1 year before their diagnosis, this definition would yield an early detection benefit of 12 months instead of 18 months. Across both definitions ( Table 2 and Supplementary Table 9), Tempo-Mirai obtains better efficiency than other guidelines (e.g., annual screening). Fig. 2, our risk progression models take as input a sequence of prior risk assessments and predict a risk assessment at the next time step. We considered two possible methods to estimate risk progression, namely Static Risk, which always predicted that a patient's risk at the next time step would be the same as at the last time step, and an RNN. Our RNN estimated risk progression in an iterative fashion; at each step, it took as input a single risk assessment and outputted a single risk assessment for the next time step. We implemented our RNN as a gated recurrent unit 38 with an additive hazard layer 12 and trained the model to minimize the Kullback-Leibler divergence between predicted risk assessments and the risk assessments observed in the MGH training set.

Risk progression models. As shown in
We experimented with different learning rates, hidden sizes, number of layers and dropout, and we chose the model that obtained the lowest validation Kullback-Leibler divergence in the MGH validation set. Our final risk progression RNN had two layers, a hidden dimension size of 100 and a dropout of 0.25, and it was trained for 30 epochs with a learning rate of 1 × 10 −3 using the Adam optimizer. The outputs of our risk progression model for Tempo-Mirai are visualized in Extended Data Fig. 1. Given a trained risk progression model, we can now estimate unobserved risk assessments autoregressively. At each time step, the model takes as input the previous risk assessment, the prior hidden state, using the previous predicted assessment if the real one is not available and predicts the risk assessment at the next time step. We validated our risk progression network on the MGH, Emory and Karolinska test sets in Supplementary Table 6 and note that our RNN outperformed the static risk baseline in all datasets. Because we collected only one exam for each patient in the CGMH test set, we could not validate the risk progression network on that test set. Information regarding the implementation for each risk progression and hyperparameter search is available in our code release.
Personalized screening models. We implemented our personalized screening policy as a multiple-layer perceptron, which took as input a risk assessment and weighting between rewards and predicted the Q-value for each action (i.e., follow-up recommendation) across the rewards. This network was trained using Envelope Q-Learning 17 . Following recent work in deep RL 39,40 , we used an experience replay buffer to reduce correlation between our training batches and utilized a target Q-network 39 to stabilize training updates.
We experimented with different numbers of layers, hidden dimension sizes, learning rates, dropouts, exploration epsilons, target network reset rates and weight decay rates. We note that we conducted the same grid searches for Tempo-Mirai and Tempo-TCv8 and chose each model to maximize the average reward on the MGH validation set. Our final Tempo-Mirai model had six layers, each with 256 hidden units, followed by rectified linear unit (ReLU) nonlinearities. It was trained for 30 epochs using a learning rate of 1 × 10 −3 , a dropout of 0.25 and a weight decay of 0.01 using the Adam optimizer, and the target network was reset every 1,000 batches. Our final Tempo-TCv8 model had four layers, each with 256 hidden units, followed by ReLU nonlinearities. It was trained for 30 epochs using a learning rate of 1 × 10 −3 , a dropout of 0.25 and a weight decay of 0 using the Adam optimizer, and the target network was reset every 1,000 batches. Information regarding the implementation of each risk policy, the training code and our hyperparameter searches is available in our code release. For both Tempo-Mirai and Tempo-TCv8, we chose a reward weighting to approximately match the screening cost of annual screening on the MGH development set and used this reward weighting across all test sets. Tempo-Mirai used a reward weight of 0.5 and 3.0 for screening cost and early detection, respectively. Tempo-TCv8 used a reward weight of 0.77 and 3.0 for screening cost and early detection, respectively.
Supervised learning baseline. We implemented our supervised learning baselines, Supervised-Mirai and Supervised-TCv8, as a multiple layer perceptron, which took as input a risk assessment and predicted a probability distribution across follow-up recommendations. This network was trained to minimize the cross-entropy loss between its actions and the optimal sequence of actions. We computed optimal actions for each patient to maximize our rewards metrics. For patients who did not develop cancer within the time period of the maximum follow-up recommendation, the optimal action was the maximum follow-up recommendation of 3 years. For patients who developed cancer, the optimal action was to recommend a screening follow-up in the time step following the last negative mammogram. Unlike Tempo-Mirai, which is trained to maximize trajectory level rewards using RL, Supervised-Mirai is trained to maximize the likelihood of the optimal sequence of actions. As a result, Supervised-Mirai does not benefit from observing how its own errors compound across the trajectory at training time.
For each supervised model, we experimented with different numbers of layers, hidden dimension sizes, learning rates, dropouts and weight decays. To enable fair comparison against Tempo models, we searched the same space of hyperparameters and selected those that achieved the best average reward on the MGH validation set. Our final Supervised-Mirai model had eight layers, each with 512 hidden units, followed by ReLU nonlinearities. It was trained for 30 epochs using a learning rate of 1 × 10 −3 , a dropout of 0.25, a weight decay of 0.1 and the Adam optimizer. Our final Supervised-TCv8 model also had eight layers, each with 512 hidden units, followed by ReLU nonlinearities. It was trained for 30 epochs using a learning rate of 1 × 10 −4 , a dropout of 0.25, a weight decay of 0.1 and the Adam optimizer. Information regarding the implementation of each risk policy, the training code and our hyperparameter searches is available in our code release.

Statistical analysis.
To calculate CIs while accounting for patients with multiple simulations, we used a clustered bootstrap approach with 5,000 samples. To assess significance in the difference between two metrics, we used a two-tailed t test with a predefined P value of 0.05 for significance.
Reporting Summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

data availability
All datasets were used under license to the respective hospital system for the current study and are not publicly available. To access the MGH dataset, investigators should contact C.L. to apply for an IRB-approved research collaboration and obtain an appropriate data use agreement. To access the Karolinska dataset, investigators should contact F.S. to apply for an approved research collaboration and sign a data use agreement. To access the CGMH dataset, investigators should contact G.L. to apply for an IRB-approved research collaboration. To access the Emory dataset, investigators should contact H.T. to apply for an approved collaboration. Fig. 4 | Our early detection metric assumed that a cancer could be caught up to 18 months before diagnosis. To test the robustness of our results to this assumption, we also evaluated our screening policies when changing this assumption to 6 months, 12 months and 24 months. For each policy, we report its screening efficiency, which is defined as its early detection benefit in months divided by the amount of mammograms it recommends per year. Asterisk denotes the policy with the highest screening efficiency.