Malaria Early Detection in a Declining Transmission Setting in the Amhara Region of Ethiopia


 Background Despite remarkable progress in the reduction of malaria incidence, this disease remains a public health threat to a significant portion of the world’s population. Surveillance, combined with early detection algorithms, can be an effective intervention strategy to inform timely public health responses to potential outbreaks. Our main objective was to compare the potential for detecting malaria outbreaks by selected event detection methods.Methods We used historical surveillance data with weekly counts of confirmed Plasmodium falciparum (including mixed) cases from the Amhara region of Ethiopia, where there was a resurgence of malaria in 2019 following several years of declining cases. We evaluated three early detection methods to detect the 2019 malaria events: 1) the Centers for Disease Prevention and Control (CDC) Early Aberration Reporting System (EARS), 2) methods based on weekly statistical thresholds, including the WHO and Cullen methods, and 3) the Farrington algorithms.Results All of the methods and parameters evaluated performed better than a naïve random alarm generator. We also found distinct trade-offs between the percent of events detected and the percent of true positive alarms. CDC EARS and weekly statistical threshold methods had high event sensitivities (80% − 100% CDC; 57% − 100% weekly statistical) and low to moderate alarm specificities (25% − 40% CDC; 16% − 61% weekly statistical). Farrington variants had a wide range of scores (20% − 100% sensitivities; 16% − 100% specificities) and could achieve various balances between sensitivity and specificity.Conclusions Of the methods tested, we found that the Farrington improved method was most effective at maximizing both the percent of events detected and true positive alarms for our dataset (83% sensitivity, 51% specificity). This method uses statistical models to establish thresholds while controlling for seasonality and multi-year trends, and we suggest that it and other model-based approaches should be considered more broadly for malaria early detection.

Malaria surveillance as a core intervention strategy is one of the pillars of the Global Technical Strategy for malaria [3,4]. Information from surveillance systems can be used to optimize interventions to interrupt disease transmission and ultimately accelerate elimination. Timely detection allows o cials to intensify control measures as needed to manage epidemics [4][5][6][7][8][9][10]. Many early-detection algorithms exist, and there is a need to quantitatively evaluate the performance of these algorithms for different diseases and locations [11][12][13][14][15][16][17]. The central idea behind outbreak detection is to identify when the case volume exceeds a baseline threshold, and to use this information in a prospective (not retrospective) manner to identify epidemics in their early stages [4,15]. Various algorithms are used to calculate these thresholds, with different assumptions about the pattern of disease transmission, including the speed of outbreak development, seasonality, and trends.
Early detection algorithms that have been proposed for malaria include Cullen, WHO quartile, and cumulative sum (CUSUM) methods [4,5,10,12,17,18]. These techniques de ne thresholds based on statistical summaries of historical data. The Cullen and quartile methods are recommended by WHO to have at least ve years of past data to generate reliable estimates of the thresholds [5,12]. The Cullen method calculates the mean value over the past ve years of the current time period (e.g. week or month of year), excluding values from any past outbreak periods. Case volumes over the mean plus two standard deviations are considered outbreaks [5,12,19]. The WHO quartile method de nes an outbreak by calculating quartile values for the current seasonal time period over the past ve years. An outbreak is identi ed when cases exceed the upper third quartile. This approach may be sensitive to slight increases in case volume during time periods when there have never been spikes or outbreaks of cases, but is less affected by abnormal years compared to the Cullen method [5,12]. Several variations of the statistical methods have been evaluated from selected health center data in Ethiopia and found that weekly percentile measures performed as well as ones with more complex calculations [17]. There are many variations of the cumulative sum (CUSUM) approach, a type of control chart that tracks cumulative differences between observed values and expected values and indicates an outbreak when these cumulative differences exceed a threshold [5,12,[20][21][22][23].
In many situations, su cient historical data may not be available to implement these approaches. Even when historical data are available, older data may be less applicable or relevant to the current situation of malaria transmission [10]. In places undergoing intensive malaria intervention efforts, incidence in recent years may be signi cantly reduced compared to only a few years in the past or may exhibit different seasonal patterns [24]. Thresholds based on previous years may then fail to capture the new patterns and intensities of current outbreaks. However, surveillance and outbreak detection are still crucial in areas of low or unstable transmission. Immunity levels to malaria decreases with the intensity of malaria transmission, and the population could be highly vulnerable to malaria outbreaks [5].
Other early detection algorithms use different approaches for the calculation of the thresholds and may be more applicable in regions undergoing rapid change in malaria transmission patterns. The CDC Early Aberration Reporting System (EARS) has been used as a drop-in technique for syndromic surveillance after major incidents that could precipitate disease outbreaks [16,22,[25][26][27]. This suite of methods is similar to quality control charts and relies on only very recent data to create a baseline and is therefore useful when long-term data is not available or not relevant to the current situation. The EARS system is actively used by U.S. state and local public health o ces [25]. Syndromic surveillance using school-based absenteeism has been investigated for potentially identifying localized malaria outbreaks in Ethiopia [28]. A family of methods developed by Farrington and later, Noufaily, have been implemented at several European infectious disease control centers [29,30]. Farrington methods are based on quasi-Poisson regression and can take advantage of historical information while accounting for seasonality, long-term trends, and previous outbreaks [30][31][32].
While previous research has compared various detection algorithms, many of these studies have used simulated datasets [e.g. 16,22,33], and it is unclear the extent that these would be representative of real world outbreaks, especially in the context of public health interventions. Therefore, in this paper, we used a 7.5 year weekly surveillance dataset of malaria cases to test the suitability of EARS, WHO quartile (and other statistical thresholds), and Farrington methods to detect malaria outbreaks. To develop a baseline dataset of malaria outbreaks, we applied a novel method to identify malaria events of interest to use as retrospective test cases. This research was conducted in the Amhara region of Ethiopia, which has been the subject of intense malaria interventions and experienced a general decline in malaria cases [34]. In 2019 there was a resurgence of malaria cases in the region, and we used this year as the basis for testing the outbreak detection algorithms. Our main objectives were to compare sensitivity and true positive rates of the event detection methods applied to malaria outbreak detection, and to compare the potential for detecting outbreaks.

Study Area and Data
The AmhSara region is located in northwest Ethiopia [ Figure 1]. Most of the terrain is mountainous, with lowlands along the northwestern edge of the region. Rainfall is highly seasonal, with the heaviest rains from June through September. There are two major seasons for malaria transmission: the main transmission season after the end of the rainy season between September and December, and a secondary peak at the beginning of the rainy season in May through August [35,36]. Population in the Amhara region is over 21 million, and the people primarily live in rural areas and practice subsistence farming [37]. There is widespread transmission of Plasmodium falciparum and P. vivax malaria, with a ratio of 1.2 of P. falciparum to P. vivax, as seen in blood lm tests from a cross-sectional survey [38]. A national malaria control program targets the Ethiopian population at risk, including the Amhara region.
The program is comprised of four main interventions: distribution of free long-lasting insecticidal nets (LLINs), indoor residual spraying (RDS), rapid diagnostic tests (RDTs) available at all health facilities, and treatment with artemesinin combination therapy [37,38]. Areas with low transmission rates due to declining malaria incidence and unstable transmission patterns are being targeted for elimination [37,39,40]. Administratively, the region is divided into twelve zones and three administered towns, which are further divided into between four and 24 woredas, or districts [ Figure 1]. Woredas are subdivided into kebeles (villages). In the Amhara region, there are 162 woredas (containing 3543 kebeles), and 47 of the most malaria-prone woredas are included in the Epidemic Prognosis Incorporating Disease and Environmental Monitoring for Integrated Assessment (EPIDEMIA) project [37]. The health care system is organized into three tiers: primary, secondary, and tertiary levels [41]. The primary level in rural areas includes health posts, health centers, and a primary hospital. The primary health care units (PHCUs) which contain ve health posts (satellite facilities located in kebeles) and one referral health center. Secondary and tertiary levels are referral general and specialized hospitals, respectively.
Public health surveillance data on patients seeking care at health posts or health centers are collected and aggregated by the Amhara Regional State Health Bureau (ARHB). Among the data collected are the number of malaria cases con rmed by rapid-diagnostic tests (RDT) or blood lm screening, and these counts are grouped as Plasmodium falciparum (including mixed infections) and P. vivax (only) malaria. These data are summarized by the week of the year (based on the ISO 8601 standard used by WHO) and reported to the woreda health o ce. This o ce aggregates a complete woreda report before sending the summarized data to the zonal health o ce, which compiles all the woreda reports within the zone, and sends the reports to the regional ARHB o ce, where they were uploaded into the EPIDEMIA system [37].
This study analyzed data from the 47 EPIDEMIA pilot woredas, which included weekly case counts of P. falciparum (or mixed) and P. vivax malaria starting from ISO week 28 of 2012 through week 52 of 2019. These woredas have seen great public health successes in reducing the malaria burden from 2012 through 2018, but a resurgence in 2019 [ Figure 2].
Between 2013 and 2018, there was a steady decrease from 349,523 P. falciparum or mixed malaria cases to 104,947 cases, a 70% reduction. However, in 2019 there were 210,194 cases, a volume that had not been seen since 2015 [ Table 1]. We focused our analysis on P. falciparum (including mixed infections with P. vivax) which is the predominant parasite species, is of greatest concern from a public health standpoint, and had the strongest resurgence in 2019.  Prior to evaluating event detection algorithms, speci c events of interest must be de ned for each woreda to be the baseline testing dataset. Here, for research purposes, we developed an objective approach named trend weighted seasonal thresholds (TWST) for identifying events as anomalous increases in the number of reported malaria cases. The approach was designed to identify events retrospectively in the context of seasonal patterns and decreasing long-term trends in disease transmission, while allowing for variation in patterns across woredas as well as slight time-shifts in seasonal peaks between years.
The TWST approach identi ed two thresholds, weekly and yearly, for each woreda. This combination of weekly and yearly thresholds has been used in other work for de ning malaria epidemics [13]. In preparation, the raw weekly time-series were smoothed using a centered 5-week triangular moving average. The yearly threshold was calculated as the harmonic mean of the entire year plus a multiplier based on the standard deviation (1.5 for P. falciparum and mixed species).
The weekly threshold was calculated in a three-step process. In the rst step, the raw threshold value for a given week was the harmonic mean of that week in the year, over the ve years of data, plus a multiplier based on the standard deviation (1 for P. falciparum and mixed). In the second step, the raw thresholds were optionally trend weighted based on the year harmonic mean. If there was a declining trend (from the year previous), then the weekly threshold values were weighted proportional to the difference between the current year harmonic mean and the highest (max) harmonic mean using a weighting factor, 0.5 for P. falciparum and mixed: (max -weighting factor * (max -current)) / max. If there was no declining trend, the weekly thresholds were weighted based on the previous year mean instead of the current year. In the third and nal step, allowances were made to prevent minor time shifts in increasing and decreasing case counts between years from triggering alerts [28,42], by in ating weekly thresholds that were not near in time to peaks. Peak areas were identi ed using a percentile cut-off per year (85% for P. falciparum and mixed), plus short stretches (up to 8 weeks) between these high rates. The in ation was based on the average of the year and week harmonic means multiplied by an expansion factor (1.2 for P. falciparum and mixed), which was then added to the trend weighted threshold of the previous step to arrive at the nal TWST week threshold. Anomalies were identi ed if cases exceeded both the yearly and weekly thresholds, and events were identi ed if anomalies lasted for four or more weeks consecutively. Events that were separated by only one or two weeks were merged into one event.

Detection Algorithms
The previous step, event identi cation, was based on a retrospective analysis with full knowledge of the entire 6.5-year time span, yielding speci c spikes or abnormal increases in malaria case counts to be used as a baseline testing set for the detection algorithms. In contrast, event detection algorithms were forward-looking, running in-step with the data and only using values up to a given week, which mimics real time surveillance efforts to detect outbreaks as early as possible to mount timely public health responses. For this study, three types of event detection algorithms were used: 1) CDC EARS, 2) weekly statistical summaries that included the commonly-used WHO and Cullen methods, and Farrington methods [4,26,31,32].
For EARS, the three variations C1-Mild, C2-Medium, and C3-High were tested using the default alpha values (0.001 for C1 and C2, 0.025 for C3) with four different baseline periods: the default 7 periods (weeks, here), plus 14, 28, and 56 weeks.
For the weekly statistical summaries, thresholds were calculated from the week of the year median, mean plus two standard deviations (without removing past outbreaks), and 75th and 85th percentiles for three historical time periods: 5 years, 6 years, or weekly maximum of 6 or 7 years depending on the week of year.
The Farrington algorithm offers parameters to control various model settings, such as the number of time points to include in the historical window through a speci ed number of years, inclusion of an optional long term trend, the number of periods to account for seasonality, and the number of weeks to exclude at the beginning of the evaluation period (for events that may already be in progress). For the Farrington algorithm, 204 variations in a parameter sensitivity analysis were run. There were four basic settings: 1) original method with all package defaults, 2) original method with four periods for seasonality, 3) improved method with all package defaults, 4) improved method with four periods of seasonality. The two hundred other runs utilized the population offset option and an exhaustive set of combinations of selected parameters and values: window half size (3, 5), years of historical data to include (3, 4, 5, 6, or maximum adaptive), long-term trend inclusion (trend or no trend), seasonality periods (1,2,4,8,12), and past weeks to exclude at the beginning for spin up time (26, or set equal to window half size). All parameter combinations can be found in the supplemental materials. [Additional File 2, Supplemental Tables S1 and S2]. All methods were implemented in R and the surveillance package was used for the EARS and Farrington methods [26,43].

Skill Comparison Test
As a skill comparison test to the real detection algorithms, six sets of random alarms were also generated. Any algorithm that produces alarms will, by chance, occasionally occur during an event, and the more alarms triggered, the more likely events will seem to be detected. This skill test checked that the event detection methods are performing better than a null model and provided context in the comparison between the methods. The random algorithm produced alarms between one and ve weeks long, with a minimum buffer of four weeks between runs. The probability per week of an alarm was varied to create different total numbers of alarms.

Metrics
Metrics of detection effectiveness were event based, because using events as the unit of analysis is relevant to how these algorithms would be used in public health surveillance to nd outbreaks before or as they are occurring. Two main indicators were used: the percent of events that were caught, and the percent of alarms that were associated with events. An alarm and event were considered associated if the alarm was triggered any week during or up to two weeks prior to the event. Percent of events caught was an indicator of how well the algorithm detected events, with higher percentage meaning that fewer events were missed. Percent of alarms associated with events was the true positive rate of the alarms (the percentage of alarms that overlapped with or up to two weeks prior of an event). A high percentage of this metric demonstrated that the algorithm was more likely to trigger alarms when an event was actually happening and less likely to generate false alarms. Ideally, event detection algorithms would trigger alarms for all events (100% events detected) and never when there was not an event (100% alarms true positive). In addition to events caught, we also considered if the alarm for the event was timely, which was de ned as an alarm between two weeks prior and including the start week of an event.

Identi ed Events
The TWST algorithm, developed to identify time periods of excess malaria case counts that were considered of potential public health interest, found a total of 255 events for P. falciparum and mixed species. The numbers of events declined from 2013 to 2018, however in 2019 the number of events greatly increased. Also during 2019, the average number of cases in events was the highest since 2012 [ Table 2, all events shown over time in Additional File 1, Supplemental Fig. 1]. The TWST algorithm was speci cally designed to account for seasonality, and not identify every seasonal peak as being an event, in the context of overall declining trends in malaria transmission. However, different woredas in the region exhibited various patterns in incidence, including decreasing trends, increases in the middle or end of the time period, clear single seasonal peaks, dual seasonal peaks, and various combinations of these patterns. The TWST algorithm was exible enough to appropriate identify events in these patterns [ Figure 3]. Mecha and Baso Liben both had decreasing incidence and a resurgence in 2019, but Baso Liben had maintained seasonal peaks while Mecha did not. Seasonal patterns also vary between clear single or dual peaks to more jagged patterns such as in Jawi. Observed incidence is marked in light grey and the smoothed incidence in black.
Week and year thresholds from the TWST algorithm are shown as dot-dashed lines in green and blue, respectively. Any identi ed events are marked with red circles at the appropriate weeks at the top of the graphs.
The algorithm was able to identify peaks that would have been overshadowed by peaks in much earlier years but are important relative to more recent patterns. For example, the woreda Abargelie had high peaks in 2013 and to a lesser extent in 2014. During 2015 however, the season was very quiet with no large peaks. In the fall of 2016, a moderate seasonal peak returned and larger fall peaks in 2017 and 2018, but if the thresholds had not more strongly considered the 2015 season (trend-weighting), the 2017 or 2018 peak would not been identi ed as an event [ Figure 4]. The time-shift allowance in TWST was also needed to prevent noti cations of events where the peak simply declined more slowly than in other years [ Figure 4].  As expected, random alerts performed poorly and had the lowest percentages of true positive alarms across the variants (Table 3, Fig. 5). Variants with higher probabilities created more alarms, and saw higher event caught percent scores, as the more alarms are present the more likely they are to randomly overlap with an event.
The CDC EARS methods generated large numbers of alarms (98 to 152), with an associated high percentage of events caught (80-100%) and variable events caught timely (43-87%), but also had low to midrange percentages (25-40%) of true positive alarms (selected items in Table 3, full listing in Supplemental materials). Of the weekly statistical summaries, the Cullen mean plus two standard deviations variant produced the highest true positive rates (51 to 61%, depending on the number of years of historical data included), but the lowest event caught scores (57-80%) and lowest events caught timely scores (13-37%). The WHO 75th percentile with 5 years of data, a commonly used algorithm, produced 200 alarms with a 97% event caught rate (93% timely) but only a 26% true positive rate ( Table 3). The 85th percentile variants produced somewhat fewer alarms with higher true positive rates, and with similar or slightly reduced event caught and timely percentages.
Examining the Farrington results (orange hollow circles in Fig. 5), there was a trade-off seen between events caught and true positives. The Farrington variants were based on a sensitivity analysis of ve parameters: window half size (w), years of historical data included (b), number of periods for seasonality, long-term trend inclusion, and the exclusion period for spin up time. Not all parameters in uenced the outcomes; window half size and the exclusion period did not greatly affect the results, although the 26week exclusion period seemed slightly preferable. The parameters for number of historical years of data, number of periods for seasonality, and trend inclusion had the greatest impacts on the outcome metrics.
Of the 200 variants with population offset, the event caught rate was highest when the trend was included and there were 4 to 12 periods for seasonality (Fig. 6). The event caught rate fell as more years of historical data were included, especially in variants that did not include a trend. Figure 6. Plot of event caught percentages from the Farrington event detection variants. Scores were higher when a long-term trend was included ( lled circles) than when no trend was included (hollow triangles). Event caught rate fell as more years of historical data were included (x-axis), especially in variants that did not include a trend. Within each trend set, scores were higher with 4 to 12 periods for seasonality (blue to green colors), and lowest with one period, i.e. no seasonality (dark purple). The number of alarms generated is indicated by the size of the marker and decreases as more years of historical data are included.
Of the 200 variants that included population offsets, the true positive percentages were highest when no trend was included, two to four periods for seasonality were included, and increases as more years of historical data are included (Fig. 7). The number of alarms generated decreased with additional years of historical data (size of the marker in Figs. 6, 7). Figure 7. Plot of true positive alarm percentages from the Farrington event detection variants. Scores were higher when the long-term trend was not included (hollow triangles) as compared to variants where trend was included ( lled circles). More historical data (x-axis) increased the alarm true positive score and decreased the total number of alarms generated (size of marker). Scores were highest with two to four periods for seasonality (blues), and lowest with no seasonality (one period, dark purple). The number of alarms generated is indicated by the size of the marker and decreases as more years of historical data are included.
The Farrington original and improved methods with default values (and with seasonality) and no population offset were compared against the 200 parameter sensitivity runs using the Improved method and population offsets (original A1 and A2, base improved B1 and B2 in Table 3 and Table 4). As seen in Fig. 6 and Fig. 7, there were large trade-offs in the 200 variant set between events caught and true positive rates. Some Farrington runs reached 100% events caught, but the highest true positive rate of that set was only 26% (Farrington C1 in Table 3). Other variants reached 100% in alarm true positive rate, but the highest event caught score in that set was 40% (Farrington C2 in Table 3). Taking a balanced approach, a variant with reasonable trade-offs had a score of 73% events caught and 74% alarm true positive but only 37% evets caught timely (Farrington C3 in Table 3 and Table 4). Another, and our selected balanced variant had 83% events caught and 53% events caught timely, though 51% alarms true positive (Farrington C4 in Table 3 and Table 4). Table 3 Results for selected event detection algorithms for P. falciparum and mixed malaria events in the 2019 evaluation time period. The percent of events caught, percent of event caught timely, percent of true positive alarms, and the total number of alarms generated are reported. Farrington parameter details can be found in Table 4.

Discussion
The TWST algorithm that we developed succeeded in identifying malaria transmission events in the presence of changing expectations due to decreasing incidence trends. Using thresholds de ned from time periods with high disease transmission may mask important events in less active years; events which would be considered abnormal if compared to more recent activity. This approach is essential in areas like the Amhara region where malaria incidence is declining in many woredas because of public health interventions. In regards to malaria surveillance, the WHO speci cally notes that the normal or expected patterns of malaria, from which outbreak thresholds are derived, do change over time in areas that see sharp decreases in incidence after intensive control efforts [4]. As woredas approach elimination, the sizes of malaria events become smaller, but it will still be necessary to detect and respond quickly to these outbreaks. In the context of resurgence, having dynamic thresholds that adapt to changing conditions is crucial for identifying malaria peaks that are smaller than larger historical outbreaks, but still signi cantly larger than malaria case numbers in recent years.
The operational activities of detecting and responding to outbreaks are enabled by and integral to malaria surveillance systems. Surveillance as an intervention is the third pillar of the WHO global technical strategy for malaria elimination with differing key aspects as disease control transitions to preelimination, elimination, and prevention of reintroduction phases [7][8][9][44][45][46][47][48][49]. More recent frameworks focus on transitions and evolving approaches needed in setting with changing epidemiology patterns [7,44]. The Amhara region, as mentioned previously, is in a transition period marked by declining and changing trends in malaria transmission due to disease interventions plus a resurgence in 2019. By testing these algorithms on In the event detection comparison, the randomly generated alarms produced the worst results, indicating that all the algorithms that we tested were better than the naïve assumption of random outbreaks. CDC EARS is designed to be used even when lacking historical data, as it creates thresholds from only recent data (7 for C1 and C2 to 11 for C3 previous time steps with a baseline of 7). A drawback is that this approach cannot effectively account for seasonality and tends to trigger alarms at every seasonal peak.
However, the results indicate that the EARS algorithms have a high sensitivity to increases in malaria cases. Thresholds based on weekly statistical summaries also produced high event caught scores and moderately higher alarm true positive rates as compared to CDC EARS methods. Both EARS and WHO methods tended to produce a high total number of alarms generated.
The suite of Farrington methods, especially the improved versions, allows adjustments for long-term trends and seasonal patterns. However, as with the weekly statistical summaries, this method requires several years of historical data, which may not always be available. As expected with the highly seasonal patterns we observe in the Amhara region, including enough seasonal periods was important as accuracy suffered when no or too few periods were included. A substantial trade-off was found with the inclusion of long-term trend between the percent of events caught and the percentage of true positive alarms.
Including the long-term trend as implemented in the Farrington algorithm increased events caught rate, however, there was also a decrease in the true alarm rate. In the context of declining malaria incidence, setting thresholds based on historical data tends to result in a high threshold that cannot detect smaller, more recent events. Adjusting the threshold using the recent trend of declining malaria cases therefore increases the sensitivity of outbreak detection, but can result in large numbers of false alarms if the resulting threshold is too low. These results do show that accounting for annual cycles and inter-annual trends is essential for calibrating malaria early detection parameters in settings characterized by seasonal transmission and declining malaria trends caused by public health interventions.
One of our motivations for comparing early detection algorithms was to guide the selection of methods for a malaria early warning system in the Amhara region as part of the EPIDEMIA project [37]. Following discussions among project partners and in consideration of the public health applications of the early detection results, we opted then to give the true positive metric slightly more importance in the evaluation of algorithm performance. We did not want to induce alert fatigue with an algorithm that had lower speci city, and we were cognizant that false alarms could cause ineffective and costly unnecessary mobilizations of resources. However, we balanced this desire to avoid false positives with the need to capture important events accurately and maintain credibility. In this analysis, quanti ed the trade-off between events caught and true positive scores by testing a range of methods and parameterizations, and we found that variations of the Farrington method were usually best for maximizing both events caught and true positives.
Depending on the intended public health utilization of the event detection alarms, other implementations may choose to prioritize sensitivity over speci city if identifying all potential malaria outbreaks is more important than minimizing false positives. Methods and variants with high sensitivity could be useful for generating a 'watch list' of places that may be seeing an outbreak beginning or spike in cases. However, due to the high false alarm rate (low true positive percentage), warnings based on algorithm variants with low alarm true positive scores run the risk of causing alert fatigue, where public health o cials may be overwhelmed by alerts that are not meaningful. They also would not be suitable to prompt costly interventions and perhaps would better serve as lists of places to monitor more closely.
Many of the early detection algorithms recommended for malaria use ve full years to create the baseline. We tested ve to 6.5 years in the weekly statistical summary methods, and from three to 6.5 years in the Farrington variants. However, given continuing changes in malaria transmission environments resulting from continuing interventions, social and demographic changes, and climate change, it may not be reasonable to expect that historical malaria more than a few years old is suitable to provide a baseline for detecting future outbreaks [4,24,[50][51][52][53][54]. Therefore, it is imperative to continue to explore new approaches for malaria outbreak detection that can be used with data covering shorter time periods. Future studies evaluating other algorithms will likely also prove insightful, as well as investigating the performance of the EARS and Farrington methods in other locations with different patterns of malaria incidence.

Conclusion
We compared the effectiveness of established methods for disease outbreak detection, the CDC EARS, WHO and statistical-based options, and the Farrington methods, using 7.5 years of malaria surveillance data from the Amhara region of Ethiopia. To our knowledge, this is the rst study to assess the potential application of the EARS and Farrington methods for malaria outbreak detection. The EARS methods by design use a very short historical window that cannot account for seasonal trends in malaria occurrence. As a result, they could not effectively distinguish seasonal increases from outbreaks, but were very sensitive to increases in cases. WHO and statistical methods were also quite sensitive to detecting outbreaks, but with moderate speci city in alarms (true positive scores). Variations of the Farrington method had a wide range of trade-offs between events caught and true positive scores. Farrington variants that that accounted for seasonality had much higher true positive rates than the EARS and WHO methods, and could achieve a better balance between true positives and the percent of malaria events caught. We determined that the Farrington method was the most exible and useful approach for operational early detection of malaria outbreaks in the Amhara region, and we suggest that this approach is more generally useful for detecting infectious disease outbreaks in transitional environments with strong seasonality and declining trends. The intended used of the early detection results will drive the choice of algorithm and parameter settings to optimize sensitivity and speci city of alarms for particular applications.

Declarations
Availability of data and material The data that support the ndings of this study are not publically available because they were used under a data-sharing agreement with the Amhara Regional Health Bureau that does not permit their redistribution, but are available from the Amhara Regional Health Bureau on reasonable request.

Ethics approval and consent to participate
Ethical approval for this research was provided by the Amhara Regional Health Bureau. The research did not involve human subjects as it used only non-identi able data provided as aggregated summaries.

Competing interests
The authors declare that they have no competing interests.

Funding
This work is supported by Grant Number R01-AI079411 from the National Institute of Allergy and Infectious Diseases.