Study area
Our study area covers four woredas in the Amhara region of Ethiopia (Figure 1). The Amhara region is located in northwestern Ethiopian and is geographically very diverse. Elevation ranges from 500 m to 4500 m. Climate is very seasonal with a pronounced rainy season and dry season. The four woredas are located in two geographically separated locations. The first study area, encompassing Mecha and Bahir Dar Zuryia, is located south of Lake Tana and the city of Bahir Dar in central Amhara. In these woredas, elevations range from 1500 m to 3200 m, mean daily temperatures range from 16 °C to 26 °C, and annual precipitation is between 1000 mm and 2000 mm. The northern section is relatively flat with large expanses of mixed agriculture and extensive wetlands, transitioning to the foothills of the Choke Mountains in the southern section. This study area contains a substantial area of irrigated agriculture within the Koga irrigation development project. The Koga project encompasses two dams with respective water detention ponds and 11 night water storage reservoirs. It irrigates 7004 ha of farmland via a system of canals [37].
The second study area, encompassing Aneded and Awabel, is located on the southern edge of the Amhara region between the Choke Mountains to the north and the Blue Nile River to the south. In these woredas, elevations range from 1000 m to 3600 m, mean daily temperatures range from 15 °C to 28 °C, and annual precipitation is between 900 mm and 1300 mm. This northern section has gradually sloping terrain and is covered by a mosaic of croplands and wetlands, while the southern section drops steeply into the Blue Nile Gorge and has some croplands with large areas of bare soil and sparse vegetation. There is no large-scale irrigation scheme in this area.
The Amhara region has unstable malaria transmission that results in sporadic localized and regional outbreaks [38]. Outbreaks occur primarily between September and December. The malaria parasites in this region are Plasmodium falciparum and Plasmodium vivax [35] and the primary vector species in this region is the mosquito Anopheles arabiensis [39]. The Ethiopian government has committed to move towards nationwide malaria elimination by 2030 [40]. Due to successes of malaria control efforts, malaria incidence and mortality rates have been declining in the Amhara region [29] and in other parts of Ethiopia [3]. In the Amhara region, confirmed malaria cases are usually reported from health facilities and then aggregated by woredas under a new malaria elimination pilot project, malaria data from seven woredas were aggregated at kebele level. Each kebele has at least one health post that serves up to 5,000 people. Larger kebeles have health centers that serve 20,000 people. Our study area includes a total of 122 kebeles from the four study woredas: Mecha, Bahir Dar Zuryia, Aneded and Awabel.
Malaria data
Malaria case data were collected at local health posts and health centers, summarized by week for each kebele, and reported to the Amhara Regional Health Bureau. The data included malaria cases of patients who sought care at a health post or health center and were confirmed by rapid diagnostic test or blood film screening. Weekly summaries included information on the malaria-causing pathogen (total P. falciparum cases, P. vivax cases, and mixed infections), age of the patients (above or below 5 years of age), the number of malaria patients with a travel history, and the number of total outpatients seeking care during a given week. Since no recent population data were available at the kebele level, we calculated the proportion of outpatients diagnosed with malaria (hereafter referred to as malaria proportion). This ratio is considered a reliable indicator of malaria burden because it controls for temporal variation in health facility attendance and can be calculated in situations where accurate population data are not available [41,42]. The case data ranged from September 2013 to July 2018. However, we only used data from 2014 – 2017, as these were available for the entire year. Out of 472 reporting health posts or health centers, we did not have reliable location data for six. Together, these six reported only 23 malaria cases for the entire time frame and their exclusion was not expected to influence the results.
Epidemiological data were summarized for all health posts within a kebele and for each year, to produce a kebele-wide annual tally of total malaria cases. We summarized total malaria cases, as well as for P. falciparum + mixed infections combined and P. vivax individually. To detect statistically significant spatial clusters in malaria occurrence, we performed a scan statistic using SaTScan software version 9.6 [43]. We ran a retrospective purely spatial discrete Poisson model with total outpatients as the population at risk. The scan statistic was performed with an elliptical window for each year and each malaria pathogen group (total malaria cases, P. falciparum + mixed cases, and P. vivax cases) separately. After determining the spatial clusters for each year, we then identified stable hot spots (areas with recurring clusters in three or four years), unstable hot spots (clusters in one or two years), and areas that were never identified as clusters.
Environmental variables
Table 1: List of static and dynamic variables used to predict malaria proportion.
Type
|
Description
|
Source
|
Name
|
Units
|
Dynamic variables
|
Daytime temperature
|
Terra MODIS
|
LST
|
°C
|
Annual Rainfall
|
IMERG
|
PREC
|
mm
|
Normalized Difference Vegetation Index
|
MODIS NBAR
|
NDVI
|
index
|
Normalized Difference Moisture Index
|
MODIS NBAR
|
NDMI
|
index
|
Static variables
|
Settlement mean density
|
PlanetScope
|
SETME
|
index
|
Settlement max density
|
PlanetScope
|
SETMX
|
index
|
Area below 2 m above nearest drainage
|
DEM, Stream network
|
HAND
|
%
|
Wetland cover
|
Midekisa et al. (2014)
|
WETL
|
%
|
Woody vegetation cover
|
Midekisa et al. (2014)
|
WOODY
|
%
|
Cropland cover
|
Midekisa et al. (2014)
|
CROP
|
%
|
Open water cover
|
Midekisa et al. (2014)
|
WATER
|
%
|
Sparse vegetation cover
|
Midekisa et al. (2014)
|
SPVEG
|
%
|
Irriagtion cover
|
Digitized from Google Earth
|
IRRI
|
%
|
Distance to Seasonal Waterbodies
|
Landsat OLI
|
DISTSW
|
m
|
Dynamic variables are summarized for each year for the dry season (_dr), rainy season (_rn), and transition season (_tr).
Two types of explanatory environmental variables were used to investigate malaria case patterns (Table 1). Dynamic variables included environmental conditions that were expected to vary between and within years, such as land surface temperature and remotely-sensed greenness and moisture indices. Temperature data were derived from the MODIS Terra Land Surface Temperature and Emissivity Product (MOD11A2) [44]. These data have a spatial resolution of 1km and are provided as 8-day composites. We used daytime observations to reduce the problem of missing data from nighttime clouds. Data were filtered using the quality assurance flags to only include observations with an average LST error of below 2 °K. Temperature values were then converted to °C.
Spectral indices measuring vegetation greenness and surface moisture were derived from daily 500 m spatial resolution MODIS Nadir BRDF-Adjusted Reflectance data (MCD43A4) [45]. We applied a data quality filter using the BRDF/Albedo quality product (MCD43A2) and only included observations that were flagged as land and were “good” or “best” quality. We then calculated the Normalized Difference Vegetation Index (NDVI) [46], as well as a Normalized Difference Moisture Index (NDMI) [47].
We imputed missing values and replaced outliers from imperfect cloud-screening using a robust linear regression model from the R MASS library [48]. We fitted a robust linear regression model on our temporal data for NDVI, NDMI, and LST using cyclical splines and estimated outliers with a z-score above 3 or below -3. We then removed observations that were identified as outliers and replaced them, as well as missing values, with predicted values from the robust linear regression model.
Precipitation data was derived from the Integrated Multi-satellitE Retrievals for Global Precipitation Measurement product (IMERG) [49]. The IMERG product has a spatial resolution of 0.1° and a temporal resolution of 30 minutes. Thee dynamic variables for land surface temperature, spectral indices, and precipitation were summarized for each kebele and for each season of the year: dry season (January – April), rainy season (May – August), and the transition season (September – December).
To identify potential mosquito breeding habitats, we mapped seasonally flooded areas for each year using 30 m Landsat 8 Operational Land Imager (OLI) imagery for the dry season and the rainy season of each year. We removed pixels that were flagged as cloud or cloud shadow and calculated the Normalized Difference Water Index (NDWI) [50] for each Landsat scene. To identify seasonally flooded areas, we extracted areas where the median NDWI during the end of the wet season (September - October) was above zero and the median NDWI during dry season (January – April) was below zero. We then calculated the distance of each pixel to the nearest seasonally flooded pixel, expressed as cumulative cost distance of the shortest path measured in meters from the neatest water source. This was done for each year separately to account for inter-annual variation in flooding extents. Additionally, we added “year” as a continuous variable, to capture inter-annual trends of malaria proportion that are not related to environmental conditions.
Static variables measured environmental conditions that were not expected to change substantially between years. These included data on settlement structures, land cover and the hydrological setting of the landscape. We created two settlement density indices from 3 m spatial resolution PlanetScope imagery [51]. We acquired images from November 2016 and classified each tile into building and non-building areas with a Random Forest model using a bag fraction of 0.67 over 500 classification trees. The over-all accuracy of the building classification based on out-of-bag data was 0.98 for both areas. In Mecha and Bahir Dar Zuryia we reached a sensitivity of 0.97 and a specificity of 0.98. In Aneded and Awabel sensitivity and specificity were both 0.98. To create variables for settlement density, we resampled the classification to 100 m pixels via a majority filter and then applied a Kernel density estimator using a gaussian kernel with a radius of 100 m and a sigma of 50 m. For each kebele we summarized the mean and maximum settlement density. Settlement classification and density estimation were performed in Google Earth Engine [52].
Land cover variables were derived from a 30 m resolution Landsat-based classification done by Midekisa et al. [24]. We calculated the percentage of each kebele covered by the following land cover classes: open water, herbaceous wetlands, woody vegetation, cropland, and sparse vegetation.
To map areas that are likely be flooded during larger rain events, we calculated the height above nearest drainage (HAND). The height above nearest drainage measures the vertical distance between a given point and the nearest stream. Low values indicate floodplains and other low-lying areas that are likely to be inundated during and after the rainy season when flow rates are high. We used topographic and stream network data to calculate the HAND using Topography Tools for ArcGIS [53]. We then identified areas less than 2 m above the nearest drainage and calculated the proportion of each kebele that falls into this category.
Statistical analysis
We used Boosted Regression Trees (BRT) to model the relationships between malaria cases and environmental variables. BRT models are commonly used in species distribution modeling [54,55], and have also been used to study mosquito borne diseases [18,56–58]. Advantages of the BRT method are that it performs well under moderate collinearity of predictor variables [59], captures non-linear relationships between predictor and response variables, and allows for interactions between predictor variables. BRT models were created using the R library gbm [60].
For each study area, separate models were fitted for P. falciparum + mixed cases, P. vivax cases, and total cases. Outpatient numbers were included as offset term. The models were fitted with a learning rate of 0.01, a tree complexity of 3, and a bag fraction of 0.5. Several parameter combinations were tested and the combination that yielded the best R2 from a 5-fold cross validation was selected.
Relative importance of predictor variables was estimated to determine which variables had the strongest influences on malaria patterns. Variable importance is a measure of how often a variable is used to create a split, normalized by the improvement in squared error resulting from the corresponding split. We then ranked the variables by their importance value and identified the top five variables for each model. To visualize the relationships between these variables and malaria proportion, we fitted partial dependence plots that show how a response variable depends on the predictor variable after taking the average effects of all other variables into account.