2.1. Study area
We conducted our study in Arrah district, Bihar, India (25.47◦N, 84.52◦E), which is in the eastern portion of India’s main grain belt, the Indo-Gangetic Plains (IGP) (Fig. 1). Our analyses are focused on an 8 × 16 km² area where we had access to SkySat and PlanetScope imagery over multiple years. The region is dominated by smallholder farms (< 0.3 ha) (Jain et al. 2016) and over 80% of the land area is under agriculture. Most farmers in the region follow a rice-wheat cropping system, with rice planted during the monsoon (kharif) growing season and wheat planted in the winter (rabi) growing season. Our analyses focus on wheat as previous studies have shown that yield gaps for wheat are large in this region11,17 and are expected to increase given the negative impacts of warming temperatures on wheat yield18,19. The wheat growing season spans from early November to mid-April, and wheat management varies widely across farms resulting in significant across-farm heterogeneity in yield13. For example, wheat sowing dates vary from early November to early January and farmers also vary the number of irrigations applied throughout the growing season (ranging from 1 to 3 irrigations13).
2.2. Field data collection and processing
We collected crop cut data from 271 wheat fields from 2014-15 to 2018-19, with the number of fields ranging from 36 to 79 in a given year (details provided in Table S1). It is important to note that different fields were visited in different years, meaning that repeat samples were not collected for the same field through time. To collect crop cut data, the field team visited each of the 271 fields at the time of crop harvest and randomly selected subplots to harvest from each of the farmer’s fields (details in Table S1). The team harvested the crop from these subplots, threshed the sample, and weighed the grain on site. We then averaged the yields from each subplot for each field to estimate the average yield per hectare for the full field. In addition, the field team collected five GPS points, one from the field’s center and four from the corners of each field. For years 2014-15 and 2015-16, we also conducted a survey about farmers’ management practices in the field for that year. The survey included questions about management factors shown to be important for explaining wheat yields in the previous literature, including sow date and the number of irrigations throughout the season. This work was considered exempt for Human Subjects Research by the University of Michigan Institutional Review Board (HUM00156479, HUM00128955, HUM00120778).
To derive field polygon boundaries from the five GPS coordinates for each field, we used the rgeos20 and sp21 packages in R Project Software Version 4.1.022. We overlaid all field polygons over high-resolution imagery from Google Earth (https://www.google.com/earth/), and adjusted field polygons to match field boundaries that were visible in the high-resolution imagery23,24. We then linked all associated yield and management factors with each polygon, and used the resulting shapefile for all subsequent analyses.
2.3. Satellite yield estimation
We used SkySat (2 m) imagery for the 2014–2015 and 2015–2016 growing seasons, and PlanetScope (3 m) imagery for the 2016–2017 to 2018–2019 growing seasons to estimate wheat yield. The number of images and specific dates used varied across years depending on image availability and cloud cover (details provided in Table S2). We assessed image availability by searching the PlanetScope API (www.planet.com) for all available images for our study site (Fig. 1) from November 1 to April 15 for each year’s growing season. We then visually inspected all available images and selected only those images that were cloud free. Since multiple tiles encompassed our study area, we mosaicked tiles into one image that covered the full study area extent using histogram-matching of overlapping areas in ENVI Software. SkySat imagery were provided as top of the atmosphere reflectance, so we corrected imagery to surface reflectance by stretching histograms to match distributions of each band as seen in cloud-free, surface reflectance corrected Landsat 7 and 8 imagery obtained from Google Earth Engine25 (GEE). Specifically, images were matched to cloud-free Landsat scenes from the closest available image date, and if a cloud-free Landsat scene was not available within two days of a given SkySat scene, we used a date-weighted average of the histograms from the two closest Landsat scenes before and after each available SkySat date (more details provided in Jain et al.13). All PlanetScope imagery were provided as surface reflectance corrected data, and thus all images were used directly without additional corrections.
SkySat and PlanetScope have blue (450– 515nm, 450– 515nm), green (515–595nm, 500–590nm), red (605–695nm, 590–670nm), and near infrared (NIR, 740–900nm, 780–860nm) bands. Using these bands, we calculated the green chlorophyll vegetation index for each image (GCVI) (Eq. 1) as previous studies have shown that GCVI has a linear relationship with the leaf area index for wheat26.
GCVI = (NIR/green) − 1 (1)
We then extracted the mean GCVI for each field polygon for each image date in all years, and these mean GCVI values were used to create our yield estimation model. To estimate yield, we used a two-step approach. First, we predicted yield using random forest linear regressions, where each year’s crop cut data were used to train individual random forest models for each year (Eq. 2)
Crop cut yield ∼ β0 + β1GCVI1 + β2GCVI2… + βnGCVIn + ε (2)
where crop cut yield (kg/ha) is the observed yield estimated using crop cuts for each polygon, GCVI1 to GCVIn are the mean GCVI values for each polygon for each image date (n) within a given growing season, and ε represents the residual error. For each year, the coefficients derived from the random forest model were applied to the stacked GCVI raster layer for the respective year. We found that random forest yield estimation resulted in models with high correlation between estimated and observed yields, however, the results did not fall consistently on the one-to-one line, suggesting systematic over or underestimation of yields in each year (Figure S1). Thus, to correct for this systematic bias, we conducted a second step, where we regressed the observed crop cut yields on the random forest estimated yields (Eq. 3).
Crop cut yield ∼ β0 + β1 RF estimated yield + ε (3)
where crop cut yield (kg/ha) is the observed yield estimated using crop cuts for each polygon, RF estimated yield (kg/ha) represents the mean predicted yield for each polygon derived from the random forest model from the first step (Eq. 2), and ε represents the residual error. To calculate mean satellite yield, we took the mean value of all pixels within each field’s polygon. We then applied the coefficients from Eq. 3 to the full raster stack to correct predicted random forest yields at the pixel scale across the study site. It is important to note that this correction (Eq. 3) was conducted separately for each of the five years. We validated our satellite yield estimates at the polygon scale by comparing estimated yields after the two-step approach with observed crop cut yields at the field scale. Accuracy was evaluated using R2 and root mean squared error (RMSE).
2.4. Yield gap estimation and persistence
To estimate yield gaps (YG), we subtracted the mean yield (Ymean ) for each polygon from yield potential (Yp) for each year (y; Eq. 4).
YGy = Ypy – Ymeany (4)
Where YGy equals the yield gap (kg/ha) for each polygon in year y, Ypy represents the yield potential (kg/ha) for all polygons in a given year y, and Ymeany represents the mean yield (kg/ha) for each polygon in a given year y. We defined Ypy as the 95th percentile yield value found in each year’s satellite estimated yield raster for the study region after masking out non-cropland pixels using land cover classes from the Global Land Cover product27. Previous studies have suggested that such empirically estimated Yp better captures realistic economically-achievable yields that consider infrastructural, management, and economic constraints, which are not well accounted for in modelled estimates of Yp10. We calculated Ymean using the mean satellite estimated yield for each polygon for each year, as crop cut yield values were not available for each field in each year.
We measured yield persistence in two ways. First, we estimated how consistently fields were relatively high or low yielding by conducting a decile analysis developed in Lobell10. Specifically, all 271 fields were categorized into one of ten deciles based on their yield rank using 2014-15 satellite estimated yields. Keeping each field within its original categorized decile from 2014-15, we plotted boxplots of all yields seen for all fields across all remaining years (2015-16 to 2018-19). If yields are persistent, we would expect there to be little overlap between boxplots across decile values, as the lowest yielding fields would always be low yielding and the highest yielding fields would always be high yielding. However, if there is a large amount of boxplot overlap across deciles, this suggests that there is variability in yield through time. Second, we quantified the percent of yield variation that was persistent through time using methods from Lobell et al.28. Specifically, for the highest yielding decile of fields found in 2014-15, we calculated the fields’ average anomaly from the study site mean yield in 2014-15. For these same fields, we then calculated the fields’ average anomaly compared to the study site mean for all subsequent years (2015-16 to 2018-19). By comparing yield anomalies from 2014-15 with yield anomalies for all remaining years, we gain an understanding of the amount of yield persistence from 2014-15 across later time periods.
2.5. Drivers of yield gaps and the ability to close yield gaps
To understand which factors most influence yield gaps, we conducted random forest regressions where we regressed yield gap estimates for each year on a suite of management, weather, and biophysical variables that have been suggested to be important drivers of yield gaps in the previous literature (Eq. 5). Specifically, for management variables we considered wheat sowing date and the number of irrigations applied, for weather variables we considered average temperature and total rainfall within each winter season, and for biophysical variables we considered soil nitrogen and soil organic carbon.
YGy ~ β0 + β1 DOSy + β2 Irrigationy + β3 AvgTempy + β4 Tot_Rainy + β5 Nitrogen + β6Soil_Org_C + β7 Plotarea + ε (5)
where YGy represents the yield gap (kg/ha) calculated for each year for each field (from Eq. 4), DOSy represents the sowing date of wheat (days since November 1) for each field in each year, Irrigationy represents the number of irrigations (ranging from 1–3) applied to each field during the wheat growing season in each year, AvgTempy represents the average temperature (°C) for each polygon in each year, Tot_Rainy represents the total amount of rainfall (mm) for each polygon in each year, Nitrogen represents mean soil nitrogen (cg/kg) for each field across all years, Soil_Org_C (dg/kg) represents mean soil organic carbon (SOC) for each field across all years, Plotarea represents area of the field, and ε represents error. We calculated variable importance for our random forest regression (Eq. 5) by examining the mean decrease in accuracy (%IncMSE) over all out-of-bag cross validated predictions when each variable was permuted.
We obtained sowing date and irrigation information from management surveys that were conducted in 2014-15 and 2015-16. Given that we only had management variables available for these two years, we restricted our analyses (Eq. 5) to only these two years. We calculated average temperature using temperature data from Terra Climate29; specifically we calculated mean temperature for each month (November to April) for each year (2014-15 to 2018-19) using the mean of monthly maximum and minimum temperature. We calculated total rainfall as the sum of monthly rainfall from November to April using monthly precipitation data from Terra Climate29. Finally, we calculated soil nitrogen and SOC using World Soil Information Service (WoSIS) global raster data30. Weather and soil raster data were extracted as the mean value for each polygon using the raster package31 in R Project Software 4.1.022. More details about each dataset, including their source and resolution are included in Table S3.
Finally, we ran simulations to quantify how much yield gaps could be closed if all farmers adopted optimal management strategies. For this analysis, we focused on the two management variables considered in our analyses (Eq. 5), sowing date and number of irrigations applied. To identify what management values were optimal, we examined the partial dependence plots of sow date and irrigation and identified which values were associated with the smallest yield gaps. Partial dependence plots show the marginal effect of each feature on the predicted outcome from our random forest analysis (Eq. 5). Based on the partial dependence plots, we found that a sowing date of November 12 and three irrigations were associated with the lowest yield gaps. In our scenario analysis, we therefore altered all sowing dates to be November 12 and all irrigations to equal 3, and we predicted what yields would be for each field using our random forest model (Eq. 5). To estimate how much yield gain could be achieved, we quantified the difference between this predicted yield value under optimal management and Ymean y. All analyses were done using the Random Forest32 and partial dependence plot33 packages in R Project Software 4.1.022.