Spatio-temporal Evaluation of Gridded Precipitation and Evapotranspiration Products Over Balochistan Province in Pakistan


 This study evaluated the performance of four state-of-the-art precipitation (P; CHIRPSv2, MSWEPv2.2, PERSIANN-CDR, and ERA5) and three state-of-the-art evaporation (Ea; GLEAMv3.3a, SSEBopv4.0, and ERA5) products in the data-scarce province of Balochistan, Pakistan. The P products were evaluated through a point-to-pixel comparison at the monthly, annual, and seasonal temporal scales using data from 17 gauges. The modified Kling-Gupta efficiency (KGE’) and the root mean square error (RMSE) were used as statistical indices to evaluate the performance of the P products. We used the Budyko framework to evaluate the Ea component due to unavailability of ground-based Ea data. For this purpose, we calculated the multi-annual mean Ea — defined as Ea_WB — for 12 defined catchments for a period of 16 years (2003–2018). Subsequently, the estimated Ea_WB was used as the reference to evaluate the performance of the Ea products. The results of the study revealed that among the P products generally MSWEPv2.2 closely followed by ERA5 presented the best spatio-temporal KGE’ performance over most of the stations and across the province. CHIRPSv2 and PERSIANN-CDR relatively showed poor KGE’ performance. Whereas, for the Ea component, ERA5 generally outperformed the other two products — i.e., GLEAMv3.3a and SSEBopv4.0 — in most of the evaluated catchments except for those located over the north-eastern region where GLEAMv3.3a performed the best. The results of this study can be beneficial to shift towards a proactive water management approach that can serve as a basis to improve the decision-making process in the region.

However, P and Ea estimates are always subject to errors and uncertainties; and therefore, they must be evaluated before any hydrological application (Oliveira et al., 2014;Zambrano-Bigiarini et al., 2017). For example, Duan et al. (2012) evaluated two TRMM products (3B42v6 and 3B43v6) over the Caspian Sea region and found that both overestimated low P events and underestimated high P events. Similarly, Maggioni et al. (2016) evaluated the accuracy of different satellite P products including TRMM 3B42v7, the Global Satellite Mapping of Precipitation (GSMaP), the Climate Prediction Center Morphing method (CMORPH), and the PERSIANN-Cloud Classification System (PERSIANN-CCS). The results of this study showed that topography, seasonality, and climatological patterns influenced the P products' performance. Long et al. (2014) studied uncertainties in two remote-sensing based Ea products, including MOD16 and Advanced Very-High-Resolution Radiometer (AVHRR), in varying climate and land cover conditions in the south-central United States. They found that there was a trade-off between the spatial resolution of the Ea products and the level of uncertainty. That is to say, the higher was the resolution of the product the greater was the level of uncertainty; and lower was the resolution of the product the less was the level of uncertainty. In a recent study, Hugo et al. (2019) evaluated the performance of satellite-based Ea products, including GLEAM, SSEBop, Atmospheric Land Exchange Inverse (ALEXI), MOD16, Surface Energy Balance System (SEBS), and CMRSET, in the complex environment of the Amazon.
They found that the uncertainty of the products was very low and sometimes negligible at the monthly temporal scale, but as they evaluated the products at higher temporal scales (e.g., more than one month), the uncertainty of the products increased It's, therefore, important to study the uncertainties embedded in these products when used to account for P and Ea; especially, over the arid and semi-arid regions due to the reduced performance of these products over arid and semi-arid areas (Emam et al., 2015). The arid region of Balochistan province in Pakistan is not an exception but is considered as one of the most water-scarce and drought-prone regions in the country (Ahmed et al., 2017). It has been observed that on average the province experiences four years of droughts out of ten years (Anjum et al., 2012) exerting severe economic impacts on the livelihood of the people who majorly depend on agro-pastoral activities (Ahmed et al., 2015b). For example, the prolonged drought of 1998-2002 resulted in a 60-80% reduction in crop yield and caused around two million livestock death in the province (Sarwar, 2008). In addition, the projections for the upcoming decades suggest an increase in drought frequency and a greater P variability due to climate change (Ahmed et al., 2005), which will place unprecedented pressure on the already scarce water resources of the area.
There have been many studies in Balochistan Region to understand the water cycle components and different climatic variables employing a variety of approaches. For example, Ahmed et al. (2015a) developed a model based on multilayer perceptron neural networks to downscale P in Balochistan Region using reanalysis datasets over the study area and calibrated and validated the data using monthly ground-based meteorological stations.
They found a good correlation between observed and downscaled time series of P during both the calibration and the verification periods. However, the downscaled product underestimated the P variance during both the calibration and verification periods. In a recent study, Iqbal and Athar (2018) evaluated the performance of TRMM's Multi-satellite Precipitation Analysis (TMPA) using ground-based P data over the entire country of Pakistan. They found that TMPA underestimates P in high elevations of the country, whereas the results were relatively satisfactory in medium elevated regions and lowlands. However, they do not discuss the performance of TMPA over specifically Balochistan province. Furthermore, in a recent comparative study, Ahmed et al. (2019) evaluated the performance of gauged-based gridded P products against ground-based meteorological station data in the region. They evaluated four products including the Global Precipitation Climatology Centre (GPCC), Center for Climatic Research-University of Delaware (UDel), Asian Precipitation Highly Resolved Observational Data Integration towards Evaluation (APHRODITE), and Climatic Research Unit (CRU). Their study concluded that the performance of P products varied in different climatic regions. However, the agreement between observed P and GPCC was much better in all climatic regions under all the statistical indices used in the study.
Howbeit, no study have evaluated the key meteorological components of the water balance (P and Ea) using the state-of-the-art satellite-based and reanalysis datasets over the Balochistan Region. Thus, the general objective of this study is to assess the ability of four P products and three Ea products of representing the spatio-temporal patterns of these key water balance components over the data-scarce Balochistan Region with a methodology that can be replicated in other data-scarce settings. The specific objectives of the study are given below.
1. To evaluate the performance of P and Ea products using ground-based measurements and to select the products that represent the best the spatio-temporal patterns of these variables.
2. To evaluate, whether the Budyko framework can be used to assess the performance of actual evaporation products in data-scarce regions.
Additionally, this work aims to provide the necessary information and knowledge about the water resources of the region and their distribution to relevant stakeholders and decision-makers. This knowledge can be particularly beneficial to shift towards a proactive water management approach that can serve as a basis to improve the decision-making process.

Study area
Balochistan is the largest province of Pakistan in terms of area (44% of the total land surface of the country) and is by far the smallest in terms of population (5% of the total population of the country; P&D-GoB, 2020). It covers a total land surface area of 347,190 km 2 with a population of 12.3 million inhabitants. The province lies on the south-eastern part of the Iranian plateau between the latitudes 25º and 32º N and between the longitudes 61º and 71º E ( Fig. 1 [a]). It borders Iran to the west, Afghanistan to the north, the province of Khyber Pakhtunkhwa to the north-east, provinces of the Sindh and Punjab to the east, and the Arabian Sea to the south. The majority of the population in the province lives in the rural areas (85%) and depends on agriculture and livestock raising for their livelihood (Shafiq, 2008;Ashraf et al., 2014). These interdependent agro-pastoral activities provide around 67% of the employment to the total workforce in the region (Baloch & Tanik, 2008). Rangelands are the main source (90%) of the livestock feed requirements (Shafiq & Kakar, 2008). The landscape of the province is composed of barren mountains, deserts, floodplains, and coastal plains (PDMA, 2020). As shown in Figure 1(a), the elevation ranges from 0 m.a.s.l. (in the southern coastal areas) to 3,558 m a.s.l (in the northern highlands). There is a pronounced gradient of elevation which has a huge impact on the climate of the region (Ahmed et al., 2014). The northern highlands have very cold winters and warm summers while the plains in the east and lowlands in the south have very hot summers and mild winters (GoB, 2020).
The climate of the province is classified as arid-desert and arid-steppe based on the most updated Köppen-Geiger climate classification (Beck et al., 2018). The P is very low and unevenly distributed throughout the region (Adnan et al., 2009). The annual mean P ranges from 30 mm in the south-west desert to around 397 mm in the northeastern highlands (Ahmed et al., 2018). Ea losses are huge (e.g., in Quetta valley varies from 1 to 7.5 mm/day) across the province (Rasul & Mahmood, 1993). The only perennial surface water source comes from the Indus Basin Irrigation System (IBIS; 3.75 billion m 3 /yr.) that constitutes around 14% of total accessible water which irrigates around 1% of the total land surface area of the province (Aftab et al., 2018). The rest of the agriculture (around 6% of the total surface area of the province) is practiced at valley bottoms through floodwater harvesting, spate irrigation, dug wells, qanats, and the tubewells (Mustafa & Qazi, 2007).
Rainfall regimes are predominantly controlled by western depressions and the monsoon (Hussain & Lee, 2014).
According to Ahmed et al. (2015a), winter P (December-March) is caused by the western disturbances that originate in the Mediterranean Sea and are responsible for around 58% of the total rainfall in the region. On the other hand, the summer monsoon winds, originated in the Bay of Bengal, are responsible for around 31% of the total rainfall that occurs from June to September. The south-eastern part of the province receives most of the P during the monsoon season, while the rest of the region receives most of the P during winter (Ahmed et al., 2018).
As the monsoon winds advance from east to west, the moisture content in the air decreases and as a result, the P gradually reduces (Ahmed et al., 2019).
The major catchments in the study area are shown in figure 1(b). The eastern catchments (i.e., Zhob, Nari, Bolan, Mula, and Guj) flow to the Indus River basin in the east, and the southern catchments (i.e., Karradi, Porali, Hingol, Basol, Shadikor, and Kech) empty into the Arabian Sea in the south. Whereas the western and north-western Balochistan is entirely covered by the Kharan-Pashin-Lora (K.P.L.) closed basin which flows into numerous large (e.g., Hamun-e-Mashkel, Hamun-e-Lora, and Zangi Nawar lakes) and small desert lakes.

Ground-based P data
Time series of monthly P data for 22 stations were obtained from the Pakistan Meteorological Department (PMD).
The temporal coverage of the data ranges from 15 to more than 30 years. This study used 17 stations that had at least 18 years of record (2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017). The selection of the study period for the P analysis was selected because of two main reasons: (i) to include the maximum number of stations in the study, and (ii) to include a period in which all the selected P products were available. Figure 1(a) shows the selected stations. Four stations (vis. Karachi, Larkana, Jacobabad, and D. I. Khan) out of the seventeen selected stations are located outside of the province of Balochistan (closer than 100 kilometers from the border of the province). These stations were included to have the maximum number of stations available for this study assuming that useful information can be obtained from stations lying in the proximity of the province.
The Expectation-maximization (EM) method (Dempster et al., 1977) was used to fill the monthly missing values a total of four monthsin Jacobabad, Jewani, Lasbela, and Quetta (S. Manda) stations (one month in each station). The EM method has been widely used in the literature to compute the missing P values (e.g., First et al., 2010;Ahmed et al., 2015a;Alamgir et al., 2015). A complete description of the EM algorithm can be found in Dempster et al. (1977).
This study evaluated three state-of-the-art satellite-based P products (i.e., CHIRPSv2, MSWEPv2.2, and PERSIANN-CDR) and one reanalysis dataset (i.e., ERA5) by comparing their estimates with ground-based measurements. The selection of these products was based on: (i) their long period of record, which is crucial in water balance applications, and (ii) their relatively high spatial resolution. Table 1 summarizes the main characteristics of the selected products.

CHIRPSv2
CHIRPSv2 is a satellite-based P product developed by the collaboration of the United States Geological Survey (USGS) and the University of California, Santa Barbara, and is available at https://data.chc.ucsb.edu/products/CHIRPS-2.0/global_monthly/tifs/. This product has a quasi-global coverage (50ºS-50ºN), high spatial (0.05º) and temporal (daily) resolution, providing a long-term historical record (1981present;Funk et al., 2015). This product was developed based on the integration of various other datasets including Climate Hazards group Precipitation climatology (CHPclim), TRMM 3B42, global Cold Cloud Duration (CCD) estimates, atmospheric model rainfall fields from the National Oceanic and Atmospheric Administration's (NOAA) Climate Forecast System (CFS) and ground-based measurements from multiple sources . This product has a latency of 2 days. Although the principal objective of developing this product was to monitor droughts, climate extremes, and environmental changes over land (Katsanos et al., 2016), it has been successfully used in a variety of hydrological applications (e.g., Katsanos et al., 2016;Tuo et al., 2016;Xian et al., 2019).

MSWEPv2.2
MSWEP version 2.2 is a global gridded P product spanning over more than 40 years (1979-present;Beck et al., 2019b) and is available at http://www.gloh2o.org/. MSWEPv2.2 features high spatial (0.1º) and temporal (3hourly) resolutions (Beck et al., 2019b). This product was developed by combining a wide range of datasets such as rain gauges, satellite, and reanalysis P products aiming to provide an improved dataset worldwide (Beck et al., 2019b). This product was specifically designed for hydrological applications (Beck et al., 2017a); taking into consideration the existing limitations of the P products to produce accurate estimates over mountainous, and snowdominated regions around the globe (Zambrano-Bigiarini et al., 2017;Beck et al., 2020). It has been used in a range of hydrological studies such as hydrological modelling (Xiang et al., 2021), and P variability analysis (Chen & Dirmeyer, 2016;Zhang et al., 2017).

PERSIANN-CDR
PERSIANN-CDR is a quasi-global (60°S-60°N) satellite-based P product developed by the University of California, Irvine (UCI) to facilitate continuous monitoring of changing trends in daily P, particularly P extremes, caused by natural climate variability and climate change (Ashouri et al., 2015). PERSIANN-CDR is an opensource product available at http://chrsdata.eng.uci.edu/. The product features a relatively high spatial (0.25º) and temporal (daily) resolution covering more than 35 years (1983-present) of record. The dataset is generated by applying the PERSIANN algorithm on GridSat-B1 infrared data from NOAA's National Centers for Environmental Prediction (NCEP; Ashouri et al., 2015). The product is bias-corrected using monthly 2.5º data from the Global Precipitation Climatology Project (GPCP; Nguyen et al., 2019).

ERA5
ERA5 is an atmospheric reanalysis product produced by the European Centre for Medium-Range Weather Forecast (ECMWF) to replace ERA-Interim, which was released in 2006 (Hersbach et al., 2018). The product has global coverage with a spatial resolution of approximately 31km and is produced at 1-hourly temporal resolution with a coverage of more than 40 years (1979-present). This product is based on the Integrated Forecasting System (IFS) Cycle 41r2 and is available after five days (Hersbach et al., 2018). Furthermore, the dataset is based on an advanced numerical model and a vast variety of observation inputs and forcing datasets (Nogueira, 2020). Many recent studies have used ERA5 for many hydrological applications. These include modeling of global mean P (Hersbach et al., 2018), evaluating daily P products over the continental United States (Beck et al., 2019a), modelling of surface runoff (Matveeva & Sidorchuk, 2020), and catchment scale hydrological modelling (Tarek et al., 2020). The dataset can be obtained at https://cds.climate.copernicus.eu/.

E products
As ground-based Ea information is not available in the study area, we used the Budyko framework that uses potential evaporation (hereafter defined as Ep) information to calculate Ea which was then used as a reference to evaluate the Ea products (see section 2.3.2). Thus, this study used both Ea and Ep products. An overview of these products is given below. Table 2 summarizes the E products.

GLEAMv3.3
The GLEAMv3.3 is a satellite-based E and root-zone soil moisture dataset developed in 2011 . This dataset features a relatively high spatial (0.25°) and temporal (daily) resolution with global coverage of 39 years   (Martens et al., 2016). This study used two datasets from the GLEAMv3.3: i.e., Ep product and Ea product. The datasets are available in two latest versions i.e., GLEAMv3.3a and GLEAMv3.3b, which are available at www.gleam.eu. We used both Ep and Ea datasets from GLEAMv3.3a because this version uses a variety of observation inputs including gauged-based, satellite-based, and reanalysis datasets while the GLEAMv3.3b largely depends on satellite datasets alone (Martens et al., 2017).

SSEBopv4.0
The operational Simplified Surface Energy Balance (SSEBopv4.0; Senay et al., 2013), is one of the promising Ea products available in recent times (de Paula et al., 2019). The model is based on the Simplified Surface Energy Balance (SSEB) approach (Senay et al., 2011) with a unique simplified parameterization (Senay et al., 2013). The novelty of the SSEBopv4.0 parameterization is that for each pixel, the difference between the hot and the cold reference values are pre-defined and seasonally dynamic. It uses a thermal index approach to combine the local reference evaporation (Eo) with evaporation fractions obtained from remotely sensed thermal imagery from MODIS datasets (Senay et al., 2013). The pre-defined temperature difference is added to the cold reference value to obtain the hot reference value, whereas the cold reference value is obtained as a fraction of the air temperature (Senay, 2018). The SSEBopv4.0 product has been used in a variety of hydrological studies in data-scarce regions for famine early warning, drought monitoring, mapping Ea, and water accounting in different land-use types (Senay et al., 2011;Senay et al., 2016;Senay et al., 2013). The model is available at 10-day, monthly and annual temporal resolutions, and at a spatial resolution of 1 km. The SSEBopv4.0 covers more than 18 years of record (2003-present) and is available at https://earlywarning.usgs.gov/fews/datadownloads.

ERA5
A general overview of ERA5 has already been discussed in section 2.2.2. However, besides the P product of the ERA5, this study also used Ea and Ep products from the same dataset. Ep parameter represents the extent to which the near-surface conditions are conducive to facilitate maximum possible Ea in a given time (ECMWF, 2020).
This parameter is developed based on the surface energy balance calculations under certain underlying assumptions (ECMWF, 2020). One major assumption that may limit its performance in arid and semi-arid regions is that it assumes that the agricultural land use is well watered and that the atmospheric conditions are not affected by this artificial surface condition (ECMWF, 2020). Ea parameter represents the total evaporated water in a given time from the surface of the earth including the transpired water from the vegetation (ECMWF, 2020).

Evaluation of P products
Four P products (CHIRPSv2, PERSIAN-CDR, MSWEPv2.2, and ERA5) were evaluated against 17 ground-based stations for 18 years (2000-2017) through a point-to-pixel approach (Duan et al., 2012;dos Reis et al., 2017;Paredes-Trejo et al., 2017;Zambrano-Bigiarini et al., 2017;Baez-Villanueva et al., 2018), which assumes that the ground-based measurements are representative of the P of the respective grid-cells of the products. The evaluation was performed on different temporal scales (i.e., monthly, seasonal, and annual) to analyse the performance of the P products for different hydrological applications (dos Reis et al., 2017). The P products were accumulated to monthly, seasonal, and annual totals for comparison. Following Zambrano-Bigiarini et al. (2017), all the products with a higher spatial resolution than 0.25º were unified to a grid cell of 0.25º by using the bilinear interpolation method to ensure a consistent evaluation. Bilinear interpolation technique averages the nearest four input cells to obtain the output resampled cell (Wu et al., 2008). According to Baez-Villanueva et al. (2018), resampling of the products using bilinear interpolation might impact the performance of the P products. However, this study assumes that the resampling of the products had no impact on the obtained results.
The ability of the selected products to represent the P distribution over the study area and at different temporal scales (monthly, seasonal and annual) was evaluated using five continuous statistical indices: the modified Kling-Gupta efficiency (KGE´, Eq. 1; Gupta et al., 2009;Kling et al., 2012) along with its three individual components, i.e., the linear correlation (r; Eq. 2), bias ratio (β; Eq. 3) and variability ratio (γ; Eq. 4), and the root mean square error (RMSE; Eq. 5). Following Zambrano-Bigiarini et al. (2017) and Muthoni et al. (2019), this study selected the KGE' because P products are generally expected to be able to reproduce temporal dynamics (measured by r) of P while also preserving the total volume (measured by β) and distribution (measured by γ; Zambrano-Bigiarini et al., 2017). The KGE´ and its components can be defined as follows: where n is the number of observations, Oi and Si are the observed and estimated values at day i respectively; and are the arithmetic mean of the observed and estimated P, respectively; μ is the distribution mean; σ is the standard deviation and CV is the coefficient of variation.
The KGE´ and its components have their optimum value at one (Kling et al., 2012). According to Baez-Villanueva et al. (2018), a value equal to 1 represents perfect linear correlation, a value of -1 represents a perfect negative linear correlation, while a 0 indicates the absence of linear correlation. β greater than 1 represents overestimation and β less than 1 represents underestimation. γ shows whether the dispersion of the estimated P values is higher or lower than the observed values at gauge stations.
Furthermore, the RMSE was applied to estimate the average error value between observed and estimated P (Prasetia et al., 2013). This index has widely been used in the evaluation of P products (e.g., dos Reis et al., 2017;Iqbal and Athar, 2018;Nicholson et al., 2019). RMSE value of 0 represents perfect agreement between observed and estimated datasets (Chen & Liu, 2012). The RMSE can be defined as follows: where n is the number of observations; y is the observed value and ŷ is the corresponding P estimated value at day i.

Evaluation of Ea products
Due to the unavailability of ground-based Ea data over the study area, this study firstly estimated Ea based on the simplified water balance approach (Cheng et al., 2011) and subsequently compared the results with the Ea products to evaluate their performance (for other examples of this approach the reader is referred to Cheng et al., 2011;Mianabadim et al., 2017). The simplified water balance approach is based on the water budget of a catchment and for any time can be expressed as: where Ea is actual evaporationalso referred as actual evapotranspiration (Miralles et al., 2020) -P is precipitation; R is surface runoff, interflow, and baseflow; and is the change in water storage. Commonly, over a long period of time, e.g., ≥ 5-year (Tekleab et al., 2010), groundwater recharge and water accumulation in a natural catchment can be assumed to be negligible (Tekleab et al., 2010;Xiong & Guo, 2011;Du et al., 2016;Gunkel & Lange, 2017). As a result, a long-term steady-state water balance in a catchment can be expressed as: The long-term mean runoff R in a catchment can be calculated by the difference between the long-term mean and the long-term mean Ea assuming that water accumulation at the catchment scale is negligible (Caracciolo et al., 2017), and can be expressed as: The steady-state water balance is controlled by water availability (P) and atmospheric water demand (Ep). The Ep in turn is controlled by available energy (Zhang et al., 2008). This relationship between water availability, atmospheric water demand, and energy availability is well established by the widely employed Budyko framework (Budyko, 1974). This framework assumes that over a large timescale under natural conditions water accumulation at the basin scale can be neglected and the long-term mean Ea is determined by long-term mean P and the atmospheric water demand (Tekleab et al., 2010). Under the given assumptions, the Budyko framework establishes a functional link (ᴪ) that relates the ratio between the long-term mean annual actual evaporation Ea and the long-term mean annual precipitation P with the aridity index; defined as the ratio between the long-term mean annual potential evaporation Ep and the long-term mean annual precipitation P (Caracciolo et al., 2017), which can be expressed as: The original Budyko framework has been widely used in different regions around the world to estimate the longterm water balance in different catchments (e.g., Porporato et al., 2004;Istanbulluoglu et al., 2012). However, it has always been observed that the approximation of the estimated mean Ea varies from catchment to catchment and in different regions due to several physical processes such as vegetation, soil characteristics, and seasonality in climate which are not taken into account in the original Budyko framework (Xiong & Guo, 2011;Xu et al., 2013). As a result, many researchers (e.g., Choudhury, 1999;Fu, 1981;Zhang et al., 2004) have proposed different analytical solutions to this limitation (e.g., ignoring the catchment characteristics) to the original Budyko framework. Fu's (1981) equation is one of the most widely used analytical solutions to the original Budyko framework in the literature and can be expressed as: where ⍵, known as Fu's parameter, has no analytical solution but it has been related to different measured data (e.g., vegetation, soil characteristics, seasonality in climate, and other catchment characteristics) or it can be adopted from other regional and global studies that have assessed ⍵ (Caracciolo et al., 2017). Xu et al. (2013) developed a model to predict the spatial distribution of ⍵ globally. They developed a neural network (NN) model using 224 small (100 -10,000 km 2 ) U.S. basins and 32 large (230,000 -600,000 km 2 ) global basins considering both local (i.e., slope and normalized difference vegetation index [NDVI]) and global (location) factors and mapped the ⍵ globally. They also validated the predicted Ea over 36,600 large and small global basins with an independent satellite-based Ea product and found good agreement (R 2 =0.72). Therefore, this study employed Fu's equation to estimate and adopted ⍵ values from Xu et al. (2013) for the selected catchments (see Figure 1(b). Xu et al. (2013) suggest an average ⍵ value of 1.8 for the catchments in the study area. The Fu's ⍵ parameter, however, was used without any regional validation or calibration which may lead to certain uncertainties in estimating the Ea at catchment scale (Gunkel & Lange, 2017). Moreover, most of the catchments in the study area are not natural and experience at least some degree of anthropogenic influence (Baloch & Tanik, 2008) which may also lead to certain uncertainties.
To estimate mean Ea from Eq. 10, mean annual Ep (2003-2018) were obtained from ERA5 and GLEAMv3.3a products for the major catchments in the study area. For each catchment two mean Ea were calculated; one from ERA5 and the other from GLEAMv3.3a, adopting a ⍵ value of 1.8. The obtained mean Ea values were then compared against three Ea products, namely ERA5, GLEAMv3.3a, and SSEBopv4.0 to assess their performance based on the assumption that the less the difference between mean Ea obtained from Eq. 10 and mean Ea obtained from Ea products the better has the Ea product performed. The selection of the Ep products, i.e., ERA5 and GLEAMv3.3a, and Ea products, i.e., ERA5, GLEAMv3.3a, and SSEBopv4.0, was based on the maximum possible temporal coverage of the study period governed by P (ground-based and products) and Ea (products) data availability.

Spatial performance of the P products
The spatial distribution was computed for the selected continuous indices (r, β, y, KGE' and RMSE) at a monthly scale showing all the 17 stations. Figure 2( The spatial distribution of r (related to the temporal dynamics of P) is shown in Figure 2 The spatial distribution of the β component (related to the total volume of P) of the KGE' at the monthly temporal scale is shown in Figure 3(a). CHIRPSv2 presented the best performance for the β component exhibiting values ranging from 0.87 to 1.35 for all the stations located within the study area and nearly a perfect agreement (1 ≤ β ≤ 1.09) between observed and estimated P (in terms of volume) for almost all the stations located in the southern part of the province. MSWEPv2.2 and ERA5 were the second and third best-performing products, respectively; with slightly different performances among each other. MSWEPv2.2 underestimated more stations than ERA5.
PERSIANN-CDR was the product with the lowest performance and overestimated most of the stations particularly the ones located in the western part of the study area.    CHIRPSv2 also underestimated some stations over the eastern part of the study area.
Finally, the spatial distribution of the RMSE at the monthly scale is shown in Figure 4. Generally, RMSE performance for all the products was better in the drier stations compared to the wetter ones. As a result, the performance of the products was better in the stations located in the western part, which have less P, while the performance decreased towards the east where the stations receive relatively more P. However, MSWEPv2.2 followed by ERA5 had a relatively better performance for the stations located in the western part of the province compared to PERSIANN-CDR and CHIRPSv2, while PERSIANN-CDR followed by CHIRPSv2 had a relatively better performance for the stations located in the eastern part of the province compared to MSWEPv2.2 and ERA5.

Temporal performance of the P products
This study evaluated the performance of P products at the monthly, seasonal, and annual temporal scales.
Following Iqbal and Athar (2018), and Sadiq (2013)   None of the P products performed the best in all of the temporal scales. However, MSWEPv2.2 was the best performing product at the monthly, monsoon, and post-monsoon temporal scales, while PERSIANN-CDR was the best performing product in winter and pre-monsoon seasons. ERA5 performed the best at the annual timescale and CHIRPSv2 was the least performing product in all the temporal scales except monsoon where it performed better than ERA5 and PERSIANN-CDR.
In the case of β, MSWEPv2.2 was almost unbiased (median value = 1) for all the temporal scales. CHIRPSv2 was also almost unbiased for monthly, annual, and monsoon temporal scales. ERA5 also showed its median value near to 1 for the winter season. In general, all the products performed well over the winter season and worst over the post-monsoon. In overall temporal bias ratio performance, MSWEPv2.2 showed the best median performance followed by CHIRPSv2 and ERA5. PERSIANN-CDR was the worst-performing product and overestimated P at all temporal scales.
The evaluation of (γ) at different temporal scales revealed that the distribution of P at all the temporal scales was underestimated by all products except for MSWEPv2.2, which presented a median value close to 1 for the premonsoon season. However, the monthly timescale (0.59 ≤ median ≤ 0.9) followed by the annual timescale (0.51 ≤ median ≤ 0.9) was best represented by the P products compared to the rest of the temporal scales (i.e., winter, pre-monsoon, monsoon, and post-monsoon). In general, for all the temporal scales, MSWEPv2.2 was the best performing product followed by ERA5, PERSIANN-CDR, and CHIRPSv2.
The overall performance of the r, β, and γ is summarized in the KGE' as shown in Figure 5. All the P products performed the best during the monthly temporal scale except PERSIANN-CDR, which showed its best performance during winter. The monthly median KGE' values for MSWEPv2.2 and ERA5 varied from 0.68 to 0.62 respectively and for PERSIANN-CDR and CHIRPSv2 from 0.36 to 0.39, respectively. After the monthly timescale, the best performance of all P products was observed during the winter season followed by the annual temporal scale. All the products performed worst during post-monsoon followed by monsoon and pre-monsoon seasons. MSWEPv2.2 was the best performing product during monthly timescale and pre-monsoon, monsoon, and post-monsoon seasons; while ERA5 performed the best during the annual and winter temporal scales. In general, after MSWEPv2.2, ERA5 was the best performing product followed by CHIRPSv2 and PERSIANN-CDR. The performance of PERSIANN-CDR was particularly poor in pre-monsoon, monsoon, and post-monsoon where it showed negative KGE' values.
The best RMSE performance was observed in the post-monsoon (6.5 ≤ median values ≤ 10.7) followed by the monthly (

Performance of the Ea products
Due to the unavailability of ground-based Ea information in the study area, this study first estimated mean Ea based on the simplified water balance approach, here defined as Ea_WB, (see section 2.3.2), and subsequently compared the results with the Ea products to evaluate their performance (as in Cheng et al., 2011;Mianabadi et al., 2017).

Estimation of Ea_WB
This study used Fu's (1981)  Fu's equation required P, Ep, and ⍵ to calculate Ea_WB. data were used from the best performing P product at the multi-annual temporal scale (i.e., ERA5), while Ep data were extracted from two different products, i.e., GLEAMv3.3a and ERA5; this study used two Ep products so that two aridity indices (AI) could be estimated and compared for each catchment, and employed a ⍵ = 1.8 adopted from Xu et al. (2013). The two estimated AIs were then compared to the climate classification of Köppen-Geiger (Beck et al, 2018) and the AI information available in the literature to select the Ep that made more sense in terms of climate and previous literature. Finally, the Ep product that was best representing the AI in the study area was used in the Fu's equation to calculate the mean annual Ea_WB for the selected catchments during 2003-2018. Figure 6 shows the evaporation index (EI; Ea/P) versus aridity index (AI; Ep/P) and the Budyko curve calculated from both the GLEAMv3.3a and ERA5 Ep products. All the catchments fell within the water and energy limit boundaries (water and energy availability); and therefore, fulfill the Budyko assumptions. For the AI, ERA5 exhibited drier results (4 < IA < 20) compared to the GLEAMv3.3a (1 < IA < 9) for most of the catchments. However, the majority of the catchments lay in the range of 4 < AI ≤ 6 calculated from both the GLEAMv3.3a and ERA5.
In order to decide which product, i.e., GLEAMv3.3a or ERA5, more accurately depicted the extent of the aridity in the study area, this study evaluated the climate classification results based on UNEP (1992) obtained from GLEAMv3.3a (arid, semi-arid, and dry sub-humid) and ERA5 (arid and semi-arid) in the study area in two ways; (i) the obtained climate results were compared to the Köppen-Geiger climate classification (Beck et al, 2018).
According to this classification, most of the study area lies in the arid-desert (arid) climate zone except the northeastern part which exhibits an arid-steppe (semi-arid) climate (Beck et al., 2018). (ii) the obtained climate results were compared to the results of the previous studies which have specifically analysed the aridity of the study area.
These studies included Ahmed et al. (2015a), Ahmed et al. (2017), and Haider and Adnan (2014). In this sense, Haider and Adnan (2014) classified the Balochistan province as arid except the north-eastern part which was classified as semi-arid. Ahmed et al. (2017) and Ahmed et al. (2015a) had similar results as Haider and Adnan (2014) except with a difference that they classified extreme western Balochistan as hyper-arid.
The results obtained from ERA5 (i.e., arid and semi-arid) are in better agreement with the climate classification of the Köppen-Geiger (arid-desert and arid-steppe) and also to the results in the regional studies (arid and semiarid) compared to the climate classification results obtained from the GLEAMv3.3a (i.e., arid, semi-arid and dry sub-humid). Therefore, we selected ERA5 as the product to calculate the Ea_WB. These results suggest that ERA5 Ep estimates are of better quality to evaluate the performance of the Ea products, compared to the Ea_WB estimated with GLEAMv3.3a.
Considering the performance of the products based on the area instead of the number of catchments, SSEBopv4.0 was the best performing product (125879.5 km 2 within the province as K.P.L. closed basin is a transboundary basin) followed by ERA5 (867250 km 2 ) and GLEAMv3.3a (57474 km 2 ). However, the performance of the

Discussion
This work aims to evaluate for the first time the key water balance components over the Balochistan province in Pakistan. The present study evaluated the spatio-temporal performance of 4 state-of-the-art P products (CHIRPSv2, MSWEPv2.2, PERSIANN-CDR, and ERA5) using 17 ground-based station data, and 3 state-of-theart Ea products (GLEAMv3.3a, ERA5, and SSEBopv4.0) by applying the widely used Budyko framework over the data-scarce Balochistan province.

Performance of P products
The spatial distribution of the performance of the products evaluated with KGE' at the monthly timescale ( Figure   2) revealed that MSWEPv2.2 was the best performing product followed by ERA5, CHIRPSv2, and PERSIANN-CDR, respectively. The performance of MSWEPv2.2 and ERA5 was relatively satisfactory (0.6 < KGE' ≤ 0.8) for most of the stations and consistent across the province with a few exceptions. However, CHIRPSv2 and PERSIANN-CDR presented relatively poor performance (< 6). Their performance was particularly poor over the western part of the province, which receives most of the P during winter, compared to the stations located in the eastern part of the province that receives most of the P during the summer monsoon. Wu et al. (2019) obtained similar results for the performance of P products during winter and summer P over Yunnan province, China. This can be attributed to the fact that satellite P products generally tend to detect strong and convective P events (e.g., monsoon) more effectively than a drizzle or frozen P during the winter (Guo et al., 2015).
Similar results were obtained at annual and seasonal (winter, pre-monsoon, monsoon, and post-monsoon) temporal scales ( Figure 5) for the KGE'. MSWEPv2.2 was the best performing product during the monthly, premonsoon, monsoon, and post-monsoon temporal scales, while ERA5 performed the best during annual and winter temporal scales. CHIRPSv2 also presented relatively good performance during monsoon and post-monsoon.
However, PERSIANN-CDR presented poor performance, scoring negative KGE' values during pre-monsoon, monsoon, and post-monsoon. The poor performance of PERSIANN-CDR might be attributed to the fact that unlike the other satellite-based products (MSWEPv2.2 and CHIRPSv2) used in this study, that directly incorporate temporally finer gauge data for corrections, PERSIANN-CDR relies on the GPCP dataset to correct the monthly P bias (Beck et al., 2017b).
In general, the merged products (which combine gauges, reanalysis, and satellite-based data), that directly incorporate daily gauge data (e.g., MSWEPv2.2) performed the best, followed by ERA5, which is a reanalysis product. The ERA5 was then followed by CHIRPSv2, which also incorporates reanalysis data, and temporally coarser gauge data compared to MSWEPv2.2. Whereas Gauge-satellite-based product that indirectly incorporates gauge data (e.g., PERSIANN-CDR) through other multi-source datasets (e.g., PERSIANN-CDR is adjusted using gauge-based GPCP datasets) performed the worst. These findings are in agreement with Beck et al. (2017b), who studied the performance of 22 P products on a global scale and concluded that the products that directly incorporate daily gauge data perform better than the products that incorporate temporally coarser gauge data. In turn, these products outperform the products that indirectly incorporate gauge data.
Moreover, the products generally showed their best KGE' performance during the monthly timescale compared to the annual which is in agreement with Duan et al. (2016) and Zambrano-Bigiarini et al. (2017). The better performance of the products at monthly timescale is also due to the seasonality of the P at the monthly timescale compared to the annual which results in better r performance of the KGE'.
For the seasonal evaluation, the best KGE' performance was observed during the winter season followed by premonsoon and monsoon seasons. The poorest performance was observed during the post-monsoon season.
Generally, the wetter the season was, the better the performance of the P products (e.g., best performance during winter). These findings are in agreement with the findings of Zambrano-Bigiarini et al. (2017), who concluded that the P products generally performed better in the wetter seasons.
RMSE results were different than those of the KGE' results which is in agreement with Baez-Villanueva et al.
(2018). For example, contrary to the KGE' results where MSWEPv2.2 and ERA5 generally were the best performing products across the province and during all the temporal scales, in the case of the RMSE, PERSIANN-CDR followed by CHIRPSv2 had a relatively better performance. The RMSE performance was particularly better in the stations located in the eastern part of the province while MSWEPv2.2 followed by ERA5 showed a relatively better performance for the stations located over the western part of the study area.
According to Baez-Villanueva et al. (2018), the RMSE places more importance on mismatches of high P values (i.e., the rainy season) compared to the dry season. On the contrary, KGE' appears to be a better index to evaluate the performance of the P products due to two main reasons; (i) because its ability to reduce the excessive weight of the high P events; and (ii) because it decomposes the total performance of the P products into r (reproduces temporal dynamics), β (preserves the volume) and γ (preserves the distribution; Zambrano-Bigiarini et al., 2017).
The results of this study confirm that it is difficult to point out a single best-performing P product that performs the best across the province and during all temporal scales. However, MSWEPv2.2 performs the best, followed closely by ERA5. MSWEPv2.2 performed the best during the monthly, pre-monsoon, monsoon, and postmonsoon temporal scales, while ERA5 was the best performing product during the annual and winter temporal scales.

The capability of Budyko framework to estimate Ea_WB
As shown in figure 7, the Ea_WB were obtained using Fu's equation. Most of the Ea_WB results were not adequately comparable to those of the Ea products in most of the catchments. Although this study takes the Ea_WB as the reference Ea to evaluate the performance of the Ea products, the possibility that Fu's equation was not able to adequately estimate the Ea_WB in the defined catchments cannot be exempted. In other words, the poor agreement between Ea_WB and Ea products could be attributed either to the inherent uncertainties and poor performance of the Fu's equation or the poor performance of the Ea products. This could be investigated in future studies for the study area. However, the inadequate agreement between Ea_WB and Ea products is in agreement with the results obtained by Tekleab et al. (2010).
Furthermore, Tekleab et al. (2010) argue that poor performance of the Fu's equation could be attributed to many uncertainties including; ignoring the soil water storage in the catchments, catchment P and Ep, seasonality of the P, and non-stationary characteristics (e.g., NDVI, slope, land cover, etc.) of the catchments themselves measured by the Fu's ⍵ parameter. Gunkel and Lange (2017) Tekleab et al. (2010), Cheng et al. (2011), andGunkel andLange (2017.

Possible biases in estimating the AI and the Ea_WB
Forcing datasets have a large influence over the accuracy of the Ea estimates (Mueller et al., 2013). Water input (measured by P) and energy availability (measured by Ep ) are the two main variables affecting the Ea estimates (Zhang et al., 2016). In this study, annual P data from ERA5 (it performed the best at annual timescale) were considered the best after point-to-pixel evaluation of the P products. However, For the Ep component, this study evaluated the performance of two Ep products including ERA5 and GLEAMv3.3a based on their ability to capture the aridity in the study area. The aridity results were compared to the climate classification of Köppen-Geiger (Beck et al., 2018) and also to the literature that has documented the aridity in the study area. The Ep product that best captured the aridity appeared to be ERA5; and therefore, this study used this product to calculate the Ea_WB for the major catchments within the study area. Finally, the calculated Ea_WB was used to evaluate the performance of the Ea products. As a result, once again ERA5 Ea product appeared to outperform the GLEAMv3.3a and the SSEBopv4.0 products. In other words, this study calculated aridity using Ep and P both from ERA5 and calculated Ea_WB. The estimated Ea_WB was then used to evaluate the performance of Ea products where once again ERA5 was the best performing Ea product. Thus, the evaluation of the Ea products could be biased towards ERA5 because both P and Ep were used to evaluate the Ea products. However, there are no ground-based measurements to evaluate these products more straightforwardly.

Overall performance of the Ea products
The results of the Ea products evaluation (Figure 7) revealed that generally, the products failed to reproduce the Ea_WB, as they tended to overestimate the Ea_WB in the majority of the catchments (except for GLEAM3.3a for a few catchments which were underestimated). Bai and Liu (2018), also reported an overestimation of the Ea products across the continental China. They attributed the relative failure of the Ea products to close the water balance to inherent uncertainties in the water balance approach itself, uncertainties in the algorithms of the Ea products, and the input variables used to develop the products.
However, in line with the results obtained by Mahto and Mishra (2019), the evaluation of the Ea products revealed that ERA5 outperformed the other two products (GLEAMv3.3a and SSEBopv4.0) in most of the catchments.
Either it was the best performing product, or it showed comparable results in most of the catchments except for those located in the north-eastern part of the province where GLEAMv3.3a outperformed ERA5. The better performance of GLEAMv3.3a over these catchments could be attributed to the fact that these catchments receive relatively higher P compared to the rest of the catchments in the study area which was also observed by Bai and Liu (2018) who reported better performance of GLEAMv3.2 over the wetter catchments in China. In this way, SSEBopv4.0 also presented a relatively good performance in the north-western region. The improved performance of SSEBopv4.0 over this region could be attributed to its less complex topography as noted by Saxe et al. (2020) who found that SSEBopv4.0 performed better in regions with less pronounced topography across the continental United States. Therefore, our results suggest that it is difficult to point out a single best-performing product across the province; however, ERA5 performed the best, followed by SSEBopv4.0 and GLEAMv3.3a in most of the catchments in the study area.

Conclusions
P and Ea are the two major components of the water cycle. An accurate spatio-temporal evaluation of these components remains a global challenge. However, since the availability of satellite remote sensing data and the reanalysis datasets, it has been relatively easy to account for these climatological variables in near-real-time and make observations even in the most underdeveloped and difficult to access regions around the world. This study evaluated the performance of 4 state-of-the-art P (CHIRPSv2, MSWEPv2.2, PERSIANN-CDR, and ERA5) and 3 state-of-the-art Ea (GLEAMv3.3a, ERA5, and SSEBopv4.0) products in the arid and semi-arid province of Balochistan, Pakistan. This study evaluated the performance of these P products employing a point-to-pixel approach at the monthly, annual, and seasonal (winter, pre-monsoon, monsoon, and post-monsoon) temporal scales for 2000-2017. The KGE' along with its three individual components, i.e., r (linear correlation), β, (bias ratio) and γ (variability ratio), and the RMSE, were used as statistical indices to measure the performance of the products. Due to the unavailability of ground-based Ea data, this study employed the widely used Budyko framework and calculated Ea_WB for 12 major catchments in the study area. Subsequently, the Ea_WB was used as the reference to evaluate the performance of the Ea products.
The following paragraphs summarize the key findings of the study: -MSWEPv2.2 closely followed by ERA5 presented the best spatio-temporal KGE' performance for most of the stations and across the province. CHIRPSv2 and PERSIANN-CDR relatively showed poor KGE' performance.
-ERA5 showed the best KGE' performance at the annual temporal scale and therefore was used as the precipitation product for calculating Ea_WB in the Fu's equation.
-RMSE results were different than those of the KGE'. Generally, CHIRPSv2 and PERSIANN-CDR outperformed MSWEPv2.2 and ERA5 at the temporal scales. Similar results were obtained for the spatial scale at the wetter stations that are located in eastern Balochistan. However, MSWEPv2.2 and ERA5 presented better spatial RMSE performance in the western drier stations.
-In general, the product that directly incorporate satellite, reanalysis, and daily gauge data (i.e.,

MSWEPv2
.2) performed the best followed by the reanalysis product (i.e., ERA5), and the product that incorporate satellite, reanalysis, and monthly gauge data (i.e., CHIRPSv2). Whereas the satellite-based product that indirectly incorporate gauge data through other datasets (i.e., PERSIANN-CDR) performed the worst.
-Generally, all the P products performed the best at the monthly temporal scale compared to the annual.
-For the seasonal evaluation, the P products performed the best during winter followed by pre-monsoon, monsoon, and post-monsoon seasons.
-The results of this study confirm that it is difficult to point out a single best-performing P product that performs the best across regions with different climates and topographical gradients and at different temporal scales. However, MSWEPv2.2 performed the best followed closely by ERA5.
-The KGE' along with its three individual components (i.e., r, β, and γ) appears to be an improved performance index for evaluation of the P products compared to the RMSE due to its ability to decompose the total performance of the products into r (reproduces temporal dynamics), β (preserves the volume) and γ (preserves the distribution).
-The results for evaluation of the Ea products revealed that generally, the products failed to reproduce the Ea_WB of the selected catchments. However, ERA5 outperformed the other two productsi.e., GLEAMv3.3a and SSEBopv4.0in most of the catchments of the study area except for the northeastern wetter catchments, where GLEAMv3.3a was the best performing product. SSEBopv4.0 presented the best performance over the north-western catchment of the K.P.L. closed basin.
-The Ea_WB values obtained using Fu's equation were not in good agreement with those of the Ea products in most of the catchments in the study area. This could be attributed to two main reasons: (i) the uncertainties from the Fu's equation, i.e., the soil water storage in the catchments is negligible, the catchments are natural, and the Fu's ⍵ parameter represents the non-stationary characteristics (e.g., NDVI, slope, land cover, etc.) of the catchments, and (ii) the Ea products are not able to represent the Ea values over these data-scarce and arid catchments.
Finally, this study concludes that although there are certain underlying limitations in the evaluation results in the present study including the various assumptions discussed in the methodology sectionsparsely distributed ground P measurements and unavailability of ground Ea measurements in the study area are the major obstacles to study the water resources in the study area (Ahmed et al., 2019)the following concrete findings of this study can be withdrawn and considered as important to the concerned decision-makers or might contribute in management of the water resources in the study area: -MSWEPv2.2 and ERA5 presented a relatively good performance (0.6 < KGE' ≤ 0.8) across the province of Balochistan and during most of the timescales with a few exceptions. However, a site-specific evaluation of these products is still recommended for any hydrological applications.
-For the Ea products, ERA5 generally outperformed the other two productsi.e., GLEAMv3.3a and SSEBopv4.0in most of the catchments except the north-eastern wetter catchments, where GLEAMv3.3a performed the best, and the north-western catchment of the K.P.L. closed basin, where SSEBopv4.0 presented the best performance. However, we recommend an evaluation of these products for specific applications in catchments within the study area.
Moreover, this study recommends that future studies might calibrate the P, Ea, and Ep products with local observations to achieve improved performance of these products Dembélé et al., 2020). Furthermore, Fu's ⍵ parameter could also be estimated and calibrated locally for better performance of Budyko framework as observed by Gunkel and Lange (2017) who reported improved estimation of renewable water resources in Jordan River Basin after estimating and calibrating the ⍵ values locally. Similarly, future studies could also take into account the change in water storage ( at catchment scale (relying on remotelysensed data) as done by Bastiaanssen et al. (2014) to estimate improved multi-annual water balance at the catchment scale.

Availability of data and material
The dataset generated during and/or analyzed during this study can be categorized into two sections

Gauge data
The gauge data are available from Pakistan Meteorological Department, Meteorological Complex University Road, Karachi-75270, Pakistan but restrictions apply to the availability of these data. For the current study, the data were requested from the aforementioned department and are not publicly available.

Open source remotely sensed and reanalysis datasets
The following repositories were used to download the precipitation and evaporation data

Ethics approval
Not applicable

Consent to participate
Not applicable

Consent for publication
All Authors agree to publish this   (a) Bias ratio (β) and (b) variability ratio (γ) evaluation between monthly P products and ground-based observations at the corresponding grid cells of the P products.

Figure 4
Root Mean Square Error (RMSE) evaluation between monthly P products and ground-based observations at the corresponding grid cell.

Figure 5
Modi ed Kling-Gupta E ciency (KGE') along with its three individual components (r, β, and γ) and the Root Mean Square Error (RMSE) between satellite products and ground observations at the monthly, annual, and seasonal (winter, pre-monsoon, monsoon and post-monsoon) temporal scales. The horizontal green line indicates the optimum value of the performance indices.

Figure 7
Comparison of multi-annual means (2003-2018) of estimated EaWB from Fu's equation to Ea products.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. Suplimentarymaterial.docx