Modelling Green Volume Using Sentinel-1, -2, PALSAR-2 Satellite Data and Machine Learning for Urban and Semi-Urban Areas in Germany

Urban Green Infrastructure (UGI) provides ecosystem services such as cooling of temperatures and is majorly important for climate change adaptation. Green Volume (GV) describes the 3-D space occupied by vegetation and is highly useful for the assessment of UGI. This research uses Sentinel-2 (S-2) optical data, vegetation indices (VIs), Sentinel-1 (S-1) and PALSAR-2 (P-2) radar data to build machine learning models for yearly GV estimation on large scales. Our study compares random and stratified sampling of reference data, assesses the performance of different machine learning algorithms and tests model transferability by independent validation. The results indicate that stratified sampling of training data leads to improved accuracies when compared to random sampling. While the Gradient Tree Boost (GTB) and Random Forest (RF) algorithms show generally similar performance, Support Vector Machine (SVM) exhibits considerably greater model error. The results suggest RF to be the most robust classifier overall, achieving highest accuracies for independent and inter-annual validation. Furthermore, modelling GV based on S-2 features considerably outperforms using only S-1 or P-2 based features. Moreover, the study finds that underestimation of large GV magnitudes in urban forests constitutes the biggest source of model error. Overall, modelled GV explains around 79% of the variability in reference GV at 10 m resolution and over 90% when aggregated to 100 m resolution. The research shows that accurately modelling GV is possible using openly available satellite data. Resulting GV predictions can be useful for environmental management by providing valuable information for climate change adaptation, environmental monitoring and change detection.


Introduction
Urbanisation and climate change are considered global megatrends that will continue to affect life on this planet (Retief et al. 2016). The United Nations suggest that already today, 55% of the world's population live in urban areas and that this number is estimated to rise to 68% by 2050 (United Nations, Department of Economic and Social Affairs, Population Division 2018). Human induced climate change leads to continuously rising average temperatures and poses risks through increased climate and weather extremes including floods, heatwaves, and droughts (IPCC 2021). Both phenomena further increase pressures on the natural environment, including biodiversity and ecosystem resilience. Thus, urban planning and environmental management need to consider these megatrends and their interconnected effects (Retief et al. 2016;Gill et al. 2007;Mathey et al. 2011).
Looking at climate change adaptation in urban contexts, green spaces function as urban green infrastructure (UGI) that provide a variety of ecosystem services (Gill et al. 2007;Mathey et al. 2011;Frick et al. 2020;Palliwoda et al. 2020;Matzarakis 2001). Studies show that greater abundance of UGI, including increased green volume and number of green roofs, has strong positive effects on reducing peak summer temperatures in cities (Frick et al. 2020;Gill et al. 2007). Reasons for this include cooling effects through evapotranspiration (Rocha et al. 2022) and lower heat storage when compared to built-up areas and shading (Mathey et al. 2011). Furthermore, UGI is relevant for redirecting air flow in cities, providing the opportunity of optimising local climate and e.g., directing cool air flows to residential areas in the summer (Wende et al. 2010).
Understanding the 3D space that vegetation occupies within cities is essential for evaluating these ecosystem services. However, as Casalegno et al. (2017) point out, much of the research on urban green has focused on its presence, without considering volumetric information. Shedding light on the space that vegetation occupies within cities can enhance urban planning by improving visual representations of urban green, enabling detailed assessments of environmental effects (such as air temperature and quality), and facilitating the monitoring of green infrastructure. Cooling effects of urban vegetation, for example, greatly depend on its spatial distribution, height and leaf area and are therefore best assessed using volumetric data (Rocha et al. 2022). Combining volumetric vegetation data with additional indicators, such as thermal stress, can help identify areas in urgent need of adaptation measures (Frick et al. 2020). Therefore, researching the potential of remote sensing to accurately estimate 3D data on urban vegetation on a yearly basis can provide important insights for the field of environmental planning and management.
Furthermore, according to Wolch et al. (2014), access to and use of urban green infrastructure (UGI) in Germany is unevenly distributed across different social groups. To promote social sustainability in urban planning, Kabisch and Haase (2014) suggest that city planning measures should take into account UGI access for less affluent communities. Therefore, it is crucial to consider detailed information on the distribution of UGI in urban planning.
One such indicator is the green volume (GV) in m³/m² which describes the 3-D space that vegetation objects such as trees and shrubs occupy (Großmann et al. 1983). Meinel et al. (2006) propose GV to be a useful indicator for application in urban ecology. Moreover, Frick et al. (2020) establish a relationship between GV, surface sealing degree and surface temperature and identify areas where additional GV is crucial for reducing peak summer temperatures.
Using remote sensing, GV can be estimated using digital orthophotos and surface models. Such information is highly valuable. However, data availability may be low and processing costs high. The aim of this study is to make use of openly available satellite data and machine learning algorithms to model GV and allow for yearly predictions on large scales. Considering the benefits of volumetric UGI information, annual GV predictions constitute highly valuable information for planning and monitoring.
Over the recent decade, estimation of GV at small to medium scales has become more feasible, specifically including normalised digital surface models (nDSM) derived by Light Detection and Ranging (LiDAR) data or image based stereo matching algorithms. Looking at existing research, Hecht et al. (2008) use last pulse only LiDAR data to estimate GV in the city of Dresden, Germany. Furthermore, Huang et al. (2013) use LiDAR data and highresolution aerial images to identify trees and fit pseudo cylinders to estimate GV in the Lujiazui region in China. Casalegno et al. (2017) model GV using waveform LiDAR and, in line with Anderson et al. (2016), point to the advantage of using waveform LiDAR for GV estimation as it is better able to capture understory vegetation. This gain, however, comes at the expense of higher data processing efforts. Frick and Tervooren (2019), use high-resolution LiDAR data for GV determination in the city of Potsdam, Germany. In their study, a cuboid representation of vegetation is assumed and certain percentages of GV are subtracted to account for missing volume around tree stems. Furthermore, GV is predicted remarkably well by building a model using parameters derived from Sentinel-2. This supports the general feasibility of satellite-based modelling of GV (Frick und Tervooren 2019).
Considering additional usability of GV data, Frick et al. (2020) use satellite derived data on surface temperatures and sealing degree to identify GV deficits in highly sealed urban areas. Furthermore, Lehmann et al. (2014) use 3D urban vegetation information to provide detailed classifications of urban vegetation structure types. From such data, valuable information on ecosystem services of urban green spaces can be derived (Mathey et al. 2021). Other studies use object-based image analysis approaches and high-resolution LiDAR data and orthophotos to map categories of greenspaces within urban areas (Degerickx et al. 2020;Banzhaf et al. 2020). Lee et al. (2021) show that high resolution UAV imagery can be used to create DSMs to retrieve 3D information on urban green with high accuracies. While it is shown that GV can be estimated well using digital surface models and high resolution orthophotos, such data is not always available and acquisition costs may be high. Therefore, this study aims to model GV using machine learning algorithms and openly available satellite imagery to allow for GV estimation on large scales.
While no other studies have attempted large scale modelling of GV, modelling of related biophysical parameters such as aboveground biomass (AGB) and tree height has been undertaken. For modelling environmental indicators, the application of decision tree based methods using bagging and boosting approaches has been widespread (Abdi 2020; Li et al. 2020;Pham et al. 2020;Wagle et al. 2020). Navarro et al. (2019), for example, use regression modelling to estimate AGB using Sentinel-1 radar and Sentinel-2 optical data. Li et al. (2020) model AGB in a Chinese national forest using Landsat 8 optical data in combination with Sentinel-1A data using Linear Regression (LR), Random Forest and Extreme Gradient Boosting (XGBoost) algorithms. Similarly, Pham et al. (2020) use extreme gradient boosting for AGB estimation in North Vietnam drawing on Sentinel-2, Sentinel-1 and ALOS-2 PALSAR-2 satellite data. Both studies find boosting approaches to outperform RF in modelling accuracy. Further, combining features from all three data sources outperforms modelling using only some of them. This highlights the potential of radar data in increasing performance in AGB estimation.
Other studies assess the possibility of using remote sensing data to model forest variables including tree height, an important indicator for GV estimation. Antropov et al. (2018) use Tandem-x and ALOS-PALSAR radar data to estimate forest tree height with RMSE of around 2.8 m using interferometric SAR images. Astola et al. (2019) compare Sentinel-2 and Landsat-8 for forest variable prediction, including tree height. In their study, Sentinel-2 outperforms Landsat-8 for forest parameter estimation. Similarly, Lang et al. (2019) construct S-2 based tree height models for Switzerland and Gabon, finding RMSEs of 3.4 m and 5.6 m, respectively. Albeit varying error magnitudes, existing literature suggests considerable potential of satellite data for biophysical parameter estimation such as AGB and tree height.
This study aims to model GV, using machine learning in conjunction with openly available remote sensing data. Such data may be used as a standardised indicator to facilitate environmental planning in urban settings and beyond. Advantages of the applied methodology include the large-scale applicability, cost efficiency, yearly repeatability, and methodological consistency of GV estimation. Being part of our ongoing research on urban climate change adaptation the generated data is planned to be made available in an online mapping tool to provide for low-barrier public access. To our best knowledge, the research fills a relevant gap since no previous study has attempted large scale modelling of GV.
More specifically, we assess the possibility of modelling Green Volume (GV) in urban and semi urban areas based on Sentinel-1, Sentinel-2 and ALOS-2/PALSAR-2 data and derived indices using different machine learning algorithms.
The following sub questions will be answered: How do different sampling strategies of reference data influence the model performance?
Which of the satellite sensors is most suitable for modelling green volume?
Is geographical and temporal transferability of the model predictions possible?

Study Area
For this study, five reference areas were selected due to their distribution throughout Germany, structural heterogeneity, and data availability. The GV reference data originates from the year 2018 and was partially generated as part of previous research (Frick and Tervooren 2019;Frick et al. 2020). Figure 1 shows the collection of areas. This collection constitutes the pool for training data generation for model building. Areas include: the municipality of Leipzig, Saxony; the municipality of Potsdam, Brandenburg; the municipality of Saalfeld, Thuringia; a rural area around the Schmalkalden-Meiningen district, Thuringia; the municipality of Schwäbisch Gmünd, Baden-Württemberg; and the municipality of Solingen, North Rhine-Westphalia. Furthermore, GV for the city-state of Berlin for the year 2020 was used as an independent validation dataset to test geographical and inter-annual transferability of the model.
GV reference data is generated by classifying areas into different land use classes using CIR aerial images, the NDVI and nDSMs. After classification, GV constants are set for pixels classified as sealed surface/water (i.e. 0 m³/m²), grassland (i.e. 0.5 m³/m²) and cropland (i.e. 1 m³/m²). Constants are chosen for agricultural cropland and grassland to represent average vegetation height due to changing vegetation heights throughout the year. For GV estimation of shrubs, shrubs and trees, as well as trees, the pixel size is multiplied by the height value represented in the nDSM. For larger trees, fixed percentages of GV are subtracted to account for lower vegetation volume around stems. The described method and parts of the reference data stems from previous research further described in Frick and Tervooren (2019).

Data
This study uses optical satellite data from Sentinel-2 (S-2) and derived vegetation indices as well as radar satellite data from C-band Sentinel-1 (S-1) and L-band PALSAR-2 (P-2) SAR sensors. For S-2, all atmospherically corrected surface reflectance images between the months of April and September 2018 with a cloud percentage of below 60% were selected, resulting in a collection of 3572 images. A cloud mask was applied to each image using the S2_CLOUD_-PROBABILITY image collection and workflow provided by Google Earth Engine (GEE) (Gorelick et al. 2017). Furthermore, several vegetation indices (VI) were calculated for each image and added as new bands. VIs were chosen according to literature on estimating biophysical parameters (Navarro et al. 2019;Pham et al. 2020, see Appendix: Table 5). Furthermore, two image composites were built containing the median values of all bands and vegetation indices for two timeframes. The first timeframe for median building ranges from 15 th of April to 1 st of July 2018. For much of the temperate vegetation considered it captures the phenological stages of budding and leaf growth up until very high levels of photosynthetic activity. The second timeframe from 1st of July to 15th of September 2018 captures the phenological state of slow continuous growth up until leaf senescence, colour change and harvest. The timeframes were chosen to include varying stages of phenological development while at the same time allowing for the presence of sufficient cloud-free S-2 data, which is necessary for consistent feature generation.
For S-1, image composites for the same timeframes were built. The S1 SAR GRD collection provided in GEE was used, as it contains radiometrically calibrated, and terrain corrected scenes. First, an image collection containing all available S-1 scenes between the months of April and September 2018 that were captured in Interferometric Wide Swath mode, containing both VV and VH polarisations at 10 m resolution was created. A total of 1938 S-1 scenes were collected. An additional mask for edge noise removal was applied. Then, a ratio index of VV and VH polarisation was calculated (VH/VV) and added as a band to each scene. Next, images were split according to ascending vs. descending orbit. Finally, multi-temporal image stacks were built. In the case of S-1, mean values were calculated for the two previously mentioned timeframes. Radar backscatter images are less influenced by cloudy conditions which reduces the risk of outliers when compared to S-2 imagery and makes compositing based on mean values feasible. The generated features were added as bands to the previously created image stack.
For ALOS-2/PALSAR-2 (P2) feature generation, the 25 m 2018 PALSAR global mosaic provided in GEE was used. In addition to the available HH like-polarised and HV cross-polarized bands, a ratio of both was built and added as a HV/HH band. Again, each feature was added as a band to the image stacks of S2 and S1 features.
Overall, a number of 27 features are generated, containing eight S-2 bands, ten derived VIs, six S-1 features and three P-2 features. Since S-1 and S-2 features are generated for two timeframes, the total number of features available for model building is 51. See the appendix for a list of all features.

Methodology
In this study we compared three Machine learning (ML) algorithms to predict Green Volume (GV), namely Random Forest (Breiman 2001), Gradient Tree Boost (Friedman 2001(Friedman , 2002, and Support Vector Machines (Cortes and Vapnik 1995;Drucker et al. 1996). All three approaches have been used successfully to monitor environmental variables with remote sensing data. However, since one of the algorithms might be more accurate in solving this specific task, we assessed the performance of each algorithm.
Random Forest is a popular ensemble learning algorithm that creates multiple decision trees and combines their results to improve accuracy and reduce overfitting. We used Random Forest as it is widely applied in the realm of environmental modelling using remote sensing data and frequently reported to be a robust classifier for multidimensional data (Abdi 2020; Li et al. 2020;Wagle et al. 2020). GTB is another ensemble learning algorithm that builds decision trees sequentially to improve the model's prediction accuracy. We included the algorithm since various researchers have found GTB to outperform RF on tasks related to modelling of biophysical parameters (Li et al. 2020;Pham et al. 2020). SVMs solve problems by separating the input data using hyperplanes in a multidimensional space. They have proven effective in highdimensional datasets and have been used in various applications related to modelling of biophysical parameters using remote sensing data (Navarro et al. 2019). For most of the feature generation and model building, the cloud computing application GEE (Gorelick et al. 2017) was used.
The precise derivation of GV values is dependent on a suitable selection of training data, as models aim to create a good, yet generalised fit to the given data. Therefore, this study aimed to compare different sampling strategies and their effects on model performance. High resolution GV rasters were aggregated to 10 m resolution to fit the size of Sentinel-2 pixels. From this, a selection of different samples was built to assess the influence of sampling strategies on model performance. Besides random vs. stratified sampling, taking samples only from areas classified as urban in the Corine Land Cover (CLC) 2018 dataset (Copernicus 2018) vs. from all areas was tested.
First, a random sample was drawn from all reference areas. The second random sample was taken only from urban areas as defined in the CLC 2018 dataset. The dataset contains land cover classifications based on data from the Copernicus Land Monitoring Service. All polygons of group '1', containing artificial surfaces such as urban fabric, industrial and artificially vegetated areas within cities were used (Copernicus 2018). Second, two stratified random samples were drawn, again, one from all GV areas and the other one from CLC 2018 urban areas. Stratified sampling was done due to the highly skewed distribution of reference data towards low amounts of GV. The reference data includes large, sealed surfaces, grassland and agricultural areas. To allow for greater inclusion of shrubs, trees and larger trees in the training data, stratified sampling was applied. As depicted in Fig. 2, GV rasters were classified into eight GV classes, representing the strata that equal number of points were sampled from.
Finally, drawing on insights gained throughout the research, a spatially 'harmonised stratified sample' was generated. For this sample, two adjustments were made. First, reference GV rasters were spatially harmonised with S-2 pixels, so that pixel alignment of reference data and S-2 images is given. This was done to potentially reduce mixed pixel effects and study the effect of harmonised pixel alignment on model performance. Second, two reference areas were excluded. The reference area of Thuringia was excluded to assess whether the omission of large, homogeneous coniferous forests in training data would aid model performance. Further, Schwaebisch Gmuend was excluded due to the presence of minor errors found in the GV reference data.
In total, as shown in Fig. 2, five sets of sample points (with n~11.000 each) were created: a 'random sample of all areas', a 'random sample of urban areas', a 'stratified sample of all areas', a 'stratified sample of urban areas and a spatially 'harmonised stratified sample'. Each sample was split into 70% training and 30% validation points. For internal model validation, a set of mixed validation points (n = 4442) containing an equal number of points from each set of validation points was created.
For each of the five different training samples, a Random Forest (RF) model was built. Each training sample contained all 51 features and reference GV values. RF models were built with 500 trees and additional hyperparameters were left at standard values provided in the GEE classifier (variables per split = square root of the number of variables; min leaf population = 1; bag fraction = 0.5; max nodes = no limit). After model building, GV was predicted for the whole area of Germany and GV predictions extracted by location to the validation points.
For evaluating accuracies, the Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Mean Bias Error (MBE) and Coefficient of determination (R²) were calculated. Each error metric applies a different approach to quantify how well the predicted values fit the actual ground truth data. The RMSE, MAE and MBE draw on the difference between model predictions and the actual ground truth. The RMSE is very commonly used and penalises large errors more heavily than smaller ones when compared to the MAE. While being more sensitive to the influence of large errors, it is also more heavily influenced by potential outliers. While both RMSE and MAE do not consider the direction of error, the MBE shows the mean bias of the model and indicates whether, on average, the model over or underestimates the target variable. R² measures the proportion of variance in the ground truth that is explained by the predictions. It ranges from 0 to 1 and is a commonly used metric that to evaluate how well the model fits the data.
The best performing sample was then used to build models using GTB and SVM algorithms correspondingly and compare their performance. The GTB classifier was built with 500 Trees and most hyperparameter values were again left at default (shrinkage = 0.005; sampling rate = 0.7; max leaf nodes = no limit). The 'huber' loss function was chosen for cost optimisation as it provided best modelling results. For SVM, the SVM type 'EPSILON_SVR' was used, which is suitable for regression problems. Other hyperparameters were left at standard (kernel Type = linear; shrinking heuristics = true; cost parameter = 1; termination epsilon criterion = 0.001; loss function epsilon = 0.1). Additionally, RF models were built using different sets of features to allow for comparison of S-2, S-1 and P-2 modelling performances.
For all applied algorithms, preliminary testing of different sets of hyperparameters was undertaken. Differences in outcomes were generally small, with the above-described parameters yielding the best results. The number of trees for random forest was defined at 500, as an increase in trees yielded similar results while being computationally more expensive and holding more potential for overfitting.
To assess geographical and temporal transferability of the model, an independent validation was done using GV reference data from the city state of Berlin for 2020. The previously best performing samples and classifiers were used to predict GV for a 2020 composite image containing all S-1, S-2 and PALSAR-2 features of the city of Berlin. For validation, 5,000 pixels were randomly sampled in the area of Berlin and GV values were extracted by location and error metrics calculated. To assess prediction accuracies at different spatial resolutions, GV reference and predictions were aggregated to 100 m*100 m and the same validation steps were undertaken. In addition, model performance at planning block level was assessed. For the city of Berlin, the ISU5 statistical blocks represent a basis for spatial and The sum of GV was calculated for each ISU5 statistical block and the difference between mean values for reference vs. prediction differences were calculated and visualised.

Results
Depending on the utilised sampling strategy, sampling type and sensor, achieved results range from a RMSE of 2.54 to 5.24 (m³/m²) with R² between 0.181 and 0.775. This indicates that a careful selection of sampling, sensor and algorithm is of major importance. When the best approaches are implemented, estimation of GV with a high accuracy is possible. Table 1 shows the performance of the different training samples using RF. Considering the error metrics, we see that the samples taken from all available areas generally outperform samples from CLC2018 urban areas. Furthermore, stratified sampling outperforms random sampling for both urban and all areas. Overall, the harmonised stratified sample performs best with an RMSE of 2.54 m³/m², followed by the stratified sample of all areas with RMSE of 2.83 m³/m² and the random sample of all areas with RMSE of 3.03 m³/m². With regards to the MAE, however, there is only a slightly higher performance of the stratified sample when compared to the random sample taken from all areas. RMSE increased with a magnitude of~0.3 m³/m² when comparing samples taken from all areas to samples taken from only urban areas. Looking at the MBE, there is an average overestimation of GV for three of the five samples, with errors ranging from −0.75 m³/m² for the stratified sample of urban areas to 0.70 m³/m² for the random sample of urban areas. Considering the coefficient of determination, the harmonised stratified sample performs highest with R² of 0.773 and the random sample of urban areas lowest with R² of 0.699. Table 2 shows the performance of different algorithms applied to the harmonised stratified sample. We observe overall best performance of the Gradient Tree Boost (GTB) and Random Forest (RF) algorithms. RF exhibits the lowest RMSE with 2.54 m³/m² vs. 2.59 m³/m² for GTB. The GTB based model, however, exhibits the highest overall R² of 0.774 vs. R² of 0.773 for RF. Generally, both algorithms predict GV similarly well. The SVM model exhibits considerably greater errors with an RMSE of 4.16 m³/m². All models show a slight negative mean bias of −0.19 m³/m² to −0.75 m³/m². Table 3 depicts the modelling results for different combinations of sensors and features. The best model's fit is achieved by S-2 Bands and VIs, with RMSE of 2.54 m³/m², followed by S-2 Bands only with 2.56 m³/m² and S-2 VIs with 2.58 m³/m². All these combinations exhibit comparable performance and interestingly, achieved accuracies are very similar to the ones using all available sensors. When considering the R² value, excluding S-1 and P-2 features even achieves a slightly higher R² of 0.774 vs. R² of 0.773 for RF. Furthermore, using S-2 bands slightly outperforms using S-2 based VIs with an R² of 0.770 vs. R² of 0.766. When looking at radar satellite features, a noticeable decrease in predictive performance can be observed, with RMSE of 5.24 m³/m² for P-2 features and RMSE of 4.02 m³/m² for S-1. While combining both sensors slightly increase modelling performance, the achieved error is still considerably greater than when using S-2 based features with an RMSE of 3.95 m³/m².   Looking at the independent validation, Table 4 shows that the models achieve good accuracies when transferred to a new year and geographical context. Using RF and the harmonised stratified sample achieves best overall model fit with RMSE of 2.64 m³/m² and R² of 0.775, followed by GTB and the harmonised stratified sample with RMSE of 2.72 m³/m² and R² of 0.774. As for internal validation, greater errors are present when modelling is based on the stratified sample of all areas. Using this sample, interestingly, GTB shows highest error magnitudes with RMSE of 3.20 m³/m².
Generally, RF shows a slightly better model performance when compared to GTB, especially when looking at models built from the stratified sample of all areas. Overall, error magnitudes for RF are comparable to the ones found in internal validation.
In Fig. 3 a scatter plot of reference vs. predicted GV for independent validation at 100 m*100 m spatial resolution is presented. Towards higher magnitudes of GV, the prediction error generally increases, with reference values being greater than predicted ones. This mostly affects greenspaces, e.g., urban forests, with large amounts of green. While reference GV per pixel peaks around 20 m³/m², the maximum predicted values only reach around 15 m³/m². Furthermore, there is a noticeable spread in prediction error around middle GV magnitudes with validation points within green spaces showing a greater spread than those in residential areas. The lowest errors appear to be present in low magnitudes of GV, where residential areas make up most of the displayed data.
When spatially mapping the prediction error of the model, as shown in Fig. 4, an underestimation of GV in large forest areas and forested urban parks can be observed. This is consistent with the general underestimation of large magnitudes of GV presented in Fig. 3. Overestimation of GV appears to occur mainly in highly vegetated residential areas towards the periphery of the city as well as in allotment gardens. The most exact estimations are found in residential areas with overall lower magnitudes of GV surrounding the center of Berlin. Further, differences are rather small for cropland and grassland, with cropland appearing to exhibit the most accurate estimations.

Sampling Strategies
The presented study confirms the possibility of modelling green volume (GV) using freely available Sentinel-1, -2 and ALOS-2-PALSAR-2 satellite data. In the first step, different reference data sampling strategies were compared. The results indicate that stratified sampling of reference data leads to higher accuracies in GV models when compared to random sampling. This may be explained by the fact that higher magnitudes of GV, i.e., shrubs and trees, are better represented in training data obtained by stratified sampling. Further, samples taken only from urban areas were compared with samples taken from all reference areas considering their resulting model performance. The results indicate that models built with samples taken from all areas outperform ones taken only from urban areas. This may again be attributed to the inclusion of more reference data with high magnitudes of GV.
Furthermore, the effect of spatially harmonising reference GV rasters with Sentinel-2 pixels was studied. It was found that the overall best performing sample was a stratified sample drawn from reference areas that were aggregated to fit S-2 pixels. The performance increase may be brought about by reducing mixed pixel effects, which is certainly relevant when looking at urban greenery at 10 m spatial resolution. Looking further into spatial accuracy of multi-temporal S-2 data, errors and offsets in co-registration  Fig. 3 Reference vs. predicted GV at 100 m*100 m using RF and the harmonised stratified sample need to be considered (Stumpf et al. 2018). Improving multitemporal geometric consistency by applying enhanced co-registration approaches may further improve modelling accuracies and allow for more consistent predictions over time (Rufin et al. 2021). Generally, we find adaptation of sampling and spatial harmonisation of reference data to be cost effective ways of considerably improving predictive performance of GV models.

Machine Learning Algorithms
When looking at the comparison of machine learning algorithms for modelling GV, this research shows similar performance of the boosting and bagging approaches, i.e. Gradient Tree Boost (GTB) and Random Forest (RF). While GTB achieved the highest R² in internal validation, RF exhibits lowest RMSE error. Overall, the best models achieve accuracies around R² of 0.774 and RMSE of 2.54 m³/m². Looking at error metrics from independent validation, where models were applied to a completely new city and year, RF slightly outperforms GTB. While again, both algorithms can predict GV considerably well with R² of around 0.775, GTB shows slightly lower consistency in modelling results. The fact that GTB achieves highest R² in internal validation, yet shows slightly less consistent results in independent validation, may suggest that the algorithm marginally overfits the training data, when compared with RF. The issue of GTB being able to achieve high accuracies, yet running into risk of overfitting, is also reported in other literature (Li et al., 2020). Looking at the performance of Support Vector Machine (SVM), a substantially lower model performance with R² of 0.558 was found. This might be reasoned by the fact that noisy training data and outliers can negatively influence SVM's performance for regression. Authors suggest e.g. cleaning of training data and novel approaches such as robust least squares SVMs (Sabzekar and Hasheminejad 2021;Yang et al. 2004Yang et al. , 2014You et al. 2011). It is possible that the training data includes a considerable number of outliers and noise due to, e.g. remaining mixed pixels, shadows, small errors in the reference data themselves and the general diversity of land cover in the reference areas. Furthermore, noisy data may have been present in radarbased features, as is exemplified by the overall lower performance of radar based models. This may have resulted in overall increased difficulties for the SVM algorithm to properly separate the data. To further investigate this issue, selection and pre-processing of training data and SVM hyperparameter tuning may be advisable for future research.
While no other studies have attempted large scale modelling of GV, modelling of related biophysical parameters such as aboveground biomass (AGB) and tree height has been undertaken. Looking at research on AGB estimation, Li et al. (2020) and Pham et al. (2020) use boosting and bagging algorithms to model AGB based on optical and radar satellite data, achieving accuracies ranging from R² of 0.488 to R² of 0.75. In contrast to this study, the applied boosting algorithms outperform RF by a considerable margin. Castillo et al. (2017) use Sentinel-1 and -2 data to model AGB and report an R value of 0.75. They find that modelling AGB only using S-1 achieves similar results as S-2. In contrast, this study found considerably lower modelling performance of S-1 for GV when compared to S-2.
Considering large scale tree canopy height modelling, Astola et al. (2019) find an R² of 0.54 using a regression tree method and R² of 0.73 using a neural network for S-2 based tree height estimation. Interestingly, the neural network applied in the mentioned study clearly outperforms the regression tree approach, pointing to the potential of deep learning for vegetation variable prediction. Lang et al. (2019) report RMSEs from 3.4 m to 5.6 m in their study on country wide tree height modelling using S-2. The error can be considered similar in magnitude to the RMSE of 2.54 m³/ m² found in this study. Finally, looking at small scale modelling of GV, Frick und Tervooren (2019) model GV on city level using multi-temporal S-2 NDVI parameters and report an R² of 0.85.
Overall, studies on modelling AGB, tree height and GV report accuracies R²~0.7 with a range of R² of~0.5 tõ 0.85 (Astola et al. 2019;Castillo et al. 2017;Lang et al. 2019;Li et al.;Pham et al. 2020). This study lies within the reported range with a comparably high R² value of 0.774. In the reviewed literature, boosting algorithms are mostly reported to outperform RF. This clear advantage of boosting algorithms is not supported by this study, where RF was found to be the overall slightly more robust classifier. While SVM tends to exhibit lower performance metrics in other studies as well, the difference found in this study is among the largest, potentially being related to noise present in training data.

Sensor Performance
When comparing modelling performance for different satellite sensors, this study finds S-2 Bands and derived vegetation indices (VIs) to exhibit highest predictive performance. In fact, excluding Sentinel-1 (S-1) and PALSAR-2 (P-2) data did not decrease overall modelling performance when compared to using all features. Using S-1 features alone led to substantially greater error when compared to S-2 with RMSE of 4.02 m³/m² vs. 2.54 m³/m² and P-2 achieved overall lowest modelling performance with RMSE of 5.24. While combining both radar sensors slightly improves modelling performance, errors are still higher than S-2 based modelling. Looking at studies on modelling AGB, Pham et al. (2020) find a greater positive effect of radar sensors to modelling performance. Navarro et al. (2019), find S-1 to slightly outperform S-2 and Castillo et al. (2017) find overall similar performance of S-1 and S-2 in their studies of mapping AGB. Furthermore, various studies point to the capabilities of L-Band radar data, such as P-2, for predicting AGB (Huang et al. 2018;Carreiras et al. 2013). This appears to not be the case for GV estimation, which may be explained by the penetration depth of L-band radar being helpful for AGB estimation, yet less so for GV where information on canopy structure is of great importance. Generally, radar sensors show lower predictive performance than in studies on AGB, which may be related to penetration depth and the fact that most AGB studies were conducted in non-urban context, where backscatter signals are influenced less by built up structures in close proximity to vegetation.
To sum up, we find that S-2 bands and VIs show highest predictive performance in modelling GV by a considerable margin, followed by S-1 and P-2. Further research may be necessary to support these results as comparison to AGB modelling studies may not be as viable in this case. Additionally, future research may consider the potential of modelling GV based on radar interferometry.

Model Transferability
Considering independent validation with reference data of the city of Berlin for the year 2020, the model showed good transferability to a new spatial and temporal context. The Random Forest algorithm paired with the harmonised stratified sample performed best with R² of 0.775 which is slightly higher than in internal validation. Interestingly, the stratified sample of all areas performed worse when compared to internal validation, especially when looking at the GTB algorithm with R² = 0.744. It appears that RF, in this case, is the slightly more robust classifier, exhibiting greater inter-annual prediction accuracy. This may be explained by the fact that the models differ in which features are valued most in prediction. Some features may be more consistent over the years than others. VIs, for example, may be more consistent when determined for different years when compared to band values. This may partly explain the greater performance decrease for GTB, which tunes the model towards most predictive features in the boosting process. In this case, certain GTB models may have slightly overfitted on the 2018 training data, leading to worse predictions for 2020. Additionally, changes in weather and climate between the years can lead to differing amounts of available satellite images and phenological cycles. Specifically, the reference year of 2018 exhibited a very hot and dry growing season for Germany (Zscheischler and Fischer 2020). This may also affect interannual model transferability. Overall, the independent validation supports good model transferability across space and considerably good interannual transferability.
To sum up, the applied algorithms can predict GV considerably well for an unseen spatial context and year. Here, RF appears to be slightly more robust when compared to GTB. The results support the feasibility of using predictions for means of environmental monitoring. Further research may diversify reference data to include multiple years and extend on existing inter-annual validation to find the most robust set of features and algorithms to allow for best monitoring of GV.

Planning Block Scale and Relevance for Environmental Management
For the validation at 100 m*100 m spatial resolution, the study finds high agreement between predicted and reference GV with R² of 0.905 for the 2020 Berlin prediction. This suggests high usability of data at building block scale and for planning applications. Validation was further undertaken at ISU5 planning block level, to identify spatial variability in model errors. We find that underestimation of GV tends to be most present in urban forests. A reason for the underestimation of high GV volumes may be the saturation of S2 band values and VIs that are frequently observed in large dense vegetation (Sun et al. 2020). In line with other literature, RF variable importance suggests that some vegetation indices, such as the specific leaf area index (SLAVI), saturate less and therefore exhibit higher predictive performance (Mutanga and Skidmore 2004; Kaplan and Rozenstein 2021;Pham et al. 2020).
Overestimation tends to happen in allotment gardens and very green residential areas, often in the periphery of the city. This may tentatively be attributed to a great abundance of vegetation, yet, with rather low vegetation height. The model may be less capable of distinguishing between actual height and in this case overestimate GV for highly photosynthetically active areas with low vegetation. Best model fit is observed in inner city areas and on cropland. The variability of prediction error is generally lower in residential areas when compared to urban green spaces, which shows that the model is generally well able to predict GV for innercity areas with high degree of surface sealing. We see that achieved accuracies at planning block scale are considerably high and the derived data may therefore provide valuable and consistent information for urban and environmental planning. It is to be noted, however, that higher variability is observed in urban greenspaces and allotment gardens, with an underestimation of GV in urban forests.
The fact that high accuracies are achieved for inner city areas and at planning block scale points towards good usability of our results for environmental planning. At planning block level, yearly GV predictions may help track changes in GV over time and identify areas where decreasing vegetation cover may lead to heat or air quality related hazards. The generated data may be useful as a basis for local climate modelling, where 3D data on vegetation structures is of major importance. Additionally, the monitoring of vegetation cover that is not located on public land may be improved as planners often lack sufficient data in these areas. Considering planning for socially sustainable cities, the generated data may be used to assess the supply of green spaces and vegetation related benefits to citizens.
Future research may look for ways of improving modelling performance in high magnitudes of GV. Furthermore, the use of deep learning architectures such as convolutional neural networks and time-series based approaches for modelling GV may provide valuable insights. Considering additional data sources, global LiDAR datasets such as Global Ecosystem Dynamics Investigation (GEDI) (Adam et al. 2020;Chen et al. 2021) and surface models derived from InSAR (Braun 2021;Solberg et al. 2017) may offer valuable sources of information for modelling GV.

Conclusion
This research successfully uses optical and radar satellite data from Sentinel-2 (S-2); Sentinel-1 (S-1) and PALSAR-2 (P-2) and machine learning algorithms to model Green Volume (GV) in urban and semi-urban areas. We find that stratified sampling of reference considerably outperformed random sampling in terms of model building. Furthermore, spatial harmonisation of reference data pixels with S-2 pixels lead to considerable improvements in modelling performance which may be explained by the reduction in mixed pixel effects on model building. Overall, the results suggest that noticeable performance improvements can be brought about by optimising the sampling process when ample reference data is available.
Considering performance of different machine learning algorithms, Gradient Tree Boost (GTB) and Random Forest (RF) exhibit similar results in internal validation. Support Vector Machine (SVM) shows considerably lower modelling performance. When modelling GV based on single sensors, this study finds that excluding S-1 and P-2 data did not decrease modelling performance. Hence, in our study, radar data appears to have no positive influence on overall modelling performance. Looking at the independent validation and transferability of the model, high transferability is given when applied to the city of Berlin for the year 2020. Interestingly, RF slightly outperforms GTB and may therefore be the overall more robust classifier for the issue at hand. The results further suggest that a considerable amount of model error stems from the underestimation of high magnitudes of GV as found in e.g., urban forests.
Overall, the modelled GV explains around 79% of the variability in reference GV at 10 m resolution and up to 90% when aggregated to 100 m resolution. With the achieved accuracies, especially at building block level, annual GV modelling may provide useful information for environmental monitoring and change detection. The generated data may be used in conjunction with other indicators to identify areas where climate change adaptation measures are needed most. Incorporating the insights of this research, the Urban Green Eye project makes satellite derived indicators for urban climate change adaptation available online for free use. Germany wide GV estimations will be part of the published data soon and available for use.