Since the 1960s, spatial information technology supported by satellite positioning system, geographic information system and remote sensing has gradually develop. And a large number of data with spatial location have been collected, processed and applied(Li and Shao, 2019). Compared with other data, spatial data is difficult to use the classical statistical method of variable independence assumption because of its spatio-temporal correlation. And Newton's prediction and other methods in geometric space are not applicable. In 1970, Professor Toblert (1970) proposes the "First Law of Geography", which provide a theoretical basis for the analysis and application of spatial data. Spatial prediction(SP) has also been developed and improved.
At present, SP methods can be roughly divided into four categories: (1) deterministic prediction: inverse distance weighted(IDW) (Willmott et al., 1985), (2) geostatistics method: kriging (Matheron, 1963), (3) combination method: regression kriging(RK) (Mohanasundaram et al., 2020), (4) machine learning(ML). With the complexity of practical problems, these basic methods cannot meet the requirements. So, on the basis of them, they gradually improved and put forward many new methods. For instance, Yan et al. (2021) apply a novel multiple parameters synchronization optimization IDW algorithm which involves anisotropy(PIDW) to two spatial data of different scales. It is proved that the method can effectively improve the accuracy. Kriging also has certain expansion, such as Universal Kriging(UK) (Xuan Thanh et al., 2015), Kriging with External Drift(KED) (Berndt et al., 2014). Wu et al. (2021) uses geographical map spot as basic mapping units. In comparison with traditional regular grid-based methods, it achieves higher accuracy. On the basis of considering spatial information, RF develops into Random Forest for spatial data (RFsp) (Hengl et al., 2018) and Random Forest Spatial Interpolation (RFSI) (Sekulic et al., 2020). These method have been applied to many fields such as soil water quality, marine environment, geological exploration, air quality, etc. But they are not discussed for the small sample problem. And the number of meteorological stations set up in a region is often insufficient to study the situation of the entire region because of the limitations of terrain, financial resources and other conditions. So, ML may have insufficient fitting in case of insufficient samples. And with the progress of productivity, the demand of social and economic life for the delicacy and timeliness of geospatial information is further highlighted. It is of great practical significance to develop spatial prediction models and improve the mapping level in the case of small samples.
The methods for small sample learning are roughly divided into three categories: small sample learning methods based on data augmentation, small sample learning methods based on metric learning, and small sample learning methods based on meta-learning (Wang, 2022). Data augmentation is to add new data to the original dataset, which can be unlabeled data or composite labeled data. In supervised data augmentation, it is divided into single sample data augmentation and multi sample data augmentation. Single sample data augmentaion is performed around the a sample itself. Multi data augmentation uses multiple samples to generate new samples, such as synthetic minority over-sampling technique(SMOTE) (Chawla et al., 2002), mixup (Zhang et al., 2017). These data augmentation methods are combined with ML to form many methods to deal with small sample learning.
In the field of geoscience, neural network(NN) (Lawrence et al., 1997) has been integrated with data augmentation technology and widely applied to the classification of hyperspectral images(HSIs). For example, Li et al. (2019) use deep convolutional neural network(CNN) to extract pixel-block pair (PBP) features, and decision fusion is utilized for final label assignment. Results demonstrate that this method can outperform support vector machine with the composite kernel (SVM-CK) (Li et al., 2019) and multiple classifier systems-based SVM with random feature selection (SVM-RFS) (Waske et al., 2010). Generative adversarial network(GAN) has been practical and effective in HSIs classification(Zhu et al., 2018). And improved Wasserstein GAN is morecapable of generating similar radar images while achieving higher structural similarity results(Lee et al., 2020). Accion et al. (2020) introduce Dual-Window Superpixel(DWS) data augmentation on the basis of CNN. Experimental results show that the method is effective in HSIs in classification. In prediction aspects, Li et al. (2022) use window offset, scaling and rotation data augmentation and deep CNN to predict subsurface mineral deposits. And this method can efficiently predict mineral prospective areas where there are few ore deposits. But its data enhancement method is to enhance samples by observing from different angles and distances. It is not applicable to station data. Yang el al. (2022) adopt cropping operations to generate sufficient training samples and utilize LeNet, AlexNet and VggNet to predict mineral deposits. LetNet can outperform other method. But its cropping data augmentation is to operate on the image. Huang el al. (2020) propose spatial autocorrelation-based mixture interpolation(SABAMIN). Compared with traditional ML, it’s accuracy is improved. However, it use kriging prediction to create reliable pseudo data. In general, kriging has high theoretical requirements. And it is relatively difficult to fit the variogram.
To sum up, the combination of data augmentation and ML has been applied to the classification and prediction aspects of geosciences, especially in the classification aspcets. However, it is relatively less used in the prediction aspects. In addition, the process of generating pseudo data by kriging interpolation is relatively complex in the application process. Data augmentation methods such as clipping are not applicable to station data. Therefore, this paper proposes the Random Forest Spatial Interpalation-Modified Unsampling(RFSI-MUS) based on the above problems. It is mainly used in the RFSI model to enhance the data of observation sample points through the modified unsampling method, as to solve the underfitting phenomenon in the RFSI model. The modified unsampling mainly reflected in the following two aspects. Firstly, in the process of selecting nearest points, it is to select points with similar geographic information in some aspects of the category after classification. Secondly, the selected difference is for each category. In order to verify the effectiveness of the method proposed in this paper, the precipitation data set of Chongqing is used to compare RFSI-MUS with Random Forest(RF), RFSI and RFSI-Mixup.