Minimizing uncertainty in complete automation of Douglas-Peucker Algorithm for geospatial mapping

Abstract


27
In realistic cartography, different cartographers always select different maps as the 28 generalization results, and in most situations, cartographers often need to input 29 parameter(s) in the execution of an algorithm. This humans' constant interference 30 consequently leads to the non-automation of generalized algorithms, which decreases 31 the efficiency of map generalization and increases the uncertainty of the resulting maps 32 (Yan 2014; Makris et al., 2021). Linear data such as contour, road network, and rivers 33 are the main spatial data types widely used in different fields. In the process of 34 downsizing small-scale maps or constructing vector map databases, it is indispensable 35 to generalize polylines from one larger scale (e.g., 1:5000) to corresponding generalized 36 counterparts at the other smaller scale (e.g., 1:50000). Therefore, it is extremely 37 essential to realize the completely automatic generalization of polylines. 38 Numerous scholars have been making tremendous efforts to investigate the realization 39 of automatic polyline generalization issues (Ai et al. 2016;Shen et al. 2018;Kronenfeld 40 et al. 2020). Meanwhile, various methods have been presented (Mcmaster 1987), such 3 as the Douglas-Peucker algorithm (DP) (Douglas and Peucker 1973), Li-Openshaw (Li 42 and Openshaw 1992), Bend Group algorithm (Qian HZ 2017). However, the problems 43 are only partially solved. Among these approaches, the DP remains the most well-44 known and effective polyline simplification algorithm in map generalization (Ramer 45 1972; Hershberger and Snoeyink 1992;Saalfeld 1999;Yan 2014;Sandu et al., 2015). 46 The evaluation and minimization of uncertainties in polyline simplification methods 47 might validate these algorithms' acceptability, consistency, and advanced application 48 for automatization. in-depth investigations on integrating spatial similarity degree into map automatic 55 generalization are available. Only a few studies have been tested (Yan 2014), i.e., it is 56 ambiguous about how to automatically calculate the transition conditions threshold of 57 DP using the spatial similarity degree. As a result, it leads to constant human 58 intervention in the execution of DP. Secondly, very little is known about the specific 59 quantitative functional relationship between optimal distance threshold and map scale 60 change. Is it linear or nonlinear? Is it a linear equation with one variable or a quadratic 61 6 process of DP. The optimal distance threshold determines the number of points to be 102 deleted and the degree to that a polyline is generalized. Hence, the current research is 103 aimed to find an appropriate approach for determining the quantitative relationship 104 between T and C. To this end, the order that points to be deleted is firstly determined 105 by reversing the order of points to be selected. As a result, the similarity between the 106 intermediate result and the standard target scale polyline reaches the maximum; the 107 corresponding threshold λk is the optimal distance threshold that the polyline is 108 generalized from scale M1 to scale M2. Finally, the point pairs (C, T) used to fit the 109 quantitative relations between them are recorded after multiple circulative iterations 110 ( Figure 1). 111 is the initial distance threshold. ∆ is a 113 gradually changing value, and the value of ∆ is inversely correlated with the similarity degree between LM2 114 and Li. For example, suppose the polyline is generalized from 1:50000 to 1:250000, then if the similarity degree 115 (S) between LM1 and Li is less than 0.6, ∆ will be 0.02 km, while if S is more than 0.6 and less than 0.8, ∆ 116 will be 0.01 km; and if S is more than 0.8, ∆ will be 0.005 km. 7

Selection of evaluation index and determination its
118 weights 119 The judgment of spatial similarity degree is the essence of map generalization. However, 120 people often ambiguity about the similarity between two objects when judging the 121 spatial similarity relationship between multi-scale maps because they remain uncertain 122 about what properties should be considered in similarity assessments (Yan 2010). 123 Therefore, among many polyline properties, including length, distance, complexity, 124 sinuosity, etc., it is necessary to extract major properties that influence people's 125 judgment at first and then determine its corresponding weights to construct a spatial 126 similarity evaluation model. 127 According to previous research, 12 properties can be used to evaluate the spatial 128 similarity between multi-scale polylines, including position (Olteanu-Raimond et al. Hughes's effect (Hughes 1968). Using as few indicators as possible to express as much 139 polyline information as possible should be solved first. Therefore, absolute distance 140 matrix D between factors is employed to overcome the above problem and measure the 141 closeness of the indicators based on the results of genetic algorithm. 142 ( , = 1,2, … , 12, = 144 1,2, … ,9), represents the kth factor that affects the ith data group in Table 5  generally not allowed to move polylines on maps in the process of map generalization, 171 i.e., orientation between polylines is not changed after map generalization. Therefore, 172 Sinuosity, Distance, Shape, and Buffer-overlapped area are finally chosen as evaluation 173

indices. 174
Distance is a fundamental concept in geospatial science. According to the previous 175 research, it can be known that compared with Euclidian distance (Peuquet 1992), 176 Bottleneck distance (Efrat et al. 2001), and Fréchet distance (Nayyeri et al. 2015), 177 Hausdorff distance is one of the most used distances for spatial objects in GIS, but it is 178 sensitive to the shape of the objects, especially to the outliers. Besides, it does not satisfy 179 the change law of similarity (Li et al., 2018). Therefore, Mean-Hausdorff distance 180 (MHD) is selected as the distance similarity metric to obtain more stable and accurate 181 result (Deng et al. 2007). On the other hand, the shape is viewed as the most crucial 182 geometric factor that describes planar curves. The essence of shape similarity is to judge 183 the coincidence degree between polylines. Therefore, Li (2018)  which is more suitable for the comparison between multi-scale polylines of the same 189 target. 190 inversely correlated with the similarity degree between L 1 and L 2 . If the angle is clockwise, α i is 193 positive, and if the angle is counterclockwise, ′ is negative. The relation between and ′ is that

197
Buffer-overlapped area can reflect the horizontal distance and vertical distance of two 198 polyline entities. However, it is worth noting that the threshold of buffer radius is the 199 key to solving various research problems. According to the visual perception degree of 200 position deviation on the figure, buffer radius of the buffer-overlapped area is generally 201 set as 1~2 times of the minimum distance (0.2mm) between two points on the target 202 scale map, e.g., if the target scale is 1:250000, then the buffer radius is 50m. 203 However, suppose we want to study the mechanism of spatial similarity varying with 204 map scale change. In that case, map scale change must be the unique independent 205 variable, and the same set of weights for different groups of datasets should be used to 206 eliminate the influence of weights variation, which may adversely influence the results. 207 Therefore, nine groups of weights of four stability factors are normalized to the same 208 group of weights (Table 1)  The weights of sinuosity, distance, shape, buffer-overlapped area are 0.23, 0.32, 0.11, 211 and 0.34, the critical ranking of which is consistent with the weights obtained from the 212 opinions of experts based on spatial cognition experiment or the well-adopted criteria 213 (Chehreghan et al. 2016). The large weight value indicates that the corresponding factor 214 strongly influences the judgment of spatial similarity relations and vice versa. Thus, 215 according to the value of the adopted weights, the buffer-overlapped area has the 216 greatest impact, followed by distance. 217

218
According to the definition and description of each index in Section 2.2, it is known to 219 all that some factors are positive indicators, such as distance, shape, sinuosity, while 220 others are negative, i.e., the larger � 1 , 2 , indicates the less similarity between polylines, 221 while the � 1 , 2 is an opposite factor. Moreover, the units and dimensions of factors are 222 different, and their value varies greatly, which also has a great impact on the results. 223 Therefore, to eliminate the effects of negative factors and different units on results, the 224 similarity degree of these factors should be normalized to [0, 1] before determining the 225 spatial similarity evaluation model, which is achieved by using the most applicable 226 technique range standardization as follows: 227 228 Sim L i ,L j P k is the similarity degree of the factor Pk after normalized, where the large 229 Sim L i ,L j P k indicates the higher spatial similarity. N is the number of groups of datasets. 230 Finally, Equation (3) (Yan 2014) can be employed to calculate the spatial similarity 231 degree between polylines based on four factors. 232 (3) 14 where, Sim(L 1 , L 2 ) ∈ (0,1], W p ∈ [0,1], n is the total number of impact factors, which is 4. ArcGIS10.6. The GF-2 image is obtained by the fusion of Pan and multi-spectral image, 283 whose spatial resolution is 1m and 4m, respectively. Multi-scale datasets of rivers and 284 roads, which include 1:5w, 1:10w, 1:25w, 1:50w, and 1:100w, were derived from the 285 current results of the National Geomatics Center of China (NGCC). 286 It is qualitatively evident that the greater the map scale change, the smaller the spatial 292 similarity degree between multi-scale polylines. However, no specific quantitative 293 relationships between them are known by far, which hampers the complete automation 294 of the algorithm. There are five potential candidate functions to map this gap that can 295 describe such changing trends, including linear function, polynomials, power functions, 296 logarithmic functions, and exponential function. The three or more order polynomials 297 have n-2 inflection point(s). Hence the curve has more than one monotonicity 298 decreasing interval. Ultimately, only second-order polynomials should be considered. 299

309
R 2 , as a good indicator, often is used to compare the candidate functions. The larger 310 value of R 2 always indicates the better fitting curve. As can be clearly seen from Figure  311 5, the R 2 of the power function is closest to 1(R 2 =0.8152), which achieves the best fitting relationship of the curve among all candidates. Therefore, Equation (5)  where, a > 0, −1 < b < 0; Sim�L i , L j � ∈ (0,1], C �L i , L j � ∈ [1, +∞).

318
To validate the reliability of the proposed model, this paper selected five groups of 319 vector sampling multi-scale polylines from previous research of Yan (2015) and 320 Chehreghan (2016). Figure 6 shows the fitting results of the proposed model, and Table  321 2 shows the accuracy comparison result with the existing evaluation model (Yan 2014). 322  Compared with the previous model, the fitting accuracy of the evaluation model 326 proposed in this paper is improved by 4.16%~11.52%. However, compared with 327 Chehreghan's conclusion, there is a nonlinear power function relationship between 328 spatial similarity and map scale change (Figure 6 (B), (D)). Although the trend of three 329 groups of the multi-scale polyline (Figure 6(A),(C),(E)) is the same, they cannot be 330 described using the same power function whether they are the same ground objects of 331 different types or the different ground objects of same type, which is some different 332 with Yan's and Chehreghan's conclusion. 333 In order to further verify whether these quantitative relationships can be fitted using the 334 same function with the same coefficients, this paper selects 38 groups of the dataset of 335 different plains, and the curve fitting results are shown in Figure 7. 336 Through the comparative analysis of the above fitting results in Figure 7, it can be seen 342 clearly that the relationship between spatial similarity degree and map scale change of 343 different groups of the dataset from the same geographical feature plain can be 344 described using the same power function curve with the same parameters, e.g., in Figure  345

362
The essence of the DP algorithm is to compare the straight-line distance connecting the 363 start point and endpoint with the optimal distance threshold. Therefore, this part aims 364 to find the optimal distance threshold, which determines the degree of the polyline to 365 be simplified, corresponding to map scale change to realize the completely automatic 366 generalization of polylines. 367 It is qualitatively evident that the greater the map scale change, the greater the optimal 368 distance threshold of DP. Hence, three potential candidate functions can be employed 369 to fit the changing trend between λ and C: polynomial, linear Equation, and logarithmic 370 function. Since the other polynomials have n-1 inflection point(s), e.g., cubic 371 polynomial, which indicates that the curve is not monotonic. Hence, only the second-372 order polynomial (x ∈ [0, − 2 ⁄ ]) satisfies the variation tendency that the dependent 373 variable increases with the independent variable. Ultimately, only quadratic polynomial, 374 linear Equation, and logarithmic function are considered (Equation (6)). 375 According to the principle described in section 2.1, the point pairs (C, T) of the optimal 377 distance threshold and map scale change are continuously recorded with the gradual 378 generalization of the polyline. The fitting results between them are shown in Figure 8. 379

381
As can be clearly seen from Figure 8, the R 2 of the unary quadratic function is closest 382 to 1 (R 2 =0.9585), which achieves the best fitting relationship of the curve among all the 383 candidates. Therefore, the unary quadratic function (Equation (7)) is chosen as the 384 quantitative model for describing the relationships between T and C. 385 � T = 0 ( = 1) T = a 2 + + (a < 0, C ∈ (1, max{C i }]) 386 In order to further verify whether the quantitative relationships between T and C can be 387 fitted using the same function with the same coefficients, the experimental datasets of 388 this part are divided into two parts: control and experimental. The former consists of 389 multi-scale datasets from the different geographical feature areas, and the quadratic 390 functions are shown in Figure 9(a). The latter consists of multi-scale datasets from the 391 different geographical feature areas, and the quadratic functions are shown in Figure  392 9(b). 393 Consequently, these results convincingly demonstrate that it is unreasonable to describe 402 all five groups of datasets using the same single quadratic Equation with the same 403 coefficients; besides, R 2 is only 0.4636. Therefore, it is impossible to simultaneously 404 realize the completely automatic generalization of all polylines using the same optimal 405 distance threshold. 406 Based on the results presented in Figure 9(b), the first group of experimental data which 407 is consisted of four groups of multi-scale polyline from Shanxi Mountain (Figure 9(b1)), 408 the other consists of 19 groups of multi-scale polylines from the Lower Yangtze River 409 Plain (Figure 9(b2)). The R 2 values of the two groups of datasets are not less than 0.8521, 410 with the best in the first groups (R 2 = 0.9012). Hence, the fitting accuracy is satisfactory. 411 i.e., it is affirmative to realize the completely automatic generalization of all two groups 412 of sampling datasets, respectively. Therefore, these results convincingly demonstrate 413 that it is reasonable to realize the complete automation of the DP generalization 414 algorithm for the polylines from the same geographical feature area. Therefore, 415 polylines from the same geographical feature area are the objects of the following 416 research. 417

418
Taking the multi-scale polylines from the Shanxi Mountain as an example, the 419 theoretical optimal distance threshold is determined if the original and target map scales 420 are given (Table 3). For example, suppose the original and target map scales are 421 1:50000, 1:250000, respectively, and then the map scale change C is 5. Hence, the 422 optimal distance threshold T will be 0.0095 km. Afterward, six groups of experimental 423 datasets are automatically generalized using DP based on the corresponding theoretical 424 optimal distance threshold. 425    Table 4 also reveals that comparing with LM2, △LR of samples is less than zero, and 435 △SR is almost greater than zero, the error of the compression amount (△LR), but the 436 sinuosity degree change rate (△SR) of the road and river are the smallest. It highlights 437 that the compression amount (LR) and the sinuosity degree change rates (SR) of L are 438 consistent with LM2, and the proposed approaches can better maintain the geometric 439 features of road and river; the data compression effect is superior. 440 As shown in Table 4, the SMD is almost greater than zero, but the maximum of SMD 449 of L only is up to 0.0408, i.e., the index SMD also indicates that the simplification 450 results of the proposed method could better maintain the local position accuracy of 451 polylines. In summary, the proposed method effectively maintains the local and global 452 shape characteristics of roads and rivers. 453 affecting map generalization: scale and regional geographic characteristics. This 474 conclusion illustrates those regional geographic characteristics influence the 475 determination of the quantitative relationship between the optimal distance threshold 476 and map scale change. Therefore, it is reasonable to generalize the polylines from the 477 same geographical area using the same single optimal distance threshold. proposed method is better than the s standard target scale polyline, i.e., geometric 31 features retention of the proposed method is more superior. This conclusion is 490 consistent with the conclusions drawn by Wu (2008), who compared the  Peucker, Li-Openshaw, Circle, and asymptotic algorithms. Therefore, it can be 492 concluded that the proposed method can better maintain the geometric features and 493 shape characteristics of polylines on the whole in terms of geometric and position 494 accuracy. In this paper, multi-scale individual polylines are taken as research objects. 495

Discussions
However, the generalization of polyline group objects is not only dependent on the 496 features of ground features; it is closely related to the surrounding ground features. 497 Therefore, due to the defects of the DP algorithm itself, the simplification results may 498 intersect or intersect with other adjacent features that are close in distance, e.g., two 499 adjacent contours with a smaller distance between them simplified by the DP algorithm 500 may intersect. It is recommended that future work should conduct extensive 501 experiments over polyline group objects, e.g., contour cluster, road network, reticulate 502 drainage. 503

504
This paper proposed a model, which describes the quantitative relationships between 505 the optimal distance threshold and map scale change, to minimize the uncertainties in 506 the complete automation of the Douglas-Peucker algorithm. It can calculate the 507 theoretical optimal distance threshold of the polylines from the same geographical 508 feature area taking map scale change as the only independent variable and vice versa. 32 It is indicated that realizing the complete automation of the Douglas-Peucker algorithm 510 is affirmative for the polylines from the same geographical feature area. The findings 511 of the current study not merely provide an idea and method for realizing the complete 512 automation of map generalization by looking for the quantitative relationship between 513 the parameter of algorithm and map scale change but also facilitate the automation of 514 map generalization algorithms and system of polyline and improve the accuracy of map 515 generalized results. The proposed model optimizes algorithm threshold settings, 516 eventually reducing the inevitable differences between automated-simplified and 517 cartographers' ratified simplification results. Reduced uncertainties in realizing the 518 Douglas-Peucker Algorithm automation would support improved spatial data matching 519 and establish a vector map database. 520