In this study, a set of metrics will be proposed to evaluate the quality of online geocoding services available in China, including the number of lost addresses, the number of addresses geocoded in incorrect units at multiple scales, and the number of addresses geocoded in correct units. To interpret the geocoding errors of addresses, the geocoded locations are associated with the true locations represented by the addresses. These associated point pairs are then classified into more detailed types in accordance with their topological relationships with mapping units (Table 1).
Table 1. Metrics used for geocoding evaluation and error types
|
Match type
|
Topological relationship with mapping units
|
Metrics used to measure the error
|
Geocoding error
|
Matched with building
|
A, B
|
Match rate
|
Misplaced in another unit
|
C, D, E, F, G
|
Lost
|
None
|
Loss rate
|
Loss rate
The loss rate is calculated as the percentage of lost addresses among all input addresses. In a geocoding process, the input addresses are required to follow a certain form to ensure that they can be correctly recognized by the geocoding platform. Data loss can be caused by an unstandardized input address, an incomplete database, or a less than perfect matching algorithm.
Matched data
Matched data refer to addresses matched with an address in a database. In a geocoding process, the most similar address is selected from the reference database, and the corresponding coordinates are matched to the input data. However, the matched coordinates are often not the same as the true coordinates, meaning that some finite distance exists between the geocoded location and the true location.
For commercial addresses, geocoding errors can be separated into five categories according to their spatial relationships with the mapping units. If the true location and geocoded location are within the same building, then this error will not affect the mapping accuracy. These errors are represented by types ‘A’ and ‘B’, as shown in Figure 2. The remaining geocoding errors can be separated into three cases, which are represented by types ‘C’, ‘D’ and ‘E’, as shown in Figure 2. C represents cases in which the true and geocoded locations belong to two different road areas. Such an error is often generated due to misidentification of a road name in the geocoding process. D represents cases in which the geocoded and true locations are in two different buildings but within the same roadside area. E represents cases in which the geocoded and true locations are in two different buildings located on opposite sides of the same road. Whether the three aforementioned cases affect the mapping accuracy depends on their spatial relationships with the mapping units. If the solid line polygon in the figure is the mapping unit, none of the three error types will not affect the mapping accuracy. However, if the dashed line polygon is used for analysis, then the results of spatial aggregation will be changed.
For residential addresses, geocoding errors can be classified into four categories according to their relationships with the mapping units. If the geocoded point is within the correct community, the address is matched at the community level. If the geocoded location is located within the correct district, this address is matched at the district level. These two cases are shown as ‘A’ and ‘B’ in Figure 3. These two errors will not affect the accuracy of building-level mapping and subsequent analysis. If the addresses are geocoded in the wrong building (‘C’ and ‘D’ errors in Figure 3), the impacts on the mapping accuracy will depend on their spatial relationships with the mapping units. If the mapping unit is the solid blue polygon in the figure, then this error will not affect the mapping accuracy. However, if the mapping unit is the dashed line polygon, then this error will change the spatial statistics at this level.
Acceptance level
The minimum acceptance level for the geocoding match rate has been extensively investigated in previous research. These parameters are also used to evaluate the acceptance level in this study. Ractliffe (2004) suggested 85% as the minimum acceptable geocoding match rate. Briz-Redon, et al. (2019) observed that this threshold is mostly between 80% and 90% and suggest to raise it because it is heavily sensitive to research purposes or statistical techniques. As suggested by these research results, the minimum acceptable geocoding match rate at the community level or building level is determined to be 80~90%. If a match rate is lower than 80%, the result is considered to be unable to correctly reflect the spatial distribution pattern of the input data. If the match rate is between 80% and 90%, then the result can marginally satisfy the mapping requirement. If the match rate is higher than 90%, the result is good enough to maintain the spatial distribution pattern of the input data.
Research area and data
The built area of N city was selected as the research area. N city is located in the highly developed Yangtze River Delta region. Many online companies provide location-based services in this region, including navigation, food delivery, bike sharing, etc. A large volume of addresses has been collected and stored in a reference database for public use. The highly developed online services and well-constructed database for N city make it an ideal region for geocoding research.
We use burglary addresses as the research data, which were collected by the local public security bureau (PSB). In contrast to typical point of interest (POI) addresses, the addresses in the current list were reported by the victims and verified by law enforcement in person. This process enhanced the completeness and correctness of the data.
The dataset includes 7259 burglary addresses (Figure 4). Each address is composed of four parts: “city + district + subdistrict + community/road + building number”. These addresses are categorized into two groups. The first group includes residential addresses, represented by “community + building number”. The second group includes commercial addresses, represented by “road + building number”.
The precise coordinates represented by each address were collected by the PSB and manually inspected based on the Global Positioning System (GPS). In this geocoding process, four online geocoding platforms, namely, Baidu, Gaode, Tencent and Tianditu, were used to translate these addresses into coordinate pairs. Each coordinate pair was assigned a confidence level indicating its accuracy level. WanderGIS [21] was used to transform these Mars coordinates into WGS84 coordinates due to its good performance in this transformation [21]. The geocoding error of each address was calculated from the spatial distance between the geocoded coordinates and its precise coordinates.