a) Principle
The GAP method uses the gap index, GAP, to search for a threshold of outliers. We define GAP for each adjacent data pair in a dataset, where the data are in ascending order of values. Let the values of a pair be A and B and assume 0 ≤ A ≤ B. Then, GAP = B / A (GAP ≥ 1 by definition). The core idea of the GAP method is to regard the gap of a data pair with the maximum GAP in the upper range of the dataset as the threshold for outliers. The complementary idea is to repeat the removal of outliers based on the core idea until the maximum GAP falls below a certain level. The ‘maxGAP’ denotes the maximum GAP hereafter.
b) The nature of GAP
To understand why this is effective, we will elucidate the nature of GAP.
Let us assume a dataset with ten positive integers in ascending order: (1, 2, 3, 4, 5, 6, 7, 8, 9, 10). This dataset has nine GAPs: (2.00, 1.50, 1.33, 1.25, 1.20, 1.17, 1.14, 1.13, 1.11). Of these, maxGAP is 2.00, which lies between 1 and 2 in the original dataset. This dataset is an arithmetic sequence in which the differences between adjacent data are the same. In an arithmetic sequence with positive numbers, maxGAP lies between the first and second minimum values. In contrast, all GAPs in a geometric sequence, for example, (1, 2, 4, 8, 16, 32, 64, 128, 256, 512), are the same by definition. Now, let us assume a dataset with a wide range of positive real numbers in ascending order. Now, we can imagine that, in its bottom range, even a small value difference in an adjacent data pair can result in a large GAP; however, to have a large GAP in the top range, the difference must be relatively large.
The nature of GAP has three implications. First, GAP is useful only in the upper range of a dataset as a tool to search for a threshold for outliers. Second, we can expect a large GAP in the top range only when the range is sparsely populated, which is the situation where outliers occur. As the range becomes lower and more densely populated, GAPs decrease and become less significant as an indicator of the threshold for dichotomisation. Third, because GAP is a ratio between positive values, the dataset must be a ratio scale 15, where 0 means the ultimate minimum, not a partition between negative and positive values as seen in the interval scale.
c) Basic procedure
The basic procedure of the GAP method can be described in steps as follows:
1) Sort the data in a dataset in ascending order of values.
2) Find a data pair with maxGAP in the upper range of the dataset and regard all data above the pair's gap as outliers.
3) Remove the outliers from the dataset.
4) Return to step 2) with the dataset after removal.
Analysts decide when to stop the repetition; however, maxGAP, which quickly decreases with repetition, provides a clue for this decision. Our experience suggests that if maxGAP falls below 1.10, it is time to consider stopping the repetition (see the Application and Discussion). In implementing the procedure, we set 5% of the data size as the upper range of the datasets to search for maxGAP. This range was wide enough for our data because the procedure was invariably ended up with a much smaller percentage of removal (see the Application).
d) Generalised procedure
The basic procedure works only for a sequence-irrelevant univariate dataset. However, there are many cases that this procedure cannot handle. For example, when the dataset is a time series, a question of the temporal sequence of values may arise instead of one relating to the individual values of data points; when the dataset is multivariate, the question may not concern the individual variate values, but the multivariate composite values (i.e. the position of data points in multidimensional data space). In such a case, before applying the basic procedure, we must quantify data points to expose anomalies in question and create a univariate ratio-scale dataset. Let us call the quantified values 'anomaly-scores'. There is more than one way of quantification. It must be devised according to the nature of the question (see Application Example 1). The generalised procedure that adapts to such a case is as follows (the differences from the basic procedure are in italics):
0) Create an anomaly-score dataset from the original dataset.
1) Sort the data in the anomaly-score dataset in ascending order of values.
2) Find a data pair with maxGAP in the upper range of the anomaly-score dataset and regard all the data above the pair's gap as outliers.
3) Remove the data from the original dataset that corresponds to the outliers in the anomaly-score dataset.
4) Return to step 0) with the original dataset after removal.
The difference from the basic procedure is the addition of step 0) and the inclusion of steps 0) and 1) for repetition. The creation of the anomaly-score dataset in step 0) is necessary at every repetition because the removal in step 3) causes a change in the data structure in the original dataset. In modified step 3), outliers exposed in the anomaly-score dataset in step 2) indicate which data in the original dataset to remove. Step 3) is simple only when a one-to-one correspondence between the original and the anomaly-score datasets holds. When this is not the case, step 3) is difficult to generalise. We will show a specific example of step 3) for this case in Application Example 2.
Application to wildlife GPS data
The detection of outliers in wildlife GPS data is challenging because the data often have an extraordinarily wide range of values. Highly unstable measurement precision, caused by multiple factors of a highly variable and unpredictable nature, is partly responsible. Each GPS fix is usually given with the dilution of precision (DOP), which is the precision computable from the number of GPS satellites used and their constellation 16. However, other factors affect the precision, such as the multipath of signals from satellites to receivers and the movement speed of animals. The multipath effect is ineligible under thick forest cover 17, 18, 19, and fast movements naturally can make precise measurements of location difficult. However, their effects under changing conditions in the wild are difficult to measure and not given. Overall, the DOP is merely a computable part of precision. In addition, low precision does not necessarily imply low accuracy. The DOP-dependent data removal is both insufficient and misleading.
We are now developing an Excel system for analysing wildlife GPS data, which first checks the data to create suitable datasets for subsequent analysis. We developed the GAP method for this purpose. The GPS data are multivariate of time, height, and a pair of latitudes and longitudes. We applied the GAP method to each of these separately to combine the results later. We will describe three ways to implement the method using data obtained from two terrestrial mammalian species: the sika deer (Cervus nippon) and the pale-throated three-toed sloth (Bradypus tridactylus).
Example 1: The removal of height outliers
We used two procedures of the GAP method to remove height outliers: the basic procedure for sequence-independent outliers (extreme heights) and the generalised procedure for sequence-dependent outliers (height spikes). The latter procedure mostly suffices for our data, but the former ensures the removal of extreme values in a row. The default lower bounds of maxGAP in the former and latter procedures were set at 1.5 and 1.1, respectively. The bound for the former is higher because we expect it only to handle extreme values. The two procedures are combined and run sequentially. We used the following two datasets.
Original dataset(time-series of heights)
(H1, H2, H3, ..., Hi−2, Hi−1, Hi, Hi+1, Hi+2, ...)
Anomaly-score dataset(sequence-dependent quantification for height spikes)
(S1, S2, S3, ..., Si−2, Si−1, Si, Si+1, Si+2, ...)
The former procedure uses only the original dataset. Although heights are an interval scale, we treated them as a ratio scale here for convenience (see Discussion). For the anomaly-score dataset in the latter procedure, we devised three variations of quantification for analysts to choose: (A) Si = (Hi - Hi−1) x (Hi - Hi+1); (B) Si = (2Hi - Hi−1 - Hi−2) x (2Hi - Hi+1 - Hi+2); (C) Si = the maximum of the following three products ((Hi - Hi−1) x (Hi - Hi+1), (Hi - Hi−2) x (Hi - Hi+1), (Hi - Hi−1) x (Hi - Hi+2)). In all of these variations, a one-to-one correspondence holds between the original and anomaly-score datasets. Some of the resulting values in these variations become negative; however, these only indicate the absence of height spikes and do not affect the procedure.
We used the GPS data obtained from a female sika deer living in a mountainous region in central Japan as an example (Supplementary Excel-DataFile 1). The original data spans 606 days, but for ease of demonstration, we selected data covering one year — from August 2012 to July 2013 at 2-hour intervals. The initial data size was 4254. The former and latter procedures resulted in 1 and 4 repetitions, 4.765 and 1.159 for the last maxGAPs, and 7 (0.16%) and 15 (0.35%) removals, respectively (Fig. 1; Table 1; Supplementary Video 1). Removal by the latter procedure included three downward spikes. We used the quantification variation (B) in this example. Naturally, the other variations had partly different results.
Example 2: The removal of horizontal position outliers
We used the generalised procedure to remove horizontal position outliers. The original dataset is a time series of latitude-longitude pairs. We created the interim dataset of XY coordinate pairs from the original using gnomonic projection and used it as a proxy for distance calculation and mapping. We created the anomaly-score dataset using Euclidean distances between sequential data pairs in the original dataset.
Original dataset(time-series of latitude-longitude pairs)
((Lt1,Ln1), (Lt2,Ln2), (Lt3,Ln3), (Lt4,Ln4), (Lt5,Ln5), ...)
Interim dataset(time-series of XY coordinates pairs)
((X1,Y1), (X2,Y2), (X3,Y3), (X4,Y4), (X5,Y5), ...)
Anomaly-score dataset (Euclidean distances between sequential data pairs in original dataset)
(D12, D23, D34, D45, D56, ...)
Note that a one-to-one correspondence does not hold. The idea behind this quantification is to use distance outliers in the anomaly-score dataset to separate data points in the original dataset. Of those separated, let us call single points and groups of a small number of points ‘minors’. Minors are likely to be spatial outliers but not always (see G5 in Phase 1 and G7 and G8 in Phase 4 in Fig. 2). For this reason, in implementing the procedure, we let analysts check which minors are spatial outliers. As a result, step 3) in the generalised procedure is described in the following sub-steps:
3 − 1) Separate the data points in the original dataset using distance outliers in the anomaly-score dataset.
3 − 2) Choose single points and groups of a small number of points as candidates for removal
3–3) Show these candidates and others on the map distinctively
3–4) Let analysts choose and remove only spatial outliers from the candidates
As another example, we used the GPS data obtained from a pale-throated three-toed sloth living in a thick tropical forest in Manaus, Brazil. The data covered 243 days from October 2019 to June 2020 at 15-minute intervals (Supplementary Excel-File 2). The initial data size was 22618. We, as an analyst, stopped the procedure after four repetitions with 1.088 for the last maxGAP and 14 (0.06%) removals (Fig. 2; Tables 2 & 3; Supplementary Video 2). Note that maxGAP and the spatial scale quickly decrease with repetition.