The purpose of this study was to determine whether there exists concordance among different methods of binary seasonality classification when applied to time series derived from diagnosis codes in observational data. The results of this study, as shown in Figure 1, indicate the methods are generally inconsistent with each other, with discordance observed in 60–80% of time series across 10 populations. As Table 2 reveals, the methods exhibit variation both across databases and within databases, implying that the source of the variation is not the data, but the methods themselves. Ultimately, the source of discord stems from the different ways in which the methods assess seasonality. While there do exist similarities, each method focuses on a different aspect of a time series to assess seasonality (Table 1). For instance, half the methods fit a time series with a hypothetical model and test the model for seasonality, while the other half test different aspects of a time series directly, without using a hypothesized model. To take the discussion further and generalize where we can, we make distinctions between types of concordance and types of peaks. Regarding concordance, we define “positive concordance” to be unanimous agreement among the methods that a time series is seasonal, while “negative concordance” to be unanimous agreement that a time series is nonseasonal. Therefore, for a given time series, the methods are discordant when there is neither positive concordance nor negative concordance. Regarding peaks, we say that peaks are “persistent” if they occur year after year, and they are “consistent” if they occur in the same month. We make this distinction because peaks relate to important aspects of time series analysis relevant to seasonality; specifically, variation and autocorrelation. Peaks can, of course, come in different sizes. Time series with large peaks suggest greater variation than those with small peaks. Persistent peaks (be they small or large) suggest the possibility of underlying cyclical behavior in the time series. Consistent peaks, to the extent that they are consistent, indicate autocorrelation in the time series. We’ll use Figures 2 and 3 to navigate the remainder of the discussion. For the sake of brevity, when discussing the individual time series in Figure 3, reading from topleft to bottomright, we’ll refer to them as Fig. 3.ts1, Fig. 3.ts2, …, Fig. 3.ts9.
From Fig. 3.ts1 (N = 2,809) and Fig. 3.ts9 (N = 1,498), we learn that the methods exhibit concordance only 4,307/11,137 = 38.7% of the time. Figure 2 provides valuable insight into the extent of discord among the methods. Of the 40 unique combinations, we observe that some combinations occur more frequently than others and this is due to similarities in the testing procedure (Table 1). For instance, methods that group time series data by month and test for differences among the groups are assessing seasonality differently than methods that fit a hypothetical model and then determine seasonality by minimizing forecast error. Acknowledging the differences in how the methods assess seasonality is important not only for understanding the amount of observed discord, but in recognizing that these differences indicate a disagreement with regards to how seasonality is defined. Indeed, if the methods were highly concordant despite their contrasting approaches, we would have to concede that the contrasting approaches are ultimately just different ways of expressing the same aspect of a time series. This can be observed more clearly by exploring Figure 3. In Fig. 3.ts1, …, Fig. 3.ts4 we observe time series that to the human eye seem seasonal and very similar. Identifying such time series as seasonal is a very old idea in time series analysis, with Beveridge [26] and Yule [27] employing harmonic functions to model time series with cyclical behavior. However, despite an obvious cyclical pattern and visual similarities, Fig. 3.ts2, Fig. 3.ts3, and Fig. 3.ts4, all exhibit discord. The reason being, except for the ED method, the methods are not testing for seasonality by fitting the data with harmonic functions. Thus, the different methods of seasonality assessment ultimately result in different definitions of seasonality.
As we’ve mentioned previously, the behavior of peaks plays an important role in concordance. We’ll use Figure 3 further to explore the relationship between peaks, variation, and discord, and provide general principles as to when a method would be more likely to classify a time series as seasonal rather than nonseasonal.
Since each method assesses seasonality differently, positive concordance is only achieved when multiple conditions are simultaneously present. Persistent and consistent peaks are most important for ED, AA, AR, and ET. Peaks will result in a seasonal classification by ED, so long as there exists a sufficient difference between the peaks and troughs in the data. However, even with persistent and consistent peaks, variation (particularly among the peaks) over time can lead to a nonseasonal classification by AA, AR, or ET (Fig. 3.ts2, Fig. 3.ts3, and Fig. 3.ts4). Indeed, we have confirmed experimentally that we can achieve positive concordance for the time series in Fig. 3.ts2, Fig. 3.ts3, and Fig. 3.ts4, by removing the data prior to 2016. Since time series with persistent and consistent peaks will have high correlation between seasonal lags, they will be classified seasonal by QS. For FR, KW, and WE, most important is variation. In the absence of the prominent peaks we see in Fig. 3.ts1, …, Fig. 3.ts4, sufficient variation in the time series data can lead FR, KW, and WE to a seasonal classification (Fig. 3.ts6). Therefore, with regards to positive concordance we see tension among the methods in that variation may cause some methods to classify seemingly seasonal time series as nonseasonal (Fig. 3.ts2, Fig. 3.ts3, and Fig. 3.ts4) and seemingly nonseasonal time series as seasonal (Fig. 3.ts5, …, Fig. 3.ts8).
The relationship between negative concordance and variation is more straightforward. The time series in Fig. 3.ts5, …, Fig. 3.ts9 are similar in that one cannot determine the results of the methods by visual inspection alone (recall that any linear trend in each of the original series have been removed prior to method application). Given the similarity of the time series in Fig. 3.ts5, …, Fig. 3.ts9, it’s reasonable to wonder why they all do not exhibit negative concordance. Ultimately, time series that are constant or stationary around a constant mean with minimal variation will result in negative concordance among the methods. However, a time series with both large peaks and variation will exhibit negative concordance if there is no monthly or yearly autocorrelation (for instance, a time series generated from N(µ,σ2)). As was noted in the Results section, the 1,498 time series for which the methods exhibit negative concordance report a mean variance of 0 to four decimal places.
We’ve explained general scenarios in which we can expect negative and positive concordance, but further generalization in more difficult. As Figure 3 reveals, there are thousands of different combinations of discord (M = 2,168, …, 1,267) for each time series, making it difficult to predict which particular combination of discord to expect based on visual inspection of the time series alone. However, an immediate consequence of this study is that researchers using different methods are implicitly defining seasonality differently. Given the discordance between the methods, researchers relying on different methods are likely to encounter different results, thus leading to conflicting understanding of the seasonality of a time series.
Finally, we note that the study and evaluation of methods was limited to 10 observational databases and eight methods of binary seasonality classification. Different results may have been observed by modifying one or more of the following design choices:

Construction of time series

Number and choice of databases

Number and choice of methods of binary seasonality classification