## Dataset selection

We selected population time series data from 14 different databases (**Supplementary Table 5**). These datasets were from multi-sites, multi-year, and multi-species abundance / density databases. We chose only databases with at least 20 consecutive years of surveyed data. We also used only population time series from each database with abundance / density greater than zero for each of the years in the time series.

## Data description

The datasets used in this analysis were from North America (10 out of 14), Canada (1 dataset), The Netherlands (2 datasets), and United Kingdom (1 dataset). Three aquatic ecosystems and eleven terrestrial ecosystems were represented in the datasets. Types of organisms under study were birds, plants, zooplankton, rodents, fish and moths. The minimum and maximum number of years in a time series is 20 and 30 years respectively. Thus, the maximum number of years in a training set ranged from 11–21 years (see section 2.2.6 below). The number of time series in the datasets ranged from 11 to 7450. The mean number of different species in a dataset was around 60 and range of the number of different species was 3–254. The mean number of different sites in a dataset was 53 and the range in the number of different sites was 4–283. Sampling methods included trawl surveys, visual searches, live trapping and quadrat mapping. The units of abundance/density estimates include number of individuals per 1\({m}^{2}\) plot, number of individuals per \({m}^{2}\) and percent cover).

## Using data to fit models

We fitted four (4) different models to each time series in each of the 14 datasets. The models were fit to a training set to estimate parameters then the fitted models were used to predict values in the test set.

## The models

Dennis and Taper (1994)24 in a classic series of density dependent papers used the logistic equation to identify four plausible models. I used the same four models as follows.

1. **Mean model**

$$ln\left(\frac{{n}_{t+1}}{{n}_{t}}\right)=b+m\left({n}_{t}\right) , b, m=0 \varvec{E}\varvec{q}\varvec{u}\varvec{a}\varvec{t}\varvec{i}\varvec{o}\varvec{n} 1$$

## 2. Trend model

$$ln\left(\frac{{n}_{t+1}}{{n}_{t}}\right)=b+m\left({n}_{t}\right) , b\ne 0, m=0 \varvec{E}\varvec{q}\varvec{u}\varvec{a}\varvec{t}\varvec{i}\varvec{o}\varvec{n} 2$$

## 3. Logistic model

$$ln\left(\frac{{n}_{t+1}}{{n}_{t}}\right)=b+m\left({n}_{t}\right) , b\ne 0, m\ne 0 \varvec{E}\varvec{q}\varvec{u}\varvec{a}\varvec{t}\varvec{i}\varvec{o}\varvec{n} 3$$

## 4. Gompertz model

$$ln\left(\frac{{n}_{t+1}}{{n}_{t}}\right)=b+m\left(\text{l}\text{n}\left({n}_{t}\right)\right) , b\ne 0, m\ne 0 \varvec{E}\varvec{q}\varvec{u}\varvec{a}\varvec{t}\varvec{i}\varvec{o}\varvec{n} 4$$

The key difference between density dependent (Eq. 3 and **Eq. 4**) and density independent models (Eq. 1 and **Eq. 2**) is that the slope in density dependent models is different from zero. Each of the four models was fit to each training set for each time series from each dataset.

## Detailed descriptions of training sets and testing sets

To assess the predictive ability of the population models, the time series data were partitioned into training and testing sets.

**Step 1**

We selected the last nine years as the test set and the first N-9 years as the training set. Because the number of years in time series ranged from 20–30 years among the datasets, this means that not all training sets across datasets had the same number of years although within each dataset the number of years in each training set was constant.

**Step 2**

Because we wanted to assess the optimum number of years to include in a training set, we subset the ‘complete’ training set into a series of training sets containing 1, 3, 5…. N-9 years. For example, for a dataset containing time series that were 30 years long, the ‘complete’ training set would include the first 21 years. The ‘complete’ training set would then be subset into training sets of 1,3,5,7,9, 11,13,15,17,19, and 21 years (Fig. 1). Subsets would be created by excluding early years. Thus, if a training subset was to only include three years of data, they would be the years 19–21. Data were fit to each of the four models using each of the subsets with the exception of models that could not be fit using a single year (i.e., both density dependent models and the trend model) or year interval (i.e., both density dependent models). In the example for a dataset with 30 years of data, this implies that there would be 10 or 11 sub models (i.e., one for each of the training subsets) for each of the four models.

**Step 3**

All models were used to predict the year immediately after the last year in the training subset. So, for all subsets of a ‘complete’ training set that included year 21, year 22 would be the predicted population size.

**Step 4**: “**Rolling window**”: To maximize the number of predictions available for assessing predictive ability, the training and testing sets were chosen from the time series in a ‘rolling window’ method (Fig. 2). This allowed us to compare 5 times as many predicted values to the observed values but still ensures that the model predictions were tested on data that were not used to train the models. The ‘complete’ training set for each database was shifted one year forward in each database and Steps 2–4 was repeated. For example, for a database with a 30-year time series that has used years 1–21 for the initial ‘complete’ training set, the training set would be shifted on year forward to include year’s 2–22. This implies that the test set will shrink to eight years in length because there would only be eight years remaining to include in the test set. Now, instead of using models trained on years 1–21 to predict years 22, we used models trained on years 2–22 to predict year 23.

**Step 5**

We repeated the shift of the ‘complete’ training set forward one year three more times until the available tests set only includes the last five years in the time series. This allowed us to make five replicate predictions for each model/training set combination for each time series in each dataset. For example, for a dataset containing 30 years of data, there were five replicate predictions for the Gompertz model using the 21-year training set to fit a model and predict one year out because we used years 1–21 to predict year 22, years 2–22 to predict year 23, years 3–23 to predict year 24, years 4–24 to predict year 25, and years 5–25 to predict year 26.

## Mean prediction error

**Absolute**

We calculated absolute prediction error for a single prediction as |Observed-Predicted|. Mean prediction error was calculated separately for each model/training set size combination. For example, a dataset containing time series of 30-year duration, would calculate a different mean for the Gompertz 21-year training set, the Gompertz 19-year training, the Gompertz 17-year training set and so on. The mean is calculated across all times series and all 5 replicate predictions for each time series.

**Mean Absolute Prediction Error** =

Where N = number of time series in a dataset (e.g., 968 for Aleutian Island Fish dataset), n = number of replicates for each model (training set combination = 5), and the prediction error is the difference between the observed and predicted values for a specified replicate of a particular model times series – training set combination. For example, the Aleutian Island Fish dataset has 968 time series and there would be 5 replicate predictions for each model - time series combination for a total of 968 x 5 = 4840 predictions for each Model – training set combination. That is, the Gompertz model fits with 19 years in the training set would have 4840 predictions and therefore 4840 prediction errors.

## Statistical analyses

## Model effects

To test for differences in predictive ability among models (i.e., Gompertz, Logistic, Mean and Trend), we used a linear mixed model with a fixed effect for the model term and random effects for the site and species. This was done separately for each of the 14 datasets.

## Binomial and Chi-square test for density dependent and independent models

The statistical test in model effects tests for differences in mean prediction error among models but we also tested for statistically significant differences in the frequency that each model was ‘best’ (i.e., made the smallest prediction error). We used a binomial test to test whether the frequency that density dependent models were best was significantly different than the frequency that density independent models were best. Moreover, we used a Chi-square test to test for a difference in frequency among the four models (Gompertz, Logistic, Mean and Trend).