*Means*

Figure 1 illustrates means and standard deviations for the 30 measures at T1, T2 and for T2 change. The details of the descriptive statistics, along with descriptive statistics further broken down by gender and zygosity, are included in Supplementary Tables 1-6. These results are based on one twin randomly selected from each pair so that the data points are independent. Results for the other twin are virtually identical, as shown in Supplementary Tables 7-9.

**Figure 1**. Descriptive statistics for all measures at T1 and T2. Means and standard deviations for all the measures are presented in the panel on the left. On the right are effect sizes (Cohen’s d) for the differences between phenotypes at T1 and T2.

Almost as many changes were in a positive direction as in a negative direction. However, the effect sizes are modest as indicated by Cohen’s *d *statistic, which is the ratio of the mean difference to the standard deviation (Cohen, 1988; Figure 1). The average *d *across the 30 measures was 0.24, which accounts for less than two percent of the variance and includes as many positive as negative changes.

Cohen (1988) proposed, as convention, that a large effect size is a *d *of 0.8, accounting for about 25% of the variance. Only one large negative effect emerged, decreased Volunteering (0.84), which is almost certainly due to less opportunity for volunteering during lockdown.

A *d *of 0.5, considered a medium effect size, accounts for about 9% of the variance. Medium-sized mean differences in the negative direction emerged for three variables. Prosocial Behaviour declined (0.44), which, like Volunteering, might be due in part to reduced opportunity. Achievement Motivation decreased (0.47), which is worrying because emerging adults are our next generation of workers. Hyperactivity- Inattention increased (0.42), which seems to fit with reports in the media that people feel less able to concentrate. Other effect sizes were modest (*d *= 0.20).

*Variances*

These mean differences mask a wide range of individual differences. If the COVID-19 crisis affected people in more extreme ways, we would expect to see increased variance at T2. The standard deviations (Supplementary Tables 1 and 2) do not support this hypothesis. The average standard deviation at T2 (1.71) was slightly *lower *than at T1 (1.79).

For these analyses and the following analyses of individual differences, we focused on variables that showed sufficient variability and approached normal distributions, including Achievement motivation, Alcohol, Community satisfaction, Conduct problems, Depression, Emotional problems, General anxiety, Healthcare, Hyperactivity/inattention, Importance of relationships, Love and relationships, Media use, Money attitudes, Peer problems, Physical activity, Prosocial behaviour, Purpose in life and Volunteering.

*Covariances*

If the COVID-crisis re-shuffled the rank order of individual differences, we would expect to see little stability from T1 to T2. Pearson correlations from T1 to T2 are shown in Figure 2 and listed in Supplementary Table 10, separately for males and females. The average correlation is 0.48 across the two- year gap. The most stable measures include Purpose in Life (0.68), Emotional Problems (0.56), Peer Problems (0.58), General Anxiety (0.57), and Depression (0.56). Stability correlations were generally similar for males and females, with average stability correlations of 0.50 and 0.47, respectively.

**Figure 2. **Phenotypic correlations (and 95% confidence intervals) between measures at T1 and T2.

Reliability of the measures represents a ceiling for stability. In TEDS, we obtained two-week test-retest reliability from TEDS twins on most measures as part of our preparatory work for the 2018 (T1) assessment (Supplementary Table 11). The average test-retest reliability was 0.71, ranging from 0.47 for Importance of Healthcare to 0.84 for Volunteering. The average stability correlation of 0.48 implies that 48% of the *total *variance of the measures was stable from T1 to T2. Taking test-retest reliability into account (through dividing the correlation estimate by the test-retest coefficient) suggests that 68% of the *reliable *variance of the measures was stable from T1 to T2.

Despite the substantial stability from T1 to T2, T2 change scores revealed some individuals who changed dramatically in positive as well as negative directions, as illustrated in Supplementary Figure 1. We will assess the twins on the same measures on three more occasions during 2020, which will enable analyses of the perseverance and prediction of these extremes.

*Genetic and environmental aetiologies of variances and covariances*

*Twin correlations. *Figure 3 depicts intraclass correlations for identical and non-identical twins at T1 and T2 and for T2 change scores. (See Supplementary Tables 12-14 for the correlation coefficients). We will describe the main results of the twin analysis using these twin correlations, although later we show that these results are confirmed by structural equation modelling, which also provides 95% confidence intervals for the genetic and environmental estimates.

**Figure 3**. Correlations between MZ and DZ twin pairs for all measures at T1, T2 and T2 change.

At T1, the average twin correlations for identical and non-identical twins were 0.35 and 0.16, respectively. Because identical twins are identically genetically whereas non-identical twins are only 50% similar genetically, the difference in their correlations indexes genetic influence on individual differences, called heritability. Doubling the difference between these correlations suggests a rough estimate of heritability of 35% at T1 because heritability cannot exceed the identical twin correlation. At T2, the average twin correlations for identical and non-identical twins were similar, 0.31 and 0.16, as was the average heritability of 30%, despite the COVID-19 crisis and lockdown.

Twin resemblance not explained by zygosity can be attributed to shared environment (C). In other words, the extent to which heritability does not account for the identical twin correlation is a rough index of C. On average, C was negligible at T1 (2%) and T2 (4%).

The rest of the variance is attributed to a residual component of variance (E) that includes non-shared environment plus unreliability of measurement. The average E was 63% at T1 and 66% at T2. Test-retest reliabilities suggest that non-shared environment accounted for about half of E at T1 and T2.

Deducting the component of variance due to unreliability indicates that about half of the *reliable *variance at T1 and T2 can be attributed to inherited DNA differences. In other words, of the *total *variance at T1 and T2, about 40% can, on average across the measures, be attributed to genetic factors, about 30% to non- shared environmental factors, and about 30% to unreliability of measurement. Shared environmental influence has negligible impact.

T2 change scores show lower heritabilities, 16% on average. Because T2 change is a residualised score independent of scores at T1, stable genetic influence from T1 to T2 is removed from T2 change scores.

Thus, heritability of T2 change scores represents novel genetic influence at T2 that does not affect T1. Shared environment, which includes not only shared rearing environment (the twin pairs grew up together in the same family) but also shared experiences during the COVID-19 crisis, has negligible effects on T2 change, 3% on average. Most of the variance of T2 change scores is due to the E component of variance, 81% on average. We cannot separate E of T2 change scores into non-shared environment and unreliability of measurement because test-retest reliability at T1 cannot be assumed to represent the reliability of T2 change scores.

*Univariate model-fitting results. *These results about variance and covariance gleaned from the twin correlations are highly similar to the results of univariate model-fitting analyses of variance for T1, T2 and T2 change measures, as shown in Figure 4. (See Supplementary Table 15 for model-fit statistics, precise ACE estimates and confidence intervals.) The average model-fitting heritability estimates were 32% for T1, 32% for T2 and 15% for T2 change. Model-fitting estimates of shared environment were 3% for T1 measures, 3% for T2 measures and 2% for T2 change measures. Average model-fitting estimates of E were 66%, 65% and 82%, respectively.

**Figure 4. **Univariate model-fitting estimates.

*Bivariate model-fitting results. *The Cholesky Decomposition bivariate model-fitting model separates A, C and E components of variance at T2 into variance in common with variance at T1 and variance at T2 independent of variance at T1. As explained in Methods, the model yields estimates of the extent to which the phenotypic correlation between T1 and T2 is accounted for by A, C and E. The genetic correlations are shown in the top panel of Figure 5 (See Supplementary Figure 2 for shared environmental and non-shared environmental correlations). The results of the Cholesky bivariate analysis are illustrated in the bottom panel of Figure 5, with details in Supplementary Tables 16-21. Genetics accounts for 55% of the T1-T2 phenotypic correlations on average. Shared environment accounts for 4% of the phenotypic correlations on average. E influences shared at T1 and T2 are responsible for the rest of the phenotypic correlations (40%), which could be stable non-shared environmental influences or correlated error.

**Figure 5**. Bivariate model-fitting estimates. Genetic correlations are presented in the top panel. The bottom panel shows the proportion of the phenotypic correlation that is explained by A, C and E.

The Cholesky model also estimates A, C and E components of variance at T2 independent of their respective A, C and E components of variance at T1. These A, C and E estimates of T2 change (Supplementary Tables 16-21) are, as expected, similar to the A, C and E estimates for T2 change shown in Figure 4.

Figure 5 also shows the genetic correlations between T1 and T2 and compares them to the phenotypic correlations described earlier. As explained in Analyses, the Cholesky model estimates the genetic contribution to phenotypic stability from T1 to T2, which includes the genetic correlation. The genetic correlation is the correlation between genetic effects at T1 and T2 independent of the T1 and T2 heritabilities. The genetic correlations averaged 0.91, and most of their 95% confidence intervals included 1.0, indicating that genetic effects at T2 were substantially correlated with genetic effects at T1, despite the COVID-19 crisis and lockdown.

*Twins locked down together vs apart. *Finally, we investigated possible moderators of the univariate results. The most novel moderator is whether the twins were locked down together or living apart during lockdown. Lockdown presents a quasi-experimental test of contemporary shared environments by comparing results for the 28% of twins living together during lockdown and those living apart. If shared lockdown experiences were important, twins locked down together should be more similar than twins living apart during lockdown. On the basis of the generally weak effects of shared environment, we predicted that environmental effects due to living together during lockdown are negligible.

At first this prediction seemed wrong because the average twin correlation for twin pairs locked down together (.30) was higher than the correlation for twin pairs living apart during lockdown (.23), although this difference was not significant (p = .051). However, this possible effect of shared environments might be a genetic effect in disguise because identical twins locked down together more often than non-identical twins (32% vs 25%). Results of univariate model-fitting separately for twins locked down together vs apart (Figure 6) are consistent with the notion that the apparent effect of shared environments might be mediated in part genetically (Supplementary Table 26-27 for model-fitting results including the 95% confidence intervals). For T2 scores, twins together yielded a slightly higher average estimate of shared environmental influence compared to twins apart (.07 vs .03), suggesting some very slight increase in true shared environmental influence. However, twins together also yielded a slightly higher average estimate of genetic influence compared to twins apart (.33 vs .30), which could be the result of genetically influenced selection for being locked down together, which would be an example of gene-environment correlation. However, a great deal of caution is warranted in these interpretations because the difference in phenotypic correlations for twins locked down together vs apart is not significant and our design has negligible power to detect significant differences of this magnitude for A and C.

**Figure 6. **Univariate model- fitting estimates for twins in lockdown apart (top panel) vs. twins in lockdown together (bottom panel).

Power to detect significant differences for such small effects is negligible. Nonetheless, further support for the hypothesis that the apparent C effect of being locked down together is not really C comes from finding nearly identical A and C estimates pre-existing at T1: A and C are .33 and .06 for twins together and .30 and .03 for twins apart. Results of T2 change scores provides additional confirmation in that a similar pattern emerged: A and C are .19 and .04, respectively, for twins together and .14 and .02 for twins apart.

*Other moderators. *We also considered other potential moderators. For example, similar to being locked down together or apart, gender is a dichotomous variable that is the same for both members of a twin pair (when opposite-sex non-identical twins are excluded). Separate univariate analyses for male and female twins yielded similar results. These model-fitting results are presented in Supplementary Tables 28 and 29.

For the continuous moderator of family SES and for moderators that can be discordant for members of a twin pair (living conditions during lockdown, COVID-19 symptoms, losing a job/financial difficulties), we corrected T2 and T2 change scores for these moderators and repeated the analyses. ACE estimates were similar when we compared estimates before and after correction for these moderators. These model- fitting results are included in Supplementary Table 30-36.