Verbal Intelligence, Not Level of Education, Robustly Assesses Cognitive Reserve

Background: Cognitive reserve is most commonly measured using socio-behavioural proxy variables. These variables are easy to collect, have a straightforward interpretation, and are widely associated with reduced risk of dementia and cognitive decline in epidemiological studies. However, the specic proxies vary across studies and have rarely been assessed in complete models of cognitive reserve (i.e., alongside both a measure of cognitive outcome and a measure of brain structure). Complete models can test independent associations between proxies and cognitive function in addition to the moderation effect of proxies on the brain-cognition relationship. Consequently, there is insucient empirical evidence guiding the choice of proxy measures of cognitive reserve and poor comparability across studies. Method: We assessed the validity of 5 common proxies (education, occupational complexity, verbal intelligence, leisure activities, and exercise) and all possible combinations of these proxies in 2 separate community-dwelling older adult cohorts: The Irish Longitudinal Study on Ageing (TILDA; N = 313, mean age = 68.9 years, range = 54 – 88) and the Cognitive Reserve/Reference Ability Neural Network Study (CR/RANN; N = 234, mean age = 64.49 years, range = 50 – 80). 15 models were created with 3 brain structure variables (grey matter volume, hippocampal volume, and mean cortical thickness) and 5 cognitive variables (verbal uency, processing speed, executive function, episodic memory, and global cognition). Results: No moderation effects were observed. There were robust positive associations with cognitive function, independent of brain structure, for 2 individual proxies (verbal intelligence and education), and 16 composites (i.e., combinations of proxies). Verbal intelligence was statistically signicant in all models. Education was signicant only in models with executive function as the cognitive outcome variable. Three robust composites were observed in more than two thirds of brain-cognition models: the composites of 1) occupational complexity and verbal intelligence, 2) education and verbal intelligence, and 3) education, occupational complexity and verbal intelligence. However, no composite had larger average effects nor was more robust than verbal intelligence alone. Conclusion: These results suggest that researchers using proxy variables to measure cognitive reserve should use verbal intelligence where possible. Exec Func = Executive Function, Glob Cog = Global Cognition.


Background
Neuropathology and measures of brain structure do not fully explain cognitive decline (1) nor age-related variation in cognitive function (2). This is evident in the nding of normal cognitive function in individuals who meet the diagnostic criteria for Alzheimer's disease (AD) based on neuropathology (3,4). This wellestablished gap between brain and cognition may be explained by cognitive reserve (CR), wherein the effects of brain pathology or ageing on cognitive function are moderated by an individual's ability to e ciently or exibly use the brain's resources to cope with task demands (5). effectively considered to be re ective (i.e., caused by) the latent CR construct (34). Composite proxies are a more appropriate formative measurement model, where the observed proxies are considered to form, or cause, CR. Moreover, this approach can re ect the unique additive contributions of individual proxies, whereas factor analytic models re ect only the shared variance across different proxies (8).
While the composite approach offers advantages over the use of single proxies, there is no agreed-upon gold-standard composite proxy (29) just as there is likewise no gold-standard individual proxy. Similarly, it is unclear which proxy should be used when assessing candidate neuroimaging measures of CR, as face validity is assessed via their association with CR proxies (40,41). The considerable variation (42,43) and lack of coherence in the use of proxies means that there is poor comparability across studies, as an effect observed for one proxy (e.g., educational attainment), may not be observed to the same degree for another (e.g., occupational complexity), even though both putatively re ect CR. It also provides researchers in the eld of CR with an additional "researcher degrees of freedom" (44) such that several different proxies could be examined but only statistically signi cant results are reported.
There have been limited attempts to-date to assess the effects of different CR proxies on cognitive function. A systematic review of reviews and meta-analyses have found that education, occupational complexity/status and engagement in cognitively stimulating activities are individually associated with a reduced risk of dementia (42) and positively associated with cognitive function in cognitively healthy older adults (43). Composites of these proxies and also including verbal intelligence have also shown similar effects (42,43,45). Opdebeeck et al. (43) reported that the reviewed literature showed that education, as a single proxy, and composite proxies had moderate associations with cognitive function, with smaller associations found for occupational complexity/status and cognitively stimulating activities (43). Verbal intelligence and social engagement were also associated with a reduced risk of dementia although were less frequently used compared to other proxies (42).
A complete model of CR requires 3 components: a measure of CR (e.g., a proxy), a measure of brain structure/pathology, and a measure of cognitive function (8,46). A complete model enables the assessment of the cognitive bene t criterion (41). This criterion can be satis ed via the observation of 1) an "independent effect" in which the candidate measure is positively associated with cognitive function, independent of brain structure, or 2) a "moderation effect" in which the candidate measure moderates the relationship between brain structure and cognitive function (8,40). The moderation effect is considered the ideal benchmark for CR, whereas the independent effect is considered a weaker level of evidence for a CR effect (8).
The evidence reported in the systematic reviews and meta-analyses described above were obtained from incomplete models of CR (47). Chapko et al. (47) sought to rectify this problem and conducted a systematic review of studies assessing CR proxies in complete CR models. 58% of all models assessing education reported positive evidence for education as a CR proxy, although this dropped to 38% of models within cognitively healthy cohorts. Chapko et al. concluded that the evidence for occupational complexity/status was inconclusive. One reviewed study provided evidence that greater engagement in cognitively stimulating activities in mid-and late-life provided CR effects (48). Con icting results were found for more general leisure activity measures, with one study nding a protective effect (49) while another reported a null effect (50).
Verbal intelligence was not considered as a CR proxy by Chapko et al. (47) in their systematic review, although it has been relatively widely used as a proxy. Negash et al. (51) reported that verbal intelligence was positively associated with cognitive function controlling for global AD neuropathology, in a mixed sample of cognitively healthy older adults and older adults with mild cognitive impairment (MCI) and dementia. In another mixed sample, a moderation effect was observed for verbal intelligence on the relationship between cognition and inferior temporal lobe tau deposition, but not global amyloid burden (52). However, this moderation effect on tau deposition was not signi cant when the analysis was restricted to cognitively healthy older adults. Other studies have reported positive evidence for verbal intelligence as a CR proxy in cognitively healthy older adults, including a positive association with cognitive function controlling for hippocampal atrophy (53), and a moderation effect on the relationship between cognition and ber bundle length, an index of white matter microstructural integrity (54). In the latter study, the reported moderation effect may have been confounded by age, as age is negatively associated with both bre bundle length (55) and cognitive function (56,57), yet the analysis did not control for age.
Chapko et al. (47) did not assess physical activity or social engagement as CR proxies, presumably because studies with complete models including these proxies were not available at the time of the research. Complete CR models assessing physical activity and social engagement have since been published. Con icting evidence has been reported for physical activity, which was positively associated with cognition in the presence of neuropathology (58) but not hippocampal atrophy (53). Positive evidence has also been reported for social engagement, which moderated the relationship between amyloid-beta deposition and cognitive decline (59).
Mixed evidence for CR effects of composite proxies has also been published (note, composites were not assessed by Chapko et al., (47)). The composite of verbal intelligence and education has been reported to moderate the relationship of subcortical grey matter (GM) volume and cortical thickness with uid reasoning but not memory or processing speed and attention (60). This composite has also been associated with memory controlling for GM volume (61) and global cognition controlling for a composite AD-biomarker (62). Aside from the composite of verbal intelligence and education, there is very little empirical evidence regarding the effects of different CR composites within complete models.
There is currently no conclusive evidence for the best individual or composite proxy for measuring or validating neuroimaging measures of CR, particularly with respect to cognitively healthy older adults. A methodology for solving this problem is the use of hierarchical linear moderated regressions to systematically assess standard CR proxies and their composites in complete models, an approach that enables the examination of both moderation and independent effects within the same analysis framework. This is important because, although moderation effects should ideally be observed to validate a CR proxy or measure (8), they are typically small in real-world data (63), explaining 1-3% of the variance in the outcome (64). Consequently, large sample sizes are required to detect typically small moderation effects (65). This issue is further exacerbated when measurement error is present in either variable in the interaction term (e.g., the CR proxy and measure of brain structure) used to assess the moderation effect (66) or when either variable in the interaction term is associated with the outcome variable (e.g., cognitive function; 65). Given the noted di culties in identifying moderation effects, it is important to also consider the independent effect when assessing the validity of CR proxies.
Hierarchical linear regressions allow the robustness (i.e., frequency of effects using different measures of brain structure and cognitive function) and magnitude of both moderation and independent effects of different proxies to be compared. Here, in two separate community-dwelling older adult cohorts, we examined ve common putative CR proxies -education, occupational complexity, verbal intelligence, leisure activities, and exercise -and all of their possible combinations. We included three brain structure variables, mean cortical thickness, hippocampal volume, and grey matter volume, in each model. Our primary aim was to identify the CR proxies with the most robust and largest effects across two datasets. More formally, we de ne effective CR proxies as those variables that have a signi cant independent or moderation effect on measures of cognitive function and brain structure.

Participants
The rst dataset consisted of data from 313 community-dwelling adults (mean age = 68.90 years, SD = 6.75 years, range = 54-88 years; 50.48% female), a subset of The Irish Longitudinal Study on Ageing (TILDA), a nationally representative longitudinal cohort study of older adults in Ireland (67,68). This data was collected during Wave 3 of the TILDA study (69). All participants were screened for MRI contraindications and study-speci c inclusion criteria included: no history of neurological conditions and available data for CR proxies and cognitive function.
The second dataset consisted of data from 234 community-dwelling adults (mean age = 64.49 years, SD = 7.42 years, range = 50-80 years; 51.28% female) selected from participants in the Cognitive Reserve/Reference Ability Neural Network (CR/RANN) studies (70)(71)(72). Participants were screened for MRI contraindications, hearing and visual impairments, medical or psychiatric conditions, and dementia or MCI. Participants selected for the current analyses were aged 50 years or older with data available for CR proxies, cognitive function and MRI.

Measures: CR Proxies
Data was available for 5 socio-behavioural proxies in both datasets: Educational attainment, Occupational complexity, Verbal intelligence, Leisure activities, and Physical activity. In TILDA, further data was available for the proxies: Cognitively stimulating activities and Social engagement.
Educational attainment was measured using years of formal education in both datasets. In TILDA, participants were asked to indicate the age at which they rst left continuous full-time education. This information was missing for 4 participants in the nal sample (1.28%), so it was imputed using educational quali cation, father's education, age, sex, and rural residence during childhood as previously described (73).
Occupational complexity was measured using the complexity of work in the dimensions of data, people, and things (74) using ratings obtained from an online catalogue of the Dictionary of Occupational Tiles (DOT: www.occupationalinfo.org). Ratings for each dimension were reversed (such that higher scores re ected greater complexity) and then summed to create a total occupational complexity score, with scores ranging from 0 (minimal complexity) to 21 (maximal complexity). This was obtained for each participant's current occupation or last occupation before retirement in TILDA and for participant's occupation of longest duration of their lifetime in CR/RANN. Verbal intelligence was measured using the total number of correctly pronounced words on the National Adult Reading Test (NART; Nelson & Willinson, 1982) in TILDA and the American National Adult Reading Test (AMNART; Grober & Sliwinski, 1991) in CR/RANN. In TILDA, a stress/anxiety-preventative and timesaving measure (78) was employed such that participants only completed the second half of the NART if they scored greater than 20 on the rst half. A correction procedure was employed whereby scores of 0-11 were retained as full scores, but scores of 12-20 in participants who did not complete the second half were corrected using a conversion table outlined by Beardsall and Brayne (79) (80). Possible scores on the NART, in TILDA, ranged from 0 to 50 and on the AMNART, in CR/RANN, from 0 to 45. While the NART is often used to provide a measure of premorbid intelligence, we have labelled NART scores here as verbal intelligence in line with previous cognitive reserve studies (81,82).
Leisure activities were assessed in TILDA by participants rating their current frequency of engagement on an 8-point Likert scale (0=Never to 7=Daily/Almost Daily) in 9 activities: watching television, going to lms/plays/concerts, travel, listening to music/radio, going to the pub, eating out, sports/exercise, visiting/talking on phone, and volunteering. In CR/RANN, participants rated their frequency of engagement over the preceding 6 months on a 3-point Likert scale (1 = Never to 3 = Often) in 17 activities: television/radio, cards/games, reading, lectures/concerts, theatre/movies, travel, walks/rides, crafts/hobbies, music, visiting, sports/dancing/exercise, cooking, group membership, collecting, religious activities, and volunteering. For both datasets, total scores were created by summing individual responses and possible scores ranged from 17 to 51.
Physical activity was assessed in TILDA by calculating the total metabolic minutes arising from selfreported physical activity over the last week using the International Physical Activity Questionnaire-Short Form (IPAQ-SF; Craig et al., 2003;Lee et al., 2011). This questionnaire assessed the time spent in 3 categories: vigorous, moderate, and walking. Responses were converted to metabolic equivalent minutes (83) and summed. In CR/RANN, physical activity was calculated using total metabolic hours arising from physical activity in an average week. The Godin leisure time exercise questionnaire (85) assessed the frequency of activity sessions lasting > 15 mins in 3 categories: strenuous, moderate, and mild exercise. Responses were then weighted by the average estimated duration of activity in each category (0.5, 0.75, 1 hr respectively) and their metabolic equivalent values ( Cognitively stimulating activities were assessed in TILDA with a questionnaire where participants rated their frequency of engagement on an 8-point Likert scale (0=Never to 7=Daily/Almost Daily) in 5 activities: attending classes and lectures, working in the garden/home or on a car, reading books/magazines, spending time on hobbies/creative activities, and playing cards/bingo/games. Total scores were created by summing individual responses and possible scores ranged from 0 to 35.
Social engagement was measured in TILDA using the Social Network Index (87) which provides a total score, ranging from 0 to 4, re ecting an individual's degree of social connection (88).
Composite proxies were created by rst standardising (z-scoring) individual proxies. Next, every unique combination of proxies was generated and the composite proxy was the average of those proxies. For TILDA, this produced 120 unique composite proxies. For CR/RANN, this resulted in 26 composite proxies.
To summarize, for TILDA there were 127 proxies in total (individual and composite) and 31 in total for CR/RANN. To attenuate possible effects of outliers, all proxies were Winsorized using a robust technique based on the median absolute deviation (89). Outliers were identi ed as values greater than a threshold of 3 median absolute deviations from the median. Identi ed outliers were replaced by the median +/-3 median absolute deviations.

Measures: Cognitive Function
Verbal uency was assessed using the total score on the Animal Naming Test which measures the ability to spontaneously produce the name of animals in one minute (78). The total number of animals named was used as the total score in both datasets.
Processing speed was measured using the time to complete the Colour Trails Task 1 (CTT 1; D' Elia et al., 1996) in TILDA and the Trail Making Task A (TMT A; Reitan, 1955) in CR/RANN. The CTT is considered a cross-culturally valid form of the TMT (78). Scores were reversed coded, such that higher scores re ected greater cognitive performance.
Executive function was assessed using the CTT 2 (D' Elia et al., 1996) in TILDA and the TMT B (Reitan, 1955) in CR/RANN. Both measures re ect the multi-dimensional executive function construct (92,93), speci cally visual attention and cognitive exibility with contributions from processing speed as well (78). The time taken to complete both tasks was used as the outcome measure. Scores were reversed coded such that higher scores re ected greater cognitive performance.
Episodic memory was measured in both datasets with a composite measure created using the average of standardized and Winsorized immediate and delayed recall variables. In TILDA, immediate and delayed recall were measured using a 10-item word list (94) as used originally in the Health and Retirement Study (95). The word list was assessed over 2 trials in TILDA and the average score for immediate and delayed recall from both trials were used. In CR/RANN, immediate and delayed recall were measured using the total and delayed recall scores from the Selective Reminding Test (SRT; Buschke & Fuld, 1974).
Global cognition was measured using a composite measure of all 5 cognitive variables in each dataset. Cognitive variables were Winsorized and standardised prior to creation of the composite. The composite variable was then Winsorized and standardised itself.

Analysis
Fifteen individual brain structure-cognitive function models were created for each combination of brain structure and cognitive function variable, where one brain structure variable was selected as an independent variable and one cognitive function variable was selected as an outcome variable (Fig. 1). A moderated hierarchical regression ( Fig. 1) was conducted within each brain structure-cognitive function model (n = 15) for each unique proxy (TILDA = 127; CR/RANN = 31). In Step 1, a cognitive measure was regressed on age, sex, and a measure of brain structure. In Step 2, a proxy variable was included as an independent variable. In Step 3, the interaction term for brain structure and the proxy was added.
To protect against violations of linear regression assumptions, the analysis was repeated using a robust regression, speci cally an iteratively reweighted least squares regression with Tukey's biweight function and median absolute deviation scaling. Signi cant effects within each dataset were only considered signi cant if they were statistically signi cant in both the linear regression and robust regression. To control for multiple comparisons and to ensure generalizability of ndings, effects were only considered signi cant if they were statistically signi cant across both datasets. The analysis was conducted with customized Python code (available here: https://github.com/rorytboyle/hierarchical_regression) which used the statsmodels module (100). The change in R 2 (i.e. amount of variance explained) from Step 1 to Step 2, and from Step 2 to Step 3 in linear regression models were used to assess the size of the independent and moderation effects of CR proxies, respectively. Where signi cant effects were observed, the mean R 2 change across both datasets was calculated to assess the average additional variance explained by the proxy and its interaction with brain structure.

Demographics
In TILDA, some data were missing for mean cortical thickness (N = 34) and CTT 2 and Global Cognition (N = 2). In CR/RANN, the same N was used (N = 234) in all models. Consequently, different Ns were used across models within TILDA (see Table 1). Step 1: Brain-Cognition Relationships Models in Step 1 of the hierarchical regression (i.e., containing a brain structure measure, sex, and age) were signi cantly associated with cognitive measures across both datasets (see Tables 2 and 3), except for two models in CR/RANN (hippocampal volume-executive function, and hippocampal volume-episodic memory). Sex was positively associated with cognition in 40% of models in TILDA and negatively associated in 20% of models in CR/RANN, independent of brain structure and age. Age was negatively associated with cognitive function, independent of brain structure and sex, in 100% and 40% of models in TILDA and CR/RANN, respectively.
In TILDA, only one brain structure variable, mean cortical thickness, was independently positively associated with cognitive function (processing speed). In CR/RANN, grey matter volume was independently positively associated with all cognitive measures and cortical thickness was independently positively associated with all cognitive measures except for processing speed. Hippocampal volume was not independently associated with any measure of cognition in either dataset. .251* -.391**** Note: * = p < .05, ** = p < .01, *** = p < .001, **** = p < .0001. Verb Flu = Verbal Fluency, Proc Speed = Processing Speed, Exec Func = Executive Function, Epi Mem = Episodic Memory, Glob Cog = Global Cognition. Step 2a: Independent Effects Signi cant positive independent effects were observed for 18 proxies, including 2 individual proxies and 16 composites, across the 15 models in both datasets (see Additional le 1 for signi cant independent effects across both datasets; see Additional le 2 for all signi cant independent effects in TILDA; see Additional le 3 for all signi cant independent effects in CR/RANN). The proxy with the largest average independent effect was verbal intelligence (mean R 2 change = 0.10; see Fig. 3). Verbal intelligence was the most robust proxy: independent effects were replicated across both datasets in 100% of models. The largest average independent effects were observed for verbal intelligence on global cognition where it explained a mean additional 16.80% (hippocampal volume), 15.87% (grey matter volume), and 14.66% (mean cortical thickness) of the variance after accounting for age, sex, and brain structure (for scatter plots of proxies with 10 largest average independent effects, see Additional le 3, Fig. S1). Education was the only other individual proxy with reproducible independent effects (mean R 2 change = 0.05), which were observed in 20% of models, all of which contained executive function.
The most robust composite proxy was comprised of occupational complexity and verbal intelligence (mean R 2 change = 0.07) which was replicated in 86.67% of models. The composite proxy with the largest average effect was educational attainment and verbal intelligence (mean R 2 change = 0.09) which was replicated in 80% of models. Only one composite with reproducible independent effectsoccupational complexity and physical activity -did not include verbal intelligence. This was the least robust composite as it was replicated in a single model and had the smallest average effect (mean R 2 change = 0.02).
Step 2b: Additional Independent Effects Data was only available for cognitively stimulating activities and social engagement in TILDA. Consequently, these effects could not be assessed in terms of their reproducibility. However, within TILDA, cognitively stimulating activities was observed in 100% of models and had the second largest average independent effect of all individual proxies (mean R 2 change = 0.065, see Fig. 4). In contrast, social activities and was observed in only 40% of models and had the second smallest average independent effect of all individual proxies (mean R 2 change = 0.013). The only individual proxy with smaller effects than social engagement was physical activity which did not have signi cant effects in any model.
Composite proxies including verbal intelligence had the largest average effects, followed by cognitively stimulating activities, and then education (see Fig. 5). Composites including verbal intelligence had signi cant effects in all models in TILDA. The composite with the largest effect in TILDA was verbal intelligence and cognitively stimulating activities (mean R 2 change = 0.13). The only composite proxy which was not signi cant in any model was social engagement and physical activity.
Step 3: Moderation Effects There were no signi cant moderation effects in either datasets for any proxy. 31 non-replicated negative moderation effects (i.e., consistent with the CR hypothesis) were observed in TILDA (see Additional le 4,

Discussion
The reproducibility and magnitude of moderation and independent effects of 33 CR proxies, comprised of 5 standard individual proxies and all their unique combinations, were assessed across 2 datasets to investigate their validity as measures of CR. No moderation effects were observed across both datasets. Replicated independent effects -positive associations with cognitive function, independent of brain structure -were observed for 2 individual proxies (verbal intelligence and educational attainment) and 16 composites. The most robust and largest effects were found for verbal intelligence, which satis ed the independent effect criterion in all 15 brain-cognition models across both datasets. Educational attainment satis ed the independent effect criterion in 3 brain-cognition models. No composite proxy had larger or more robust independent effects than verbal intelligence alone. Our results suggest that when proxies are used to measure or adjust for CR, verbal intelligence should be used.
Verbal intelligence is a more robust proxy than Educational attainment We found that verbal intelligence had the largest and most robust independent effects on cognition.
Unlike previous studies, due to the availability of two large neuroimaging datasets, we could demonstrate that the effects of verbal intelligence were present in several brain-cognition models and were replicable. This validation of verbal intelligence as a CR proxy supports previous, narrower, associations between verbal intelligence and cognitive function in the presence of hippocampal atrophy (53), a neuropathological 'residual' measure of CR (51), a functional connectivity measure of CR based on task potency (101), and a possible neuromarker of CR, locus coeruleus signal intensity (102).
Aside from verbal intelligence, the only other individual proxy with replicable independent effects on cognition was educational attainment. These replicable effects were only observed in brain-cognition models where executive function was the cognitive outcome variable. While education has been previously positively associated with executive function, without accounting for brain structure, in cognitively healthy older adults (103) and in a systematic review (43), our results show that this association is independent of brain structure. Notably, the effects of education were less robust than verbal intelligence, as positive associations were not seen across both datasets for verbal uency, processing speed, episodic memory and global cognition. As such, these results suggest that educational attainment is not a reliable individual proxy of CR in cognitively healthy older adults. This conclusion is supported by previous ndings including a systematic review which found positive evidence for education in only 38% of complete models with cognitively healthy samples (47) and a non-signi cant association between education (when considered separately from other possible CR proxies) and a neuropathological residual measure of CR (48). Based on their ndings using ex-vivo neuropathological measures, Reed et al. (48) concluded that the observed effects of education on cognition should not be simply considered as reserve effects. Our results further show that this conclusion is valid when using invivo neuroimaging measures of brain structure.
The general nding that verbal intelligence had larger and more robust CR effects than educational attainment convincingly supports an argument favoring the use of verbal intelligence over education (104). This argument was previously broadly supported by evidence that, compared to educational attainment, verbal intelligence was a stronger predictor of cognitive function/decline (105,106) and had greater protective effects on the onset of clinical symptoms of MCI/AD (107,108). More speci cally, Malek-Ahmadi et al. (30) directly compared educational attainment and verbal intelligence in a mixed autopsy sample, consisting of adults with diagnoses of no cognitive impairment, MCI and AD. In complete CR models, including neuropathological indices and measures of episodic memory and executive function, positive evidence was found for verbal intelligence, but not education, as a CR proxy, leading to the conclusion that verbal intelligence measures are superior to educational attainment as CR proxies. Here, we have shown that verbal intelligence is also a superior CR proxy when using in-vivo measures of brain structure and when assessed in respect to additional cognitive outcome measures, including verbal uency, processing speed, and global cognition. Importantly, our results show that this conclusion holds when tested across two separate samples of cognitively healthy older adults.
The larger and more robust effects of verbal intelligence reported here and elsewhere could be explained by 2 key factors. Firstly, verbal intelligence may be a closer re ection of the quality, bene t, or outcomes of educational attainment (109) than years of education, which simply re ects the quantity of educational attainment. Quality of education can differ greatly among individuals with the same quantity of education due to various socioeconomic and systemic factors (110), such as class size (111), and also due to individual level factors such as intrinsic learning motivation and academic self-e cacy (112).
Secondly, measures of verbal intelligence may re ect wider lifetime educational and cognitive experiences as compared to years of education which is generally restricted to early-life formal education (54,104,113,114) and typically neglects to consider later-life education which has been positively associated with cognitive function (115,116). In this sense, verbal intelligence could be considered a dynamic CR proxy which can change over time (117)(118)(119) whereas years of education may be considered a static CR proxy (30). Despite the widespread use of educational attainment as an individual CR proxy, our results suggest that it should only be used as an individual proxy where verbal intelligence is not available.
Composite proxies are less robust than Verbal intelligence We found signi cant positive independent effects of 16 different composite proxies across both datasets. 3 of these composites had signi cant effects in at least two thirds of the brain-cognition models assessed: occupational complexity and verbal intelligence (86.67% of models); education and verbal intelligence (80% of models); and education, occupational complexity, and verbal intelligence (66.67% of models). This is a novel nding as the most robust composite -occupational complexity and verbal intelligence -has never (to the best of our knowledge) been used previously as a CR proxy, likely due to the predominant use of education both as an individual proxy and in composites. The next most robust composite of education and verbal intelligence has been widely used (60)(61)(62)81,82,107,108) and our results support a previous positive association between this composite and episodic memory, controlling for GM volume (61). A speculative explanation for the greater robustness of occupational complexity and verbal intelligence as a composite may be that occupational complexity and verbal intelligence are less strongly correlated with each other than educational attainment and verbal intelligence (see Fig. 2).
While composite proxies purportedly provide advantages over individual proxies, our results show that their independent effects are less robust (i.e. less frequently observed across brain-cognition models) and smaller in magnitude than those found for verbal intelligence alone. This may be explained by the large individual effects of verbal intelligence and its strong correlation with other proxies (see Fig. 2) considering that all composite proxies with replicated effects contained verbal intelligence, except for the composite with the least robust effects, occupational complexity and physical activity. While adding another proxy to verbal intelligence to form a composite should have an additive effect, this could also add noise to an already strong proxy measure as well as shared variance in situations where the proxies are correlated. Consequently, the overall effect of the composite may then be smaller than verbal intelligence alone. Alternative methods to creating composites, such as principal components analysis, could potentially mitigate this issue but may not be theoretically appropriate (34) and incorporating this method within the analysis framework used here would have signi cantly increased the complexity of the analysis. Of all composites considered here, our results especially support the use of education and verbal intelligence as well as occupational complexity and verbal intelligence as composite proxies where multiple proxies are available. However, using composites may lead to more Type II errors than using verbal intelligence alone, given the more robust and larger effects of verbal intelligence. As such, we recommend that researchers should use, or at least repeat analyses using, verbal intelligence alone.
Occupational complexity, leisure activities, and physical activity are not robust proxies of cognitive reserve We did not nd any evidence for robust independent effects for 3 individual proxies across both datasets.
Occupational complexity was not positively associated with any domain of cognitive function, adjusting for brain structure. This suggests that the small positive associations between this proxy and cognition, as reported in a meta-analysis (43), may not be independent of brain structure. Unlike the detailed nature of the occupational complexity measure used here, occupational complexity has been typically measured using government classi cation codes that are effectively a socioeconomic classi cation of occupations (e.g., the UK's O ce Of Population Statistic classi cation as in Staff et al., 2004). As such, previously reported effects for occupational complexity may have in fact re ected the effect of socioeconomic status, which can support cognitive health via greater access to resources and healthcare, among many other mechanisms (34). While Chapko et al. (47) concluded that the evidence for this proxy in complete CR models using cognitively healthy samples was inconclusive, we can conclude, based on our results, that occupational complexity should not be used as an individual CR proxy.
As with occupational complexity, we did not nd robust evidence to support the use of leisure activities as an individual CR proxy. Although it has been associated with a reduced risk of dementia and AD (Crowe et al., 2003; but cf. Sommerlad et al., 2020), few studies have rigorously tested this proxy in a complete CR model. One study found a moderation effect for midlife leisure activities but in line with our ndings, they did not nd evidence of either a moderation or independent effect for later life leisure activities (123).
Future research is warranted to clarify which speci c leisure activities should be included in measures for this proxy given that only a few activities have been associated with cognition in mid-/old-age samples, albeit without adjusting for brain structure (115,124). Considering our results, we suggest that later life leisure activities should not be used as an individual CR proxy.
Finally, our results do not support the use of physical activity as an individual CR proxy. While this proxy has been previously associated with cognitive function in older adults without controlling for brain structure (125,126), our results show that these associations are not independent of brain structure. This supports previous ndings of non-signi cant associations from the few complete CR models assessing this proxy adjusting for brain structure using GM volume and hippocampal atrophy (53,127). The disparity in the observed associations when brain structure is accounted for could be because the protective effects of exercise may be exerted via improved brain maintenance, i.e. the relative preservation of brain structural health (8,128), rather than improved CR (129). This is supported by the nding that the protective effects of exercise on cognition were mediated by increases in prefrontal cortex volume (130) and also by associations of greater physical activity with lower brain-predicted age difference scores (131), which re ects better brain maintenance (132), and greater cortical thickness (133) and regional GM volumes (134,135). Setting aside a possible role contribution of physical activity to brain maintenance, our results suggests that it does not contribute to greater CR and therefore should not be used as an individual CR proxy.

Lack of evidence for moderation effects of CR proxies
Robust moderation effects were not identi ed for any proxy here across datasets. This lack of evidence is in line with previously reported non-signi cant moderation effects on the relationship between episodic memory and GM volume (61) and right hippocampal volume (136) but con icts with previous evidence of signi cant moderation effects reported for CR proxies in similar brain-cognition models (60,123,137). However, the evidence for moderation is largely inconsistent as highlighted by the nding of moderation effects reported on 1 measure, but not on 2 other measures, of episodic memory within the same study (137) and even ndings of a positive moderation effect, which contradicts the CR hypothesis, on the relationship between left hippocampal volume and episodic memory (136). It is likely that our nonsigni cant effects highlights the general di culties in detecting CR moderation effects.
The ability to detect a moderation effect here may have been impaired because the participants were cognitively and neurologically healthy and therefore had a relatively restricted range of cognitive function and brain atrophy in comparison to cognitively and/or neurologically impaired individuals. The relatively restricted range of the predictor variable of brain structure restricts the range of the interaction term (138) which can substantially reduce statistical power to detect a moderation effect (139). This is exacerbated by the fact that neuroimaging variables explain a relatively small amount (20%) of variance in healthy older adults cognition (2), which effectively constrains the size of the moderation effect (65). While the present study was designed using pre-existing data from two cognitively and neurologically healthy cohorts, an experimental approach where individuals with extremely low or high scores on measures of cognitive reserve and brain structure are oversampled may be better able to detect the existence of a moderation effect for these proxies (138).
Promising evidence for Cognitively stimulating activities but not Social engagement as proxies but replication required We were unable to assess the reproducibility of effects for cognitively stimulating activities and social engagement across datasets as we only had su cient data in TILDA for these proxies. Within TILDA, cognitively stimulating activities was highly robust as it was signi cant in all brain-cognition models, and had the largest average independent effect after verbal intelligence. This nding supports associations between this proxy and neuropathological 'residual' measures of CR (48,51) and suggests that previously reported consistent positive associations (42,43) can be observed when controlling for brain structure and across multiple brain-cognition models. Social engagement was less robust as it was observed in only 40% of brain-cognition models and had the second smallest average independent effect of all individual proxies. This inconsistent evidence emphasises a need for further study of social engagement in complete CR models. While mixed evidence of moderation effects have been reported to-date for this proxy controlling for neuropathology (59,140), this is the rst attempt to assess it in a complete CR model including neuroimaging variables. As our focus was on replication across datasets rather than single dataset ndings requiring correction for multiple comparisons and because this proxy was only available in a single dataset, these ndings remain speculative until they can be replicated. With this in mind, while we cannot make de nitive conclusions, we can tentatively suggest that cognitively stimulating activities may be a reasonable choice of CR proxy where verbal intelligence is not available and that social engagement should not be used as an individual proxy.

Limitations
The present study provides empirically supported recommendations in the use of proxies to measure CR.
Nonetheless, there are some limitations which, if addressed in future research, could further strengthen these recommendations and provide additional insights. The main limitation of the present results are that they are cross-sectional. As such, we cannot make solid inferences about the casual direction of the relationships between the robust proxies and cognitive function. Similarly, while CR is supposed to protect against cognitive decline, our analysis only provides information about its association with individual differences in cognitive function, not decline. Future analyses after further waves of data collection will be necessary to assess whether the effects of these proxies are consistent when assessed in the context of cognitive decline.
Another limitation is that the CR models used here only contained brain structure variables related to grey matter and did not contain measures of WM microstructural integrity, WM hyperintensity volume, or ADrelated neuropathology. As CR proxies have been previously reported to moderate the relationship between these measures and cognition (52,54,(141)(142)(143), future studies could assess proxies in complete CR models containing these brain structure variables to extend the conclusions made here to a wider spectrum of brain-cognition relationships. Finally, some CR proxies, namely leisure activities and physical activity were measured differently in both datasets. Differences in these measures or in the speci c activities included in each measure may have contributed to differing effects across both datasets. This may be particularly pertinent for leisure activities as its relationship with cognitive function can vary based on the speci c leisure activities assessed (115). However, this variability across the two datasets re ects the typical variability in the measurement of CR with proxies.

Conclusions
Despite the discussed limitations, the present ndings are informative for researchers using proxies as measures of CR. We built on previous meta-analyses and systematic reviews of CR proxies by assessing a wider set of standard proxies, including their composites, and evaluating their effects across complete and theoretically consistent models of CR and in multiple brain-cognition relationships. Our analysis framework enabled the comparison of the robustness and magnitude of effects. Furthermore, the reported ndings are stringent, robust and replicable, as they were only considered statistically signi cant if they were replicated in a robust regression and across two datasets.
The present study is the rst systematic investigation of the validity of different proxies, and their composites, in complete CR models. Verbal intelligence was associated with better cognitive function in all variables assessed, controlling for mean cortical thickness, GM volume, and hippocampal volume. The independent effects of education and composite proxies, including verbal intelligence and occupational complexity as well as verbal intelligence and education, were smaller and less robust. Our results provide rm, data-driven, recommendations for the use of verbal intelligence as a CR proxy, over other proxies including education, occupational complexity, leisure activities, exercise, and composites including all possible combinations of these proxies. While no robust moderation effects were found for any proxy here, this may be due to the considerable statistical di culties in detecting such effects in normal healthy ageing samples. In sum, the nding of robust independent effects across all brain-cognitive domains assessed provides strong evidence for the use of verbal intelligence as a CR proxy.

Consent for publication
Not applicable Availability of data and materials The TILDA dataset analyzed in this study is available from TILDA upon reasonable request. The procedures to gain access to TILDA data are speci ed at https://tilda.tcd.ie/data/accessing-data/.

Competing interests
The authors declare that they have no competing interests. Figure 1 Schematic of basic brain structure-cognitive function models created for analysis.

Figure 3
Mean R2 change across datasets in all models for proxies with signi cant effects. + indicate composite proxies (e.g. Education + Verbal IQ = composite of educational attainment and verbal intelligence). Black vertical bars represent the mean of signi cant R2 change values across all models for that proxy. All models were adjusted for brain structure, age, and sex.