More alike than different: Novel methods for measuring and modeling executive function development

Executive functions (EFs) are linked to positive outcomes across the lifespan. Yet, methodological challenges have prevented rigorous understanding of the precise ways EFs are organized in childhood and how they develop over time. We introduce novel methods to address these challenges for both measuring and modeling EFs using a large, accelerated longitudinal dataset from a diverse sample of students in middle childhood (approximately ages 8 to 14; N = 1,286). Adaptive assessments allowed us to equate EF challenge across ages and a data-driven, network analytic approach revealed the evolving diversity of EFs while accounting for their unity. Our results suggest EF organization stabilizes around age 10, but continues rening through at least age 14. This approach brings new precision to EFs’ development by removing interpretative ambiguities associated with previous methodologies. By improving EF measurement, the eld can move towards improving EF training, to provide a strong foundation for students’ success. Results of network analysis and community for each and timepoint. between are shown for visualization purposes. Community detection algorithms indicate a two-community organization for the 3rd – 4th grade cohort that differentiates into a three community structure by about 5th Fluctuations in grouping and magnitude of edge weights across older cohorts suggests continued subtle development for older students. WM working memory; RI response inhibition; IR interference resolution; Span Backward Spatial Span; Span Forward Spatial Span; Sustained Attention; Impulsive Attention.

components. This approach has revealed no consistent patterns across decades of studies regarding the number of clearly distinguishable components at each age (see 9 for review). The inconsistencies in the extant literature call for a paradigm shift in approach to both the measurement and modeling of EF performance to move beyond reductionist views of EF and toward treating them as a dynamic interconnected network of skills. Next, we outline the critical factors that could comprise such an approach and offer evidence in support of the promise of such an approach.

Measuring EFs
To reveal the true developmental trajectory of EFs, we rst need to measure EFs with assessments that are robust across developmental stages and assessment sessions. Much of the prior cross-sectional and longitudinal work has been confounded by (1) the use of the same tasks across age ranges, which results in oor or ceiling effects in performance if the challenge level is not adjusted appropriately, or (2) the use of different tasks with different age groups (as reviewed in e.g., 10 ). Not only do tasks need to be comparable across age and ability, but EF assessments also need to be repeatable over multiple timepoints so that developmental progress can be measured within subjects in ways that do not suffer from practice or ceiling effects. Adaptive methods that use tasks that dynamically adjust to an individual's appropriate challenge level on a trial-by-trial basis, presents a compelling and simple solution to this pernicious problem 11,12 .
Indeed, prior work with pediatric populations suggests that highly engaging assessments with adapting challenge algorithms can reveal phenotypic differences between clinical and neurotypical populations, even when group characteristics are highly variable 13 .
We further need multidimensional assessments to disentangle what EFs share in common from what they uniquely contribute to performance, to ensure each component is measured validly and reliably. Any single task used to assess a component of EF will necessarily involve processes not related to EFs (e.g., visual processing, motor response), or may be related to multiple EF components, both of which will result in measurement impurity 14 . To address this impurity, researchers can collect multiple measures of each hypothesized component of EF, leveraging the commonalities across tasks to extract information about EF skills, and reducing the contribution of idiosyncratic skill related to any individual task. Thus, methods that use multiple indicators to measure each hypothesized component of EF are critical for a robust and reliable understanding of how EFs develop over time.
Here we introduce a novel assessment tool--Adaptive Cognitive Evaluation (ACE)--that addresses these robust measurement needs. ACE is a battery of assessments that taps multiple EFs through several different tasks. Importantly, each task incorporates adaptive algorithms, allowing the repeated measurement of EFs across multiple timepoints, using the same tasks in different age groups without running into oor or ceiling limitations.

Modeling EFs
Determining the true developmental trajectory of EFs requires more than robust and reliable measurements-these measurements must also be modeled using methods that can account for the multidimensional nature of EFs. To date, modeling approaches have largely focused on identifying when EFs differentiate over development, and have not eshed out how the commonalities among EFs develop and interact. The use of latent variable analysis in particular has been the most common approach to evaluating the changing organization of EFs over development 15 . While this modeling technique is attractive because it presents a possible solution to the measurement impurity problem, it is also reductionist as it assumes EFs are distinct, largely non-overlapping constructs. This assumption stands in contrast with the predictions of the differentiation hypothesis which proposes that, while EF components become increasingly separable over time, components do not become completely distinct. The persistence of the unity of EFs into adulthood has been supported by both behavioral and neural examinations demonstrating the existence of a unifying umbrella construct termed "Common EF" [16][17][18][19] . However, while latent variable modeling of EFs is appropriate to support the inclusion of a Common EF component at a single moment in time, it is unsuitable for testing the differentiation hypothesis and assessing the dynamic development of EFs over time.
To advance our understanding of how EFs evolve over development, we need to understand how EFs dynamically evolve as a connected network of abilities. In place of latent variable modeling, we suggest leveraging a powerful family of techniques that provides a data-driven method for understanding the number of unique cognitive mechanisms distinct from their commonalities: network analysis. Network analysis is an approach gaining traction in the psychometric eld for understanding cognitive constructs comprising complex inter-related components such as intelligence, psychopathology, and personality [20][21][22][23][24] . In network analysis, relationships between variables are determined after accounting for what is common among all variables. Data-driven methods can then be used to assess which variables are most closely related. In this way, network analysis allows us to examine the structure of EF using a holistic perspective to arrive at the organization that best re ects the data in a theory-agnostic, data-driven manner.
Here, we capitalize on the improved interpretability of longitudinal and cross-sectional comparisons afforded by the use of the same tasks across all participants 10 with our ACE battery to shine light on the relatively understudied period of middle childhood (~7-12 years old), the developmental stage in which EFs may develop most rapidly [25][26][27] . We demonstrate the bene ts of network analysis to understand the organization of EFs across development by rst testing the differentiation hypothesis with the historically popular latent variable analysis approach and then contrasting these results with those found using a network analysis approach. Speci cally, we use each method to determine not only when components of EF become distinct from one another but, critically, when they become distinct from the unifying Common EF component. Finally, we leverage information generated from network analyses to gain insights into the stability of the organization of EFs across time. We show that during middle childhood, EF organization begins to stabilize, yet continues to develop in a manner suggesting EFs need continued support throughout their protracted development as children transition to adolescence. Developmental insights revealed by network analyses and missed by latent variable analyses may explain the inconsistencies in number of components identi ed across development to date and lay the groundwork for new avenues of investigation to understand how to best support EFs across the lifespan.
We rst show how the use of novel, adaptive assessments can robustly measure EFs longitudinally across a wide age range without oor and ceiling effects. We then demonstrate how a holistic modeling approach can enhance our current understanding of the emergence and development of EFs by testing the differentiation hypothesis using two analytic approaches, latent variable analysis and network analysis. Using a latent variable approach, we replicate the ambiguous, di cult to interpret results found in prior investigations. We then critically extend our understanding using a network analytic approach, revealing developmental insights missed under the latent variable approach.

Novel EF Measurement
To assess EF performance with the same measures across a wide age range ( 29 . By separating inhibitory control-related tasks into those in which a response must always be made (interference resolution) and those in which a participant must decide whether to make a response or not (response inhibition), we are able to bring further speci city to the characterization of EFs in middle childhood.
ACE also included a measure of basic response time (BRT) and a screening assessment for red-green color blindness 30 at the beginning of each data collection session. Each task was developed from cognitive assessments commonly used in lab-based settings and modi ed for use in real-world settings by including adaptive algorithms, highly motivating trial-wise and end-of-task feedback, and a user-friendly interface (see 31 for full details on the assessment battery and implementation methods). Importantly, the adaptive algorithms enabled two critical affordances: (1) the same tasks could be used in the same students across multiple timepoints, to reveal a student's changing cognitive abilities without being confounded by ceiling or oor effects, and (2) the same tasks could be used across students of diverse ages, to reveal individual differences in cognitive abilities across development without being confounded by the use of different tasks.
This advancement in our approach to assessment enabled robust integrative data analytics within-subjects over time, and across-subjects from a wide age range.
We found each adaptive EF assessment captured predicted developmental improvements in performance.
Linear mixed effects models examining task performance for each metric of interest (see Methods), allowing random effects for participant, school, and time, showed that, across tasks, performance signi cantly increased with age and time after controlling for BRT, cohort, and gender (see Figure 1 and Figure S1). Beyond these predicted EF improvements with age, performance on most tasks showed a signi cant interaction between age and time, suggesting that younger participants tended to improve more over time compared to older participants.
To con rm the adaptive mechanism indeed challenged EFs similarly across the different grade-levels and assessment occasions, we examined in-game accuracy. In tasks with an adaptive response window (Impulsive Attention, Sustained Attention, Tap and Trace, Stroop, Flanker, and Boxed), participants only received 'correct' feedback if they provided the correct answer within a limited time frame. All other responses resulted in feedback indicating the response was 'late' or 'incorrect'. This adaptive algorithm was designed to produce ~75% of responses resulting in 'correct' feedback for all participants, and produced an average of 72.04% across tasks. While linear models examining the effect of cohort and time on percentage of trials with 'correct' feedback did show signi cant main effects of differences between cohort and timepoint, the effect size of these models was small (all R 2 < 0.2; see Table 1). Such small effects of cohort and timepoint on percentage of responses with 'correct' feedback suggests the adaptive tasks successfully presented a similar challenge across developmental stages and measurement occasions.

Novel EF Modeling
We next demonstrate that, even when solving for persistent challenges to measuring EFs, attempts to test the differentiation hypothesis using latent variable modeling yield largely uninterpretable results and provide little insight into how to improve such models. Instead, network analysis provides a exible, data-driven approach to understanding how the developmental process unfolds and hypothesis-generating insights into how experience might shape this development.
Latent Variable Analysis. To directly test the differentiation hypothesis using latent variable modeling, we compared a series of models to establish the number of distinguishable EF components at each stage of development using con rmatory factor analysis (CFA; see Methods, Figure S2). In accordance with the differentiation hypothesis, we expected that the number of EF components, represented by more complex models with more unique factors, would provide better model t for older students. However, we did not explicitly incorporate Common EF into these models, as its inclusion would prevent tests statistically comparing alternate models. Instead, we examined correlations between factors to assess when these components could be differentiated beyond the unifying Common EF factor. Correlations greater than 0.7 between factors indicate that components represent redundant information (sharing more than 49% of variance) and are therefore likely not fully differentiated from one another. Based on prior adult literature and the tasks used in the current study, the number of components could range from one to three, with the maximally-differentiated organization of EFs representing WM, IR, and RI components.
Overall, despite resolving measurement constraints in prior tests of the differentiation hypothesis, the latent variable approach revealed an indeterminate developmental progression of differentiation of EF components. Generally, results suggest that a single component best describes the organization of EF from 3rd through 4th grade, after which three distinct EF components can be identi ed. However, this pattern is not unequivocal, and many open questions remain. Statistical comparisons ( Table 2) and estimates of model t (Table S1) indicate that for both the 5th-6th and 7th-8th grade cohort, a single component was adequate to describe the structure of EF at the rst timepoint. These results suggest there may be continued re nements of EF structure over time but provide little information that can be used to begin specifying the details of this potential re nement. Further, the degree of differentiation of these factors from Common EF was unclear; factor correlations suggest WM differentiates by about 5th grade (M WM and RI = 0.40; M WM and IR = 0.54) , however, a persistent high degree of overlap between RI and IR (M IR and RI = 0.69) leaves open the question of whether one or both of these components would be distinguishable from Common EF (see Tables S2-S6 for full list of factor loadings and correlations). Without statistical methods to determine when components become distinct from both other EFs and Common EF, the use of latent variable models to answer questions about the differentiation hypothesis becomes even more untenable. variable analysis (e.g., exploratory factor analysis), these methods similarly cannot account for Common EF.
As such, instead of latent variable modeling, data-driven methods that both account for what is common among EF task performance and can generate further testable hypotheses are needed to understand the complex and nuanced evolution of EF organization during middle childhood.
Network Analysis. Next, we demonstrate how using network analysis to treat EF task performances as an interconnected set of cognitive processes leads to insights into their development, which were missed using the predominant modeling approach of the eld. With data-driven methods for grouping task performance according to strength of in-group performance compared to out-group performance, EF component construction was not restricted by theoretical assumptions of which tasks draw on each EF component.
Further, the degree of differentiation of components identi ed with this method is unambiguous; components are only identi ed if they are distinct from the unifying Common EF component.
By applying network analysis techniques, we established a clear developmental timeline of EF organization and revealed several critical insights into how EFs evolve over time. First, while both methods (latent variable and network modeling) point to EF organization stabilizing around 5th grade, network analyses revealed that a single, undifferentiated component of EF is an unlikely organization for any age in grades 3 through 8.
Second, both methods suggest great variability in the 3rd to 4th grade cohort and continued re nement from 5th to 8th grade, but network analysis reveals which EFs are developing and in what way. Finally, unlike latent variable analysis, the metrics generated from network analyses can be used to gain further insight into the development of EFs and develop new hypotheses around their trajectories. We next discuss these ndings in more detail.
Both latent variable and network approaches suggest that, for younger students, there are fewer distinct EF components and that by about 5th grade, EFs can be organized into three distinct components. However, network analyses provide a clearer and more consistent developmental trajectory. Critically, the interpretation as to whether components are distinct from Common EF is straightforward in network analysis because communities are only formed if they are distinct from Common EF and from each other. We can therefore be con dent that these communities represent distinct, differentiated components of EF, unlike latent variable analysis which requires interpretation of between-factor correlations. Network analysis results were also more consistent in terms of the number of differentiated EF components compared to factor analysis results.
Factor analyses pointed to at least one timepoint per cohort where the number of factors that best t the data differed from the other three timepoints. Network analysis, in contrast, only showed a different number of components across time for the 3rd-4th grade cohort (the cohort that showed the most variability in both analyses). Additionally, consistent with prior literature with this age range 9 , network models never indicated a single, undifferentiated component of EF. Further, network analysis revealed a more nuanced organization of components -task performances did not always cluster as would be predicted by theory, as described below.
Using factor analysis, such differences were missed, since the model using the theory-driven organization of EFs t the data reasonably well, and there was no indication that a different organization might be a better representation of EF constructs.
While both methods indicated the organization of EF task performances was most variable early in development, unlike latent variable modeling, network graphs ( Figure 2) show that the number of communities for the 3rd-4th grade cohort was fairly consistent across timepoints, yet the composition of these communities was variable. EF organization for the older cohorts, though, was relatively stable. For both the 5th-6th grade and 7th-8th grade cohort, the tasks almost always formed three communities with groupings consistent with those predicted by theory. However, at Timepoint 2, Impulsive Attention was grouped with WM tasks for 5th-6th grade cohort, while Tap and Trace was grouped with IR tasks for the 7th-8th grade cohort. Thus, while EFs can be organized into three distinct components by about 5th grade, organization of the IR and RI components in particular continue to undergo re nement across development.
See Figure S3 for estimates for all edge weights with parametric bootstrapped 95% con dence intervals.
A unique bene t of network analysis is our ability to leverage the resulting network metrics to quantify and compare the degree of network stability across cohorts. A one-way ANOVA was used to directly interrogate whether correlations between network connections ( and 4th grade is further supported as one in which the organization of EFs is undergoing large degrees of change, whereas development between 5th and 8th grade may be more incremental. Together, these results illustrate how a holistic examination of the EF system can reveal novel insights into how these processes develop, beginning to resolve the inconsistencies across the literature that have emerged from the use of a reductionist framework.

Limitations
This study makes signi cant strides in our approach to measure and model EFs, improving on several critical limitations in the eld. Yet, further advancements must continue to be made to build upon and address limitations of the work presented here, particularly in regard to the scope of EFs assessed and the availability of statistical methodologies to compare network models longitudinally.
Developing a novel, adaptive battery of EF tasks for all ages and abilities was not without its challenges; a design decision resulted in limitations to our ability to robustly measure all EF components. Speci cally, the WM component was only indexed by two measures, which limited the type of latent variable model that could be constructed and tested here. While a third task hypothesized to measure WM, Filter, was originally included in the ACE battery, it required a different adaptive mechanism, which resulted in age-related differences in challenge level, leading to its exclusion from the current analysis. Consequently, we could not test certain factor con gurations (see Methods) without rendering the models uninformative.
Further, as discussed previously, the list of EFs measured here is not exhaustive. In particular, the components 'updating' and 'cognitive exibility' as popularized by Miyake and colleagues' work modeling EFs in adults 28 were not assessed here. Due to time constraints associated with in-school testing sessions, we were limited to the ten tasks discussed above, thus we chose to focus on particularly understudied aspects of EF. Using the adaptive assessment framework we outlined here, it is our intent that additional EF assessments tapping other components are developed. Future work can then incorporate these additional components in the network analytic framework to expand the developmental model introduced here.
There were also methodological challenges related to comparing two analytical approaches (latent variable and network analyses). We intentionally did not explicitly model the dependency of multiple observations per student that occurs with longitudinal data in either analytic approach. While it is possible to model this dependency using factor analysis, development of network models that can handle longitudinal data are still in their infancy (though see 32 ). To keep the general modeling strategy consistent so that inferences would be comparable, we treated all observations as independent in both approaches. However, this strategy is unlikely to have affected the results for two reasons. First, without accounting for within-person changes, within-cohort comparisons were more conservative than necessary. Second, we did not perform tests that may have been more likely to be affected by treating observations as independent, such as testing for differences in the strengths of within-cohort associations between task performance. Nonetheless, as network analytic methodology continues to advance, it will be important to also advance the methods used to reveal the evolution of EF structure across development.
Finally, neither modeling approach was able to simultaneously account for Common EF and provide statistical comparisons between models of differing complexity. With latent variable analysis, it is a straightforward process to compare whether a model with more factors ts statistically better than a model with fewer factors. These capabilities, though, are currently limited with network models (though see 33 ).
Community detection algorithms provide a likely grouping for task performances, but there is no index to determine whether a two-community network explains EFs just as well as a three-community network, for example. However, existing methods for accounting for Common EF in the latent variable approach preclude such statistical comparisons between models, leaving the theoretical problem of how to account for Common EF in the context of differentiation of components with this approach unresolved. To date, the bene ts of the network analysis approach, which accounts for commonality among all EF task performances rather than treating it as a separate component entirely, presents a promising solution for accounting for Common EF. The rapidly emerging statistical approaches for testing network model complexity position this technique as the path forward in establishing the developmental trajectory of EFs.

Conclusions And Future Directions
This study presents potentially game-changing methods for understanding precisely how EFs differentiate across middle childhood. Adaptive algorithms in our EF assessments allow us to meet the learner where they are, regardless of ability, and allow for multiple assessments within-subject over time. The network analytic approach offers exciting new avenues for understanding the development of EF as a dynamic interconnected network of skills that can align behavioral and neural models. With more methodological advancements on the horizon to improve the statistical precision of network modeling results, we expect this approach will critically advance our understanding of how a learner's EF structure evolves across middle childhood and beyond.
The future potential for network analysis to understand complex cognitive constructs is bright. Researchers in related elds have already begun to capitalize on information gained from taking a network analytic perspective to understand other cognitive processes. For example, Kan and colleagues 34 demonstrated how t statistics can be obtained for network models, allowing a direct comparison between network and latent variable models. As such, future research could directly compare a variety of con gurations of EF modeled using latent variable analysis to those using network analysis to determine which organization best ts observed EF performance. While outside the scope of the current paper, researchers in the eld of intelligence have used this approach to show that modeling aspects of intelligence as being mutually and reciprocally related through a network framework is favored over modeling an overarching umbrella component ('g') in a latent variable framework 35 . Given this eld's similar dilemma around how to quantify developmental differentiation in the presence of task commonality 36 , we anticipate such investigations in EFs will be similarly fruitful for elucidating the mechanisms through which skill changes arise.
Further, as methods for appropriately modeling longitudinal data emerge, network analysis provides an avenue for understanding the potential reciprocal relationships among EFs over time 32 . For example, in a separate study we are examining how growth in performance on individual tasks are connected. By using a network framework for investigating EF skill growth, we can evaluate whether the same communities formed when modeling contemporaneous ties between task performances also emerge when looking at their patterns of growth across time. Such evidence would reinforce the identity of the communities as distinct components of EF and allow us to answer whether components of EF emerge independently or in tandem with other components.
Such insights into the development of EFs are critical for advancing our understanding of how they in uence, and can be in uenced by, internal and external factors. For example, EFs are often the focus of educational interventions with the goal of improving academic-related outcomes (see e.g., [37][38][39] ). Network analysis is well-poised to generate hypotheses regarding which EF tasks or components might be more likely to transfer outside a training regime, which can then guide future training studies. Indeed, the ndings from the current study provide a clear set of testable hypotheses: given that the cross-sectional network models found here suggest that WM is less strongly connected to other EF components, future training studies should test the hypothesis that training a highly connected component such as IR would be more likely to result in transfer to other EFs compared to training on the less-well connected WM component.
The ndings from this study showcase how advances in assessing EFs and an increasingly popular modeling technique, network analysis, can be applied to the eld of EFs to better align behavioral and neural investigations. The dual paradigm shifts to network analysis using adaptive measures provide a promising pathway for re ning and specifying our understanding of how EFs develop. These insights can in turn be applied to advance our understanding of EFs' wide-reaching impact on factors related to physical and cognitive health across the lifespan 2 . Together, our improved methodological approaches to measuring EFs can lead to the development of improved methods for training EFs and providing students the proper foundation they need for learning and future educational success.

Methods
Participants in the current study were recruited through their schools as part of Project iLEAD (in-school longitudinal executive function and academic achievement database), a two-year accelerated longitudinal study of EF development students grades 3-8. Full details of Project iLEAD are reported in 31 and summarized here.
The study was approved by the Institutional Review Board (IRB) of the University of California, San Francisco and was conducted in accordance the relevant guidelines and regulations. Written parental or guardian consent was obtained from all participants at the beginning of the study, and verbal assent from all participants was obtained before all in-class data collection sessions. At the end of the study, all students in participating classrooms received snacks and stickers, regardless of participation.

Participants
Nine schools (seven public, one independent, one parochial) from northern California opted to participate in this longitudinal study, which included assessments at the Fall and Spring of two academic years for a total of four assessment periods. In total, 1,280 students participated over the course of two years. At the beginning of each school year, teachers distributed consent forms to students to take home for parental or guardian review and signature. This rst round of recruitment resulted in a total of 1,088 participating students in Year 1: 284 third graders (mean age 8.07 years old, SD = 0.35), 260 fth graders (mean age 9.98 years old, SD = 0.41), and 544 seventh graders (mean age 11.9 years old, SD = 0.47). In the fall of Year 2, we re-opened enrollment to participating classrooms to allow new students to participate in the study, which resulted in an additional 195 students joining the study (44 fourth, 147 sixth, and 4 eighth grade students). The Year 2 sample thus included 1,106 students: 288 fourth graders (mean age of 9.03 years old, SD = 0.33), 336 sixth graders (mean age of 10.9 years old, SD = 0.39), and 482 eighth graders (mean age of 12.9 years old, SD = 0.44).
Our sample was demographically diverse; see Table 4 for demographics of participating students at each of the four timepoints. Gender category re ected the self-identi ed gender of the students as expressed to the district and researchers (separately). Additional demographic data (see below) were provided for students enrolled in public schools and whose parents consented to share this information (n = 1,159 of 1,280).

Procedures
We administered a series of mobile assessments of EF, math, and reading skills that took the form of digital 'games', during school hours, at the beginning and end of each academic year (fall and spring) over two school years (see Assessment Administration details below). At each of the four timepoints, EF assessments were administered during one class period, with the research team returning a little over a month later to administer the math and reading assessments (M = 5.7 weeks, SD = 2.4, min. = 1.9, max. = 10). At the end of each academic year, academic performance and other relevant data were provided by the district for students whose parents consented to share district data.
All tasks were administered in a group setting on iPads. Administration was facilitated by a large team of research staff, in groups that ranged from seven to 83 students (M = 30). A lead facilitator gave verbal instructions to the group for each task, aided by visual instructions from a 24" x 36" ipbook. Participants began each task at the same time, and instructions for the next task were not given until all participants completed the current task. Each task began with practice trials during which researchers monitored participants to ensure participants understood the task and were correctly following task instructions.
Researchers closely monitored the sessions throughout administration to provide technical assistance if necessary, answer student questions, and monitor performance. Administration sessions lasted approximately 50 minutes.

Assessments
Adaptive Cognitive Evaluation (ACE). This study used a novel mobile assessment battery, ACE, to assess EF skills. ACE was developed from cognitive assessments commonly used in lab-based settings and modi ed for use in real-world settings by including adaptive, psychometric staircase algorithms, and trial-wise as well as end-of-task feedback 12 . For this study, we additionally implemented a user-friendly interface and task tutorials for students in middle childhood. Importantly, the psychometric staircase algorithms allowed for the same tasks to be used in the same students over time, which enables benchmarking of an individual's changing cognitive abilities without being confounded by ceiling or oor. Further, these adaptive algorithms allowed for the comparison of the same tasks across students of different ages, genders, races, and cultures 12 . The assessment battery consisted of a color blindness test, a response time control task, two working memory tasks, an attentional ltering task, two response inhibition tasks, three interference resolution tasks, and a cognitive exibility task. The attentional ltering and cognitive exibility task were excluded from the current analysis due to differential task challenge across grade levels (attentional ltering task) or technical errors that prevented consistent data reporting across timepoints (cognitive exibility; see 31 for more details). All other tasks are brie y described below, and example stimuli and schematics for each task are presented in Figure S4.
Response Time Control Task. The rst ACE task was a measure of basic response time (BRT; Figure S4A). Because improvements in EFs have also been associated with improvements in general processing speed (e.g., 40,41 ), BRT was designed to serve as a covariate to be regressed from performance metrics of all other ACE tasks.
Color Blindness Test. The second ACE task was a screening assessment for red-green color blindness 30 ( Figure S4B).
Working Memory. Two tasks were used to measure working memory, Forward Spatial Span ( Figure S4C) and Backward Spatial Span ( Figure S4D). These two tasks were digital modi cations based on the Corsi Block Task 42 . In this task, students were shown an array of open circles, with a target sequence cued via circles becoming lled, in sequence, with either green (Forward Spatial Span) or blue (Backward Spatial Span) color. Sequence length increased according to performance (see Adaptivity below). Once students viewed the cued sequence, they were instructed to recreate the sequence in the same order (Forward Spatial Span) or in the reverse order (Backward Spatial Span).
Response Inhibition. Response inhibition was measured with three tasks, Sustained Attention, Impulsive Attention (collectively referred to as the Continuous Performance Task (CPT; Figure S4F) and Tap and Trace ( Figure S4G). For both tasks, students were instructed to respond to a target stimulus and withhold a response to non-target stimuli. CPT is a target detection task adapted from the Test of Variable Attention 43 (TOVA). This task included two blocked conditions: a target frequent condition (80% target trials) designed to assess impulse control (Impulsive Attention) and a target infrequent condition (20% target trials) designed to test sustained attention abilities (Sustained Attention). Tap and Trace is a dual-task assessment adapted from the paradigm described by Eversheim and Bock 44 . This task included three blocked conditions: one in which students used their dominant hand to tap when they detected a target stimulus, a second in which they traced a shape with their non-dominant hand, and a third in which they performed both tasks simultaneously.
Interference Resolution. Interference resolution was measured with three tasks: Stroop ( Figure S4H), Flanker ( Figure S4I), and Boxed ( Figure S4J). Stroop is based on the computerized version of the color-word Stroop task as described by Mead and colleagues 45 in which students selected the text color (e.g., green) of a centrally presented color word (e.g., BLUE). On 30% trials, the text and word were congruent, and on 70% trials they were incongruent. Flanker is a letter anker task based on the paradigm described by Eriksen and Eriksen 46 in which students are instructed to indicate the middle letter of a string of ve letters. On 50% trials, the central and anking letters were congruent, and on 50% trials they were incongruent. Finally, Boxed is a top-down/bottom-up attention task based on the visual search paradigm rst described by Treisman and Gelade 47 in which students must identify a target stimulus in an array of distractor stimuli. This task included four blocked conditions that varied on search condition and number of distractors. In each condition, the target was either identi able by one feature (color) or by the conjunction of two features (color of target and location of opening of the target box) and either a low (3) or high (11) number of distractor stimuli.

Analysis Methods
For each ACE task, we a priori de ned a metric that would best re ect the construct of interest. It is important to note that the psychometrics of each assessment dictated the a priori de ned metrics of interest. We describe the rationale to assess each construct of interest in turn below. For BRT, the metric of interest was mean response time collapsed across both dominant and nondominant hands. For the Color Blindness Test, we assessed whether students selected a response indicating red-green color blindness according to scoring guidelines in Ishihara (1972). The metric of interest for Forward and Backward Spatial Span was span length, or the maximum number of spatial locations in the correct sequence held in mind. For Sustained Attention, we used a metric that is sensitive to lapses in attention--the standard deviation of the response time to infrequently presented targets 48 . For Impulsive Attention, we used a metric that would measure detection of targets while accounting for withholding prepotent responses to frequent non-targets, the signal detection metric of d' 48 . This metric is an index of sensitivity re ecting the ability of a student to faithfully detect a target as measured by the difference between the normalized hit rate and the normalized falsealarm rate 49 . For the dual-task assessment (Tap and Trace), the metric of interest was how reliably students could detect a target vs. a distractor during the dual-task portion of the task; thus, we again deployed d'. For tasks in which students were expected to respond to each trial, we used Rate Correct Score (RCS) to index performance on both response time and accuracy. RCS was thus used for Stroop and Flanker. Task-level RCS was calculated by dividing the number of correct responses by the product of mean response time for all trials and the total number of trials responded to 50,51 . For the visual search task, Boxed, a technical error resulted in trials in which students responded outside the response window to be considered incorrect (rather than correct and late). Because correct but late trials could not be distinguished from incorrect trials, RCS could not be calculated in the same manner as the other tasks. Instead, we used mean response time to all correct trials as the metric of interest for this task.
Data cleaning procedures. A very small number of students who were red-green colorblind as indicated by the colorblind screener (n = 16) were excluded from analysis, given that several tasks required students to discriminate between targets and distractors based on color. Trials where no response was given when a response was expected or that were anticipatory (response time < 200ms) were excluded from analyses (1.8% of all trials). All remaining trials were evaluated for accuracy regardless of whether or not the response was within the response window. Thus, trials were only considered incorrect if an incorrect response was made (and not if they were correct but occurred outside the response window).
Data from each student were evaluated and cleaned on a task-level basis at each timepoint. For each task, to be included in data analysis, students must have answered a minimum of ve trials per condition and achieved above-chance accuracy on the easiest condition (i.e., the condition that required lowest cognitive load; see 31 for a full description of task conditions and how chance-level performance was determined for each task). Data from each task were then evaluated for outlier students based on performance within each cohort and timepoint. Outlier performance was de ned as performance falling outside three median absolute deviations (MADs) of the median performance of the relevant cohort at a given timepoint 52 . Finally, additional outlier analyses to identify in uential observations in the larger regression analysis of task performance were conducted by computing Cook's distance. Observations with Cook's d > 1 were removed 53 . These cleaning procedures resulted in exclusion of 1.9% of task-level data collected.
Con rmatory Factor Analysis. While data-driven organizations of variables are possible within an exploratory factor analysis (EFA) framework, exploratory approaches do not provide a straightforward way to account for the high degree of overlap between performance on EF tasks, and assignment of a behavior to a latent variable is dubious, often resulting in uninterpretable organizations 54 . Thus, we conducted separate con rmatory factor analyses (CFAs) for the three cohorts at the four timepoints to avoid assuming the structure of EF remained the same across all timepoints for any group, and to assess the stability of these structures over a two-year measurement period. We evaluated ve models of EF -the maximally differentiated structure with three distinct factors, all possible permutations of a two-factor model in which two of the three factors are collapsed into one, and the simplest structure in which all tasks represent a single, undifferentiated EF factor (see Figure S2).
All CFAs were conducted in Mplus version 8.1 55 with the robust maximum likelihood estimation method. To statistically compare nested models, we used Satorra-Bentler scaled chi-square tests with degrees of freedom equal to the difference in number of free parameters between the comparison and nested models 56 . These tests help us to determine whether more complex representations of EF are a better t to EF task performance across middle childhood. Because these statistics are meant to compare nested models, the 1factor model was compared to each of the 2-factor models, and each of the 2-factor models were compared to the 3-factor model, but the 2-factor models cannot be statistically compared to each other in this manner.
In interpreting these results, we took a conservative approach in which a more complex model would be selected over a less complex model only if a more complex model would always be preferred, regardless of which 2-factor permutation was considered. The results of chi-square difference testing were corroborated by converging evidence from the Comparative Fit Index (CFI), root mean square error of approximation (RMSEA), Akaike Information Criteria (AIC), and sample-size adjusted Bayesian Information Criterion (BICc). CFI values > 0.90 were considered excellent model t, with values closer to 1 indicating better model t.
RMSEA values less than or equal to .06 were considered adequate model t 57 , with lower values indicative of better model t.
Models explicitly incorporating a Common EF factor were not tested here, as models in which Common EF is a higher-order factor do not allow us to test the differentiation hypothesis. While Common EF can be incorporated as a higher-order umbrella component re ecting what is common among all lower-order components, such an approach is not amenable to testing the differentiation hypothesis. The earliest stages of differentiation in which less than three components are differentiated from Common EF would not provide meaningful insight into the patterns of behaviors being modeled as they would not be considered properly identi ed 58 . An alternative approach is to incorporate Common EF as an additional lower-order latent variable, rather than a hierarchical superordinate variable. In such 'bifactor' models, each observed variable measures two latent variables: Common EF and another differentiated component. Although these models can be easier to identify in some instances, it can be di cult for such complex models to converge given the historically low power and task reliability observed in extant examinations of EF 15 . While the hypothesis could be tested if Common EF were incorporated as a bifactor model, such a model would not be identi ed for this dataset without assuming performance on the WM tasks contributes equally to both the WM and Common EF factors, which has not been supported in the literature 16,59 .
Network Analysis. Replicating the general approach used for the latent variable models, we created separate models of EF performance for each cohort and timepoint. All network analyses were conducted in R 4.0.4 60 . Network models were estimated using the bootnet package 61,62 . All models were fully saturated partial correlation networks (non-regularized Gaussian Markov random elds), and missing data were handled via full information maximum likelihood. After estimating each network model, the Spinglass algorithm 63 from the igraph package 64,65 was applied separately to each network to determine communities of tasks. Here, the Spinglass algorithm was selected over other community detection algorithms such as the Louvain algorithm because it is able to handle negative partial correlations in a network. To ensure the stability of groupings, community detection was performed 1,000 times and the most frequent grouping is reported here. Resulting network and community detection results were displayed graphically using the qgraph package 66,67 . For graphing purposes, nodes were xed to the same positions across networks and partial correlations between -0.1 and 0.1 are not displayed. To understand network stability over time, edge weights from each network were correlated with each other. Because these edge weights represent partial correlations, edge weights were rst Fischer transformed before being correlated with edge weights from each other network. Growth in performance on executive function metrics of interest for each task and cohort. With few exceptions, all participants improved over time, and younger students tended to show the most gains over time, as indicated by signi cant main effects. Shaded region represents 95% con dence interval of linear regression of time on performance. Figure 2