Fitting Problems: Evaluating Model Fit in Behavior Genetic Models

In behavior genetics, like many elds, researchers must decide whether their models adequately explain their data – whether their models “t” at some satisfactory level. Well-tting models are compelling, whereas poorly-tting models are not (Rodgers & Rowe, 2002). Oftentimes, researchers evaluate model t by employing “universal” rules of thumb (e.g., Hu and Bentler, 1999). However, these rules are not universal, and should be treated as model specic (Kang et al., 2016). Accordingly, we focused on developing t criteria emulating Hu and Bentler (1999) for classic univariate models (ACE; CE; AE) by tting simulated twin data with correctly- and incorrectly-specied models. Ideal criteria should consistently accept correct models and reject incorrect models. Classic ACE models were indistinguishable and virtually all t indices were non-informative because (or especially when) they were obtained in saturated models. For non-ACE models, criteria were informative. Nevertheless, every t metric employed, except TLI, differed markedly across models and/or conditions. Universal solutions remain elusive, but promising and valid approaches include nested model comparisons, increasing degrees of freedom, and ruthless skepticism.

Although general behavior genetics ndings replicate , many speci c and onceclassic ndings have failed to do so, such as candidate genes for intelligence (Chabris et al., 2012;Sniekers et al., 2017) and stress-by-gene interactions (Risch et al., 2009). The speci c reasons for those failed replications are beyond the scope of this paper, but their general causes are not. Many of the failed replications addressed methodological weaknesses from the original paper, often by following best practices. Accordingly, we should continue to implement methodological best practices in all behavior genetics studies, rather than just a select and anomalous few.
In behavior genetics, like other areas of psychology, researchers decide whether their models adequately explain their data -whether their models " t" at some satisfactory level. Well-tting models are compelling, poorly-tting models are not (Rodgers & Rowe, 2002;cf. Roberts & Pashler, 2000). Often, researchers evaluate model t by employing "universal" rules of thumb, which provide t-index-speci c criteria to declare a model well-tting. Unsurprisingly, papers believed to provide such criteria are among the most highly-cited academic articles. For example, Hu and Bentler (1999) has been cited over 56,000 times. Although popular and listed in many guidelines (e.g., Appelbaum et al., 2018;Cumming, 2014;, these rules of thumb should not be treated as universal. Rather, these model-t criteria should be viewed as model-speci c, and in uenced by the measurement model (Hancock & Mueller, 2011;Kang et al., 2016). Cutoff criteria in Hu and Bentler (1999) cannot be generalized beyond the single-factor model. For example, if these criteria were applied to ve-factor data, the best tting model would have four or fewer factors (Garrison, 2017). Moreover, many works, including Savalei (2012), Millsap (2012), and Marsh, Hau, and Wen (2004) highlight that incorrect models can appear to be well-tting when universal criteria are applied. Accordingly, cutoff criteria need to be generated for the speci c model of interest. Otherwise, researchers risk rejecting well-tting models or supporting poor-tting models, and committing errors of the rst and second kind, respectively. This paper aims to address two major questions: 1) How are behavior geneticists evaluating model t, and are those methods conforming to best practices?
2) What criteria are effective to evaluate model t for behavior genetic models?
In the next section, we provide an overview of behavior genetic models, focusing on matrix speci cation and model identi cation. Then, we review evaluating model t for SEM, where we attend to best practices and evaluation of just-identi ed models. Following, we present a study of current practices in behavior genetic model tting, followed by a simulation study. We conclude with summary recommendations.

Behavior Genetic Modeling
The aim of twin research and, more broadly, behavior genetics research is to "distinguish[…] between the effects of tendencies received at birth, and of those that were imposed by the circumstances of their after lives; in other words, between the effects of nature and of nurture" (Galton, 1876, p. 392). Such designs have been used to understand the genetic and environmental in uences on virtually every construct on which people vary (Polderman et al., 2015). These individual constructs can range from the benign, such as height (Holzinger, 1929;Rodgers et al., 2019;Silventoinen et al., 2003) to potentially controversial, including intelligence, educational attainment, and economic success (Burks, 1938;Burt, 1966;Hernstein & Murray, 1994;Jensen, 1969;Murray, 1998;Thorndike, 1905). Researchers have used a variety of kinship groups in their biometrical comparisons, ranging from twins (Galton, 1876), different types of siblings (Thorndike, 1905), cousins (Fisher, 1919), and adoptees (Snygg, 1938); this review will focus on twin studies because they are the most popular within this discipline. Indeed, the "classic" behavior genetic study is a twin design (Liew et al., 2005;Rende et al., 1990), comparing correlations (or covariances) among monozygotic twins to correlations (or covariances) among dizygotic twins.
The "classic" way to estimate heritability from twin designs has developed nearly in lockstep with methodological advancements (Jinks & Fulker, 1970;Rende et al., 1990). Before the development of the correlation (Galton, 1896;Pearson, 1909;Stigler, 1989) and the wider acceptance of twin types (Fisher, 1919;Newman, 1917;Siemens, 1924;Smith, 1856), the method was qualitative (Galton, 1876), with nebulous categories of nature and nurture. After those developments, the method became algebraic and model-based (Falconer, 1952;Holzinger, 1929), with important underlying assumptions and less nebulous categories of genetic and environmental in uences. Other developments led to incremental modeling improvements (see Jinks & Fulker, 1970 for added treatment on this topic).
Incremental algebraic gains were upended with the insight that con rmatory factor analytic models could be repurposed for twin designs (Eaves & Gale, 1974;Loehlin & Vandenberg, 1968;Martin & Eaves, 1977). This insight led behavior geneticists to embrace factor analytic models and more general structural equation models (SEM). Structural equation modeling has facilitated great advancements in behavior genetic research, ones that were impossible under the assumptions of older designs. Indeed, some assumptions, such as no gene-by-environment correlations, have been transformed into their own research areas.
Nevertheless, classic twin designs have pushed the limits of structural equation modeling, in terms of model identi cation. In general, a covariance-based model is identi ed when there exists a unique set of parameters that de ne the maximum of the t function (Bekker & Wansbeek, 2003). In less formal terms, a model is identi ed when free parameters are unique, and cannot trade off with one another to produce the same covariance. We direct readers to Hunter et al. (2021) for analytic solutions and further discussion of this issue. The classic ACE twin design produces a just-identi ed model, whereas the fuller ACDE model is under-identi ed (Hunter et al., 2021;Jinks & Fulker, 1970;Neale & Maes, 2004). In the fuller ACDE model, the covariance structure for monozygotic twins, reared-together is where a 2 is additive genetic variance; d 2 is dominant genetic variance; c 2 is shared environmental variance; and e 2 is non-shared environmental variance. For dizygotic twins, reared-together, the covariance structure is In this model, there are more unknowns (4; a, d, c, and e) than knowns (3; Cov MZ, Cov DZ, and phenotypic variance Vp), resulting in an under-identi ed model. Without more information (i.e., data point), these unknowns cannot be uniquely calculated. To address this problem, researchers must either add information or reduce the number of parameters. Additional information can be achieved by adding another group (such as full siblings, or twins reared-apart) or using individual-level data. In the case in the classic ACE model, the number of parameters is reduced by removing dominance from the model. Then, the covariance structure for monozygotic twins is whereas the covariance structure for dizygotic twins is Because we have assumed a lack of dominance in this model, the model now contains three unknowns and three knowns., and we have a just-identi ed model (Hunter et al., 2020). Although just-identi ed models can be used to estimate parameters, they cannot be used to evaluate model t in the traditional sense. Traditional model t relies on comparing the model of interest to a null model, in the same manner that one would evaluate any null hypothesis. The default null model is often a saturated model, where as many parameters as possible can be de ned from the data, and then reproduce perfectly the means and covariances (Widaman & Thompson, 2003). In other words, the default model is a just-identi ed (saturated) model with a χ² value of zero and zero residual degrees of freedom. Accordingly, if one were to report a χ² test comparing the ACE model to a saturated model, the test would indicate that the model t perfectly. However, this perfect t is a consequence of the classic ACE model being just-identi ed.
Regardless, nested model comparisons are possible for just-identi ed models, including classic ACE models. For example, one could compare statistically a nested AE or CE model (both of which are overidenti ed) to the full ACE model.
The classic covariance-based model (or other similar "two-kinship-category" models) is just-identi ed, but other models, such as twins-raised-apart-and-together or children-of-twins, are over-identi ed. In those cases, classic methods of evaluating t ought to apply. Moreover, one can include additional information, such as group means or individual-level data, in addition to covariances. That additional information increases total degrees of freedom and results in an over-identi ed model, with positive residual degrees of freedom. In these non-classic cases, traditional methods of evaluating model t can and should be used. Many behavior genetic models are speci c applications of structural equation models, and can be treated as such.

Best Practices for Evaluating Model Fit
Since the advent of the replication crisis (Open Science Collaboration, 2015; but see, Shrout & Rodgers, 2017;or Ioannidis, 2005), much of the behavioral and social sciences has become interested in improving research practices. Beyond an increased interest in replication, the replication crisis has encouraged social and behavioral scientists to follow best practices for their statistical analyses. These guidelines tend to be broad and exible, aiming to give scholars enough information to evaluate the model being tested without excessively burdening the authors. Further, this exibility recognizes that methodological advances, and guidelines for evaluating t may change in accordance with those advances.
Within the context of structural equation modeling, the guidelines for best practice are broad (Appelbaum et al., 2018;Lance et al., 2016;. For evaluating model t, best practice recommends that multiple indices across classes be used in conjunction with the χ² statistic. The inclusion of the χ² statistic allows the reader to calculate most other t indices -or at least those that are log-likelihood-or χ²-based. However, these guidelines can be criticized. For example, Marsh, Hau, and Wen (2004) noted that best practices and recommendations of "golden rules" tended to deemphasize limitations of each metric and omit caveats from the original papers. Every t index that we discuss in the next section has limitations.
Some of those are obvious (such as χ 2 being sensitive to sample size; Gerbing & Anderson, 1985), whereas others are less obvious (such as RMSEA being insensitive to omitted cross-loadings; Savalei, 2012).

Speci c Methods for Evaluating Model Fit
Because numerous methods exist to evaluate model t for SEM, we will focus on those recommended in current best practices (Appelbaum et al., 2018;Lance et al., 2016;. For a classic (and more thorough) overview, see Bollen and Long (1993).
The two primary ways to evaluate model t are through model comparison and goodness-of-t indices.
The classic model comparison is the χ² test (a.k.a. the likelihood ratio test), where the proposed model is compared to the saturated model. Signi cant values indicate less than perfect t. Some methodologists consider the χ² test is "overly strict" (Mueller & Hancock, 2010, p. 395) , "almost always statistically signi cant" (Kenny, n.d.), and overpowered (Bentler, 1990;Hu et al., 1992). Recent work by McNeish (2018) has indeed found that the χ² test's Type I error rate is in ated at smaller samples.
[1] Nested models can also be compared using the χ² test. However, nested comparisons using the likelihood ratio test suffer in ated error rates for certain applications, such as classically-speci ed ACE models (Carey, 2005;Verhulst et al., 2019).
[2] Although best practice remains to report the χ² test, many goodness-of-t indices were developed to address the limitations of the χ² test (Mulaik et al., 1989).
These t indices can be broadly classi ed into three classes: absolute, parsimonious, and incremental. The three broad classes prioritize different mixtures of t with parsimony. Absolute t indices describe the overall discrepancy between observed and model implied covariance values. The addition of more parameters improves t without penalty. Examples include: SRMR (Standardized Root Mean Square Residual), where values below .08 are considered good t (Hu & Bentler, 1999).
Although parsimonious t indices also describe the overall discrepancy between observed and model implied covariance values, these indices penalize added model complexity. Model t should improve with the addition of parameters only if the bene ts outweigh additional model complexity. Examples include RMSEA (Root Mean Square Error of Approximation), where the lower bound of the 90% con dence interval falls below .05 (Hu & Bentler, 1999). Incremental (sometimes called relative) indices compare ts to a baseline model, typically the null model. Examples include TLI (Tucker-Lewis Index), and CFI (Comparative Fit Index), where values above .95 are considered good t (Hu & Bentler, 1999;Marsh, 1995).

Evaluating Just-identi ed Models
Many classic methods for evaluating model t cannot be used on just-identi ed models, because the null comparison model is saturated and just-identi ed (Widaman & Thompson, 2003). However, some methods can be applied. We have already discussed nested model comparisons comparing the saturated to the proposed model. However, that approach can be used for selecting between nested models, including the classic ACE model and models with fewer components, such as the AE or CE model. In addition to nested model comparisons, information criteria can be used, including AIC (Akaike Information Criterion), BIC (Bayesian Information Criterion), and aBIC (Adjusted BIC). These criteria are typically used to compare non-nested models, but they can be used in nested models, if they are t using the same data.
Like the t indices discussed earlier, these criteria aim to balance parsimony and t, using the likelihood function rather than the χ². The idea is to minimize information lost between the data generation process and the model used to approximate it. Better tting models lose less information. For example, AIC is a function of the number of parameters and maximum value of the likelihood function; models with smaller AIC indicate relatively better t as they have lost less information using the same data. Rules of thumb have been developed for evaluating whether these model differences are meaningful, such as differences larger than 10 (Burnham and Anderson, 2002). However, such rules are problematic within SEM because of the large sampling variability for t measures, including BIC (Preacher & Merkle, 2012). Nevertheless, non-nested models can be compared, even within the context of SEM (Merkle et al., 2016).

Study 1: Current State of the Field
How are behavior geneticists evaluating model t?
We examined how behavior geneticists are evaluating their model ts. Work by Jackson (2009) on general reporting practices for CFA revealed that most papers followed contemporary guidelines. However, a priori, we suspected that this practice would not extend to the eld of behavior genetics. In a recent review of the eld's agship journal, Behavior Genetics, the rst author found that many articles were using out-of-date software (Garrison, 2018). In 2016 and 2017, 22% of SEM-based articles in Behavior Genetics used Mx. Mx was last released in 2009, (Version 1.7.03; Neale, 2009); it is no longer compatible with most operating systems; and its developers have moved onto openMx (Boker et al., 2011;Neale et al., 2016). Garrison observed that some articles omitted standard t statistics (e.g., χ²), even in cases when the model was over-identi ed. In addition, many of those papers focused on AIC and a nested-model comparison. This method mirrors the last release of a Mx-companion behavior genetics ebook (Neale & Maes, 2004). That mirroring is suggestive that these papers used that book as a guide, without considering whether their own models were just-or over-identi ed.
[3] Accordingly, this rst study aims to quantify those observations.
[1] In ated Type I errors for ACE models might be the result of non-normal data, rather than an overpowerful test.
[2] The cause of these in ated rates again is likely more from mis-speci ed sampling distributions. However, the result is the same -an incorrectly performing test.
[3] Neale and Maes (2004) provides an excellent discussion on model identi cation for classic twin designs.

Methods
As described in Garrison (2018), the rst author reviewed two years of publications from the agship journal Behavior Genetics (2016 and 2017). That paper focused on determining which SEM programs were being employed. The number of articles identi ed is summarized in Table 1. There were identi ed 473 "publication units" (all publications) during those two years, using Publish or Perish (Harzing, 2018).

Results
Garrison (2018) focused on SEM software. In the current investigation, we used the same articles to identify which methods were being used to evaluate model t in the agship journal Behavior Genetics. The variability in practice of how much attention was paid to model t was considerable, ranging from dedicated discussion subsections (e.g. Smolkina, Morley, Rijsdijk, et al, 2017) to the total absence of their usage (n=8; 20%). We have provided summary statistics in Table 2. Given the variety of methods used and the broad guideline for best practice, we have classi ed each paper on their model-tting practice. If a paper used more than one method to evaluate t, we classi ed that paper as acceptable. If the paper used multiple classes of methods to evaluate t, such as t indices (e.g., absolute, parsimonious, incremental), we classi ed that paper as meeting the standard for better practice. Lastly, if the paper used multiple classes in addition to reporting either χ² statistic or log likelihood, we classi ed that paper as meeting the standard for best practice. Accordingly, 5% met the standard for best practice (n=2), 50% (n=20[1]) for better practice, and 72.5% (n=29) for acceptable practice. Further, if nested model comparisons are included as a "class", these percentages increase to 42.5% (n = 17) for best, 72.5% (n=29) for better; and 75% (n=30) for acceptable.
For context, 80% (n=32) of articles reported a formal evaluation of model tting. In those remaining 20% (n=8) of articles, models were selected for a variety of reasons, including "parsimony", visible inspection of covariances, signi cance of A or C parameters, and ndings from previous literature. In general, if an article evaluated t, it typically seemed to draw inspiration from Neale and Maes (2004). Further, if an article evaluated t, it used the criteria correctly. Therefore, rather than further focusing on evaluating current practice, Study 2 will focus on developing clear criteria. Total 40 [1] This count includes the two articles that met the best practice standard, as well as 18 addition articles that met the better practice standard.

Discussion
To summarize, we evaluated whether 2016-2017 Behavior Genetic papers met the standard for best practice (5%), better practice (50%), and acceptable practice (72.5%). Very few papers met the best standard (Appelbaum et al., 2018;Lance et al., 2016;, which recommended that multiple indices across classes (parsimonious, relative, and absolute) be used in conjunction with the χ² statistic. We believe that these recommendations are both generous and reasonable across all SEM models. Unfortunately, most papers published in Behavior Genetics did not meet this standard. They often did not report a second index from another class, such as an absolute measure, RMSEA, or a relative measure, TLI. Instead papers were likely to include a nested model comparison instead of a second index. Accordingly, the "better practice" standard we developed allowed the nested model comparison to count as a class for evaluating t. Even then, only half of papers met this standard. Frankly, the greatest concern is that 20% of papers did not evaluate t at all. In those cases, it would be impossible to create a metric to include them as meeting some or any standard of practice.
Nevertheless, and a positive nding, if an article evaluated t, that article typically used the criteria correctly. Our interpretation of these ndings is that most researchers recognize that evaluating model t is part of the research process. Instead, rather than criticizing current practice, or focusing on the current weaknesses in the eld, Study 2 will focus on developing clear criteria that can be used to successfully evaluate model t. Ideally, those criteria will be simple, universal, and meet the standards for best practice.

Study 2: Evaluating Model Fit
Study 2 builds upon Study 1, by determining whether clear criteria for evaluating model t can be developed, and whether those differ from "classic" SEM criteria. Speci cally, we focused on developing custom t criteria, in the style of Hu and Bentler (1999) for classic univariate ACE models, by tting models to data generated from both correctly and incorrectly speci ed models, using the classic twin design comparing MZ and DZ twin correlations/covariances.

Evaluation Plan
Methods of evaluating model t included (when possible), RMSEA, SRMR, AIC, BIC, aBIC, χ², nested model comparison, CFI, TLI, as well as an additional method identi ed in Study 1 -the model was kept if it had signi cant A or C parameters. Our evaluation plan emulated Hu and Bentler's (1999) approach by examining the Type I and Type II error rates for correctly and incorrectly speci ed models. Unacceptable rates were those that deviated substantially from 5% Type I error rates and 20% Type II error rates. The ideal method will be the one that can achieve both values, across all factors within the simulation designs. It may be unlikely that any measure/combination of measures will meet all these ideal values.

Data Generation
We examined whether the classic cutoff values for model t criteria are effective for classifying models correctly for univariate models. In other words, are misspeci ed models rejected, and are correctly speci ed models accepted? In the spirit of Hu and Bentler (1999), we t a correctly speci ed model as well as many incorrectly speci ed models. We have produced a summary table, Table 3, of the conditions outlined below; we use 5,000 replications. Variance Components. We have included one genetic component (additive) and two environmental components (shared-and non-shared). Each component condition, except for non-shared, can be 0, 1, or 2. We excluded the 0 non-shared variance condition because practically every measure has non-shared environmental variance (Polderman et al., 2015;Turkheimer & Waldron, 2000). These variance components can be converted into classic heritability estimates, by taking the proportion of variance components. For example, if the variance components were 1 additive genetic and 1 non-shared environmental, then that would result in a 2 = .5, c 2 = 0, and e 2 = .5. Each variance component is generated separately and then combined to create a total score, using the discordsim function from the discord package  in R. This function generates biometrically-informed data for up to two measures for each member of a kin pair, using multivariate normal functions for each variance component. These kin pairs can vary in their genetic and environmental similarity. Further, the two measures can have overlapping biometrical in uences, that are either bidirectional (e.g., correlated factors model, direct symmetric model) or unidirectional (Cholesky model).
Sample Size. We have varied sample size ranging from 200 total pairs to 2000 total pairs (N=200, 500, 1000, 2000). These sample sizes are based on typical sample sizes from Polderman's et al. (2015)'s meta-analysis. Pairs will be balanced to 50% Monozygotic Twins and 50% Dizygotic twins, as is typical for classic studies (Polderman et al., 2015).
Data Input Method.
[1] The generated data were used directly (i.e., at the individual level) or indirectly, with estimated covariance matrices from those same generated data.Each dataset was t using the following models: ACE, AE, and CE, using the raw data and the covariance matrix. In each case, the data were t to two incorrect models, and the correctly speci ed model. For example, if the correctly speci ed model is an AE model, then the incorrectly speci ed models t would be ACE and CE. This simulation is a 3 x 3 x 2 x 4 x 2 (Genetic variance x Shared environmental variance x Non-shared environmental variance x sample size x covariance versus raw data; see Table 3) design, with 144 total cells. Data were generated using R. Models were estimated with Mplus. Select conditions were used to validate the data generation process by tting ACE models to data generated from an E model [2] .

Rejection Criteria
Page 12/33 Rejection criteria were initially planned to be guided wholly by observed methods. However, given the small sample of those who used any given t index, we selected criteria based on a mixture of those observed practices in the survey along with common practice and popular articles, books, and websites (Hooper et al., 2008;Kenny, n.d.;Kline, 2015). Those criteria are documented in Table 4. Please note that aBIC, AIC, and BIC as well as the nested model comparison are nested, and in that regard are compared to the saturated model (i.e., ACE) t to those same data. [1] An additional series of models were t using correlation matrices, instead of the covariance or raw data. However, Mplus does not allow for multi-group SEM models to be estimated with correlations.
Preliminary attempts to work around this restriction by tting the correlations as covariances with means of 0 led to inconclusive results.
[2] The E model describes any model where only non-shared environmental variance was generated and all other variance components (A & C) were 0.

Results
Given the number of conditions and methods of comparisons, we have focused on general trends and the three most promising methods for evaluating t: TLI, χ² Test, and Nested Model Comparison. We direct readers to detailed tables in the Appendix and to the accompanying data les for complete results. Results did not differ meaningfully across data input condition (raw data or covariance-based summary statistics), which typically did not differ until the third decimal place. Accordingly, focus on the raw-data based method.
First, we conducted basic validity checks to determine whether the data generation and the model tting process were performing correctly. Speci cally, we examined whether correct models performed "better" than incorrect models in the expected direction for each t statistic. We also examined how data generated from an E model -that is, a model with only non-shared environmental variance -would t an ACE model and if the descriptive statistics from those models performed as expected.
Second, we present descriptive statistics for models by whether the correct or incorrect model was t, and also by whether the tted model had three variance components (i.e., an ACE model) versus a model with only two variance components (i.e., AE or CE). Third, we present Type I error rates and power for model selection, using conventional benchmarks. After each subsubsection, we provide brief discussions.

Validity Checks
In this subsubsection, we verify that the simulation worked as intended with the discord package. We examined for each t statistic whether correct models performed "better" than incorrect models in the expected direction; and whether general summary statistics for ACE models t to data generated from E models performed as expected. Tables 5-12 contain descriptive statistics for each condition, and can be found in the appendix.
As expected, the following measures had lower median levels when the correct model was t relative to incorrect models: aBIC, AIC, BIC, p(χ²), χ², CFI, TLI, and p(RMSEA<.05). In contrast, the following measures had higher median levels when the correct model was t relative to incorrect models: estimated RMSEA, RMSEA's upper bound of the 90% CI, and SRMR. All indices performed as expected, indicating that correct models are at least better tting than those that are not correctly speci ed.
Next, we examined descriptive statistics for data generated from an E-only model, but t to an ACE model.
As expected across conditions, median standardized (and unstandardized) parameter estimates for A and C were 0 across all conditions. Further, median standardized parameter estimates for E ranged from 0.995 to 0.999 across conditions. These p-values are more extreme than one would hope; however, they are consistent with recent work by Verhulst and colleagues (2019) and re ect the fact that estimated values for C and A are at the boundary of 0 and are restricted to be positive. In addition, distributions of p(χ²) performed as expected; speci cally, median values ranged from 0.331 to 0.358. Their 5th percentile ranged from 0.02 to 0.026 across conditions.

Summary Statistics
In this subsubsection, we present descriptive statistics for models by whether the correct or incorrect model was t, for TLI, the signi cance level of χ² test of model t, and signi cance level for nested model comparisons, comparing the ACE model to the selected model of interest. Table 13 in the appendix provides median values for standardized measures of t with 1%, 2.5%, 5%, 10%, 90%, 95%, 97.5%, and 99% quantiles for correctly-speci ed models. These values compare favorably to the majority of classic benchmarks. For example, approximately 95% of all correct models reported p(χ²) greater than .05.
Unfortunately, however, the distributions for these measures (as well as non-standardized methods of t) overlapped considerably with the distributions from incorrectly speci ed models (see Table 14 Appendix).
Select Findings. Figures 1, 2, and 3 illustrate this general trend in overlapping distributions with box and whisker plots. For each gure, the whiskers span the 5% and 95% quantiles, and the Interquartile Range spans the 25% and 75% quantiles. As illustrated in Figure 1, median levels of TLI differed between correct models (1.001) and incorrect models (0.988). However, the 5% and 95% quantiles overlapped; for correct models, the quantiles ranged from 0.98 to 1.018, and for incorrect models from 0.913 to 1.015. Similarly, distributions overlapped for p(χ²) where median levels were 0.495 for correct models and for incorrect models were 0.066 as illustrated in Figure 2. The quantiles ranged from 0.049 to 0.949 for correct models and from 0 to 0.845 for incorrect models. For nested model comparisons, the trend again was similar with differing median levels, but overlapping quantiles, for correct (1, 90% QI [0.2, 1] ) and incorrect models (0.021 , 90% QI [0, 1] ), as illustrated in Figure 3. It is worth noting that although the nested model comparison does overlap, a close inspection of gure three reveals that the large bulk of models are being distinguished correctly.
Next, we distinguished by the type of model t. Speci cally, we compared models with all three variance components (i.e., ACE) to models with only two of the three variance components (AE and CE models).
Median levels generally differed, but correct and incorrect models continued to have overlapping distributions. As expected, incorrect ACE models tended to be indistinguishable from correct models.  Figure 4). However, median TLI values were lower when incorrect two component models (0.983) were t compared to correct two component models which also had median TLI values of (1.001). A similar trend was observed for p(χ²), where median levels of p(χ²) for incorrect ACE models (0.417) were indistinguishable from correct ACE models (0.494).
Similarly, median p(χ²) values were lower when incorrect two-component (0.008) models were t compared to correct two-component models which also had median p(χ²) values of 0.495 (see Figure 5). Lastly, for nested comparisons, the trend again was similar with indistinguishable median values for ACE models (incorrect 1; correct 1), but lower values when two-component models were evaluated (incorrect 0; correct 1; see Figure 6). Nevertheless, even when distinguished by type, all measures overlapped in their distributions..
When distinguished by both type of model t (ACE vs AE or CE) and sample size, distributions tended to narrow with larger sample sizes. The spread reduction was most drastic between 200 and 500, as one would expect. The same trends occurred as described previously. When t to incorrect models, ACE models still resembled correct models, often with nearly indistinguishable distributions. As was the case without accounting for sample size, incorrect AE and CE models had distinguishable median levels when compared to incorrect ACE or correct models in general.
We also examined whether these summary statistics differed across levels of total variance [1] . Total variance ranged from 1 to 6. In general, more variance did not impact model ts when correct models were t. However, a linear relationship was observed for most statistics when t to incorrect models, either in terms of decreasing variability (e.g., TLI; Figure 7) or in declining median levels. For example, without accounting for type of model t, signi cance levels for the χ² test of overall t did not vary as a function of total variance when the correct model was t (e.g., median levels of signi cance were 0.496 when total variance was 2; 0.499 when total variance was 6, etc; see Figure 8). However, they did vary for incorrect models. Increases in total variance were associated with decreases in p-value. Lastly, nested model comparisons behaved similarly (see Figure 9); however, this behavior seems to be more the product of total variance being partially confounded with type of model t.
The same relationships with total variance seemed to hold even when accounting for type of model t (ACE vs AE or CE), with one expected exception. The ACE models t to incorrect models still tended to be indistinguishable from correct ACE models. we have included Figures 10, 11, and 12 to illustrate the trends for TLI, signi cance levels for the χ² test of overall t, and nested model comparisons. For example, signi cance levels for the χ² test of overall t did not vary as a function of total variance when the correct model was t (regardless of kind) or when the incorrect ACE model was t. Those values did level off when t to incorrect AE or CE models, median signi cance leveled off quickly when t to AE or CE, see Figure 11 (e.g., 0.444 when total variance was 1; 0.106 when total variance was 2; 0.001 when total variance was 3, etc).
Brief Discussion. The main nding from the descriptive statistics revealed that although indices for correctly-t models tended to compare favorably to classic benchmarks, there was considerable overlap in distributions between correctly and incorrectly speci ed models. These distribution overlaps were present even when compared across type of model t (ACE, CE, AE), sample size, and total variance.
These overlaps were most problematic when comparing incorrect ACE models to correct models, where the distributions were practically indistinguishable. However, those indistinguishable distributions are likely the product of the ACE model being saturated.

Power and Type I Error
Next, we examined model rejection rates, using classic benchmarks and nested model comparisons. Type I errors were de ned as rejecting a correctly speci ed model. Power was de ned as rejecting an incorrectly speci ed model. Type II error rates were de ned as β (1-power), or incorrectly speci ed models that were not rejected. These rates were calculated by condition. For the sake of brevity, we will again focus on the three most promising metrics: TLI, p(χ²), and the nested model comparison. We direct readers to Table 15 in the appendix for these rejections as well as all the other tested metrics by condition.
Select Findings. For correctly-speci ed models, the median rejection rate across conditions for correct models (i.e., Type I errors) ranged from 0 (signi cance of model parameters, TLI) to 0.992 for p(RMSEA) <1. The χ² test had median rejection rates near 0.05 (0.051; min 0.043, max 0.06). In contrast, TLI and nested model comparisons had rejection rates well below .05. For TLI, the median rejection rate was 0 (min 0, max 0.222), and for nested χ² model comparisons it was 0.009 (min 0, max 0.03). Figures 13 -15 illustrate variability and differences in rejection rates with box and whisker plots. In each gure, power and Type I error rates are benchmarked with horizontal lines at 0.80 and 0.05.
Given that distributions of t statistics seemed to differ as a function of sample size as well as total variance, we examined whether rejection rates were associated with these values. All rates for correctly speci ed models were negatively associated with sample size. Virtually all correlations were large (>.4).
These correlations are provided in Table 16 and ranged from -0.972 for RMSEA <.06 to -0.024 for χ² nested model comparisons. Many rates for incorrectly speci ed models were positively associated with sample size, but this trend was not universal. p(χ²) and nested model comparisons were positively associated with sample size, while TLI was negatively associated.
In addition, many rates were associated with total variance for both correctly and incorrectly speci ed models, but not as universally as sample size. p(χ²) was positively associated with total variance for both model speci cations. TLI was negatively associated for both. However, for nested model comparisons the association with total variance was positive for incorrectly speci ed models and negative for correctly speci ed models. For incorrectly speci ed models, the median rejection rate ranged across conditions (i.e., power). Incorrect ACE models ranged from 0 (signi cant model parameters) to 1 for p(RMSEA)<1. All three metrics had median rejection rates below .80: TLI (0.005; min 0, max 0.286), nested χ² model comparisons (0; min 0, max 0), and p(χ²)<.05 (0.07; min 0.059, max 0.112). For incorrectly speci ed AE or CE models, the rejection rates ranged from 0 (signi cant model parameters) to 1 for p(RMSEA>.05). For our three metrics, nested χ² model comparisons had median rejection rates of at least .80: (0.943; min 0, max 1). However, TLI and p(χ²)<.05 rejection rates were below .80: TLI (0.045; min 0, max 0.578); and p(χ²)<.05 (0.719; min 0.057, max 1) and had considerable variability.
[1] Total variance was not randomized and was not balanced across correct versus incorrect models as well as across the type of model t (ACE, AE, CE). For example, no correct model had a total variance of 1.

Discussion
Brief Discussion. This stage examined power and Type I error rates for distinguishing between correctly speci ed and incorrectly speci ed models using classic criterion. These examinations revealed that many classic criteria had Type I error rates that deviated from .05, drastically (and usually conservatively). Moreover, within each metric, there was considerable variability in Type I error rates across conditions, even when grouped by type of model t, sample size, and total variance. All metrics had in ated Type I error rates in at least one condition, except for nested χ² comparisons and RMSEA's Lower Bound > 0.
Similarly, the power to reject an incorrect model using classic criteria deviated considerably across conditions ranging from 0 to 1, even when grouped by type of model t (ACE versus CE or AE), sample size, and total variance. The metrics had decreased power rates (<.8) in at least one condition, with two exceptions. Only p(RMSEA>.05) < 1 and Lower Bound of RMSEA > 0 were consistently and su ciently powered.

Broad Discussion
Study 1 asked how behavior geneticists evaluate model t, by examining two years of recent publications in the agship journal Behavior Genetics. We evaluated whether these papers met the standard for best practice (5%), better practice (50%), and acceptable practice (72.5%). Very few papers met the best standard (Appelbaum et al., 2018;Lance et al., 2016;, whereas 20% of papers did not evaluate t at all. Study 2 asked, what are effective criteria to evaluate model t for behavior genetic models? It addressed this question by conducting a simulation study. Data were generated from a 3 x 3 x 2 x 4 x 2 design, where variance components (A, C, and E), sample size, and estimation method varied. These data were then t to both incorrectly and correctly speci ed ACE, AE, and CE models in a standard twin design. The analyses were broken into three stages. Stage 1 examined descriptive statistics for correct and incorrect models. Stage 2 examined model rejection rates, using classic benchmarks as well as nested model comparisons. Stage 3 identi ed alternative thresholds, by using three common methods, with the intent of developing a universal metric.
Stage-speci c results are discussed at the end of each sub-subsection in the Results section. Broadly, the answer is no, there are not universal criteria that can consistently distinguish between data t to correct versus incorrect models with anything close to good sensitivity and speci city. This status will be discussed separately for the ACE models, and the AE and CE models.
The fundamental problem seems to be that for data t to ACE models, it is borderline impossible to distinguish correct models from incorrect models. In the twin design used here, the ACE model is saturated, leaving no degrees of freedom to test model t, which creates the challenge identi ed in this study. Further, power and Type I error rates revealed that incorrect ACE models were di cult to distinguish from correct models of any type, using classic benchmarks. The ACE models were harder to correctly identify and thus, disproportionately contributed to the reduced power and in ated Type I error rates.
Accordingly, these two stages indicated that criteria would likely have greatest di culty with detecting model misspeci cation in saturated models, i.e., ACE models.
From the study's onset, it was clear that just-identi ed models like ACE models were going to be challenging for t statistics to provide useful diagnostic information. Indeed, the fact that Neale and Maes (2004) dedicated an entire section to model identi cation and t suggests that this challenge has been long apparent -and most likely previously discovered and just accepted as an inherent headache for modeling in the eld of behavior genetics. Indeed, those metrics most used by Neale and Maes (2004) performed relatively better than those metrics not included in their text. This pattern supports that they may have recognized the problems identi ed within the current study. However, there is, apparently no discussion in the behavior genetic literature to support this possibility.
It is possible that some combination of indices that included both a highly-speci c metric and a highlysensitive metric might be created to jointly produce model selection rules that achieve both power greater than .8 and Type I error rates at or below .05. Such a combination of indices would be consistent with best practices for evaluating t for SEM models and would have the chance of providing universal selection criteria -a series of golden rules rather than a single golden rule. Specifying such a combination was not a goal of the current study but is a useful suggestion for fruitful future research.
Nevertheless, if we had to select a single metric, it would have to be TLI. TLI was the closest metric that could be universally applied, and yet TLI still had considerable variability in sensitivity and speci city.
Furthermore, many of the "optimized" thresholds were right at the boundary of what was possible: TLI ≥ 1. Accordingly, although this metric could be used universally, we still would not recommend it, as it is impractical, might encourage HARKing (hypothesizing after the results are known; Kerr, 1998;Rubin, 2017), and does not require that the researcher think critically about the model they are tting.
Fit statistics thresholds at the boundary are impractical because they set an impossible expectation. Perfectly-tting models using real data can only be achieved through over tting or selectively reporting t indices. Researchers could be incentivized to tinker with their models to meet those thresholds. The resulting exploratory model might be presented and/or interpreted as a con rmatory one. It is just an impossible expectation that does more harm than good. And frankly, any model that can perfectly reproduce its covariance matrix is not useful. A model is designed to be a simpli cation of reality, not a reproduction (Rodgers, 2010).

Concluding Thoughts
We were tempted to provide criteria that differed depending on whether a saturated model (ACE) model was being used, or if a fully identi ed model (AE or CE) was being used. However, after considering the long-reaching implications and the con dence that declaring a well-tting model brings, we decided against giving speci c criteria. Readers who insist on custom criteria can nd them in Tables 17-19 in the appendix. However, we caution against eagerly accepting these benchmarks, especially for the ACE model and any saturated models. Even so, if you are evaluating a saturated model, in this paper's case the ACE model, we encourage you to be ruthlessly skeptical when evaluating your t statistics. Further, we recommend that you be skeptical of any model whose t statistics noticeably deviate from perfection. Even if your model " ts," we encourage you to t additional models to their data that are over=identi ed, and not merely just-identi ed. Researchers can either use a design that has more than two kinship categories, add covariates, or at least t a restricted model like an AE or CE model. If your model does indeed t well, it will survive this skepticism, and indicate that it ts better relative the other models.
More broadly, we think that combining ruthless skepticism with the general best practices of using multiple metrics from across classes of indices, should improve our ability to select the better tting model -ideally the best identi able model. Although the optimized metrics are using individual indices, future work should examine whether a combination of these metrics could achieve su cient power and acceptable alpha levels. Indeed, if only one lesson is to be learned from these ndings, it is the following: behavior genetic researchers should evaluate models beyond a "well-tting" ACE model. A "well-tting" ACE model does not indicate that that model is actually "well-tting," even that the model is likely to be correct. In general, behavior genetics researchers should be ruthlessly skeptical of model tting results.

Declarations
Funding: The authors were supported by NICHD grant R01-HD087395.
Con icts of interest: The authors declare that they have no con icts of interest.
Ethics approval: Not Applicable Distribution of model rejection rates using TLI across conditions Distribution of model rejection rates using TLI across conditions by model type Figure 17 Distribution of model rejection rates using p(χ²) across conditions by model type Page 33/33

Figure 18
Distribution of model rejection rates using nested model comparisons across conditions by model type

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. 5AppendixCode.docx AppendixTables.docx