Sample demographic and symptom characteristics

Appendix Table 1 reports the socio-economic characteristics of the sample and compares them to that of the US population. Similar to other studies, the children in our sample are from wealthier, more educated, and more urban families than the US average. Representation of minority groups, especially Hispanic and Black, is also lower than their share of the US population. The effects of various socioeconomic variables on the age of diagnosis in our sample are reported in Appendix Table 2.

The percent of children that, at time of diagnosis, displayed each of the signs listed in the survey, as well as (for those who reported them) their median values of severity, are presented in Table 1. Figure 1 presents the value distributions of each signs. As can be seen, the modal responses for most signs are either zero (indicating that a child did not exhibit the symptom) or the highest severity level (10). However, there are sufficient numbers of intermediate severity values reported to render severity level a potentially useful input into statistical analyses.

Correlations between severity and age of diagnosis (Univariate Analysis).

Before presenting results using factor analysis and regression trees, we report some basic correlations between each sign and the age of diagnosis. Figure 2 displays graphically, for each sign, the average effect of a one unit increase in reported severity on the age of diagnosis (in months). Each horizontal line describes the 95% confidence interval for the effect, and the dot in the center displays the point estimate (see Appendix A for more details).

For most signs, higher severity was predictive of an earlier age of diagnosis (the lines are on the left side of the panel, indicating a negative relationship between severity and age of diagnosis).

Delayed speech, delay in response to own name, and lack of gesture had the strongest effects. Most regressions of skills were associated with an earlier age of diagnosis, except for loss of motor and daily living skills, whose effects were non-significant. However, given that regressive symptoms were less frequently reported by parents, their usefulness for early diagnosis might be limited.

Interestingly, signs associated with aggression as well as “need for sameness” and sensory hyperreactivity are *positively* correlated with the age of diagnosis; children exhibiting these symptoms are diagnosed later, on average. In the Discussion Section we provide possible explanations to this finding.

A limitation of the univariate analysis is that the existence of some signs may be correlated with the existence of others, so that individual signs provide redundant clues about a child’s condition. We use two methodologies to estimate the joint effects of various symptoms on the age of diagnosis: factor analysis and regression trees.

A seemingly natural approach to dealing with this problem would be to estimate the effect of one sign controlling for the effect of each of the others in a multiple regression. This approach proved infeasible due to multicolinearity. The high correlations between many of the signs (see Table 2) increases the variance of the coefficient estimates and makes the coefficient estimates unstable and difficult to interpret. We, therefore, use factor analysis and regression trees to avoid such problems. The estimation results using a multiple regression are reported in Appendix A.

Factor analysis

Factor analysis is most suitable when the correlation between variables of interest is relatively high. A visual examination of the correlations across signs (Table 2) shows a relatively large number of correlations with values of 0.3 and higher (35.7% of the pairwise correlations). Using the Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy, the standard test of whether a data set is appropriate for factor analysis, yielded a value of 0.91, suggesting a high degree of suitability for factor analysis (Kaiser and Rice, 1974).

Using varimax rotation, we identified 5 distinct factors. Table 3 reports the mapping of the 25 signs to these five factors and the labels we chose for these factors. Three of those factors mapped well onto a triad of ASD diagnostic impairments (social interaction, communication, and restrictive/repetitive behaviors). Two additional factors represented items relevant to developmental regression and aggressive behaviors.

The next step in our analysis was to use the weights from the factor analysis to generate factor scores for each child and each factor. The factor scores are then included as independent variables in a regression model in which, as before, the dependent variable is age of diagnosis. Therefore, instead of including all 25 symptoms in one regression, we use the 5 factor scores. This avoids the problem of multicollinearity, since factors are, by construction, uncorrelated with one-another, and, by reducing the number of independent variables, increases statistical power.

The results of this regression are reported in Table 4, Column 1. The factor representing communication difficulties was the strongest predictor of age of diagnosis, with higher levels of severity associated with significantly lower age of diagnosis. Unfortunately, one of the drawbacks of factor analysis is that there is no interpretation to the values of the estimated coefficients.

Developmental regression and restricted-and-repetitive-behaviors (RRBs) were also predictive of earlier age at diagnosis, although much less predictive than communication difficulties. Presence of aggressive behaviors, on the other hand, was associated with a delayed diagnosis. Adding to the regression socio-economic indicators, the child’s year of birth, and an indicator for an Asperger diagnosis (Table 4, Column 2), did not much affect the results.

To test the hypothesis that individual signs play an important role in predicting age of diagnosis beyond what is captured by the overall severity of the child condition, we constructed a measure of overall severity by counting the number of signs that the parents reported with a positive level of severity. The results, reported in Table 4, column 3, show that, on average, a one unit increase in the number of signs reduced the age of diagnosis by almost one month. However, when the regression also includes the five factors representing the effects of the individual signs (Table 4, columns 4 and 5), the effect of overall severity becomes insignificant.

Regression Trees

We conducted a regression tree analysis using a package called RPART (Therneau and Atkinson, 2017). As described earlier, the first step in constructing the tree is to find, for each sign, the level of severity (on the 0–10 scale) that best split the sample into two groups, those who are diagnosed earlier and those who are diagnosed later. The estimation procedure is similar to running numerous Ordinary Least Squares (OLS) regressions for each sign, where the dependent variable is the age of diagnosis and the independent variable is a dummy variable indicating high vs. low severity of the sign. In each regression we use a different level of severity to construct the cutoff for the high/low dummy variable. The best cutoff point is produced by the RPART package using “Gini Index” and is, intuitively, similar to choosing a cutoff point that produces the highest explained variation (R2) for each sign.

The results of this first step are presented in Table 5. For example, for “delayed speech” the level of severity of 5.75 best divides the sample by age of diagnosis. For children with severity levels of 5.75 or below, the mean age of diagnosis is 63.7 months (median = 58), while for children with levels of severity above 5.75 the mean age of diagnosis is 34.9 (median = 30). The extremely low p-values indicate that, with almost certainty, the population mean age of diagnosis for children above the split is different (and lower) than those with severity level below the split.

At the bottom of the table are signs for which no level of severity could separate the sample into two groups where the age of diagnosis of one group was significantly different from that of the other.

In creating the tree (Fig. 3) we work from top down, first picking the sign for the top of the tree that best divides the sample between children who are diagnosed earlier and those who are diagnosed later, based on the criteria discussed above. This is “delayed speech,” which, with a cutoff level of 5.8 (all cutoffs numbers are rounded in the figure) splits our sample into two groups: 62% (n = 740) with a severity level of 5.8 or above have an average age of diagnosis of 35 months, and 38% of the sample (n = 463), with a severity level below 5.8, have an average age of diagnosis of 64 months.

Next, for each of these sub-groups, we again split the sample, using the same criteria. We repeat this process until we cannot split the sample into two sub-groups such that the difference in the age of diagnosis is statistically significant.

From Fig. 3, we can see that when we limit the sample to the children with high severity of delayed speech (node 2, where severity level is ≥ 5.8), the only remaining sign that further splits this sub-sample is “lack of gestures” where the cutoff level of 5.7 splits our sample into two groups (nodes 4 and 5): 42% (n = 500) with a severity level of 5.7 or above have an average age of diagnosis of 32 months, and 20% of the sample (n = 240), with a severity level below 5.7, have an average age of diagnosis of 42 months. As the graph shows, these two groups, both relatively large, cannot be further divided.

Going back up the tree, of children with relatively low severity of delayed speech (node 3), the symptom that best divides this sample is “delayed response to name.” Notice that here the cutoff level is quite low (0.75), suggesting that among children with no (or low level of) delayed speech, any level of delayed response to name is important in predicting the age of diagnosis. The cutoff level of 0.75 splits the subsample into two groups (nodes 6 and 7): 22% (n = 264) with a severity level of 0.75 or above have an average age of diagnosis of 55 months (node 6), and 17% of the sample (n = 199), with a severity level below 0.75, have an average age of diagnosis of 75 months (node 7). Again, the rationale for such a split seems intuitive, at least after the fact. Speech could be delayed for a variety of reasons and the delay can take many forms, for example, an ability to receive and understand speech without the capacity to speak. If the child does not respond to his name, however, the lack of response is likely to be more diagnostic, to be indicative of a lack of attachment or lack of social awareness, “autistic aloneness.” Combined with difficulties in initiating relationships we get the “Asperger” type and a later diagnosis (node 8).

Looking at the sample represented by node 6, we see that the sign that best splits this sample is “difficulties in initiating and/or maintaining relationships.” The cutoff level of 6.0 splits our sample into two groups (nodes 8 and 9). Fourteen percent (n = 170) with a severity level of 6 or above have an average age of diagnosis of 61 months, and 8% of the sample (n = 94), with a severity level below 6, have an average age of diagnosis of 45 months. It should be noted that here, children that exhibit the sign at higher severity are actually diagnosed at an older age.

Moving up to the sample represented by node 7 (low levels of delayed speech and low levels of delayed response to own name), the symptom that best split the sample is “need for sameness,” and the cutoff level of 2.3 splits our sample into two groups (nodes 11 and 15). Here again, children who exhibit the sign at higher severity are diagnosed at an older age. The 14% (n = 174) with a severity level of 2.3 or above have an average age of diagnosis of 79 months, and the 2% of the sample (n = 25), with a severity level below 2.3, have an average age of diagnosis of 48 months. This group cannot be further split, most likely due to small sample size.

Going back to node 11, the sign that best split this sample is “played with toys or objects in an unusual way.” The cutoff level of 5.6 splits our sample into two groups (nodes 10 and 13). Children with the higher level of severity have a lower age of diagnosis. The 6% of our sample (n = 67) with a severity level of 5.6 or greater have an average age of diagnosis of 68 months, and 9% of the sample (n = 107), with a severity level below 5.6, have an average age of diagnosis of 86 months. Notice that this sub-group has the highest age of diagnosis so far. Looking at this specific subgroup we see that once again “delayed speech” is the sign that best splits it. Note that we have here a small sub-group of children with relatively low levels of delayed speech. However, when limiting the sample to this sub-group, those with relatively more delayed speech are diagnosed earlier. A cutoff level of 3.8 divides the group into our last two subgroups, represented by nodes 12 and 14. Children with delayed speech levels of 3.8 or above are diagnosed at an average age of 56 months, and children with severity level below 3.8 are diagnosed at an average age of 91 months. The groups’ sizes are 15 (1%) and 92 (8%) respectively.

In sum, the regression tree analysis identifies speech delay, lack of gestures and delayed response to name – all components of factor 3 in the factor analysis - as the key signs leading to early diagnosis (the lighter the node’s color in Fig. 3, the earlier is the diagnosis).