The initial search yielded 330,225 titles, of which 227,371 were duplicates. After the first screening of titles and abstracts, we assessed 154 full text studies and finally identified 81 studies, which met inclusion criteria and underwent full data extraction. Three additional data sources were included outside of database searches, e.g. test manuals available on-line (see Fig. 1). All collected data are presented in Table 1.
Study characteristics
The studies described research from 37 countries; most studies originated from the US (N = 18), Australia (N = 5), and South Korea (N = 4). In addition, one article reported a study conducted in nine Arabic countries (Egypt, Kuwait, Jordan, Oman, Qatar, Saudi Arabia, Syria, Tunisia, and Lebanon), and one from the US conducted on a group of Nepalese refugees from Bhutan [91, 101]. The number of scientific papers published during the period under review was relatively stable, with an increase over the last five years.
Study objectives
The studies included in the review had varied purposes; however, a significant majority focused on determining the psychometric values of the tools. Reliability (defined as Cronbach’s alpha) was provided in 46 of all studies (one study reported only the factor analysis of the instrument), sensitivity was assessed in 53 studies, specificity in 51 studies, positive predictive value (PPV) in 47, and negative predictive value (NPV) in 36 studies. Two studies aimed to determine the cut-off points for the study population for a given tool [76, 77]. One study aimed to demonstrate the need for further research on the cultural and linguistic adaptation of screening questionnaires and simplifying the wording used in them [101]. Finally, one study was designed to test the stability of the cross-cultural measurement, and one aimed to identify possible difficulties related to translating the ASD screening questionnaires to adapt them to other languages and cultures [78, 82].
Study populations
The number of participants included in the studies differed significantly, ranging from 13 to 52,026 [101, 102]. 34 studies included more than 1,000 children, while six had more than 10,000 participants.
Children from the general population were included in 46 studies. In eight papers, the research was based only on a group of children at risk. One study was conducted in a group of typically developing children [68]. In the case of three publications, the characteristics of the studied population were not specified. The remaining publications concerned both children with a low and high risk of ASD. It is worth noting the different understanding of the term “high-risk children” in individual papers, as risk groups, for example, included siblings of children diagnosed with ASD, children already diagnosed with ASD or other developmental disorders, or suspected of developmental delay, etc.
Tools characteristics
In the course of the study, we were able to identify 26 different autism spectrum disorder screening tools that met our study criteria.
We would like to point out that while researching the information about tools, we found mixed data on the availability of the Checklist for Autism Spectrum Disorder (CASD) for professionals who are not psychologists or have not completed the appropriate training. Nevertheless, we decided to include CASD in this publication as a tool available to PCPs.
Original versions of questionnaires
The original versions of questionnaires come from 13 countries. Most of them (as much as 35%, N = 9) were created in the US. Only two questionnaires were developed in low- and low-middle-income countries (Uganda and Sri Lanka) [51, 112, 135]. An even greater disproportion could be observed in the languages in which the original versions of the tools are available. Of the 35 original language versions (some questionnaires such as CASD, JA-OBS, and PAAS were prepared in two languages, and INCLEN-ASD even in nine), almost half (N = 17) were in English.
Number of language versions and cultural adaptations of ASD screening tools
Data from selected publications allowed us to create 75 profiles of different versions of the adaptations or original versions of ASD screening questionnaires. Most tools were prepared in one country in one language version. At least one questionnaire was tested in a total of 45 different countries. The largest number of various questionnaires was available in the US (11), Australia and South Korea (4 each), China, the Netherlands, and Turkey (3 each).
Some questionnaires in one study were translated into multiple languages simultaneously; however, at least one tool was available in 35 different languages. In some countries, the questionnaires were adapted to the local dialect (e.g., the Spanish versions of M-CHAT were adapted to Spanish, Mexican, Chilean, and Argentinian respondents) [86, 88, 100, 104]. Most of the questionnaires were available in English (N = 21), Spanish (N = 7), Chinese (N = 6), Dutch (N = 4) and Korean (N = 4).
At this point, it is worth mentioning that there are many translations of the questionnaires, such as M-CHAT or Q-CHAT, available on the websites of organizations involved in developing them. For example, the most popular M-CHAT is available in 73 versions, but most lack research published in international journals [136, 137]. The situation is similar with the Japanese and Spanish BITSEA versions [138].
Most language versions of the individual questionnaires were translated directly into the language of the surveyed population, sometimes with minor changes. However, for example, in the Argentinian version of the M-CHAT questionnaire, the dialect was changed to match better Spanish used in Argentina. Likewise, in the Taiwanese version of STAT, two items were changed to suit the Taiwanese population better [86, 130].
In addition, cultural changes were made in nine adaptations. For example, phonemes were adapted to the language, and the type of assessed play or the type of toy shown to children was changed to capture their interest.
Psychometric values
When searching for information on different versions of questionnaires, we focused primarily on reliability, sensitivity, specificity, PPV, and NPV. We made the decision not to include validity data in our review due to the considerable variation in the methodology used across studies (different types of validity measured by various means) or other psychometric values (such as positive or negative likelihood ratio) due to the small number of studies containing these data and the desire to simplify the table as much as possible to facilitate its use by practitioners.
Out of all 75 profiles, we were only able to complete 20 of them containing all the five values sought.
Reliability. Internal reliability of the test is a measure defining the consistency of items included in a given scale, i.e., it determines to what extent the items included in a given factor or scale are similar to each other or whether they test the same phenomenon. The most common measure of reliability is Cronbach’s alpha (α) [139]. In the profiles we created, this measure ranges from 0.53 to 1.00. Using the rule of thumb and other different qualitative descriptors methods, 6 of the studies had excellent reliability (α > 0.93), 2 – strong (0.91–0.93), 12 – reliable (0.84–0.90), 14 – relatively high (0.70–0.83), and 13 had reliability below 0.70 [140].
Sensitivity. Test sensitivity is the ratio of the true positives to the sum of the true positives and the false negatives. A sensitivity of 100% would mean that all individuals with existing disorders would be diagnosed. Values of reported sensitivity in 53 profiles varied from 0.18 to 1.00. Most of the tests (N = 42) scored above 0.70. There is a significant discrepancy between the sensitivity values between linguistic adaptations of the same type of questionnaire (e.g., M-CHAT used in the US and Sri Lanka), resulting potentially from an inadequate cultural adaptation of the tool [106, 110].
Specificity. Test specificity is the ratio of the true negatives to the sum of the true negatives and false positives. A specificity of 100% would mean that all healthy individuals in the test performed would be marked as healthy. Specificity was calculated for 51 of the above-mentioned versions of questionnaires and ranged from 0.51 to 1.00. In 37, specificity exceeded 0.80.
Positive predicting value (PPV). PPV is equal to the proportion of true positives out of all positives and determines the probability that a positive test result is accurate. PPV of the questionnaires in the studies included in the review ranged from 0.01 to 1.00, showing a significant variety. Noteworthy is the considerable increase in PPV after the follow-up interview was used in the American version of M-CHAT, showing an increase from 0.11 to 0.65 [110].
Negative predicting value (NPV). NPV is the proportion of true negatives out of all negatives; it determines the probability that a negative test result is accurate. All versions of questionnaires, except one (DBC-ES with NPV = 0.48), for which NPV was calculated, had NPV greater than 0.73 [73].
Person completing the questionnaire
ASD screening questionnaires can generally be divided into questionnaires filled in by people who have constant contact with the child (parents or guardians) or independent observers – specialists (e.g., doctors, nurses, psychologists, etc.). Most (15 out of 26) tools were intended to be filled by parents, and specialists only dealt with possible doubts arising while filling in the questionnaire and calculated the result of the test. These also tools underwent cultural adaptation much more often than those in which a specialist assessed the child. Some instruments were by definition predisposed to a given professional group, e.g., the assessment of a child’s development using the JA-OBS test is performed by nurses [84].
Time of completing the questionnaire
Most of the questionnaires listed above should not take more than 10–20 minutes for parents or specialists to complete, and some only take 5 minutes. For example, according to the authors, the shortened version of Q-CHAT (Q-CHAT-10) takes less time than 5 minutes [121]. On the other hand, BeDevel can take over 40 minutes to complete, and INCLEN-ASD takes 45–60 minutes [56, 83].