Deep Learning Insights into ASD: Classifying and Unveiling Behavioural Patterns through RoBERTa and Topic Modeling on QCHAT Data.

doi:10.21203/rs.3.rs-3999158/v1

This study leverages advanced Natural Language Processing (NLP) models, including Bidirectional Encoder Representations from Transformers (BERT), A Robustly Optimized BERT Pretraining Approach (RoBERTa), and Topic Modeling, to analyze behavioral patterns in Autism Spectrum Disorder (ASD). Using the Quantitative Checklist for Autism in Toddlers (QCHAT) dataset enhanced with ASD-related behavioral terms, we demonstrate the potential of these models to improve ASD vs. Typically Developing (TD) classification accuracy and uncover key behavioral themes indicative of ASD. Our findings highlight the value of enriching clinical datasets with domain-specific knowledge and showcase the power of adapting deep learning techniques for ASD research. This work contributes to developing more accurate and informative ASD diagnostic tools.

Biological sciences/Neuroscience

Biological sciences/Psychology

Health sciences/Diseases

Health sciences/Health care

Autism Spectrum Disorder (ASD)

Predefined ASD Terms

RoBERTa (Robustly Optimized BERT Pretraining Approach)

QCHAT (Quantitative Checklist for Autism in Toddlers)

Topic Modeling

Autism Spectrum Disorder (ASD) is a complex neurodevelopmental condition that presents a significant challenge for early diagnosis due to its broad spectrum of manifestations and the individualized nature of its symptoms. Despite the critical importance of early intervention in enhancing the developmental outcomes of children with ASD, traditional diagnostic methods are often cumbersome and time-intensive, potentially delaying access to vital support services. This issue underscores the urgent need for innovative diagnostic solutions to streamline the process without compromising accuracy.

Recent advancements in Deep Neural Networks (DNNs) and Natural Language Processing (NLP) have emerged as promising avenues for revolutionizing ASD diagnostics. Pioneering research by Rahman and Subashini (2021) [2] demonstrated the potential of DNNs to significantly improve the efficiency and accuracy of ASD detection using the Quantitative Checklist for Autism in Toddlers (QCHAT) datasets. This breakthrough paves the way for leveraging machine learning algorithms to overcome the inherent limitations of manual screening methods.

Further enriching the diagnostic landscape, Zhao et al. (2022) [4] employed NLP techniques to extract a comprehensive set of ASD-related terms from Electronic Health Records (EHRs), establishing a detailed ontology that enhances our understanding of ASD and facilitates more nuanced and efficient diagnostic practices. Their work exemplifies the power of NLP in addressing the challenges posed by the variability of ASD symptoms and the lack of standardized diagnostic criteria, offering a novel approach to developing a "common language" for ASD diagnostics.

Building upon these advancements, our study aims to exploit the capabilities of RoBERTa (Robustly Optimized BERT Pretraining Approach) [12], a cutting-edge DNN model, alongside Latent Dirichlet Allocation (LDA) topic modeling [13], to refine ASD diagnostics further and reveal behavioural patterns indicative of ASD through the analysis of QCHAT data. By fine-tuning RoBERTa with QCHAT datasets and applying LDA for topic modelling, our objectives are twofold:

Enhance the Precision of ASD vs. TD (Typical Development) Classification: We aim to streamline the early detection process by improving the precision with which ASD can be distinguished from TD. This involves leveraging deep learning to analyze subtle patterns in the data that may elude human evaluators, thus addressing one of the key challenges in current diagnostic practices—the subjectivity and variability of assessments.

Unveil Key Behavioural Patterns Associated with ASD: By identifying distinct behavioural patterns through topic modelling, we contribute to a deeper understanding of ASD's manifestations. This not only aids in diagnosis but also in tailoring intervention strategies to the unique needs of each individual with ASD.

The potential impact of our study is substantial, promising to revolutionize ASD diagnostics by offering more personalized and timely interventions. By mitigating the limitations of current diagnostic practices through the application of advanced computational techniques, our research makes a significant contribution to the broader field of cognitive sciences and the integration of deep learning in healthcare.

Our study aims to enhance the diagnostic precision for Autism Spectrum Disorder (ASD) through the integration of advanced computational methodologies applied to the Quantitative Checklist for Autism in Toddlers (QCHAT) datasets from Kaggle. This section delineates the experimental protocols, ensuring reproducibility and clarity in describing the methods utilized for text data extraction and analysis, ASD vs. TD (Typical Development) classification, and topic modeling. As illustrated in Fig. 1, the core of our approach lies in a six-step workflow leveraging deep learning techniques.

Data Acquisition

The QCHAT is a pivotal tool in the early identification of ASD, comprising questions designed to capture the broad range of ASD behaviours. Its structured format and emphasis on quantifiable behaviours make it an invaluable resource for machine learning applications, allowing for the extraction and analysis of patterns that may indicate ASD presence. The selection of QCHAT datasets from Kaggle was driven by their relevance and potential to improve diagnostic processes through deep learning techniques. Sources: Utilized two Kaggle ASD QCHAT datasets, one focusing on toddlers ([https://www.kaggle.com/datasets/fabdelja/autism-screening-for-toddlers]) and the other on children([https://www.kaggle.com/datasets/uppulurimadhuri/dataset]). To better understand the specific datasets used, please refer to Table 1 and Table 2. Table 1 presents a comparative analysis of demographic and clinical characteristics for toddlers, focusing on the ASD group (n=728) and the TD group (n=326). Table 2 provides a similar analysis for children, comparing the ASD group (n=1,074) and the TD group (n=911) across various demographic and clinical variables. By examining these characteristics in the tables, we gain a deeper understanding of the data and the populations it represents. These datasets were chosen for their relevance and potential in enhancing ASD diagnostic processes through deep learning.

Table 1. QCHAT data1 for Toddler. This table presents a comparative analysis between the Autism Spectrum Disorder (ASD) group (n=728) and the Typically Developing (TD) group (n=326) across various demographic and clinical variables. The variables include age (months for toddlers), sex, ethnicity (categorized into nine groups), and family history with ASD. The age groups are categorized based on developmental stages: 12-18 months as the early developmental stage, 19-24 months as the middle developmental stage, 25-30 months as the late developmental stage, and 31-36 months as the transition stage. Data are presented as n counts (n) and percentages (%). Statistical analyses were performed using Chi-square tests for categorical variables to compare the distribution of age, sex, ethnicity, and family history of ASD between the ASD and TD groups. The p-values for age, sex, and ethnicity variables indicate statistically significant differences between the groups, with p < 0.001 suggesting strong evidence against the null hypothesis of no difference. For the family history of ASD, the p-value of 0.728 surpasses the alpha threshold (0.05), indicating no significant difference between the groups.

Variable	ASD Group n = 728	TD Group n = 326	p-value
Age Months, n (%)
12-18	99 (13.6)	77 (23.62)	< 0.001
19-24	137 (18.82)	43 (13.19)
25-30	164 (22.53)	54 (16.56)
31-36	328 (45.05)	152 (46.63)
Sex, n (%)
Male	534 (73.35)	201 (61.66)	< 0.001
Female	194 (26.65)	125 (38.34)	< 0.001
Ethnicity, n (%)
African	39 (5.36)	14 (4.29)	< 0.001
Hispanic	30 (4.12)	10 (3.07)
Latino	20 (2.75)	6 (1.84)
Middle Eastern	96 (13.19)	92 (28.22)
Other Asians	212 (29.12)	87 (26.69)
Others	34 (4.67)	9 (2.76)
Pacifica	7 (0.96)	1 (0.31)
South Asian	40 (5.49)	23 (7.06)
White European	250 (34.34)	84 (25.77)
Family History with ASD, n (%)
No	613 (84.2)	271 (83.13)	0.728
Yes	115 (15.8)	55 (16.87)	0.728

Table 2. QCHAT data2 for Children. This table presents a comparative analysis between the Autism Spectrum Disorder (ASD) group (n=1,074) and the Typically Developing (TD) group (n=911) across various demographic and clinical variables for children. The variables include age (in years, ranging from 0 to 18, segmented into five developmental stages: 0-2 years as the infancy stage, 2-5 years as early childhood, 6-8 years as middle childhood, 9-12 years as late childhood, and 13-18 years as adolescence), sex, ethnicity (categorized into nine groups), and family history with ASD. Data are presented as n counts (n) and percentages (%). Statistical analyses were performed using Chi-square tests for categorical variables to compare the distribution of age, sex, ethnicity, and family history of ASD between the ASD and TD groups. The p-values for age, sex, ethnicity, and family history variables are provided, indicating the level of statistical significance. Specifically, the p-value for age groups shows that there is not a statistically significant difference between the ASD and TD groups in the age distribution with a p-value of 0.076, suggesting no strong evidence against the null hypothesis for age. However, the analyses for sex, ethnicity, and family history of ASD show statistically significant differences with p < 0.001, indicating strong evidence against the null hypothesis of no difference in these variables between the groups. The Alpha threshold set at 0.05 was used to determine statistical significance, with p-values below this threshold indicating significant differences.

Variable	ASD Group n = 1,074	TD Group n = 911	p-value
Age Months, n (%)
0-2	20 (1.86)	28 (3.07)	0.076
2-5	224 (20.86)	152 (16.68)
6-8	275 (25.61)	236 (25.91)
9-12	147 (13.69)	124 (13.61)
13-18	408 (37.99)	371 (40.72)
Sex, n (%)
Male	963 (89.66)	484 (53.13)	< 0.001
Female	111 (10.34)	427 (46.87)	< 0.001
Ethnicity, n (%)
African	39 (3.63)	14 (1.54)	< 0.001
Hispanic	30 (2.79)	10 (1.1)
Latino	20 (1.86)	6 (0.66)
Middle Eastern	96 (8.94)	307 (33.7)
Other Asians	343 (31.94)	262 (28.76)
Others	34 (3.17)	9 (0.99)
Pacifica	7 (0.65)	1 (0.11)
South Asian	40 (3.72)	218 (23.93)
White European	465 (43.3)	84 (9.22)
Family History with ASD, n (%)
No	590 (54.93)	740 (81.23)	< 0.001
Yes	484 (45.07)	171 (18.77)	< 0.001

Text Data Extraction and Analysis

Sentence Transformer Mapping: Employed sentence transformers (e.g., 'all-MiniLM-L6-v2') to map each QCHAT questionnaire item to ASD-specific terms, using cosine similarity for precision. This mapping was based on a comprehensive set of 3,336 ASD-related terms identified by Zhao et al. (2022) [4], highlighting our commitment to leveraging detailed ontological insights in ASD diagnostics.
Expert Review and Selection: Each questionnaire item's top ASD term mappings, determined by highest cosine similarity, were reviewed by an ASD clinical expert. This ensured the most accurate term was selected for each item, significantly enhancing the dataset's quality for subsequent analysis. Supplementary Table S1 shows Predefined ASD terms mapped to Q-CHAT

ASD vs. TD Classification

Fine-Tuning Process: RoBERTa models were fine-tuned on the Kaggle QCHAT dataset (toddlers) using specific hyperparameter adjustments, including learning rate, batch size, and number of training epochs, to optimize performance for ASD-related language patterns. Fine-tuning adapted these models to the nuances of ASD diagnostic language and questionnaire responses, involving adjustments to better align them with the ASD diagnostic context and tailor them to recognize ASD-related patterns in toddler evaluation data. The model, initialized for a sequence classification task with two classes, incorporated environment variables for optional GPU usage. We employed the following training hyperparameters: 5 training epochs, batch sizes (16 for training, 8 for evaluation), and a learning rate of 2e-5. The rationale behind these choices was informed by preliminary experiments that identified configurations yielding the highest classification accuracy.
Transfer Learning: We employed a transfer learning approach to leverage the knowledge acquired during pretraining and optimize performance on the second ASD-related dataset. A pretrained RoBERTa model was fine-tuned on the second Kaggle dataset. The following hyperparameters were used during fine-tuning and these hyperparameters were carefully selected to facilitate the model's adaptation to the target dataset while mitigating overfitting: Batch sizes: 8 (training and evaluation), Learning rate: 3e-5, Regularization: Weight decay (1e-8)
Model Application: The fine-tuned RoBERTa models were subsequently applied to the second Kaggle QCHAT dataset (children) for ASD vs. Typically Developing (TD) classification. The refinement process enhanced the models' ability to discern nuanced differences in ASD-related responses, aiming to improve classification performance on this new dataset.

Topic Modeling

The choice of the Latent Dirichlet Allocation (LDA) algorithm for topic modeling was based on its efficacy in uncovering hidden thematic structures within large text corpora. To best reflect the complexity and nuances of ASD behavioural patterns in the dataset, we carefully tuned LDA hyperparameters through a combination of grid search and expert judgment. This process aimed to extract coherent and interpretable topics that offer insights into the behavioural dimensions of ASD. A perplexity score of -3.46 and a coherence score of 0.79 guided selecting 5 as the optimal number of topics.

Model Performance Evaluation

Performance Measure: We carefully selected evaluation metrics to capture the multifaceted nature of ASD diagnostic models. AUROC assessed the model's discrimination ability across thresholds, while the confusion matrix, F1 score, precision, and recall provided detailed insights into accuracy, sensitivity, and specificity. Together, these metrics offer a robust evaluation framework.
Model Validation Techniques: To validate our models, we employed a common technique of splitting our dataset into training, validation, and testing sets (as detailed in Table 3). The training set was used to adjust the models' weights, the validation set facilitated hyperparameter tuning and prevented overfitting, and the testing set served as the final benchmark for model performance on unseen data. This tripartite data split ensured a rigorous validation process, bolstering the credibility of our findings and demonstrating our commitment to developing a model with practical diagnostic potential.

Python 3.9 was used for the analysis. Some of the partial results of the pipeline can be reached in a public repository (https://github.com/skwgbobf/ASD_Kaggle.git)

Table 3. Comparative Performance of BERT, RoBERTa, and Model 1 in Classifying ASD versus TD Children Across Multiple Dataset Classification of Autism Spectrum Disorder (ASD) versus Typically Developing (TD) children using BERT, RoBERTa, and Model 1, fine-tuned on Toddler Data (Data1) and applied through transfer learning on Children Data (Data2). The dataset is summarized using descriptive statistics, with data presented as counts (n) and percentages (%) to indicate the distribution across Test, Train, and Validation datasets for each classification model.

Condition	ASD (1) Group	TD (0) Group
ASD vs TD Classification using BERT fine-tuned on Data1 (Toddler Data), n (%)
Test Data	70 (66.04)	36 (33.96)
Train Data	579 (68.68)	264 (31.32)
Validation Data	79 (75.24)	26 (24.76)
ASD vs. TD Classification using RoBERTa fine-tuned on Data1 (Toddler Data), n (%)
Test Data	219 (69.09)	98 (30.91)
Train Data	409 (69.44)	180 (30.56)
Validation Data	100 (67.57)	48 (32.43)
ASD vs. TD Classification using Model 1, fine-tuned and applied as transfer learning on Data2 (Children Data)
Test Data	322 (49.54)	274 (42.15)
Train Data	607 (54.64)	504 (45.36)
Validation Data	145 (52.16)	133 (47.84)

Our comprehensive analysis employing Natural Language Processing (NLP) and Deep Learning techniques for the classification of Autism Spectrum Disorder (ASD) versus Typically Developing (TD) individuals, along with the utilization of topic modelling to uncover behavioural patterns indicative of ASD, has produced noteworthy outcomes. The results of the fine-tuning of Bidirectional Encoder Representations from Transformers (BERT) and Robustly Optimized BERT Pretraining Approach (RoBERTa) models on Data1, the application of transfer learning to Data2, and the insights generated through topic modelling on Data1 are elaborated below.

ASD vs. TD Classification using RoBERTa fine-tuned on Data1 (Toddlers)

The RoBERTa model has been fine-tuned with a dataset labeled "Data1" to distinguish between ASD and TD individuals.

BERT Fine-Tuned on Data1: This model achieved exemplary classification metrics, with accuracy, precision, recall, and F1 score, all reaching the maximum value of 1.0. This perfect score highlights the model's exceptional ability to discern between ASD and TD instances, showcasing its potent diagnostic capabilities within the dataset. Achieving perfection in classification metrics, this model demonstrated unparalleled accuracy, precision, recall, and F1 score, all at 1.0. This perfect performance underscores the model's exceptional diagnostic precision, distinguishing ASD and TD instances among toddlers. Such results highlight the potential of deep learning models to contribute significantly to early ASD detection, where early intervention can have profound impacts.
Figure 2 illustrates the BERT model's perfect performance: AUC of 1.0 (ROC), zero errors (confusion matrix), 100% accuracy, precision, recall, and F1 score.
RoBERTa Fine-Tuned on Data1: Consistent with BERT's exemplary performance, the RoBERTa model also attained top metrics, achieving a score of 1.0 across all evaluation criteria. The training, validation, and test set split of 56/14/30 effectively underscores RoBERTa's precision in accurately classifying ASD versus TD cases. In line with BERT's remarkable performance, the RoBERTa model mirrored these results, securing a score of 1.0 across all evaluation metrics. The distribution of data into 56% training, 14% validation, and 30% test sets underscores the model's robustness and its precise capability in classifying ASD vs. TD cases in toddlers. This exceptional accuracy further validates the suitability of RoBERTa for early-stage ASD diagnostics, offering a promising tool for healthcare professionals.
As depicted in Fig. 3, the RoBERTa model achieves near-perfect performance, mirroring BERT's results. Consequently, this translates to exceptional accuracy, precision, recall, and F1 score, highlighting the model's potential for early-stage ASD diagnostics.

Transfer Learning with Model 1 on Data2 (Children):

Application of Model 1 (RoBERTa) on Data2: Applying a previously fine-tuned Model 1 (RoBERTa) to Data2 through transfer learning revealed a slight decline in performance metrics compared to the initial dataset. The model achieved an accuracy of 0.82, precision of 0.86, recall of 0.83, and an F1 score of 0.82. Despite this reduction, the model demonstrated considerable efficacy in distinguishing ASD cases from TD, indicating the potential of transfer learning for ASD diagnostics across diverse datasets. This slight performance decrease could be attributed to the complexity and variability of ASD manifestations in older children compared to toddlers.
Despite a slight decline in performance on a new dataset (children vs. toddlers), as illustrated in Fig. 4, the model achieved good accuracy (82.05%), suggesting transfer learning's potential for ASD diagnosis across datasets, with the decrease possibly due to the varying complexities of ASD in older children compared to toddlers.

Topic Modeling on Data1

LDA topic modeling on Data1 revealed five predominant topics, providing insights into the nuanced behavioural patterns associated with ASD:

Topic 1: Social Communication Impairments (34.1% of tokens): This topic emphasizes challenges with pretend play and blank staring.
Topic 2: Complex ASD with Speech Delays (29.5%): This topic highlights a combination of core ASD symptoms and delayed speech development.
Topic 3: Imitation and Gestural Difficulties (16.7%): This topic centers on challenges with imitating gestures and pointing.
Topic 4: Attention and Concentration Deficits (16.7%): This topic focuses on attentional difficulties often observed in ASD.
Topic 5: Non-ASD Classification (3%): This minor topic captures instances without an ASD diagnosis.

Table 4. Topic Modeling : Key words by topic (k=5) This table presents the distribution of language tokens across different topics related to Autism Spectrum Disorder (ASD) traits and typical development (TD) groups, highlighting the percentage of tokens, representative keywords, and group comparisons. The analysis is based on a dataset comprising 728 instances from the ASD group and 326 from the TD group, organized into five distinct topics ranging from ASD with specific impairments to no ASD traits. Statistical significance was assessed using the Chi-square test of independence to examine the relationship between the dominant topic and class/ASD traits, with a calculated Chi-square score of 224.885 and a p-value <0.001, leading to the rejection of the null hypothesis (H₀) that topic distribution and ASD traits are independent at an alpha level of 0.05. This suggests a significant dependency between the discussed topics and the presence of ASD traits.

Topic	Title	% of tokens	Representative keywords	ASD Group	TD Group
T1	ASD with pretending playing impairment and blankly staring	34.1%	Play, impairment, manipulating, at, stares, rather, objects, than, them, blankly	226	202
T2	ASD with complexed symptoms and speech delay	29.5%	attention, social, pointing, gesture, difficulty, imitating, absent, speech, pretend, does	286	11
T3	ASD with difficulty imitating gesture and pointing	16.7%	pointing, interest, express, to, social, spontaneously, difficulty, gesture, imitating, reciprocity	144	33
T4	ASD with attention and concentration deficit	16.7%	attention, social, and, deficit, concentration, impairment, reciprocity, emotional, interaction, in	72	61
T5	No ASD	3%	speech, absent, pretend, does, play, shifting, attention, eye, contacts, spontaneously	0	19
			Total	728	326

Table 4, as a statistical analysis, confirms a strong association between these topics and ASD traits, further validating the topic modelling results. We conducted a chi-square independence test to validate these findings' significance. The results confirmed a statistically significant relationship between the dominant topics and their respective ASD traits (p-value < 0.001). This rejects the null hypothesis of no relationship, affirming that the identified topics meaningfully reflect ASD characteristics.

Our study demonstrates the transformative potential of deep learning and Natural Language Processing (NLP) in advancing Autism Spectrum Disorder (ASD) research. By fine-tuning Robustly Optimized BERT Pretraining Approach (RoBERTa) on Quantitative Checklist for Autism in Toddlers (QCHAT) datasets and applying LDA topic modeling, we achieved superior ASD classification accuracy and revealed nuanced behavioural patterns.

The success of RoBERTa highlights the importance of domain-specific adaptation for deep learning models, enhancing their sensitivity to subtle ASD-related cues. Furthermore, the successful application of transfer learning suggests potential scalability of our methodology across diverse pediatric datasets.

LDA topic modeling unveiled distinct behavioural patterns within ASD, a valuable step towards understanding potential subtypes and developing more targeted interventions. Our study establishes a robust framework for integrating deep learning, NLP, and ASD-specific datasets, making a significant contribution to the field of cognitive sciences.

Ethical Considerations and Future Research

The ethical implications of employing AI in healthcare, particularly concerning data privacy, consent, and mitigating bias, are paramount. Our commitment to upholding the highest ethical standards sets a benchmark for responsible AI use in medical diagnostics. These considerations are crucial for ensuring AI technologies' equitable and responsible deployment in healthcare settings.

This study's promising results pave the way for further exploration. Future research could:

Investigate alternative deep learning architectures for ASD diagnostics.
Integrate multimodal data sources to enrich ASD understanding.
Conduct longitudinal studies to track developmental trajectories.
Assess the applicability of these models to other neurological conditions.

Clinical and Educational Implications

Our findings have profound implications for clinical practice and education. AI-driven tools like those explored here could transform early ASD screening and support. This integration could significantly enhance early intervention strategies, improving long-term outcomes for individuals with ASD.

Study Limitations

While groundbreaking, our study acknowledges the following limitations:

The reliance on QCHAT data may limit generalizability. Future efforts should incorporate diverse data sources, including tools like the M-CHAT (Modified Checklist for Autism in Toddlers), SCQ (Social Communication Questionnaire), and SRS (Social Communication Questionnaire).
The interpretive challenges associated with LDA topic modeling warrant consideration. Further refinement of modeling techniques could enhance clarity and practical utility.

In conclusion, our study demonstrates the impactful application of deep learning models like Robustly Optimized BERT Pretraining Approach (RoBERTa) and LDA topic modeling in advancing Autism Spectrum Disorder (ASD) diagnostics and understanding. Through the fine-tuning of advanced models on Quantitative Checklist for Autism in Toddlers (QCHAT) data and the innovative use of topic modelling, we have achieved not only high classification accuracy but also new insights into the behavioural manifestations of ASD. These advancements hold the promise of driving future innovations in cognitive sciences and offering more personalized care approaches for individuals with ASD, illustrating the pivotal role of deep learning in the evolution of ASD diagnostics and intervention strategies.

In sum, our study showcases the transformative potential of employing advanced deep learning models, such as Roberta and LDA topic modelling, in the realm of ASD diagnostics and understanding. By meticulously fine-tuning these models on QCHAT data and harnessing the power of topic modelling, we have achieved superior classification accuracy and gleaned novel insights into the behavioural intricacies of ASD. These advancements promise to catalyze further innovations in cognitive sciences and foster more personalized care strategies for individuals with ASD. Our work underscores the integral role of deep learning in the ongoing evolution of ASD diagnostic and intervention methodologies, heralding a new era of precision and insight in addressing complex cognitive disorders.

Authors’ contributions

Conceptualization: SKB; Methodology: SKB, HWK; Data acquisition: SKB, CWL; Formal Analysis: SKB, CWL; Model development and validation: SKB; Funding acquisition: HWK; Project administration and supervision: HWK; Writing & figure drawing – original draft: SKB, CWL; Writing – review & editing: SKB, CWL, HWK.

Manuscript Refinement with LLM

The authors created an initial draft of the manuscript. Subsequently, a large language model (LLM) was utilized to suggest wording improvements and ensure stylistic consistency.

Funding

This research was supported by a grant of the R&D project, funded by the National Center for Mental Health(grant number: MHER22A01).

Conflicts of Interest

The authors have nothing to disclose.

Data Availability

The datasets analyzed in this study were obtained from two publicly available sources on Kaggle:

Autism Screening for Toddlers: [https://www.kaggle.com/datasets/fabdelja/autism-screening-for-toddlers]
Autism Screening Dataset for Children: [https://www.kaggle.com/datasets/uppulurimadhuri/dataset]

These datasets were selected due to their relevance and potential to advance ASD diagnostic processes using deep learning methods. The code used for data analysis was implemented in Python 3.9. Partial results of the analysis pipeline are accessible in the following public repository: https://github.com/skwgbobf/ASD_Kaggle.git

Rahman, M. M. et al. A review of machine learning methods of feature selection and classification for autism spectrum disorder. Brain Sci. 10, 949, DOI: https://doi.org/10.3390/brainsci10120949 (2020).
Rahman, K. K. M. & Subashini, M. M. A deep neural network-based model for screening autism spectrum disorder using the Quantitative Checklist for Autism in Toddlers (QCHAT). J. Autism Dev. Disord. 52, 2732-2746, DOI: https://doi.org/10.1007/s10803-021-05141-2 (2022).
Kohli, M., Kar, A. K. & Sinha, S. The role of intelligent technologies in early detection of autism spectrum disorder (ASD): A scoping review. IEEE Access 2022, 3208587, DOI: 10.1109/ACCESS.2022.3208587 (2022).
Zhao, M. et al. Development of a phenotype ontology for autism spectrum disorder by natural language processing on electronic health records. J. Neurodev. Disord. 14, 32, DOI: 10.1186/s11689-022-09442-0 (2022).
Choi, E. S. et al. Applying artificial intelligence for diagnostic classification of Korean autism spectrum disorder. Psychiatry Investig. 17, 1090-1095, DOI: 10.30773/pi.2020.0211 (2020).
Stevens, E. et al. Identification and analysis of behavioral phenotypes in autism spectrum disorder via unsupervised machine learning. Int. J. Med. Inform. 129, 29-36, DOI: 10.1016/j.ijmedinf.2019.05.006 (2019).
Alqaysi, M. E., Albahri, A. S. & Hamid, R. A. Diagnosis-based hybridization of multimedical tests and sociodemographic characteristics of autism spectrum disorder using artificial intelligence and machine learning techniques: A systematic review. Int. J. Telemed. Appl. 2022, 3551528, DOI:10.1155/2022/3551528 (2022).
Anagnostopoulou, P. et al. Artificial intelligence in autism assessment. Int. J. Emerg. Technol. Learn. 15(06), 95–107, DOI:10.3991/ijet.v15i06.11231 (2020).
Christiansz, J. A. et al. Autism spectrum disorder in the DSM-5: Diagnostic sensitivity and specificity in early childhood. J. Autism Dev. Disord. 46, 2054-2063, DOI: 10.1007/s10803-016-2734-4 (2016).
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 3982-3992), DOI: 10.18653/v1/D19-1410 (2019).
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 4171-4186), DOI: 10.18653/v1/N19-1423 (2019).
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692, DOI: https://doi.org/10.48550/arXiv.1907.11692 (2019).
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-102, DOI:10.7766/orbit.v1.2.44.

No competing interests reported.

ScientificReportsSupplementaryFile.pdf

Deep Learning Insights into ASD: Classifying and Unveiling Behavioural Patterns through RoBERTa and Topic Modeling on QCHAT Data.

Status:

Version 1

Abstract

Figures

Introduction

Methods

Results

Discussion

Conclusion

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1