Screening Children at Risk for Developmental Disabilities Based on Face Landmark from Video Data of Mobile-Based Application: Preliminary Cross-Sectional Study

Background Early detection and intervention of developmental disabilities (DDs) are critical for improving the long-term outcomes of the aicted children. Mobile-based applications are easily accessible and may thus help the early identication of DDs. Objective We aimed to identify facial expression and head pose based on face landmark data extracted from face recording videos and to differentiate the characteristics between children with DDs and those without. This study aimed to examine the effectiveness of a deep learning classication model from facial video data of children collected from a mobile-based application in identifying the facial expressions and head poses of children with developmental disabilities and those without. The mobile-based application records the faces of children staring at the screen and saves them as videos. Games in mobile applications have been redesigned for this study. Among them, an animated lm for eye tracking tests is from "D-kit" previously used by investigators. In previous studies, D-kit animation has been demonstrated to increase children's participation in training sessions. The eye tracking test consists of animation content and non-animated content. Non-animated content tracks the subject's eye movements to test the subject's attention and participation level when exposed to the content. The test shows four different paintings, friendly animal characters, blocks, different animal characters, and animal characters with blocks. The content tests whether children's eyes follow the target object and whether they are more interested in the social characteristics that children can sympathize with. Animation content shows four different pictures from the previous test, but it shows the animation method, and tests whether more children are involved when objects move and whether the results of non-animation tests are the same. (Supplementary


Introduction
Developmental disabilities (DDs) are conditions that show impairments in physical, learning, language, or behavioral development, and encompass autism spectrum disorder (ASD), language disorder (LD), and intellectual disability (ID). The prevalence of children with DD showed a marked increase worldwide, affecting 17.8% of children in the US from 2015 to 2017. 1 According to a recent study, the prevalence of Loading [MathJax]/jax/output/CommonHTML/jax.js DD in South Korea steadily increased by more than four times from 2003 to 2017. 2 DDs signi cantly and negatively affect the quality of life for both the a icted individuals and their families 3 as a large amount of medical and social support is needed for children with DD. The social costs of DD are signi cant and expected to grow in size considering the increasing prevalence of DD. 4,5 Early identi cation of DDs is crucial for children to receive early evidence-based interventions, 6-8 which have been shown to be highly effective in improving the outcomes of children with DD. Although many studies emphasized the importance of early detection and intervention, di culties in universal screening lead to the delay in the age of identi cation and diagnosis. 9 Currently, ASD, ID, and LD are diagnosed by standardized instruments that evaluate the language, cognition, or social development, including Bayley Scales of Infant Development (BSID), Psychoeducational Pro le -Revised (PEP-R), Autism Diagnostic Observation Schedule (ADOS) and the Autism Diagnostic Interview-Revised (ADI-R). 8, 10,11 However, those standardized tests usually take much time and need to be conducted by trained professionals. 12 This highlights the need for user-friendly and time-e cient mobile screening tools for children with DD.
As mobile devices are becoming widely distributed, the possibility that mobile-based screening tools could be useful in the early identi cation of DD was suggested. Mobile-based screening programs are easily accessible and are less time-consuming, do not require trained professionals, and may thus help speed up the identi cation of DD. Different types of mobile-based screening measures for DD are emerging, including Cognoa ASD Screener, a platform that examines home video by analysts, 13 iTracker, an eye-tracking algorithm, 14 and ASDTest, an application based on standardized tests such as the Autism Quotient 10 (AQ) and Quantitative Checklist for Autism in Toddlers (Q-CHAT). 15 Video-analyzing platforms 13,16 and eye-tracking algorithms 17 are the most frequently studied mobile device-based methods for identifying DDs. In contrast, screening tools based on facial emotion recognition and expression have only been recently introduced.
Facial emotional recognition and expressions are indispensable in the sharing of emotions and humanto-human interactions. People with DD, especially ASD, have di culties understanding and expressing facial emotions. 18-20 Although much research has been done on screening DD using facial emotion recognition, not much research has been done on screening ASD with facial emotion production. 21 Similarly, only few studies have been performed on facial emotion expression in individuals with ID despite there being several studies regarding facial emotion recognition in individuals with ID. 22,23 Moreover, only few studies have been conducted on LD and facial emotion recognition and expression.
Recently, some studies reported a signi cant difference in facial expression between children with DD and those without using models that categorize facial expression with facial landmarks. 24,25 Manfredonia and colleagues 25 recorded videos of participants and divided them into image frames.
These studies suggest that programs examining the facial expression of children based on face recording video data could be useful for identifying children with DD. Thus, we aimed to identify facial expressions and head pose based on face landmark data extracted from face recording videos and to differentiate the characteristics between children with DD and those without.

Mobile-based application
This study aimed to examine the effectiveness of a deep learning classi cation model from facial video data of children collected from a mobile-based application in identifying the facial expressions and head poses of children with developmental disabilities and those without. The mobile-based application records the faces of children staring at the screen and saves them as videos. Games in mobile applications have been redesigned for this study. Among them, an animated lm for eye tracking tests is from "D-kit" previously used by investigators. In previous studies, D-kit animation has been demonstrated to increase children's participation in training sessions. The eye tracking test consists of animation content and non-animated content. Non-animated content tracks the subject's eye movements to test the subject's attention and participation level when exposed to the content. The test shows four different paintings, friendly animal characters, blocks, different animal characters, and animal characters with blocks. The content tests whether children's eyes follow the target object and whether they are more interested in the social characteristics that children can sympathize with. Animation content shows four different pictures from the previous test, but it shows the animation method, and tests whether more children are involved when objects move and whether the results of non-animation tests are the same. (Supplementary Figure A).

Participant enrollment
From May 2020 to July 2020, a total of 124 children were recruited from community-based daycare centers, kindergartens, and special education centers. The children were between 34 and 77 months of age. Children were not included if they had (i) a history of neurologic diseases such as cerebral palsy, (ii) any sensory disturbances (i.e., vision, hearing, taste, or smell), and (iii) severe gross or ne motor problems that prevented them from participating in the psychometric tests. Of the 124 children, data from 35 children could not be analyzed because (i) facial data could not be extracted due to the use of facial masks (n=25), (ii) facial data during video games were not available (n=5), and (iii) landmark points were incorrectly extracted (n=5). As a result, a total of 89 children were analyzed in this study ( Figure 1 Among typically developing children, children with confusing test results or developmental concerns were seen by clinicians and con rmed as typically-developing controls. The study was approved by the institutional review board of Asan Medical Center and informed consent was obtained from the parents of each child. In addition, this study was performed in accordance with the principles of Good Clinical Practice and the Helsinki Declaration. Loading [MathJax]/jax/output/CommonHTML/jax.js Data collection and preprocessing Face landmark extraction Face landmarks are standard reference points, such as the inner and outer corner of the eye ssure where the eyelids meet. In many cases, the landmarks used in computational face analysis are very similar to the anatomical soft tissue landmarks used by physicians. These extracted landmark points can be applied in various elds such as human emotion recognition, gaze detection, and face conversion. In our work, we use 2D-FAN (Face Alignment Net), a convolutional neural network-based method, to recognize children's faces in videos and extract 68 landmarks. The corresponding algorithm has been trained with an LS3D-W dataset consisting of approximately 230,000 face photographs including adult men and women, children, and showed higher performance when extracting the face landmarks compared with other algorithms. 26 We decomposed the video into a frame unit image of 33 ms for each child, recognizing the child's face from each image and extracting the face landmark points ( Figure 2).

Preprocessing
The extracted face landmark points are stored in 68 coordinate values. For frames in which the extraction algorithm did not recognize faces, we interpolated the coordinate values of the landmark points by re ecting the information of the previous and subsequent frames of those frames. In video data, when a child bows his head or goes out of the video screen, the face land mark may not be properly extracted. In this case, the video frame was regarded as an abnormal frame and removed ( Figure 2).

Feature extraction
We utilized the OpenCV library to estimate the head pose in that frame through the landmark coordinate values. We measured three head poses (Pitch, Roll, Yaw) by specifying six landmark points (i.e., eyes, nose, chin, and left and right mouth) and measuring the Euler angle of how much the points were rotated in the frame by assuming 3D coordinates with the coordinates facing forward. Pitch was measured at an angle of nodding the head up and down, Roll was archwise and left and right, Yaw was the angle of rotation of the head from left to right, and each value was measured at an angle of -90 to 90 degrees ( Figure 2).
To measure how much a child's face has moved over the previous frame in the current frame, we assumed the average position of 68 landmark coordinate values in the frame as the center point of the face, and calculated the Euclidean distance to determine the distance traveled. Since each frame was captured with a frame rate of 33 ms, the distance the face traveled was measured by re ecting the frame rate.
The changes between 68 landmark points were measured in the video. A combination of all landmark points allows the construction of a total of 2284 combinations of distance variables. We calculated all Euclidean distances of those combinations and selected frames staring forward in each child's video.
Based on the distance of the corresponding frame's landmark combination, the ratio of the remaining Loading [MathJax]/jax/output/CommonHTML/jax.js frames to the normal frame was obtained. Among all the obtained proportion variables, the top 40 variables were used as derivatives by selecting those with signi cant differences in the distribution between children with developmental disabilities and those without.

Model algorithm
The data used in the current analyses were time-series data consisting of frames of video data recording the faces of children. Accordingly, we used the long short-term memory (LSTM) model for the binary classi cation of developmental disability. 27 As a recurrent neural network (RNN) model, the LSTM model determines whether the weight value is maintained by adding cell states in an LSTM cell. The state obtained from an LSTM cell is used as input to the next LSTM cell, so the state of an LSTM cell affects the operation of the subsequent cells. The nal target output at the end of the sequence represents a label classifying the developmental disability. The LSTM model has the ability to remove or add information to the cell state, carefully regulated by structures called the gates, which are a way of optionally letting information through. The LSTM model is more persistent than the existing RNN because it is possible to control long-term memory. 28 Since the lengths of each of the seven videos are different, we generated an LSTM model corresponding to each video, utilized variables such as sex and age as input variables in additional deep neural network (DNN) models, and built a model to combine the results of seven LSTM models with the results of one DNN model to nally predict developmental disability ( Figure 3).
We also performed strati ed K-cross fold validation for the robustness of the model. Strati cation is the process of rearranging the data to ensure that each fold is a good representative of the whole. The strati ed K-cross fold validation technique splits the dataset into K sets and the model uses K-1 folds for training and is validated on the K th fold. This is continued until all the folds are used to validate the model once. Strati cation ensures that each fold is a good representation of the whole dataset, which leads to parameter netuning and helps the model in better classifying the developmental disability. 29 The K used in our study was ve. For the evaluation of the trained model, the following standard machine learning metrics were used: 1. Accuracy: Percentage of correctly classi ed data frames in the given test dataset.

Statistical analysis
To compare the differences in the distribution of variables in the two groups, the normalities of the variables were tested using the Shapiro-Wilk test. Variables satisfying normality were examined using Student's t-test, and those not satisfying normality were compared between groups using the Mann-Whitney U test. All statistical analysis was conducted using Python (version 3.7) software.

Result
Overall population

Model interpretation
After checking the performance of the model through cross-validation, we sought to nd variables that signi cantly contribute to the prediction of DDs through the interpretation process of the model using SHAP. 30 We calculated the SHAP values from the DeepExplainer of the SHAP package on ve folds divided from train dataset and calculated the mean values of absolute SHAP values in all folds. In the obtained SHAP values, we found that the nodding head angle variable had the highest results by a large margin when ranking the high-contributing variables, thus making the most contribution to predicting DDs ( Figure 4).
In addition, for the top 10 variables with high contributions, the differences in the distribution of these variables between children with DD and those without were analyzed. The Mann-Whitney U-test showed that there were signi cant differences in the distributions of all variables between the two groups (P<.05) ( Figure 5) (Table 3). Mann-whitney U, The values of variables in both groups are listed in order of size to calculate their rankings. The U value is then calculated by considering the sum of the ranks of each group, the rank average, and the number of data.

Discussion
In this study, we present the utility of deep learning classi cation model using a mobile-based video data that predicts the presence of DD by distinguishing the facial characteristics between children with DD and those without by extracting 68 facial landmarks from the faces and generating derivatives such as head pose estimation (pitch, yaw, roll) and landmark point distance. The deep learning classi cation model using mobile-based video data predicted the presence of DD with an average accuracy of 88% and found that in the pitch (head nodding) variable, children with DD have a signi cantly wider distribution than those without. Through the model's interpretation process, we identi ed important predictive variables, including the pitch variables, which all showed statistically signi cant differences in the distribution between children with DD and those without. showed a sensitivity of 71-75% and a speci city of 63-65%. 31,32 ABC was reported to have a sensitivity of 78.4%. 33 PEDS, which consists of two open-ended questions and eight yes/no questions completed by parents, showed sensitivities of 78.9% and 54.9% in severe and moderate-to-severe delays, respectively, and a speci city of 79.6%. 34 ASQ-3 showed sensitivities of 60.0% and 53.1% in severe and moderate-tosevere delays, respectively, and a speci city of 89.4%. 35 Thus, in terms of the accuracy of detection, our Loading [MathJax]/jax/output/CommonHTML/jax.js classi cation model seems to have comparable performance (88%) compared with the existing methods for screening.
Several digital screening methods for DDs were suggested in previous studies. Most web-based developmental surveillance programs are trials of online versions of established questionnaires. The Web-Based Modi ed Checklist for Autism in Toddlers with Follow-up interview (M-CHAT/F) is a checklist scored by parents and implemented as a two-stage screening test, in which a positive result prompted follow-up interview to clarify or correct the failed items. When administered by primary care pediatricians, the web-based M-CHAT/F had a sensitivity of 59% and a speci city of 71%. 36 In another study that used the digital M-CHAT-revised with Follow-up, accurate documentation in the electronic health record of screening results increased from 54-92% and appropriate action for children screening positive increased from 25-85% compared with the results from the paper form of the M-CHAT. 37 In addition, the smartphone application PEDS operated by community healthcare workers was shown to have a close correspondence with the gold standard paper-based PEDS tools operated by health professionals. 38 Most smartphone screening applications also focus on developing questionnaires answered by parents or medical professionals. 39 ASDTests is an application that is based on the Autism-Spectrum Quotient and Quantitative Checklist for Autism in Toddlers and evaluates the possibility of having autistic traits. 15 Cognoa is a mobile screening application that consists of both parental questionnaires and home video recording, and has a sensitivity of 75% and a speci city of 62%. 13,39 These studies suggest that webbased or mobile-based screening tools could be reliably used for screening DD. Because web-based or mobile-based screening tools are quicker, cheaper, and more accessible, they could be helpful in improving the early identi cation of DD.
Some recent studies evaluated DD using digital observational methods via analyzing gazes, faces, or behaviors. Eye-tracking algorithms have shown progress in their potential use for screening ASD in rural areas. 16,17 Vargas-Cuentas and colleagues 16 recorded videos of participants watching social or nonsocial videos and analyzed the image frames from the video. Fujioka and colleagues 17 used infrared light sources and cameras to record the eye position. In one study from Bangladesh, a machine learning classi er trained by data from ADOS and ADI-R was able to detect developmental delay and autism by analyzing the behavior portrayed in home videos, and showed a sensitivity and accuracy of 76%. 40 Strobl and colleagues 14 also developed a smartphone application in which the participants' gaze was analyzed by an eye-tracking algorithm. These studies show that digital methods could be used for the screening of DD.
Our study showed that facial landmark analysis, among mobile-based methods, could play a signi cant role in the detection of DD. In previous studies examining head pose and facial expressions, Happy and (Zface) to demonstrate the differences between typically developing children and children with ASD. 41 their results differ from our research because they were able to nd the differences in the speed and quantity of head movement in yaw and roll but not in pitch. In another study, children with ASD and those with ADHD were differentiated with an accuracy of 94% via a Red-Green-Blue-Depth sensor from a depth measurement camera. 42 this study is similar to our work in that there was a difference in facial expressions using FACS, but at the same time different from our results in that the study targeted adults aged 18 and older and there was a difference in head movements in yaw. While these studies are computer-based programs and require special-purpose equipment, our study used a mobile-based application and can thus be more convenient and easy to use. In one study, children watched movies on a smart tablet while the embedded camera recorded their facial expressions. Then, the computer vision analysis automatically tracked the facial landmarks and used them to classify the facial expressions into three types (Positive, Neutral, and Other) with a maximum sensitivity of 73%, with different results depending on the type of movie being shown; notably, children with ASD displayed neutral expressions more often than children without ASD. 43 This study differs from ours in that we evaluated not only children with ASD but those with DD.
Based on our results, we carefully suggest that facial landmarks and head poses may be used as screening tools for children with DD. A recent study that quanti ed head movement dynamics (displacement and velocity) showed that children with ASD had greater head movement dynamics than those without ASD. 41 Several papers hypothesized that turning away may be an adaptive strategy for individuals with ASD to regulate the overwhelming amount of information, 44,45 which may explain the atypical head movement of individuals with ASD. Therefore, using facial landmarks as a method of screening could aid the early identi cation of children with DD.
There are several limitations to this study. First of all, we were unable to nd signi cant differences in facial landmarks or head pose when children were shown social videos and non-social videos. Second, our study did not analyze the results of the subgroups of DD (i.e., ASD, ID, LD). Third, since children with incorrect data were excluded, the sample size is relatively small and thus has limited generalizability. Fourth, we do not know whether these ndings are limited to certain age groups. Fifth, our study did not consider body motion information because we used videos that only recorded the children's faces.
Despite these caveats, our study evaluated the utility of digital methods, especially mobile-based methods, in the screening of DD in community-based preschool children. Our results provide preliminary evidence that the deep-learning classi cation model using mobile-based children's video data could be used for the early detection of children with DD.