Study Participants
A total of 31 individuals participated in the study, with roughly an equal distribution of control (n = 16) and CVI participants (n = 15).
Sixteen participants with neurotypical development aged between 14 and 27 years old (mean age 18.75 years ± 3.47 SD) were enrolled in the study. Fifteen participants previously diagnosed with CVI and aged between 8 and 23 years old (mean age 15.73 years ± 5.09 SD) served as a comparative group. Comparing CVI participants and controls revealed no statistically significant difference with respect to age (compare CVI: 15.73 years ± 5.09 SD and controls: 18.75 years ± 3.47 SD; t(24.526) = 1.915, p = 0.067, d = 0.697).
Language abilities in the CVI cohort were also collected based on available clinical data. Verbal IQ was assessed using subtests from the Wechsler Intelligence Scale for Children (WISC IV) and Adults (WAIS IV), 4th Edition (specifically, the Digit Span, Similarities, and Vocabulary subtests of WISC IV and the Digit Span, Similarities, Vocabulary, and Information subtests of WAIS IV) to obtain an index of verbal comprehension. Mean score for the CVI participants was 93.00 ± 31.06 SD (range of 44 to 148).
Control participants had normal or corrected-to-normal visual acuity and no previous history of any ophthalmic (e.g. strabismus, amblyopia) or neurodevelopmental (e.g. attention deficit hyperactivity disorder) conditions.
All participants with CVI were previously diagnosed by eyecare professionals with extensive clinical experience working with this population (see 54 for further details regarding diagnosis of CVI). Briefly, the diagnosis was based on a directed and objective assessment of visual functions (including visual acuity, contrast, visual field perimetry, color, and ocular motor functions), functional vision assessment (use of structured questionnaires, surveys, and activities), a thorough refractive and ocular examination, as well as an integrated review of medical history and available neuroimaging and electrophysiology records 22,55,56. Causes of CVI were diverse and included hypoxic-ischemic injury related to prematurity and complications occurring at childbirth, periventricular leukomalacia (PVL), as well as genetic and metabolic disorders. Five CVI participants were born prematurely (less than 37 weeks gestation). Associated neurodevelopmental comorbidities included cerebral palsy (CP). Best corrected binocular visual acuity ranged from 20/15 to 20/70 Snellen (or -0.12 to 0.54 logMAR equivalent). In this study sample, all the CVI participants were categorized as category 3 (85.71%, defined as “functionally useful vision and who can work at or near the expected academic level for their age group”) based on previously defined functional criteria 19. Exclusion criteria included any evidence of oculomotor apraxia (i.e. failure of saccadic initiation), intraocular pathology (other than mild optic atrophy), uncorrected strabismus, as well as hemianopia or a visual field deficit corresponding to the area of testing (see Supplementary Table 1 in supplementary materials for complete demographic details).
All study participants had visual acuities, intact visual field function within the area corresponding to the visual stimulus presentation, as well as fixation and binocular ocular motor function sufficient for the purposes of completing the task requirements and eye tracking calibration (see below).
Prior to data collection, written informed consent was obtained from all participants and a parent/legal guardian (in the case of a minor). The study was approved by the Investigative Review Board at the Massachusetts Eye and Ear in Boston, MA, USA, and carried out in accordance with the Code of Ethics of the World Medical Association (Declaration of Helsinki) for experiments involving humans.
Visual Image Selection and Salience Analysis
Eighty images (40 indoor, 40 outdoor scenes) were sourced from the LabelMe image database 57. The LabelMe image database is an opensource tool for labeling objects within a naturalistic visual scene. Images chosen had between 20 and 114 labeled objects (mean = 46.413 objects ± 21.704 SD) and were of similar complexity (see results section for further analysis confirmation). Prior to conducting the experiment, pilot testing was completed to confirm that the chosen presentation time was appropriate for all participants and for the number of images viewed (i.e. total test time). We manually reduced the noise found in the LabelMe database 39 according to the following set of criteria. First, we removed descriptor words, removed test/duplicate/nonsense labels, corrected spelling errors, and translated non-English labels. Second, because we used GloVe as the basis for our semantic predictor model (see below), and because GloVe does not handle more than one word at a time, we reduced any multi-words to a single word, and made sure all words existed correctly in the GloVe semantic space. For this purpose, we manually edited labels by finding suitable single-word replacements (e.g. “license plate” became “numberplate”) and combined valid words (e.g. “street light” became “streetlight”). If we were unsure of a suitable replacement, we consulted MATLAB’s “vec2word” function to find the closest possible word between the multi-words (e.g. “trash bin” became “bin”).
We utilized GBVS as our image salience model. The GBVS model is readily available as an open-source toolbox and has demonstrated strong reliability and success in predicting human fixations based on low-level features associated with bottom-up processing. GBVS generates three individual feature maps based on color, orientation, and intensity, and the average of these maps represents a heatmap of the most salient image features. The GBVS model also factors in a general center bias, where it predicts the center of the image to be fixated more frequently than the edges due to the nature of eye movements during visual search 3,7,58–64. The generated heatmap ranges in values from 0 to 1, where 0 represents areas with low image salience and 1 represents areas with high image salience. High image salience is categorized by areas of a scene that differ starkly from their surroundings in one of the three mentioned feature categories. For example, high color salience would be an orange traffic cone on a gray road. High orientation salience would be the strong, straight line of a tree trunk against an empty sky. High intensity salience would be a bright light in a dimly lit room. We applied gaussian blur equal to the estimated pixel error reported by the manufacturer of the eye tracker used (Tobii 4C, see below for further details) to soften boundaries and minimize errors that might occur from slight deviations in reported gaze location (see Fig. 1B for representative example).
For our image semantics heatmaps, we utilized GloVe in conjunction with LASS. The GloVe model is one of the latest and regularly updated language models available and we have previously modified the LASS model to incorporate GloVe specifically for visual search 41. Specifically, we used GloVe to quantify image semantic salience by measuring how near two word labels are in a defined “semantic space”. This semantic space is created by categorizing words across feature dimensions and placing them in a three-dimensional “web” of similarity. For example, all animal related words would cluster together, and all reptile related words would cluster within that cluster. In this way, the distance between two words represents the similarity between them (i.e. “frog” will be more related to “horse” than it would be to “airplane”). This similarity is quantified as the cosine distance between two words, where 0 = not similar and 1 = identical. This allows similarity values to be assigned to each object within a scene, where the comparison word is always the target object in the image. To spatially assign these values across the scene, we used LASS which is a method of generating context labels (i.e. the word used to compare all scene objects, in this case, the target object), calculating the semantic similarity scores (GloVe), and embedding those scores within object masks defined by LabelMe. The result is a heatmap where all objects are scored based on their similarity to the target object (where the area within unrelated objects will be close to 0 and the area within the target object will equal 1). For example, in a scene where the target object was “boots”, the area labeled “floor” would have higher semantic salience than the area labeled “desk”, because boots are more often located on the floor compared to a desk. As with the image salience maps, we applied the same gaussian blur (see Fig. 1C for representative example).
Testing Procedure
Participants were seated comfortably in a quiet room, 60 cm in front of a 17" LED monitor (Alienware laptop computer, 1080p; 1920 x 1080 resolution). Eye movement patterns during visual search (i.e. X,Y coordinate positions of gaze) were captured under binocular viewing conditions using a Tobii 4 C Eye Tracker system (90 Hz sampling frequency, Tobii Technology AB, Stockholm, Sweden) mounted on the lower portion of the monitor. Participants were reminded to maintain their gaze on the monitor during testing but were otherwise able to move their head freely. Prior to each experiment, eye tracking calibration was performed on each participant (Tobii Eye Tracking Software, v 2.9 calibration protocol) which took less than one minute to complete. The process included a 7-point calibration task (screen positions: top-left, top-center, top-right, bottom-left, bottom-center, bottom-right, and center-center) followed by a 9-point post calibration verification (i.e., the same 7 calibration points plus center-left and center-right positions). Accuracy criterion was defined by gaze fixation falling within a 2.25 arc degree radius around each of the 9 screen positions and confirmed by visual inspection prior to data collection.
Participants were shown 2 blocks of 40 images. For testing, participants received one block where all targets were presented as image cues, and one block presented as text cues. The order of the conditions presented was counterbalanced across participants. Participants were shown a target for 2 sec (either as an image or text), followed by the visual scene to be explored for 4 sec (Fig. 2). Participants were instructed to search for the target object within the scene, and to maintain their gaze on the object once it was located until the end of the trial. In order to balance the design, two task variations were used: 1) Text A, Image B and 2) Text B, Image A. All targets that were text cues in one version were images in the opposite version (e.g. “fire hydrant” would be a text cue for half of the participants and an image cue for the other half, for the same search scene).
Behavioral Outcomes and Statistical Analyses
Primary visual search performance outcomes based on gaze behavior were success rate (measured by the percentage of trials participants were able to successfully find and fixate on the target object) and reaction time (measured as the time in milliseconds participants took to locate and fixate on the target object from the beginning of the scene presentation). The time period during cue presentation (i.e. when the object image or text cue was presented) was not included in the reaction time measurement.
Secondary visual search outcomes were visual search area and number of fixations. We measured the approximate area that participants searched using a kernel density analysis. We used the Matlab function ksdensity to plot the contours containing the gaze data. ksdensity returns a probability density estimate based on a normal kernel function for all sample data. Essentially, a 3D map is plotted where the peaks of the map correspond to higher density areas of gaze points. We then converted these 3D maps into 2D polygons, where the polygon traces the boundary of the plotted contours, and this area corresponds to the search area. To detect and measure the number of fixations, we used the function NonParaFixLab 65. NonParaFixLab calculates the optimum speed and duration thresholds for a given trial and evaluates each gaze point according to those criteria. When a gaze point surpasses both the speed and duration thresholds determined for a given trial, that point and following qualifying points are classified as belonging to a single fixation.
We used an ROC analysis to quantify the predictive power of the image (GBVS) and semantic (GloVe-based) salience models. An ROC curve is created by measuring the number of hits, correct rejections, misses, and false alarms that occur at increasing salience levels across the heatmap. For example, when testing at level 0.5, only areas of the heatmap with a value of 0.5 or lower are considered as correctly predicted. Any gaze point that falls in areas of 0.5 or lower are considered hits, and any areas above 0.5 without gaze points are considered correct rejections. Similarly, any areas predicted that do not have gaze points are scored as misses, and points falling on unpredicted areas are considered as false alarms. From this, we can calculate the true and false positive rates, where the true positives rate equals true positives / (true positives + false negatives), and the false positive rate equals 1 – (true negatives / (true negatives + false positives)). We repeat this at 100 levels increasing from 0 to 1, where the resulting false positives are plotted on the X axis and true positives are plotted on the Y axis, to generate an ROC curve. We used the Matlab function AUC_Judd 66 to calculate the ROC curves and the area under the curve (AUC). The higher the AUC, the higher the predictive power of the model following a scale from 0 to 1. An AUC of 1 means a subject looked exactly where the model predicted, while 0.5 means the model predicted no better than chance. An AUC score of 0 means that gaze points fell entirely outside areas of the model prediction.
All statistical analyses were carried out using SPSS Statistics package (version 28; IBM, Armonk, NY). To evaluate differences between the CVI group and the control group, as well as the effects of participants being presented with an image cue compared to a text cue, we performed separate repeated-measures analyses of variance (ANOVA) for all outcomes of interest (success rate, reaction time, visual search area, number of fixations, and ROC scores) with group as the between-subjects factor and cue as the within-subjects factor. Independent samples t-tests were performed for each cue separately in the case of significant group effects to confirm directionality. Paired-sample t-tests were performed for both groups separately where there were significant cue effects. Mann-Whitney U tests were conducted on data regarding visual search area and number of fixations to investigate whether these outcomes were similarly distributed between the CVI and control groups. As an ancillary analysis, we also examined if success rates and reaction times were associated with verbal IQ scores in CVI participants. For this purpose, linear regression analyses between both the image cue and text cue conditions were performed separately. Effect sizes are reported as partial eta squared. One CVI participant was only able to complete half of the experiment (image cue only). No data were omitted from the analysis.