Transformers bridge vision and language to estimate and understand scene meaning

Humans rapidly process and understand real-world scenes with ease. Our stored semantic knowledge gained from experience is thought to be central to this ability by organizing perceptual information into meaningful units to efficiently guide our attention in scenes. However, the role stored semantic representations play in scene guidance remains difficult to study and poorly understood. Here, we apply a state-of-the-art multimodal transformer trained on billions of image-text pairs to help advance our understanding of the role semantic representations play in scene understanding. We demonstrate across multiple studies that this transformer-based approach can be used to automatically estimate local scene meaning in indoor and outdoor scenes, predict where people look in these scenes, detect changes in local semantic content, and provide a human-interpretable account of why one scene region is more meaningful than another. Taken together, these findings highlight how multimodal transformers can advance our understanding of the role scene semantics play in scene understanding by serving as a representational framework that bridges vision and language.

. Meaning mapping and input/target preprocessing. A meaning map for each scene (a) is built by breaking each scene into circular patches at two spatial scales (b), and then having humans rate the patches. The human patch ratings are then recombined to generate a scene meaning map (c). To train DeepMeaning, each scene image (a) and meaning map (c) were broken into patches using a square grid (c). The square scene image patches served as the input to the pretrained Vision transformer (ViT) of the Contrastive Captioner (CoCa) while the average meaning map value of each square region served as the target value to be predicted. Raincloud plots of the distribution of the meaning target values for indoor and outdoor scenes were normally distributed (d).
a multimodal vision-language representational space to estimate local scene meaning. 21 Cognitive guidance theory is the theoretical framework anchoring our work 22 (Henderson, 2003(Henderson, , 2011. Under this view, semantic knowledge stored in memory 'pushes' 23 our attention toward scene regions that are recognizable, informative, and relevant to our 24 current goals (Henderson & Hollingworth, 1999;Potter, 1975;Biederman, 1972;Wolfe & 25 Horowitz, 2017; Land & Hayhoe, 2001). That is, where we look in scenes is primarily 26 driven by semantic representations that guide our attention toward meaningful scene re-27 gions. There is a long history of evidence supporting the relationship between semantic 28 properties and attention in scenes (Buswell, 1935;Yarbus, 1967 work is that it often focused on isolated object-scene semantic relationships (e.g., swapping 34 an octopus and a tractor in an underwater and farm scene respectively). While these dis-35 crete semantic manipulations were important in establishing a causal relationship between 36 scene semantics and attention, they do not tell us much about the overall role of semantic 37 guidance in scene understanding (Henderson & Hayes, 2017).

38
To study the effects of scene semantics globally across entire scenes we recently in-39 troduced two different approaches: meaning maps (Henderson & Hayes, 2017) and concept 40 maps (Hayes & Henderson, 2021). Meaning maps use human raters to estimate a given 41 semantic feature at each location in the scene. Specifically, each scene ( Fig.1a) is bro-42 ken into small circular image patches at two spatial scales (Fig.1b), and then participants 43 rate a random subset of these image patches based on a given semantic instruction (e.g., 44 meaningful, informative and recognizable, Henderson & Hayes, 2017). These ratings are 45 3 then combined back into their respective position to form a map of local scene meaning 46 (Fig.1c). Local scene meaning has repeatedly been shown to be one of the strongest predic-47 tors of where people look in scenes regardless of the viewing task (for review see Henderson,48 Hayes, Peacock, & Rehrig, 2019). In addition to local meaning maps, we also developed a 49 separate language-based approach using a vector space semantic model called ConceptNet 50 Numberbatch (Hayes & Henderson, 2021). ConceptNet Numberbatch derives the semantic 51 relationships between words based on regularities in almost a trillion words of written text 52 and crowd-sourced basic knowledge about the world (Günther, Rinaldi, & Marelli, 2019). 53 The semantic representations from ConceptNet can then be mapped back onto the objects 54 in a scene to form a 'concept map' that reflects how semantically related each object is 55 to the rest of the scene, which was also strongly associated with scene attention (Hayes & 56 Henderson, 2021). 57 Therefore, meaning maps and concept maps each approach scene semantics from a 58 different angle. Meaning maps are constructed by filtering a visual stimulus through the 59 cognitive system of human raters to estimate semantic properties in scenes (e.g., local mean-60 ing, Henderson & Hayes, 2017; graspability, Rehrig, Peacock, Hayes, Henderson, & Ferreira, 61 2020), while concept maps are non-visual, building semantic representations based entirely 62 on regularities in human-generated language. However, humans often acquire semantic 63 knowledge through an interplay of visual and language experience (Clarke, 2015;Ralph, 64 Jefferies, Patterson, & Rodgers, 2017), so scene semantics may best be understood within a 65 computational framework that forms a multimodal mapping between vision and language. 66 Here we apply just such a framework, a state-of-the-art Contrastive Captioner (CoCa) 67 which serves as a foundational vision-language representational model (Yu et al., 2022). 68 While transformers have played a large role in natural language processing, it is only re-69 cently that transformers have been generalized to also include visual and multimodal vision-70 language domains (Vaswani et al., 2017;Dosovitskiy et al., 2021;Yu et al., 2022). CoCa in 71 particular recently introduced a unique architecture that unifies many of the strengths of 72 previous transformer architectures (i.e., single-encoder, dual-encoder, and encoder-decoder), 73 allowing CoCa to learn aligned unimodal text and image embeddings as well as a fused mul-74 timodal image-text representational space (Yu et al., 2022). It is this unique ability that 75 allows CoCa to learn very general representations and achieve state-of-the-art performance 76 across virtually every major image, language, and multimodal benchmark (Yu et al., 2022), 77 and it is precisely this ability that we will leverage to estimate local scene meaning here.

78
In the present study, we used the pretrained feature space of CoCa to estimate local 79 scene meaning ( Fig.1c and Fig.1d) in a model we call 'DeepMeaning'. The overview of 80 how DeepMeaning estimates local scene meaning is shown in Fig.2a, and can be broadly 81 split into a feature extraction stage and a leave-one-scene-out cross-validation stage. In 82 the feature extraction stage, we take the CoCa model pretrained on more than 2 billion 83 unique image-text pairs (Fig.2a, purple) and use it to generate CoCa features for each local 84 scene region by breaking each scene into smaller patches using a square grid (Fig.2a, white). 85 Then, we train a linear model (Fig.2a, red) for indoor scenes and a linear model for outdoor 86 scenes where we use these general CoCa features for the scene patches as predictors to 87 estimate local meaning using a leave-one-scene-out procedure (Fig.2a, grey). Indoor and 88 outdoor scenes were modeled separately because there is evidence indoor and outdoor scenes 89 are behaviorally (Torralba et al., 2006)   2007). Using this general procedure, we evaluated DeepMeaning based on four criteria: 91 meaning recovery, attention prediction, ability to detect changes in semantic content, and 92 model interpretability (i.e., can we decode in human-interpretable terms why DeepMeaning 93 predicts some regions as higher meaning than others). 94 We first tested how well DeepMeaning could recover local scene meaning com-95 pared to human raters (Fig.2). Using a leave-one-scene-out cross-validation procedure

124
Having established that DeepMeaning successfully estimates local scene meaning and 125 DeepMeaning maps strongly correlate with attention, we then tested whether DeepMeaning 126 could detect the removal of local semantic information. To do this we used an adversar-127 ial image in which local scene meaning is removed using a diffeomorphic transformation 128 (Stojanoski & Cusack, 2014;Hayes & Henderson, 2022). The diffeomorphic transformation 129 (Fig.4a, 4b) preserves the basic perceptual properties of the scene region while degrading 130 its semantic content. Previously, we have shown that human meaning maps were capa-131 ble of passing this tough adversarial test, while 3 state-of-the-art deep saliency models 132 failed (Hayes & Henderson, 2022). Therefore, for DeepMeaning to count as an automated 133 method for estimating local scene meaning, DeepMeaning must also be able to pass this 134 strong semantic validity test. To perform the adversarial diffeomorph test, we compared 135 DeepMeaning's left-out-scene prediction for both the original scene and diffeomorphed scene 136 for this critical altered region. As can be seen (Fig.4c, 4d), DeepMeaning showed a large 137 decrease in estimated meaning for the diffeomorphed region relative to the original unal-138 tered scene region (t(39)=18.24, p <.001, 95%CI [0.6, 0.75], d=2.66). This is an important 139 result, as it establishes that just like human meaning maps (Hayes & Henderson, 2022), 140 DeepMeaning is sensitive to changes in local semantic content.

141
Finally, we evaluated whether DeepMeaning can go beyond even human meaning 142 maps by providing greater transparency into what underlies its predictions. Given the Con-143 useful, but they leave a representational gap that makes it difficult to understand the 7 precise mapping between visual input and semantic knowledge, either because they are 169 filtered through the human brain or because they are only based on a single representational 170 space without a mapping to the other. Our work here shows that bridging vision and 171 language representational mappings not only provides an automated way to accurately 172 estimate scene meaning and attention, but perhaps more importantly, a means to interpret 173 the representational embeddings that underlie those predictions. More broadly, the current 174 study serves as another piece of evidence that multimodal transformers like CoCa can serve 175 as 'foundational' vision-language models for downstream tasks (Yu et al., 2022). 176 In summary, we used a state-of-the-art transformer trained on billions of image-text 177 pairs to reveal how joint representations learned from vision and language can predict what 178 scene regions people find meaningful and consequently where they look. We demonstrated 179 that this computational framework successfully recovers human meaning ratings near ceil-180 ing, transfers as a strong predictor of scene attention, detects local changes in semantic 181 content, and provides a direct route to human-interpretability via multimodal image-text 182 decoding. The ability to offer automated scene meaning and attention prediction using a 183 joint representational space that bridges vision and language has tremendous potential for 184 advancing our understanding of how semantic representations produce rapid scene under-185 standing with implications for cognitive science, computer vision, linguistics, robotics, and 186 artificial intelligence.  Architecture. DeepMeaning is composed of two components: a pretrained Contrastive 202 Captioner (CoCa) transformer that is used as a feature extractor and a linear regression 203 model that is trained to use these features to predict scene meaning. Specifically, the 204 pretrained weights learned by the Contrastive Captioner by training on the LAION-2B 205 dataset were frozen, and then used to extract general features from each square scene image 206 patch. The extracted image patch features and their corresponding meaning ratings (Fig.1c 207 and Fig.1d) were then used to train a linear regression model to predict meaning ratings 208 for indoor and outdoor scene patches separately using a leave-one-scene-out cross-validation 209 procedure.

210
Square grid, scene patches, and meaning rating preprocessing. Each scene and its 211 corresponding meaning map were split into 96x96 pixel square patches with 35% overlap 212 (Fig.1c). Each square scene image served as an input to the vision transformer (Vit) com-213 ponent of CoCa for feature extraction. The meaning value for each square scene region was 214 computed as the average across its location in the corresponding human meaning map and 215 served as the target value to be predicted (Fig.1d). cally, a meaning map was created for each scene by cutting the entire scene into a dense 244 array of overlapping circular patches (Fig.2b) at a fine spatial scale (300 patches, diame-245 ter=87 pixels) and coarse spatial scale (108 patches, diameter=205 pixels). Human raters 246 then provided ratings of 300 random fine or coarse scene patches based on how informative 247 or recognizable they thought they were on a 6-point Likert scale (Henderson & Hayes, 2017; 248 Mackworth & Morandi, 1967). Patches were presented in random order and without scene 249 context, so ratings were based on context-independent judgments. Each unique patch was 250 rated by three unique raters.

251
A meaning map (Fig.1c) was generated for each scene by averaging the patch rating 252 data at each spatial scale separately, averaging the spatial scale maps together, and then 253 smoothing the grand average rating map with a Gaussian filter (i.e., Matlab 'imgaussfilt' 254 with σ = 10, full width at half maximum=23 pixels).  External. One hundred indoor and outdoor scenes from the CAT2000 benchmark 266 eye tracking dataset (Borji & Itti, 2015) served as an external replication of DeepMeaning's 267 ability to estimate local meaning that transfers to predict scene attention. Each scene in the 268 CAT2000 dataset was freely-viewed by 24 observers for 5 seconds while their eye movements 269 were recorded using an EyeLink 1000 eye tracker(SR Research, 2010).

271
The diffeomorph scene set from Hayes and Henderson (2022) was used to assess 272 whether DeepMeaning could successfully detect the local removal of semantic content from 273 a scene. The diffeomorph dataset contained 40 scenes in two conditions: diffeomorphed 274 and original. In the diffeomorph condition, a diffeomorphic transformation (Stojanoski & 275 Cusack, 2014) was applied to one local region in each scene to remove the semantic content 276 from that region while preserving its image features (Hayes & Henderson, 2022). In the 277 original condition, the scenes were presented unaltered. Human meaning ratings were then 278 collected for both the original scenes (N=164) and the diffeomorphed scenes (N=164) using 279 the same Meaning Mapping Procedure described above.

280
Captions for the original and diffeomorphed patches were decoded from CoCa using a 281 top 5% quantile token generation type with a temperature of 1 and a repetition penalty of 2 282 (Ilharco et al., 2021). Each caption was evaluated based on whether it accurately described 283 the content of the scene patch (yes or no) and how many objects it correctly recognized in 284 each patch.