Deep Neural Networks Evolve Human-like Attention Distribution during Goal-directed Reading Comprehension

Attention is a key mechanism for information selection in both biological brains and 19 many state-of-the-art deep neural networks (DNNs). Here, we investigate whether 20 humans and DNNs allocate attention in comparable ways when seeking information 21 in a text passage to answer a question. We analyze 3 transformer-based DNNs that 22 reach human-level performance when trained to perform the reading comprehension 23 task. We find that the DNN attention distribution quantitatively resembles human 24 attention distribution measured by eye tracking: Human readers fixate longer on 25 words that are more relevant to the question-answering task, demonstrating that 26 attention is modulated by the top-down reading goal, on top of lower-level visual 27 layout and textual features. Further analyses reveal that the attention weights in DNNs 28 are also influenced by both the top-down reading goal and lower-level textual 29 features, with the shallow layers more strongly influenced by lower-level textual 30 features and the deep layers attending more to task-relevant words. Additionally, deep 31 layers’ attention to task -relevant words gradually emerges when pre-trained DNN 32 models are fine-tuned to perform the reading comprehension task, which coincides 33 with the improvement in task performance. These results demonstrate that DNNs can 34 naturally evolve human-like attention distribution through task optimization. The 35 results suggest that human attention during goal-directed reading comprehension is a 36 consequence of task optimization and the attention weights in DNN are of biological 37 significance. abstract information, e.g., object information in convolutional networks 42 and syntactic information in BERT 43,44 . A recent study has also compared human eye movements and attention weights in the last layer of BERT, when participants evaluate whether a passage is an appropriate answer to a question. It is demonstrated that the human fixation time is more similar to the attention weights in BERT than the simple TF-IDF weights 27 .

input to the models consists of all words in the passage and an integrated option, and 115 also 3 special tokens, i.e., CLS, SEP1, and SEP2 (denoted as C, S1, and S2). The CLS 116 token integrates information across words and is used to calculate a score that reflects 117 how likely the option is the correct answer. The DNN model has 12 layers and has 12 118 attention heads in each layer. (D) Illustration of the DNN attention mechanism in a layer. 119 In the models, each word/token is represented by a vector, and information is integrated     attention. Humans allocate more attention to words that are more relevant to question 172 answering. *P < 0.05; **P < 0.01; ***P < 0.001. 10 We first analyzed whether textual features, e.g., word length, word frequency, and a 175 word's position in a sentence, could predict human attention distribution using linear 176 regression. The prediction accuracy, i.e., the correlation coefficient between the 177 predicted and actual attention density, was significantly above chance (P = 0.002, 178 permutation test, FDR corrected). Furthermore, the prediction accuracy was 179 significantly higher for global questions than for local questions (P = 1.4 x 10 -4 , 180 bootstrap, FDR corrected) (Fig. 3A, the left plot). We then used the same regression 181 analysis to analyze whether the visual layout of a passage could also influence attention 182 distribution. Here, layout features referred to features induced by line changes (see 183 Materials and Methods for details), which could be processed without word recognition.

184
The prediction accuracy for layout features was also statistically significant (P = 0.002, 185 permutation test, FDR corrected).

187
Textual features and layout features characterized properties of the stimulus that were 188 invariant across tasks. In the following, we investigated whether the task, i.e., to answer 189 a specific question, also modulated human attention distribution. To characterize the 190 top-down influence of task, we acquired annotations indicating each word's 191 contribution to question answering, i.e., task relevance (see Materials and Methods). 192 As shown in the left plot of Fig. 3A, we found that task relevance could indeed 193 significantly predict human attention distribution (P = 0.002, permutation test, FDR 194 corrected). Since task relevance was not a well-established modulator of reading 195 attention, we further analyzed whether the task relevance effect could be explained by 196 the well-established textual and layout effects. In this analysis, we first regressed out 197 the influence of textual and layout features from the human attention distribution, and 198 found that the residual attention distribution could still be predicted by task relevance 199 (P = 0.003, permutation test, FDR corrected) (Fig. 3A, middle plot). These results

200
showed that the top-down reading goal, quantified by task relevance, could modulate 201 human attention, on top of lower-level stimulus features, i.e., textual and layout features. The linear regression analyses revealed that textual features, layout features, and task 204 relevance all modulated human attention (see Fig. 2 for examples). The prediction 205 accuracy for different features ranged between 0.2 and 0.6, comparable to the prediction 206 accuracy of visual saliency models when predicting human attention to images 31,32 .

207
Further analyses also revealed how these features modulated human attention. For 208 example, we found that participants generally attended more to the beginning of a 209 passage (Fig. 3C). Furthermore, this effect was stronger for global questions, which 210 potentially explained why stimulus features could better predict the attention 211 distribution for global questions. Additionally, it was also found that participants 212 attended more to words that are more relevant to the question answering task (Fig. 3D). 215 We then investigated whether DNN attention was comparable to human attention. The 216 general architecture of the models was illustrated in Fig. 1C. The input to the models 217 included all the words in the passage, integrated option, and 3 special tokens. One of the special token, i.e., CLS, was the decision variable, based on the final representation 219 of which the DNN models decided whether an option was the correct answer or not. In 220 the following, we analyzed the attention weight between the CLS token and each word     To further confirm that human attention received top-down modulation from the task, 262 we conducted Study 2 as a control study. In Study 2, participants first read a passage without prior knowledge about the specific question to answer. After the first-pass 264 passage reading, the participants read the question and were then allowed to read the 265 passage again before answering the question. We analyzed the attention density during 266 the first-pass reading of the passage, which was referred to as general-purpose reading.   DNN models that did not receive fine-tuning (Fig. 4B). It was found that the attention 317 weights of pre-trained DNN were sensitive to textual features in shallow layers but not 318 sensitive to task relevance in deeper layers, suggesting that top-down attention in DNNs 319 emerged during fine-tuning using the reading comprehension task.

321
We then asked how the attention weights of DNN changed during fine-tuning and 322 whether such changes were related to the performance of question answering. During 323 fine-tuning, the structure of the DNN model remained but the parameters were adjusted.

324
In the following, we analyzed the properties of models that received different steps of 325 fine-tuning. Furthermore, since fine-tuning process was stochastic, we fine-tuned 10 326 times (see Materials and Methods). We found that, in deep layers, the properties of 327 attention weights significantly changed during fine-tuning ( Fig. S3 and Fig. S4). In the 328 last layer, for example, it was clear that the DNN attention weights became more 329 sensitive to task relevance during fine-tuning, coinciding with the improvement in task 330 performance (Fig. 5AD), especially for local questions (Fig. 5A). The trend is less clear 331 for global questions and a potential explanation is that global questions concern the 332 main topic of the passage and can be answered by paying attention to different sets of 333 words. Deep layer's sensitivity to textual features, however, dropped during fine-tuning 334 (Fig. 5BE). Therefore, fine-tuning directed deep layers' attention towards task relevant 335 information, sacrificing the sensitivity to textual features. Additionally, we found that 336 the similarity between DNN attention weights and human attention was also boosted 337 by fine-tuning for local questions (Fig. 5C). This result further demonstrated that proposed that attention can be interpreted a mechanism to implement optimal decision 385 making. For example, when faced with multiple conflicting cues, the brain can use 386 attention to modulate, i.e., weight, the neural representation of each cue. It has been 387 proposed that the brain attends to more informative and reliable cues to make an optimal 388 decision 45-47 . The current results are generally consistent with this idea since both 389 human and DNN attend to words that are relevant to task solving. How human readers allocate attention during reading is an extensively studied topic.

393
Eye tracking studies have shown that the readers fixate longer at, e.g., longer words, 394 words of lower-frequency, words that are less predictable based on the context, and 395 words at the beginning of a line 48,49 . A number of models, e.g., the E-Z reader 37,50 and 396 SWIFT 51 , have been proposed to predict the eye movements during reading, either 397 based on basic oculomotor properties or lexical processing 37 . These models can 398 generate fine-grained predictions, e.g., which letter in a word will be fixated first. A 399 limitation of these models, however, is that they are generally developed to explain the  is also referred to as the reading-to-do task 38 . Previous studies have shown the reader's 414 task may have heterogeneous influences on attention, depending on the task difficulty 415 and skill level of readers 53,54 . Here, the task is demanding and the readers are highly skilled to perform the task: The reading comprehension questions are selected from 417 exams and the time to answer each question is limited, leading to about 80% question 418 answering accuracy (Fig. 1). The participants are skilled since all Chinese students have 419 extensive practice in such reading comprehension questions in high school. Future work 420 is needed to quantify how the task and reading skills modulate human attention and 421 whether these effects can also be modeled by DNN models. In transformer-based models, the roles self-attention plays are highly diverse. Since In sum, the current study demonstrates that, when DNN and humans perform the same    The experiment procedure in Study 1 was illustrated in Fig. 1A. In each trial, 524 participants first read a question, pressed the space bar to read the corresponding 525 passage, and then pressed it again to read the question coupled with 4 options and 526 answer the question. The time limit for passage reading was 120 s. To encourage the 527 participants to read as quickly as possible, the bonus they received for a specific 528 question would decrease linearly over time. They did not receive any bonus for the 529 question, however, if they gave a wrong answer. Furthermore, before answering the 530 comprehension question, the participants reported whether they were confident that 531 they could correctly answer the question. After answering the question, they also rated 532 their confidence about their answer on the scale of 1-4 (low to high). The confidence 533 ratings were not analyzed.

535
Study 2: Study 2 included 96 reading passages and questions, with 16 questions for 536 each question type that were randomly selected from the questions used in Study 1. The 537 study was carried out in 2 days, and none of the participants participated in Study 1.

538
The familiarization procedure was identical to that in Study 1.

540
The procedure of Study 2 was similar to that of Study 1, and the main difference was 541 that a 90-s first-pass passage reading stage was introduced at the beginning of each trial.

542
During the first-pass passage reading, participants had no prior information of the 543 relevant question. The participants could press the space bar to terminate the first-pass 544 reading stage and to read a question. Then, participants read the passage for the second 545 time with a time limit of 30 s, before proceeding to answer the question. In Study 2, the 546 correctness of the answer was also the prerequisite for bonus, and the amount of bonus 547 decreased linearly with the duration of second-pass passage reading.  Furthermore, before each trial, a 1-point validation was applied, and if the calibration 562 error was higher than 0.5º, a recalibration was carried out. Head movements were 563 minimized using a chin and forehead rest. 564 565 DNN models 566 We tested 3 popular transformer-based DNN models, i.e., BERT 17 , ALBERT 18 , and 567 RoBERTa 19 . ALBERT and RoBERTa were both adapted from BERT, and had the same 568 basic structure. RoBERTa differed from BERT in its pre-training procedure 19 while 569 ALBERT applied factorized embedding parameterization and cross-layer parameter 570 sharing to reduce memory consumption 18 . Following previous works 18,19 , each option 571 was independently processed. For the i th option (i = 1, 2, 3, or 4), the question and the 572 option were concatenated to form an integrated option. As shown in Fig. 1C, for the i th 573 option, the input to DNN was the following sequence: The answer to a question was determined as the option with highest score, and all the 593 models were trained to maximize the logarithmic score of the correct option. and all hyperparameters for fine-tuning were adopted from previous studies (Table S1).

601
To isolate how the fine-tuning process modulated DNN attention, we also tested the 602 pre-trained DNN that was not fine-tuned on RACE dataset, and compared it with the 603 fine-tuned model (Fig. 4). Furthermore, we quantified how the properties of DNN answering, here we analyzed the attention weights that were used to calculate the 616 vectorial representation of CLS (illustrated in Fig. 1D). For each layer, the output of an 617 attention head was computed using the following equations. For the sake of clarity, we 618 denote the input words and tokens generally as Xi.
, ,  636 We analyzed eye fixations during passage reading in Study 1 and the first-pass passage 637 reading in Study 2. For each word, the total fixation time was the sum of the duration 638 across all fixations that fell into the square area the word occupied. We averaged the 639 total fixation time across all participants who correctly answered the question, and 640 measured human attention using the attention density, i.e., the total fixation time 641 divided by the area a word occupied.  where F and ε denoted the features being considered and the residual error, respectively.

658
The parameters β and b were fitted to minimize the mean square error. Each feature and 659 the human attention distribution were normalized within a passage by taking the z-score.

660
The prediction accuracy, i.e., the correlation between predicted attention and actual 661 human attention, was calculated based on five-fold cross-validation. Each question type 662 was separately modeled.

664
Statistical tests 665 We employed a one-sided permutation test to test whether the attention distribution