This paper proposes a novel, human-machine collaborative, pupillary response-based video summarization framework. Considering that humans are the end-users and evaluators of video content, it is natural to establish the links between video features and real-time viewers’ attentive response for designing a video summarization framework. In this paper, pupil size, a pupillary response, is introduced as a real-time indicator for assessment of viewers’ engagement and attention. Firstly, we augmented the TVSum dataset by replacing manual annotations with attention scores that converted from pupillary size signals to generate the perceptually driven dataset. Secondly, we developed a video summarization framework which uses cues from pupillary size signal of humans to predict frame-level attention scores, and then extracts key shots from videos. On the perception-driven dataset, the average F-measure of the proposed summarization method is 69.71%, while the precision and recall are 69.67% and 69.77%, respectively, which is a significant improvement compared to random summarization. Experimental results initially demonstrated that our modeling approach can learn the dynamic attention mechanism of viewers and apply it to video summarization. In addition, the experimental results validate the effect.