Given the long duration of untrimmed videos and the difficulties with end-to-end training, contemporary temporal action detection (TAD) methods heavily depend on pre-computed video feature sequences for subsequent analysis. However, video clip features extracted directly from video encoders trained for trimmed action classification commonly lack temporal sensitivity. To overcome this limitation, we propose an innovative framework named temporally-aggregative pretraining (TAP). TAP is rooted in the design principle of extracting TAD features of temporal sensitivity to improve discrimination between them and those from trimmed action classification. The proposed TAP consists of two fundamental modules at its core: a feature encoding module and a temporal aggregation module, which use both local and global features for pretraining. The feature encoding module employs a novel video encoder of a multiscale vision transformer, which ingeniously combines the essential concept of multiscale feature hierarchy with transformers to achieve effective feature extraction from video clips. The temporal aggregation module introduces a temporal pyramid pooling layer that effectively captures temporal-contextual semantic information from video feature sequences, enhancing more discriminative global video representations. Extensive experiments validate the significantly improved discriminative potency of our pretrained features for two commonly used datasets.