Weakly supervised object detection (WSOD) has been widely concerned in the field of computer vision due to its inexpensive way for learning object detectors. However, existing WSOD methods are either limited on localization accuracy or require complex pipelines. In this paper, we propose a single network training based framework for WSOD task. The layer-wise self-attention distillation is performed in our framework so as to improve representation learning of the detection network. As the complementary composition, a context addition unit is designed to prevent the detection network focusing on the most discriminative parts of object. Moreover, an instance-aware mining algorithm is proposed to generate more precise pseudo-labels, and thereby further enhance the performance of the detection network. We have evaluated the proposed method on popular benchmarks, i.e., PASCAL VOC 2007 and VOC 2012, and the experimental results show that it can accurately and efficiently locate objects in images without complex data augmentation and well-annotated auxiliary dataset.