Since the target detection algorithm based on convolutional neural network suffers from limited convolutional kernel receptive field, it leads to the model's inability to perceive the remote semantic information in the image. Because the Transformer model does not have the limitation of local receptive fields, it is introduced into the field of target detection, and many scholars have proposed target detection algorithms based on Transformer and its variants. However, the Transformer model has the difficulties of not being able to extract the deep feature information and the high computational complexity of the standard self-attention mechanism in the process of target detection and recognition applications. Aiming at the above two core problems, we have carried out in-depth analysis and research, and proposed an encoder-decoder model consisting of a convolutional layer and a Transformer module. And then, we constructed the efficient multi-head self-attention mechanism, which can capture both local and remote contextual information of target image features. Then, we design efficient convolutional module-enhanced cross-window connectivity, which can significantly improve the characterization and global modeling capabilities of Transformer model. In addition, we propose the convolution-enhanced Transformer learning framework, which improves the adaptability to different datasets, which also integrates the sparse sampling strategy. It can significantly reduce the memory and computational requirements in large-scale image processing. Finally, we propose a target detection algorithm based on a new Transformer framework. We conducted ablation experiments and computational performance comparison experiments on several HRRS scenes and natural scene datasets. The experimental results confirm that our proposed method obtains optimal results in terms of weighted F-measure, average F-measure and MAE. Moreover, our proposed method has clearer edge information and more accurate target localization information in the visual effect of detection results.