With the rapid development of social economy, the urban and rural environment, form and infrastructure have also under- gone earth-shaking changes. As a gathering place for human activities, urban and rural areas play a vital role in the interaction between humans and society. If traditional machine learning methods are used to perceive changes in the intentional connotation of urban and rural areas, it is easy to ignore the detailed information of the intentional target. At the same time, the perception accuracy needs to be improved. Therefore, the deep neural network in this paper proposes a way to perceive the temporal and spatial changes of urban and rural intentional connotations from the perspective of remote sensing. The framework first uses the multi-branch DenseNet to capture the multi-scale spatiotemporal information of the intended target, and embeds it with high-level semantics and low-level physical appearance information. Secondly, a multi-branch cross-channel attention module is designed to perform refined and integrated processing of multi-level spatiotemporal information to realize the aggregation of multi-branch information. Finally, it was tested and verified on two different time data sets, and good performance was achieved.