As cross-domain research combining computer vision and natural language processing, the current image captioning research mainly considers how to improve the visual features, less attention has been paid to utilizing the inherent properties of language to boost captioning performance. Facing this challenge, we proposed a textual attention mechanism, which can obtain semantic relevance between words by scanning all generated words. The Retrospect Network for image captioning(RNIC) proposed in this paper aims to improve input and prediction process by using textual attention. Concretely, the textual attention mechanism is applied to the model simultaneously with the visual attention mechanism to provide the input of the model with the maximum information required for generating captions. In this way, our model can learn to collaboratively attend on both visual and textual features. Moreover, the semantic relevance between words obtained by retrospect is used as the basis for prediction, so that the decoder can simulate the human language system and better make predictions based on the already generated contents. We evaluate the effectiveness of our model on the COCO image captioning datasets and achieve superior performance overthe previous methods.extraction function to extract the hidden unit information of multiple time steps for prediction, to solve the problem of insufficient LSTM prediction information. Experiments have shown that both model significantly improved the various evaluation indicators in the AI CHALLENGER test set.