Neural image captioning (NIC) is considered as a primitive problem artificial intelligence (AI) in which creates a connection between computer vision (CV) and natural language processing (NLP). However, recent attribute-based and textual semantic attention based models in NIC still encounter challenges related to irrelevant concentration of the designed attention mechanism on the relationship between extracted visual features and textual representations of corresponding image’s caption. Moreover, recent NIC-based models also suffer from the uncertainties and noises of extracted visual latent features from images which sometime leads to the disruption of the given image captioning model to sufficiently attend on the correct visual concepts. To solve these challenges, in this paper, we proposed an end-to-end integrated deep fuzzy-neural network with the unified attention-based semantic-enhanced vision-language approach, called as FuzzSemNIC. To alleviate noises and ambiguities from the extracted visual features, we apply a fused deep fuzzy-based neural network architecture to effectively learn and generate the visual representations of images. Then, the learnt fuzzy-based visual embedding vectors are combined with selective attributes/concepts of images via a recurrent neural network (RNN) architecture to incorporate the fused latent visual features into captioning task. Finally, the fused visual representations are integrated with a unified vision-language encoder-decoder for handling caption generation task. Extensive experiments in benchmark NIC-based datasets demonstrate the effectiveness of our proposed FuzzSemNIC model.