Visual Question Answering (VQA) is a hot topic task to answer natural language questions related to the content of visual images. In most VQA models, visual appearance and attribute features are ignored, resulting in complex questions without correct answers.To solve these problems, we propose a new end-to-end VQA model called Multi-modal Attribute Feature Attention Network (MAFA-Net).Firstly, the self-guided word attention modulus is designed to connect entity words with semantic words. Secondly, two problematic adaptive visual attention modules are presented not only to extract important regional features, but also to focus on key attribute features (e.g., color, spatial relationships, etc.). Additionally, a combining strategy is proposed to better explore spatial relationships between objects and their appearance properties. Finally, the experimental results show that MAFA-Net achieves performance competitive with state-of-the-art models on two large-scale VQA datasets.