Reconstructing gene regulatory networks (GRNs) from expression data is vital for understanding gene transcription. Although increasingly advanced algorithms, particularly deep-learning models, are proposed to mine potential gene-regulatory interactions, an insufficient effort has been invested into improving feature reliability in the presence of biological variability across expression samples. In this research, we propose a robust feature fusion method inspired by emerging attention-based techniques in computer vision. Our method can capture the functional asymmetry between transcription factors (TFs) and target genes with an important adaption using differentiate attentional heads. Based on three different gene-expression datasets: in silico, E.coli and S.cerevisiae, we demonstrated that our method is superior to other state-of-the-art competitors, including that based on graph embeddings.