Fine-grained visual classification (FGVC) involves classifying multiple subcat-egories within a unified major category, a task characterized by significant intra-class variability and minimal inter-class differences. Previous methods often rely on pre-trained visual models augmented with specialized modules, typically using large scale models that are challenging for industrial deployment. Moreover , image data often comes with auxiliary information (e.g., spatiotemporal priors, attributes, and text descriptions), offering opportunities to enhance FGVC accuracy. Here we propose a novel lightweight Transformer-based approach that incorporates additional auxiliary information to enhance classification accuracy. Our method introduces a simplified pixel-focused aggregation attention (AA) to achieve local and global feature fusion and improves it with a separable aggre-gation attention (SAA) to reduce model complexity. We also present the Extra Inside Padding (EIP) method for integrating auxiliary information with minimal additional parameters. Without pre-training, our model surpasses other lightweight neural networks on fine-grained datasets such as CUB-200-2011, demonstrating a significant improvement in accuracy. Our approach offers a promising direction for FGVC tasks, highlighting the effectiveness of integrating multimodal data for enhanced performance. Our source code is available at https://github.com/yang-zzy/SAA-EIP.