Multimodal models integrating visual and textual data have transformed artificial intelligence applications by providing more holistic and contextually aware responses. However, the susceptibility of these models to environmental semantic distractions poses a significant challenge, as it can compromise their performance and reliability in real-world scenarios. This article evaluates the impact of semantic distractions on ChatGPT-4o and Claude 3.5 Sonnet, revealing critical vulnerabilities in their interpretative and generative capabilities. A comprehensive methodology was employed, involving the creation of controlled input variations and the systematic introduction of distractions to assess model performance. Findings indicate a substantial decline in accuracy, coherence, and consistency with increasing levels of distraction, highlighting the need for targeted improvements in model robustness. Advanced attention mechanisms, adversarial training techniques, and synthetic data augmentation are proposed as potential strategies to enhance model resilience. Despite the controlled nature of the study, the insights gained prove the importance of developing more robust multimodal models capable of maintaining high performance amidst environmental noise. Future research directions include evaluating models in more naturalistic settings and expanding the range of assessed distractions to better capture real-world complexity.