Dementia is a progressive neurological disorder that profoundly affects the daily lives of older adults, impairing abilities such as verbal communication and cognitive function. Early diagnosis is essential for enhancing both lifespan and quality of life for affected individuals. Despite its importance, diagnosing dementia is complex and often necessitates a multimodal approach incorporating diverse clinical data types. In this study, we fine-tune Wav2vec and Word2vec baseline models using two distinct data types: audio recordings and text transcripts. We experiment with four conditions: original datasets versus datasets purged of short sentences, each with and without data augmentation. Our results indicate that synonym-based text data augmentation generally enhances model performance, underscoring the importance of data volume for achieving generalizable performance. Additionally, models trained on text data frequently excel and can further improve the performance of other modalities when combined. Audio and timestamp data sometimes offer marginal improvements. We provide a qualitative error analysis of the sentence archetypes that tend to be misclassified under each condition, providing insights into the effects of altering data modality and augmentation decisions.