In recent years, attention transformers have proven to be instrumental in natural language processing (NLP)-based tasks such as sentence classification and language translation. However, their application has been recently extended to large-scale object recognition tasks. In this work, Vision Transformer with attention has been investigated for the detection of human falls and ADLs (Activities of Daily Living) from time series-based signals. The Vision Transformer model has been trained and validated using the acceleration signals of waist-worn Inertial Measurement Unit (IMU) sensors obtained from the IMU Falls dataset[1]. The model is also trained and validated on the popular SiSFall dataset[2]. The model is also investigated by independently training three different cases of patch size and attention heads. It was observed that a larger patch size resulted in significant performance deterioration. Additionally, a smaller patch size took longer to train and was computationally expensive. The model performed (best case) with an Accuracy (%) of 99.9 ± 0.1 and a True Positive Rate (%) of 99.9 ± 0.1 on the SFU-IMU dataset and with an Accuracy (%) of 99.8 ± 0.25 and a true positive rate (%) of 99.87 ± 0.3 on the SISFALL dataset. Overall, the results show that Transformers are highly robust in the detection of human falls and nonfalls/ADLs, subject to the appropriate patch size.