With the advent of deep learning, Text-to-Speech (TTS) research has made a great leap in producing natural speech. The state-of-the-art TTS systems generate average prosody, resulting in a lack of variety and expressiveness found in human speech. To avoid synthesizing monotonous speech and averaged prosody, it is desirable to have a way of modeling the variation in the speech prosody. To generate highly expressive speech, we explore two approaches of synthesizing speech in this work. The first one is a feed-forward transformer model conditioned on the fundamental frequency. The other model uses a variational autoencoder augmented with normalizing flows and an adversarial training process. We have trained three internal Bangla (also known as Bengali) datasets containing varying amounts of expressive speeches. We provide a comparative study regarding the effect of the proportion of expressive voices in the training data. Both the subjective and objective evaluations confirm that the proposed models outperform the autoregressive Tacotron2 baseline.