Expressive Speech Synthesis by Modeling Prosody with Variational Autoencoders for Bangla Text-to-Speech

doi:10.21203/rs.3.rs-1690533/v2

Download PDF

Research Article

Expressive Speech Synthesis by Modeling Prosody with Variational Autoencoders for Bangla Text-to-Speech

https://doi.org/10.21203/rs.3.rs-1690533/v2

This work is licensed under a CC BY 4.0 License

Version 2

posted

You are reading this latest preprint version

With the advent of deep learning, Text-to-Speech (TTS) research has made a great leap in producing natural speech. The state-of-the-art TTS systems generate average prosody, resulting in a lack of variety and expressiveness found in human speech. To avoid synthesizing monotonous speech and averaged prosody, it is desirable to have a way of modeling the variation in the speech prosody. To generate highly expressive speech, we explore two approaches of synthesizing speech in this work. The first one is a feed-forward transformer model conditioned on the fundamental frequency. The other model uses a variational autoencoder augmented with normalizing flows and an adversarial training process. We have trained three internal Bangla (also known as Bengali) datasets containing varying amounts of expressive speeches. We provide a comparative study regarding the effect of the proportion of expressive voices in the training data. Both the subjective and objective evaluations confirm that the proposed models outperform the autoregressive Tacotron2 baseline.

speech prosody

expressive speech corpus

variational autoencoders

normalizing flows

feed-forward transformer

No competing interests reported.

Download PDF

Version 2

posted

You are reading this latest preprint version

Expressive Speech Synthesis by Modeling Prosody with Variational Autoencoders for Bangla Text-to-Speech

Status:

Version 2

Abstract

Full Text

Additional Declarations

Supplementary Files

Status:

Version 2