
Lack of controllability due to the fact that the generation length is automatically determined by the autoregressive generation, and so the voice speed and the prosody (such as word breaks) cannot be adjusted.Synthesized speech is usually not robust, with words being skipped and repeated, due to error propagation and the wrong attention alignments between text and speech in the autoregressive generation based on an encoder-attention-decoder framework.Slow inference speed for autoregressive mel-spectrogram generation, given the mel-spectrogram sequence usually has a length of hundreds or thousands of frames.

A spectrogram is a visual representation of frequencies measured over time.)ĭue to the long sequence of the mel-spectrogram and the autoregressive nature, those models face several challenges: (Note: the Mel scale is used to measure frequency in Hertz, and the scale is based on pitch comparisons. Neural network-based TTS models usually first generate a mel-scale spectrogram (or mel-spectrogram) autoregressively from text input and then synthesize speech from the mel-spectrogram using a vocoder.

Neural network-based TTS models (such as Tacotron 2, DeepVoice 3 and Transformer TTS) have outperformed conventional concatenative and statistical parametric approaches in terms of speech quality. Text to speech (TTS) has attracted a lot of attention recently due to advancements in deep learning.
