FastPitch: Parallel Text-to-speech with Pitch Prediction
Abstract
We present FastPitch, a fully-parallel text-to-speech model based on FastSpeech, conditioned on fundamental frequency contours. The model predicts pitch contours during inference, and generates speech that could be further controlled with predicted contours. FastPitch can thus change the perceived emotional state of the speaker or put emphasis on certain lexical units. We find that uniformly increasing or decreasing the pitch with FastPitch generates speech that resembles the voluntary modulation of voice. Conditioning on frequency contours improves the quality of synthesized speech, making it comparable to state-of-the-art. It does not introduce an overhead, and FastPitch retains the favorable, fully-parallel Transformer architecture of FastSpeech with a similar speed of mel-scale spectrogram synthesis, orders of magnitude faster than real-time.
FastPitch learns to model the voice according to the pitch countour.
The predicted contour may be adjusted - automatically or manually -
as shown in the video below.
The interface is used to adjust the predicted pitch vector p̑.
A single FastPitch model is used, with no additional post-processing.
Audio Samples
The samples come from our development subset of the LJSpeech-1.1 dataset.
In all cases we use the same WaveGlow vocoder.
A single FastPitch model has been used.
We present examples of automatic pitch transformations applied during
synthesis.
The criteria in effect prior to November twenty-two, nineteen sixty-three, for determining whether to accept material for the PRS general files
Ground truth
Ground truth mel + WaveGlow
Tacotron2
FastPitch
FastPitch (amplified)
FastPitch (inverted)
FastPitch (flattened)
FastPitch (-50 Hz)
FastPitch (+50 Hz)
Chapter seven. Lee Harvey Oswald: Background and Possible Motives, Part one.
Ground truth
Ground truth mel + WaveGlow
Tacotron2
FastPitch
FastPitch (amplified)
FastPitch (inverted)
FastPitch (flattened)
FastPitch (-50 Hz)
FastPitch (+50 Hz)
which he kept concealed in a hiding-place with a trap-door just under his bed.
Ground truth
Ground truth mel + WaveGlow
Tacotron2
FastPitch
FastPitch (amplified)
FastPitch (inverted)
FastPitch (flattened)
FastPitch (-50 Hz)
FastPitch (+50 Hz)
At the first the boxes were impounded, opened, and found to contain many of O'Connor's effects.
Ground truth
Ground truth mel + WaveGlow
Tacotron2
FastPitch
FastPitch (amplified)
FastPitch (inverted)
FastPitch (flattened)
FastPitch (-50 Hz)
FastPitch (+50 Hz)
He was in consequence put out of the protection of their internal law, end quote. Their code was a subject of some curiosity.
Ground truth
Ground truth mel + WaveGlow
Tacotron2
FastPitch
FastPitch (amplified)
FastPitch (inverted)
FastPitch (flattened)
FastPitch (-50 Hz)
FastPitch (+50 Hz)
Audio Samples: Multi-speaker
We present samples synthesized with a multi-speaker FastPitch,
trained on a dataset with three speakers (57 hours of data).
The output of the model is conditioned on the speaker embedding.
The criteria in effect prior to November twenty-two, nineteen sixty-three, for determining whether to accept material for the PRS general files
FastPitch LJS
FastPitch Sally
FastPitch Helen
FastPitch 50% LJS / 50% Sally
FastPitch 50% Sally / 50% Helen
At the first the boxes were impounded, opened, and found to contain many of O'Connor's effects.
FastPitch LJS
FastPitch Sally
FastPitch Helen
FastPitch 50% LJS / 50% Sally
FastPitch 50% Sally / 50% Helen
He was in consequence put out of the protection of their internal law, end quote. Their code was a subject of some curiosity.