-
Notifications
You must be signed in to change notification settings - Fork 41
Description
Hi,
Thanks for this clean and great implementation for MelNet.
I'm a beginner in Speech Synthesis so kindly guide me through the steps for training MelNet for TTS:
What I know/assume:
- Training will be done separately for tiers and for TTS, we'll use the tier flag set to 1 and tts flag set to True
- For subsequent tiers, we will set tier flag to 2,3,4,5,6 respectively and tts flag to False.
- Finally we will put checkpoints for each tier in inference.yaml and pass it to MelNet class for prediction.
Therefore I have some questions:
-
Can you provide/confirm the steps to train multiple tiers for the TTS option?
-
Are we supposed to train TTS (with --tts flag set to True) and keeping tier number = 1?
-
What do you mean by this in README.md:
The -s flag is a boolean for determining whether to train a TTS tier. Since a TTS tier only differs at tier 1, this flag is ignored when [tier number] != 0 .- And where is this condition in the code which you referred here: [tier number] != 0
- I assume this means we should ignore tts flag in case tier number > 2?
-
What is the difference between tts arg for trainer and tier number in config file (YAML) and should they be same? If not then what is the difference?
-
How do we know that our model (for each tier) has converged? What is the minimum train/test loss value we should achieve. What was your training time and on what GPU
-
Lastly, can we generate Mel outputs from different trained tier models? Like if we have TTS model + some consecutive tier and we can infer the output to check training performance.