- Implementation of "Neural Speech Synthesis with Transformer Network"
- This is implemented for FastSpeech
- Download and extract the LJ Speech dataset
- Make
preprocessedfolder in LJSpeech directory and makechar_seq&phone_seq&melspectrogramfolder in it - Set
data_pathinhparams.pyas the LJSpeech folder - Using
prepare_data.ipynb, prepare melspectrogram and text (converted into indices) tensors. python train.py
- Encoder Alignments
You can hear the audio samples here
- Unlike the original paper, I didn't use the encoder-prenet following espnet
- I apply additional "guided attention loss" to the two heads of the last two layers
- Batch size is important, so I use gradient accumulation
- You can also use DataParallel. Change the
n_gpus,batch_size,accumulationappropriately.
- Dynamic batch
-
For fastspeech, generated melspectrograms and attention matrix should be saved for later.
1-1. Setteacher_pathinhparams.pyand makealignmentsandtargetsdirectories there.
1-2. Usingprepare_fastspeech.ipynb, prepare alignmetns and targets. -
To draw attention plots for every each head, I change return values of the "torch.nn.functional.multi_head_attention_forward()"
#before
return attn_output, attn_output_weights.sum(dim=1) / num_heads
#after
return attn_output, attn_output_weights- Among
num_layers*num_headsattention matrices, the one with the highest focus rate is saved.
1.NVIDIA/tacotron2: https://github.com/NVIDIA/tacotron2
2.espnet/espnet: https://github.com/espnet/espnet
3.soobinseo/Transformer-TTS: https://github.com/soobinseo/Transformer-TTS