Shallow diffusion model:
k_step_max=100
unit encoder: contentvec768l12
training 600000 steps without pretrain model
network: 512*20
speaker1: opencpop
speaker2: kiritan
Naive model:
unit encoder: contentvec768l12
training 200000 steps without pretrain model
speaker1: opencpop
speaker2: kiritan