You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been seeing some promising results from using alternative noise methods to teach the model to adjust the lower frequency components of an input, since pure randn noise is mostly high frequency content and Stable Diffusion (and possibly other diffusion models trained on randn noise) learned to create image with the same average and can't make brighter or darker images. When sampling it appears to use normal randn noise for offset and pyramid, not certain for pink.
With offset noise it learns to shift the output up or down more. It's a very small change to the noise generation for training: noise = torch.randn_like(latents) + 0.1 * torch.randn(latents.shape[0], latents.shape[1], 1, 1)
With pyramid noise the input is more evenly masked across different frequencies, rather than just high frequency content. The noise is generated by scaling a low resolution noise up to a random scale (they wanted to avoid always doing 2x upscale), adding more noise after upsampling, and repeating. The code they use is given in the article, Ctrl+F for def pyramid_noise_like(x, discount=0.9):.
With pink noise (EleutherAI Discord message link) I'm not 100% sure on the benefit. It's apparently closer to the noise found in images so it seems to make sense for image generation, but perhaps it'll be good for audio too.
In case you can't open the Discord link, the code provided by crowsonkb / alstroemeria313 is
@torridgristle thanks for sharing! Do you have some results to show for audio? This is something I also wanted to try at some point. Very interested to see how the different types of noise compare
I wonder what would happen if you had a network specialize in different frequency bands. Where the loss function is judged only on the final mixed output of all of them. Perhaps more in the 500 to 4k khz range where we hear, like you're saying
I've been seeing some promising results from using alternative noise methods to teach the model to adjust the lower frequency components of an input, since pure randn noise is mostly high frequency content and Stable Diffusion (and possibly other diffusion models trained on randn noise) learned to create image with the same average and can't make brighter or darker images. When sampling it appears to use normal randn noise for offset and pyramid, not certain for pink.
With offset noise it learns to shift the output up or down more. It's a very small change to the noise generation for training:
noise = torch.randn_like(latents) + 0.1 * torch.randn(latents.shape[0], latents.shape[1], 1, 1)
With pyramid noise the input is more evenly masked across different frequencies, rather than just high frequency content. The noise is generated by scaling a low resolution noise up to a random scale (they wanted to avoid always doing 2x upscale), adding more noise after upsampling, and repeating. The code they use is given in the article, Ctrl+F for
def pyramid_noise_like(x, discount=0.9):
.With pink noise (EleutherAI Discord message link) I'm not 100% sure on the benefit. It's apparently closer to the noise found in images so it seems to make sense for image generation, but perhaps it'll be good for audio too.
In case you can't open the Discord link, the code provided by crowsonkb / alstroemeria313 is
Ideally this will help with generating lower frequency components in audio.
The text was updated successfully, but these errors were encountered: