UNet not predicting noise correctly in Stable Diffusion #262

DisableGraphics · 2024-08-14T11:04:55Z

DisableGraphics
Aug 14, 2024

First of all, I've decided to open a discussion because the problem isn't with ort per se. I've had the same errors in C++ with onnx. However, I'm primarily using ort as my project's library, so I'd like to have insight with ort. (BTW, this is why I opened a discussion, not an issue)

I'm trying to implement stable diffusion from scratch using ort. However, for some bizarre reason that transcends my limited knowledge of ML, the unet is not predicting noise correctly. I've tested everything three bajillion times and I know there's nothing wrong with any model (except this thing with the UNet) nor the scheduler (that's why I'm not posting it, but for the interested I'm using the scheduler from pyke's diffusers repo: https://github.com/pykeio/diffusers with some modifications (it just returns the prev_sample instead of the whole SchedulerOutput struct)).

I know it's not predicting correctly the noise because I've used onnxruntime's C# version of stable diffusion (https://github.com/cassiebreviu/StableDiffusion) to benchmark mine against and the noise prediction array is completely different from mine.

Here's my diffusion loop:

const CFG: f32 = 7.0;
	let cpu = CPUExecutionProvider::default().build();
	let mut gen = StdRng::seed_from_u64(42);
	let text_encoder = Session::builder()?//Session::builder()?
		.with_execution_providers(vec![cpu.clone()])?
		.commit_from_file("model/text_encoder/model.onnx")?;
	let tokenizer = instant_clip_tokenizer::Tokenizer::new();
	let prompt = "a sunny day";
	let mut tokens = vec![tokenizer.start_of_text()];
	tokenizer.encode(prompt, &mut tokens);
	tokens.push(tokenizer.end_of_text());
	let padlen = (tokens.len() / 77 + 1) * 77;
	
	let tokens = fpysd::pad_tokens(tokens, padlen, tokenizer.end_of_text());
	let tokens = tokens.into_iter().map(|token| token.to_u16().into()).collect::<Vec<i32>>();

	let uncond_tokens = fpysd::pad_tokens(vec![tokenizer.start_of_text()], 77, tokenizer.end_of_text());
	let uncond_tokens = uncond_tokens.into_iter().map(|token| token.to_u16().into()).collect::<Vec<i32>>();

	let tokens: Array2<_> = Array2::from_shape_vec((1, tokens.len()), tokens)?;
	let cond_tokens = text_encoder.run(ort::inputs![
		tokens
	]?)?[0].try_extract_tensor::<tp>()?.to_owned();
	let uncond_tokens = Array2::from_shape_vec((1, uncond_tokens.len()), uncond_tokens)?;
	let uncond_tokens = text_encoder.run(ort::inputs![
		uncond_tokens
	]?)?[0].try_extract_tensor::<tp>()?.to_owned();
	let cond_tokens: Array3<_> = cond_tokens.view().to_owned().into_dimensionality()?;
	let uncond_tokens: Array3<_> = uncond_tokens.view().to_owned().into_dimensionality()?;
	let context = concatenate![Axis(0), cond_tokens, uncond_tokens];
	let decoder = Session::builder()?
		.with_execution_providers(vec![cpu.clone()])?
		.commit_from_file("model/vae_decoder/model.onnx")?;
	let mut latents: Array4<_> = Array4::random_using((1, 4, 64, 64), StandardNormal, &mut gen);
	let mut sampler = samplers::ddpmsampler::DDPMScheduler::default();
	sampler.set_timesteps(5);
	latents *= sampler.init_noise_sigma();
	let unet = Session::builder()?
		.with_execution_providers(vec![cpu.clone()])?
		.commit_from_file("model/unet/model.onnx")?;
	let timesteps = sampler.get_timesteps().to_owned();
	for (i, timestep) in timesteps.iter().enumerate() {
		let timestep = *timestep;
		println!("Timestep: {}", timestep);
		std::io::stdout().flush()?;
		let latent_model_input = concatenate![Axis(0), latents, latents];
		let latent_model_input = sampler.scale_model_input(latent_model_input.view(), timestep);
		
		let timesteparr: Array1<f32> = Array1::from_iter([(timestep as u32) as f32]);
		let noise_pred = unet.run(ort::inputs![
			latent_model_input,
			timesteparr,
			context.view()
		]?)?;
		
		let noise_pred = noise_pred[0].try_extract_tensor::<tp>()?;
		let noise_pred= noise_pred.view();
		let noise_pred: ArrayView4<_> = noise_pred.clone().into_dimensionality()?;
		let noise_uncond = noise_pred.slice(s![..1, .., .., ..]).to_owned();
		let noise_text = noise_pred.slice(s![1.., .., .., ..]).to_owned();
		let noise_pred = &noise_uncond + (CFG*(&noise_text - &noise_uncond));
		latents = sampler.step(timestep, latents, noise_pred.view());
	}
	latents /= 0.18215;
	let mut images = Vec::new();
	for latent_chunk in latents.axis_iter(Axis(0)) {
		let image = decoder.run(ort::inputs![latent_chunk.insert_axis(Axis(0))]?)?;
		let image: Array4<f32> = image[0].try_extract_tensor()?.view().to_owned().into_dimensionality()?;
		let f_image = image / 2.0 + 0.5;
		let f_image = f_image.permuted_axes([0, 2, 3, 1]) ;

		let image =  DynamicImage::ImageRgb32F(
			Rgb32FImage::from_raw(f_image.shape()[2] as _, f_image.shape()[1] as _, f_image.map(|f| f.clamp(0.0, 1.0)).into_iter().collect::<Vec<_>>())
				.ok_or_else(|| anyhow::anyhow!("failed to construct image"))?
		);
		images.push(image);
	}
	for (i, image) in images.iter().enumerate() {
		let mut image = image.pixels().into_iter().map(|pix| { Rgb::from([pix.2.0[0], pix.2.0[1], pix.2.0[2]]) }).collect::<Vec<Rgb<u8>>>();
		let image = ImageBuffer::from_fn(512, 512, |x, y| {
			image.pop().unwrap()
		});
		image.save(format!("output{}.png", i))?;
	}

I'm using the latest version of ort (ort = { git = "https://github.com/pykeio/ort", branch = "main", default-features = false, features = ["ndarray", "download-binaries"] } )
The latents look like this after each iteration:
0

1

2

3

4

Here's the noise prediction after each iteration:
0

1

2

3

4