It seems like that the loss is easy to be NaN when adopting the checkpoint of Stable Diffusion v2.1 #103

JackWoo0831 · 2024-09-03T08:25:19Z

The reason is that when calculating the attention, some element of multiplication of Q and K will be extramely large (~3e5) in line 319, class SparseCausalAttention. But when I changed to Stable Diffusion v1.4, this issue is solved.

# attention, what we cannot get enough of
        if self._use_memory_efficient_attention_xformers:
            hidden_states = self._memory_efficient_attention_xformers(query, key, value, attention_mask)
            # Some versions of xformers return output in fp32, cast it back to the dtype of the input
            hidden_states = hidden_states.to(query.dtype)
        else:
            if self._slice_size is None or query.shape[0] // self._slice_size == 1:
                hidden_states = self._attention(query, key, value, attention_mask)
            else:
                hidden_states = self._sliced_attention(query, key, value, sequence_length, dim, attention_mask)

        # linear proj
        hidden_states = self.to_out[0](hidden_states)

        # dropout
        hidden_states = self.to_out[1](hidden_states)
        return hidden_states

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

It seems like that the loss is easy to be NaN when adopting the checkpoint of Stable Diffusion v2.1 #103

It seems like that the loss is easy to be NaN when adopting the checkpoint of Stable Diffusion v2.1 #103

JackWoo0831 commented Sep 3, 2024

It seems like that the loss is easy to be NaN when adopting the checkpoint of Stable Diffusion v2.1 #103

It seems like that the loss is easy to be NaN when adopting the checkpoint of Stable Diffusion v2.1 #103

Comments

JackWoo0831 commented Sep 3, 2024