Unclear documentation regarding the behaviour of tokens and captions #214

cookiehunter · 2023-03-12T23:47:46Z

cookiehunter
Mar 12, 2023

I understand that this technique borrows many ideas from textual inversion, but for someone without in-depth knowledge of any related technique, the given documentation is very unclear in some aspects. Judging by some comments I am not the only one having difficulty understanding the following aspects:

Why two tokens?

In all examples, there are always two tokens <s1> and <s2>. Why? Can there also be a single token only? When training a single object or style, a single token should be the main use case, right?
Maybe a specific example of how multiple tokens can be used in practice would help. My guess is that you can have an object that you can describe with different tokens, e.g. you can have an image with the caption:
"A photo of <character1> wearing <outfit1>. "
And another image with the caption:
"A photo of <character1> wearing <outfit2>. "
So in training, you would declare <character1>|<outfit1>|<outfit2>.
Is that how it is supposed to work?
Is it ok, that some captions do not contain all placeholder tokens?
Am I right that the tokens do not have to be of the form <s#> and that you can invent your own tokens?

Is `placeholder_token_at_data` optional?

If I incorporate the right placeholder tokens in my captions to start with, I do not need placeholder_token_at_data right?

What should the dataset look like?

It seems you can train with and without captions. What decides if captions are used or not? Sometimes captions are extracted from the file names. But the training script also mentions "caption.txt". What decides if the training uses a separate txt file for captions or the file names? How should the folder structure look like?

Template?

What is the difference between "style" and "object"? It is also mentioned, that you need to use style if there is no caption as the filename. Why? Does that mean I also have to use "style" even if I supply captions with a "caption.txt" file?

What happens with the tokens during merging or when monkey patching multiple lora models?

If I have a model with token <character1> and another with <enviroment1>, are both tokens accessible after merging?
Does merging compromise the learned concepts?
Is it even possible to load multiple lora models at the same time to use tokens distributed in multiple lora models inside the same prompt?
patch_pipe can technically be called multiple times. Does this add tokens of different lora models? What effect do the parameters patch_text, patch_ti and patch_unet have in this regard?

Sorry for the lengthy wall of text, but since "Make a better documentation" is on the TODO list, I thought it couldn't hurt to mention everything that remains unanswered to someone familiar with the stable diffusion codebase but not involved in this particular topic.

JamesSand · 2023-04-11T12:25:58Z

JamesSand
Apr 11, 2023

hi so what is the meaning of patch_ti in patch_pipe function?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unclear documentation regarding the behaviour of tokens and captions #214

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Unclear documentation regarding the behaviour of tokens and captions #214

cookiehunter Mar 12, 2023

Why two tokens?

Is placeholder_token_at_data optional?

What should the dataset look like?

Template?

What happens with the tokens during merging or when monkey patching multiple lora models?

Replies: 1 comment

JamesSand Apr 11, 2023

cookiehunter
Mar 12, 2023

Is `placeholder_token_at_data` optional?

JamesSand
Apr 11, 2023