Unclear documentation regarding the behaviour of tokens and captions #214
Unanswered
cookiehunter
asked this question in
Q&A
Replies: 1 comment
-
hi so what is the meaning of patch_ti in patch_pipe function? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I understand that this technique borrows many ideas from textual inversion, but for someone without in-depth knowledge of any related technique, the given documentation is very unclear in some aspects. Judging by some comments I am not the only one having difficulty understanding the following aspects:
Why two tokens?
In all examples, there are always two tokens <s1> and <s2>. Why? Can there also be a single token only? When training a single object or style, a single token should be the main use case, right?
Maybe a specific example of how multiple tokens can be used in practice would help. My guess is that you can have an object that you can describe with different tokens, e.g. you can have an image with the caption:
"A photo of <character1> wearing <outfit1>. "
And another image with the caption:
"A photo of <character1> wearing <outfit2>. "
So in training, you would declare <character1>|<outfit1>|<outfit2>.
Is that how it is supposed to work?
Is it ok, that some captions do not contain all placeholder tokens?
Am I right that the tokens do not have to be of the form <s#> and that you can invent your own tokens?
Is
placeholder_token_at_data
optional?If I incorporate the right placeholder tokens in my captions to start with, I do not need
placeholder_token_at_data
right?What should the dataset look like?
It seems you can train with and without captions. What decides if captions are used or not? Sometimes captions are extracted from the file names. But the training script also mentions "caption.txt". What decides if the training uses a separate txt file for captions or the file names? How should the folder structure look like?
Template?
What is the difference between "style" and "object"? It is also mentioned, that you need to use style if there is no caption as the filename. Why? Does that mean I also have to use "style" even if I supply captions with a "caption.txt" file?
What happens with the tokens during merging or when monkey patching multiple lora models?
If I have a model with token <character1> and another with <enviroment1>, are both tokens accessible after merging?
Does merging compromise the learned concepts?
Is it even possible to load multiple lora models at the same time to use tokens distributed in multiple lora models inside the same prompt?
patch_pipe
can technically be called multiple times. Does this add tokens of different lora models? What effect do the parameterspatch_text
,patch_ti
andpatch_unet
have in this regard?Sorry for the lengthy wall of text, but since "Make a better documentation" is on the TODO list, I thought it couldn't hurt to mention everything that remains unanswered to someone familiar with the stable diffusion codebase but not involved in this particular topic.
Beta Was this translation helpful? Give feedback.
All reactions