-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Positional Stickiness #8
Comments
Yes I see exactly what you mean and noticed this in all the models I trained (both VitGAN and mlp_mixer), this was the reason why I started by the way to work on the diversity loss, but it does not work very well so far (quality drops and if the coeficient is small enough it only starts to output the same objects with different palettes if the noise vector changes, if the coeficient is bigger then it just outputs random textures unrelated to the prompt). Yes, I also thought similarly that it was a kind of shortcut the model finds to optimize the loss, and the first thing I could think of to fix it is to explicitly add an additional constraint on the loss (diversity). |
It could also be related to the architectures themselves (VitGAN and mlp_mixer), not sure. Another way to make the constraint even more explicit is to add a diversity loss on the "shapes" of the objects (rather than the "style") to avoid the model sticking on one shape. |
@mehdidc interesting so this happens with mlp-mixer too? Hm. There are of course examples to this which sort of defy the notion that it's a universal problem - if the captions are sufficiently different you can get it to output something quite different. For instance - the "photo of san francisco" captions tend to produce wildly different outputs - although you can sort of see that it's still getting caught up on the same position: I think in this example - rather than "animal head goes here - other animal's body goes there" - it is changing to "foreground goes here - background goes there" |
"For instance - the "photo of san francisco" captions tend to produce wildly different outputs " Ah okay, so what are the text prompts here where you observed different outputs, you mix "photo of san francisco" with the different attributes that you mentioned before such as "8k resolution"? |
@mehdidc Indeed - but I saw very similar results from the |
Generating a video from the training samples is maybe the best way to illustrate the issue: ffmpeg -framerate 15 -pattern_type glob -i '*.png' -c:v libx264 -pix_fmt yuv420p training_as_video.mp4 |
Yes exactly! |
Another way to see is through interpolation, here is an video showing interpolation (of text encoded features) from "the sun is a vanishing point absorbing shadows" to "the moon is a vanishing point absorbing shadows": https://drive.google.com/file/d/16yreg0jajmC4qwJGmiGJypp_L_VQbhq5/view?usp=sharing |
Not an answer - but perhaps a direction for enquiry in this repo by @nerdyrodent in the readme - he spells out a way to alter the weights of the words being passed in. python generate.py -p "A painting of an apple in a fruit bowl | psychedelic | surreal:0.5 | weird:0.25" for me - this coefficient tacked on to the end - provided a way to directed output with more control. tri - x 400tx a cylinder made of coffee beans :0.5 Need to dig into the code - but this may help transcend this blob problem. |
Another possible direction: @kevinzakka has a colab notebook here for getting saliency maps out of CLIP from a specific text-image prompt. |
I actually like this so called "stickiness" effect. It creates some kind of continuity over large amounts of generated images. Is there a way to influence the "sticky spacial shape"? Is it shaped during training, or can it be modified somehow? |
@mesw As far as I can tell it is inherited from the biases present in CLIP and perhaps in the specific captions used to train this repo. @mehdidc oh yes I forgot to mention. @crowsonkb suggested that the loss used in CLOOB "InfoLOOB", would be appropriate for these methods and may help output more diverse generations for a given caption. The blog linked mentions the problem of "explaining away" which seems correlated with this issue perhaps? edit: Implementation here https://github.com/ml-jku/cloob |
I put in the whole of What I would like to see is a 'seed' input alongside the text input, which provokes the net into producing a completely different output for the same textual input. Is there a way to train that? Could you, for example, have a loss which makes it so that if the net is fed You can provoke the net into producing different images by random vectors to
|
It's interesting that all of the above images have a black spot in the bottom right. What's that about? :) |
I think the problem might be that you aren't modeling any randomness in the output at all. Your input for a given prompt is deterministic (based on what CLIP says the vector is) and you're then feeding that into a the model and optimizing it to produce a single image. Even though that image target is changing every time, the model doesn't know that it's supposed to be able to generate multiple random images for one given prompt. I think the best way to tackle the problem would be to add a latent vector input to the model. Just e.g. a 32-dim random normal sample. Then the model would hopefully learn that different normal samples with the same fixed prompt mean different images. That way you could explicitly change the overall structure of the image by sampling different latents. |
@afiaka87 thanks will definitely give a try to CLOOB, pre-trained models seem to be available (https://ml.jku.at/research/CLOOB/downloads/checkpoints/) |
@pwaller just to be sure to understand, you computed CLIP text features of all words in @pwaller @JCBrouwer Yes exactly, generating a diverse set of images from text using a random seed would be a must, I did simple attempt by concatenating a random normal vector with CLIP features, generating a set of images from the same text, then computing a diversity loss using VGG features. This kind of diversity loss has already been tried in feed-forward texture synthesis https://arxiv.org/pdf/1703.01664.pdf. However, it didn't work so well so far, could be that a different feature space (other than VGG) for diversity is needed, or a more fancy way to incorporate randomness into the architecture instead of just concatenating CLIP text features and a random normal vector. The diversity loss + randomness is already possible in the current code, you can check the notebook for an explanation on how to do it, but as I said I find the results so far are not good enough, the overall quality is reduced, and the diversity loss coefficient should be small enough otherwise the images are not recognizable anymore, they just end up resembling textures if the coeficient is big enough. Here is one attempt with the word 'nebula' on a model trained with diversity loss (each image is a different seed): another with 'castle in the sky': |
Yes, that's correct. Sorry I missed this message before. Is there a model published trained with the diversity loss? I don't have the time/capability to easily train it myself for the foreseeable future. I would be very curious to take a look at outputs with the possibility of varying the seed. Edit: Hold on, have I misunderstood? Is it possible to access a pretrained model using the diversity loss using omegaconf somehow? |
@pwaller I have trained some models using diversity loss like the one above but didn't make them available publicly yet. Here is a link for the model above.
to sample 8 independent images given the same text. Recently, I have experimented with a different way to incorporate diversity: train a probabilistic model (using normalizing flows from https://compvis.github.io/net2net/) that maps CLIP text features to CLIP image features and, separately, train a model that maps CLIP image features to VQGAN latent space, trained exactly like done currently in this repo except that input is image features rather than text features. Combing the two gives a way to generate different images given the same text. Example with "Castle in the sky": seems it does not always fit the text correctly, this could probably be improved, by training a better model or at least using CLIP based re-ranking |
That looks amazing. Thanks for sharing, really excited to go and play with it. 🚀 |
Ah, I see I misunderstood, model_with_diversity.th did not produce the above image. Any plans to publish the above model? Looks very fun. Your work here is a lot of fun, many thanks for sharing this with the world :) |
@pwaller Happy to know you find it useful and have fun with it :) I have too.
Once you download them, you can try to generate using e.g. the following:
Let me know if it does not work for you. |
I get |
Oh, actually I made a mistake, the branch is |
merged now into master, explanations in the README, where I refer to those as "priors" |
For lack of a better word; I've noticed during training that the VitGAN tends to get stuck on one, two, or three (i don't see four happen very often/at all) "positional blobs" for lack of better words.
Does this match your experience? Effectively what I'm see is that the VitGAN needs to slide from one generation to the next in its latent space. In doing so - it seems to find that it's easier to just sort of create two "spots" in the image that are highly likely to contain specific concepts from each caption.
Does this match your experience? Any idea if this is bad/good? In my experience with the "chimera" examples; it seems to hurt things.
I hope you can see what I mean - there's a position in particular that seems designated for the "head" of the animal. But it also biases the outputs from other captions as well; for instance -
tri - x 4 0 0 tx a cylinder made of coffee beans . a cylinder with the texture of coffee beans .
The text was updated successfully, but these errors were encountered: