-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft of feat: allow "negative" text queries #19
Draft of feat: allow "negative" text queries #19
Conversation
f1c1aef
to
8af4280
Compare
Thank you! I like your suggestion, responded in the GHI: #18 (comment) |
373ee7b
to
6d26173
Compare
I updated the branch and rebased them to a single commit. I think I incorporated most of the feedback, but |
Looks great! Thank you 😄 Can you please address a few nits, change "phrase" to "query" we will ship. |
Do you have the time to add some tests? It will be great to have one that checks that we don't break the search when introducing changes like this one. It will make sense to PR this test separately. |
1959d1f
to
af08bd8
Compare
I'd be happy to add some tests if there was a template/framework for me to add to --- even if the initial template is just the equivalent of I think I incorporated the recent feedback in the pull request; but I do agree with the idea of adding tests first before merging. |
@ramayer, OK, let's leave tests aside for now. I'll take a look at them when I have time. It makes sense for me to store a dozen test images in git repo; just let's make sure that they are not heavy. |
Thank you! A few more comments regarding the computation of the features. |
I agree with all your suggestions, and thanks for teaching me better numpy tricks. Too busy with work to get to them today; but I should have a cleaner pull request on the weekend. |
Clip's preprocess tranform seems to scale everything to 244x244 (at least at the settings rclip uses), it seems it should work with pretty light test images .... ..... but hmm..... that means CLIP is missing small details from high resolution pictures.... which makes me want to consider another possible feature request. Maybe I want to index multiple different CLIP vectors from different crops of my larger pictures...... |
Sure. No rush. Thank you for doing this! :-) |
You'll be surprised how well 244x244 works. It will be interesting to experiment with different resolutions or crops, but I suspect that 244x244 should be good enough for rclip. And it is much faster compared to using multiple crops or a higher resolution. Usually, researchers tend to go with the smallest resolution that produces decent results. |
66da726
to
f79e2de
Compare
Updated with the most recent feedback. Now using |
rclip/model.py
Outdated
similarities = (text_features @ item_features.T).squeeze(0).tolist() | ||
positive_features = self.compute_text_features(positive_queries) | ||
negative_features = self.compute_text_features(negative_queries) | ||
text_features = np.add.reduce(positive_features) - np.add.reduce(negative_features) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good enough for now 👍 , but we may have done something like: np.add.reduce(self.compute_text_features(all_queries) * [1,1,1,-1,-1])
to compute all of the features at once. The array of ones and negative ones should be pre-created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that - but I think the multiply is more expensive. If this times it correctly:
import numpy as np
import timeit
npos = nneg = 1000
pos = np.random.rand(npos,512)
neg = np.random.rand(nneg,512)
timeit.timeit(lambda: np.add.reduce(pos) - np.add.reduce(neg),number=1000)
all = np.random.rand(npos + nneg,512)
signs = np.array([1] * npos + [-1] * nneg)
timeit.timeit(lambda: np.add.reduce(signs * all.T),number=1000)
I'm getting
>>> timeit.timeit(lambda: np.add.reduce(signs * all.T),number=1000)
2.642763003001164
>>> timeit.timeit(lambda: np.add.reduce(pos) - np.add.reduce(neg),number=1000)
0.7372207119988161
I also don't find the array of signs as easy to read (but maybe that's just me).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ramayer, thank you for the benchmark! I agree that readibility suffers from the "sign array" approach. My main concern was with the performance difference of calling compute_text_features
once vs. twice.
Hi! Thank you for the update. One nit. |
Can you, please, also rebase on top of master? |
27b5e5d
to
4874780
Compare
2 questions on the latest PR:
e.g. Wouldn't it make more sense to get rid of the redundant -a option to simplify things for users?
I think the problem is that the current code assumes that there is always a negative query. One solution would be something like:
|
|
Rebased with: updates based on github code review feedback Github issue yurijmikhalevich#18 Co-authored-by: Yurij Mikhalevich <[email protected]>
4874780
to
33dd995
Compare
I think I fixed the bug. I also created a branch without the explicit I'm not sure how to re-open the pull request, though :( It seems it was closed when I was updating the branch. |
attempt at reopening using this technique: https://gist.github.com/guille-moe/cd41fdbc8969b15428a50af2543a5cfa --- sorry about my trouble with github |
0aee0f7
to
33dd995
Compare
Hi @talpay, thank you for commenting. This is where I was standing initially, but I changed my opinion recently. I've checked and More importantly, I believe that you will almost always be able to substitute |
@ramayer, merged! Thank you! Great job! 😄 |
@yurijmikhalevich Thanks for the reply (and the awesome project). I've had a closer look and you're right, they're quite different and it is actually a very important distinction that should be clearly communicated:
These two might still get similar results but if we shift one more word, it should become clear:
The first 2 queries will also return grayscale ("black and white") images of zebras whereas the last query will most likely not because "black and white" is not part of the embeddings (you can test this on @ramayer 's web ui). After checking the code, I agree that this syntax is better but I would suggest making this behavior clearer by adding some good examples to the README.md, e.g. like above but also showing how you can chain multiple subtractions together like this:
Maybe also adding the fact that we can actually use quotation marks in queries e.g. On a final note, I could (subjectively) still imagine a single-string-syntax like |
@talpay, thank you! I am happy to hear that you like rclip. And thanks for a more detailed explanation. This is exactly what I was talking about. I'll definitely add examples to the README before releasing the v2. And I agree with your outlook on a single-string syntax. I was considering something like that too and came to a similar conclusion. I also think that separate arguments just look cleaner. |
Here's a great example of the differences of some ways of processing phrases:
Also interestingly:
I'm kinda forced to do that on the web UI (unless I want a multi-field advanced-search form) - but it becomes hard to remember, and excruciatingly painful to communicate. I even started going down the path of a crazy math for embeddings; starting with +2.5(dragon) +1.5(castle) to give different weights to the different phrases; and started looking into other operators (rotate the vector for "dragon" 30% into the direction of the vector of "castle"). And as we're discussing -- so many symbols mean something to CLIP like the phrase ❤️🌭 that it becomes hard to express different strings you might want to express without a syntax that's really robust at quoting weird characters. That path leads to madness or json or lisp. I'm even thinking of just switching to a JSON syntax for all my web UI's math operators (which I already do in places like this that I wasn't bothered to come up with a clean syntax to express). |
@ramayer, these are great examples indeed. It's also interesting to see how CLIP "distinguishes" between different kinds of quotes. I suspect that it's an artifact/bias learned from a dataset that inconsistently used both types of quotes. |
This is a draft of a patch for something similar to issue #18 -- but it was not implemented exactly according to the requirements described in that issue.
Instead of a single string containing both the positive and negative clauses, I think it would be cleaner if the additive and subtractive phrases used separate command line parameters, like:
More details are mentioned in a comment under #18 .
If you think this is a good direction, I could clean it up more (add examples to docs; and remove a no-longer-used method) and re-submit it.