Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NLLB-CLIP with SigLIP + small tokenizer fix #741

Merged
merged 13 commits into from
Nov 22, 2023
Merged

NLLB-CLIP with SigLIP + small tokenizer fix #741

merged 13 commits into from
Nov 22, 2023

Conversation

visheratin
Copy link
Contributor

Hi! I trained NLLB-CLIP models with SigLIP (ViT and loss). They perform much better than the previous version across all benchmarks.

I'm also working on integrating the multilingual benchmarks from the paper into the CLIP benchmark. To make it work with the NLLB tokenizer, I had to change the tokenizer method to batch_encode_plus because the default __call__ doesn't take language-specific prefix tokens into account.

@rom1504
Copy link
Collaborator

rom1504 commented Nov 18, 2023 via email

@visheratin
Copy link
Contributor Author

Here are the results for Crossmodal-3600 and XTD10 datasets. I didn't evaluate the models on English-only datasets. I think it may make sense to add a separate benchmark CSV file for multilingual models to the docs.

@rom1504
Copy link
Collaborator

rom1504 commented Nov 19, 2023 via email

@visheratin
Copy link
Contributor Author

Can you tell me its model id and pretrained name? I have the testbed set up and running right now.

Regarding other benchmarks, NLLB-CLIP base and large outperform SigLIP ViT-G (page 16) on text-to-image.

@visheratin
Copy link
Contributor Author

@gabrielilharco can you share the script you use to create benchmark CSV files for the repo (like this) from CLIP benchmark outputs?

@rom1504
Copy link
Collaborator

rom1504 commented Nov 19, 2023

xlm-roberta-large-ViT-H-14

@visheratin
Copy link
Contributor Author

Here is the file. Very impressive results! NLLB-CLIP large is a bit better on text-to-image. I'm wondering why there is such a discrepancy between t2i and i2t results for my models. Maybe undertrained text encoder.

@rom1504
Copy link
Collaborator

rom1504 commented Nov 20, 2023

thanks, interesting indeed!

The PR LGTM

I think adding a mention on this new model in this section https://github.com/mlfoundations/open_clip/blob/main/docs/PRETRAINED.md#nllb would be good, so people have a change to discover the model (by looking at openclip doc)

@rom1504
Copy link
Collaborator

rom1504 commented Nov 20, 2023

this is how I trained this xlm-roberta-large-ViT-H-14 model https://github.com/mlfoundations/open_clip/blob/main/docs/PRETRAINED.md#vit-h14-xlm-roberta-large

looking at your paper https://arxiv.org/pdf/2309.01859.pdf the freezing method seems a bit similar; I froze the image encoder but not the text encoder
However I trained on very different data: noisy multilingual captions from Laion5B (VS the much cleaner but smaller dataset you used)

I see you evaluate only on retrieval. I had evaluated as well on imagenet with translated class names, and found the model to perform better than previous ones but in absolute numbers really poorly. (for example 56% in italian while the model gets 78% for the same in english). I am not sure what is the cause but that may be of interest to you

(just FYI)

@visheratin
Copy link
Contributor Author

Thanks! I added a bit more info on NLLB-CLIP to the doc. I'll add more info about evals when I figure out how to make the eval CSV file readable - it has too many dimensions (language, i2t/t2i, recall@k).

@visheratin
Copy link
Contributor Author

Regarding tasks, my original interest when starting the project was in multilingual retrieval. Because of that, I evaluated the model only on this task.

I'll work on compiling something like ImageNet-200 when I have time.

@gabrielilharco
Copy link
Collaborator

gabrielilharco commented Nov 21, 2023

@visheratin looks from my end too. My scripts for running evals on the 38 datasets still need some cleaning up, I plan to do that in the future and push them so everyone can run it easily. Meanwhile, I'm running evals for the 2 new models, and will update here with the results once it's done

@visheratin
Copy link
Contributor Author

@gabrielilharco thank you! In the meantime, I will compile a CSV for multilingual retrieval for NLLB-CLIP and XLM-RoBERTa.

@gabrielilharco
Copy link
Collaborator

@visheratin I added eval results and profiling numbers for the new models

@visheratin
Copy link
Contributor Author

Thanks! The models are still far from the top of the dashboard but they are 10% better than the first version =)

@gabrielilharco I just added a CSV with the benchmark results for NLLB-CLIP and XLM RoBERTa. Can you please take a look?

@gabrielilharco
Copy link
Collaborator

Thanks @visheratin! Can you make the numbers in the numbers in the new csv have less significant digits? E.g. 0.8569999933242798 becomes 0.8570. I think it's a bit easier to read this way.

It would be nice to add the other nllb models to that table as well. Ideally all models actually, would this be too expensive for you to run? If so, I can try running on my end if you share the scripts.

@visheratin
Copy link
Contributor Author

@gabrielilharco I updated the CSV file with the fixed numbers.

Regarding benchmarking all models, I've reached my quota on GCP, where the test bench is deployed, so I'll be able to run full tests only in December, when the quota resets. To run the tests, you'd need the CLIP benchmark version from that PR, which is dependent on this PR.

I propose to wait with the multilingual benchmark CSV until we have the results for all models. I'll remove the CSV from this PR and will create a separate PR when I have all the results. What do you think about it?

@gabrielilharco
Copy link
Collaborator

Sounds good to me. Thanks @visheratin!

@gabrielilharco gabrielilharco merged commit 29b90b8 into mlfoundations:main Nov 22, 2023
5 checks passed
@BIGBALLON
Copy link

Hi, @visheratin , thanks for your great work, Is there any plan to add NLLB-CLIP models(with SigLIP) to timm

@visheratin
Copy link
Contributor Author

As far as I remember, timm is a pure CV library. The image encoder used in NLLB-CLIP with SigLIP exists in timm, if you want to use it. The best way to use NLLB-CLIP models is via OpenCLIP (when the next version will be released).

@BIGBALLON
Copy link

@visheratin yea, thanks, looking forward to the next verison of open_clip.

@rom1504
Copy link
Collaborator

rom1504 commented Nov 24, 2023 via email

@visheratin
Copy link
Contributor Author

@rom1504 good to know, thank you! I'll wait until I benchmark all models on multilingual retrieval datasets and then create a release PR.

Interpause pushed a commit to Interpause/open_clip that referenced this pull request May 23, 2024
* Added configs.

* Added links to pretrained models.

* Add NLLB-CLIP base/large results

* Added new version of NLLB-CLIP.

* Added more info on NLLB-CLIP.

* add eval results and profiling

* Added file with benchmarks.

* Fixed CSV file.

* Updated CSV file.

---------

Co-authored-by: Gabriel Ilharco Magalhães <[email protected]>
Co-authored-by: Gabriel Ilharco <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants