-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multilingual benchmark datasets #113
Conversation
Also, this PR depends on the PR in the OpenCLIP repo because we need to change how the text is tokenized to properly support NLLB tokenizer. |
The OpenCLIP PR is merged. @mehdidc could you please check the PR? |
Checking now, thanks a lot @visheratin for the PR! |
@visheratin Many thanks, all the datasets work fine! I am trying to reproduce the numbers in your paper, but I see some gaps.
I got the following results:
Is there anything I am missing? |
@mehdidc The changes from the PR in OpenCLIP haven't yet been released, so the tokenizer doesn't work correctly. You need to install OpenCLIP from the main branch. If you run the benchmark on the Regarding the numbers from the paper, I ran those tests on a slightly different version of the model, so the numbers for |
@visheratin I was using the main branch actually, will rerun with the siglip version to compare with the csv, thank you. |
@mehdidc Looks like the tokenizer works fine. The difference is because the models are not exactly the same. For some languages, the model from the paper is better; for others, the OpenCLIP version is better. I will run the CLIP benchmark on all models from OpenCLIP this week so we will have a full picture for multilingual retrieval. |
Cool, sounds good! I ran the siglip version of the base model and it is matching the CSV.
|
Merging. |
Hi!
When working on the paper about NLLB-CLIP models, I benchmarked the models on multiple multilingual datasets. I thought it would be a good idea to add these benchmarks to this repo to be able to test other multilingual models.
The changes in this PR:
set_language
method to set the target language for the NLLB tokenizer.@mehdidc @JeniaJitsev