We evaluate EVE on a diverse set of vision-langugage benchmarks. To ensure the reproducibility, we evaluate the models with greedy decoding. We do not evaluate using beam search to make the inference process consistent with the chat demo of real-time outputs.
Currently, we mostly utilize the official toolkit or server for the evaluation.
You can evaluate EVE mode on your custom datasets by converting your dataset to EVE's jsonl format, and evaluate using eve/eval/model_vqa.py
.
Below we provide a general guideline for evaluating datasets with some common formats.
- Short-answer (e.g. VQAv2, MME).
<question>
Answer the question using a single word or phrase.
- Option-only for multiple-choice (e.g. MMBench, SEED-Bench).
<question>
A. <option_1>
B. <option_2>
C. <option_3>
D. <option_4>
Answer with the option's letter from the given choices directly.
- Natural QA (e.g. LLaVA-Bench, MM-Vet).
No postprocessing is needed.
- You MUST first download EVE's playgroud.zip before preparing task-specific data. It contains custom annotations, scripts, and the prediction files with EVE. Extract to ./playground/
. This also provides a general structure for all datasets.
- You can then utilize bash scripts/eve/test_all_benchmark.sh
for all tasks, or verify each task with the following script:
- Download test2015 and put it under
./playground/data/eval/vqav2
. - Multi-GPU inference.
# for single node inference
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/eve/eval/vqav2.sh ${CKPT_NAME} ${CKPT_PATH}
# for slurm inference
srun -p Your partion --gres gpu:8 bash scripts/eve/eval/vqav2.sh ${CKPT_NAME} ${CKPT_PATH}
- Submit the results to the evaluation server:
./playground/data/eval/vqav2/answers_upload
.
- Download the data and evaluation scripts following the official instructions and put under
./playground/data/eval/gqa/data
. You may need to modifyeval.py
as this due to the missing assets in the GQA v1.2 release. - Multi-GPU inference.
# for single node inference
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/eve/eval/gqa.sh ${CKPT_NAME} ${CKPT_PATH}
# for slurm inference
srun -p Your partion --gres gpu:8 bash scripts/eve/eval/gqa.sh ${CKPT_NAME} ${CKPT_PATH}
- Download test.json and extract test.zip to
test
. Put them under./playground/data/eval/vizwiz
. - Single-GPU inference.
# for single node inference
CUDA_VISIBLE_DEVICES=0 bash scripts/eve/eval/vizwiz.sh ${CKPT_NAME} ${CKPT_PATH}
# for slurm inference
srun -p Your partion --gres gpu:1 bash scripts/eve/eval/vizwiz.sh ${CKPT_NAME} ${CKPT_PATH}
- Submit the results to the evaluation server:
./playground/data/eval/vizwiz/answers_upload
.
- Under
./playground/data/eval/scienceqa
, downloadimages
,pid_splits.json
,problems.json
from thedata/scienceqa
folder of the ScienceQA repo. - Single-GPU inference and evaluate.
# for single node inference
CUDA_VISIBLE_DEVICES=0 bash scripts/eve/eval/sqa.sh ${CKPT_NAME} ${CKPT_PATH}
# for slurm inference
srun -p Your partion --gres gpu:8 bash scripts/eve/eval/sqa.sh ${CKPT_NAME} ${CKPT_PATH}
- Download TextVQA_0.5.1_val.json and images and extract to
./playground/data/eval/textvqa
. - Single-GPU inference and evaluate.
# for single node inference
CUDA_VISIBLE_DEVICES=0 bash scripts/eve/eval/textvqa.sh ${CKPT_NAME} ${CKPT_PATH}
# for slurm inference
srun -p Your partion --gres gpu:1 bash scripts/eve/eval/textvqa.sh ${CKPT_NAME} ${CKPT_PATH}
- Download 2014 Val images (41K/6GB) and rename it as
val2014
under./playground/data/eval/pope
. - Single-GPU inference and evaluate.
# for single node inference
CUDA_VISIBLE_DEVICES=0 bash scripts/eve/eval/pope.sh ${CKPT_NAME} ${CKPT_PATH}
# for slurm inference
srun -p Your partion --gres gpu:1 bash scripts/eve/eval/pope.sh ${CKPT_NAME} ${CKPT_PATH}
- Download the data following the official instructions here.
- Downloaded images to
MME_Benchmark_release_version
. - put the official
eval_tool
andMME_Benchmark_release_version
under./playground/data/eval/MME
. - Single-GPU inference and evaluate.
# for single node inference
CUDA_VISIBLE_DEVICES=0 bash scripts/eve/eval/mme.sh ${CKPT_NAME} ${CKPT_PATH}
# for slurm inference
srun -p Your partion --gres gpu:1 bash scripts/eve/eval/mme.sh ${CKPT_NAME} ${CKPT_PATH}
- Download mmbench_dev_20230712.tsv and put under
./playground/data/eval/mmbench
. - Multi-GPU inference.
# for single node inference
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/eve/eval/mmbench_en.sh ${CKPT_NAME} ${CKPT_PATH}
# for slurm inference
srun -p Your partion --gres gpu:8 bash scripts/eve/eval/mmbench_en.sh ${CKPT_NAME} ${CKPT_PATH}
- Submit the results to the evaluation server:
./playground/data/eval/mmbench/answers_upload/mmbench_dev_20230712
.
- Download mmbench_dev_cn_20231003.tsv and put under
./playground/data/eval/mmbench
. - Multi-GPU inference.
# for single node inference
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/eve/eval/mmbench_cn.sh ${CKPT_NAME} ${CKPT_PATH}
# for slurm inference
srun -p Your partion --gres gpu:8 bash scripts/eve/eval/mmbench_cn.sh ${CKPT_NAME} ${CKPT_PATH}
- Submit the results to the evaluation server:
./playground/data/eval/mmbench/answers_upload/mmbench_dev_cn_20231003
.
- Following the official instructions to download the images and the videos. Put images under
./playground/data/eval/seed_bench/SEED-Bench-image
. - Extract the video frame in the middle from the downloaded videos, and put them under
./playground/data/eval/seed_bench/SEED-Bench-video-image
. We provide our scriptextract_video_frames.py
modified from the official one. - Multiple-GPU inference and evaluate.
# for single node inference
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/eve/eval/seed.sh ${CKPT_NAME} ${CKPT_PATH}
# for slurm inference
srun -p Your partion --gres gpu:8 bash scripts/eve/eval/seed.sh ${CKPT_NAME} ${CKPT_PATH}
- Optionally, submit the results to the leaderboard:
./playground/data/eval/seed_bench/answers_upload
using the official jupyter notebook.
- Extract contents of llava-bench-in-the-wild to
./playground/data/eval/llava-bench-in-the-wild
. - Single-GPU inference and evaluate.
# for single node inference
CUDA_VISIBLE_DEVICES=0 bash scripts/eve/eval/llavabench.sh ${CKPT_NAME} ${CKPT_PATH}
# for slurm inference
srun -p Your partion --gres gpu:1 bash scripts/eve/eval/llavabench.sh ${CKPT_NAME} ${CKPT_PATH}
- Extract mm-vet.zip to
./playground/data/eval/mmvet
. - Single-GPU inference.
# for single node inference
CUDA_VISIBLE_DEVICES=0 bash scripts/eve/eval/mmvet.sh ${CKPT_NAME} ${CKPT_PATH}
# for slurm inference
srun -p Your partion --gres gpu:1 bash scripts/eve/eval/mmvet.sh ${CKPT_NAME} ${CKPT_PATH}
- Evaluate the predictions in
./playground/data/eval/mmvet/results
using the official jupyter notebook.
- Download llvisionqa_dev.json (for
dev
-subset) and llvisionqa_test.json (fortest
-subset). Put them under./playground/data/eval/qbench
. - Download and extract images and put all the images directly under
./playground/data/eval/qbench/images_llviqionqa
. - Single-GPU inference.
# for single node inference
CUDA_VISIBLE_DEVICES=0 bash scripts/eve/eval/qbench.sh ${CKPT_NAME} ${CKPT_PATH}
# for slurm inference
srun -p Your partion --gres gpu:1 bash scripts/eve/eval/qbench.sh ${CKPT_NAME} ${CKPT_PATH}
We only support dev evaluation in your local machine for now.