Inference Optimal VLMs Need Only One Visual Token but Larger Models
Kevin Y. Li*, Sachin Goyal*, João D. Semedo, J. Zico Kolter
Paper: https://arxiv.org/abs/2411.03312v1
Our repo contains two components: our QueCC token compression algorithm and our scaling law fitting code. The QueCC algorithm compresses tokens via a cross-attention mechanism that utilizes query-based convolutional downsampling.
Our scaling laws find that for various visual understanding and reasoning tasks, under fixed VLM inference compute, it is more optimal to utilize the largest LLM component by reducing the number of visual tokens. In fact, performance on these types of tasks vary
The relevant code that we used to fit our scaling laws can be found in the scaling_law_code/ folder and the instructions can be found in its README.
Our repo is built upon the original LLaVA repo. We thanks the authors for releasing their codebase. Setup, training hyperparameters, etc., are the same as detailed in the linked repo.
The QueCC token compression module is spread across a couple files. These components can be copied and transferred over to one's own repository:
train has been adjusted to include more model arguments to allow for token compression flexibility, e.g., change downsampling rate, and to pass in the pointers to the LLM and tokenizer when processing images for the query-based compression.
llava_arch has been adjusted to include the additional model arguments and preprocessing of user queries for compression.
clip_encoder has been modified to pass in the required features for compression.
The projector builder has been changed to include QueCC and QueCC's implementation can be found in a new file.
Example pretraining and finetuning scripts can be found here and here, respectively.
If you find our scaling laws or compression algorithm valuable or insightful, please cite our paper:
@misc{li2024inferenceoptimalvlmsneed,
title={Inference Optimal VLMs Need Only One Visual Token but Larger Models},
author={Kevin Y. Li and Sachin Goyal and Joao D. Semedo and J. Zico Kolter},
year={2024},
eprint={2411.03312},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.03312},
}