Inference Optimal VLMs Need Only One Visual Token but Larger Models

Inference Optimal VLMs Need Only One Visual Token but Larger Models
Kevin Y. Li*, Sachin Goyal*, João D. Semedo, J. Zico Kolter
Paper: https://arxiv.org/abs/2411.03312v1

Our repo contains two components: our QueCC token compression algorithm and our scaling law fitting code. The QueCC algorithm compresses tokens via a cross-attention mechanism that utilizes query-based convolutional downsampling.

Scaling Law Findings and code

Our scaling laws find that for various visual understanding and reasoning tasks, under fixed VLM inference compute, it is more optimal to utilize the largest LLM component by reducing the number of visual tokens. In fact, performance on these types of tasks vary $5\times$ faster when adjusting the number of LLM parameters than the number of visual tokens. However, for OCR-like tasks, the opposite is true: the number of visual tokens is more important the size of the LLM.

The relevant code that we used to fit our scaling laws can be found in the scaling_law_code/ folder and the instructions can be found in its README.

Query Based Token Compression

Our repo is built upon the original LLaVA repo. We thanks the authors for releasing their codebase. Setup, training hyperparameters, etc., are the same as detailed in the linked repo.

The QueCC token compression module is spread across a couple files. These components can be copied and transferred over to one's own repository:

train has been adjusted to include more model arguments to allow for token compression flexibility, e.g., change downsampling rate, and to pass in the pointers to the LLM and tokenizer when processing images for the query-based compression.

llava_arch has been adjusted to include the additional model arguments and preprocessing of user queries for compression.

clip_encoder has been modified to pass in the required features for compression.

The projector builder has been changed to include QueCC and QueCC's implementation can be found in a new file.

Example pretraining and finetuning scripts can be found here and here, respectively.

Citation

If you find our scaling laws or compression algorithm valuable or insightful, please cite our paper:

@misc{li2024inferenceoptimalvlmsneed,
      title={Inference Optimal VLMs Need Only One Visual Token but Larger Models}, 
      author={Kevin Y. Li and Sachin Goyal and Joao D. Semedo and J. Zico Kolter},
      year={2024},
      eprint={2411.03312},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.03312}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Inference Optimal VLMs Need Only One Visual Token but Larger Models

Scaling Law Findings and code

Query Based Token Compression

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Inference Optimal VLMs Need Only One Visual Token but Larger Models

Scaling Law Findings and code

Query Based Token Compression

Citation