Exploring the Design Space of Visual Context Representation in Video MLLMs

📰 News

[2024.10.12] Release the inference codes of Opt-Visor.

🛠️ Requirements

Python == 3.10.12
CUDA Version == 12.4

pip install -r requirements.txt

🌍 Model Zoo

Model Name	Visual Encoder	Language Decoder	# Training Frames	Tokens per Frame
Opt-Visor-120frame-49token-Qwen2-7B	siglip-so400m-patch14-384	Qwen2-7B	120	49

🤖 Inference

Run the following command to get the response of an instruction:

python inference.py \
       --model_path /path/to/Opt-Visor \
       --gpu_id 0 \
       --video_path /path/to/your/video \
       --question "Please describe the video indetail."

To Do List

Release the inference code.
Release the model.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
asset		asset
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
data_processor.py		data_processor.py
inference.py		inference.py
modeling_mm.py		modeling_mm.py
projector.py		projector.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exploring the Design Space of Visual Context Representation in Video MLLMs

📰 News

🛠️ Requirements

🌍 Model Zoo

🤖 Inference

To Do List

📑 Citation

About

Releases

Packages

Languages

RUCAIBox/Opt-Visor

Folders and files

Latest commit

History

Repository files navigation

Exploring the Design Space of Visual Context Representation in Video MLLMs

📰 News

🛠️ Requirements

🌍 Model Zoo

🤖 Inference

To Do List

📑 Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages