This repository contains a Streamlit web application designed to stress test Vision-Language Models (VLMs). The application allows users to add and compare multiple VLMs for multi-modal inferences, facilitating the creation of datasets for question-answering (QA) tasks and beyond.
The VLM Stress Testing Web Application enables users to:
- Upload images and input queries to test different VLMs.
- Compare responses, latencies, and token usage across models.
- Select the best-performing model based on the output and save results to a CSV file.
The tool is useful for evaluating multiple VLMs in a unified interface, enabling insights into model performance for multi-modal question-answering tasks.
- Streamlit: A fast, simple UI for interacting with the VLM models.
- TuneAPI: A proxy API to connect and interact with various VLMs like Llama 3.2, Qwen 2 VL, and GPT 4o.
- Pandas: For managing and saving results to CSV files.
- ColBERT: Used in conjunction with retrieval models to retrieve relevant context (if applicable).
- Llama 3.2:
meta/llama-3.2-90b-vision
- Qwen 2 VL:
qwen/qwen-2-vl-72b
- GPT 4o:
openai/gpt-4o
- Multi-modal Input: Upload images and ask natural language questions to test VLM models.
- Model Comparison: Compare multiple VLMs based on response quality, latency, and token usage.
- Dynamic Output: View responses side-by-side and analyze model metrics.
- CSV Logging: Save the best-performing model's results to a CSV file for further analysis.
-
Clone the Repository Clone this repository to your local machine:
git clone https://github.com/aryankargwal/genai-tutorials.git cd genai-tutorials/vlm-comparison
-
Install Dependencies Install the required dependencies from the
requirements.txt
file:pip install -r requirements.txt
-
Set Up API Keys Export your TuneAPI key to connect to the VLM models:
export API_KEY="your_api_key_here"
-
Run the Application Run the Streamlit app to start stress testing VLMs:
streamlit run app.py
-
Upload Images & Input Questions
- Upload an image (JPG, JPEG, or PNG).
- Enter a question for the VLMs to answer.
- Select two models to compare.
- View and compare model responses, latencies, and token counts.
-
Save Results Once the responses are generated, select the best-performing model and save the result to a CSV file.
- Image Upload & Encoding: Users upload an image, which is encoded to base64 for model input.
- Model Querying: The app queries selected models with the image and question. Each model processes the image and generates a response.
- Latency Tracking: The app measures and displays the latency for each model's response.
- Token Count: The app calculates and shows the token count for each model's generated output.
- Result Logging: After selecting the best model, the app saves the responses, latencies, and token counts to a CSV file for further analysis.
Users can perform inference using image inputs combined with text-based questions to test how well VLMs handle multi-modal reasoning.
Compare two models side-by-side in terms of:
- Response Quality: Generated answer to the user-provided question.
- Latency: Time taken by each model to generate a response.
- Token Count: Number of tokens generated by each model, useful for understanding efficiency.
Users can log their model comparison results, including the selected best model, to a CSV file with the following information:
- Image path
- Question
- Responses from both models
- Latency and token count for each model
- The selected best model
- Enhanced Dataset Creation: Expand support to automatically generate datasets from the saved model outputs.
- Fine-Tuning Scripts: Add scripts for fine-tuning models based on user data or custom datasets.
- Additional VLM Support: Include more Vision-Language Models to extend comparison options.
This project is licensed under the Apache 2.0 License. See the full license here.