AISciVision: A Framework for Specializing Large Multimodal Models in Scientific Image Classification
AiSciVision is a general framework that enables Large Multimodal Models (LMMs) to adapt to niche image classification tasks. The framework uses two key components: (1) Visual Retrieval-Augmented Generation (VisRAG) and (2) domain-specific tools utilized in an agentic workflow. To classify a target image, AiSciVision first retrieves the most similar positive and negative labeled images as context for the LMM. Then the LMM agent actively selects and applies tools to manipulate and inspect the target image over multiple rounds, refining its analysis before making a final prediction.
Link to AiSciVision paper: https://arxiv.org/abs/2410.21480
We recommend using Python 3.9+ and a CUDA-capable GPU.
Create a conda environment using the provided environment.yml
:
conda env create -f environment.yml
conda activate aiscivision
The framework requires two API keys:
- OpenAI API Key. Required for accessing GPT-4V or other OpenAI LMMs. Get your key at: https://platform.openai.com/api-keys
- Google Maps API Key. Required for the satellite imagery tooling. To obtain:
- Create a Google Cloud Project
- Enable the Maps JavaScript API and Static Maps API
- Create credentials at: https://console.cloud.google.com/apis/credentials
- Enable billing (required for API access)
Set your API keys as environment variables:
export OPENAI_API_KEY=`cat openai_api_key.txt`
export GMAPS_API_KEY=`cat gmaps_api_key.txt`
Run all baseline and AiSciVision experiments for a dataset.
The solar
dataset is publicly available.
For aquaculture
and eelgrass
datasets, please contact the authors.
# replace <dataset> with: aquaculture, eelgrass, or solar
bash final_exps.sh <dataset>
The framework is designed to be modular and extensible. Take these steps to apply AiSciVision to your own dataset:
- Add dataset name and tools to
config.py
, and update parsing arguments inutils.py
- Create dataset class in
dataloaders/datasets.py
implementing the abstractImageDataset
class - Define prompt schema in
promptSchema.py
inheriting fromBasePromptSchema
- Create tools in
tools/<dataset>.py
extending theTool
base class - Run experiments with
bash final_exps.sh <dataset>
We welcome contributions! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Run the linting checks (
make lint
andmake fix-lint
) - Submit a pull request
For major changes, please open an issue first to discuss the proposed changes.
The included Makefile
provides utilities for maintaining code quality:
make lint
: Run code lintingmake fix-lint
: Auto-fix linting issuesmake find-todos
: Find TODO commentsmake find-text SEARCH_STRING="aiscivision"
: Search codebase for specific text
final_exps.py
andfinal_exps.sh
execute all experiments.
main.py
: Experiment runner.aiSciVision.py
: AiSciVision framework. Manages conversation state and history with LMM, and orchestrates between LMM, VisRAG system, and tool execution.visualRAG.py
: Visual RAG system. Implements prompts for retrieval-augmented generation for visual tasks.promptSchema.py
: Prompt Management. Defines prompt templates (visual context, tool use, initial/final prompts) for LMM use.lmm.py
: Large Multimodal Model interface. Transforms conversation to LMM API parse-able turn-style conversation. Extensible to other APIs and models.embeddingModel.py
: Embedding Models. Handles image preprocessing for the Visual RAG system.tools/
: Tool definitions and implementations.
main_knn.py
. Experiment runner for KNN baseline. See model inmodels/knn_classifier.py
.main_clip_zero_shot.py
. Experiment runner for CLIP Zero Shot baseline. See model inmodels/clip_classifier.py
.main_clip_supervised.py
. Experiment runner for CLIP + MLP supervised model baseline. See model inmodels/clip_classifier.py
.
config.py
. Common variables used throughout.utils.py
. Experiment argument definitions, logging functions, evaluation metric functions.create_test_set_selection.py
. Helper script to save an ordering of test samples, useful for reproducing experiments.
This project is licensed under the MIT License - see the LICENSE file for details.
Please use the following citation if you find our work useful.
@article{hogan2024aiscivision,
title={{AiSciVision: A Framework for Specializing Large Multimodal Models in Scientific Image Classification}},
author={Brendan Hogan and Anmol Kabra and Felipe Siqueira Pacheco and Laura Greenstreet and Joshua Fan and Aaron Ferber and Marta Ummus and Alecsander Brito and Olivia Graham and Lillian Aoki and Drew Harvell and Alex Flecker and Carla Gomes},
year={2024},
journal={arXiv preprint arXiv:2410.21480},
url={https://arxiv.org/abs/2410.21480},
}