AISciVision: A Framework for Specializing Large Multimodal Models in Scientific Image Classification

AiSciVision is a general framework that enables Large Multimodal Models (LMMs) to adapt to niche image classification tasks. The framework uses two key components: (1) Visual Retrieval-Augmented Generation (VisRAG) and (2) domain-specific tools utilized in an agentic workflow. To classify a target image, AiSciVision first retrieves the most similar positive and negative labeled images as context for the LMM. Then the LMM agent actively selects and applies tools to manipulate and inspect the target image over multiple rounds, refining its analysis before making a final prediction.

Link to AiSciVision paper: https://arxiv.org/abs/2410.21480

Installation

We recommend using Python 3.9+ and a CUDA-capable GPU. Create a conda environment using the provided environment.yml:

conda env create -f environment.yml
conda activate aiscivision

Running Experiments

API Keys

The framework requires two API keys:

OpenAI API Key. Required for accessing GPT-4V or other OpenAI LMMs. Get your key at: https://platform.openai.com/api-keys
Google Maps API Key. Required for the satellite imagery tooling. To obtain:
1. Create a Google Cloud Project
2. Enable the Maps JavaScript API and Static Maps API
3. Create credentials at: https://console.cloud.google.com/apis/credentials
4. Enable billing (required for API access)

Set your API keys as environment variables:

export OPENAI_API_KEY=`cat openai_api_key.txt`
export GMAPS_API_KEY=`cat gmaps_api_key.txt`

Experiments

Run all baseline and AiSciVision experiments for a dataset. The solar dataset is publicly available. For aquaculture and eelgrass datasets, please contact the authors.

# replace <dataset> with: aquaculture, eelgrass, or solar
bash final_exps.sh <dataset>

Extending AiSciVision

The framework is designed to be modular and extensible. Take these steps to apply AiSciVision to your own dataset:

Add dataset name and tools to config.py, and update parsing arguments in utils.py
Create dataset class in dataloaders/datasets.py implementing the abstract ImageDataset class
Define prompt schema in promptSchema.py inheriting from BasePromptSchema
Create tools in tools/<dataset>.py extending the Tool base class
Run experiments with bash final_exps.sh <dataset>

Contributing

We welcome contributions! Please:

Fork the repository
Create a feature branch
Make your changes
Run the linting checks (make lint and make fix-lint)
Submit a pull request

For major changes, please open an issue first to discuss the proposed changes.

The included Makefile provides utilities for maintaining code quality:

make lint: Run code linting
make fix-lint: Auto-fix linting issues
make find-todos: Find TODO comments
make find-text SEARCH_STRING="aiscivision": Search codebase for specific text

Codebase Overview

final_exps.py and final_exps.sh execute all experiments.

AiSciVision Implementation

main.py: Experiment runner.
aiSciVision.py: AiSciVision framework. Manages conversation state and history with LMM, and orchestrates between LMM, VisRAG system, and tool execution.
visualRAG.py: Visual RAG system. Implements prompts for retrieval-augmented generation for visual tasks.
promptSchema.py: Prompt Management. Defines prompt templates (visual context, tool use, initial/final prompts) for LMM use.
lmm.py: Large Multimodal Model interface. Transforms conversation to LMM API parse-able turn-style conversation. Extensible to other APIs and models.
embeddingModel.py: Embedding Models. Handles image preprocessing for the Visual RAG system.
tools/: Tool definitions and implementations.

Baseline Implementations

main_knn.py. Experiment runner for KNN baseline. See model in models/knn_classifier.py.
main_clip_zero_shot.py. Experiment runner for CLIP Zero Shot baseline. See model in models/clip_classifier.py.
main_clip_supervised.py. Experiment runner for CLIP + MLP supervised model baseline. See model in models/clip_classifier.py.

Utilities

config.py. Common variables used throughout.
utils.py. Experiment argument definitions, logging functions, evaluation metric functions.
create_test_set_selection.py. Helper script to save an ordering of test samples, useful for reproducing experiments.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

Please use the following citation if you find our work useful.

@article{hogan2024aiscivision,
  title={{AiSciVision: A Framework for Specializing Large Multimodal Models in Scientific Image Classification}}, 
  author={Brendan Hogan and Anmol Kabra and Felipe Siqueira Pacheco and Laura Greenstreet and Joshua Fan and Aaron Ferber and Marta Ummus and Alecsander Brito and Olivia Graham and Lillian Aoki and Drew Harvell and Alex Flecker and Carla Gomes},
  year={2024},
  journal={arXiv preprint arXiv:2410.21480},
  url={https://arxiv.org/abs/2410.21480}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AISciVision: A Framework for Specializing Large Multimodal Models in Scientific Image Classification

Table of Contents

Installation

Running Experiments

API Keys

Experiments

Extending AiSciVision

Contributing

Codebase Overview

AiSciVision Implementation

Baseline Implementations

Utilities

License

Citation

About

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
dataloaders		dataloaders
models		models
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
aiSciVision.py		aiSciVision.py
config.py		config.py
create_test_set_selection.py		create_test_set_selection.py
embeddingModel.py		embeddingModel.py
environment.yml		environment.yml
final_exps.py		final_exps.py
final_exps.sh		final_exps.sh
lmm.py		lmm.py
main.py		main.py
main_clip_supervised.py		main_clip_supervised.py
main_clip_zero_shot.py		main_clip_zero_shot.py
main_knn.py		main_knn.py
metrics.py		metrics.py
promptSchema.py		promptSchema.py
pyproject.toml		pyproject.toml
utils.py		utils.py
visualRAG.py		visualRAG.py

License

gomes-lab/AiSciVision

Folders and files

Latest commit

History

Repository files navigation

AISciVision: A Framework for Specializing Large Multimodal Models in Scientific Image Classification

Table of Contents

Installation

Running Experiments

API Keys

Experiments

Extending AiSciVision

Contributing

Codebase Overview

AiSciVision Implementation

Baseline Implementations

Utilities

License

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages