Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models

💡Overview

Expert-Controlled Classifier-Free Guidance is a training-free expert-in-the-loop framework designed to align MedVLM with clinical expertise. It integrates token-level uncertainty estimation, a BioMedCLIP-based medical multimodal Retrieval-Augmented Generation (RAG), and interactive expert revisions and highlight-based guidance.

🔨Setup

🔨Installation

conda create -n expert_cfg python=3.10 -y
conda activate expert_cfg
pip install -r requirements.txt

🔨Pre-trained weights

Baseline Model:

Download them to the current directory separately and merge them with Phi-3-vision-128k-instruct and Phi-3.5-vision-instruct respectively.

Phi-3V: Huggingface
Phi-3.5V: Huggingface

Medical LoRA:

Our fine-tuning Phi3V-Med and Phi3.5V-Med LoRA links (removed to comply with double-blind requirements):

Phi-3V-Med: [Huggingface]
Phi-3.5V-Med: [Huggingface] Download them to the ./lora_weights folder

Demo

torchrun --nproc_per_node=1 demo.py \
    --bf16 \
    --use_lora \
    --input_json 'examples/input_queries.json' \
    --img_root 'examples/images' \
    --save_path 'examples/results.json' \
    --output_dir './lora_weights/logs_phi35_pubmed_instruct'

Medical Image & Test Encoder for RAG(optional):

Download BiomedCLIP and place it in ./src/backbone/BiomedCLIP.

BiomedCLIP links:

Huggingface
[Baiduyun]

Note: Directly downloading weights from Huggingface might encounter network issues. To facilitate modifications, we have converted the original .bin file to PyTorch's .pth. We recommend using the Baiduyun version.

📑Data Preparation

Our data mainly comes from publicly available, free online Pathology Education Informational Resource (PEIR) Digital Library. We test our model on:

VQA-RAD
SLAKE
PathVQA

Medical Alignment and Instruction Tuning:

PubMedVision
Llave-Med

Prepare BiomedCLIP Pre-extracted Image Feature

Note: We recommend using our pre-extracted BioMedCLIP features. The original images can also be found in the links below:

Dataset	Pre-extracted Features & Original Images
PEIR	Remove
PathVQA	Remove
Slake	Remove
RADVQA	Remove

📝Acknowledgements

We also reference the excellent repos of Phi-3CookBook, HuatuoVision, BioMedCLIP, in addition to other specific repos to the baseline and dataset we examined (see paper).

📝Citation

If you find this paper useful, please consider staring 🌟 this repo and citing 📑 our paper:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models

💡Overview

🔨Setup

🔨Installation

🔨Pre-trained weights

Baseline Model:

Medical LoRA:

Demo

Medical Image & Test Encoder for RAG(optional):

📑Data Preparation

Prepare BiomedCLIP Pre-extracted Image Feature

📝Acknowledgements

📝Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models

💡Overview

🔨Setup

🔨Installation

🔨Pre-trained weights

Baseline Model:

Medical LoRA:

Demo

Medical Image & Test Encoder for RAG(optional):

📑Data Preparation

Prepare BiomedCLIP Pre-extracted Image Feature

📝Acknowledgements

📝Citation