Skip to content

Commit

Permalink
update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
baichuanzhou committed Jan 5, 2025
1 parent 35c060e commit b2ab404
Show file tree
Hide file tree
Showing 5 changed files with 106 additions and 3 deletions.
109 changes: 106 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,110 @@
<h2 align="center"> <a href="">[AAAI 2025]UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios</a><h5 align="center">
# <h2 align="center"> <a href="">UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios</a><h5 align="center">

[![hf_space](https://img.shields.io/badge/🤗-%20Open%20In%20HF-blue.svg)](https://arxiv.org/abs/2408.17267) [![arXiv](https://img.shields.io/badge/Arxiv-2402.14289-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2408.17267) [![arXiv](https://img.shields.io/badge/Arxiv-2405.11788-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2405.11788)[![License](https://github.com/opendatalab/UrBench?tab=Apache-2.0-1-ov-file#readme)]
This repo contains evaluation code for the paper "[UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios](https://arxiv.org/pdf/2408.17267)" [AAAI 2025]

![task](./assets/tasks.png)
[**🌐 Homepage**](https://opendatalab.github.io/UrBench/) | [**🤗 Dataset**](https://opendatalab.github.io/UrBench/) | [**📑 Paper**](https://arxiv.org/pdf/2408.17267) | [**💻 Code**](https://github.com/opendatalab/UrBench) | [**📖 arXiv**](https://arxiv.org/abs/2408.17267)

## &#x1F389; News
* **🔥[2024.12.11]** UrBench has been accepted to AAAI 2025 main track!


## Introduction

We propose <b>UrBench</b>, a multi-view benchmark designed
to evaluate LMMs’ performances in urban environments.
Our benchmark includes 14 urban tasks that we categorize into various dimensions. These tasks encompass
both region-level evaluations that assess LMMs’ capabilities in urban planning, as well as role-level evaluations
that examine LMMs’ responses to daily issues.

<p align="center">
<img src="./assets/tasks.png" alt="UrBench Overview" style="width: 100%; height: auto;">
</p>


## Comparison with Existing Benchmarks
<div class="container is-max-desktop">
<div class="columns is-centered has-text-centered">
<div class="column is-five-fifths">
<div class="content has-text-justified" style="font-size: 20px">
Compared to previous benchmarks, <b>UrBench</b> offers:
<ul>
<li> <i><b>Region-level and role-level questions.</b></i> <b>UrBench</b> contains diverse questions at both region and role level,
while previous benchmarks generally focus on region-level questions.</li>
<li> <i><b>Multi-view data.</b></i> <b>UrBench</b> incorporates both street and satellite data, as well as their paired-up cross-view data.
Prior benchmarks generally focus on evaluations from a single view perspective.</li>
<li> <i><b>Diverse task types.</b></i> <b>UrBench</b> contains 14 diverse task types categorized into four task dimensions,
while previous benchmarks only offer limited task types such as counting, object recognition, etc.</li>
</ul>
<div class="column is-centered has-text-centered">
<img src="./assets/comparison_plot.png" style="max-width: 85%; height: auto"/>
</div>
</div>
</div>

</div>


## Evaluation Results
<section class="section hero">
<div class="container is-max-desktop">
<div class="columns is-centered has-text-centered">
<div class="column is-five-fifths">
<div class="content has-text-justified">
<!-- Qualitative Results -->
<h3></h3>
<div class="column is-centered has-text-justified">
<p style="font-size: 20px;">
UrBench poses significant challenges to current SoTA LMMs. We find that the best performing closed-source model GPT-4o and open-source model VILA-1.5-
40B only achieve a <b>61.2%</b> and a <b>53.1%</b> accuracy, respectively. Interestingly, our findings indicate that the primary
limitation of these models lies in their ability to comprehend <b>UrBench</b> questions, not in their capacity to process multiple images, as the performance between multi-image and
their single-image counterparts shows little difference, such as LLaVA-NeXT-8B and LLaVA-NeXT-Interleave in the table.
Overall, the challenging nature of our benchmark indicates that current LMMs’ strong performance on the general
benchmarks are not generalized to the multi-view urban scenarios.
</p>
<img src="./assets/evaluation_results.png" class="center-image"/>
<figcaption style="font-size: 20px;text-align: center;">Performances of LMMs and human experts on the <b>UrBench</b> test set.</figcaption>
</div>
<!-- End Qualitative Results -->

</div>
</section>


## 📊 Evaluation


### 🛠️ Installation
Please clone our repository and change to that folder
```bash
git clone https://github.com/opendatalab/Urbench.git
cd LOKI
```

Create a new python environment and install relevant requirements
```bash
conda create -n urbench python=3.10
conda activate urbench
pip install -e .
```

### Start evaluating

Here's an example of running evaluation on UrBench's test set with TinyLLaVA
```bash
python -m accelerate.commands.launch --num_processes=2 --main_process_port=10043 -m lmms_eval --model=llava_hf --model_args="pretrained="bczhou/tiny-llava-v1-hf",device=""" --log_samples --log_samples_suffix tinyllava --tasks citybench_test_all --output_path ./logs
```





## Citation

```bibtex
@article{zhou2024urbench,
title={Urbench: A comprehensive benchmark for evaluating large multimodal models in multi-view urban scenarios},
author={Zhou, Baichuan and Yang, Haote and Chen, Dairong and Ye, Junyan and Bai, Tianyi and Yu, Jinhua and Zhang, Songyang and Lin, Dahua and He, Conghui and Li, Weijia},
journal={arXiv preprint arXiv:2408.17267},
year={2024}
}
```
Binary file added assets/comparison_plot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/evaluation_results.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/tasks-min.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified assets/tasks.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit b2ab404

Please sign in to comment.