Enhancing UI Location Capabilities of Autonomous Agents

📺 Demo

output.mp4

☀️ Introduction

With the growing reliance on digital devices equipped with graphical user interfaces (GUIs), such as computers and smartphones, the need for effective automation tools has become increasingly important. Although multimodal large language models (MLLMs) like GPT-4V excel at tasks such as drafting emails, they struggle with GUI interactions, which limits their effectiveness in automating everyday tasks. In this paper, we introduce ClickAgent, a novel framework for building autonomous agents. In ClickAgent, the MLLM handles reasoning and action planning, while a separate UI location model (e.g., SeeClick) identifies the relevant UI elements on the screen. This approach addresses a key limitation of current-generation MLLMs: their difficulty in accurately locating UI elements.

ClickAgent significantly outperforms other prompt-based autonomous agents (such as CogAgent, AppAgent, and Auto-UI) on the AITW benchmark. Our evaluation was conducted on both an Android smartphone emulator and an actual Android smartphone, using the task success rate as the key metric for measuring agent performance.

🚀 Getting Started

🔧 Installation

pip install -r requirements.txt

🤖 Android Environment Setup

Download the Android Debug Bridge.
Turn on the ADB debugging switch on your Android phone, it needs to be turned on in the developer options first.If it is the HyperOS system, you need to turn on USB Debugging (Security Settings) at the same time.
Connect your phone to the computer with a data cable and select "Transfer files".
Test your ADB environment as follow: /path/to/adb devices. If the connected devices are displayed, the preparation is complete.
If you are using a MAC or Linux system, make sure to turn on adb permissions as follow: sudo chmod +x /path/to/adb
If you are using Windows system, the path will be xx\xx\adb.exe

In case you need additional information about ADB, you can find it here: ADB Docs

📋 Used apps

Show list

1. Clock
2. Calendar
3. Files
4. Messages
5. Contacts
6. Calculator
7. Settings
8. Gmail
9. Google Chrome
10. Google Maps
11. Google Play
12. Google Movies
13. Google Photos
14. YouTube 
15. YouTube Music
16. Netflix
17. Spotify
18. Amazon Alexa
19. Amazon Music
20. Amazon Prime
21. X (Twitter)
22. Facebook
23. Instagram
24. Pandora
25. Yahoo
26. Yelp
27. eBay
28. Wikipedia

📱 Emulator

Download and install Android Studio
Tools -> Device Manager
Create Virtual Device
Information about device, such as AVD ID and snapshot names you can find in details of created emulator

⚙️ Configuration

All configurations are stored in config.ini. Every parameter can be overridden as a command line argument.

List of arguments:

run.py 
--config-path path to configuration file

--instruction instuction
--action-file folder name in eval-save-folder where output will be saved

--qwen IP:PORT Add port to host_api_worker.py default 21002
--internvl IP:PORT
--florence IP:PORT

--use-eval True/False use reflection module
--use-florence-only True/False use only UI Location Module

--do-stop add to prompt STOP action
--option "1/2" decision prompt choice

--adb-path path to adb
--aapt-path path to aapt
--emu-path path ot emulator
--device-type real or emu
--device-id id of device
--avd-name name of emulator AVD
--snapshot run specified snapshot
--run-apps-through-adb run applications directly with app id using adb if present

--max-steps max number of steps per instruction 
--eval-save-folder folder name where output will be saved

🔬 Run Experiments

After modifying the config to what you like, you can now run experiments with the following commands:

cd tests
python run_test.py

📐 Main Results Reproduction

To reproduce the results shown in Table 1 of our paper, you must first host TinyClick, server and InternVL (vLLM is recommended).

After this use default values of config.ini.

TinyClick server

cd api
python host_florence.py

Host server

cd api
python host_api_worker.py

InternVL2-Llama3-76B server

pip install vllm
vllm serve OpenGVLab/InternVL2-Llama3-76B --served-model-name internlm2 --tensor-parallel-size 4

🏃 Run

python run.py

📝 To-Do List

Evaluate the ClickAgent on open-source models (e.g. GPT-4V)

📖 Citation

@misc{hoscilowicz2024enhancinguilocationcapabilities,
      title={Enhancing UI Location Capabilities of Autonomous Agents},
      author={Jakub Hoscilowicz and Bartosz Maj and Bartosz Kozakiewicz and Oleksii Tymoschuk and Artur Janicki},
      year={2024},
      eprint={2410.11872},
      archivePrefix={arXiv},
      primaryClass={cs.HC},
      url={https://arxiv.org/abs/2410.11872},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Enhancing UI Location Capabilities of Autonomous Agents

📺 Demo

☀️ Introduction

🚀 Getting Started

🔧 Installation

🤖 Android Environment Setup

📋 Used apps

📱 Emulator

⚙️ Configuration

🔬 Run Experiments

📐 Main Results Reproduction

TinyClick server

Host server

InternVL2-Llama3-76B server

🏃 Run

📝 To-Do List

📖 Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Enhancing UI Location Capabilities of Autonomous Agents

📺 Demo

☀️ Introduction

🚀 Getting Started

🔧 Installation

🤖 Android Environment Setup

📋 Used apps

📱 Emulator

⚙️ Configuration

🔬 Run Experiments

📐 Main Results Reproduction

TinyClick server

Host server

InternVL2-Llama3-76B server

🏃 Run

📝 To-Do List

📖 Citation