Skip to content

Latest commit

 

History

History
19 lines (12 loc) · 1.52 KB

README.md

File metadata and controls

19 lines (12 loc) · 1.52 KB

This document provides examples to fine-tune Aria on three different datasets: single-image data, multi-image data and video data.

Fine-tune on single-image dataset

We use a 30k subset of the RefCOCO dataset as an example. RefCOCO is a visual grounding task. Given an image and a description of the reference object as input, the model is expected to output corresponding bounding box. For a given bounding box, we normalize its coordinates to [0,1000) and transform it into "(x1,y1), (x2,y2)". Please refer to RefCOCO_Example for more details!

Fine-tune on multi-image dataset

We use the NLVR2 dataset as an example. NLVR2 (Natural Language for Visual Reasoning) is a task where given two images, the model needs to determine whether a claim is true by answering yes or no. Please refer to NLVR2_Example for details!

Fine-tune on video dataset

We use the NextQA dataset as an example. NextQA requires the model to select an answer from several options according to the video input and question. The model is expected to output the correct option's character. Please refer to NextQA_Example for details!

Fine-tune on code dataset

We use the Magicoder-Evol-Instruct-110k dataset as an example to further finetune Aria for generating high-quality code. Please refer to Code-SFT_Example for details!