Instruction tuning dataset generation inspired by LLaVA-Instruct-158k via any LLM.
- become independent of LLaVA-Instruct-158k which cannot be used commercially.
- Add more datasets, like OpenImages to overcome limited object class range in Coco.
- Add more modalities other than images.
I currently plan to create the following datasets:
- equivalent to LLaVA-Instruct-158k on COCO dataset using Llama2 70b and Mixtral 8x7B
- a more powerful instruction dataset on Open Images V7 including localized narratives, bounding boxes with metadata, image level labels and object relationships.
- adding more data sources to COCO and Opent Images, specifically
These improved datasets will help multimodal LLM architectures like LLaVA which require pretraining, but even more so architectures like LaVIN which only have instruction tuning steps.
- improve inference speed of OpenAI API with parallel requests
- OpenImages v7 support: captions, boxes
- OpenImages v7 support for positive and negative image labels in dataset
- fully process with Mistral 7b for first commercially usable version for LLaVA training.
- Add token amount estimation tool for cost estimation
- add LICENSE information to generated files
- add support for motion data instruction dataset creation (e.g. from HumanML3D)
- fully process with LLAMA-2 for first commercially usable version for LLaVA training.
- improve inference speed of huggingface models with batching
- improve inference speed of llama.cpp models with batching
To generate a multimodal instruction dataset:
- pick a dataset
- pick or set up a prompt config
Available datasets are
Source.COCO2014
andSource.COCO2017
- COCO has been used to generate the LLaVA-Instruct-158k dataset.
- Provides the following data for instruction dataset generation:
- captions: 5 sentences by different annotators describing the image
- object bounding boxes in the format
category_name: [min_x, min_y, max_x, max_y]
Source.OPENIMAGESV7
- Provides the following data for instruction dataset generation:
- captions: narratives from voice recordings of annotators describing the image in one or more sentences.
- object bounding boxes in the format
category_name: [min_x, min_y, max_x, max_y] [confidence, is_occluded, is_truncated, is_group_of, is_depiction, is_inside]
- Provides the following data for instruction dataset generation:
python generate.py COCO2014 --model_source huggingface --model meta-llama/Llama-2-7b-chat-hf
python generate.py COCO2014 --model_source huggingface --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --prompt_config prompt_config_llava_smallcontext.yaml
python generate.py COCO2014 --model_source llama.cpp --model ./PATH/TO/MODEL.gguf
python generate.py COCO2014 --model_source openai --model gpt-3.5-turbo
python generate.py COCO2014 --model_source openai --model mymodel --openai_base_url BASE_URL
python generate.py OPENIMAGESV7 --prompt_config prompt_config_openimagesv7.yaml ...
- Huggingface chat models: only supports models with chat templates in
tokenizer_config.json