Skip to content

Latest commit

 

History

History
83 lines (62 loc) · 4.02 KB

dataset_prepare.md

File metadata and controls

83 lines (62 loc) · 4.02 KB

Dataset Preparation

Pretraining

During the image-based training stage, our dataset comprises 558K image-caption pairs from LAION-CCSBU and 708k image-caption pairs from ALLaVA-4V-Caption dataset, culminating in a total of 1.26M image-caption pairs for pretraining.

Fine-tuning

The datasets employed for instruction-tuning encompass 665K mixture dataset from LLaVA-Instruct, 692k instructions from ALLaVA-4V-Instruction dataset, and an additional 25k instructions derived from a combination of ShareGPT4V , DocVQA, DVQA and AI2D, with a total number of more than 1.3M image-text conversations.

Download Images for Training

LLaVA-1.5 pretrain images -> data/LLaVA-Pretrain/images

ALLaVA-4V-LAION and ALLaVA-4V-Vison-FLAN images -> data/allava_laion/images, data/allava_vflan/images

COCO -> data/coco/train2017

GQA -> data/gqa/images

OCR-VQA -> data/ocr_vqa/images

TextVQA -> data/textvqa/train_images

VG-Part1, VG-Part2 -> data/vg/VG_100K, data/vg/VG_100K_2

The Web-Celebrity, Web-Landmark, WikiArt, Share-TextVQA in ShareGPT-4V.

AI2D -> data/ai2d/images

DocVQA -> data/docvqa/images

DVQA -> data/dvqa/images

The complete structure is as follows:

MG-LLAVA
├── data
│   ├── LLaVA-Pretrain
│   │   ├── images
│   ├── ai2d
|   |   ├── images
│   ├── allava_laion
|   |   ├── images
│   ├── allava_vflan
|   |   ├── images
│   ├── coco
|   |   ├── train2017
│   ├── docvqa
|   |   ├── images
│   ├── dvqa
|   |   ├── images
│   ├── gqa
|   |   ├── images
│   ├── ocr_vqa
|   |   ├── images
│   ├── share_textvqa
|   |   ├── images
│   ├── textvqa
|   |   ├── train_images
│   ├── vg
|   |   ├── VG_100K
|   |   ├── VG_100K_2
│   ├── web-celebrity
|   |   ├── images
│   ├── web-landmark
|   |   ├── images
│   ├── wikiart
|   |   ├── images

Download Annotations Files

We employ RAM-Plus and OWL-ViT to generate bounding boxes for training and evaluation. Our trainging annotation files and bounding box annotation files are available in Hugging Face. Please download and modify data_path and box_json_path in your config file.

If you want to generate the bounding boxes by yourself, you can refer to the image_offline_to_bbox.py by downloading RAM, OWL-ViT2 and modifying the data_file, image_folder, save_json_path ,then run the following command:

torchrun --nproc_per_node=8 mg_llava/bbox_generation/image_offline_to_bbox.py

Download Data for Evaluation

Most of the evaluation benchmarks utilized in our paper can be found in LLaVa.

The bounding box annotation files for evaluation are available in Hugging Face.