Skip to content

Latest commit

 

History

History
32 lines (24 loc) · 1.53 KB

data.md

File metadata and controls

32 lines (24 loc) · 1.53 KB

Data

To facilitate the application of this framework to training in different domains, we provide a script for adjusting the format of training data. Specifically, this framework can already support the training of OCR, caption, interleaved image text, and pure text. Among them, due to the large volume of caption and interleaved data, we chose webdataset to encapsulate it.

We give a simple wds data construction method.

python ./data/process_wds.py
[2024-07-08 16:31:52.077279] start to write samples to shard ./tars/test-000001.tar
[2024-07-08 16:31:52.082296] complete to write samples to shard ./tars/test-000001.tar
[2024-07-08 16:31:52.072908] start to write samples to shard ./tars/test-000000.tar
[2024-07-08 16:31:52.088707] complete to write samples to shard ./tars/test-000000.tar

DataLoader example

import webdataset as wds
from torch.utils.data import DataLoader

dataset = wds.WebDataset("./tars/test-{000000..000001}.tar")
dataloader = DataLoader(dataset, batch_size=2, num_workers=1)

for ind, row in enumerate(dataloader):
    print(ind, row["jpg.txt"])
0 [b'text_1', b'text_2']
1 [b'text_4', b'text_3']