Data

To facilitate the application of this framework to training in different domains, we provide a script for adjusting the format of training data. Specifically, this framework can already support the training of OCR, caption, interleaved image text, and pure text. Among them, due to the large volume of caption and interleaved data, we chose webdataset to encapsulate it.

We give a simple wds data construction method.

python ./data/process_wds.py
[2024-07-08 16:31:52.077279] start to write samples to shard ./tars/test-000001.tar
[2024-07-08 16:31:52.082296] complete to write samples to shard ./tars/test-000001.tar
[2024-07-08 16:31:52.072908] start to write samples to shard ./tars/test-000000.tar
[2024-07-08 16:31:52.088707] complete to write samples to shard ./tars/test-000000.tar

DataLoader example

import webdataset as wds
from torch.utils.data import DataLoader

dataset = wds.WebDataset("./tars/test-{000000..000001}.tar")
dataloader = DataLoader(dataset, batch_size=2, num_workers=1)

for ind, row in enumerate(dataloader):
    print(ind, row["jpg.txt"])

0 [b'text_1', b'text_2']
1 [b'text_4', b'text_3']

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data.md

data.md

Data

Files

data.md

Latest commit

History

data.md

File metadata and controls

Data