Rewrite the Dataset Class? #158

rainyl · 2022-06-01T09:55:27Z

rainyl
Jun 1, 2022

Necessity

More clear

Now, the Dataset class used was written without extending the torch.utils.data.Dataset, so, many functions was implemented hand by hand, I think the current dataset class is not clear enough. For example, we can use torchvision to read and process image without cv2. (reducing the dependency is also important, which will help deploying to desktop.)

Easy to load

By combining torch.utils.data.Dataset and torch.utils.data.Dataloader, it will be more clear and easy to load data. For example, Pytorch's Dataloader make it easier to parallel load data.

Drop pickle

I think it is not necessary to generate a dataset pickle and then load when we want to train, it is easy to just provide a math.txt which contains latex code and image directory, shown as follow:

-dataset
 -math.txt
 -train
  -0.png
  -3.png
  -10.png
   ...

0.png means this image was generated from the 0 line of math.txt

In this way, we don't need to regenerate the dataset's pickle file every time we modify the dataset, ie. add new latex code and image.

lukas-blecher · 2022-06-01T10:43:44Z

lukas-blecher
Jun 1, 2022
Maintainer

Dataset

When I started the project I was aware of the advantages from the torch.data module but I had one problem to solve (that I still don't know how to handle in the best way) which is the batching of a set of images with different dimensionalities.
That's how the dataset/dataloader class I implemented initially formed.

Recently, @TITC has been chipping away at these problems with PRs #150 and #154. They allow them to be used in a more pytorch-like way but underneath it's still the same dataset class.

I haven't looked into it in much detail but maybe the batch sampler and collate_fn can help here.

How did you solve the problem?

cv2 is still needed for the pad function anyways but you're right with the general goal to reduce dependencies.

Pickle

I don't see how this is an improvement. The main advantage of the pickle file is that the information about the image size is connected to each image-text pair. This way the data loading can start right away.
Once again the reason is that the images can have different dimensions.
Sorry if I misunderstood you here.

3 replies

rainyl Jun 1, 2022
Author

Yes, the collate_fn will solve the problem of dimension, the function's parameter is [[a1, b1], [a2, b2]...] which returned by dataset class, and it's output should be like [[a1, a2,...], [b1, b2,...]], shown as follow:

def dl_collate_pad(batch):
    # B C H W
    data = [b[0] for b in batch]
    # [B * tokenizers.Encoding]
    tgt = [b[1] for b in batch]
    # pad tgt
    ids = [torch.Tensor(t).to(torch.int32) for t in tgt]
    # B * max(t.num_tokens)
    tgt = pad_sequence(ids, batch_first=True, padding_value=CONF.pad_token)

    # pad data
    max_dim_h = max([d.shape[-2] for d in data])
    max_dim_w = max([d.shape[-1] for d in data])
    data = torch.cat(
        [
            VF.pad(d, [0, 0, max_dim_w-d.shape[-1], max_dim_h-d.shape[-2]], 0) 
            for d in data
        ],
        dim = 0,
    ).unsqueeze(1)

    return [data, tgt]

and pass this function to Dataloader's collate_fn parameter.

Pickle is great, but if considering the increment training, it won't be a great solution. However, for now, it works fine.
Anyway, it's only a suggestion to make it more clear and elegant to load data,

lukas-blecher Jun 1, 2022
Maintainer

Appreciate it, thanks.
I've not tested it but I would guess the data loading is faster with the current solution because no image padding is necessary.
I do like your approach better though 👍

rainyl Jun 2, 2022
Author

yes, you are right :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite the Dataset Class? #158

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Rewrite the Dataset Class? #158

rainyl Jun 1, 2022

Necessity

More clear

Easy to load

Drop pickle

Replies: 1 comment · 3 replies

lukas-blecher Jun 1, 2022 Maintainer

Dataset

Pickle

rainyl Jun 1, 2022 Author

lukas-blecher Jun 1, 2022 Maintainer

rainyl Jun 2, 2022 Author

rainyl
Jun 1, 2022

Replies: 1 comment 3 replies

lukas-blecher
Jun 1, 2022
Maintainer

rainyl Jun 1, 2022
Author

lukas-blecher Jun 1, 2022
Maintainer

rainyl Jun 2, 2022
Author