Generating Fonts/Writing from Text Tokens #339

afiaka87 · 2021-07-17T16:15:31Z

afiaka87
Jul 17, 2021

I've been training a DALL-e with the goal of seeing whether or not a caption could be used to visualize the text itself in RGB pixels. I'm limited by my GPU but early results are certainly interesting. I'm using the oft-ignored weights from OpenAI's dVAE under the assupmtion it would better represent text (because that is mentioned as an explicit goal in the DALLE paper). But these early results are promising so I'm switching back to the pretrained VQGAN from CompVis to see if it can represent letters graphically as well as the dVaE.

In this last example - it seems to be reusing codes found for text-gen in the actual image itself; which is what I was hoping for.

@rom1504 @mehdidc @janEbert @robvanvolt @johnpaulbin @kobiso may be interested.

afiaka87 · 2021-07-17T18:13:47Z

afiaka87
Jul 17, 2021
Author

16K VQGAN vs. 1024 VQGAN (f=16)

Early tests with the 1024 codebook pretrained on Imagenet imply it's not very good at representing text. Here is an early comparison between the 16K codebook vqgan and the 1024 codebook vqgan - both with patch size/f = 16.

Generally speaking; during training the 1024-sized codebook just can't seem to get past these O and I shapes that seem to be a common generalization for text in vision tasks in general; presumably because English characters either take on the form of a shape or a line.

I would do comparisons with the supposedly stronger GumbelVQGAN 8k codebook, f=8; but I've had a lot of trouble getting that one to train and the CompVis team has been sort of tough to get ahold of with regard to how to decode the thing.

1 reply

afiaka87 Jul 17, 2021
Author

Another interesting aspect is that since we're rewarding something which is so directly related to the actual text tokens; it seems the quality of the image reconstructions suffers. It's possible it needs to optimiize one before it can focus on the other because it's such "low-hanging fruit". Perhaps a better approach would be to simply generate/optimize fonts as a separate task/modality entirely; a sort of ground truth/auxiliary loss. One of my hopes with this approach would be that you could pass both the caption and a mask of a font representing the caption during generations to try and improve the possible outcomes. I still haven't tested that scenario however.

afiaka87 · 2021-07-17T18:32:09Z

afiaka87
Jul 17, 2021
Author

Code used:

I'm using the new augly framework for the text transforms. The code looks like this:

# in dalle_pytorch/loader.py
import augly.image as imaugs
import augly.text as textaugs
# ...
self.image_transform = T.Compose([
    T.Lambda(lambda img: img.convert("RGB") if img.mode != "RGB" else img),
    T.CenterCrop((192, 256)) # `memeify` takes up the top 64 pixels.
])
# ...
substring = description[:20] + "\n" + description[20:40]
pil_image = PIL.Image.open(image_file)
top_cut_image = self.image_transform(pil_image)
aug_image = imaugs.meme_format(top_cut_image, text=substring, opacity=1.0, caption_height=64)
image_tensor = T.Compose([
    #T.CenterCrop(256),
    T.ToTensor(),
])

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generating Fonts/Writing from Text Tokens #339

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Generating Fonts/Writing from Text Tokens #339

afiaka87 Jul 17, 2021

Replies: 2 comments · 1 reply

afiaka87 Jul 17, 2021 Author

16K VQGAN vs. 1024 VQGAN (f=16)

afiaka87 Jul 17, 2021 Author

afiaka87 Jul 17, 2021 Author

afiaka87
Jul 17, 2021

Replies: 2 comments 1 reply

afiaka87
Jul 17, 2021
Author

afiaka87 Jul 17, 2021
Author

afiaka87
Jul 17, 2021
Author