Questions about number of tokens in tokenizer~ #18

thaoshibe · 2024-07-08T23:24:35Z

Hi, congrats on your interesting work!
I have a question about the mismatch in the number of tokens between the paper and the implementation.

As far as I understand while reading the paper:
LLaMA has 32.000 tokens, and SEED-X added 290 new tokens for LLaMA, as details:

224 bbox tokens
2 special tokens and
64 visual tokens
(Total: 290 tokens)

But as I run the code, I found out: There're actually 330 added tokens!

224 bbox token: (<loc-0>... <loc-223>)
2 special tokens for image generation: <img> and </img>
100 visual tokens (<img_00000>... <img_00099>)
2 special tokens for bbox: <box_start> and , <box_end>
2 patch tokens <patch>, </patch>
(Total: 330 tokens!)

My questions are:

What's the use of patch token?
How many visual tokens (or another word, image token) are actually added into the final models? (N =100 or N= 64)
Do you have any comment on the number of image token? (e.g., do you feel that more image token will be better? (e.g., N=100 is better than N=64))?

Thank you so much!

The text was updated successfully, but these errors were encountered:

AsteriaCao · 2024-07-09T12:21:17Z

Hi, I notice in configs/clm_models/llm_seed_x_lora.yaml, special patch tokens are used to tag the beginning & end of patches, which represent the splited original images in raster-scan order. Image patches are added into multi-modal inputs to realize any resolution image generation according to the paper, which is also a novalty of SEED-X.

However, I also have the same question about 100 visual tokens. I found 64 tokens are used to represent an image, but vocab_size is 32330 actually. Maybe the extra tokens are used to explore if add more visual tokens can performance better.
Looking forward to your reply!

geyuying · 2024-07-21T05:35:10Z

Yes, patch tokens are used to are used to tag the beginning & end of image patches. @AsteriaCao's understanding is corret.
We actually use N=64 visual tokens to represent an image, and the rest 36 visual tokens are not used during training.
For the ablation study of the number of visual tokens to represent an image, you can refer to https://github.com/geyuying/SEED-X. We find that more visual tokens lead to better image reconstruction for the image de-tokenizer, but lead to worse regression for MLLM.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about number of tokens in tokenizer~ #18

Questions about number of tokens in tokenizer~ #18

thaoshibe commented Jul 8, 2024 •

edited

Loading

AsteriaCao commented Jul 9, 2024 •

edited

Loading

geyuying commented Jul 21, 2024

Questions about number of tokens in tokenizer~ #18

Questions about number of tokens in tokenizer~ #18

Comments

thaoshibe commented Jul 8, 2024 • edited Loading

AsteriaCao commented Jul 9, 2024 • edited Loading

geyuying commented Jul 21, 2024

thaoshibe commented Jul 8, 2024 •

edited

Loading

AsteriaCao commented Jul 9, 2024 •

edited

Loading